Video Transcription Service
I found out in a meeting that the transcription service we used only generated transcripts for the first ten minutes of a video. After that it just stopped. Mid-sentence. A 45-minute interview came back as a 10-minute stub of truncated garbage, and the accepted fix was to shrug and move on.
Which is absolutely insane. Why are we delivering a partial product when there are so many ways we could deliver more?
So I dug in. The old service was running full Whisper on SageMaker for inference, and SageMaker was timing out. I swapped full Whisper for faster-whisper, moved inference off SageMaker onto ECS running on EC2, and wrapped the whole thing in an event-driven pipeline so it could scale from zero GPU workers up to a fleet and back down when the queue was empty.
The new service transcribes the full video — all of it — in 100+ languages at roughly 4x the speed of the old one. It was also designed to be modular: any team that wants bulk transcription can plug it in without inheriting the SageMaker problem.
Key Metrics
4x
Speed vs. Old
faster-whisper on ECS
100+
Languages
Supported
Full
Coverage
No 10-minute cap
Architecture
Event-driven pipeline that processes video files through GPU-accelerated transcription workers.
S3 Upload
Video/audio file ingestion
EventBridge
Event routing
SQS Queue
Job buffering
ECS GPU Worker
faster-whisper transcription
S3 Output
Transcript delivery
S3 Upload
Video/audio file ingestion
EventBridge
Event routing
SQS Queue
Job buffering
ECS GPU Worker
faster-whisper transcription
S3 Output
Transcript delivery
Tech Stack
Languages
Cloud
Tools
AI/ML