Video Transcription Service

I found out in a meeting that the transcription service we used only generated transcripts for the first ten minutes of a video. After that it just stopped. Mid-sentence. A 45-minute interview came back as a 10-minute stub of truncated garbage, and the accepted fix was to shrug and move on.

Which is absolutely insane. Why are we delivering a partial product when there are so many ways we could deliver more?

So I dug in. The old service was running full Whisper on SageMaker for inference, and SageMaker was timing out. I swapped full Whisper for faster-whisper, moved inference off SageMaker onto ECS running on EC2, and wrapped the whole thing in an event-driven pipeline so it could scale from zero GPU workers up to a fleet and back down when the queue was empty.

The new service transcribes the full video — all of it — in 100+ languages at roughly 4x the speed of the old one. It was also designed to be modular: any team that wants bulk transcription can plug it in without inheriting the SageMaker problem.

PythonAWSfaster-whisperECSGPU

Key Metrics

Speed vs. Old

faster-whisper on ECS

100+

Languages

Supported

Full

Coverage

No 10-minute cap

Architecture

Event-driven pipeline that processes video files through GPU-accelerated transcription workers.

S3 Upload

Video/audio file ingestion

→

EventBridge

Event routing

→

SQS Queue

Job buffering

→

ECS GPU Worker

faster-whisper transcription

→

S3 Output

Transcript delivery

S3 Upload

Video/audio file ingestion

↓

EventBridge

Event routing

↓

SQS Queue

Job buffering

↓

ECS GPU Worker

faster-whisper transcription

↓

S3 Output

Transcript delivery

Tech Stack

Languages

Python

Cloud

AWS ECSAWS EC2AWS EventBridgeAWS SQSAWS S3

Tools

Docker

AI/ML

faster-whisper