Transcript-First Video Processing
How Summry changed video processing to fetch YouTube transcripts before falling back to yt-dlp audio download and Whisper transcription.
Transcript-First Video Processing
Problem
Summry's original worker pipeline treated audio download as the required first step for every video:
download audio -> transcribe audio -> chunk transcript -> summarize
That worked as a straightforward implementation, but production made the cost of that choice more visible. The app does not actually need video or audio as its primary artifact. It needs timestamped text. Downloading media first meant every processing job paid for the most brittle and expensive part of the pipeline before checking whether YouTube already exposed a usable transcript.
This was especially awkward after deployment. yt-dlp is powerful, but production server traffic can run into bot checks, extractor changes, format issues, cookie rotation, and JavaScript challenge solver problems. When yt-dlp is the first mandatory step, any of those issues can stop the product before summarization begins.
Context
The existing architecture already had most of the right boundaries:
services/video_processing_service.pyowned the worker-side processing workflow.integrations/transcription_service.pywrapped faster-whisper.integrations/youtube_service.pywrapped yt-dlp metadata extraction.- Transcript segments were already stored as timestamped
video_segments.
The weak point was the workflow order. yt-dlp audio download lived directly inside the processing service, and the service assumed Whisper was the only transcript source. That made transcript fetching look like a replacement for transcription, when it is better modeled as the first transcript acquisition strategy.
The new target workflow is:
fetch YouTube transcript
-> if usable, chunk and summarize
-> if unavailable, download audio with yt-dlp
-> transcribe audio with Whisper
-> chunk and summarize
Implementation
Both transcript sources now normalize into the same shape:
- full transcript text
- timestamped segment records
- language metadata when available
- transcript source
- whether the transcript was generated when that metadata is available
Then I added a dedicated YouTube transcript integration:
integrations/youtube_transcript_service.py
It calls youtube-transcript-api, maps each snippet into Summry's existing VideoSegmentPartial structure, skips empty snippets, and raises a local YouTubeTranscriptUnavailable error when the transcript cannot be fetched or contains no usable text.
The old audio download logic moved into:
integrations/audio_download_service.py
That keeps external downloader behavior in integrations/ instead of embedding yt-dlp setup inside the service layer.
Finally, services/video_processing_service.py now tries the transcript path first. If transcript fetching succeeds, the worker skips audio download and Whisper entirely. If transcript fetching fails, the worker publishes a progress event and falls back to the existing yt-dlp plus faster-whisper path.
Usage events now include:
transcript_sourcetranscript_is_generated
That should make production debugging more useful. A slow or failed job can now be traced back to whether it used the fast transcript path or the heavier audio fallback.
Why youtube-transcript-api
youtube-transcript-api is a better first tool for Summry's core job because it targets the artifact the app actually needs.
For videos with captions, it avoids:
- media extraction
- audio download
- FFmpeg postprocessing
- Whisper inference
- temporary audio cleanup
- some of the production fragility around full video/audio extraction
It is still unofficial and can be blocked or fail when captions are disabled. That is why this implementation uses it as the primary path, not the only path.
Tradeoffs
This reduces reliance on yt-dlp in the worker, but it does not remove yt-dlp from the system.
The API still uses yt-dlp for metadata extraction before creating the video record. The worker also still uses yt-dlp when transcript fetching fails. That means production still needs the downloader hardening already documented in docs/production-readiness.md.
The transcript path also introduces its own failure modes:
- captions may be disabled
- captions may be unavailable in the requested language
- auto-generated captions may be lower quality
- YouTube may block transcript requests from production egress
The benefit is that the expensive path is now the fallback instead of the default.
Lessons Learned
The important design shift was separating transcript acquisition from transcript processing.
Summarization, chunking, embeddings, and chat retrieval do not care whether text came from YouTube captions or Whisper. They care that the transcript is normalized into timestamped segments. Once that contract became explicit, the worker could choose the cheapest available transcript source without changing the rest of the processing pipeline.
This also made the service boundaries cleaner. The processing service now orchestrates the workflow, while YouTube transcript fetching, audio downloading, and Whisper transcription each live behind integration modules.
Next Steps
The next practical improvement is metadata extraction. The worker is no longer audio-first, but the submit path still depends on yt-dlp to fetch video metadata. Replacing or backing up that path with a lighter metadata strategy would reduce production fragility before jobs even enter the queue.
Another useful follow-up is production observability. Once this is deployed, the transcript_source metadata should show how often videos use the fast transcript path versus the audio fallback. That ratio will help decide whether the Whisper fallback should remain always-on, move to a slower queue, or eventually become an opt-in recovery path.