build logMay 22, 2026Updated May 30, 2026

Transcript-First Video Processing

How Summry changed video processing to fetch YouTube transcripts before falling back to yt-dlp audio download and Whisper transcription.

summrybuild-logbackendsummarizationarchitecture

Problem

Summry's original worker pipeline treated audio download as the required first step for every video:

download audio -> transcribe audio -> chunk transcript -> summarize

That worked as a straightforward implementation, but production made the cost of that choice more visible. The app does not actually need video or audio as its primary artifact. It needs timestamped text. Downloading media first meant every processing job paid for the most brittle and expensive part of the pipeline before checking whether YouTube already exposed a usable transcript.

This was especially awkward after deployment. yt-dlp is powerful, but production server traffic can run into bot checks, extractor changes, format issues, cookie rotation, and JavaScript challenge solver problems. When yt-dlp is the first mandatory step, any of those issues can stop the product before summarization begins.

Context

The existing architecture already had most of the right boundaries:

services/video_processing_service.py owned the worker-side processing workflow.
integrations/transcription_service.py wrapped faster-whisper.
integrations/youtube_service.py wrapped yt-dlp metadata extraction.
Transcript segments were already stored as timestamped video_segments.

The weak point was the workflow order. yt-dlp audio download lived directly inside the processing service, and the service assumed Whisper was the only transcript source. That made transcript fetching look like a replacement for transcription, when it is better modeled as the first transcript acquisition strategy.

The new target workflow is:

fetch YouTube transcript
  -> if usable, chunk and summarize
  -> if unavailable, download audio with yt-dlp
  -> transcribe audio with Whisper
  -> chunk and summarize

Implementation

Both transcript sources now normalize into the same shape:

full transcript text
timestamped segment records
language metadata when available
transcript source
whether the transcript was generated when that metadata is available

Then I added a dedicated YouTube transcript integration:

integrations/youtube_transcript_service.py

It calls youtube-transcript-api, maps each snippet into Summry's existing VideoSegmentPartial structure, skips empty snippets, and raises a local YouTubeTranscriptUnavailable error when the transcript cannot be fetched or contains no usable text.

The old audio download logic moved into:

integrations/audio_download_service.py

That keeps external downloader behavior in integrations/ instead of embedding yt-dlp setup inside the service layer.

Finally, services/video_processing_service.py now tries the transcript path first. If transcript fetching succeeds, the worker skips audio download and Whisper entirely. If transcript fetching fails, the worker publishes a progress event and falls back to the existing yt-dlp plus faster-whisper path.

Usage events now include:

transcript_source
transcript_is_generated

That should make production debugging more useful. A slow or failed job can now be traced back to whether it used the fast transcript path or the heavier audio fallback.

Why youtube-transcript-api

youtube-transcript-api is a better first tool for Summry's core job because it targets the artifact the app actually needs.

For videos with captions, it avoids:

media extraction
audio download
FFmpeg postprocessing
Whisper inference
temporary audio cleanup
some of the production fragility around full video/audio extraction

It is still unofficial and can be blocked or fail when captions are disabled. That is why this implementation uses it as the primary path, not the only path.

Tradeoffs

This reduces reliance on yt-dlp in the worker, but it does not remove yt-dlp from the system.

The API still uses yt-dlp for metadata extraction before creating the video record. The worker also still uses yt-dlp when transcript fetching fails. That means production still needs the downloader hardening already documented in docs/production-readiness.md.

The transcript path also introduces its own failure modes:

captions may be disabled
captions may be unavailable in the requested language
auto-generated captions may be lower quality
YouTube may block transcript requests from production egress

The benefit is that the expensive path is now the fallback instead of the default.

Lessons Learned

The important design shift was separating transcript acquisition from transcript processing.

Summarization, chunking, embeddings, and chat retrieval do not care whether text came from YouTube captions or Whisper. They care that the transcript is normalized into timestamped segments. Once that contract became explicit, the worker could choose the cheapest available transcript source without changing the rest of the processing pipeline.

This also made the service boundaries cleaner. The processing service now orchestrates the workflow, while YouTube transcript fetching, audio downloading, and Whisper transcription each live behind integration modules.

Next Steps

The next practical improvement is metadata extraction. The worker is no longer audio-first, but the submit path still depends on yt-dlp to fetch video metadata. Replacing or backing up that path with a lighter metadata strategy would reduce production fragility before jobs even enter the queue.

Another useful follow-up is production observability. Once this is deployed, the transcript_source metadata should show how often videos use the fast transcript path versus the audio fallback. That ratio will help decide whether the Whisper fallback should remain always-on, move to a slower queue, or eventually become an opt-in recovery path.