YouTube Data API Metadata
How Summry moved submit-time YouTube metadata lookup to the official YouTube Data API while keeping yt-dlp as a fallback.
YouTube Data API Metadata
Problem
After moving the worker to a transcript-first pipeline, one brittle piece still remained near the front of the product flow: metadata extraction.
Before this change, submitting a YouTube URL still depended on yt-dlp before Summry could create or reuse a video record:
normalize URL -> yt-dlp metadata -> duration check -> create chat room -> enqueue processing
That meant production could still fail before the new transcript-first worker ever had a chance to run. The worker no longer needed to download audio by default, but the API still had to ask yt-dlp for title, channel, duration, thumbnail, tags, and category data during chat creation.
The product needs that metadata, but it does not need a media extractor to get it.
Context
The previous transcript-first change reduced worker-side dependence on yt-dlp by trying youtube-transcript-api before audio download and Whisper transcription.
This change applies the same idea one step earlier in the workflow:
normalize URL
-> YouTube Data API metadata
-> yt-dlp metadata fallback
-> duration check
-> create or reuse video
The goal was not to remove yt-dlp completely. yt-dlp still has value as a fallback and remains necessary for the audio transcription fallback. The goal was to stop making yt-dlp the first required metadata path in production.
Implementation
I kept the public integration contract stable:
integrations/youtube_service.py
get_youtube_metadata_from_url(youtube_url: str) -> YoutubeMetaData
That let the service layer continue using the same metadata object without changing routes, repositories, schemas, or frontend contracts.
The integration now checks for:
YOUTUBE_DATA_API_KEY
When the key is configured, it calls the official YouTube Data API videos.list endpoint with:
part=snippet,contentDetails,status
The response is mapped into the existing YoutubeMetaData shape:
snippet.title-> titlesnippet.description-> descriptionsnippet.channelTitle-> channel namesnippet.channelId-> channel idsnippet.tags-> tagssnippet.thumbnails-> best available thumbnail URLsnippet.publishedAt-> upload datecontentDetails.duration-> duration seconds
The Data API returns duration as ISO-8601, so the integration now parses values like:
PT1H2M3S
into seconds before the existing duration limit check runs.
Category Names
The existing database stores categories as a list of strings. yt-dlp usually returns human-readable category names, while the Data API video resource gives categoryId.
To preserve a useful human-readable value, the integration calls videoCategories.list for the category id and stores the category title. Category lookups are cached in memory per process so repeated videos in the same category do not repeatedly spend the extra lookup.
If category lookup fails, Summry does not fail the whole submit flow. It logs the category lookup failure and stores the raw category id as the fallback category value.
Fallback Behavior
The Data API path is primary only when YOUTUBE_DATA_API_KEY is configured.
If the key is missing, the app behaves like before and uses yt-dlp metadata extraction.
If the key exists but the Data API request fails, returns no video items, or cannot map the response, the integration logs the failure and falls back to yt-dlp. This keeps local development and production recovery behavior intact while still shifting the happy path to the official API.
Configuration
Two config values were added:
YOUTUBE_DATA_API_KEY
YOUTUBE_DATA_API_TIMEOUT_SECONDS=10
The key is optional. The timeout is a positive integer so a slow metadata request does not hang chat creation indefinitely.
The deployment docs now describe the YouTube stack in three layers:
- YouTube Data API for submit-time metadata
youtube-transcript-apifor worker-side transcript acquisition- yt-dlp for metadata fallback and audio fallback
Tradeoffs
This change makes the submit path less dependent on yt-dlp, but it introduces a new operational dependency: YouTube Data API quota and key management.
The upside is that quota and API keys are more predictable than media extraction failures. A quota problem is visible and manageable. A YouTube extractor or bot-check failure can be harder to reason about from a production container.
The tradeoffs are:
- production should configure
YOUTUBE_DATA_API_KEY - metadata requests now consume YouTube Data API quota
- category names may require an extra cached lookup
- yt-dlp still needs to remain healthy for fallback cases
That feels like the right trade: make the official API the front door, keep the unofficial extractor as the escape hatch.
Tests
The implementation added coverage for:
- optional Data API config and timeout validation
videos.listrequest parameters- mapping Data API payloads into
YoutubeMetaData - ISO-8601 duration parsing
- best-thumbnail selection
- empty Data API results falling back to yt-dlp
- HTTP failures falling back to yt-dlp
- category name lookup
- category cache behavior
- category lookup failure falling back to the raw category id
The full backend suite passed:
131 passed
Lessons Learned
The important design move was keeping the metadata contract stable while changing the provider behind it.
video_service should not care whether metadata came from yt-dlp or the YouTube Data API. It needs a normalized title, channel, duration, thumbnail, tags, categories, and upload date. By preserving that boundary, the implementation could reduce production brittleness without expanding the change into the rest of the app.
This also makes the YouTube integration easier to reason about:
metadata: YouTube Data API first, yt-dlp fallback
transcript: youtube-transcript-api first, Whisper fallback
audio: yt-dlp only when transcript fallback needs it
The system still has unofficial YouTube surfaces, but they are no longer the first tool used for every step.
Next Steps
The next useful follow-up is operational visibility. Once deployed, logs should make it clear how often metadata comes from the Data API versus yt-dlp fallback. If fallback use is rare, yt-dlp can stay focused on true recovery paths.
Another follow-up is quota tracking. The API currently uses a simple key and request timeout. If usage grows, Summry may need dashboard alerts or lightweight internal counters for Data API failures and fallback rates.