build logMay 22, 2026Updated May 30, 2026

YouTube Data API Metadata

How Summry moved submit-time YouTube metadata lookup to the official YouTube Data API while keeping yt-dlp as a fallback.

summrybuild-logbackendinfradebugging

Problem

After moving the worker to a transcript-first pipeline, one brittle piece still remained near the front of the product flow: metadata extraction.

Before this change, submitting a YouTube URL still depended on yt-dlp before Summry could create or reuse a video record:

normalize URL -> yt-dlp metadata -> duration check -> create chat room -> enqueue processing

That meant production could still fail before the new transcript-first worker ever had a chance to run. The worker no longer needed to download audio by default, but the API still had to ask yt-dlp for title, channel, duration, thumbnail, tags, and category data during chat creation.

The product needs that metadata, but it does not need a media extractor to get it.

Context

The previous transcript-first change reduced worker-side dependence on yt-dlp by trying youtube-transcript-api before audio download and Whisper transcription.

This change applies the same idea one step earlier in the workflow:

normalize URL
  -> YouTube Data API metadata
  -> yt-dlp metadata fallback
  -> duration check
  -> create or reuse video

The goal was not to remove yt-dlp completely. yt-dlp still has value as a fallback and remains necessary for the audio transcription fallback. The goal was to stop making yt-dlp the first required metadata path in production.

Implementation

I kept the public integration contract stable:

integrations/youtube_service.py
get_youtube_metadata_from_url(youtube_url: str) -> YoutubeMetaData

That let the service layer continue using the same metadata object without changing routes, repositories, schemas, or frontend contracts.

The integration now checks for:

YOUTUBE_DATA_API_KEY

When the key is configured, it calls the official YouTube Data API videos.list endpoint with:

part=snippet,contentDetails,status

The response is mapped into the existing YoutubeMetaData shape:

snippet.title -> title
snippet.description -> description
snippet.channelTitle -> channel name
snippet.channelId -> channel id
snippet.tags -> tags
snippet.thumbnails -> best available thumbnail URL
snippet.publishedAt -> upload date
contentDetails.duration -> duration seconds

The Data API returns duration as ISO-8601, so the integration now parses values like:

PT1H2M3S

into seconds before the existing duration limit check runs.

Category Names

The existing database stores categories as a list of strings. yt-dlp usually returns human-readable category names, while the Data API video resource gives categoryId.

To preserve a useful human-readable value, the integration calls videoCategories.list for the category id and stores the category title. Category lookups are cached in memory per process so repeated videos in the same category do not repeatedly spend the extra lookup.

If category lookup fails, Summry does not fail the whole submit flow. It logs the category lookup failure and stores the raw category id as the fallback category value.

Fallback Behavior

The Data API path is primary only when YOUTUBE_DATA_API_KEY is configured.

If the key is missing, the app behaves like before and uses yt-dlp metadata extraction.

If the key exists but the Data API request fails, returns no video items, or cannot map the response, the integration logs the failure and falls back to yt-dlp. This keeps local development and production recovery behavior intact while still shifting the happy path to the official API.

Configuration

Two config values were added:

YOUTUBE_DATA_API_KEY
YOUTUBE_DATA_API_TIMEOUT_SECONDS=10

The key is optional. The timeout is a positive integer so a slow metadata request does not hang chat creation indefinitely.

The deployment docs now describe the YouTube stack in three layers:

YouTube Data API for submit-time metadata
youtube-transcript-api for worker-side transcript acquisition
yt-dlp for metadata fallback and audio fallback

Tradeoffs

This change makes the submit path less dependent on yt-dlp, but it introduces a new operational dependency: YouTube Data API quota and key management.

The upside is that quota and API keys are more predictable than media extraction failures. A quota problem is visible and manageable. A YouTube extractor or bot-check failure can be harder to reason about from a production container.

The tradeoffs are:

production should configure YOUTUBE_DATA_API_KEY
metadata requests now consume YouTube Data API quota
category names may require an extra cached lookup
yt-dlp still needs to remain healthy for fallback cases

That feels like the right trade: make the official API the front door, keep the unofficial extractor as the escape hatch.

Tests

The implementation added coverage for:

optional Data API config and timeout validation
videos.list request parameters
mapping Data API payloads into YoutubeMetaData
ISO-8601 duration parsing
best-thumbnail selection
empty Data API results falling back to yt-dlp
HTTP failures falling back to yt-dlp
category name lookup
category cache behavior
category lookup failure falling back to the raw category id

The full backend suite passed:

131 passed

Lessons Learned

The important design move was keeping the metadata contract stable while changing the provider behind it.

video_service should not care whether metadata came from yt-dlp or the YouTube Data API. It needs a normalized title, channel, duration, thumbnail, tags, categories, and upload date. By preserving that boundary, the implementation could reduce production brittleness without expanding the change into the rest of the app.

This also makes the YouTube integration easier to reason about:

metadata: YouTube Data API first, yt-dlp fallback
transcript: youtube-transcript-api first, Whisper fallback
audio: yt-dlp only when transcript fallback needs it

The system still has unofficial YouTube surfaces, but they are no longer the first tool used for every step.

Next Steps

The next useful follow-up is operational visibility. Once deployed, logs should make it clear how often metadata comes from the Data API versus yt-dlp fallback. If fallback use is rare, yt-dlp can stay focused on true recovery paths.

Another follow-up is quota tracking. The API currently uses a simple key and request timeout. If usage grows, Summry may need dashboard alerts or lightweight internal counters for Data API failures and fallback rates.