build logMay 24, 2026Updated May 30, 2026

Video Processing Retry Backoff

Retries from immediate to delayed exponential backoff.

summrybuild-logbackendsystems-designinfra

Problem

Summry already had retry and dead-letter behavior for failed video processing jobs, but the retry timing was too eager.

When a processing attempt failed under the configured retry limit, the worker immediately pushed the job back onto the Redis processing queue with an incremented attempt count:

failure -> increment attempts -> push back onto processing queue

That meant retries existed, but backoff did not. A transient failure could be retried almost immediately, which is rarely what we want for work that depends on external systems like YouTube transcript fetching, yt-dlp, Whisper, embeddings, or an LLM provider.

The risk was not just wasted work. Immediate retries can compress several failures into a short window, make logs noisier, and put pressure on the same dependency that may need a little time to recover.

Context

The worker already had a useful reliability shape:

process a Redis video job
mark retryable failures with a retry event
retry up to the configured maximum
write terminal failures to the dead-letter queue
keep progress events flowing through Redis pub/sub

The missing piece was time.

Retries should still be owned by Redis, because Redis already owns execution queueing and pub/sub progress. But a delayed retry should not block the worker with an in-memory sleep(). If the worker sleeps while holding a failed job, a restart can lose the scheduled retry timing. It also ties retry delay to one worker process instead of to shared queue state.

Implementation

The change adds a Redis sorted set for scheduled retries.

New configuration controls the retry queue name, the initial delay, and the maximum delay. The exact values are environment-specific so they can be tuned without changing worker code.

When a job fails under the retry limit, requeue_video_processing_job now:

increments the attempt count
calculates an exponential delay
caps that delay at VIDEO_PROCESS_RETRY_BACKOFF_MAX_SECONDS
writes the serialized retry job into the configured retry queue
uses the due timestamp as the sorted-set score

The delay shape is:

attempt 1 -> short delay
attempt 2 -> longer delay
attempt 3 -> longer delay again
...
cap at configured maximum

The worker now drains due retry jobs before popping new work:

dispatch undispatched outbox rows
drain due retry jobs back onto the processing queue
wait for queued work
process video

One subtle detail mattered here: a blocking queue read can wait indefinitely. If the worker blocked forever while no fresh jobs were arriving, a retry could become due in the sorted set without anything waking the loop to drain it. The worker now uses a short pop timeout so it periodically checks for due retries.

The public job contract stayed the same. That keeps the delayed-retry change inside the Redis queue integration and worker loop rather than spreading it through the processing service or API layer.

Benefits

The main benefit is calmer failure behavior.

Video processing can now give transient dependency failures time to recover before trying the same expensive workflow again. That matters because a single video processing attempt can touch several slower or failure-prone systems:

YouTube transcript APIs
yt-dlp fallback downloads
Whisper transcription
embedding generation
LLM summarization
Redis progress events
Postgres persistence

The change also keeps retry timing durable enough for the current architecture. A delayed retry is stored in Redis instead of living in a sleeping worker coroutine.

Operationally, the logs now expose more useful retry information without requiring operators to reconstruct the retry lifecycle by hand:

retry attempt count
calculated retry delay range
count of due retries drained back into the processing queue

Tradeoffs

This introduces one more Redis data structure and one more worker responsibility.

The previous implementation was simple:

failed job -> normal queue

The new implementation is more deliberate:

failed job -> retry sorted set -> due drain -> normal queue

That extra step is worth it because retry timing becomes explicit and configurable, but it does mean the worker must keep draining due retries. The short queue-read timeout is a practical compromise: the worker remains mostly blocking and efficient, while still waking periodically to move due retry jobs.

The Redis move from retry sorted set to processing queue is not a full transactional queueing system. In the current single-worker-oriented design, that is acceptable. If Summry later runs multiple workers at higher volume, this path may deserve stronger atomic move semantics with a Lua script or a more formal queue library.

Lessons Learned

Retries are not just a count. They are a timing policy.

The old implementation answered "how many times should we retry?" but not "when should the next attempt happen?" That second question matters once the worker is doing real production work against external services.