summry
Back to Build Logs
build logMay 23, 2026Updated May 30, 2026

Durable Video Processing Dispatch

Why Summry added a Postgres processing outbox so Redis enqueue failures do not strand pending videos.

summrybuild-logbackendsystems-designarchitecture

Durable Video Processing Dispatch

Problem

Summry's video submission flow had a small but important reliability gap.

The API created the video and chat room in Postgres, committed those rows, and then pushed a processing job into Redis. That meant the user-visible state and the worker execution queue were updated in two separate steps:

commit video/chat rows -> enqueue Redis job

If Redis enqueue failed after the database commit, the app could return an error while still leaving behind a real chat room and a pending video. Future submissions for the same YouTube video could reuse that existing pending row without necessarily creating a new processing job. In the worst case, the video looked like it was waiting for work that no worker would ever receive.

The issue was not that Redis was the wrong tool. Redis is still a good execution queue for this stage of the app. The problem was asking Redis to be the only durable memory of work that had to happen after Postgres had already accepted the user-facing state.

Context

The create chat workflow high-level shape:

  • authenticate the user
  • enforce the video creation rate limit
  • normalize and validate the YouTube URL
  • fetch metadata
  • create or reuse the video
  • create a user-scoped chat room
  • record usage events
  • enqueue processing if needed

The weak point was transaction ownership. video_service had helper-level commits, while chat_service also owned the full workflow commit. That made it harder to reason about partial state, retries, and what should happen when Redis was unavailable.

The worker also had a related UX mismatch. process_video published an error event for each failed attempt, but the worker could still requeue the job. The frontend treated error as terminal, so a retryable backend failure could look final in the browser.

Implementation

The fix was to make processing dispatch durable before Redis enqueue.

The API now creates a Postgres outbox row when a video needs processing. That outbox row is written in the same transaction as the chat room, video state, and usage events. After the transaction commits, the API tries to dispatch the outbox row to Redis. If Redis succeeds, the row is marked dispatched. If Redis fails, the row stays undispatched with failure metadata.

The new flow is:

create video/chat rows
create processing outbox row
commit
try queue dispatch
mark outbox dispatched if dispatch succeeds

The worker now drains undispatched outbox rows before blocking on the Redis queue. That gives the system a recovery path:

worker starts loop
dispatch undispatched Postgres outbox rows
wait for queued work
process video

This keeps Redis as the fast execution queue while Postgres owns the durable intent that processing must happen.

The implementation also cleaned up the retry event semantics:

  • retryable processing failures now publish attempt_failed
  • worker retries still publish retry
  • terminal failures publish dead_letter
  • SSE closes only on done, dead_letter, or an explicit terminal event
  • the frontend displays retry progress instead of treating every processing failure as final

Tradeoffs

This adds another database table and one more worker responsibility. That is more moving parts than directly pushing to Redis from the API.

The tradeoff is worth it because the boundary is clearer:

  • Postgres owns durable application state and durable processing intent.
  • Redis owns execution queueing, retries, and pub/sub progress.
  • The worker bridges the two when API-side dispatch cannot.

The system can still produce duplicate queued jobs in rare crash windows. That is acceptable for this stage because processing is already idempotent enough to skip videos that are ready. A stricter multi-worker design can wait until worker count and queue pressure justify it.

Lessons Learned

The main lesson was that "enqueue after commit" is not quite enough when the enqueue is required for the committed state to make progress.

The database did not just need to store videos and chat rooms. It needed to store the fact that a video had work pending. Once that was modeled explicitly, the rest of the design became easier to reason about:

  • the API can return successfully after durable intent is saved
  • Redis outages do not permanently strand pending videos
  • the worker has a clear recovery task before waiting for new jobs
  • retry events can describe the actual lifecycle instead of prematurely ending the browser stream

This also reinforced the value of keeping transaction ownership visible at the workflow level. Hidden commits inside helpers make small reliability issues harder to see because state changes stop lining up with the product operation the user initiated.

Next Steps

The next practical step is production observation.

Useful signals to watch after deployment:

  • undispatched outbox row count
  • outbox dispatch failures
  • Redis queue depth
  • worker retry and dead-letter frequency
  • videos that remain pending for too long

If those metrics show repeated duplicate jobs or long-lived pending rows, the next refinement would be stronger worker claiming semantics.