Make Whisper resilient: chunking, retry caps, transcript cache by JoseBallestas · Pull Request #10 · bradautomates/claude-video

JoseBallestas · 2026-05-02T19:07:43Z

Summary

Fixes a chain of related Whisper transcription bugs that prevented /watch from working on anything longer than ~10 min, and adds a transcript cache so re-runs are free.

Discovered while trying to use /watch on a 40-min local lesson video. A single transient Groq 500 turned into a 4-attempt retry storm that burned through Groq's 7200s/hour quota and produced zero transcript. From there it cascaded — turned out the script also re-uploaded the full audio even with --start/--end, the failure-mode message lied to the user about why the transcript was missing, and there was no way to process audio over Groq's 25 MB single-file cap (~52 min).

Five commits, each a self-contained fix:

Cap 5xx retries to prevent quota burn — MAX_5XX_RETRIES = 2, symmetric with the existing 429 cap. Each retry re-uploads the audio and counts against the per-hour quota, so 4 attempts of a 40-min file = ~3 hours of "audio" billed and locks the user out of the free tier.
Surface the actual failure reason in the report — when Whisper failed for any reason (rate limit, network, parse), the markdown report hardcoded "no API key set, or --no-whisper was used" — neither was true. Now it surfaces the real exception.
Extract only the focused window for Whisper — --start/--end previously constrained frame extraction but Whisper still got the full audio. On a 40-min video focused to 5 min, that meant uploading 18 MB to transcribe 2 MB worth of content. Now extract_audio() honors -ss/-to and transcribe_video() offsets the returned segment timestamps so they align with the source-video timeline.
Chunk Whisper uploads — audio over 10 min is split into chunks, each uploaded independently. A chunk that fails permanently is reported and skipped — caller gets segments from the successful chunks plus a list of (start, end, reason) tuples for failures. Fixes the hard 25 MB / ~52 min cap and means a single 5xx no longer kills the entire transcript.
Cache successful chunk transcripts — successful transcriptions cached to ~/.cache/watch/chunks/ keyed by (file path + size + mtime + window + backend). On re-run with matching inputs, both extraction and the API call are skipped. Means partial-failure recovery (focused re-run on the missing window) is free for the chunks that already worked.

Real-world validation

After landing all five fixes, I used the patched /watch to transcribe a complete 16-lesson storyboarding course — ~14 hours of video, ~10,000+ Whisper segments returned. Several lessons exceeded 1 hour individually; lesson 13 was 2h 20m (categorically beyond Groq's free-tier single-file cap pre-fix). Multiple lessons hit transient 500s and 429s mid-run; all recovered cleanly. One lesson's first chunk failed both retries; the cache let me kill the run, wait for the rolling-hour window to reset, retry, and pick up exactly where the previous attempt left off without re-paying for the chunks that had already succeeded.

Without these fixes, none of that would have been possible on Groq's free tier.

Test plan

Unit tests for retry-cap behavior (5xx, 429, 4xx, network errors) — see test runs in commits
Unit tests for ffmpeg start/end window arg construction
Unit tests for chunking + partial-failure path + total-failure path
Unit tests for cache hit / miss / invalidation paths (5 scenarios incl. partial-cache recovery)
End-to-end on 14 hours of real audio across 16 separate runs, including focused re-runs and post-quota recovery scenarios

Notes

All changes live in scripts/whisper.py and scripts/watch.py. No new dependencies, no API changes that aren't backward-compatible (the transcribe_video return tuple grew from 2-tuple to 3-tuple, but the only caller is watch.py which I updated).
The cache is opt-out implicit (delete ~/.cache/watch/chunks/) and never breaks the pipeline — load failures silently fall through to a fresh upload.
CHUNK_DURATION_SECONDS = 600 chosen to keep each chunk well under the 25 MB cap (~4.7 MB at 64 kbps mono) and divide cleanly into Groq's 7200s/hour quota.

🤖 Generated with Claude Code

Server-error retries re-uploaded the full audio each time, counting against Groq's per-hour ASPH limit. A single 40-min file with 4 attempts exceeded the 7200s/hour cap, locking the user out of their own free tier when the original failure was just a transient 500. Cap 5xx retries at 2 attempts (initial + 1 retry), symmetric with the existing 429 handling. Network errors still get all 4 attempts since no payload has been uploaded yet. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

When Whisper failed for any reason — rate limit, network error, parse error — the final report hardcoded "no API key set, or --no-whisper was used" even when neither was true. Confusing, and steers users toward re-running setup.py when the real problem was elsewhere. Track the actual failure reason from each fallback path (subtitle parse, no key, --no-whisper, Whisper exception) and surface it in the report. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Previously --start/--end constrained frame extraction but Whisper still got the full audio file. On a 40-min video focused to 5 min, that meant uploading 18 MB to transcribe 2 MB worth of content — wasteful, and 8x more quota burn against Groq's per-hour ASPH limit. extract_audio() now accepts start/end seconds and passes them to ffmpeg as -ss/-to. transcribe_video() forwards them through and offsets the returned segment timestamps so they align with the source video timeline. Live test on the same 40-min file focused to first 5 min: audio dropped from 18763 kB → 2345 kB and Whisper succeeded on the first attempt (after Groq had been 500ing the larger payload all session). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Audio over 10 min is now split into chunks before upload. Each chunk is independently retryable, and a chunk that fails permanently is reported without taking down the rest of the transcript. Why: Groq has a 25 MB per-file cap (~52 min at our bitrate) so anything longer can't fit in one request anyway. More importantly, when a single upload hit a transient 500 the entire transcript was lost; now the rest of the chunks still come through. Live test on the same 40-min file that had been failing repeatedly: 5/5 chunks succeeded (586 segments) despite one transient 500 and one 429 wait — exactly the failure modes that left us with zero transcript before. transcribe_video() now returns (segments, backend, failures), where failures is a list of (start, end, reason) tuples for any chunks that didn't make it. watch.py surfaces these as a "Partial transcript" note above the transcript block so the user knows which windows are missing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Successful Whisper transcriptions are now cached to ~/.cache/watch/chunks/ keyed by source file identity (path + size + mtime), window, and backend. On a subsequent run with matching inputs, both audio extraction and the API call are skipped — segments come straight from disk. Why: when a chunked run partially fails, the recovery is a focused re-run of the missing window. Without caching, that re-run had to re-extract and re-upload chunks that had already succeeded. With caching, only the truly-missing chunks hit the network. This is the exact recovery path that surfaced on Lesson 03's chunk-1 5xx storm. Cache key includes file size + mtime_ns so editing the source file invalidates entries automatically. Backend is in the key too — switching between Groq's whisper-large-v3 and OpenAI's whisper-1 produces different transcripts, so they shouldn't share cache. CACHE_VERSION lets future schema changes invalidate everything cleanly. Refactored both single-upload and chunked paths in transcribe_video to share a common _transcribe_window helper, so the cache logic lives in one place. Live test on a 14-min lesson video: - Cold run (cache miss): ~30s, 2 uploads, full Whisper quota cost - Warm run (cache hit): ~3s, 0 uploads, 0 quota cost Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

JoseBallestas and others added 5 commits May 1, 2026 17:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make Whisper resilient: chunking, retry caps, transcript cache#10

Make Whisper resilient: chunking, retry caps, transcript cache#10
JoseBallestas wants to merge 5 commits into
bradautomates:mainfrom
JoseBallestas:main

JoseBallestas commented May 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

JoseBallestas commented May 2, 2026

Summary

Real-world validation

Test plan

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant