Make Whisper resilient: chunking, retry caps, transcript cache#10
Open
JoseBallestas wants to merge 5 commits into
Open
Make Whisper resilient: chunking, retry caps, transcript cache#10JoseBallestas wants to merge 5 commits into
JoseBallestas wants to merge 5 commits into
Conversation
Server-error retries re-uploaded the full audio each time, counting against Groq's per-hour ASPH limit. A single 40-min file with 4 attempts exceeded the 7200s/hour cap, locking the user out of their own free tier when the original failure was just a transient 500. Cap 5xx retries at 2 attempts (initial + 1 retry), symmetric with the existing 429 handling. Network errors still get all 4 attempts since no payload has been uploaded yet. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When Whisper failed for any reason — rate limit, network error, parse error — the final report hardcoded "no API key set, or --no-whisper was used" even when neither was true. Confusing, and steers users toward re-running setup.py when the real problem was elsewhere. Track the actual failure reason from each fallback path (subtitle parse, no key, --no-whisper, Whisper exception) and surface it in the report. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously --start/--end constrained frame extraction but Whisper still got the full audio file. On a 40-min video focused to 5 min, that meant uploading 18 MB to transcribe 2 MB worth of content — wasteful, and 8x more quota burn against Groq's per-hour ASPH limit. extract_audio() now accepts start/end seconds and passes them to ffmpeg as -ss/-to. transcribe_video() forwards them through and offsets the returned segment timestamps so they align with the source video timeline. Live test on the same 40-min file focused to first 5 min: audio dropped from 18763 kB → 2345 kB and Whisper succeeded on the first attempt (after Groq had been 500ing the larger payload all session). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Audio over 10 min is now split into chunks before upload. Each chunk is independently retryable, and a chunk that fails permanently is reported without taking down the rest of the transcript. Why: Groq has a 25 MB per-file cap (~52 min at our bitrate) so anything longer can't fit in one request anyway. More importantly, when a single upload hit a transient 500 the entire transcript was lost; now the rest of the chunks still come through. Live test on the same 40-min file that had been failing repeatedly: 5/5 chunks succeeded (586 segments) despite one transient 500 and one 429 wait — exactly the failure modes that left us with zero transcript before. transcribe_video() now returns (segments, backend, failures), where failures is a list of (start, end, reason) tuples for any chunks that didn't make it. watch.py surfaces these as a "Partial transcript" note above the transcript block so the user knows which windows are missing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Successful Whisper transcriptions are now cached to ~/.cache/watch/chunks/ keyed by source file identity (path + size + mtime), window, and backend. On a subsequent run with matching inputs, both audio extraction and the API call are skipped — segments come straight from disk. Why: when a chunked run partially fails, the recovery is a focused re-run of the missing window. Without caching, that re-run had to re-extract and re-upload chunks that had already succeeded. With caching, only the truly-missing chunks hit the network. This is the exact recovery path that surfaced on Lesson 03's chunk-1 5xx storm. Cache key includes file size + mtime_ns so editing the source file invalidates entries automatically. Backend is in the key too — switching between Groq's whisper-large-v3 and OpenAI's whisper-1 produces different transcripts, so they shouldn't share cache. CACHE_VERSION lets future schema changes invalidate everything cleanly. Refactored both single-upload and chunked paths in transcribe_video to share a common _transcribe_window helper, so the cache logic lives in one place. Live test on a 14-min lesson video: - Cold run (cache miss): ~30s, 2 uploads, full Whisper quota cost - Warm run (cache hit): ~3s, 0 uploads, 0 quota cost Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes a chain of related Whisper transcription bugs that prevented
/watchfrom working on anything longer than ~10 min, and adds a transcript cache so re-runs are free.Discovered while trying to use
/watchon a 40-min local lesson video. A single transient Groq 500 turned into a 4-attempt retry storm that burned through Groq's 7200s/hour quota and produced zero transcript. From there it cascaded — turned out the script also re-uploaded the full audio even with--start/--end, the failure-mode message lied to the user about why the transcript was missing, and there was no way to process audio over Groq's 25 MB single-file cap (~52 min).Five commits, each a self-contained fix:
MAX_5XX_RETRIES = 2, symmetric with the existing 429 cap. Each retry re-uploads the audio and counts against the per-hour quota, so 4 attempts of a 40-min file = ~3 hours of "audio" billed and locks the user out of the free tier.--no-whisperwas used" — neither was true. Now it surfaces the real exception.--start/--endpreviously constrained frame extraction but Whisper still got the full audio. On a 40-min video focused to 5 min, that meant uploading 18 MB to transcribe 2 MB worth of content. Nowextract_audio()honors-ss/-toandtranscribe_video()offsets the returned segment timestamps so they align with the source-video timeline.(start, end, reason)tuples for failures. Fixes the hard 25 MB / ~52 min cap and means a single 5xx no longer kills the entire transcript.~/.cache/watch/chunks/keyed by(file path + size + mtime + window + backend). On re-run with matching inputs, both extraction and the API call are skipped. Means partial-failure recovery (focused re-run on the missing window) is free for the chunks that already worked.Real-world validation
After landing all five fixes, I used the patched
/watchto transcribe a complete 16-lesson storyboarding course — ~14 hours of video, ~10,000+ Whisper segments returned. Several lessons exceeded 1 hour individually; lesson 13 was 2h 20m (categorically beyond Groq's free-tier single-file cap pre-fix). Multiple lessons hit transient 500s and 429s mid-run; all recovered cleanly. One lesson's first chunk failed both retries; the cache let me kill the run, wait for the rolling-hour window to reset, retry, and pick up exactly where the previous attempt left off without re-paying for the chunks that had already succeeded.Without these fixes, none of that would have been possible on Groq's free tier.
Test plan
Notes
scripts/whisper.pyandscripts/watch.py. No new dependencies, no API changes that aren't backward-compatible (thetranscribe_videoreturn tuple grew from 2-tuple to 3-tuple, but the only caller iswatch.pywhich I updated).~/.cache/watch/chunks/) and never breaks the pipeline — load failures silently fall through to a fresh upload.CHUNK_DURATION_SECONDS = 600chosen to keep each chunk well under the 25 MB cap (~4.7 MB at 64 kbps mono) and divide cleanly into Groq's 7200s/hour quota.🤖 Generated with Claude Code