Skip to content

Make Whisper resilient: chunking, retry caps, transcript cache#10

Open
JoseBallestas wants to merge 5 commits into
bradautomates:mainfrom
JoseBallestas:main
Open

Make Whisper resilient: chunking, retry caps, transcript cache#10
JoseBallestas wants to merge 5 commits into
bradautomates:mainfrom
JoseBallestas:main

Conversation

@JoseBallestas

Copy link
Copy Markdown

Summary

Fixes a chain of related Whisper transcription bugs that prevented /watch from working on anything longer than ~10 min, and adds a transcript cache so re-runs are free.

Discovered while trying to use /watch on a 40-min local lesson video. A single transient Groq 500 turned into a 4-attempt retry storm that burned through Groq's 7200s/hour quota and produced zero transcript. From there it cascaded — turned out the script also re-uploaded the full audio even with --start/--end, the failure-mode message lied to the user about why the transcript was missing, and there was no way to process audio over Groq's 25 MB single-file cap (~52 min).

Five commits, each a self-contained fix:

  1. Cap 5xx retries to prevent quota burnMAX_5XX_RETRIES = 2, symmetric with the existing 429 cap. Each retry re-uploads the audio and counts against the per-hour quota, so 4 attempts of a 40-min file = ~3 hours of "audio" billed and locks the user out of the free tier.
  2. Surface the actual failure reason in the report — when Whisper failed for any reason (rate limit, network, parse), the markdown report hardcoded "no API key set, or --no-whisper was used" — neither was true. Now it surfaces the real exception.
  3. Extract only the focused window for Whisper--start/--end previously constrained frame extraction but Whisper still got the full audio. On a 40-min video focused to 5 min, that meant uploading 18 MB to transcribe 2 MB worth of content. Now extract_audio() honors -ss/-to and transcribe_video() offsets the returned segment timestamps so they align with the source-video timeline.
  4. Chunk Whisper uploads — audio over 10 min is split into chunks, each uploaded independently. A chunk that fails permanently is reported and skipped — caller gets segments from the successful chunks plus a list of (start, end, reason) tuples for failures. Fixes the hard 25 MB / ~52 min cap and means a single 5xx no longer kills the entire transcript.
  5. Cache successful chunk transcripts — successful transcriptions cached to ~/.cache/watch/chunks/ keyed by (file path + size + mtime + window + backend). On re-run with matching inputs, both extraction and the API call are skipped. Means partial-failure recovery (focused re-run on the missing window) is free for the chunks that already worked.

Real-world validation

After landing all five fixes, I used the patched /watch to transcribe a complete 16-lesson storyboarding course — ~14 hours of video, ~10,000+ Whisper segments returned. Several lessons exceeded 1 hour individually; lesson 13 was 2h 20m (categorically beyond Groq's free-tier single-file cap pre-fix). Multiple lessons hit transient 500s and 429s mid-run; all recovered cleanly. One lesson's first chunk failed both retries; the cache let me kill the run, wait for the rolling-hour window to reset, retry, and pick up exactly where the previous attempt left off without re-paying for the chunks that had already succeeded.

Without these fixes, none of that would have been possible on Groq's free tier.

Test plan

  • Unit tests for retry-cap behavior (5xx, 429, 4xx, network errors) — see test runs in commits
  • Unit tests for ffmpeg start/end window arg construction
  • Unit tests for chunking + partial-failure path + total-failure path
  • Unit tests for cache hit / miss / invalidation paths (5 scenarios incl. partial-cache recovery)
  • End-to-end on 14 hours of real audio across 16 separate runs, including focused re-runs and post-quota recovery scenarios

Notes

  • All changes live in scripts/whisper.py and scripts/watch.py. No new dependencies, no API changes that aren't backward-compatible (the transcribe_video return tuple grew from 2-tuple to 3-tuple, but the only caller is watch.py which I updated).
  • The cache is opt-out implicit (delete ~/.cache/watch/chunks/) and never breaks the pipeline — load failures silently fall through to a fresh upload.
  • CHUNK_DURATION_SECONDS = 600 chosen to keep each chunk well under the 25 MB cap (~4.7 MB at 64 kbps mono) and divide cleanly into Groq's 7200s/hour quota.

🤖 Generated with Claude Code

JoseBallestas and others added 5 commits May 1, 2026 17:14
Server-error retries re-uploaded the full audio each time, counting
against Groq's per-hour ASPH limit. A single 40-min file with 4 attempts
exceeded the 7200s/hour cap, locking the user out of their own free tier
when the original failure was just a transient 500.

Cap 5xx retries at 2 attempts (initial + 1 retry), symmetric with the
existing 429 handling. Network errors still get all 4 attempts since no
payload has been uploaded yet.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When Whisper failed for any reason — rate limit, network error, parse
error — the final report hardcoded "no API key set, or --no-whisper was
used" even when neither was true. Confusing, and steers users toward
re-running setup.py when the real problem was elsewhere.

Track the actual failure reason from each fallback path (subtitle parse,
no key, --no-whisper, Whisper exception) and surface it in the report.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously --start/--end constrained frame extraction but Whisper still
got the full audio file. On a 40-min video focused to 5 min, that meant
uploading 18 MB to transcribe 2 MB worth of content — wasteful, and
8x more quota burn against Groq's per-hour ASPH limit.

extract_audio() now accepts start/end seconds and passes them to ffmpeg
as -ss/-to. transcribe_video() forwards them through and offsets the
returned segment timestamps so they align with the source video timeline.

Live test on the same 40-min file focused to first 5 min: audio dropped
from 18763 kB → 2345 kB and Whisper succeeded on the first attempt
(after Groq had been 500ing the larger payload all session).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Audio over 10 min is now split into chunks before upload. Each chunk is
independently retryable, and a chunk that fails permanently is reported
without taking down the rest of the transcript.

Why: Groq has a 25 MB per-file cap (~52 min at our bitrate) so anything
longer can't fit in one request anyway. More importantly, when a single
upload hit a transient 500 the entire transcript was lost; now the rest
of the chunks still come through. Live test on the same 40-min file that
had been failing repeatedly: 5/5 chunks succeeded (586 segments) despite
one transient 500 and one 429 wait — exactly the failure modes that left
us with zero transcript before.

transcribe_video() now returns (segments, backend, failures), where
failures is a list of (start, end, reason) tuples for any chunks that
didn't make it. watch.py surfaces these as a "Partial transcript" note
above the transcript block so the user knows which windows are missing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Successful Whisper transcriptions are now cached to ~/.cache/watch/chunks/
keyed by source file identity (path + size + mtime), window, and backend.
On a subsequent run with matching inputs, both audio extraction and the
API call are skipped — segments come straight from disk.

Why: when a chunked run partially fails, the recovery is a focused re-run
of the missing window. Without caching, that re-run had to re-extract
and re-upload chunks that had already succeeded. With caching, only the
truly-missing chunks hit the network. This is the exact recovery path
that surfaced on Lesson 03's chunk-1 5xx storm.

Cache key includes file size + mtime_ns so editing the source file
invalidates entries automatically. Backend is in the key too — switching
between Groq's whisper-large-v3 and OpenAI's whisper-1 produces different
transcripts, so they shouldn't share cache. CACHE_VERSION lets future
schema changes invalidate everything cleanly.

Refactored both single-upload and chunked paths in transcribe_video to
share a common _transcribe_window helper, so the cache logic lives in
one place.

Live test on a 14-min lesson video:
  - Cold run (cache miss): ~30s, 2 uploads, full Whisper quota cost
  - Warm run (cache hit):   ~3s,  0 uploads, 0 quota cost

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant