Three robustness fixes: frame dim clamp, Whisper auto-chunking, transcript-to-file#28
Open
kuhnhomeuk-cell wants to merge 4 commits into
Open
Conversation
… them Claude Code's Read tool rejects images with any dimension >2000px. Portrait sources at higher --resolution values were silently producing oversized frames that Claude couldn't see, leaving the pipeline frames-blind without a clear error. Concrete trigger: a 1320×2868 phone screen recording at --resolution 1024 previously produced 1024×2224 frames, both rejected on Read. frames.extract now probes the source dims via ffprobe and computes an explicit W:H pair that respects --resolution as a width hint but rescales the longer edge down to 1998px when needed. Aspect ratio is preserved; both dimensions are forced even (libx264 requirement). A stderr warning fires whenever clamping triggers so the cause isn't opaque. Landscape sources at any reasonable --resolution are unaffected: a 1920×1080 input at --resolution 1024 still emits 1024×576 as before. Tested against the actual failure case (1320×2868 portrait, --resolution 1024): output is now 918×1998, both dims accepted by Read.
…oad cap
Whisper's API rejects uploads over 25 MB. With the existing 64 kbps mono
extraction (~480 kB/min), any video over ~52 min returns an HTTP 4xx and the
transcript pipeline falls through to "none available" — the SKILL.md already
calls this out as a known failure mode.
This change adds two helpers and wraps the upload site:
- `_audio_duration_seconds(path)` — thin ffprobe wrapper.
- `_split_audio_into_chunks(audio, dir, max_mb)` — ffmpeg segment muxer
with `-c copy` (no re-encode); computes segment length from total bytes
so chunks land close to the cap, with a 2% undershoot to absorb the
keyframe-snap padding. Returns [(path, offset_seconds)] with offsets
derived from each chunk's measured duration (not assumed uniform).
- `_post_whisper_chunked(...)` — passthrough when ≤22 MB; otherwise
splits, uploads sequentially, stitches segments with absolute
timestamps so downstream `filter_range` / `format_transcript` don't
have to know chunking happened.
`transcribe_video` now calls `_post_whisper_chunked` instead of
`_post_whisper` directly. No public signature changes.
Manually verified:
- 90s/704 KB silent audio → 1 chunk, no split (passthrough)
- 1500s/34 MB silent audio → 3 chunks of ≤16.8 MB, offsets [0, 735, 1470]
Long videos previously dumped the entire transcript into the stdout report as
one fenced markdown block. A 60-minute video produces 800+ segments → tens of
thousands of context tokens consumed on every run, with no way to opt out
short of `--no-whisper`. After a sparse full-video pass the user typically
needs to re-focus into 1-2 regions; re-paying the transcript cost each time
is wasteful.
Watch now:
- Writes `transcript.json` (machine-readable) and `transcript.md` (timestamped,
human-readable) to the working directory whenever a transcript is available
(captions or Whisper).
- Prints a head/tail preview in the report — first 30 + last 10 segments —
with the on-disk path so Claude can deliberately Read the full file when
needed.
- Adds `--inline-transcript` for the legacy behavior of dumping the full
transcript into stdout.
Preview is suppressed for short transcripts (≤45 segments) where the head+tail
contains the whole thing anyway.
No public API changes; no new external dependencies. Tested against captions
and Whisper paths with focused and full ranges.
- Adds Unreleased section to CHANGELOG documenting the frame dim clamp, Whisper auto-chunking, and --inline-transcript flag. - Documents --inline-transcript and the auto-clamp / auto-chunk behaviors in SKILL.md's flags table. - Removes the now-stale "25 MB upload limit" caveat from the failure-modes section (it's handled by chunking automatically), and adds a note for the portrait-source auto-clamp warning so the diagnostic isn't surprising.
|
@kuhnhomeuk-cell thanks for this - I pulled two parts of this into my own fork. The chunking I did myself already - probably our clankers came to the same conclusions |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
Three independent failure modes I hit while pointing
/watchat long-form local screen recordings (a 65-min and a 25-min phone capture of a webinar). Each one is documented in the body below with the exact symptom, what's going wrong inside the script, and a verification against the original input.The commits stand alone — you can take one, two, or all three. They're ordered shortest-to-longest in the branch.
1.
fix(frames): clamp output dims to ≤1998px so the Read tool can ingest themSymptom. A 1320×2868 portrait phone screen recording at
--resolution 1024produced 1024×2224 frames that Claude's Read tool silently rejected (it caps both dimensions at 2000px). The pipeline ran clean, frame paths were listed, and Claude was nominally given the frames — but every Read returned blank, leaving the entire report transcript-only. There was no error to surface.Root cause.
frames.extractpassesscale={resolution}:-2to ffmpeg, which only constrains the width. Tall portrait sources blow past Read's cap on the height axis.Fix.
frames.pynow probes source dims via ffprobe and computes an explicitW:Hthat respects--resolutionas a width hint but rescales the longer edge down to 1998px when needed. Aspect ratio is preserved; both dims are forced even (libx264 requirement). A stderr warning fires when clamping triggers so the cause isn't opaque.Verification.
Landscape sources are unaffected — a 1920×1080 input at
--resolution 1024still produces 1024×576 as before (verified against anffmpeg lavfi testsrc).2.
feat(whisper): auto-chunk audio when extracted file exceeds 25 MB upload capSymptom. A 65-min webinar with no captions returned no transcript.
whisper.pyextracts mono 16 kHz 64 kbps mp3 (~480 kB/min), which puts any video over ~52 min above Whisper's 25 MB upload cap. The API returns 4xx, the script falls through totranscript: none available. SKILL.md already lists this as a known failure mode.Fix. Two new helpers + a wrapper:
_audio_duration_seconds(path)— ffprobe wrapper._split_audio_into_chunks(audio, dir, max_mb)— ffmpeg segment muxer with-c copy(no re-encode); computes segment length from total bytes so chunks land close to the cap, with a 2% undershoot to absorb keyframe-snap padding. Returns[(path, offset_seconds)]with offsets summed from each chunk's measured duration (not assumed uniform)._post_whisper_chunked(...)— passthrough when ≤22 MB; otherwise splits, uploads sequentially, stitches segments with absolute timestamps so downstreamfilter_range/format_transcriptdon't have to know chunking happened.transcribe_videocalls_post_whisper_chunkedinstead of_post_whisper. No public signature changes.Verification. (with silent ffmpeg lavfi audio — no Whisper credits consumed)
Stitching path is the same shape as
_post_whisper(verbose_json segments with start/end/text), so_segments_from_responseconsumes it unchanged.3.
feat(watch): write transcript to file by default; preview-only in reportSymptom. Long videos dumped the entire transcript into the stdout report as one fenced markdown block. A 60-min video produces 800+ segments → tens of thousands of context tokens consumed on every run, with no way to opt out short of
--no-whisper. After a sparse full-video pass the user typically re-focuses into 1-2 regions; re-paying the transcript cost each time is wasteful.Fix. Watch now:
transcript.json(machine-readable) andtranscript.md(timestamped, human-readable) to the working directory whenever a transcript exists (captions or Whisper).--inline-transcriptfor the legacy behavior of dumping the full transcript into stdout.Preview is suppressed for short transcripts (≤45 segments) where head+tail contains the whole thing anyway.
No public API changes; no new external dependencies.
Verification. Mocked a 50-segment transcript and ran focused mode:
transcript.json(598 B) +transcript.md(186 B) written ✓Scope
5 files, +325 / -21. All changes are additive on top of v0.1.3:
The v0.1.3 path-resolution and UTF-8 hardening from #2 / #4 is preserved on every line I touched.
Not in this PR
I kept a few opinionated extras local to my fork rather than bundle them here, on the theory you may want to discuss them separately or pass:
cache_dirkeyword totranscribe_videoand assumes a long-lived working dir — feels like its own discussion.--archive PATHflag for runs you want to keep, with amanifest.json.--ocr/--qr/--dedupepost-passes that shell out totesseract/zbarimg/dedupe_frames.pywhen available. All new dependencies; opt-in only; happy to PR separately if any are interesting.Happy to split, squash, or rework any of the three commits if the shape isn't right.