Three robustness fixes: frame dim clamp, Whisper auto-chunking, transcript-to-file by kuhnhomeuk-cell · Pull Request #28 · bradautomates/claude-video

kuhnhomeuk-cell · 2026-05-30T15:57:07Z

Why

Three independent failure modes I hit while pointing /watch at long-form local screen recordings (a 65-min and a 25-min phone capture of a webinar). Each one is documented in the body below with the exact symptom, what's going wrong inside the script, and a verification against the original input.

The commits stand alone — you can take one, two, or all three. They're ordered shortest-to-longest in the branch.

1. `fix(frames): clamp output dims to ≤1998px so the Read tool can ingest them`

Symptom. A 1320×2868 portrait phone screen recording at --resolution 1024 produced 1024×2224 frames that Claude's Read tool silently rejected (it caps both dimensions at 2000px). The pipeline ran clean, frame paths were listed, and Claude was nominally given the frames — but every Read returned blank, leaving the entire report transcript-only. There was no error to surface.

Root cause. frames.extract passes scale={resolution}:-2 to ffmpeg, which only constrains the width. Tall portrait sources blow past Read's cap on the height axis.

Fix. frames.py now probes source dims via ffprobe and computes an explicit W:H that respects --resolution as a width hint but rescales the longer edge down to 1998px when needed. Aspect ratio is preserved; both dims are forced even (libx264 requirement). A stderr warning fires when clamping triggers so the cause isn't opaque.

Verification.

$ python3 scripts/frames.py portrait.mp4 out/ --resolution 1024 --start 0 --end 2 --fps 1
[watch] source 1320x2868 at requested width 1024 would have produced 1024x2225
        (Claude's Read tool rejects any edge >1998px). Clamped to 918x1998.

$ sips -g pixelWidth -g pixelHeight out/frame_0001.jpg
  pixelWidth: 918
  pixelHeight: 1998

Landscape sources are unaffected — a 1920×1080 input at --resolution 1024 still produces 1024×576 as before (verified against an ffmpeg lavfi testsrc).

2. `feat(whisper): auto-chunk audio when extracted file exceeds 25 MB upload cap`

Symptom. A 65-min webinar with no captions returned no transcript. whisper.py extracts mono 16 kHz 64 kbps mp3 (~480 kB/min), which puts any video over ~52 min above Whisper's 25 MB upload cap. The API returns 4xx, the script falls through to transcript: none available. SKILL.md already lists this as a known failure mode.

Fix. Two new helpers + a wrapper:

_audio_duration_seconds(path) — ffprobe wrapper.
_split_audio_into_chunks(audio, dir, max_mb) — ffmpeg segment muxer with -c copy (no re-encode); computes segment length from total bytes so chunks land close to the cap, with a 2% undershoot to absorb keyframe-snap padding. Returns [(path, offset_seconds)] with offsets summed from each chunk's measured duration (not assumed uniform).
_post_whisper_chunked(...) — passthrough when ≤22 MB; otherwise splits, uploads sequentially, stitches segments with absolute timestamps so downstream filter_range / format_transcript don't have to know chunking happened.

transcribe_video calls _post_whisper_chunked instead of _post_whisper. No public signature changes.

Verification. (with silent ffmpeg lavfi audio — no Whisper credits consumed)

90s / 704 KB → 1 chunk, no split (passthrough confirmed)
1500s / 34.3 MB → 3 chunks, largest 16.8 MB, offsets [0.0, 735.0, 1470.0]
All chunks ≤ 22 MB cap ✓

Stitching path is the same shape as _post_whisper (verbose_json segments with start/end/text), so _segments_from_response consumes it unchanged.

3. `feat(watch): write transcript to file by default; preview-only in report`

Symptom. Long videos dumped the entire transcript into the stdout report as one fenced markdown block. A 60-min video produces 800+ segments → tens of thousands of context tokens consumed on every run, with no way to opt out short of --no-whisper. After a sparse full-video pass the user typically re-focuses into 1-2 regions; re-paying the transcript cost each time is wasteful.

Fix. Watch now:

Writes transcript.json (machine-readable) and transcript.md (timestamped, human-readable) to the working directory whenever a transcript exists (captions or Whisper).
Prints a head/tail preview in the report — first 30 + last 10 segments — with the on-disk path so Claude can deliberately Read the full file when needed.
Adds --inline-transcript for the legacy behavior of dumping the full transcript into stdout.

Preview is suppressed for short transcripts (≤45 segments) where head+tail contains the whole thing anyway.

No public API changes; no new external dependencies.

Verification. Mocked a 50-segment transcript and ran focused mode:

transcript.json (598 B) + transcript.md (186 B) written ✓
Report dropped from ~140 lines to ~40 ✓
Preview block contains the 6 in-range segments + a file reference ✓

Scope

5 files, +325 / -21. All changes are additive on top of v0.1.3:

CHANGELOG.md       |   9 ++++
SKILL.md           |  10 +++-
scripts/frames.py  |  84 ++++++++++++++++++++++++++++-
scripts/watch.py   |  90 +++++++++++++++++++++++++++----
scripts/whisper.py | 153 ++++++++++++++++++++++++++++++++++++++++++++++++++---

The v0.1.3 path-resolution and UTF-8 hardening from #2 / #4 is preserved on every line I touched.

Not in this PR

I kept a few opinionated extras local to my fork rather than bundle them here, on the theory you may want to discuss them separately or pass:

Transcript caching by source-file fingerprint. Lets focused re-runs against the same source skip Whisper entirely. Useful, but adds a cache_dir keyword to transcribe_video and assumes a long-lived working dir — feels like its own discussion.
--archive PATH flag for runs you want to keep, with a manifest.json.
--ocr / --qr / --dedupe post-passes that shell out to tesseract / zbarimg / dedupe_frames.py when available. All new dependencies; opt-in only; happy to PR separately if any are interesting.

Happy to split, squash, or rework any of the three commits if the shape isn't right.

… them Claude Code's Read tool rejects images with any dimension >2000px. Portrait sources at higher --resolution values were silently producing oversized frames that Claude couldn't see, leaving the pipeline frames-blind without a clear error. Concrete trigger: a 1320×2868 phone screen recording at --resolution 1024 previously produced 1024×2224 frames, both rejected on Read. frames.extract now probes the source dims via ffprobe and computes an explicit W:H pair that respects --resolution as a width hint but rescales the longer edge down to 1998px when needed. Aspect ratio is preserved; both dimensions are forced even (libx264 requirement). A stderr warning fires whenever clamping triggers so the cause isn't opaque. Landscape sources at any reasonable --resolution are unaffected: a 1920×1080 input at --resolution 1024 still emits 1024×576 as before. Tested against the actual failure case (1320×2868 portrait, --resolution 1024): output is now 918×1998, both dims accepted by Read.

…oad cap Whisper's API rejects uploads over 25 MB. With the existing 64 kbps mono extraction (~480 kB/min), any video over ~52 min returns an HTTP 4xx and the transcript pipeline falls through to "none available" — the SKILL.md already calls this out as a known failure mode. This change adds two helpers and wraps the upload site: - `_audio_duration_seconds(path)` — thin ffprobe wrapper. - `_split_audio_into_chunks(audio, dir, max_mb)` — ffmpeg segment muxer with `-c copy` (no re-encode); computes segment length from total bytes so chunks land close to the cap, with a 2% undershoot to absorb the keyframe-snap padding. Returns [(path, offset_seconds)] with offsets derived from each chunk's measured duration (not assumed uniform). - `_post_whisper_chunked(...)` — passthrough when ≤22 MB; otherwise splits, uploads sequentially, stitches segments with absolute timestamps so downstream `filter_range` / `format_transcript` don't have to know chunking happened. `transcribe_video` now calls `_post_whisper_chunked` instead of `_post_whisper` directly. No public signature changes. Manually verified: - 90s/704 KB silent audio → 1 chunk, no split (passthrough) - 1500s/34 MB silent audio → 3 chunks of ≤16.8 MB, offsets [0, 735, 1470]

Long videos previously dumped the entire transcript into the stdout report as one fenced markdown block. A 60-minute video produces 800+ segments → tens of thousands of context tokens consumed on every run, with no way to opt out short of `--no-whisper`. After a sparse full-video pass the user typically needs to re-focus into 1-2 regions; re-paying the transcript cost each time is wasteful. Watch now: - Writes `transcript.json` (machine-readable) and `transcript.md` (timestamped, human-readable) to the working directory whenever a transcript is available (captions or Whisper). - Prints a head/tail preview in the report — first 30 + last 10 segments — with the on-disk path so Claude can deliberately Read the full file when needed. - Adds `--inline-transcript` for the legacy behavior of dumping the full transcript into stdout. Preview is suppressed for short transcripts (≤45 segments) where the head+tail contains the whole thing anyway. No public API changes; no new external dependencies. Tested against captions and Whisper paths with focused and full ranges.

- Adds Unreleased section to CHANGELOG documenting the frame dim clamp, Whisper auto-chunking, and --inline-transcript flag. - Documents --inline-transcript and the auto-clamp / auto-chunk behaviors in SKILL.md's flags table. - Removes the now-stale "25 MB upload limit" caveat from the failure-modes section (it's handled by chunking automatically), and adds a note for the portrait-source auto-clamp warning so the diagnostic isn't surprising.

joweiser · 2026-06-13T11:49:01Z

@kuhnhomeuk-cell thanks for this - I pulled two parts of this into my own fork. The chunking I did myself already - probably our clankers came to the same conclusions

kuhnhomeuk-cell added 4 commits May 30, 2026 16:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Three robustness fixes: frame dim clamp, Whisper auto-chunking, transcript-to-file#28

Three robustness fixes: frame dim clamp, Whisper auto-chunking, transcript-to-file#28
kuhnhomeuk-cell wants to merge 4 commits into
bradautomates:mainfrom
kuhnhomeuk-cell:feat/whisper-chunking-and-frame-clamp

kuhnhomeuk-cell commented May 30, 2026

Uh oh!

joweiser commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kuhnhomeuk-cell commented May 30, 2026

Why

1. fix(frames): clamp output dims to ≤1998px so the Read tool can ingest them

2. feat(whisper): auto-chunk audio when extracted file exceeds 25 MB upload cap

3. feat(watch): write transcript to file by default; preview-only in report

Scope

Not in this PR

Uh oh!

joweiser commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1. `fix(frames): clamp output dims to ≤1998px so the Read tool can ingest them`

2. `feat(whisper): auto-chunk audio when extracted file exceeds 25 MB upload cap`

3. `feat(watch): write transcript to file by default; preview-only in report`