Skip to content

Three robustness fixes: frame dim clamp, Whisper auto-chunking, transcript-to-file#28

Open
kuhnhomeuk-cell wants to merge 4 commits into
bradautomates:mainfrom
kuhnhomeuk-cell:feat/whisper-chunking-and-frame-clamp
Open

Three robustness fixes: frame dim clamp, Whisper auto-chunking, transcript-to-file#28
kuhnhomeuk-cell wants to merge 4 commits into
bradautomates:mainfrom
kuhnhomeuk-cell:feat/whisper-chunking-and-frame-clamp

Conversation

@kuhnhomeuk-cell

Copy link
Copy Markdown

Why

Three independent failure modes I hit while pointing /watch at long-form local screen recordings (a 65-min and a 25-min phone capture of a webinar). Each one is documented in the body below with the exact symptom, what's going wrong inside the script, and a verification against the original input.

The commits stand alone — you can take one, two, or all three. They're ordered shortest-to-longest in the branch.


1. fix(frames): clamp output dims to ≤1998px so the Read tool can ingest them

Symptom. A 1320×2868 portrait phone screen recording at --resolution 1024 produced 1024×2224 frames that Claude's Read tool silently rejected (it caps both dimensions at 2000px). The pipeline ran clean, frame paths were listed, and Claude was nominally given the frames — but every Read returned blank, leaving the entire report transcript-only. There was no error to surface.

Root cause. frames.extract passes scale={resolution}:-2 to ffmpeg, which only constrains the width. Tall portrait sources blow past Read's cap on the height axis.

Fix. frames.py now probes source dims via ffprobe and computes an explicit W:H that respects --resolution as a width hint but rescales the longer edge down to 1998px when needed. Aspect ratio is preserved; both dims are forced even (libx264 requirement). A stderr warning fires when clamping triggers so the cause isn't opaque.

Verification.

$ python3 scripts/frames.py portrait.mp4 out/ --resolution 1024 --start 0 --end 2 --fps 1
[watch] source 1320x2868 at requested width 1024 would have produced 1024x2225
        (Claude's Read tool rejects any edge >1998px). Clamped to 918x1998.

$ sips -g pixelWidth -g pixelHeight out/frame_0001.jpg
  pixelWidth: 918
  pixelHeight: 1998

Landscape sources are unaffected — a 1920×1080 input at --resolution 1024 still produces 1024×576 as before (verified against an ffmpeg lavfi testsrc).


2. feat(whisper): auto-chunk audio when extracted file exceeds 25 MB upload cap

Symptom. A 65-min webinar with no captions returned no transcript. whisper.py extracts mono 16 kHz 64 kbps mp3 (~480 kB/min), which puts any video over ~52 min above Whisper's 25 MB upload cap. The API returns 4xx, the script falls through to transcript: none available. SKILL.md already lists this as a known failure mode.

Fix. Two new helpers + a wrapper:

  • _audio_duration_seconds(path) — ffprobe wrapper.
  • _split_audio_into_chunks(audio, dir, max_mb) — ffmpeg segment muxer with -c copy (no re-encode); computes segment length from total bytes so chunks land close to the cap, with a 2% undershoot to absorb keyframe-snap padding. Returns [(path, offset_seconds)] with offsets summed from each chunk's measured duration (not assumed uniform).
  • _post_whisper_chunked(...) — passthrough when ≤22 MB; otherwise splits, uploads sequentially, stitches segments with absolute timestamps so downstream filter_range / format_transcript don't have to know chunking happened.

transcribe_video calls _post_whisper_chunked instead of _post_whisper. No public signature changes.

Verification. (with silent ffmpeg lavfi audio — no Whisper credits consumed)

90s / 704 KB → 1 chunk, no split (passthrough confirmed)
1500s / 34.3 MB → 3 chunks, largest 16.8 MB, offsets [0.0, 735.0, 1470.0]
All chunks ≤ 22 MB cap ✓

Stitching path is the same shape as _post_whisper (verbose_json segments with start/end/text), so _segments_from_response consumes it unchanged.


3. feat(watch): write transcript to file by default; preview-only in report

Symptom. Long videos dumped the entire transcript into the stdout report as one fenced markdown block. A 60-min video produces 800+ segments → tens of thousands of context tokens consumed on every run, with no way to opt out short of --no-whisper. After a sparse full-video pass the user typically re-focuses into 1-2 regions; re-paying the transcript cost each time is wasteful.

Fix. Watch now:

  • Writes transcript.json (machine-readable) and transcript.md (timestamped, human-readable) to the working directory whenever a transcript exists (captions or Whisper).
  • Prints a head/tail preview in the report — first 30 + last 10 segments — with the on-disk path so Claude can deliberately Read the full file when needed.
  • Adds --inline-transcript for the legacy behavior of dumping the full transcript into stdout.

Preview is suppressed for short transcripts (≤45 segments) where head+tail contains the whole thing anyway.

No public API changes; no new external dependencies.

Verification. Mocked a 50-segment transcript and ran focused mode:

  • transcript.json (598 B) + transcript.md (186 B) written ✓
  • Report dropped from ~140 lines to ~40 ✓
  • Preview block contains the 6 in-range segments + a file reference ✓

Scope

5 files, +325 / -21. All changes are additive on top of v0.1.3:

CHANGELOG.md       |   9 ++++
SKILL.md           |  10 +++-
scripts/frames.py  |  84 ++++++++++++++++++++++++++++-
scripts/watch.py   |  90 +++++++++++++++++++++++++++----
scripts/whisper.py | 153 ++++++++++++++++++++++++++++++++++++++++++++++++++---

The v0.1.3 path-resolution and UTF-8 hardening from #2 / #4 is preserved on every line I touched.

Not in this PR

I kept a few opinionated extras local to my fork rather than bundle them here, on the theory you may want to discuss them separately or pass:

  • Transcript caching by source-file fingerprint. Lets focused re-runs against the same source skip Whisper entirely. Useful, but adds a cache_dir keyword to transcribe_video and assumes a long-lived working dir — feels like its own discussion.
  • --archive PATH flag for runs you want to keep, with a manifest.json.
  • --ocr / --qr / --dedupe post-passes that shell out to tesseract / zbarimg / dedupe_frames.py when available. All new dependencies; opt-in only; happy to PR separately if any are interesting.

Happy to split, squash, or rework any of the three commits if the shape isn't right.

… them

Claude Code's Read tool rejects images with any dimension >2000px. Portrait
sources at higher --resolution values were silently producing oversized frames
that Claude couldn't see, leaving the pipeline frames-blind without a clear
error. Concrete trigger: a 1320×2868 phone screen recording at --resolution
1024 previously produced 1024×2224 frames, both rejected on Read.

frames.extract now probes the source dims via ffprobe and computes an explicit
W:H pair that respects --resolution as a width hint but rescales the longer
edge down to 1998px when needed. Aspect ratio is preserved; both dimensions
are forced even (libx264 requirement). A stderr warning fires whenever
clamping triggers so the cause isn't opaque.

Landscape sources at any reasonable --resolution are unaffected: a 1920×1080
input at --resolution 1024 still emits 1024×576 as before.

Tested against the actual failure case (1320×2868 portrait, --resolution
1024): output is now 918×1998, both dims accepted by Read.
…oad cap

Whisper's API rejects uploads over 25 MB. With the existing 64 kbps mono
extraction (~480 kB/min), any video over ~52 min returns an HTTP 4xx and the
transcript pipeline falls through to "none available" — the SKILL.md already
calls this out as a known failure mode.

This change adds two helpers and wraps the upload site:

  - `_audio_duration_seconds(path)` — thin ffprobe wrapper.
  - `_split_audio_into_chunks(audio, dir, max_mb)` — ffmpeg segment muxer
    with `-c copy` (no re-encode); computes segment length from total bytes
    so chunks land close to the cap, with a 2% undershoot to absorb the
    keyframe-snap padding. Returns [(path, offset_seconds)] with offsets
    derived from each chunk's measured duration (not assumed uniform).
  - `_post_whisper_chunked(...)` — passthrough when ≤22 MB; otherwise
    splits, uploads sequentially, stitches segments with absolute
    timestamps so downstream `filter_range` / `format_transcript` don't
    have to know chunking happened.

`transcribe_video` now calls `_post_whisper_chunked` instead of
`_post_whisper` directly. No public signature changes.

Manually verified:
  - 90s/704 KB silent audio → 1 chunk, no split (passthrough)
  - 1500s/34 MB silent audio → 3 chunks of ≤16.8 MB, offsets [0, 735, 1470]
Long videos previously dumped the entire transcript into the stdout report as
one fenced markdown block. A 60-minute video produces 800+ segments → tens of
thousands of context tokens consumed on every run, with no way to opt out
short of `--no-whisper`. After a sparse full-video pass the user typically
needs to re-focus into 1-2 regions; re-paying the transcript cost each time
is wasteful.

Watch now:
  - Writes `transcript.json` (machine-readable) and `transcript.md` (timestamped,
    human-readable) to the working directory whenever a transcript is available
    (captions or Whisper).
  - Prints a head/tail preview in the report — first 30 + last 10 segments —
    with the on-disk path so Claude can deliberately Read the full file when
    needed.
  - Adds `--inline-transcript` for the legacy behavior of dumping the full
    transcript into stdout.

Preview is suppressed for short transcripts (≤45 segments) where the head+tail
contains the whole thing anyway.

No public API changes; no new external dependencies. Tested against captions
and Whisper paths with focused and full ranges.
- Adds Unreleased section to CHANGELOG documenting the frame dim clamp,
  Whisper auto-chunking, and --inline-transcript flag.
- Documents --inline-transcript and the auto-clamp / auto-chunk behaviors
  in SKILL.md's flags table.
- Removes the now-stale "25 MB upload limit" caveat from the failure-modes
  section (it's handled by chunking automatically), and adds a note for the
  portrait-source auto-clamp warning so the diagnostic isn't surprising.
@joweiser

Copy link
Copy Markdown

@kuhnhomeuk-cell thanks for this - I pulled two parts of this into my own fork. The chunking I did myself already - probably our clankers came to the same conclusions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants