Skip to content

feat: add optional TwelveLabs Pegasus provider (--provider twelvelabs) for long-video, low-token analysis#36

Open
mohit-twelvelabs wants to merge 1 commit into
bradautomates:mainfrom
mohit-twelvelabs:feat/twelvelabs-pegasus-provider
Open

feat: add optional TwelveLabs Pegasus provider (--provider twelvelabs) for long-video, low-token analysis#36
mohit-twelvelabs wants to merge 1 commit into
bradautomates:mainfrom
mohit-twelvelabs:feat/twelvelabs-pegasus-provider

Conversation

@mohit-twelvelabs

Copy link
Copy Markdown

Who I am

Hi! I'm Mohit, and I work @ TwelveLabs. I use /watch and wanted to contribute an integration with our video model as an optional parser.

What TwelveLabs / Pegasus is

TwelveLabs builds video-understanding foundation models. Pegasus is its generative video-language model: you point it at a video and it returns text grounded in the actual pixels and audio. It performs its own ASR and produces timestamped, temporally grounded output with no pre-indexing required.

What this adds

The default frames pipeline is great for short clips and pixel-level inspection, but every frame is an image. That means token cost and context limits can make long videos a sparse, expensive scan.

This PR adds --provider twelvelabs. Instead of extracting frames and a Whisper transcript, it hands the whole video to Pegasus, which analyzes it server-side and returns text: a verbatim, timestamped transcript plus a scene-by-scene visual walkthrough.

Claude then reads a few KB of text instead of 80–100 JPEGs, so there is no per-frame image-token cost, no context-length ceiling, and no Whisper key required because Pegasus does its own ASR.

Highlights

  • Default unchanged: --provider defaults to frames; nothing about the existing path changes.
  • Long videos: anything over --chunk-minutes is split with ffmpeg -c copy, analyzed per chunk, and merged into one report with absolute-timestamp segment headings.
  • Auto chunk resizing: chunk length auto-shrinks and re-splits to stay under TwelveLabs' 200 MB direct-upload cap.
  • Pure stdlib: mirrors whisper.py and adds no new pip dependencies.
  • New scripts:
    • scripts/twelvelabs.py: REST client for asset upload, async analyze task creation, and polling.
    • scripts/chunk.py: segmenting and trimming utilities.
  • Setup support: setup.py scaffolds an optional TWELVELABS_API_KEY, treats it as satisfying preflight, and reports has_twelvelabs_key in --json.
  • New flags:
    • --provider
    • --tl-model
    • --tl-prompt
    • --tl-max-tokens
    • --tl-temperature
    • --chunk-minutes
  • Safety/edge-case handling: max_tokens is clamped to the model range, sub-4s clips are guarded, and multipart filenames are sanitized.
  • Tests: tests/test_provider.py uses stdlib unittest and covers chunk planning, segment offsets, prompt building, result extraction, token clamping, and filename sanitization.

Testing

Verified end-to-end against the live TwelveLabs API on both:

  • single-clip videos
  • chunked videos with multi-segment analysis and merged absolute offsets

Also verified that the default frames path is unchanged.

All unit tests pass.

When to use this provider

This is useful for long videos, tight token budgets, summary/Q&A workflows, and transcript-heavy use cases.

The existing frames provider remains best for short clips and exact visual detail. This is an alternative, not a replacement.

Requirements

Requires only a TWELVELABS_API_KEY. A free tier is available.

Thanks for building /watch

Add an opt-in parser backend that hands the video to TwelveLabs' Pegasus
video-language model for on-the-fly analysis instead of extracting frames and a
Whisper transcript. Pegasus analyzes the video server-side (pixels + its own
audio ASR) and returns a verbatim, timestamped transcript plus a scene-by-scene
visual walkthrough as TEXT — so Claude reads a few KB instead of 80-100 frame
images. This removes the per-frame image-token cost and the context-length
ceiling that make long videos expensive in the default frames mode, and needs
no Whisper key.

- Default provider stays `frames`; existing behavior is unchanged.
- Videos longer than --chunk-minutes (default 30) are split with
  `ffmpeg -c copy`, analyzed per-chunk, and merged into one report with
  absolute-timestamp segment headings. Chunk length auto-shrinks and re-splits
  to stay under the 200 MB direct-upload cap.
- New scripts/twelvelabs.py (pure-stdlib REST client: asset upload, async
  analyze task, polling — mirrors whisper.py) and scripts/chunk.py
  (segmenting + trimming).
- setup.py scaffolds an optional TWELVELABS_API_KEY, treats it as satisfying
  preflight, and reports has_twelvelabs_key in --json.
- New flags: --provider, --tl-model, --tl-prompt, --tl-max-tokens,
  --tl-temperature, --chunk-minutes. max_tokens is clamped to the model range;
  sub-4s clips and the multipart filename are guarded.
- tests/test_provider.py covers chunk planning, segment offsets, prompt
  building, result extraction, max_tokens clamping, and filename sanitization.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant