feat: add optional TwelveLabs Pegasus provider (--provider twelvelabs) for long-video, low-token analysis#36
Open
mohit-twelvelabs wants to merge 1 commit into
Conversation
Add an opt-in parser backend that hands the video to TwelveLabs' Pegasus video-language model for on-the-fly analysis instead of extracting frames and a Whisper transcript. Pegasus analyzes the video server-side (pixels + its own audio ASR) and returns a verbatim, timestamped transcript plus a scene-by-scene visual walkthrough as TEXT — so Claude reads a few KB instead of 80-100 frame images. This removes the per-frame image-token cost and the context-length ceiling that make long videos expensive in the default frames mode, and needs no Whisper key. - Default provider stays `frames`; existing behavior is unchanged. - Videos longer than --chunk-minutes (default 30) are split with `ffmpeg -c copy`, analyzed per-chunk, and merged into one report with absolute-timestamp segment headings. Chunk length auto-shrinks and re-splits to stay under the 200 MB direct-upload cap. - New scripts/twelvelabs.py (pure-stdlib REST client: asset upload, async analyze task, polling — mirrors whisper.py) and scripts/chunk.py (segmenting + trimming). - setup.py scaffolds an optional TWELVELABS_API_KEY, treats it as satisfying preflight, and reports has_twelvelabs_key in --json. - New flags: --provider, --tl-model, --tl-prompt, --tl-max-tokens, --tl-temperature, --chunk-minutes. max_tokens is clamped to the model range; sub-4s clips and the multipart filename are guarded. - tests/test_provider.py covers chunk planning, segment offsets, prompt building, result extraction, max_tokens clamping, and filename sanitization. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Who I am
Hi! I'm Mohit, and I work @ TwelveLabs. I use
/watchand wanted to contribute an integration with our video model as an optional parser.What TwelveLabs / Pegasus is
TwelveLabs builds video-understanding foundation models. Pegasus is its generative video-language model: you point it at a video and it returns text grounded in the actual pixels and audio. It performs its own ASR and produces timestamped, temporally grounded output with no pre-indexing required.
What this adds
The default frames pipeline is great for short clips and pixel-level inspection, but every frame is an image. That means token cost and context limits can make long videos a sparse, expensive scan.
This PR adds
--provider twelvelabs. Instead of extracting frames and a Whisper transcript, it hands the whole video to Pegasus, which analyzes it server-side and returns text: a verbatim, timestamped transcript plus a scene-by-scene visual walkthrough.Claude then reads a few KB of text instead of 80–100 JPEGs, so there is no per-frame image-token cost, no context-length ceiling, and no Whisper key required because Pegasus does its own ASR.
Highlights
--providerdefaults toframes; nothing about the existing path changes.--chunk-minutesis split withffmpeg -c copy, analyzed per chunk, and merged into one report with absolute-timestamp segment headings.whisper.pyand adds no new pip dependencies.scripts/twelvelabs.py: REST client for asset upload, async analyze task creation, and polling.scripts/chunk.py: segmenting and trimming utilities.setup.pyscaffolds an optionalTWELVELABS_API_KEY, treats it as satisfying preflight, and reportshas_twelvelabs_keyin--json.--provider--tl-model--tl-prompt--tl-max-tokens--tl-temperature--chunk-minutesmax_tokensis clamped to the model range, sub-4s clips are guarded, and multipart filenames are sanitized.tests/test_provider.pyuses stdlibunittestand covers chunk planning, segment offsets, prompt building, result extraction, token clamping, and filename sanitization.Testing
Verified end-to-end against the live TwelveLabs API on both:
Also verified that the default
framespath is unchanged.All unit tests pass.
When to use this provider
This is useful for long videos, tight token budgets, summary/Q&A workflows, and transcript-heavy use cases.
The existing
framesprovider remains best for short clips and exact visual detail. This is an alternative, not a replacement.Requirements
Requires only a
TWELVELABS_API_KEY. A free tier is available.Thanks for building
/watch