Add Deepgram backend: transcribe audio over Whisper's 25 MB limit (45 min+ videos)#35
Open
drlee91 wants to merge 1 commit into
Open
Add Deepgram backend: transcribe audio over Whisper's 25 MB limit (45 min+ videos)#35drlee91 wants to merge 1 commit into
drlee91 wants to merge 1 commit into
Conversation
…limit
Groq and OpenAI Whisper both reject uploads larger than 25 MB (HTTP 413).
At the 64 kbps mono extraction rate that is roughly 45 minutes of audio,
so any longer caption-less video currently comes back frames-only with
no way to get a transcript.
This adds Deepgram's pre-recorded API (nova-2) as a third backend:
- Audio <= 24 MB: unchanged - Whisper (Groq preferred, OpenAI fallback).
Deepgram now also catches the case where Whisper errors out.
- Audio > 24 MB: routed straight to Deepgram, which accepts files up to
~2 GB, instead of failing with 413.
- Standalone: when DEEPGRAM_API_KEY is the only key configured, Deepgram
serves as the primary backend for all sizes.
- --whisper deepgram forces it explicitly.
Implementation follows the existing whisper.py conventions: pure stdlib
(urllib, no SDK), same {start, end, text} segment shape, same SystemExit
error contract, keys via env or ~/.config/watch/.env. The module is named
deepgram_backend (not deepgram) so it can never shadow the real Deepgram
SDK if one is installed.
setup.py accepts DEEPGRAM_API_KEY as a valid transcription key, scaffolds
a placeholder for it, and mentions it in the install hints. SKILL.md and
README document the new backend.
Tested end-to-end on Windows with a 59-minute YouTube video (27 MB audio,
native captions unavailable due to YouTube 429): routed to Deepgram and
returned 628 timestamped segments. Also verified: small file still
prefers Groq; --whisper deepgram forces Deepgram with a Groq key present;
Deepgram-only config auto-selects Deepgram; no key at all produces the
updated guidance message.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Groq and OpenAI Whisper both reject uploads over 25 MB with HTTP 413. At the 64 kbps mono extraction rate that's roughly 45 minutes of audio — so any longer caption-less video currently comes back frames-only, with no transcript possible at all:
This bites in practice: long tutorials, podcasts, talks, VODs — and YouTube caption pulls can also fail with 429 rate limits, which makes the audio fallback the only path for those videos.
Approach
Add Deepgram's pre-recorded API (
nova-2) as a third backend. Its file limit is ~2 GB, so the 25 MB cliff disappears. Routing keeps current behavior unchanged for existing users:DEEPGRAM_API_KEYis the only configured key, Deepgram serves as the primary backend.--whisper deepgramforces it.Design notes
whisper.py—urllibonly, no SDK dependency. Same{start, end, text}segment shape (from Deepgramutterances, with paragraph/transcript fallbacks), sameSystemExiterror contract, same key resolution (env →~/.config/watch/.env→./.env).detect_language=truekeeps it language-agnostic like the Whisper path.deepgram_backend(notdeepgram) so it can never shadow or be shadowed by the real Deepgram SDK.setup.pyacceptsDEEPGRAM_API_KEYas a valid transcription key, scaffolds a commented placeholder, and includes it in the install hints.--check/--jsonsemantics are unchanged otherwise.Verification
--whisper deepgramwith a Groq key present → Deepgram used.DEEPGRAM_API_KEYconfigured → Deepgram auto-selected as primary;setup.py --jsonreportsready.--checkexits 3 as before.Docs (README + SKILL.md) updated accordingly.