Skip to content

Add Deepgram backend: transcribe audio over Whisper's 25 MB limit (45 min+ videos)#35

Open
drlee91 wants to merge 1 commit into
bradautomates:mainfrom
drlee91:feat/deepgram-transcription
Open

Add Deepgram backend: transcribe audio over Whisper's 25 MB limit (45 min+ videos)#35
drlee91 wants to merge 1 commit into
bradautomates:mainfrom
drlee91:feat/deepgram-transcription

Conversation

@drlee91

@drlee91 drlee91 commented Jun 9, 2026

Copy link
Copy Markdown

Problem

Groq and OpenAI Whisper both reject uploads over 25 MB with HTTP 413. At the 64 kbps mono extraction rate that's roughly 45 minutes of audio — so any longer caption-less video currently comes back frames-only, with no transcript possible at all:

[watch] audio: 27625 kB — uploading to groq Whisper…
[watch] whisper fallback failed: Whisper request failed: HTTP Error 413: Payload Too Large
       — {"error":{"message":"Request Entity Too Large",...,"code":"request_too_large"}}

This bites in practice: long tutorials, podcasts, talks, VODs — and YouTube caption pulls can also fail with 429 rate limits, which makes the audio fallback the only path for those videos.

Approach

Add Deepgram's pre-recorded API (nova-2) as a third backend. Its file limit is ~2 GB, so the 25 MB cliff disappears. Routing keeps current behavior unchanged for existing users:

  • Audio ≤ 24 MB: exactly as before — Groq preferred, OpenAI fallback. Deepgram additionally catches the case where Whisper errors out.
  • Audio > 24 MB: routed straight to Deepgram instead of failing with 413. Without a Deepgram key, Whisper is still attempted so the user sees the clear 413 error.
  • Standalone: if DEEPGRAM_API_KEY is the only configured key, Deepgram serves as the primary backend.
  • Override: --whisper deepgram forces it.

Design notes

  • Pure stdlib, matching whisper.pyurllib only, no SDK dependency. Same {start, end, text} segment shape (from Deepgram utterances, with paragraph/transcript fallbacks), same SystemExit error contract, same key resolution (env → ~/.config/watch/.env./.env).
  • detect_language=true keeps it language-agnostic like the Whisper path.
  • The module is named deepgram_backend (not deepgram) so it can never shadow or be shadowed by the real Deepgram SDK.
  • setup.py accepts DEEPGRAM_API_KEY as a valid transcription key, scaffolds a commented placeholder, and includes it in the install hints. --check/--json semantics are unchanged otherwise.
  • Considered chunked uploads to stay within Whisper's 25 MB instead, but that needs segment-boundary handling, per-chunk timestamp offsetting, and N sequential uploads — a second provider that follows the existing backend contract is simpler and also gives users a provider choice. Happy to adjust if you'd prefer the chunking route.

Verification

  • 59-min YouTube video (caption pull 429'd, audio 27 MB): routed to Deepgram, returned 628 timestamped segments, full report, exit 0. Previously: frames-only.
  • Small file regression: 40 s local clip without flags → still transcribed via Groq (unchanged default).
  • Forced: --whisper deepgram with a Groq key present → Deepgram used.
  • Standalone: only DEEPGRAM_API_KEY configured → Deepgram auto-selected as primary; setup.py --json reports ready.
  • No keys at all: updated guidance message lists all three options; --check exits 3 as before.

Docs (README + SKILL.md) updated accordingly.

…limit

Groq and OpenAI Whisper both reject uploads larger than 25 MB (HTTP 413).
At the 64 kbps mono extraction rate that is roughly 45 minutes of audio,
so any longer caption-less video currently comes back frames-only with
no way to get a transcript.

This adds Deepgram's pre-recorded API (nova-2) as a third backend:

- Audio <= 24 MB: unchanged - Whisper (Groq preferred, OpenAI fallback).
  Deepgram now also catches the case where Whisper errors out.
- Audio > 24 MB: routed straight to Deepgram, which accepts files up to
  ~2 GB, instead of failing with 413.
- Standalone: when DEEPGRAM_API_KEY is the only key configured, Deepgram
  serves as the primary backend for all sizes.
- --whisper deepgram forces it explicitly.

Implementation follows the existing whisper.py conventions: pure stdlib
(urllib, no SDK), same {start, end, text} segment shape, same SystemExit
error contract, keys via env or ~/.config/watch/.env. The module is named
deepgram_backend (not deepgram) so it can never shadow the real Deepgram
SDK if one is installed.

setup.py accepts DEEPGRAM_API_KEY as a valid transcription key, scaffolds
a placeholder for it, and mentions it in the install hints. SKILL.md and
README document the new backend.

Tested end-to-end on Windows with a 59-minute YouTube video (27 MB audio,
native captions unavailable due to YouTube 429): routed to Deepgram and
returned 628 timestamped segments. Also verified: small file still
prefers Groq; --whisper deepgram forces Deepgram with a Groq key present;
Deepgram-only config auto-selects Deepgram; no key at all produces the
updated guidance message.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant