Skip to content

Add Deepgram (nova-3) as a third transcription backend#13

Open
apicurius wants to merge 1 commit into
bradautomates:mainfrom
apicurius:add-deepgram-backend
Open

Add Deepgram (nova-3) as a third transcription backend#13
apicurius wants to merge 1 commit into
bradautomates:mainfrom
apicurius:add-deepgram-backend

Conversation

@apicurius

Copy link
Copy Markdown

Summary

Adds Deepgram as a third option alongside the existing Groq + OpenAI Whisper backends. Useful primarily because Deepgram's /v1/listen has no per-request size limit — Whisper APIs cap at 25 MB, which limits long-video coverage even with mono 16 kHz audio.

Why nova-3 / utterances

results.utterances[] already gives us pre-segmented chunks with start, end, and transcript, mapping cleanly onto the existing {start, end, text} segment shape used by both the VTT parser (transcribe.parse_vtt) and the Whisper verbose_json adapter (_segments_from_response). No downstream changes needed in transcribe.filter_range or format_transcript.

smart_format=true&punctuate=true&detect_language=true keeps behavior parity with Whisper's defaults (auto language, punctuated output).

Wire-level differences from Whisper

The Deepgram client mirrors _post_whisper's retry/backoff envelope (4 attempts, 2 of them on 429), but differs on three points:

  1. Auth header is Token <key>, not Bearer <key>.
  2. Body is the raw audio bytes — no multipart form.
  3. Response shape is results.utterances[] (or results.channels[0].alternatives[0].transcript as fallback).

Pure stdlib — no deepgram-sdk dependency, consistent with the existing Groq/OpenAI implementation.

Backend selection

Preference order when multiple keys are set: Groq → OpenAI → Deepgram. Override with --whisper {groq,openai,deepgram}. setup.py scaffolds DEEPGRAM_API_KEY= alongside the other placeholders and accepts it as satisfying the preflight key check.

Stderr messages and docs refer to "transcription" / "speech-to-text" rather than "Whisper" where the broader concept applies — but the --whisper CLI flag name is preserved for back-compat.

Test plan

  • python3 -c "import ast; ast.parse(open('scripts/whisper.py').read())" — syntax clean
  • Smoke-test against an X.com video without captions: extracted audio, uploaded to api.deepgram.com/v1/listen, got back 26 segments aligned to the speaker's actual delivery.
  • Verified _segments_from_deepgram_response falls back to the alternative transcript when utterances is absent (manual response stub).
  • setup.py --check returns 0 with only DEEPGRAM_API_KEY set.
  • setup.py --json reports whisper_backend: "deepgram".

Notes

  • Bumped plugin.json to 0.2.0 (additive feature, no breaking changes to default backend behavior).
  • No new dependencies. No changes to the frames.py / download.py paths.

Whisper API uploads cap at 25 MB, which constrains long-video coverage even
with mono 16 kHz audio. Deepgram's /v1/listen has no per-request size limit
and exposes utterances directly, mapping cleanly onto the existing
{start, end, text} segment shape used by both the VTT parser and the Whisper
verbose_json adapter.

The Deepgram client follows the same retry/backoff envelope as the existing
Whisper client (4 attempts total, 2 of them on 429), but differs on three
wire-level points: Token (not Bearer) auth, raw audio body (not multipart),
and a results.utterances[] response shape with fallback to the full
alternative transcript when utterances are absent.

Backend selection extends the existing chain: Groq -> OpenAI -> Deepgram,
overridable via --whisper {groq,openai,deepgram}. setup.py now scaffolds
DEEPGRAM_API_KEY alongside the other placeholders and accepts it as
satisfying the preflight key check. Stderr messages and docs refer to
"transcription" rather than "Whisper" where the broader concept applies.

Bumps plugin.json to 0.2.0 (additive feature, no breaking changes to the
default backend behavior).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant