Skip to content
This repository was archived by the owner on Jun 9, 2026. It is now read-only.

priyanshumit/whisperbridge

Repository files navigation

whisperbridge

Real-time voice translation for macOS — capture audio from a call or video, transcribe, translate, and either show captions or speak the translation back to the other side. Every stage is a pluggable provider configured in YAML.

What it does

  • Incoming: capture system audio (e.g. a Spanish video call) → live English captions on a borderless overlay window.
  • Outgoing: capture your mic → translate your speech → write Spanish audio to a virtual mic that the call app uses as its input, so the other party hears you in their language.
  • Bidirectional: both at once. Original goal — a self-hosted equivalent of Apple AirPods Live Translation, working on any audio source.

Pipeline stages: AudioSource → StreamingASR → Translator → CaptionSink + TTS → AudioSink. Each is swappable via config.

Requirements

  • macOS 12+ (only macOS is fully supported; Linux works for non-call use cases via PulseAudio/PipeWire).
  • uv — Python package manager. Install with curl -LsSf https://astral.sh/uv/install.sh | sh.
  • Homebrew — for installing audio drivers.
  • API keys for whichever cloud providers you use. Local-only configs need none.

Quick start (sanity check, no setup)

Confirms the pipeline is wired correctly. Stubs only, no models, no APIs.

git clone https://github.com/priyanshumit/whisperbridge.git
cd whisperbridge
uv sync
uv run python -m translate run --config configs/stub.yaml

You should see a few canned transcripts print and the process exit in ~5 seconds.

Web UI (recommended for everyday use)

A local web UI replaces hand-editing YAML for most flows. Pick languages, devices, ASR/translator providers from dropdowns, hit Start, captions stream into the page.

uv run python -m translate serve

Then open http://127.0.0.1:8080 in your browser.

  • Listen mode: capture system audio (e.g. a Spanish call) → live English captions in the page, and optionally an always-on-top overlay window.
  • Call mode: bidirectional — your mic → translated audio out a virtual mic, the other party's audio → English captions on screen.
  • API keys: a panel in the sidebar shows which keys are configured. Click add next to any provider key, paste the value, and it persists to .env and applies immediately. Providers without their key are auto-disabled in the dropdowns.
  • Setup checklist: a first-run banner reports what's missing for the current mode (BlackHole 2ch / 16ch, Multi-Output Device, Piper voice, translator key) with the exact brew install … / Audio MIDI / voice download command to fix each item. Auto-hides once everything is OK.
  • Glossary: a textarea accepts term = expansion per line (e.g. cascos = headphones). Entries get injected into the LLM translator's system prompt for both directions, so domain terms translate consistently. Ignored by DeepL (which has its own glossary endpoint).
  • Live captions stream from the pipeline to the browser over a WebSocket on port 8766. The Anthropic translator streams tokens, so target-language captions update progressively as the translation arrives instead of appearing all at once.
  • Form state (mode, languages, devices, providers, glossary) persists to localStorage and rehydrates on refresh.

The CLI flow (below) is still available for advanced use, automation, or running configs that don't fit the form.

Setup for real use

1. Install Python deps

uv sync

This creates a .venv/ and installs faster-whisper, sounddevice, piper-tts, httpx, websockets, etc.

2. Install audio drivers (for capturing system audio + virtual mic)

brew install --cask blackhole-2ch       # capture system audio (incoming)
brew install --cask blackhole-16ch      # virtual mic for translated voice (outgoing)
sudo killall coreaudiod                  # reload audio drivers without rebooting

Confirm both show up:

uv run python -m translate devices

3. Audio MIDI Setup (one time)

  1. ⌘-Space → "Audio MIDI Setup".
  2. Bottom-left +Create Multi-Output Device.
  3. Check your earphones (or laptop speakers) AND BlackHole 2ch.
  4. Master Device → set to your earphones (drift correction).
  5. Right-click the new Multi-Output Device → "Use This Device for Sound Output".

4. Video conferencing app

In Zoom / Google Meet / Teams / etc.:

  • Speaker (output) → Multi-Output Device
  • Microphone (input) → BlackHole 16ch

The call app's audio reaches our pipeline via BlackHole 2ch; our translated audio reaches the other party via BlackHole 16ch.

5. Download a Piper TTS voice (only needed if outgoing TTS is enabled)

uv run python scripts/download_voice.py es_ES-sharvard-medium    # Spanish
uv run python scripts/download_voice.py en_US-amy-medium         # English
uv run python scripts/download_voice.py zh_CN-huayan-medium      # Mandarin

Voices land under voices/. Other voices listed at rhasspy/piper-voices.

Some Mandarin voices (zh_CN-chaowen-medium, zh_CN-xiao_ya-medium) use phoneme_type: pinyin, which runs a BERT model per utterance and adds several seconds of latency. Prefer espeak-based voices like zh_CN-huayan-medium for real-time use. If you need a pinyin voice anyway, install the extra (pulls g2pW, sentence-stream, unicode-rbnf, torch, and requests; ~1GB on disk once torch is in):

uv sync --extra zh

6. API keys

Copy .env.example to .env and fill in only the providers you actually use:

ANTHROPIC_API_KEY=sk-ant-...           # translation (validated, recommended)
DEEPGRAM_API_KEY=...                    # streaming ASR (recommended for quality)
GROQ_API_KEY=...                        # alternative translator
ELEVENLABS_API_KEY=...                  # cloud TTS
OPENAI_API_KEY=...                      # alternative translator
DEEPL_API_KEY=...                       # specialty translator

Pure-local setups (faster-whisper + Ollama + Piper) need no keys.

7. Use headphones

For any setup that involves both your mic and outgoing audio, always use headphones. Otherwise call audio leaks from your speakers into your mic and creates a transcription loop.

Running

uv run python -m translate run --config <file>

Configs included:

Config What it does
configs/stub.yaml No-op stubs end-to-end; sanity check only
configs/asr-test.yaml WAV file → faster-whisper → passthrough → stdout. Run scripts/make_sample.py first
configs/demo-mic.yaml Live mic → ASR → Anthropic → caption window + Spanish TTS to speakers (use headphones)
configs/incoming-translate.yaml Listen mode: capture system audio → English captions overlay. No TTS, no virtual mic. Best for podcasts/videos.
configs/example.yaml Full bidirectional: Mic → Spanish to BlackHole 16ch + BlackHole 2ch → English captions overlay. Use headphones.
configs/local-only.yaml Same as example.yaml but Ollama instead of Anthropic, all local
configs/captions-only.yaml Faster-whisper + Groq, no TTS
configs/ollama-test.yaml Smoke test for Ollama translator
configs/mic-debug.yaml Minimal mic + ASR diag config

Stop with Ctrl-C. (Background-launched processes: pkill -f "translate run".)

Other useful subcommands

uv run python -m translate serve                 # start the web UI on :8080
uv run python -m translate list-providers       # all registered providers
uv run python -m translate devices               # list available audio devices

Tuning knobs

Choosing an ASR

deepgram (recommended for live calls). Streams audio to the cloud and commits a caption every time it detects a phrase boundary, even mid-sentence. ~200 ms latency from end of speech to a final caption. ~$0.0043/min. Requires DEEPGRAM_API_KEY.

faster_whisper (local / offline). Runs Whisper on your Mac via ctranslate2. Free, private, no network — but commits a final only after ≥350 ms of silence (configurable via min_silence_ms). On continuous monologue sources (a YouTube lecture, a non-stop speaker) the silence threshold may never fire and finals never commit. Use it for back-and-forth conversations with natural pauses, or pick a larger min_avg_logprob floor and shorter min_silence_ms if your source has any pauses at all.

ASR (faster_whisper):

  • model: tiny / base / small / medium. Bigger = more accurate, slower.
  • min_silence_ms: silence to commit a final (default 500). Lower = faster finals, more cuts.
  • min_avg_logprob: confidence floor; below it, transcripts are dropped (default -1.0).
  • partial_interval_ms: cadence of in-progress partials.

ASR (deepgram):

  • endpointing: silence (ms) to commit a final (default 300). Lower for snappier captions.

Translator:

  • provider: anthropic / groq / openai / ollama / deepl / passthrough.
  • For LLMs, the pipeline maintains a 2-turn rolling source/target context so the model has conversation history (resolves polysemy like Spanish "cascos" → "headphones" vs "helmets").
  • runtime.glossary: dict[str, str] of term → expansion injected into the LLM system prompt. Use for project jargon, names, or to lock in a preferred translation. Ignored by DeepL.
  • Anthropic uses SSE streaming; partial captions update token-by-token. On transient failures the streaming path falls back to the non-streaming retry loop so a blip won't drop the caption.

Captions:

  • provider: window (overlay), stdout, or websocket. Multiple sinks per side allowed.

Audio sources:

  • microphone accepts gain: <float> for software boost on quiet mics.

Per-side language gate: if you set language: en (or any specific code) on a side, transcripts from a different detected language are dropped. Use auto for multi-language sources.

Troubleshooting

Pass -v / --verbose to translate run to print every partial and final transcript to the terminal (with [local] / [remote] prefixes) and bump the log level to info. Quickest way to see whether audio is reaching each side and where translation is failing.

uv run python -m translate run --config configs/example.yaml --verbose
  • "BlackHole 2ch not found" after brew install: run sudo killall coreaudiod and check uv run python -m translate devices again. No reboot needed.
  • Captions never appear: check mic level with python -m translate devices and a quick capture probe. Cheap mics may need gain: 8.016.0 on the source.
  • _tkinter.TclError: Can't find a usable init.tcl: the auto-fix in captions_window.py should handle uv's bundled Python. If not, set TCL_LIBRARY and TK_LIBRARY to your system Tcl/Tk paths.
  • HTTP 401 from translator: API key missing/wrong. The variable name in .env must exactly match the provider's expected env var; no extra whitespace, no quotes, no Bearer prefix.
  • Random foreign-language hallucinations (Hindi/Greek/etc. on silence): set the source side's language: to a specific code so the language gate drops them.
  • "helmets" instead of "headphones" (or other context-dependent mistranslations): the rolling context helps once a couple of turns have established the topic. For one-shot translations, expect literal defaults.

Tests

uv run pytest

Covers the registry, config schema + loader, latency tracker, web server endpoints (including /api/setup-status), glossary formatting, Anthropic SSE streaming (mocked transport), and a stub-pipeline end-to-end smoke. Runs in ~1s with no audio hardware, API keys, or models needed.

License profile per provider

Each provider class declares one of:

  • commercial-safe — MIT/Apache-licensed locally-runnable
  • non-commercial-only — restrictive license (set runtime.hide_non_commercial: true in config to filter)
  • api-tos — cloud API, your usage governed by the vendor's terms
  • depends-on-model — the loaded model decides (e.g., Ollama)

Project layout

configs/                  YAML configs for various scenarios
samples/                  Generated sample WAVs (gitignored)
voices/                   Piper voices (gitignored)
scripts/
  make_sample.py          Generate samples/hello.wav via macOS `say`
  download_voice.py       Pull Piper voices from HuggingFace
tests/                    pytest suite (registry, config, metrics, stub pipeline)
src/translate/
  cli.py                  Subcommands: run, serve, list-providers, devices
  config.py               Pydantic schemas (strict; extra=forbid)
  pipeline.py             Async supervisor + per-side state machines
  metrics.py              Latency tracker
  registry.py             Decorator-based provider registry
  server.py               aiohttp web UI: REST + WebSocket + lifecycle
  static/index.html       Single-page frontend (form + live captions)
  providers/
    base.py               Abstract base classes + dataclasses
    audio_local.py        microphone, virtual_device, system_loopback
    audio_sinks.py        speakers, virtual_device sinks
    asr_local.py          faster-whisper with streaming Silero VAD
    asr_api.py            Deepgram streaming
    translators_llm.py    Ollama, Groq, OpenAI, Anthropic, DeepL
    tts_local.py          Piper, system (`say`/`espeak-ng`)
    tts_api.py            ElevenLabs
    captions_window.py    tkinter overlay
    captions_websocket.py WebSocket broadcast
    stubs.py              Stub providers for tests

License

MIT — see LICENSE.

About

Real-time voice translation for macOS — captions and TTS over a pluggable provider pipeline

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors