Real-time voice translation for macOS — capture audio from a call or video, transcribe, translate, and either show captions or speak the translation back to the other side. Every stage is a pluggable provider configured in YAML.
- Incoming: capture system audio (e.g. a Spanish video call) → live English captions on a borderless overlay window.
- Outgoing: capture your mic → translate your speech → write Spanish audio to a virtual mic that the call app uses as its input, so the other party hears you in their language.
- Bidirectional: both at once. Original goal — a self-hosted equivalent of Apple AirPods Live Translation, working on any audio source.
Pipeline stages: AudioSource → StreamingASR → Translator → CaptionSink + TTS → AudioSink. Each is swappable via config.
- macOS 12+ (only macOS is fully supported; Linux works for non-call use cases via PulseAudio/PipeWire).
uv— Python package manager. Install withcurl -LsSf https://astral.sh/uv/install.sh | sh.Homebrew— for installing audio drivers.- API keys for whichever cloud providers you use. Local-only configs need none.
Confirms the pipeline is wired correctly. Stubs only, no models, no APIs.
git clone https://github.com/priyanshumit/whisperbridge.git
cd whisperbridge
uv sync
uv run python -m translate run --config configs/stub.yaml
You should see a few canned transcripts print and the process exit in ~5 seconds.
A local web UI replaces hand-editing YAML for most flows. Pick languages, devices, ASR/translator providers from dropdowns, hit Start, captions stream into the page.
uv run python -m translate serve
Then open http://127.0.0.1:8080 in your browser.
- Listen mode: capture system audio (e.g. a Spanish call) → live English captions in the page, and optionally an always-on-top overlay window.
- Call mode: bidirectional — your mic → translated audio out a virtual mic, the other party's audio → English captions on screen.
- API keys: a panel in the sidebar shows which keys are configured. Click
addnext to any provider key, paste the value, and it persists to.envand applies immediately. Providers without their key are auto-disabled in the dropdowns. - Setup checklist: a first-run banner reports what's missing for the
current mode (BlackHole 2ch / 16ch, Multi-Output Device, Piper voice,
translator key) with the exact
brew install …/ Audio MIDI / voice download command to fix each item. Auto-hides once everything is OK. - Glossary: a textarea accepts
term = expansionper line (e.g.cascos = headphones). Entries get injected into the LLM translator's system prompt for both directions, so domain terms translate consistently. Ignored by DeepL (which has its own glossary endpoint). - Live captions stream from the pipeline to the browser over a WebSocket on port 8766. The Anthropic translator streams tokens, so target-language captions update progressively as the translation arrives instead of appearing all at once.
- Form state (mode, languages, devices, providers, glossary) persists to
localStorageand rehydrates on refresh.
The CLI flow (below) is still available for advanced use, automation, or running configs that don't fit the form.
uv sync
This creates a .venv/ and installs faster-whisper, sounddevice,
piper-tts, httpx, websockets, etc.
brew install --cask blackhole-2ch # capture system audio (incoming)
brew install --cask blackhole-16ch # virtual mic for translated voice (outgoing)
sudo killall coreaudiod # reload audio drivers without rebooting
Confirm both show up:
uv run python -m translate devices
- ⌘-Space → "Audio MIDI Setup".
- Bottom-left + → Create Multi-Output Device.
- Check your earphones (or laptop speakers) AND BlackHole 2ch.
- Master Device → set to your earphones (drift correction).
- Right-click the new Multi-Output Device → "Use This Device for Sound Output".
In Zoom / Google Meet / Teams / etc.:
- Speaker (output) → Multi-Output Device
- Microphone (input) → BlackHole 16ch
The call app's audio reaches our pipeline via BlackHole 2ch; our translated audio reaches the other party via BlackHole 16ch.
uv run python scripts/download_voice.py es_ES-sharvard-medium # Spanish
uv run python scripts/download_voice.py en_US-amy-medium # English
uv run python scripts/download_voice.py zh_CN-huayan-medium # Mandarin
Voices land under voices/. Other voices listed at
rhasspy/piper-voices.
Some Mandarin voices (zh_CN-chaowen-medium, zh_CN-xiao_ya-medium) use
phoneme_type: pinyin, which runs a BERT model per utterance and adds
several seconds of latency. Prefer espeak-based voices like
zh_CN-huayan-medium for real-time use. If you need a pinyin voice anyway,
install the extra (pulls g2pW, sentence-stream, unicode-rbnf, torch, and
requests; ~1GB on disk once torch is in):
uv sync --extra zh
Copy .env.example to .env and fill in only the providers you actually use:
ANTHROPIC_API_KEY=sk-ant-... # translation (validated, recommended)
DEEPGRAM_API_KEY=... # streaming ASR (recommended for quality)
GROQ_API_KEY=... # alternative translator
ELEVENLABS_API_KEY=... # cloud TTS
OPENAI_API_KEY=... # alternative translator
DEEPL_API_KEY=... # specialty translator
Pure-local setups (faster-whisper + Ollama + Piper) need no keys.
For any setup that involves both your mic and outgoing audio, always use headphones. Otherwise call audio leaks from your speakers into your mic and creates a transcription loop.
uv run python -m translate run --config <file>
Configs included:
| Config | What it does |
|---|---|
configs/stub.yaml |
No-op stubs end-to-end; sanity check only |
configs/asr-test.yaml |
WAV file → faster-whisper → passthrough → stdout. Run scripts/make_sample.py first |
configs/demo-mic.yaml |
Live mic → ASR → Anthropic → caption window + Spanish TTS to speakers (use headphones) |
configs/incoming-translate.yaml |
Listen mode: capture system audio → English captions overlay. No TTS, no virtual mic. Best for podcasts/videos. |
configs/example.yaml |
Full bidirectional: Mic → Spanish to BlackHole 16ch + BlackHole 2ch → English captions overlay. Use headphones. |
configs/local-only.yaml |
Same as example.yaml but Ollama instead of Anthropic, all local |
configs/captions-only.yaml |
Faster-whisper + Groq, no TTS |
configs/ollama-test.yaml |
Smoke test for Ollama translator |
configs/mic-debug.yaml |
Minimal mic + ASR diag config |
Stop with Ctrl-C. (Background-launched processes: pkill -f "translate run".)
uv run python -m translate serve # start the web UI on :8080
uv run python -m translate list-providers # all registered providers
uv run python -m translate devices # list available audio devices
deepgram (recommended for live calls). Streams audio to the cloud and
commits a caption every time it detects a phrase boundary, even mid-sentence.
~200 ms latency from end of speech to a final caption. ~$0.0043/min. Requires
DEEPGRAM_API_KEY.
faster_whisper (local / offline). Runs Whisper on your Mac via
ctranslate2. Free, private, no network — but commits a final only after
≥350 ms of silence (configurable via min_silence_ms). On continuous monologue
sources (a YouTube lecture, a non-stop speaker) the silence threshold may
never fire and finals never commit. Use it for back-and-forth conversations
with natural pauses, or pick a larger min_avg_logprob floor and shorter
min_silence_ms if your source has any pauses at all.
ASR (faster_whisper):
model:tiny/base/small/medium. Bigger = more accurate, slower.min_silence_ms: silence to commit a final (default 500). Lower = faster finals, more cuts.min_avg_logprob: confidence floor; below it, transcripts are dropped (default-1.0).partial_interval_ms: cadence of in-progress partials.
ASR (deepgram):
endpointing: silence (ms) to commit a final (default 300). Lower for snappier captions.
Translator:
provider:anthropic/groq/openai/ollama/deepl/passthrough.- For LLMs, the pipeline maintains a 2-turn rolling source/target context so the model has conversation history (resolves polysemy like Spanish "cascos" → "headphones" vs "helmets").
runtime.glossary:dict[str, str]of term → expansion injected into the LLM system prompt. Use for project jargon, names, or to lock in a preferred translation. Ignored by DeepL.- Anthropic uses SSE streaming; partial captions update token-by-token. On transient failures the streaming path falls back to the non-streaming retry loop so a blip won't drop the caption.
Captions:
provider: window(overlay),stdout, orwebsocket. Multiple sinks per side allowed.
Audio sources:
microphoneacceptsgain: <float>for software boost on quiet mics.
Per-side language gate: if you set language: en (or any specific code) on
a side, transcripts from a different detected language are dropped. Use
auto for multi-language sources.
Pass -v / --verbose to translate run to print every partial and final
transcript to the terminal (with [local] / [remote] prefixes) and bump
the log level to info. Quickest way to see whether audio is reaching each
side and where translation is failing.
uv run python -m translate run --config configs/example.yaml --verbose
- "BlackHole 2ch not found" after
brew install: runsudo killall coreaudiodand checkuv run python -m translate devicesagain. No reboot needed. - Captions never appear: check mic level with
python -m translate devicesand a quick capture probe. Cheap mics may needgain: 8.0–16.0on the source. _tkinter.TclError: Can't find a usable init.tcl: the auto-fix incaptions_window.pyshould handle uv's bundled Python. If not, setTCL_LIBRARYandTK_LIBRARYto your system Tcl/Tk paths.- HTTP 401 from translator: API key missing/wrong. The variable name in
.envmust exactly match the provider's expected env var; no extra whitespace, no quotes, noBearerprefix. - Random foreign-language hallucinations (Hindi/Greek/etc. on silence):
set the source side's
language:to a specific code so the language gate drops them. - "helmets" instead of "headphones" (or other context-dependent mistranslations): the rolling context helps once a couple of turns have established the topic. For one-shot translations, expect literal defaults.
uv run pytest
Covers the registry, config schema + loader, latency tracker, web server
endpoints (including /api/setup-status), glossary formatting, Anthropic
SSE streaming (mocked transport), and a stub-pipeline end-to-end smoke. Runs
in ~1s with no audio hardware, API keys, or models needed.
Each provider class declares one of:
commercial-safe— MIT/Apache-licensed locally-runnablenon-commercial-only— restrictive license (setruntime.hide_non_commercial: truein config to filter)api-tos— cloud API, your usage governed by the vendor's termsdepends-on-model— the loaded model decides (e.g., Ollama)
configs/ YAML configs for various scenarios
samples/ Generated sample WAVs (gitignored)
voices/ Piper voices (gitignored)
scripts/
make_sample.py Generate samples/hello.wav via macOS `say`
download_voice.py Pull Piper voices from HuggingFace
tests/ pytest suite (registry, config, metrics, stub pipeline)
src/translate/
cli.py Subcommands: run, serve, list-providers, devices
config.py Pydantic schemas (strict; extra=forbid)
pipeline.py Async supervisor + per-side state machines
metrics.py Latency tracker
registry.py Decorator-based provider registry
server.py aiohttp web UI: REST + WebSocket + lifecycle
static/index.html Single-page frontend (form + live captions)
providers/
base.py Abstract base classes + dataclasses
audio_local.py microphone, virtual_device, system_loopback
audio_sinks.py speakers, virtual_device sinks
asr_local.py faster-whisper with streaming Silero VAD
asr_api.py Deepgram streaming
translators_llm.py Ollama, Groq, OpenAI, Anthropic, DeepL
tts_local.py Piper, system (`say`/`espeak-ng`)
tts_api.py ElevenLabs
captions_window.py tkinter overlay
captions_websocket.py WebSocket broadcast
stubs.py Stub providers for tests
MIT — see LICENSE.