| title | Who Spoke When |
|---|---|
| emoji | 🎙️ |
| colorFrom | blue |
| colorTo | indigo |
| sdk | docker |
| app_file | app/main.py |
| pinned | false |
Speaker diarization service and web app: upload audio and get who spoke when segments.
The project now runs with a hybrid pipeline:
- Preferred:
pyannote/speaker-diarization-3.1(best quality) - Fallback: VAD + ECAPA-TDNN embeddings + agglomerative clustering
- FastAPI backend (
/diarize,/diarize/url,/health) - Web UI (
/) for file upload and timeline view - CLI demo (
demo.py) - Automatic fallback if pyannote models are unavailable
app/
main.py FastAPI app and endpoints
pipeline.py Hybrid diarization pipeline
models/
embedder.py ECAPA-TDNN embedding extractor
clusterer.py Speaker clustering logic
utils/
audio.py Audio and export helpers
static/
index.html Web UI
Dockerfile
requirements.txt
README.md
Windows PowerShell:
python -m venv .venv
.\.venv\Scripts\Activate.ps1Linux/macOS:
python -m venv .venv
source .venv/bin/activatepip install -r requirements.txtpyannote models are gated. Create a token at https://huggingface.co/settings/tokens.
Windows PowerShell:
$env:HF_TOKEN="your_token_here"Linux/macOS:
export HF_TOKEN="your_token_here"uvicorn app.main:app --host 0.0.0.0 --port 8000Open:
- UI:
http://localhost:8000 - API docs:
http://localhost:8000/docs
- The UI now defaults to same-origin API (
/diarize), so it works on Hugging Face Spaces. - If you manually set a custom endpoint, ensure it allows CORS and is reachable from browser.
- Space created (Docker SDK)
- Space secret
HF_TOKENconfigured - Terms accepted for:
Push main branch to your Space repo remote:
git push huggingface mainIf push fails with unauthorized:
- Use a token with Write role (not Read)
- Confirm token owner has access to the target namespace
Returns service health and device.
Upload an audio file.
Form fields:
file: audio filenum_speakers(optional): force known number of speakers
Example:
curl -X POST http://localhost:8000/diarize \
-F "file=@meeting.mp3" \
-F "num_speakers=2"Diarize audio from a remote URL.
Example:
curl -X POST "http://localhost:8000/diarize/url?audio_url=https://example.com/sample.wav"python demo.py --audio meeting.wav
python demo.py --audio meeting.wav --speakers 2
python demo.py --audio meeting.wav --output result.json --rttm result.rttm --srt result.srt| Variable | Default | Description |
|---|---|---|
HF_TOKEN |
unset | Hugging Face token for gated pyannote models |
CACHE_DIR |
temp model cache path | Model download/cache directory |
USE_PYANNOTE_DIARIZATION |
true |
Enable full pyannote diarization first |
PYANNOTE_DIARIZATION_MODEL |
pyannote/speaker-diarization-3.1 |
pyannote diarization model id |
- Load and normalize audio
- Try full pyannote diarization (best quality)
- If unavailable/fails, fallback to:
- VAD (pyannote VAD or energy VAD)
- Sliding windows
- ECAPA embeddings
- Agglomerative clustering
- Merge adjacent same-speaker segments
Likely wrong API endpoint. Use same-origin /diarize in deployed UI.
You need:
- valid
HF_TOKEN - accepted model terms on both pyannote model pages
- Provide
num_speakerswhen known - Ensure clean audio (minimal background noise)
- Prefer pyannote path (set token + accept terms)
This is usually model download/cache/auth mismatch. Confirm HF_TOKEN, cache path write access, and internet connectivity.
- Overlapped speech may still be imperfect in fallback mode
- Quality depends on audio clarity, language mix, and noise
- Very short utterances are harder to classify reliably
Add your preferred license file (LICENSE) if this project is public.