A production pipeline around OpenAI gpt-4o-transcribe-diarize for long-form
2-speaker interviews. Plug in your own domain_terms.json and adapt to any
field.
| Raw API limitation | This pipeline |
|---|---|
| 25 MB / 1400 s hard cap per call | Two-step preprocess: chunk by duration first, then compress only chunks > 24.9 MB |
| Each chunked call labels speakers independently (chunk 1's "A" ≠ chunk 2's "A") | After chunk 1, auto-extract top-N speaker references (2–10 s clips) and pass them to chunks 2…N via known_speaker_references |
Tested on a 2 h 26 min interview (280 MB) — 7 chunks, single consistent speaker labeling end-to-end.
| Raw API limitation | This pipeline |
|---|---|
gpt-4o-transcribe-diarize often invents 3–4 speakers for a 2-person interview |
Stage 0.2 consolidation: count speakers, keep top-N by frequency (expected_speakers=2), let GPT-4 re-map minor → major (C → A, D → B) |
DER on the test interview: 12.99% → 4.28% (−8.7 pp).
| Raw API limitation | This pipeline |
|---|---|
Rare / domain-specific terms get mistranscribed (Nemotron → Neutron, Dennard → Denard, OpenClaw → open claw, CUDA → Kuda) |
Stage 2 GPT correction with domain_terms.json injected into a strict whitelist system prompt — only domain spell-fixes allowed, no rewriting / contraction expansion / filler removal |
Open-loop GPT correction occasionally hallucinates refusals ("I'm sorry, I can't assist...") into the transcript |
Fix B refusal guard: regex-detect refusal phrases in GPT output, auto-revert to original |
Lex Fridman × Jensen Huang podcast #494, 22,281 reference words, 2 speakers. WER under NIST sclite-style normalization (contractions + hesitations folded). Q = 100 − ½(WER + DER) × 100 (higher is better).
| Config | Stages on | WER | DER | Q |
|---|---|---|---|---|
| Raw (no post-processing) | none | 6.27% | 12.99% | 90.4 |
| Speaker stages only | 0.1 + 0.2 + 1 | 6.27% | 4.28% | 94.7 |
| Full + NVIDIA dict + gpt-5.5 + rebalance ⭐ | 0.1 + 0.2 + 1 + 1.5 + 2 | 6.05% | 4.28% | 94.8 |
Takeaway: enabling speaker stages (0.1 / 0.2 / 1) is a free Q +4.3 (DER win, no WER cost). Stage 2 (GPT correction) needs a domain-appropriate dict to be net-positive.
Comparison run on a separate 11-interview corpus, all scored on the same audios with Q = 100 − ½(WER + DER) × 100. Recording conditions are deliberately tough — noisy commodity laptop-microphone audio, CN/Eng code-switching within the same recording, and ~38 min average duration per interview. This is closer to real-world meeting / field-interview audio than to studio podcast benchmarks.
| Rank | Engine | Q | Notes |
|---|---|---|---|
| 🥇 | CWX-Transcribe (this) | 85.8 | integrated WER + DER pipeline |
| 🥈 | AssemblyAI Universal-2 | 84.4 | commercial API |
| 🥉 | WhisperX (Whisper + pyannote) | 81.5 | open-source stacked |
| 4 | gpt-4o-transcribe-diarize (raw) | 74.3 | ← what CWX wraps; +11.5 Q on top |
| 5 | NVIDIA Parakeet (RNNT 1.1B) | 73.0 | open-source ASR-only |
| 6 | Qwen3-ASR | 72.9 | |
| 7 | Deepgram Nova-3 | 71.1 | commercial API |
| Rank | Engine | Q |
|---|---|---|
| 🥇 | CWX-Transcribe | 88.7 |
| 🥈 | AssemblyAI | 88.4 |
| 🥉 | WhisperX | 83.7 |
| Rank | Engine | Q |
|---|---|---|
| 🥇 | CWX-Transcribe | 78.2 |
| 🥈 | WhisperX | 75.6 |
| 🥉 | AssemblyAI | 73.9 |
The +11.5 Q gap between CWX (85.8) and the raw OpenAI API (74.3) is what this production pipeline buys you.
git clone https://github.com/Vincent-WenZX/CWX-Transcribe.git
cd CWX-Transcribe
brew install ffmpeg # macOS (Linux: sudo apt install ffmpeg)
python3 -m venv venv && source venv/bin/activate
pip install -r requirements.txt
echo "OPENAI_API_KEY=sk-..." > .env# single file
python run_cwx_batch.py meeting.m4a -o ./outputs --lang en --speakers 2
# whole folder (skip-if-exists by default)
python run_cwx_batch.py ./inputs -o ./outputs --lang zh --speakers 2
# force re-run
python run_cwx_batch.py *.m4a -o ./outputs --no-skipOutput:
outputs/<stem>/
├── <stem>_cwx.txt plain (for WER)
├── <stem>_cwx_speaker.txt speaker-tagged (for DER)
├── <stem>_cwx_readable.txt timestamped, human-readable
└── <stem>_cwx.json full structured segments
from src import Audio, Transcriber
from src.services.openai_service import OpenAIService
import os
trx = Transcriber(OpenAIService(api_key=os.environ["OPENAI_API_KEY"]))
result = trx.transcribe(Audio("meeting.m4a"), language="en", max_speakers=2)
# result["segments"] = [{"speaker":"A", "start":0, "end":3.2, "text":"..."}, ...]Edit config/domain_terms.json — both arrays get flattened and joined comma-separated into the prompt:
{
"terms": ["Salesforce", "HubSpot", "OKR", "ARR", "churn"],
"abbreviations": ["KPI", "ROI", "B2B", "SaaS", "MRR"]
}Up to ~224 tokens combined with the base prompt.
| Knob | Default | What it does |
|---|---|---|
transcription.language |
en |
language code (en/zh/ja/...) |
transcription.max_speakers |
2 | max speakers (1–4) |
post_processing.enabled |
true | master switch |
post_processing.consolidate_speakers |
true | Stage 0.2 — fixes diarization hallucinations |
post_processing.expected_speakers |
2 | top-N to keep |
post_processing.merge_same_speaker |
true | Stage 1 |
post_processing.correct_text |
false | Stage 2 GPT correction; default OFF — see Fix A |
CWX-Transcribe/
├── README.md this file
├── LICENSE Apache 2.0
├── requirements.txt Python dependencies (openai, pydub, jiwer, …)
├── architecture.jpg pipeline diagram (shown at the top of this README)
├── .gitignore
├── run_cwx_batch.py ⭐ main CLI entry point
│
├── config/
│ ├── settings.yaml main config: model, audio limits, post-processing toggles
│ └── domain_terms.json pluggable domain dictionary (your vocabulary goes here)
│
├── src/ core library
│ ├── audio.py Audio class — load, duration/size, slice
│ ├── preprocessor.py chunk by duration → compress per chunk (2-step)
│ ├── prompt_engineer.py base prompt + inject domain dictionary
│ ├── transcriber.py OpenAI gpt-4o-transcribe-diarize call;
│ │ cross-chunk speaker-reference loop
│ ├── transcript_post_processor.py 4-stage post-processor (normalize / consolidate /
│ │ merge / GPT-correct) with Fix A + Fix B safety guards
│ ├── output_formatter.py re-base chunk timestamps + write 4 output formats
│ ├── speaker_reference_manager.py validate + format known-speaker audio samples
│ ├── config_manager.py YAML config loader (dotted-path access)
│ ├── utils.py helpers: .env loading, duration formatting, etc.
│ ├── services/
│ │ └── openai_service.py unified OpenAI client production pipeline (timeouts, retries, stats)
│ └── evaluation/
│ └── wer_evaluator.py jiwer-based WER + CER + MER + WIL
│
└── examples/ runnable scripts (run from project root)
├── verify_implementation.py smoke-test imports & basic init (run once after install)
├── example1.py verbose end-to-end transcription walkthrough
├── usage_example.py minimal Python API usage
├── transcribe_with_eval.py transcribe + auto-WER evaluation in one go
└── evaluate_wer.py standalone WER eval against your own ground truth
All numbers in Sections 2 and 3 come from offline runs against published or internally-collected audio with known ground-truth transcripts. This section documents exactly what was scored and how.
| Audio | Lex Fridman Podcast #494 — Jensen Huang × Lex Fridman |
| Duration | 8,758 s (2 h 26 min) |
| Reference words | 22,281 |
| Speakers | 2 (Lex Fridman = interviewer, Jensen Huang = interviewee) |
| Recording quality | studio-grade, clean, mono 16 kHz |
| Ground truth | Lex Fridman's official published transcript (lexfridman.com) |
| Audio shipped with this repo? | ❌ no — copyrighted material; download separately |
| Number of recordings | 11 long-form interviews |
| Languages | 8 English + 3 Chinese (CN/Eng code-switching within recordings) |
| Average duration | ~38 min |
| Recording quality | noisy commodity laptop microphones — closer to real meeting / field-interview audio than to studio podcasts |
| Ground truth | human-transcribed and human-corrected by domain experts |
| Engines compared | WhisperX (Whisper + pyannote), AssemblyAI Universal-2, Deepgram Nova-3, NVIDIA Parakeet RNNT 1.1B, Qwen3-ASR, gpt-4o-transcribe-diarize (raw), CWX-Transcribe |
jiwer 4.0+ for the alignment, with NIST sclite-style normalization
applied to both reference and hypothesis before scoring (consistent with
research-community convention for conversational ASR — see
NIST sclite glm, Whisper EnglishTextNormalizer):
1. Unicode-fold typographic quotes ("’", "‘") to ASCII apostrophe
2. Lowercase
3. Expand colloquial contractions (applied to both ref AND hyp):
gonna → going to, wanna → want to, gotta → got to,
kinda → kind of, sorta → sort of, dunno → do not know,
cuz → because
4. Strip all punctuation; collapse whitespace
5. Remove hesitation tokens entirely:
{uh, um, mm, hmm, mhmm, mmhmm, ah, er, eh}
6. Compute jiwer.process_words(ref_normalized, hyp_normalized)
This is not jiwer's default behaviour. We document it explicitly because naive jiwer scoring (lowercase + remove-punctuation only) over-counts ~150 errors on the test interview that mainstream ASR evaluation conventions would not count as errors (the same 6.27% jumps to 8.17% under naive scoring).
pyannote.metrics.DiarizationErrorRate(collar=0.25, skip_overlap=False),
with the standard (missed + false_alarm + confusion) / total definition.
Reference and hypothesis annotations are built from the per-segment
(start, end, speaker) tuples without smoothing.
Q = 100 − ½ × (WER + DER) × 100 (range 0–100, higher is better)
A simple linear combination of the two metrics. We use it for single-number ranking in Section 3 because users typically care about "transcript usefulness end-to-end" — both what was said and who said it. Equal weighting is a deliberate choice; if your downstream task weights one metric more (e.g., search vs. interview attribution), recompute with custom weights.
- Real-time / streaming latency (this is an offline batch pipeline)
- Multi-speaker scenarios beyond 2 speakers
- Noisy environments beyond what the Section 3 corpus represents
- Languages other than English and Chinese
- Audio under 5 minutes (chunking overhead dominates)
- Adversarial audio (background music, simultaneous speech, etc.)
Apache License 2.0 — see LICENSE. Copyright 2025–2026 Zhenxiong Wen · vincewen2024@gmail.com · @Vincent-WenZX
⚠ All numbers in this README are from a single 2 h 26 min interview (Section 2) or 11 long-form interviews (Section 3). They are illustrative, not statistically rigorous benchmarks. Performance on shorter, noisier, or multi-speaker recordings has not been measured.
