CWX-Transcribe

A production pipeline around OpenAI gpt-4o-transcribe-diarize for long-form 2-speaker interviews. Plug in your own domain_terms.json and adapt to any field.

1. What this fixes about the raw model

1.1 No length cap, with consistent speaker IDs across chunks

Raw API limitation	This pipeline
25 MB / 1400 s hard cap per call	Two-step preprocess: chunk by duration first, then compress only chunks > 24.9 MB
Each chunked call labels speakers independently (chunk 1's "A" ≠ chunk 2's "A")	After chunk 1, auto-extract top-N speaker references (2–10 s clips) and pass them to chunks 2…N via `known_speaker_references`

Tested on a 2 h 26 min interview (280 MB) — 7 chunks, single consistent speaker labeling end-to-end.

1.2 Diarization hallucination reduction

Raw API limitation	This pipeline
`gpt-4o-transcribe-diarize` often invents 3–4 speakers for a 2-person interview	Stage 0.2 consolidation: count speakers, keep top-N by frequency (`expected_speakers=2`), let GPT-4 re-map minor → major (`C → A`, `D → B`)

DER on the test interview: 12.99% → 4.28% (−8.7 pp).

1.3 Domain-term recognition + defensive GPT correction

Raw API limitation	This pipeline
Rare / domain-specific terms get mistranscribed (`Nemotron → Neutron`, `Dennard → Denard`, `OpenClaw → open claw`, `CUDA → Kuda`)	Stage 2 GPT correction with `domain_terms.json` injected into a strict whitelist system prompt — only domain spell-fixes allowed, no rewriting / contraction expansion / filler removal
Open-loop GPT correction occasionally hallucinates refusals (`"I'm sorry, I can't assist..."`) into the transcript	Fix B refusal guard: regex-detect refusal phrases in GPT output, auto-revert to original

2. Benchmark — single 2 h 26 min interview, post-processing config sweep

Lex Fridman × Jensen Huang podcast #494, 22,281 reference words, 2 speakers. WER under NIST sclite-style normalization (contractions + hesitations folded). Q = 100 − ½(WER + DER) × 100 (higher is better).

Config	Stages on	WER	DER	Q
Raw (no post-processing)	none	6.27%	12.99%	90.4
Speaker stages only	0.1 + 0.2 + 1	6.27%	4.28%	94.7
Full + NVIDIA dict + gpt-5.5 + rebalance ⭐	0.1 + 0.2 + 1 + 1.5 + 2	6.05%	4.28%	94.8

Takeaway: enabling speaker stages (0.1 / 0.2 / 1) is a free Q +4.3 (DER win, no WER cost). Stage 2 (GPT correction) needs a domain-appropriate dict to be net-positive.

3. Benchmark — vs other ASR engines (11 interviews, 3 CN + 8 Eng)

Comparison run on a separate 11-interview corpus, all scored on the same audios with Q = 100 − ½(WER + DER) × 100. Recording conditions are deliberately tough — noisy commodity laptop-microphone audio, CN/Eng code-switching within the same recording, and ~38 min average duration per interview. This is closer to real-world meeting / field-interview audio than to studio podcast benchmarks.

Overall (all 11 interviews)

Rank	Engine	Q	Notes
🥇	CWX-Transcribe (this)	85.8	integrated WER + DER pipeline
🥈	AssemblyAI Universal-2	84.4	commercial API
🥉	WhisperX (Whisper + pyannote)	81.5	open-source stacked
4	gpt-4o-transcribe-diarize (raw)	74.3	← what CWX wraps; +11.5 Q on top
5	NVIDIA Parakeet (RNNT 1.1B)	73.0	open-source ASR-only
6	Qwen3-ASR	72.9
7	Deepgram Nova-3	71.1	commercial API

English-only (8 interviews)

Rank	Engine	Q
🥇	CWX-Transcribe	88.7
🥈	AssemblyAI	88.4
🥉	WhisperX	83.7

Chinese-only (3 interviews)

Rank	Engine	Q
🥇	CWX-Transcribe	78.2
🥈	WhisperX	75.6
🥉	AssemblyAI	73.9

The +11.5 Q gap between CWX (85.8) and the raw OpenAI API (74.3) is what this production pipeline buys you.

4. Install & Use

git clone https://github.com/Vincent-WenZX/CWX-Transcribe.git
cd CWX-Transcribe
brew install ffmpeg                       # macOS  (Linux: sudo apt install ffmpeg)
python3 -m venv venv && source venv/bin/activate
pip install -r requirements.txt
echo "OPENAI_API_KEY=sk-..." > .env

Command line

# single file
python run_cwx_batch.py meeting.m4a -o ./outputs --lang en --speakers 2

# whole folder (skip-if-exists by default)
python run_cwx_batch.py ./inputs -o ./outputs --lang zh --speakers 2

# force re-run
python run_cwx_batch.py *.m4a -o ./outputs --no-skip

Output:

outputs/<stem>/
  ├── <stem>_cwx.txt           plain (for WER)
  ├── <stem>_cwx_speaker.txt   speaker-tagged (for DER)
  ├── <stem>_cwx_readable.txt  timestamped, human-readable
  └── <stem>_cwx.json          full structured segments

Python API (4 lines for a single call)

from src import Audio, Transcriber
from src.services.openai_service import OpenAIService
import os

trx = Transcriber(OpenAIService(api_key=os.environ["OPENAI_API_KEY"]))
result = trx.transcribe(Audio("meeting.m4a"), language="en", max_speakers=2)
# result["segments"] = [{"speaker":"A", "start":0, "end":3.2, "text":"..."}, ...]

Custom domain dictionary

Edit config/domain_terms.json — both arrays get flattened and joined comma-separated into the prompt:

{
  "terms":         ["Salesforce", "HubSpot", "OKR", "ARR", "churn"],
  "abbreviations": ["KPI", "ROI", "B2B", "SaaS", "MRR"]
}

Up to ~224 tokens combined with the base prompt.

Key configuration knobs (`config/settings.yaml`)

Knob	Default	What it does
`transcription.language`	`en`	language code (en/zh/ja/...)
`transcription.max_speakers`	2	max speakers (1–4)
`post_processing.enabled`	true	master switch
`post_processing.consolidate_speakers`	true	Stage 0.2 — fixes diarization hallucinations
`post_processing.expected_speakers`	2	top-N to keep
`post_processing.merge_same_speaker`	true	Stage 1
`post_processing.correct_text`	false	Stage 2 GPT correction; default OFF — see Fix A

5. Project Layout — what each file does

CWX-Transcribe/
├── README.md                              this file
├── LICENSE                                Apache 2.0
├── requirements.txt                       Python dependencies (openai, pydub, jiwer, …)
├── architecture.jpg                       pipeline diagram (shown at the top of this README)
├── .gitignore
├── run_cwx_batch.py                       ⭐ main CLI entry point
│
├── config/
│   ├── settings.yaml                      main config: model, audio limits, post-processing toggles
│   └── domain_terms.json                  pluggable domain dictionary (your vocabulary goes here)
│
├── src/                                   core library
│   ├── audio.py                           Audio class — load, duration/size, slice
│   ├── preprocessor.py                    chunk by duration → compress per chunk (2-step)
│   ├── prompt_engineer.py                 base prompt + inject domain dictionary
│   ├── transcriber.py                     OpenAI gpt-4o-transcribe-diarize call;
│   │                                      cross-chunk speaker-reference loop
│   ├── transcript_post_processor.py       4-stage post-processor (normalize / consolidate /
│   │                                      merge / GPT-correct) with Fix A + Fix B safety guards
│   ├── output_formatter.py                re-base chunk timestamps + write 4 output formats
│   ├── speaker_reference_manager.py       validate + format known-speaker audio samples
│   ├── config_manager.py                  YAML config loader (dotted-path access)
│   ├── utils.py                           helpers: .env loading, duration formatting, etc.
│   ├── services/
│   │   └── openai_service.py              unified OpenAI client production pipeline (timeouts, retries, stats)
│   └── evaluation/
│       └── wer_evaluator.py               jiwer-based WER + CER + MER + WIL
│
└── examples/                              runnable scripts (run from project root)
    ├── verify_implementation.py           smoke-test imports & basic init (run once after install)
    ├── example1.py                        verbose end-to-end transcription walkthrough
    ├── usage_example.py                   minimal Python API usage
    ├── transcribe_with_eval.py            transcribe + auto-WER evaluation in one go
    └── evaluate_wer.py                    standalone WER eval against your own ground truth

6. Methodology & Data Sources

All numbers in Sections 2 and 3 come from offline runs against published or internally-collected audio with known ground-truth transcripts. This section documents exactly what was scored and how.

6.1 Section 2 — single-sample benchmark


Audio	Lex Fridman Podcast #494 — Jensen Huang × Lex Fridman
Duration	8,758 s (2 h 26 min)
Reference words	22,281
Speakers	2 (Lex Fridman = interviewer, Jensen Huang = interviewee)
Recording quality	studio-grade, clean, mono 16 kHz
Ground truth	Lex Fridman's official published transcript (lexfridman.com)
Audio shipped with this repo?	❌ no — copyrighted material; download separately

6.2 Section 3 — multi-engine comparison corpus


Number of recordings	11 long-form interviews
Languages	8 English + 3 Chinese (CN/Eng code-switching within recordings)
Average duration	~38 min
Recording quality	noisy commodity laptop microphones — closer to real meeting / field-interview audio than to studio podcasts
Ground truth	human-transcribed and human-corrected by domain experts
Engines compared	WhisperX (Whisper + pyannote), AssemblyAI Universal-2, Deepgram Nova-3, NVIDIA Parakeet RNNT 1.1B, Qwen3-ASR, gpt-4o-transcribe-diarize (raw), CWX-Transcribe

6.3 WER scoring standard

jiwer 4.0+ for the alignment, with NIST sclite-style normalization applied to both reference and hypothesis before scoring (consistent with research-community convention for conversational ASR — see NIST sclite glm, Whisper EnglishTextNormalizer):

1. Unicode-fold typographic quotes ("’", "‘") to ASCII apostrophe
2. Lowercase
3. Expand colloquial contractions (applied to both ref AND hyp):
     gonna → going to,  wanna → want to,  gotta → got to,
     kinda → kind of,   sorta → sort of,  dunno → do not know,
     cuz   → because
4. Strip all punctuation; collapse whitespace
5. Remove hesitation tokens entirely:
     {uh, um, mm, hmm, mhmm, mmhmm, ah, er, eh}
6. Compute jiwer.process_words(ref_normalized, hyp_normalized)

This is not jiwer's default behaviour. We document it explicitly because naive jiwer scoring (lowercase + remove-punctuation only) over-counts ~150 errors on the test interview that mainstream ASR evaluation conventions would not count as errors (the same 6.27% jumps to 8.17% under naive scoring).

6.4 DER scoring standard

pyannote.metrics.DiarizationErrorRate(collar=0.25, skip_overlap=False), with the standard (missed + false_alarm + confusion) / total definition. Reference and hypothesis annotations are built from the per-segment (start, end, speaker) tuples without smoothing.

6.5 Quality Score (Q)

Q  =  100  −  ½ × (WER + DER) × 100        (range 0–100, higher is better)

A simple linear combination of the two metrics. We use it for single-number ranking in Section 3 because users typically care about "transcript usefulness end-to-end" — both what was said and who said it. Equal weighting is a deliberate choice; if your downstream task weights one metric more (e.g., search vs. interview attribution), recompute with custom weights.

6.6 What this benchmark does not measure

Real-time / streaming latency (this is an offline batch pipeline)
Multi-speaker scenarios beyond 2 speakers
Noisy environments beyond what the Section 3 corpus represents
Languages other than English and Chinese
Audio under 5 minutes (chunking overhead dominates)
Adversarial audio (background music, simultaneous speech, etc.)

License & Contact

⚠ All numbers in this README are from a single 2 h 26 min interview (Section 2) or 11 long-form interviews (Section 3). They are illustrative, not statistically rigorous benchmarks. Performance on shorter, noisier, or multi-speaker recordings has not been measured.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CWX-Transcribe

1. What this fixes about the raw model

1.1 No length cap, with consistent speaker IDs across chunks

1.2 Diarization hallucination reduction

1.3 Domain-term recognition + defensive GPT correction

2. Benchmark — single 2 h 26 min interview, post-processing config sweep

3. Benchmark — vs other ASR engines (11 interviews, 3 CN + 8 Eng)

Overall (all 11 interviews)

English-only (8 interviews)

Chinese-only (3 interviews)

4. Install & Use

Command line

Python API (4 lines for a single call)

Custom domain dictionary

Key configuration knobs (`config/settings.yaml`)

5. Project Layout — what each file does

6. Methodology & Data Sources

6.1 Section 2 — single-sample benchmark

6.2 Section 3 — multi-engine comparison corpus

6.3 WER scoring standard

6.4 DER scoring standard

6.5 Quality Score (Q)

6.6 What this benchmark does not measure

License & Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
config		config
examples		examples
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
architecture.jpg		architecture.jpg
requirements.txt		requirements.txt
run_cwx_batch.py		run_cwx_batch.py

Folders and files

Latest commit

History

Repository files navigation

CWX-Transcribe

1. What this fixes about the raw model

1.1 No length cap, with consistent speaker IDs across chunks

1.2 Diarization hallucination reduction

1.3 Domain-term recognition + defensive GPT correction

2. Benchmark — single 2 h 26 min interview, post-processing config sweep

3. Benchmark — vs other ASR engines (11 interviews, 3 CN + 8 Eng)

Overall (all 11 interviews)

English-only (8 interviews)

Chinese-only (3 interviews)

4. Install & Use

Command line

Python API (4 lines for a single call)

Custom domain dictionary

Key configuration knobs (config/settings.yaml)

5. Project Layout — what each file does

6. Methodology & Data Sources

6.1 Section 2 — single-sample benchmark

6.2 Section 3 — multi-engine comparison corpus

6.3 WER scoring standard

6.4 DER scoring standard

6.5 Quality Score (Q)

6.6 What this benchmark does not measure

License & Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Key configuration knobs (`config/settings.yaml`)

Packages