Skip to content

Vincent-WenZX/CWX-Transcribe

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CWX-Transcribe

Pipeline architecture

A production pipeline around OpenAI gpt-4o-transcribe-diarize for long-form 2-speaker interviews. Plug in your own domain_terms.json and adapt to any field.


1. What this fixes about the raw model

1.1 No length cap, with consistent speaker IDs across chunks

Raw API limitation This pipeline
25 MB / 1400 s hard cap per call Two-step preprocess: chunk by duration first, then compress only chunks > 24.9 MB
Each chunked call labels speakers independently (chunk 1's "A" ≠ chunk 2's "A") After chunk 1, auto-extract top-N speaker references (2–10 s clips) and pass them to chunks 2…N via known_speaker_references

Tested on a 2 h 26 min interview (280 MB) — 7 chunks, single consistent speaker labeling end-to-end.

1.2 Diarization hallucination reduction

Raw API limitation This pipeline
gpt-4o-transcribe-diarize often invents 3–4 speakers for a 2-person interview Stage 0.2 consolidation: count speakers, keep top-N by frequency (expected_speakers=2), let GPT-4 re-map minor → major (C → A, D → B)

DER on the test interview: 12.99% → 4.28% (−8.7 pp).

1.3 Domain-term recognition + defensive GPT correction

Raw API limitation This pipeline
Rare / domain-specific terms get mistranscribed (Nemotron → Neutron, Dennard → Denard, OpenClaw → open claw, CUDA → Kuda) Stage 2 GPT correction with domain_terms.json injected into a strict whitelist system prompt — only domain spell-fixes allowed, no rewriting / contraction expansion / filler removal
Open-loop GPT correction occasionally hallucinates refusals ("I'm sorry, I can't assist...") into the transcript Fix B refusal guard: regex-detect refusal phrases in GPT output, auto-revert to original

2. Benchmark — single 2 h 26 min interview, post-processing config sweep

Lex Fridman × Jensen Huang podcast #494, 22,281 reference words, 2 speakers. WER under NIST sclite-style normalization (contractions + hesitations folded). Q = 100 − ½(WER + DER) × 100 (higher is better).

Config Stages on WER DER Q
Raw (no post-processing) none 6.27% 12.99% 90.4
Speaker stages only 0.1 + 0.2 + 1 6.27% 4.28% 94.7
Full + NVIDIA dict + gpt-5.5 + rebalance ⭐ 0.1 + 0.2 + 1 + 1.5 + 2 6.05% 4.28% 94.8

Takeaway: enabling speaker stages (0.1 / 0.2 / 1) is a free Q +4.3 (DER win, no WER cost). Stage 2 (GPT correction) needs a domain-appropriate dict to be net-positive.


3. Benchmark — vs other ASR engines (11 interviews, 3 CN + 8 Eng)

Comparison run on a separate 11-interview corpus, all scored on the same audios with Q = 100 − ½(WER + DER) × 100. Recording conditions are deliberately tough — noisy commodity laptop-microphone audio, CN/Eng code-switching within the same recording, and ~38 min average duration per interview. This is closer to real-world meeting / field-interview audio than to studio podcast benchmarks.

Overall (all 11 interviews)

Rank Engine Q Notes
🥇 CWX-Transcribe (this) 85.8 integrated WER + DER pipeline
🥈 AssemblyAI Universal-2 84.4 commercial API
🥉 WhisperX (Whisper + pyannote) 81.5 open-source stacked
4 gpt-4o-transcribe-diarize (raw) 74.3 ← what CWX wraps; +11.5 Q on top
5 NVIDIA Parakeet (RNNT 1.1B) 73.0 open-source ASR-only
6 Qwen3-ASR 72.9
7 Deepgram Nova-3 71.1 commercial API

English-only (8 interviews)

Rank Engine Q
🥇 CWX-Transcribe 88.7
🥈 AssemblyAI 88.4
🥉 WhisperX 83.7

Chinese-only (3 interviews)

Rank Engine Q
🥇 CWX-Transcribe 78.2
🥈 WhisperX 75.6
🥉 AssemblyAI 73.9

The +11.5 Q gap between CWX (85.8) and the raw OpenAI API (74.3) is what this production pipeline buys you.


4. Install & Use

git clone https://github.com/Vincent-WenZX/CWX-Transcribe.git
cd CWX-Transcribe
brew install ffmpeg                       # macOS  (Linux: sudo apt install ffmpeg)
python3 -m venv venv && source venv/bin/activate
pip install -r requirements.txt
echo "OPENAI_API_KEY=sk-..." > .env

Command line

# single file
python run_cwx_batch.py meeting.m4a -o ./outputs --lang en --speakers 2

# whole folder (skip-if-exists by default)
python run_cwx_batch.py ./inputs -o ./outputs --lang zh --speakers 2

# force re-run
python run_cwx_batch.py *.m4a -o ./outputs --no-skip

Output:

outputs/<stem>/
  ├── <stem>_cwx.txt           plain (for WER)
  ├── <stem>_cwx_speaker.txt   speaker-tagged (for DER)
  ├── <stem>_cwx_readable.txt  timestamped, human-readable
  └── <stem>_cwx.json          full structured segments

Python API (4 lines for a single call)

from src import Audio, Transcriber
from src.services.openai_service import OpenAIService
import os

trx = Transcriber(OpenAIService(api_key=os.environ["OPENAI_API_KEY"]))
result = trx.transcribe(Audio("meeting.m4a"), language="en", max_speakers=2)
# result["segments"] = [{"speaker":"A", "start":0, "end":3.2, "text":"..."}, ...]

Custom domain dictionary

Edit config/domain_terms.json — both arrays get flattened and joined comma-separated into the prompt:

{
  "terms":         ["Salesforce", "HubSpot", "OKR", "ARR", "churn"],
  "abbreviations": ["KPI", "ROI", "B2B", "SaaS", "MRR"]
}

Up to ~224 tokens combined with the base prompt.

Key configuration knobs (config/settings.yaml)

Knob Default What it does
transcription.language en language code (en/zh/ja/...)
transcription.max_speakers 2 max speakers (1–4)
post_processing.enabled true master switch
post_processing.consolidate_speakers true Stage 0.2 — fixes diarization hallucinations
post_processing.expected_speakers 2 top-N to keep
post_processing.merge_same_speaker true Stage 1
post_processing.correct_text false Stage 2 GPT correction; default OFF — see Fix A

5. Project Layout — what each file does

CWX-Transcribe/
├── README.md                              this file
├── LICENSE                                Apache 2.0
├── requirements.txt                       Python dependencies (openai, pydub, jiwer, …)
├── architecture.jpg                       pipeline diagram (shown at the top of this README)
├── .gitignore
├── run_cwx_batch.py                       ⭐ main CLI entry point
│
├── config/
│   ├── settings.yaml                      main config: model, audio limits, post-processing toggles
│   └── domain_terms.json                  pluggable domain dictionary (your vocabulary goes here)
│
├── src/                                   core library
│   ├── audio.py                           Audio class — load, duration/size, slice
│   ├── preprocessor.py                    chunk by duration → compress per chunk (2-step)
│   ├── prompt_engineer.py                 base prompt + inject domain dictionary
│   ├── transcriber.py                     OpenAI gpt-4o-transcribe-diarize call;
│   │                                      cross-chunk speaker-reference loop
│   ├── transcript_post_processor.py       4-stage post-processor (normalize / consolidate /
│   │                                      merge / GPT-correct) with Fix A + Fix B safety guards
│   ├── output_formatter.py                re-base chunk timestamps + write 4 output formats
│   ├── speaker_reference_manager.py       validate + format known-speaker audio samples
│   ├── config_manager.py                  YAML config loader (dotted-path access)
│   ├── utils.py                           helpers: .env loading, duration formatting, etc.
│   ├── services/
│   │   └── openai_service.py              unified OpenAI client production pipeline (timeouts, retries, stats)
│   └── evaluation/
│       └── wer_evaluator.py               jiwer-based WER + CER + MER + WIL
│
└── examples/                              runnable scripts (run from project root)
    ├── verify_implementation.py           smoke-test imports & basic init (run once after install)
    ├── example1.py                        verbose end-to-end transcription walkthrough
    ├── usage_example.py                   minimal Python API usage
    ├── transcribe_with_eval.py            transcribe + auto-WER evaluation in one go
    └── evaluate_wer.py                    standalone WER eval against your own ground truth

6. Methodology & Data Sources

All numbers in Sections 2 and 3 come from offline runs against published or internally-collected audio with known ground-truth transcripts. This section documents exactly what was scored and how.

6.1 Section 2 — single-sample benchmark

Audio Lex Fridman Podcast #494 — Jensen Huang × Lex Fridman
Duration 8,758 s (2 h 26 min)
Reference words 22,281
Speakers 2 (Lex Fridman = interviewer, Jensen Huang = interviewee)
Recording quality studio-grade, clean, mono 16 kHz
Ground truth Lex Fridman's official published transcript (lexfridman.com)
Audio shipped with this repo? ❌ no — copyrighted material; download separately

6.2 Section 3 — multi-engine comparison corpus

Number of recordings 11 long-form interviews
Languages 8 English + 3 Chinese (CN/Eng code-switching within recordings)
Average duration ~38 min
Recording quality noisy commodity laptop microphones — closer to real meeting / field-interview audio than to studio podcasts
Ground truth human-transcribed and human-corrected by domain experts
Engines compared WhisperX (Whisper + pyannote), AssemblyAI Universal-2, Deepgram Nova-3, NVIDIA Parakeet RNNT 1.1B, Qwen3-ASR, gpt-4o-transcribe-diarize (raw), CWX-Transcribe

6.3 WER scoring standard

jiwer 4.0+ for the alignment, with NIST sclite-style normalization applied to both reference and hypothesis before scoring (consistent with research-community convention for conversational ASR — see NIST sclite glm, Whisper EnglishTextNormalizer):

1. Unicode-fold typographic quotes ("’", "‘") to ASCII apostrophe
2. Lowercase
3. Expand colloquial contractions (applied to both ref AND hyp):
     gonna → going to,  wanna → want to,  gotta → got to,
     kinda → kind of,   sorta → sort of,  dunno → do not know,
     cuz   → because
4. Strip all punctuation; collapse whitespace
5. Remove hesitation tokens entirely:
     {uh, um, mm, hmm, mhmm, mmhmm, ah, er, eh}
6. Compute jiwer.process_words(ref_normalized, hyp_normalized)

This is not jiwer's default behaviour. We document it explicitly because naive jiwer scoring (lowercase + remove-punctuation only) over-counts ~150 errors on the test interview that mainstream ASR evaluation conventions would not count as errors (the same 6.27% jumps to 8.17% under naive scoring).

6.4 DER scoring standard

pyannote.metrics.DiarizationErrorRate(collar=0.25, skip_overlap=False), with the standard (missed + false_alarm + confusion) / total definition. Reference and hypothesis annotations are built from the per-segment (start, end, speaker) tuples without smoothing.

6.5 Quality Score (Q)

Q  =  100  −  ½ × (WER + DER) × 100        (range 0–100, higher is better)

A simple linear combination of the two metrics. We use it for single-number ranking in Section 3 because users typically care about "transcript usefulness end-to-end" — both what was said and who said it. Equal weighting is a deliberate choice; if your downstream task weights one metric more (e.g., search vs. interview attribution), recompute with custom weights.

6.6 What this benchmark does not measure

  • Real-time / streaming latency (this is an offline batch pipeline)
  • Multi-speaker scenarios beyond 2 speakers
  • Noisy environments beyond what the Section 3 corpus represents
  • Languages other than English and Chinese
  • Audio under 5 minutes (chunking overhead dominates)
  • Adversarial audio (background music, simultaneous speech, etc.)

License & Contact

Apache License 2.0 — see LICENSE. Copyright 2025–2026 Zhenxiong Wen · vincewen2024@gmail.com · @Vincent-WenZX

⚠ All numbers in this README are from a single 2 h 26 min interview (Section 2) or 11 long-form interviews (Section 3). They are illustrative, not statistically rigorous benchmarks. Performance on shorter, noisier, or multi-speaker recordings has not been measured.

About

Production pipeline around OpenAI gpt-4o-transcribe-diarize for long-form 2-speaker interviews. Cross-chunk speaker consistency · diarization hallucination fix · async GPT-5.5 domain-term correction. WER 6.05% / DER 4.28% on 2h26m benchmark. Beats raw OpenAI API by +11.5 Q.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages