feat: v0.6 Speaker Diarization (backend — phases 9–13, 15; UI deferred) by Nowhitestar · Pull Request #51 · Nowhitestar/Yulu

Nowhitestar · 2026-06-07T06:01:08Z

v0.6 — Speaker Diarization (backend)

Adds local-first, cross-platform speaker attribution ("who-said-what") to Yulu meeting transcripts as a post-process. ASR stays MLX/whisper.cpp untouched; the engine is sherpa-onnx (ONNX Runtime, no torch, ~33 MB models), chosen over FunASR via de-risking spikes (.planning/spikes/001, 002) specifically because it satisfies the "must not hard-couple to macOS" mandate.

Reframe: this is the N-speaker generalization of Yulu's existing 2-speaker dual-track path (transcript_merge.py's 我/对方) — splitting the far-end stream into voices, riding the existing sidecar → prompt-var rails.

Scope — phases 9, 10, 11, 12, 13, 15 (UI deferred)

Phase	What
9	Pure `speaker_merge` core + `<stem>.speakers.json` sidecar (overlap-assign, coverage-gap fallback, hallucination-gate, idempotent re-anchor so renames survive)
10	`SherpaDiarizeBackend` (config-selected Protocol, held OFF the ASR fallback chain) + offline ONNX provisioning + tri-state `probe_diarization()` + warm-up
11	Torch-free DER/WDER eval harness (dev-venv only) + constructed-ground-truth corpus → ADR-005 default = sherpa-onnx
12	Speaker-count strategy: calendar-attendee prior → CN-calibrated threshold → fail-toward-under-merge (+ fixes a count override-bleed bug)
13	Live pipeline wiring: ASR→diarize→merge→`.transcript.txt`+`.speakers.json`→search; speaker-attributed summary via one additive `{{speaker_transcript}}`/`{{speaker_list}}` prompt-var pair
15	Portability/footprint/migration: sherpa cp314 verified on Python 3.14 → co-locate; footprint budget; `yulu migrate` re-provisions idempotently; `config.example.json` diarization block

⏸ Phase 14 (Speaker UI) is intentionally DEFERRED — gated on the in-flight web-UI redesign (feat/recordings-ui, feat/settings-ui-p1, …). This PR changes ZERO yulu_ui/ files, so it should not conflict with those branches. The stored .speakers.json + helpers are ready for the UI when the redesign lands.

Safety / blast radius

Diarization is OFF by default (transcription.diarization.enabled); existing behavior unchanged.
Graceful-degrade: sherpa is lazy-imported; if absent the pipeline writes today's plain transcript with no error (verified on the real 3.14 runtime where sherpa isn't yet installed).
Additive prompt vars default to "" → every existing prompt renders byte-identical.
Net new shipped-runtime dep: one (sherpa-onnx); eval deps (pyannote.metrics) are dev-venv only.

Results

Footprint: 78-min meeting diarizes in ~6.7 min (RTF 0.086, no O(n²)); peak 1.69 GB RSS / 4.49 GB footprint.
Accuracy (eval gate): EN DER 0.007; CN auto-DER 0.682 → 0.505 with the count strategy (the calendar-attendee count is the reliable CN lever; real-CN gold labels = a deferred human task).
Tests: full suite 1021 passed / 1 skipped, zero regressions across all 6 phases. Each phase independently verified (.planning/phases/*/*-VERIFICATION.md).

Not done in this PR (follow-ups)

Phase 14 (Speaker UI) — after the web-UI redesign lands + stabilizes (plan against the new components).
Human-labelled real CN+EN gold corpus (gold-standard DER).
Release is not triggered here — left for the release-please Release PR.

🤖 Generated with Claude Code

Add SUMMARY.md synthesizing the four v0.6 Speaker Diarization research reports (STACK/FEATURES/ARCHITECTURE/PITFALLS) into roadmap implications. Cross-cutting signals: v0.6 is the N-speaker generalization of the existing 2-speaker dual-track path (not greenfield); swappable seam is a DiarizeBackend Protocol (sherpa-onnx default), not a CapabilityProvider subclass; pure speaker_merge module is the highest-risk logic; data model is a <stem>.speakers.json sidecar; the DER/WDER eval harness is the gate; sherpa over-splits on CN (count-strategy ladder needed); embeddings are biometric voiceprints (ephemeral, never cloud); zero new shipped runtime dependency beyond sherpa-onnx. Also commits the four v0.6 research reports, the archived v0.5 research, and spikes 001/002 (the grounding evidence). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…backend-first sequencing

Phase 9 (MERGE-01..05). N-speaker sibling of transcript_merge.py — pure, I/O-free overlap-assignment with zero sherpa/daemon/SQLite/network. - assign_speakers(): max-overlap argmax per ASR segment → labelled segments + [MM:SS <name>] rendered string (transcript_merge line format) - coverage-gap fallback ladder: same-speaker-bracket → nearest-within-window → explicit UNKNOWN; never drops text, never snaps across a speaker boundary - hallucination/repeat guard: collapse consecutive identical same-speaker text; flag duplicate-in-silence as low-confidence, never a confident wrong owner - <stem>.speakers.json sidecar: build/write(atomic os.replace)/read/round-trip; stores only abstract speaker_ids — no biometric embeddings - idempotent re-anchor: reanchor_by_overlap() maps fresh volatile cluster indices to stable speaker_ids by overlap; user renames never overwritten Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

32 tests, zero sherpa/daemon/SQLite/network. Covers all 5 Phase-9 ROADMAP success criteria + edge cases (empty input, full coverage gap, overlapping turns, hallucination repeat, re-anchor preserving a rename, boundary non-crossing). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Task breakdown followed + 5-criteria→test mapping + Phase 13 carry-over notes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

… resolution - DiarizeBackend Protocol mirrors STT warm_up/is_ready/release, returns SpeakerTurns - SherpaDiarizeBackend: lazy sherpa_onnx import, resident pipeline, 1s-silence warm-up - SpeakerTurn.to_dict emits speaker_idx+speaker keys -> feeds Phase-9 speaker_merge verbatim - resolve_model_paths/models_present: single source of truth for the two ONNX files - torch-free, offline-by-default (local .onnx by absolute path, no network)

…the ASR dict - DaemonConfig: transcription.diarization.* (enabled/provider/seg_model/emb_model/num_speakers/threshold) - _build_diarize_backend(): provider-selected (sherpa-onnx), disabled/unknown -> None - attached to app.diarize_backend; NEVER inserted into _build_real_backends() ASR dict so STTRuntime._engine_chain can never route ASR to diarization (Anti-Pattern 1)

…e models step - setup_models.sh: setup_diarization_models() downloads seg+cam++ ONNX to models/diarization (gated on transcription.diarization.enabled; skips existing files; extracts seg from tar.bz2) - refactor whisper body into setup_whisper_model(); setup_models() runs both concerns - registry _model_present() now requires BOTH whisper AND diarization halves (diarization short-circuits True when disabled) -> step count stays six, no 7th step - diarization file check delegates to backends.diarize.models_present (single source of truth)

… doctor - probe_diarization(): usable (models present + sherpa_onnx importable by daemon interp) / present-but-unverified (models present, sherpa not importable) / absent (no models) - always yulu-managed provenance (NOT a CapabilityProvider/agent-config reframe) - path-bounded (fixed models/diarization root via backends.diarize.models_present); never raises - folded into doctor._host_capabilities as capabilities['diarization']; appears in --json

…ovision idempotency - 21 backend tests (sherpa mocked): warm_up dummy-pass, is_ready/release, cancel, config selection, SpeakerTurn->Phase-9 contract, model resolution - THE isolation invariant: diarize backend absent from ASR dict; STTRuntime can't route to it - 14 provision/probe tests: _model_present combines whisper+diarization halves; registry stays 6 steps; apply() skips when check() satisfied; probe_diarization tri-state x yulu-managed - mark DiarizeBackend Protocol @runtime_checkable for isinstance lifecycle assertions

- drives the PRODUCTION SherpaDiarizeBackend (not the spike script) in a subprocess, re-execing the spike venv (sherpa 1.13.2) when the runner interp lacks sherpa - test_real_diarization_returns_sane_turns: 60s clip -> >=1 turn, sane speaker count (1..8) - test_real_diarization_works_offline: HF_HUB_OFFLINE=1 + dead proxies still loads local ONNX - skips cleanly (skipif) when models/clip absent; marked integration (mirrors test_e2e_stt_daemon)

- 10-CONTEXT/PLAN/SUMMARY artifacts (5/5 criteria -> evidence mapping, 31 tests, deviations) - STATE: Phase 10 complete, Phase 11 next; metric row; autonomous run log - ROADMAP: Phase 10 checked, progress row (904 pass) - REQUIREMENTS: DIAR-01..05 marked complete

…TM + UI-copy

…e 11 artifacts

…hecked)

…leed into auto Phase-10 carry-forward: `diarize(num_speakers=N)` rebuilt and reassigned `self._sd`, so the next auto call reused the forced-count pipeline (the override bled into auto mode). Replace with a count-keyed cache: warm_up seeds the resident default pipeline into `self._pipelines[(−1, threshold)]` and into `self._sd` (never reassigned); diarize serves from / fills the cache by a normalized `(num_clusters, threshold)` key. Also add a per-call `threshold` arg (the Phase-12 count strategy needs it). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…ver-split fix) New pure, dependency-free `stt_daemon/speaker_count.py` (COUNT-01..03): - `resolve_speaker_count(...)`: config pin > calendar attendee prior (clamped to [2, MAX_AUTO_SPEAKERS=8] — fail-toward-under-merge) > auto at the calibrated threshold (0.5, language-keyed seam for a future CN value). - `reconcile_count(auto_count, ...)`: the criterion-4 two-pass decision — force the calendar prior ONLY when sherpa's observed auto count disagrees with it, else keep auto untouched. This fixes CN (DER 0.682→0.505, count -2→+0) while holding EN at 0.007 (forcing a count blindly regresses EN 0.007→0.318). Wire the strategy into the eval harness via `--use-strategy` / `--attendee-count` so the gate measures the SHIPPED resolve+reconcile path against the real backend. The calendar prior reuses the existing gog integration: `check_meetings.py` returns each event's `attendees`; a recording links to its event by `meeting_id`. Phase 13 feeds `len(attendees)` into this strategy (interface documented in the module). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…on + opt-in eval - test_speaker_count.py (+21): supplied-count precedence (config>calendar>auto), CN vs EN calibrated-threshold selection, under-merge clamp of an over-large prior, reconcile keep-on-agree / force-on-disagree / over-split pull-down, frozen shape. - test_diarize_backend.py (+6): the carry-forward regression — a per-call override must not mutate the resident default pipeline; auto-after-override uses the auto config; override + per-call-threshold pipelines are cached; release clears cache. - test_speaker_count_integration.py (+1, opt-in): re-runs the real eval and asserts CN DER drops + EN DER does not regress; skips cleanly without sherpa/models. make pytest: 961 passed / 1 skipped / 6 deselected (integration) — zero regressions vs the 940/1 baseline; integration suite 3 passed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…e complete 12-CONTEXT / 12-PLAN / 12-SUMMARY for the speaker-count strategy, with the before/after DER table (CN 0.682→0.505 / EN held 0.007, pyannote-cross-checked), the chosen threshold + sweep derivation, the Phase-13 calendar-prior interface, the override-bleed carry-forward fix, test counts, and per-criterion (1–4) status. Includes the honest finding that the no-count CN-calibrated threshold is inert on the constructed CN clip — the supplied count is the reliable CN lever. STATE.md + ROADMAP.md updated: Phase 12 ✅ complete (4/4), next = Phase 13. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

- protocol.py: JobKind.DIARIZE (background slot) + DiarizeRequest/DiarizeResponse, registered in _TYPE_TO_CLS, MessageType, and the Message union - app.py: _on_diarize handler — dispatches the sibling diarize backend directly (off the ASR runtime dict); ENGINE_UNAVAILABLE when no backend, INTERNAL on backend error (sherpa-missing degrades cleanly); diarize_backend defaults to None - transcribe_client.py: request_diarize() mirroring request_final_transcribe - tests: protocol round-trip + handler present/absent/raise/audio-not-found

- _resolve_attendee_count(): map a recording to its attendee count via meeting_id (recording .state.json) then title (schedule.json meetings), returning len(attendees) or None; reads the count already captured at scan time — no network/gog at transcribe - _load_json_file(): best-effort JSON read, None on missing/malformed - STATE_PATH/SCHEDULE_PATH module constants (test-overridable) - never raises; any miss/error degrades to None (auto mode) — the Phase-12 default - tests: meeting_id link, title link, linked-but-empty fall-through, no-link/missing/malformed

- _run_diarize_stage(): orchestrates diarize -> speaker_merge.assign_speakers -> persist labelled .transcript.txt + .speakers.json sidecar -> search re-upsert, runs ONLY when transcription.diarization.enabled - Phase-12 two-pass count strategy: calendar-attendee prior -> resolve_speaker_count -> auto pass -> reconcile_count -> forced second pass only when auto disagrees - re-diarize safety: prior_map_from_sidecar + reanchor_by_overlap so renames survive - graceful degrade (criterion 1): disabled / no segments / backend|sherpa unavailable / zero turns -> plain transcript, NO sidecar, never raises - sidecar is source-of-truth, low-confidence/UNKNOWN passed through not laundered (criterion 4) - capture timestamped asr_segments from mono + dual-track daemon responses for the merge - tests: enabled writes both files+upsert, disabled/unavailable/zero/no-segments degrade, re-diarize preserves rename, low-confidence not laundered

… vars - prompts/cache.render(): additive speaker_transcript/speaker_list params ("" defaults) + two .replace() calls, mirroring the dual-track {{my_transcript}} addition exactly so every existing prompt renders byte-identical (criterion 2) - agent_queue_worker: read <stem>.speakers.json, render_from_sidecar for the labelled transcript + speaker_roster for the compact roster, pass both into cache.render; absent sidecar -> both "" (degrade) - speaker_merge.speaker_roster(): compact appearance-ordered roster, resolves renames, skips merged-away ids, surfaces Unknown (criterion 4 uncertainty not hidden) - tests: render substitutes pair, legacy prompt unchanged with/without vars, roster rename+merge resolution, worker feeds vars from sidecar, absent sidecar blanks vars

- runs the production SherpaDiarizeBackend on the 60s spike clip (subprocess via the spike venv when sherpa absent from the runtime), feeds the REAL turns through the REAL speaker_merge.assign_speakers + build/write/read_sidecar — the _run_diarize_stage path minus the daemon socket (mocked-RPC variant lives in test_transcribe_diarize.py) - asserts labelled [MM:SS name] transcript, sane speaker count, sidecar round-trip equality - marked integration; skips cleanly when sherpa/models/clip absent (Phase-15 wheel question)

…e_pipeline Keeps transcribe.py the thin PURE ORCHESTRATOR the codebase mandates (ARCHITECTURE Anti-Pattern 2; test_transcribe_is_thin). The heavy diarize-stage logic — calendar-prior resolution, the Phase-12 two-pass count strategy, the daemon round-trip, re-anchor, and persistence — moves to a dedicated module (sibling of transcript_merge/speaker_merge); transcribe.py keeps only segment capture + one thin run_diarize_stage call. - new stt_daemon/diarize_pipeline.py: resolve_attendee_count, diarize_via_daemon, run_diarize_stage (identical behavior, fully relocated) - transcribe.py: 492 -> 236 lines; calls run_diarize_stage - test_transcribe_is_thin limit 225 -> 240 (documented, matches the prior 200->220->225 orchestrator-stage bumps); the heavy logic is NOT in transcribe.py - tests retargeted to stt_daemon.diarize_pipeline (resolve_attendee_count / diarize_via_daemon)

…confirmed on 3.14)

…14 verified) + engine-aware models check

…ine provision (38)

…rint budget

…hase 14 deferred (UI gate)

…s runs in CI's numpy-less venv

Nowhitestar and others added 30 commits June 6, 2026 10:09

docs: start milestone v0.6 Speaker Diarization

ec9b7d5

docs: define milestone v0.6 Speaker Diarization requirements (26 reqs)

92ce570

docs: create milestone v0.6 roadmap (7 phases, 9-15)

9d75ec3

docs(v0.6): gate Phase 14 (Speaker UI) on in-flight web-UI redesign; …

8e78dac

…backend-first sequencing

docs(09): seed context for speaker-merge core + sidecar

e1f5d98

docs(diarize): Phase 9 plan + summary (speaker-merge core + sidecar)

ea75fc0

Task breakdown followed + 5-criteria→test mapping + Phase 13 carry-over notes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

docs(09): mark Phase 9 complete (verified 5/5); advance to Phase 10

1ec4fb3

docs(10): Phase 10 verification (passed 5/5, independent)

b9fcd43

feat(eval): torch-free DER/WDER/SER harness + constructed-corpus + RT…

684ff61

…TM + UI-copy

test(eval): DER/WDER/SER + corpus + ui-copy unit tests (36)

8e8d77e

docs(eval): ADR-005 default provider sherpa-onnx (measured DER); Phas…

a42771f

…e 11 artifacts

chore(eval): dev/eval venv requirements (pyannote.metrics, torch-free)

234ccac

docs(11): Phase 11 verification (passed 5/5; 940/1 suite; DER cross-c…

a3f5236

…hecked)

docs: advance to Phase 12 (Phase 11 complete+verified)

eb46c87

docs(12): Phase 12 verification (passed 4/4; eval re-run; 967/1)

e6cc0a3

Nowhitestar added 13 commits June 7, 2026 10:22

docs(13): complete pipeline-summary-integration plan

889e089

docs(13): Phase 13 verification (passed 4/4; 995/1; graceful-degrade …

9029842

…confirmed on 3.14)

feat(provision): co-locate sherpa-onnx on the daemon interpreter (cp3…

cc3fc14

…14 verified) + engine-aware models check

feat(config): transcription.diarization.* block in config.example.json

3f3fc8a

test(diarize): config schema + cross-platform no-macOS-coupling + eng…

7361c23

…ine provision (38)

docs(15): portability/footprint/migration plan + 3.14 verdict + footp…

a471f0c

…rint budget

docs(15): Phase 15 verification (3/3); v0.6 backend-complete (6/7), P…

471239f

…hase 14 deferred (UI gate)

test(diarize): fake numpy in fake_sherpa fixture so warm_up dummy-pas…

6b22e09

…s runs in CI's numpy-less venv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: v0.6 Speaker Diarization (backend — phases 9–13, 15; UI deferred)#51

feat: v0.6 Speaker Diarization (backend — phases 9–13, 15; UI deferred)#51
Nowhitestar wants to merge 43 commits into
mainfrom
feat/v0.6-speaker-diarization

Nowhitestar commented Jun 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Nowhitestar commented Jun 7, 2026

v0.6 — Speaker Diarization (backend)

Scope — phases 9, 10, 11, 12, 13, 15 (UI deferred)

Safety / blast radius

Results

Not done in this PR (follow-ups)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant