Skip to content

feat: v0.6 Speaker Diarization (backend — phases 9–13, 15; UI deferred)#51

Open
Nowhitestar wants to merge 43 commits into
mainfrom
feat/v0.6-speaker-diarization
Open

feat: v0.6 Speaker Diarization (backend — phases 9–13, 15; UI deferred)#51
Nowhitestar wants to merge 43 commits into
mainfrom
feat/v0.6-speaker-diarization

Conversation

@Nowhitestar

Copy link
Copy Markdown
Owner

v0.6 — Speaker Diarization (backend)

Adds local-first, cross-platform speaker attribution ("who-said-what") to Yulu meeting transcripts as a post-process. ASR stays MLX/whisper.cpp untouched; the engine is sherpa-onnx (ONNX Runtime, no torch, ~33 MB models), chosen over FunASR via de-risking spikes (.planning/spikes/001, 002) specifically because it satisfies the "must not hard-couple to macOS" mandate.

Reframe: this is the N-speaker generalization of Yulu's existing 2-speaker dual-track path (transcript_merge.py's 我/对方) — splitting the far-end stream into voices, riding the existing sidecar → prompt-var rails.

Scope — phases 9, 10, 11, 12, 13, 15 (UI deferred)

Phase What
9 Pure speaker_merge core + <stem>.speakers.json sidecar (overlap-assign, coverage-gap fallback, hallucination-gate, idempotent re-anchor so renames survive)
10 SherpaDiarizeBackend (config-selected Protocol, held OFF the ASR fallback chain) + offline ONNX provisioning + tri-state probe_diarization() + warm-up
11 Torch-free DER/WDER eval harness (dev-venv only) + constructed-ground-truth corpus → ADR-005 default = sherpa-onnx
12 Speaker-count strategy: calendar-attendee prior → CN-calibrated threshold → fail-toward-under-merge (+ fixes a count override-bleed bug)
13 Live pipeline wiring: ASR→diarize→merge→.transcript.txt+.speakers.json→search; speaker-attributed summary via one additive {{speaker_transcript}}/{{speaker_list}} prompt-var pair
15 Portability/footprint/migration: sherpa cp314 verified on Python 3.14 → co-locate; footprint budget; yulu migrate re-provisions idempotently; config.example.json diarization block

⏸ Phase 14 (Speaker UI) is intentionally DEFERRED — gated on the in-flight web-UI redesign (feat/recordings-ui, feat/settings-ui-p1, …). This PR changes ZERO yulu_ui/ files, so it should not conflict with those branches. The stored .speakers.json + helpers are ready for the UI when the redesign lands.

Safety / blast radius

  • Diarization is OFF by default (transcription.diarization.enabled); existing behavior unchanged.
  • Graceful-degrade: sherpa is lazy-imported; if absent the pipeline writes today's plain transcript with no error (verified on the real 3.14 runtime where sherpa isn't yet installed).
  • Additive prompt vars default to "" → every existing prompt renders byte-identical.
  • Net new shipped-runtime dep: one (sherpa-onnx); eval deps (pyannote.metrics) are dev-venv only.

Results

  • Footprint: 78-min meeting diarizes in ~6.7 min (RTF 0.086, no O(n²)); peak 1.69 GB RSS / 4.49 GB footprint.
  • Accuracy (eval gate): EN DER 0.007; CN auto-DER 0.682 → 0.505 with the count strategy (the calendar-attendee count is the reliable CN lever; real-CN gold labels = a deferred human task).
  • Tests: full suite 1021 passed / 1 skipped, zero regressions across all 6 phases. Each phase independently verified (.planning/phases/*/​*-VERIFICATION.md).

Not done in this PR (follow-ups)

  • Phase 14 (Speaker UI) — after the web-UI redesign lands + stabilizes (plan against the new components).
  • Human-labelled real CN+EN gold corpus (gold-standard DER).
  • Release is not triggered here — left for the release-please Release PR.

🤖 Generated with Claude Code

Nowhitestar and others added 30 commits June 6, 2026 10:09
Add SUMMARY.md synthesizing the four v0.6 Speaker Diarization research
reports (STACK/FEATURES/ARCHITECTURE/PITFALLS) into roadmap implications.

Cross-cutting signals: v0.6 is the N-speaker generalization of the
existing 2-speaker dual-track path (not greenfield); swappable seam is a
DiarizeBackend Protocol (sherpa-onnx default), not a CapabilityProvider
subclass; pure speaker_merge module is the highest-risk logic; data model
is a <stem>.speakers.json sidecar; the DER/WDER eval harness is the gate;
sherpa over-splits on CN (count-strategy ladder needed); embeddings are
biometric voiceprints (ephemeral, never cloud); zero new shipped runtime
dependency beyond sherpa-onnx.

Also commits the four v0.6 research reports, the archived v0.5 research,
and spikes 001/002 (the grounding evidence).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Phase 9 (MERGE-01..05). N-speaker sibling of transcript_merge.py — pure,
I/O-free overlap-assignment with zero sherpa/daemon/SQLite/network.

- assign_speakers(): max-overlap argmax per ASR segment → labelled segments +
  [MM:SS <name>] rendered string (transcript_merge line format)
- coverage-gap fallback ladder: same-speaker-bracket → nearest-within-window →
  explicit UNKNOWN; never drops text, never snaps across a speaker boundary
- hallucination/repeat guard: collapse consecutive identical same-speaker text;
  flag duplicate-in-silence as low-confidence, never a confident wrong owner
- <stem>.speakers.json sidecar: build/write(atomic os.replace)/read/round-trip;
  stores only abstract speaker_ids — no biometric embeddings
- idempotent re-anchor: reanchor_by_overlap() maps fresh volatile cluster
  indices to stable speaker_ids by overlap; user renames never overwritten

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
32 tests, zero sherpa/daemon/SQLite/network. Covers all 5 Phase-9 ROADMAP
success criteria + edge cases (empty input, full coverage gap, overlapping
turns, hallucination repeat, re-anchor preserving a rename, boundary non-crossing).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Task breakdown followed + 5-criteria→test mapping + Phase 13 carry-over notes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… resolution

- DiarizeBackend Protocol mirrors STT warm_up/is_ready/release, returns SpeakerTurns
- SherpaDiarizeBackend: lazy sherpa_onnx import, resident pipeline, 1s-silence warm-up
- SpeakerTurn.to_dict emits speaker_idx+speaker keys -> feeds Phase-9 speaker_merge verbatim
- resolve_model_paths/models_present: single source of truth for the two ONNX files
- torch-free, offline-by-default (local .onnx by absolute path, no network)
…the ASR dict

- DaemonConfig: transcription.diarization.* (enabled/provider/seg_model/emb_model/num_speakers/threshold)
- _build_diarize_backend(): provider-selected (sherpa-onnx), disabled/unknown -> None
- attached to app.diarize_backend; NEVER inserted into _build_real_backends() ASR dict
  so STTRuntime._engine_chain can never route ASR to diarization (Anti-Pattern 1)
…e models step

- setup_models.sh: setup_diarization_models() downloads seg+cam++ ONNX to models/diarization
  (gated on transcription.diarization.enabled; skips existing files; extracts seg from tar.bz2)
- refactor whisper body into setup_whisper_model(); setup_models() runs both concerns
- registry _model_present() now requires BOTH whisper AND diarization halves
  (diarization short-circuits True when disabled) -> step count stays six, no 7th step
- diarization file check delegates to backends.diarize.models_present (single source of truth)
… doctor

- probe_diarization(): usable (models present + sherpa_onnx importable by daemon interp) /
  present-but-unverified (models present, sherpa not importable) / absent (no models)
- always yulu-managed provenance (NOT a CapabilityProvider/agent-config reframe)
- path-bounded (fixed models/diarization root via backends.diarize.models_present); never raises
- folded into doctor._host_capabilities as capabilities['diarization']; appears in --json
…ovision idempotency

- 21 backend tests (sherpa mocked): warm_up dummy-pass, is_ready/release, cancel,
  config selection, SpeakerTurn->Phase-9 contract, model resolution
- THE isolation invariant: diarize backend absent from ASR dict; STTRuntime can't route to it
- 14 provision/probe tests: _model_present combines whisper+diarization halves; registry stays
  6 steps; apply() skips when check() satisfied; probe_diarization tri-state x yulu-managed
- mark DiarizeBackend Protocol @runtime_checkable for isinstance lifecycle assertions
- drives the PRODUCTION SherpaDiarizeBackend (not the spike script) in a subprocess,
  re-execing the spike venv (sherpa 1.13.2) when the runner interp lacks sherpa
- test_real_diarization_returns_sane_turns: 60s clip -> >=1 turn, sane speaker count (1..8)
- test_real_diarization_works_offline: HF_HUB_OFFLINE=1 + dead proxies still loads local ONNX
- skips cleanly (skipif) when models/clip absent; marked integration (mirrors test_e2e_stt_daemon)
- 10-CONTEXT/PLAN/SUMMARY artifacts (5/5 criteria -> evidence mapping, 31 tests, deviations)
- STATE: Phase 10 complete, Phase 11 next; metric row; autonomous run log
- ROADMAP: Phase 10 checked, progress row (904 pass)
- REQUIREMENTS: DIAR-01..05 marked complete
…leed into auto

Phase-10 carry-forward: `diarize(num_speakers=N)` rebuilt and reassigned `self._sd`,
so the next auto call reused the forced-count pipeline (the override bled into auto
mode). Replace with a count-keyed cache: warm_up seeds the resident default pipeline
into `self._pipelines[(−1, threshold)]` and into `self._sd` (never reassigned);
diarize serves from / fills the cache by a normalized `(num_clusters, threshold)`
key. Also add a per-call `threshold` arg (the Phase-12 count strategy needs it).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ver-split fix)

New pure, dependency-free `stt_daemon/speaker_count.py` (COUNT-01..03):

- `resolve_speaker_count(...)`: config pin > calendar attendee prior (clamped to
  [2, MAX_AUTO_SPEAKERS=8] — fail-toward-under-merge) > auto at the calibrated
  threshold (0.5, language-keyed seam for a future CN value).
- `reconcile_count(auto_count, ...)`: the criterion-4 two-pass decision — force the
  calendar prior ONLY when sherpa's observed auto count disagrees with it, else keep
  auto untouched. This fixes CN (DER 0.682→0.505, count -2→+0) while holding EN at
  0.007 (forcing a count blindly regresses EN 0.007→0.318).

Wire the strategy into the eval harness via `--use-strategy` / `--attendee-count`
so the gate measures the SHIPPED resolve+reconcile path against the real backend.

The calendar prior reuses the existing gog integration: `check_meetings.py` returns
each event's `attendees`; a recording links to its event by `meeting_id`. Phase 13
feeds `len(attendees)` into this strategy (interface documented in the module).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…on + opt-in eval

- test_speaker_count.py (+21): supplied-count precedence (config>calendar>auto), CN
  vs EN calibrated-threshold selection, under-merge clamp of an over-large prior,
  reconcile keep-on-agree / force-on-disagree / over-split pull-down, frozen shape.
- test_diarize_backend.py (+6): the carry-forward regression — a per-call override
  must not mutate the resident default pipeline; auto-after-override uses the auto
  config; override + per-call-threshold pipelines are cached; release clears cache.
- test_speaker_count_integration.py (+1, opt-in): re-runs the real eval and asserts
  CN DER drops + EN DER does not regress; skips cleanly without sherpa/models.

make pytest: 961 passed / 1 skipped / 6 deselected (integration) — zero regressions
vs the 940/1 baseline; integration suite 3 passed.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…e complete

12-CONTEXT / 12-PLAN / 12-SUMMARY for the speaker-count strategy, with the
before/after DER table (CN 0.682→0.505 / EN held 0.007, pyannote-cross-checked),
the chosen threshold + sweep derivation, the Phase-13 calendar-prior interface, the
override-bleed carry-forward fix, test counts, and per-criterion (1–4) status.
Includes the honest finding that the no-count CN-calibrated threshold is inert on
the constructed CN clip — the supplied count is the reliable CN lever. STATE.md +
ROADMAP.md updated: Phase 12 ✅ complete (4/4), next = Phase 13.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- protocol.py: JobKind.DIARIZE (background slot) + DiarizeRequest/DiarizeResponse,
  registered in _TYPE_TO_CLS, MessageType, and the Message union
- app.py: _on_diarize handler — dispatches the sibling diarize backend directly
  (off the ASR runtime dict); ENGINE_UNAVAILABLE when no backend, INTERNAL on
  backend error (sherpa-missing degrades cleanly); diarize_backend defaults to None
- transcribe_client.py: request_diarize() mirroring request_final_transcribe
- tests: protocol round-trip + handler present/absent/raise/audio-not-found
- _resolve_attendee_count(): map a recording to its attendee count via meeting_id
  (recording .state.json) then title (schedule.json meetings), returning len(attendees)
  or None; reads the count already captured at scan time — no network/gog at transcribe
- _load_json_file(): best-effort JSON read, None on missing/malformed
- STATE_PATH/SCHEDULE_PATH module constants (test-overridable)
- never raises; any miss/error degrades to None (auto mode) — the Phase-12 default
- tests: meeting_id link, title link, linked-but-empty fall-through, no-link/missing/malformed
- _run_diarize_stage(): orchestrates diarize -> speaker_merge.assign_speakers ->
  persist labelled .transcript.txt + .speakers.json sidecar -> search re-upsert,
  runs ONLY when transcription.diarization.enabled
- Phase-12 two-pass count strategy: calendar-attendee prior -> resolve_speaker_count
  -> auto pass -> reconcile_count -> forced second pass only when auto disagrees
- re-diarize safety: prior_map_from_sidecar + reanchor_by_overlap so renames survive
- graceful degrade (criterion 1): disabled / no segments / backend|sherpa unavailable /
  zero turns -> plain transcript, NO sidecar, never raises
- sidecar is source-of-truth, low-confidence/UNKNOWN passed through not laundered (criterion 4)
- capture timestamped asr_segments from mono + dual-track daemon responses for the merge
- tests: enabled writes both files+upsert, disabled/unavailable/zero/no-segments degrade,
  re-diarize preserves rename, low-confidence not laundered
… vars

- prompts/cache.render(): additive speaker_transcript/speaker_list params ("" defaults)
  + two .replace() calls, mirroring the dual-track {{my_transcript}} addition exactly so
  every existing prompt renders byte-identical (criterion 2)
- agent_queue_worker: read <stem>.speakers.json, render_from_sidecar for the labelled
  transcript + speaker_roster for the compact roster, pass both into cache.render; absent
  sidecar -> both "" (degrade)
- speaker_merge.speaker_roster(): compact appearance-ordered roster, resolves renames,
  skips merged-away ids, surfaces Unknown (criterion 4 uncertainty not hidden)
- tests: render substitutes pair, legacy prompt unchanged with/without vars, roster
  rename+merge resolution, worker feeds vars from sidecar, absent sidecar blanks vars
- runs the production SherpaDiarizeBackend on the 60s spike clip (subprocess via the
  spike venv when sherpa absent from the runtime), feeds the REAL turns through the REAL
  speaker_merge.assign_speakers + build/write/read_sidecar — the _run_diarize_stage path
  minus the daemon socket (mocked-RPC variant lives in test_transcribe_diarize.py)
- asserts labelled [MM:SS name] transcript, sane speaker count, sidecar round-trip equality
- marked integration; skips cleanly when sherpa/models/clip absent (Phase-15 wheel question)
…e_pipeline

Keeps transcribe.py the thin PURE ORCHESTRATOR the codebase mandates (ARCHITECTURE
Anti-Pattern 2; test_transcribe_is_thin). The heavy diarize-stage logic — calendar-prior
resolution, the Phase-12 two-pass count strategy, the daemon round-trip, re-anchor, and
persistence — moves to a dedicated module (sibling of transcript_merge/speaker_merge);
transcribe.py keeps only segment capture + one thin run_diarize_stage call.

- new stt_daemon/diarize_pipeline.py: resolve_attendee_count, diarize_via_daemon,
  run_diarize_stage (identical behavior, fully relocated)
- transcribe.py: 492 -> 236 lines; calls run_diarize_stage
- test_transcribe_is_thin limit 225 -> 240 (documented, matches the prior 200->220->225
  orchestrator-stage bumps); the heavy logic is NOT in transcribe.py
- tests retargeted to stt_daemon.diarize_pipeline (resolve_attendee_count / diarize_via_daemon)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant