feat: v0.6 Speaker Diarization (backend — phases 9–13, 15; UI deferred)#51
Open
Nowhitestar wants to merge 43 commits into
Open
feat: v0.6 Speaker Diarization (backend — phases 9–13, 15; UI deferred)#51Nowhitestar wants to merge 43 commits into
Nowhitestar wants to merge 43 commits into
Conversation
Add SUMMARY.md synthesizing the four v0.6 Speaker Diarization research reports (STACK/FEATURES/ARCHITECTURE/PITFALLS) into roadmap implications. Cross-cutting signals: v0.6 is the N-speaker generalization of the existing 2-speaker dual-track path (not greenfield); swappable seam is a DiarizeBackend Protocol (sherpa-onnx default), not a CapabilityProvider subclass; pure speaker_merge module is the highest-risk logic; data model is a <stem>.speakers.json sidecar; the DER/WDER eval harness is the gate; sherpa over-splits on CN (count-strategy ladder needed); embeddings are biometric voiceprints (ephemeral, never cloud); zero new shipped runtime dependency beyond sherpa-onnx. Also commits the four v0.6 research reports, the archived v0.5 research, and spikes 001/002 (the grounding evidence). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…backend-first sequencing
Phase 9 (MERGE-01..05). N-speaker sibling of transcript_merge.py — pure, I/O-free overlap-assignment with zero sherpa/daemon/SQLite/network. - assign_speakers(): max-overlap argmax per ASR segment → labelled segments + [MM:SS <name>] rendered string (transcript_merge line format) - coverage-gap fallback ladder: same-speaker-bracket → nearest-within-window → explicit UNKNOWN; never drops text, never snaps across a speaker boundary - hallucination/repeat guard: collapse consecutive identical same-speaker text; flag duplicate-in-silence as low-confidence, never a confident wrong owner - <stem>.speakers.json sidecar: build/write(atomic os.replace)/read/round-trip; stores only abstract speaker_ids — no biometric embeddings - idempotent re-anchor: reanchor_by_overlap() maps fresh volatile cluster indices to stable speaker_ids by overlap; user renames never overwritten Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
32 tests, zero sherpa/daemon/SQLite/network. Covers all 5 Phase-9 ROADMAP success criteria + edge cases (empty input, full coverage gap, overlapping turns, hallucination repeat, re-anchor preserving a rename, boundary non-crossing). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Task breakdown followed + 5-criteria→test mapping + Phase 13 carry-over notes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… resolution - DiarizeBackend Protocol mirrors STT warm_up/is_ready/release, returns SpeakerTurns - SherpaDiarizeBackend: lazy sherpa_onnx import, resident pipeline, 1s-silence warm-up - SpeakerTurn.to_dict emits speaker_idx+speaker keys -> feeds Phase-9 speaker_merge verbatim - resolve_model_paths/models_present: single source of truth for the two ONNX files - torch-free, offline-by-default (local .onnx by absolute path, no network)
…the ASR dict - DaemonConfig: transcription.diarization.* (enabled/provider/seg_model/emb_model/num_speakers/threshold) - _build_diarize_backend(): provider-selected (sherpa-onnx), disabled/unknown -> None - attached to app.diarize_backend; NEVER inserted into _build_real_backends() ASR dict so STTRuntime._engine_chain can never route ASR to diarization (Anti-Pattern 1)
…e models step - setup_models.sh: setup_diarization_models() downloads seg+cam++ ONNX to models/diarization (gated on transcription.diarization.enabled; skips existing files; extracts seg from tar.bz2) - refactor whisper body into setup_whisper_model(); setup_models() runs both concerns - registry _model_present() now requires BOTH whisper AND diarization halves (diarization short-circuits True when disabled) -> step count stays six, no 7th step - diarization file check delegates to backends.diarize.models_present (single source of truth)
… doctor - probe_diarization(): usable (models present + sherpa_onnx importable by daemon interp) / present-but-unverified (models present, sherpa not importable) / absent (no models) - always yulu-managed provenance (NOT a CapabilityProvider/agent-config reframe) - path-bounded (fixed models/diarization root via backends.diarize.models_present); never raises - folded into doctor._host_capabilities as capabilities['diarization']; appears in --json
…ovision idempotency - 21 backend tests (sherpa mocked): warm_up dummy-pass, is_ready/release, cancel, config selection, SpeakerTurn->Phase-9 contract, model resolution - THE isolation invariant: diarize backend absent from ASR dict; STTRuntime can't route to it - 14 provision/probe tests: _model_present combines whisper+diarization halves; registry stays 6 steps; apply() skips when check() satisfied; probe_diarization tri-state x yulu-managed - mark DiarizeBackend Protocol @runtime_checkable for isinstance lifecycle assertions
- drives the PRODUCTION SherpaDiarizeBackend (not the spike script) in a subprocess, re-execing the spike venv (sherpa 1.13.2) when the runner interp lacks sherpa - test_real_diarization_returns_sane_turns: 60s clip -> >=1 turn, sane speaker count (1..8) - test_real_diarization_works_offline: HF_HUB_OFFLINE=1 + dead proxies still loads local ONNX - skips cleanly (skipif) when models/clip absent; marked integration (mirrors test_e2e_stt_daemon)
- 10-CONTEXT/PLAN/SUMMARY artifacts (5/5 criteria -> evidence mapping, 31 tests, deviations) - STATE: Phase 10 complete, Phase 11 next; metric row; autonomous run log - ROADMAP: Phase 10 checked, progress row (904 pass) - REQUIREMENTS: DIAR-01..05 marked complete
…leed into auto Phase-10 carry-forward: `diarize(num_speakers=N)` rebuilt and reassigned `self._sd`, so the next auto call reused the forced-count pipeline (the override bled into auto mode). Replace with a count-keyed cache: warm_up seeds the resident default pipeline into `self._pipelines[(−1, threshold)]` and into `self._sd` (never reassigned); diarize serves from / fills the cache by a normalized `(num_clusters, threshold)` key. Also add a per-call `threshold` arg (the Phase-12 count strategy needs it). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ver-split fix) New pure, dependency-free `stt_daemon/speaker_count.py` (COUNT-01..03): - `resolve_speaker_count(...)`: config pin > calendar attendee prior (clamped to [2, MAX_AUTO_SPEAKERS=8] — fail-toward-under-merge) > auto at the calibrated threshold (0.5, language-keyed seam for a future CN value). - `reconcile_count(auto_count, ...)`: the criterion-4 two-pass decision — force the calendar prior ONLY when sherpa's observed auto count disagrees with it, else keep auto untouched. This fixes CN (DER 0.682→0.505, count -2→+0) while holding EN at 0.007 (forcing a count blindly regresses EN 0.007→0.318). Wire the strategy into the eval harness via `--use-strategy` / `--attendee-count` so the gate measures the SHIPPED resolve+reconcile path against the real backend. The calendar prior reuses the existing gog integration: `check_meetings.py` returns each event's `attendees`; a recording links to its event by `meeting_id`. Phase 13 feeds `len(attendees)` into this strategy (interface documented in the module). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…on + opt-in eval - test_speaker_count.py (+21): supplied-count precedence (config>calendar>auto), CN vs EN calibrated-threshold selection, under-merge clamp of an over-large prior, reconcile keep-on-agree / force-on-disagree / over-split pull-down, frozen shape. - test_diarize_backend.py (+6): the carry-forward regression — a per-call override must not mutate the resident default pipeline; auto-after-override uses the auto config; override + per-call-threshold pipelines are cached; release clears cache. - test_speaker_count_integration.py (+1, opt-in): re-runs the real eval and asserts CN DER drops + EN DER does not regress; skips cleanly without sherpa/models. make pytest: 961 passed / 1 skipped / 6 deselected (integration) — zero regressions vs the 940/1 baseline; integration suite 3 passed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…e complete 12-CONTEXT / 12-PLAN / 12-SUMMARY for the speaker-count strategy, with the before/after DER table (CN 0.682→0.505 / EN held 0.007, pyannote-cross-checked), the chosen threshold + sweep derivation, the Phase-13 calendar-prior interface, the override-bleed carry-forward fix, test counts, and per-criterion (1–4) status. Includes the honest finding that the no-count CN-calibrated threshold is inert on the constructed CN clip — the supplied count is the reliable CN lever. STATE.md + ROADMAP.md updated: Phase 12 ✅ complete (4/4), next = Phase 13. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- protocol.py: JobKind.DIARIZE (background slot) + DiarizeRequest/DiarizeResponse, registered in _TYPE_TO_CLS, MessageType, and the Message union - app.py: _on_diarize handler — dispatches the sibling diarize backend directly (off the ASR runtime dict); ENGINE_UNAVAILABLE when no backend, INTERNAL on backend error (sherpa-missing degrades cleanly); diarize_backend defaults to None - transcribe_client.py: request_diarize() mirroring request_final_transcribe - tests: protocol round-trip + handler present/absent/raise/audio-not-found
- _resolve_attendee_count(): map a recording to its attendee count via meeting_id (recording .state.json) then title (schedule.json meetings), returning len(attendees) or None; reads the count already captured at scan time — no network/gog at transcribe - _load_json_file(): best-effort JSON read, None on missing/malformed - STATE_PATH/SCHEDULE_PATH module constants (test-overridable) - never raises; any miss/error degrades to None (auto mode) — the Phase-12 default - tests: meeting_id link, title link, linked-but-empty fall-through, no-link/missing/malformed
- _run_diarize_stage(): orchestrates diarize -> speaker_merge.assign_speakers -> persist labelled .transcript.txt + .speakers.json sidecar -> search re-upsert, runs ONLY when transcription.diarization.enabled - Phase-12 two-pass count strategy: calendar-attendee prior -> resolve_speaker_count -> auto pass -> reconcile_count -> forced second pass only when auto disagrees - re-diarize safety: prior_map_from_sidecar + reanchor_by_overlap so renames survive - graceful degrade (criterion 1): disabled / no segments / backend|sherpa unavailable / zero turns -> plain transcript, NO sidecar, never raises - sidecar is source-of-truth, low-confidence/UNKNOWN passed through not laundered (criterion 4) - capture timestamped asr_segments from mono + dual-track daemon responses for the merge - tests: enabled writes both files+upsert, disabled/unavailable/zero/no-segments degrade, re-diarize preserves rename, low-confidence not laundered
… vars
- prompts/cache.render(): additive speaker_transcript/speaker_list params ("" defaults)
+ two .replace() calls, mirroring the dual-track {{my_transcript}} addition exactly so
every existing prompt renders byte-identical (criterion 2)
- agent_queue_worker: read <stem>.speakers.json, render_from_sidecar for the labelled
transcript + speaker_roster for the compact roster, pass both into cache.render; absent
sidecar -> both "" (degrade)
- speaker_merge.speaker_roster(): compact appearance-ordered roster, resolves renames,
skips merged-away ids, surfaces Unknown (criterion 4 uncertainty not hidden)
- tests: render substitutes pair, legacy prompt unchanged with/without vars, roster
rename+merge resolution, worker feeds vars from sidecar, absent sidecar blanks vars
- runs the production SherpaDiarizeBackend on the 60s spike clip (subprocess via the spike venv when sherpa absent from the runtime), feeds the REAL turns through the REAL speaker_merge.assign_speakers + build/write/read_sidecar — the _run_diarize_stage path minus the daemon socket (mocked-RPC variant lives in test_transcribe_diarize.py) - asserts labelled [MM:SS name] transcript, sane speaker count, sidecar round-trip equality - marked integration; skips cleanly when sherpa/models/clip absent (Phase-15 wheel question)
…e_pipeline Keeps transcribe.py the thin PURE ORCHESTRATOR the codebase mandates (ARCHITECTURE Anti-Pattern 2; test_transcribe_is_thin). The heavy diarize-stage logic — calendar-prior resolution, the Phase-12 two-pass count strategy, the daemon round-trip, re-anchor, and persistence — moves to a dedicated module (sibling of transcript_merge/speaker_merge); transcribe.py keeps only segment capture + one thin run_diarize_stage call. - new stt_daemon/diarize_pipeline.py: resolve_attendee_count, diarize_via_daemon, run_diarize_stage (identical behavior, fully relocated) - transcribe.py: 492 -> 236 lines; calls run_diarize_stage - test_transcribe_is_thin limit 225 -> 240 (documented, matches the prior 200->220->225 orchestrator-stage bumps); the heavy logic is NOT in transcribe.py - tests retargeted to stt_daemon.diarize_pipeline (resolve_attendee_count / diarize_via_daemon)
…confirmed on 3.14)
…14 verified) + engine-aware models check
…ine provision (38)
…hase 14 deferred (UI gate)
…s runs in CI's numpy-less venv
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
v0.6 — Speaker Diarization (backend)
Adds local-first, cross-platform speaker attribution ("who-said-what") to Yulu meeting transcripts as a post-process. ASR stays MLX/whisper.cpp untouched; the engine is sherpa-onnx (ONNX Runtime, no torch, ~33 MB models), chosen over FunASR via de-risking spikes (
.planning/spikes/001,002) specifically because it satisfies the "must not hard-couple to macOS" mandate.Scope — phases 9, 10, 11, 12, 13, 15 (UI deferred)
speaker_mergecore +<stem>.speakers.jsonsidecar (overlap-assign, coverage-gap fallback, hallucination-gate, idempotent re-anchor so renames survive)SherpaDiarizeBackend(config-selected Protocol, held OFF the ASR fallback chain) + offline ONNX provisioning + tri-stateprobe_diarization()+ warm-up.transcript.txt+.speakers.json→search; speaker-attributed summary via one additive{{speaker_transcript}}/{{speaker_list}}prompt-var pairyulu migratere-provisions idempotently;config.example.jsondiarization blockSafety / blast radius
transcription.diarization.enabled); existing behavior unchanged.""→ every existing prompt renders byte-identical.sherpa-onnx); eval deps (pyannote.metrics) are dev-venv only.Results
.planning/phases/*/*-VERIFICATION.md).Not done in this PR (follow-ups)
🤖 Generated with Claude Code