Spec: speaker diarization on meeting Others track#64
Conversation
Extends the v1.10.0 meeting feature so the system-audio track is split into Speaker 1..N via FluidAudio's SortformerDiarizer. Mic stays "You"; diarization is real-time, auto-detects speakers, no UI input.
There was a problem hiding this comment.
Pull request overview
Documentation-only PR adding a design spec for extending the v1.10.0 meeting transcription feature with real speaker diarization on the system-audio ("Others") track using FluidAudio's SortformerDiarizer. No code changes are introduced.
Changes:
- Adds a new design document under
.kiro/specs/meeting-speaker-diarization/design.mdfollowing the existing per-feature spec convention. - Describes data-model reshape (
MeetingSpeakerenum with associated value), a newMeetingDiarizeractor, wire-in points inMeetingStateManager, UI updates, settings toggle, and test plan. - Lists representative files to edit and a manual verification procedure.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Summary
This design spec for speaker diarization on meeting transcripts is well-structured and follows the existing .kiro/specs/ convention. The approach of using FluidAudio's SortformerDiarizer for system-audio-only diarization is sound. However, three critical implementation issues must be addressed before implementation:
Critical Issues (Must Fix)
-
Index-to-label mapping inconsistency (line 36): The spec shows both 0-based internal indices and 1-based display labels ("Speaker 1" for index 0) without clear documentation of the transform, risking debugging confusion.
-
Diarization cold-start race condition (lines 67-69): Transcription can complete before diarization processes audio, causing early chunks to show
nilspeakerIndex. Need explicit handling or grace period documentation. -
Hardcoded sample rate assumption (line 73): The
16000Hz sample rate is hardcoded but may not match actualMeetingAudioEngineoutput, causing timeline misalignment.
Additional Concern
- Auto-enabling toggle behavior (line 95): Automatically changing
meetingDiarizationEnabledfrom false→true after model download creates surprising UX.
Once these issues are resolved, the spec provides a solid foundation for implementation. The test strategy, verification plan, and out-of-scope boundaries are all well-defined.
You can now have the agent implement changes and create commits directly on your pull request's source branch. Simply comment with /q followed by your request in natural language to ask the agent to make changes.
- Make 0-based vs 1-based index convention explicit (storage 0-based, display layer adds +1). - Document the legitimate first-chunks `nil` window from Sortformer's warmup; renderer handles `.others(nil)` as plain "Others". - Move sample rate to a single MeetingAudioEngine constant and have the engine attach `startTime` directly to each yielded chunk so the consumer no longer divides sample counts. - Default `meetingDiarizationEnabled` to false with explicit opt-in; drop the surprising auto-enable on model download.
Summary
Adds a design spec to extend the v1.10.0 meeting transcription feature with real speaker diarization on the system-audio ("Others") track only, so the live transcript shows distinct labels (Speaker 1, Speaker 2, …) for remote participants instead of a single "Others" badge.
SortformerDiarizer(already a dependency at 0.13.4 — streaming, runs on ANE, ~11% DER, up to 4 speaker slots).The spec follows the existing
.kiro/specs/<feature>/design.mdconvention used bymeeting-transcription,hands-free-dictation, etc. No code changes in this PR — design only.Test plan
MeetingSpeakerenum reshape (.you/.others(speakerIndex: Int?)) is acceptableSortformerDiarizer(spot-checked in repo before authoring)