feat(voice): streamed gapless PCM assistant speech via AudioWorklet#1644
Merged
Conversation
Replace the buffered whole-clip <audio> playout (nothing audible until the entire clip is synthesized + transferred) with a start-on-first-chunk path, keeping the buffered path as a strict fallback so this can never regress. - keiko-playback-worklet.js: a hardened playback AudioWorkletProcessor — a pre-allocated Float32 ring buffer (no per-message spread, no per-quantum allocation), a prime/jitter threshold so the first chunks never underrun, underrun→silence (never a glitch), `null`→instant flush (sub-frame barge-in), and a frames-played position report. - gateway: requestTextToSpeechStream returns the provider audio as a bounded ReadableStream (same auth/egress/error-coding/size cap), not a buffered clip. - BFF: POST /api/voice/speak/stream streams raw PCM (audio/pcm) via the STREAMING sentinel with abort-on-disconnect + write backpressure; the buffered /api/voice/speak route is unchanged. - client: an injectable AudioWorklet PCM sink (AudioContext @ 24kHz). The engine tries it first and falls back to the buffered opus path when WebAudio is unavailable (e.g. under test) or it fails to start. Barge-in flushes the worklet immediately. PCM little-endian decoding carries a sample split across network chunks. Verified: gateway/server/ui suites pass; build:ui ships the worklet to dist/ui/static (served same-origin under script-src 'self', no inline hash); root export-surface contract updated (zero drift); live Azure e2e streams raw 24kHz PCM with ~0.85s time-to-first-audio-byte. Precise interrupt offset and STT verbose_json are documented follow-ups. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ailure Make the streaming sink strictly fallback-safe: an error before audio starts (provider/network error, empty 200) now returns false so useAssistantSpeech runs the buffered path, instead of surfacing a failure. This keeps a turn from being lost when only the buffered route works (e.g. a smoke test that stubs /api/voice/speak but not the stream route) and is the safer production default. Only a mid-stream failure after playback has begun degrades to text. Harden the worklet for short/empty streams: the "end" marker forces any sub-prime remainder to play out and completes once drained (no dependence on having crossed the prime threshold); a no-audio stream is reported as an error by the client rather than hanging. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The prior push did not trigger a CI run (GitHub synchronize glitch); the parent commit's full suite is green and this only adds an empty commit to re-attach the required checks to HEAD. Squashed away on merge. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replaces the buffered whole-clip
<audio>assistant-speech playout (nothing audible until the entireclip is synthesized + transferred) with a start-on-first-chunk AudioWorklet PCM path, keeping the
buffered path as a strict fallback so it can never regress. Inspired by the canonical Azure realtime
worklet pattern, adapted and hardened for Keiko (not copied). Also closes the audit's outstanding
no-webaudio-scheduling-architecturefinding.What changed
keiko-playback-worklet.js(new,packages/keiko-ui/public/): a hardened playbackAudioWorkletProcessor— pre-allocated Float32 ring buffer (no per-message spread, no per-quantumallocation), a prime/jitter threshold so the first chunks never underrun, underrun→silence (never a
glitch),
null→instant flush (sub-frame barge-in), and a frames-played position report.requestTextToSpeechStreamreturns the provider audio as a boundedReadableStream(same auth / egress seam / error coding / size cap), not a fully-buffered clip.
POST /api/voice/speak/streamstreams raw PCM (audio/pcm) via theSTREAMINGsentinel withabort-on-disconnect + write backpressure. The buffered
/api/voice/speakroute is untouched.AudioContext@ 24 kHz). The engine tries it first andfalls back to the buffered opus path when WebAudio is unavailable (e.g. under test) or it fails to
start. Barge-in flushes the worklet immediately. PCM little-endian decoding carries a sample split
across network chunks.
Verification (reproduced locally)
build:uiships the worklet todist/ui/static/keiko-playback-worklet.js(served same-origin underscript-src 'self', no inline hash; intact + no leakedwindow/document).the BFF stream route (bytes + abort + error-before-headers + capability gate), the PCM byte→sample
decoder (incl. split-sample carry), and the useAssistantSpeech streaming/fallback/barge-in wiring.
vs the buffered path waiting for the full ~1.1s+ clip; gapless + instant barge-in.
Invariants
Model Gateway / BFF boundary intact · no new runtime npm deps (AudioWorklet is native) · no raw audio
persisted · browser never receives the provider key · capability-gated · no
globals.csschange. Thebuffered path is the universal fallback, so unsupported browsers/tests are unaffected.
Deferred follow-ups (documented)
Precise interrupt offset via the worklet position (barge-in is already instant via flush), and STT
verbose_json— both orthogonal and minor; deferred to keep this PR focused on the playout path.🤖 Generated with Claude Code