Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -87,6 +87,8 @@
- **Back-to-back dedup fix** — a final within 500 ms of the previous is now dropped only when it is a *near-duplicate* (Deepgram emitting `speech_final` then `is_final` for the same utterance). A genuinely different fast follow-up (e.g. the real interruption right after a suppressed phantom) is kept instead of being silently swallowed into an empty turn.
- **Interrupted-turn context rewrite** — on a confirmed mid-turn barge-in the spoken prefix is recorded in history with an `[interrupted by caller]` marker (instead of an ungrounded full reply), so a stateful agent runtime (Hermes/OpenClaw, keyed by `X-Hermes-Session-Id`) sees on the next turn that it was cut off and what the caller actually heard. `libraries/python/getpatter/stream_handler.py`, `libraries/typescript/src/stream-handler.ts`.
- **Forward-STT-without-AEC no longer self-interrupts on its own echo.** The remaining live Hermes/OpenClaw barge-in failure: with `PATTER_FORWARD_STT_WHILE_SPEAKING` on, no AEC, and no `barge_in_strategies`, a VAD `speech_start` during TTS cancelled the turn immediately — but on a no-AEC link that `speech_start` is very often the agent's *own* TTS echo (or pre-first-token line noise during a long tool-running turn). The result was a cascade of false-positive interruptions: a short normal reply like "bene bene" produced `agent_text='[interrupted]'` with `bargein_ms≈0`, and the next turn's LLM ran for seconds but emitted `tts_characters=0` because it was torn down before its first token. The echo guard existed only on the *transcript* path, so the raw VAD-energy cancel had no protection. The VAD-energy cancel is now **deferred to transcript confirmation** whenever audio is forwarded during TTS without AEC (`forward_stt_while_speaking && aec is None`), exactly as it already was when `barge_in_strategies` are configured: the `speech_start` marks the barge-in *pending* (the agent keeps talking) and the cancel only fires once `_handle_barge_in` / `handleBargeIn` sees a real transcript that survives the echo guard; if none confirms within `barge_in_confirm_ms` (default 1500 ms) the agent resumes its sentence. The default VAD path and forward-STT *with* AEC keep the responsive immediate cancel — no behaviour change for existing configs. For the cleanest short-echo handling, still pair with `echo_cancellation=True` or `barge_in_strategies`. `libraries/python/getpatter/stream_handler.py`, `libraries/typescript/src/stream-handler.ts`.
- **Barge-in now works while the carrier is still PLAYING a long buffered reply — the "Hermes detects the interruption but keeps talking" bug.** The pipeline pushes TTS audio to the carrier as fast as the provider synthesizes it (no pacing) while the carrier buffers and plays at realtime. With a token-paced LLM the two stay roughly in sync, but an agent-runtime LLM (`HermesLLM` / `OpenClawLLM`) delivers its whole — often long — reply at once after the thinking pause: TTS outruns realtime and the carrier ends up holding tens of seconds of queued audio. The handler's speaking state ended a fixed `PATTER_TTS_TAIL_GRACE_MS` (1.5 s) after the last *push*, not the last *playback* — so for most of the audible reply `_is_speaking` was already false, every VAD `speech_start` / transcript was treated as a calm next turn instead of a barge-in, `send_clear` was never sent, and the buffered audio kept playing over the caller (with the next turn's reply queued behind it). The handler now tracks an **estimated playback cursor** (`_playback_buffered_until` / `playbackBufferedUntil`, advanced per pushed chunk at the chunk's real byte rate — PCM16@16kHz or carrier-native μ-law@8kHz) and `_end_speaking_with_grace` waits in two phases: phase 1 keeps `_is_speaking=true` with `_tail_grace_active=false` for the whole estimated backlog (barge-in stays armed and takes the full cancel + `send_clear` path, which drops the carrier buffer instantly); phase 2 is the unchanged echo-tail grace. Barge-in cancels reset the cursor (the buffer was just cleared). No new config; token-paced LLMs (no backlog) behave byte-identically to before, and `PATTER_TTS_TAIL_GRACE_MS=0` still forces the legacy synchronous flip. This is the industry-standard semantics (stop + flush client-side regardless of LLM state — cf. Twilio media-stream `clear`). `libraries/python/getpatter/stream_handler.py`, `libraries/typescript/src/stream-handler.ts`.
- **Interrupted-turn history now records the reply prefix the caller actually HEARD (LiveKit-style truncation), not everything the LLM generated.** Builds on the playback cursor above. Two gaps closed: (a) on a **mid-turn** barge-in with an agent-runtime LLM, the whole reply had already been synthesized into the carrier buffer, so the `[interrupted by caller]` marker was appended to the FULL text — a stateful runtime (Hermes/OpenClaw) believed the caller heard everything; (b) on a barge-in landing **after the turn completed** (while the carrier still played the buffered tail) no marker was applied at all. The handler now tracks per-turn `(sentence, playback_start)` segments (`_turn_spoken_segments` / `turnSpokenSegments`; filler and `llm_error_message` audio advance the clock but add no segment) and maps `heard = total_pushed − carrier_backlog` to a sentence-granular prefix: the streaming path records `<heard prefix> [interrupted by caller]`, and the post-complete cancel paths rewrite the last assistant history entry the same way before clearing the buffer. No new config; with no tracked segments (e.g. no TTS) the legacy full-text marker is preserved. `libraries/python/getpatter/stream_handler.py`, `libraries/typescript/src/stream-handler.ts`.
- **(Python) Twilio/Plivo mark frames now carry the caller-supplied name — first-message pacing no longer burns the mark-await timeout on every call.** `TwilioAudioSender.send_mark` (and the Plivo checkpoint equivalent) discarded the `mark_name` argument and sent a locally generated `audio_N` instead, so the `fm_N` echo the first-message pacer waited for never matched and every mark resolved via the 0.5 s fallback timeout (~1.5 s of guaranteed extra latency in the barge-in window of every Twilio call). The wire name is now the caller's, matching the TypeScript behaviour. `libraries/python/getpatter/telephony/twilio.py`, `.../telephony/plivo.py`.
- **(TypeScript) Inbound audio frames are now awaited — a transient audio-path error can no longer kill the whole server.** All three carrier WS message handlers called `handler.handleAudio(...)` without `await`, so a rejection inside the audio path (VAD, resampler, STT send) escaped the surrounding `try/catch` and became an unhandled rejection, which terminates the Node process (Node 15+) together with every active call. `libraries/typescript/src/server.ts`.
- **(TypeScript) Telnyx calls no longer leak `activeCallIds` entries.** The Telnyx WS close handler was the only one of the three carriers that never deleted its `ws → call_control_id` map entry, so the map grew for the server's lifetime and graceful shutdown issued hangup REST calls for long-dead calls. `libraries/typescript/src/server.ts`.
Expand Down
Loading
Loading