From 2df311febc134442c84b1235ddbbe434d2075ff8 Mon Sep 17 00:00:00 2001
From: nicolotognoni <nicolo.tognoni1@gmail.com>
Date: Fri, 5 Jun 2026 19:28:11 +0200
Subject: [PATCH 01/11] =?UTF-8?q?feat(llm):=20Hermes=20DX=20=E2=80=94=20TS?=
 =?UTF-8?q?=20namespace=20exports,=20caller-hash=20session=20key,=20long-t?=
 =?UTF-8?q?urn=20filler?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Three opt-in developer-experience improvements for the agent-LLM providers, full
Python/TypeScript parity.

- TypeScript namespace exports: `import { hermes, openclaw, openaiCompatible }` ->
  `new hermes.LLM()`, mirroring Python's `from getpatter.llm import hermes`. Frozen objects.
- session_key_factory / sessionKeyFactory + session_key_from="caller_hash": derive the
  X-Hermes-Session-Key per call from a SHA-256 caller hash (new public SessionContext +
  hash_caller/hashCaller), so an agent runtime remembers a caller across calls WITHOUT the
  raw phone number ever reaching the wire or the logs. The factory takes precedence over
  the static session_key; a falsy return omits the header. The loop dispatch was
  generalised to thread caller/callee only to providers whose stream() declares them (or
  **kwargs) — built-in and minimal custom providers unchanged. An unknown session_key_from
  raises in both SDKs (parity).
- long_turn_message / longTurnMessage (+ _after_s, default 4 s): opt-in spoken filler when
  a turn is slow and no audio has reached the caller yet — distinct from llm_error_message
  (which fires on error). Fires once, gated on emitted audio; the TS timer is serialised
  via an async clear() that awaits an in-flight filler so it can never overlap the real
  sentence.

Adversarial review caught and fixed a TS filler double-speak race (the setTimeout callback
could overlap the first real sentence; Python's asyncio path was immune).

Python 2206 / TypeScript 1758 tests pass; tsc + build clean.
---
 CHANGELOG.md                                  |   6 +
 docs/integrations/hermes.mdx                  |  26 ++
 libraries/python/getpatter/__init__.py        |   4 +
 libraries/python/getpatter/client.py          |  13 +
 libraries/python/getpatter/llm/hermes.py      |  44 ++-
 .../python/getpatter/llm/openai_compatible.py |  86 +++-
 libraries/python/getpatter/models.py          |  58 +++
 .../python/getpatter/services/llm_loop.py     |  99 +++--
 libraries/python/getpatter/stream_handler.py  |  94 +++++
 .../tests/test_llm_session_key_factory.py     | 367 ++++++++++++++++++
 .../tests/unit/test_long_turn_filler.py       | 312 +++++++++++++++
 libraries/typescript/src/index.ts             |  13 +
 libraries/typescript/src/llm-loop.ts          |  44 ++-
 libraries/typescript/src/llm/hermes.ts        |  45 ++-
 .../typescript/src/llm/openai-compatible.ts   |  81 +++-
 libraries/typescript/src/stream-handler.ts    |  81 ++++
 libraries/typescript/src/types.ts             |  42 ++
 .../tests/llm-namespace-exports.test.ts       |  50 +++
 .../llm-session-key-factory.mocked.test.ts    | 286 ++++++++++++++
 .../tests/long-turn-filler.mocked.test.ts     | 341 ++++++++++++++++
 20 files changed, 2016 insertions(+), 76 deletions(-)
 create mode 100644 libraries/python/tests/test_llm_session_key_factory.py
 create mode 100644 libraries/python/tests/unit/test_long_turn_filler.py
 create mode 100644 libraries/typescript/tests/llm-namespace-exports.test.ts
 create mode 100644 libraries/typescript/tests/llm-session-key-factory.mocked.test.ts
 create mode 100644 libraries/typescript/tests/long-turn-filler.mocked.test.ts

diff --git a/CHANGELOG.md b/CHANGELOG.md
index 220bafb..b727355 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -1,5 +1,11 @@
 ## Unreleased
 
+### Added
+
+- **TypeScript namespace exports for the agent-LLM presets.** `import { hermes, openclaw, openaiCompatible } from "getpatter"` now works alongside the existing `HermesLLM` / `OpenClawLLM` / `OpenAICompatibleLLM` named exports, so `new hermes.LLM()` mirrors Python's `from getpatter.llm import hermes; hermes.LLM()`. `libraries/typescript/src/index.ts`.
+- **`session_key_factory` / `sessionKeyFactory` — per-call long-term memory scope from a caller hash.** `OpenAICompatibleLLM` (and `HermesLLM`) can derive the `X-Hermes-Session-Key` header per call from a `SessionContext` (`call_id` / `caller` / `callee` / `caller_hash`) instead of a static value, so an agent runtime can remember a caller across calls **without the raw phone number ever reaching the wire or the logs**. Shortcut `HermesLLM(session_key_from="caller_hash")` installs a default `patter-caller-<caller_hash>` factory (SHA-256, 16 hex chars). New public `SessionContext` + `hash_caller` / `hashCaller` helper. The factory takes precedence over the static `session_key`; a falsy return omits the header. The loop dispatch was generalised to thread `caller` / `callee` only to providers whose `stream()` declares them (or `**kwargs`), keeping built-in and minimal custom providers unchanged. `libraries/python/getpatter/models.py`, `.../llm/openai_compatible.py`, `.../llm/hermes.py`, `.../services/llm_loop.py` + TypeScript mirrors.
+- **`long_turn_message` / `longTurnMessage` — opt-in spoken filler during a slow turn.** When an LLM turn takes longer than `long_turn_message_after_s` (default 4 s) and no audio has reached the caller yet, Patter speaks a short configurable line (e.g. "One moment, let me check.") instead of dead silence — useful for agent runtimes (Hermes / OpenClaw) that run tools mid-turn. Distinct from `llm_error_message` (which fires on error): this fires on **slowness**, once per turn, gated on emitted audio so it never double-speaks. `None` / unset = off (no behaviour change). `libraries/python/getpatter/models.py`, `.../stream_handler.py`, `.../client.py` + TypeScript mirrors.
+
 ## 0.6.5 (2026-06-05)
 
 ### Added
diff --git a/docs/integrations/hermes.mdx b/docs/integrations/hermes.mdx
index 7b03ede..006ed0f 100644
--- a/docs/integrations/hermes.mdx
+++ b/docs/integrations/hermes.mdx
@@ -63,6 +63,23 @@ single turn can take **30–90 s**. That is why `HermesLLM` defaults to a **120
 timeout (the generic provider's 60 s, raised for the preset) instead of the short ceiling
 used for raw inference providers — a turn that runs a tool isn't cut off mid-thought.
 
+Because a tool-running turn can leave the caller in **silence** for several seconds, the
+agent supports an opt-in spoken **filler**: set `long_turn_message` / `longTurnMessage`
+(with `long_turn_message_after_s` / `longTurnMessageAfterS`, default 4 s) and Patter speaks
+a short line if no audio has reached the caller yet by then. It fires once per turn, only
+on slowness, and never overlaps the real reply. (A separate `llm_error_message` /
+`llmErrorMessage` covers the gateway-down / timeout **error** case.)
+
+```python
+agent = phone.agent(
+    stt=DeepgramSTT(),
+    llm=HermesLLM(),
+    tts=ElevenLabsTTS(),
+    long_turn_message="One moment, let me check that.",
+    long_turn_message_after_s=4,
+)
+```
+
 <Note>
   **Where the session lives.** Hermes is **stateless** and keys continuity off
   **HTTP headers**, not the OpenAI `user` field. Each phone call maps to **one** Hermes
@@ -83,6 +100,15 @@ used for raw inference providers — a turn that runs a tool isn't cut off mid-t
   const llm = new HermesLLM({ sessionKey: 'customer-42' });
   ```
 
+  For **per-caller memory without storing the raw phone number**, derive the key from a
+  caller hash instead of a static value — `HermesLLM(session_key_from="caller_hash")` /
+  `new HermesLLM({ sessionKeyFrom: 'caller_hash' })` emits
+  `X-Hermes-Session-Key: patter-caller-<hash>` (SHA-256, 16 hex chars), so Hermes
+  remembers a caller across calls while the raw number never reaches the wire or the
+  logs. For a custom scheme, pass `session_key_factory` / `sessionKeyFactory`, a callback
+  that receives a `SessionContext` (`call_id` / `caller` / `callee` / `caller_hash`) and
+  returns the scope value (a falsy return omits the header for that call).
+
   (Patter also still sends `user=patter-call-<call_id>` for upstream-log correlation,
   but that field is **not** what drives the Hermes session — the headers are.)
 </Note>
diff --git a/libraries/python/getpatter/__init__.py b/libraries/python/getpatter/__init__.py
index 13f529b..0de24fb 100644
--- a/libraries/python/getpatter/__init__.py
+++ b/libraries/python/getpatter/__init__.py
@@ -47,9 +47,11 @@
     OpenAICompatibleConsult,
     PipelineHooks,
     RealtimeTurnDetection,
+    SessionContext,
     STTConfig,
     TTSConfig,
     TurnMetrics,
+    hash_caller,
 )
 from getpatter.services.barge_in_strategies import (
     BargeInStrategy,
@@ -419,9 +421,11 @@ def mix_pcm(agent: bytes, bg: bytes, ratio: float) -> bytes:
     "LatencyBreakdown",
     "PipelineHooks",
     "RealtimeTurnDetection",
+    "SessionContext",
     "STTConfig",
     "TTSConfig",
     "TurnMetrics",
+    "hash_caller",
     "BargeInStrategy",
     "MinWordsStrategy",
     "evaluate_barge_in_strategies",
diff --git a/libraries/python/getpatter/client.py b/libraries/python/getpatter/client.py
index a3ce686..12075a4 100644
--- a/libraries/python/getpatter/client.py
+++ b/libraries/python/getpatter/client.py
@@ -1441,6 +1441,8 @@ def agent(
         language: str = "en",
         first_message: str = "",
         llm_error_message: str | None = None,
+        long_turn_message: str | None = None,
+        long_turn_message_after_s: float = 4.0,
         tools: list[Tool] | None = None,
         stt: STTProvider | None = None,
         tts: TTSProvider | None = None,
@@ -1482,6 +1484,15 @@ def agent(
             model: OpenAI Realtime model ID.
             language: BCP-47 language code, e.g. ``"en"``.
             first_message: If set, the agent speaks this immediately on connect.
+            long_turn_message: Pipeline mode only. Opt-in short filler spoken
+                when a turn is SLOW (e.g. an agent runtime running tools) and no
+                audio has reached the carrier after
+                ``long_turn_message_after_s`` seconds — distinct from
+                ``llm_error_message`` (which fires on an error). ``None``
+                (default) keeps today's silence-while-thinking behaviour. Speaks
+                at most once per turn and never once real audio has started.
+            long_turn_message_after_s: Seconds to wait before the
+                ``long_turn_message`` filler fires. Default ``4.0``.
             tools: List of ``Tool`` instances (build with the ``tool()`` factory).
             stt: ``STTProvider`` instance for pipeline mode (e.g.
                 ``DeepgramSTT(api_key=...)``).
@@ -1659,6 +1670,8 @@ def agent(
             language=language,
             first_message=first_message,
             llm_error_message=llm_error_message,
+            long_turn_message=long_turn_message,
+            long_turn_message_after_s=long_turn_message_after_s,
             tools=tuple(tools_out) if tools_out is not None else None,
             provider=provider,
             stt=stt_resolved,
diff --git a/libraries/python/getpatter/llm/hermes.py b/libraries/python/getpatter/llm/hermes.py
index 0ebfa69..a0f8ece 100644
--- a/libraries/python/getpatter/llm/hermes.py
+++ b/libraries/python/getpatter/llm/hermes.py
@@ -18,9 +18,10 @@
 from __future__ import annotations
 
 import os
-from typing import ClassVar
+from typing import Callable, ClassVar
 
 from getpatter.llm.openai_compatible import OpenAICompatibleLLMProvider
+from getpatter.models import SessionContext
 
 __all__ = ["LLM"]
 
@@ -57,13 +58,26 @@ class LLM(OpenAICompatibleLLMProvider):
     * per-call continuity → ``X-Hermes-Session-Id: patter-call-<call_id>``
       (always sent with a call id — the primary mechanism)
     * long-term memory → ``X-Hermes-Session-Key: <session_key>`` (only sent
-      when ``session_key`` is configured)
+      when ``session_key`` / ``session_key_from`` / ``session_key_factory`` is
+      configured)
 
     Args:
-        session_key: Optional long-term memory scope. When set, every turn
-            emits ``X-Hermes-Session-Key: <session_key>`` so Hermes namespaces
-            persistent memory across calls. Credential-grade — never logged.
-            ``None`` (default) means the header is not sent.
+        session_key: Optional STATIC long-term memory scope. When set, every
+            turn emits ``X-Hermes-Session-Key: <session_key>`` so Hermes
+            namespaces persistent memory across calls. Credential-grade — never
+            logged. ``None`` (default) means the header is not sent.
+        session_key_from: Convenience selector for a built-in per-call key
+            derivation. Set to ``"caller_hash"`` to derive the session key per
+            call as ``f"patter-caller-{ctx.caller_hash}"`` (a stable,
+            non-reversible hash of the caller — never the raw number), enabling
+            per-caller cross-call memory. ``None`` (default) uses the static
+            ``session_key`` path. Ignored when ``session_key_factory`` is given
+            explicitly.
+        session_key_factory: Custom callable deriving the
+            ``X-Hermes-Session-Key`` value per call from a
+            :class:`getpatter.models.SessionContext`. Takes precedence over both
+            ``session_key`` and ``session_key_from``. A falsy return omits the
+            header for that call. Credential-grade — never logged.
     """
 
     provider_key: ClassVar[str] = "hermes"
@@ -76,11 +90,28 @@ def __init__(
         model: str | None = None,
         timeout: float = 120.0,
         session_key: str | None = None,
+        session_key_from: str | None = None,
+        session_key_factory: Callable[[SessionContext], str | None] | None = None,
         **kwargs,
     ) -> None:
         resolved_model = model or os.environ.get(
             "API_SERVER_MODEL_NAME", _DEFAULT_MODEL
         )
+        # ``session_key_from="caller_hash"`` installs a default factory that
+        # scopes durable memory per caller via the non-reversible caller hash
+        # (never the raw number). An explicit ``session_key_factory`` always
+        # wins over this convenience selector.
+        if session_key_factory is None and session_key_from == "caller_hash":
+            session_key_factory = (
+                lambda ctx: f"patter-caller-{ctx.caller_hash}"
+                if ctx.caller_hash
+                else None
+            )
+        elif session_key_from is not None and session_key_from != "caller_hash":
+            raise ValueError(
+                "session_key_from must be 'caller_hash' or None, "
+                f"got {session_key_from!r}"
+            )
         super().__init__(
             api_key=api_key,
             base_url=base_url,
@@ -92,5 +123,6 @@ def __init__(
             session_id_prefix=_SESSION_ID_PREFIX,
             session_key_header=_SESSION_KEY_HEADER,
             session_key=session_key,
+            session_key_factory=session_key_factory,
             **kwargs,
         )
diff --git a/libraries/python/getpatter/llm/openai_compatible.py b/libraries/python/getpatter/llm/openai_compatible.py
index b3d02c6..0825c30 100644
--- a/libraries/python/getpatter/llm/openai_compatible.py
+++ b/libraries/python/getpatter/llm/openai_compatible.py
@@ -45,8 +45,9 @@
 import asyncio
 import logging
 import os
-from typing import Any, AsyncIterator, ClassVar
+from typing import Any, AsyncIterator, Callable, ClassVar
 
+from getpatter.models import SessionContext, hash_caller
 from getpatter.services.llm_loop import OpenAILLMProvider
 
 __all__ = ["OpenAICompatibleLLMProvider", "LLM"]
@@ -101,6 +102,17 @@ class OpenAICompatibleLLMProvider(OpenAILLMProvider):
             credential-grade memory scope and is NEVER logged. ``None``
             (default) means the header is omitted even if
             ``session_key_header`` is set.
+        session_key_factory: Optional callable that derives the
+            ``session_key_header`` VALUE per call from a
+            :class:`getpatter.models.SessionContext` (carrying ``call_id`` /
+            ``caller`` / ``callee`` / ``caller_hash``). When set it takes
+            PRECEDENCE over the static ``session_key``: at request-build time
+            the factory is called and its return value is emitted in
+            ``session_key_header``. A falsy return (``None`` / ``""``) omits the
+            header for that call. The static ``session_key`` remains the simple
+            fallback used when no factory is configured. The returned value is a
+            credential-grade memory scope and is NEVER logged. ``None``
+            (default) means the static path is used.
         **kwargs: Sampling kwargs forwarded to :class:`OpenAILLMProvider`.
     """
 
@@ -121,6 +133,7 @@ def __init__(
         session_id_prefix: str = "",
         session_key_header: str | None = None,
         session_key: str | None = None,
+        session_key_factory: Callable[[SessionContext], str | None] | None = None,
         **kwargs,
     ) -> None:
         try:
@@ -155,6 +168,9 @@ def __init__(
         self._session_key_header = session_key_header
         # Credential-grade memory scope — never logged.
         self._session_key = session_key
+        # When set, derives the session_key_header value per call (caller-hash,
+        # etc.) and overrides the static session_key. Never logged.
+        self._session_key_factory = session_key_factory
 
     async def warmup(self) -> None:
         """Pre-call DNS / TLS warmup that omits ``Authorization`` for keyless gateways.
@@ -198,21 +214,50 @@ def _record_completion_cost(
         except Exception:  # pragma: no cover — defense in depth
             logger.debug("_record_completion_cost failed", exc_info=True)
 
+    def _resolve_session_key(
+        self,
+        *,
+        call_id: str | None,
+        caller: str | None,
+        callee: str | None,
+    ) -> str | None:
+        """Resolve the ``session_key_header`` VALUE for this call.
+
+        When a ``session_key_factory`` is configured it is called with a
+        :class:`SessionContext` (the raw ``caller`` plus its non-reversible
+        :func:`hash_caller`) and its return value wins — a falsy return omits
+        the header. Otherwise the static ``session_key`` is used. The result is
+        a credential-grade memory scope and is never logged.
+        """
+        if self._session_key_factory is not None:
+            ctx = SessionContext(
+                call_id=call_id,
+                caller=caller,
+                callee=callee,
+                caller_hash=hash_caller(caller),
+            )
+            return self._session_key_factory(ctx)
+        return self._session_key
+
     def _build_completion_kwargs(
         self,
         messages: list[dict],
         tools: list[dict] | None,
         *,
         call_id: str | None = None,
+        caller: str | None = None,
+        callee: str | None = None,
     ) -> dict[str, Any]:
         """Assemble ``chat.completions.create`` kwargs, adding session continuity.
 
         Extends the parent builder with up to three INDEPENDENT, opt-in
         session signals — the OpenAI ``user`` field, a per-call session-id
-        header, and a static memory-scope header. Each is gated separately, so
-        e.g. a runtime can take the per-call header without the ``user`` field.
-        Per-call signals require a ``call_id``; the memory-scope header does
-        not. When none applies the result is byte-identical to the parent
+        header, and a memory-scope header (static ``session_key`` OR a per-call
+        value from ``session_key_factory``). Each is gated separately, so e.g. a
+        runtime can take the per-call header without the ``user`` field.
+        Per-call ``user`` / session-id signals require a ``call_id``; the
+        memory-scope header does not (a factory may key off the caller hash
+        alone). When none applies the result is byte-identical to the parent
         (no ``user``, no ``extra_headers``).
         """
         kwargs = super()._build_completion_kwargs(messages, tools)
@@ -221,11 +266,16 @@ def _build_completion_kwargs(
             kwargs["user"] = f"{self._session_user_prefix}{call_id}"
         if self._session_id_header is not None and call_id:
             extra[self._session_id_header] = f"{self._session_id_prefix}{call_id}"
-        if self._session_key_header is not None and self._session_key:
+        if self._session_key_header is not None:
             # Truthy check (not ``is not None``): an empty-string session key is
             # not a meaningful memory scope — treat it as unset rather than
-            # emitting a confusing empty header on the wire.
-            extra[self._session_key_header] = self._session_key
+            # emitting a confusing empty header on the wire. The factory (when
+            # configured) takes precedence over the static session_key.
+            session_key_value = self._resolve_session_key(
+                call_id=call_id, caller=caller, callee=callee
+            )
+            if session_key_value:
+                extra[self._session_key_header] = session_key_value
         if extra:
             # Merge over any pre-existing extra_headers (the parent never sets
             # this today, but the spread keeps it future-safe and clobber-free).
@@ -239,15 +289,21 @@ async def stream(
         *,
         cancel_event: asyncio.Event | None = None,
         call_id: str | None = None,
+        caller: str | None = None,
+        callee: str | None = None,
     ) -> AsyncIterator[dict]:
-        """Stream chunks, threading ``call_id`` into the session continuity fields.
-
-        Mirrors :meth:`OpenAILLMProvider.stream` but routes ``call_id`` into
-        ``_build_completion_kwargs`` so the per-call ``user`` / session header
-        are emitted. ``call_id`` is optional — unset means the parent-identical
-        no-session path.
+        """Stream chunks, threading per-call context into the session fields.
+
+        Mirrors :meth:`OpenAILLMProvider.stream` but routes ``call_id`` (plus
+        ``caller`` / ``callee`` when a ``session_key_factory`` needs them) into
+        ``_build_completion_kwargs`` so the per-call ``user`` / session headers
+        are emitted. All three are optional — unset means the parent-identical
+        no-session path. ``caller`` is used only to compute the session-key
+        scope (and its non-reversible hash); it is never logged here.
         """
-        kwargs = self._build_completion_kwargs(messages, tools, call_id=call_id)
+        kwargs = self._build_completion_kwargs(
+            messages, tools, call_id=call_id, caller=caller, callee=callee
+        )
         response = await self._client.chat.completions.create(**kwargs)
 
         last_usage = None
diff --git a/libraries/python/getpatter/models.py b/libraries/python/getpatter/models.py
index f22c21b..bfdb0cc 100644
--- a/libraries/python/getpatter/models.py
+++ b/libraries/python/getpatter/models.py
@@ -10,6 +10,7 @@
 from __future__ import annotations
 
 import asyncio
+import hashlib
 import logging
 import re
 from dataclasses import dataclass, field
@@ -393,6 +394,47 @@ def __post_init__(self) -> None:
             )
 
 
+def hash_caller(caller: str | None) -> str | None:
+    """Stable, non-reversible 16-char hash of a caller for session scoping.
+
+    Used to derive a per-caller memory namespace (e.g. an agent runtime's
+    session key) WITHOUT ever exposing the raw phone number — the call site
+    keys cross-call memory off the hash, never the number itself. Returns the
+    first 16 hex chars of the SHA-256 digest of the UTF-8 ``caller`` string, or
+    ``None`` when ``caller`` is ``None`` / empty (no caller → no scope). The
+    16-char (64-bit) truncation is plenty for namespacing while keeping the
+    emitted header value compact; it is NOT a security primitive (a phone
+    number has too little entropy to make the digest a secret) — its only job
+    is to avoid putting the raw number on the wire / in logs.
+    """
+    if not caller:
+        return None
+    return hashlib.sha256(caller.encode("utf-8")).hexdigest()[:16]
+
+
+@dataclass(frozen=True)
+class SessionContext:
+    """Per-call context handed to a ``session_key_factory``.
+
+    A session-aware LLM provider (e.g. :class:`getpatter.llm.hermes.LLM`) can
+    derive its memory-scope header value per call from this — most usefully
+    from :attr:`caller_hash`, a stable non-reversible hash of the caller, so
+    one phone number maps to one durable memory namespace across calls without
+    the raw number ever being emitted or logged.
+
+    All fields are optional: ``call_id`` / ``caller`` / ``callee`` are present
+    when the call provides them; ``caller_hash`` is :func:`hash_caller` of
+    ``caller`` (``None`` when there is no caller). The raw ``caller`` is carried
+    here only so a factory CAN re-derive its own scope — it must never be put on
+    the wire or logged beyond what already exists.
+    """
+
+    call_id: str | None = None
+    caller: str | None = None
+    callee: str | None = None
+    caller_hash: str | None = None
+
+
 @dataclass(frozen=True)
 class Agent:
     """Configuration for a local-mode voice AI agent.
@@ -429,6 +471,22 @@ class Agent:
     # behaviour: nothing is spoken on LLM error. Pipeline mode only —
     # Realtime / ConvAI surface provider errors on their own audio path.
     llm_error_message: str | None = None
+    # Opt-in spoken filler for pipeline mode when an LLM turn is SLOW (distinct
+    # from ``llm_error_message``, which fires on an ERROR). Agent-runtime
+    # providers (Hermes / OpenClaw) run tools / memory / skills internally, so a
+    # turn can take many seconds before the first word is spoken — the caller
+    # hears dead silence. When set to a non-empty string and the turn has
+    # produced NO audio after ``long_turn_message_after_s`` seconds, the SDK
+    # synthesizes this line ONCE through the normal TTS turn lifecycle (subject
+    # to barge-in) to fill the gap. It never fires once real audio has started
+    # this turn, and never double-speaks. ``None`` (default) keeps today's
+    # behaviour: nothing is spoken while a slow turn runs. Pipeline mode only.
+    long_turn_message: str | None = None
+    # Seconds to wait after the turn begins speaking before the
+    # ``long_turn_message`` filler fires (only consulted when
+    # ``long_turn_message`` is set and no audio has reached the carrier yet).
+    # Default ``4.0`` s.
+    long_turn_message_after_s: float = 4.0
     tools: tuple[dict, ...] | None = None
     provider: ProviderMode = "openai_realtime"
     stt: STTConfig | None = None  # which STT provider to use in pipeline mode
diff --git a/libraries/python/getpatter/services/llm_loop.py b/libraries/python/getpatter/services/llm_loop.py
index 2dc52c3..fb4c9f4 100644
--- a/libraries/python/getpatter/services/llm_loop.py
+++ b/libraries/python/getpatter/services/llm_loop.py
@@ -37,38 +37,59 @@
 logger = logging.getLogger("getpatter")
 
 
-# Per-provider-TYPE memo of whether ``stream`` accepts a ``call_id`` keyword.
+# Per-call-context kwargs the loop MAY thread into ``provider.stream`` — but
+# only those the provider's signature actually declares (or absorbs via
+# ``**kwargs``). ``call_id`` predates ``caller`` / ``callee``: a provider that
+# only declares ``call_id`` (every built-in before the session_key_factory
+# feature) keeps getting just ``call_id`` and is unaffected by the additions.
+_CALL_CONTEXT_STREAM_KWARGS = ("call_id", "caller", "callee")
+
+# Per-provider-TYPE memo of which call-context kwargs ``stream`` accepts.
 # Built-in providers declare ``call_id`` (or ``**kwargs``) and hit the fast
 # path after the first call; a user's minimal custom provider whose ``stream``
 # is ``(self, messages, tools=None, *, cancel_event=None)`` is detected once and
-# called WITHOUT ``call_id`` thereafter — otherwise it would raise TypeError.
-_provider_accepts_call_id: dict[type, bool] = {}
+# called WITHOUT any of these thereafter — otherwise it would raise TypeError.
+_provider_accepted_stream_kwargs: dict[type, frozenset[str]] = {}
 
 
-def _stream_accepts_call_id(provider: object) -> bool:
-    """Whether ``provider.stream`` tolerates a ``call_id`` keyword argument.
+def _stream_accepted_context_kwargs(provider: object) -> frozenset[str]:
+    """Which of :data:`_CALL_CONTEXT_STREAM_KWARGS` ``provider.stream`` tolerates.
 
-    True when the signature declares a parameter named ``call_id`` OR accepts
-    ``**kwargs`` (``VAR_KEYWORD``). Cached per provider type to keep the hot
-    path cheap. Some callables (C-level, ``functools.partial`` without
-    ``__wrapped__``) refuse introspection — those default to ``False`` so the
-    safe no-``call_id`` path is taken rather than risking a new crash site.
+    A name is accepted when the signature declares a parameter of that name OR
+    the signature accepts ``**kwargs`` (``VAR_KEYWORD``), in which case ALL of
+    them are accepted. Cached per provider type to keep the hot path cheap. Some
+    callables (C-level, ``functools.partial`` without ``__wrapped__``) refuse
+    introspection — those default to the empty set so the safe no-context path
+    is taken rather than risking a new crash site.
     """
     provider_type = type(provider)
-    cached = _provider_accepts_call_id.get(provider_type)
+    cached = _provider_accepted_stream_kwargs.get(provider_type)
     if cached is not None:
         return cached
-    accepts = False
+    accepted: set[str] = set()
     try:
         sig = inspect.signature(provider.stream)
         for param in sig.parameters.values():
-            if param.name == "call_id" or param.kind is inspect.Parameter.VAR_KEYWORD:
-                accepts = True
+            if param.kind is inspect.Parameter.VAR_KEYWORD:
+                accepted = set(_CALL_CONTEXT_STREAM_KWARGS)
                 break
+            if param.name in _CALL_CONTEXT_STREAM_KWARGS:
+                accepted.add(param.name)
     except (ValueError, TypeError):  # pragma: no cover - exotic callables
-        accepts = False
-    _provider_accepts_call_id[provider_type] = accepts
-    return accepts
+        accepted = set()
+    result = frozenset(accepted)
+    _provider_accepted_stream_kwargs[provider_type] = result
+    return result
+
+
+def _stream_accepts_call_id(provider: object) -> bool:
+    """Whether ``provider.stream`` tolerates a ``call_id`` keyword argument.
+
+    Back-compat shim around :func:`_stream_accepted_context_kwargs` (some tests
+    and external callers still reference this). True when ``call_id`` is among
+    the accepted call-context kwargs.
+    """
+    return "call_id" in _stream_accepted_context_kwargs(provider)
 
 
 # ---------------------------------------------------------------------------
@@ -961,24 +982,32 @@ async def run(
             _span_cm.__enter__()
             _span_exc_info: tuple = (None, None, None)
             try:
-                # Only thread ``call_id`` into providers whose ``stream``
-                # accepts it (or ``**kwargs``). A user's minimal custom provider
-                # with ``(messages, tools=None, *, cancel_event=None)`` would
-                # otherwise raise TypeError on the added keyword. ``cancel_event``
-                # predates this and every Protocol implementer tolerates it.
-                if _stream_accepts_call_id(self._provider):
-                    stream_iter = self._provider.stream(
-                        messages,
-                        self._openai_tools,
-                        cancel_event=cancel_event,
-                        call_id=call_context.get("call_id"),
-                    )
-                else:
-                    stream_iter = self._provider.stream(
-                        messages,
-                        self._openai_tools,
-                        cancel_event=cancel_event,
-                    )
+                # Thread only the per-call context kwargs the provider's
+                # ``stream`` actually declares (or absorbs via ``**kwargs``). A
+                # provider that declares just ``call_id`` keeps getting only
+                # ``call_id``; one that also declares ``caller`` / ``callee``
+                # (e.g. the OpenAI-compatible provider with a session_key_factory)
+                # gets those too; a minimal custom provider with neither gets
+                # none. Each value is only included when present in
+                # ``call_context``. ``cancel_event`` predates this and every
+                # Protocol implementer tolerates it.
+                accepted = _stream_accepted_context_kwargs(self._provider)
+                context_kwargs = {
+                    name: call_context[name]
+                    for name in _CALL_CONTEXT_STREAM_KWARGS
+                    if name in accepted and name in call_context
+                }
+                # ``call_id`` is threaded even when absent (value None) to
+                # preserve the prior contract where a session-aware provider was
+                # always handed ``call_id=<value-or-None>``.
+                if "call_id" in accepted and "call_id" not in context_kwargs:
+                    context_kwargs["call_id"] = call_context.get("call_id")
+                stream_iter = self._provider.stream(
+                    messages,
+                    self._openai_tools,
+                    cancel_event=cancel_event,
+                    **context_kwargs,
+                )
                 async for chunk in stream_iter:
                     chunk_type = chunk.get("type")
 
diff --git a/libraries/python/getpatter/stream_handler.py b/libraries/python/getpatter/stream_handler.py
index a3b9e12..7967e54 100644
--- a/libraries/python/getpatter/stream_handler.py
+++ b/libraries/python/getpatter/stream_handler.py
@@ -3105,6 +3105,70 @@ async def _synthesize_sentence(
             self.audio_sender.reset_pcm_carry()
         return True
 
+    def _schedule_long_turn_filler(
+        self,
+        first_tts_chunk: list,
+        hook_executor: PipelineHookExecutor,
+        hook_ctx: HookContext,
+    ) -> "asyncio.Task | None":
+        """Spawn the opt-in long-turn filler task, or ``None`` when disabled.
+
+        Returns ``None`` (no task) when ``agent.long_turn_message`` is unset /
+        empty — the default, byte-identical to today's behaviour. Otherwise
+        returns a task that waits ``agent.long_turn_message_after_s`` seconds and
+        then, IFF no audio has reached the carrier this turn
+        (``first_tts_chunk[0]`` still ``True``) AND we still own the floor
+        (``self._is_speaking``), synthesizes the filler ONCE via
+        ``_synthesize_sentence``. Guards strictly on "no audio emitted yet" so it
+        cannot double-speak; self-synthesis failure degrades to silence.
+        """
+        message = getattr(self.agent, "long_turn_message", None)
+        if not message:
+            return None
+        after_s = getattr(self.agent, "long_turn_message_after_s", 4.0)
+
+        async def _filler() -> None:
+            try:
+                await asyncio.sleep(after_s)
+            except asyncio.CancelledError:
+                # Cancelled before firing (real audio started / turn ended).
+                raise
+            # Fire at most once, only if the caller still heard SILENCE this
+            # turn and we still hold the floor (no concurrent barge-in).
+            if first_tts_chunk[0] and self._is_speaking:
+                try:
+                    await self._synthesize_sentence(
+                        message, hook_executor, hook_ctx, first_tts_chunk
+                    )
+                except asyncio.CancelledError:
+                    raise
+                except Exception:  # pragma: no cover - defensive
+                    logger.exception("long_turn_message filler synthesis failed")
+
+        return asyncio.create_task(_filler())
+
+    async def _cancel_long_turn_filler(
+        self, task: "asyncio.Task | None"
+    ) -> None:
+        """Cancel the long-turn filler task and await its teardown.
+
+        Idempotent and race-safe: a ``None`` / already-finished task is a no-op,
+        ``CancelledError`` from the cancel is suppressed, and any exception the
+        task raised before cancellation is swallowed (already logged inside the
+        task). Returns ``None`` so callers can reassign the handle in one line.
+        """
+        if task is None:
+            return None
+        if not task.done():
+            task.cancel()
+        try:
+            await task
+        except asyncio.CancelledError:
+            pass
+        except Exception:  # pragma: no cover - defensive
+            logger.debug("long_turn_message filler task ended with error", exc_info=True)
+        return None
+
     async def _process_streaming_response(self, result, call_id: str) -> str:
         """Process a streaming (async generator) response through TTS with sentence chunking."""
         chunker = SentenceChunker(
@@ -3120,6 +3184,17 @@ async def _process_streaming_response(self, result, call_id: str) -> str:
         hook_executor = PipelineHookExecutor(hooks)
         hook_ctx = self._build_hook_context()
 
+        # Opt-in long-turn filler: when the turn is SLOW (agent runtime running
+        # tools/memory) and NO audio has reached the carrier yet, speak a short
+        # filler instead of dead silence. Distinct from ``llm_error_message``
+        # (that fires on an LLM ERROR; this fires on SLOWNESS). The task waits
+        # ``long_turn_message_after_s`` then, IFF still no audio this turn AND we
+        # still own the floor, synthesizes the filler ONCE. Cancelled the moment
+        # real audio is emitted, on the error branch, and in the finally.
+        long_turn_task = self._schedule_long_turn_filler(
+            first_tts_chunk, hook_executor, hook_ctx
+        )
+
         # Reset the per-turn LLM cancel event so a stale cancel from a
         # previous turn cannot terminate this stream prematurely.  The
         # event is *set* by ``_handle_barge_in`` to break out of the
@@ -3178,6 +3253,15 @@ async def _process_streaming_response(self, result, call_id: str) -> str:
                                 continue  # hook dropped this sentence
                             sentence = transformed
 
+                        # Real audio is about to be synthesized — cancel the
+                        # long-turn filler so it can never fire (or double-speak)
+                        # once the agent's own reply has started. Cancelling
+                        # before the await is race-safe: asyncio is single-
+                        # threaded, so the filler coroutine cannot interleave
+                        # between this cancel and the synthesis call.
+                        long_turn_task = await self._cancel_long_turn_filler(
+                            long_turn_task
+                        )
                         if not await self._synthesize_sentence(
                             sentence, hook_executor, hook_ctx, first_tts_chunk
                         ):
@@ -3190,6 +3274,9 @@ async def _process_streaming_response(self, result, call_id: str) -> str:
                 llm_error = True
                 chunker.reset()  # discard partial content on LLM error
                 logger.exception("LLM streaming error: %s", exc)
+                # The turn errored — stop the filler so it cannot speak over the
+                # (distinct) error fallback below.
+                long_turn_task = await self._cancel_long_turn_filler(long_turn_task)
                 # Close the active turn as interrupted so the metrics accumulator
                 # does not leak an open turn when LLM throws mid-stream.
                 if self.metrics is not None and self.metrics.turn_active:
@@ -3240,12 +3327,19 @@ async def _process_streaming_response(self, result, call_id: str) -> str:
                             continue
                         sentence = transformed
 
+                    # Real flushed audio about to play — cancel the filler.
+                    long_turn_task = await self._cancel_long_turn_filler(
+                        long_turn_task
+                    )
                     if not await self._synthesize_sentence(
                         sentence, hook_executor, hook_ctx, first_tts_chunk
                     ):
                         interrupted = True
                         break
         finally:
+            # Ensure the long-turn filler task never outlives the turn (clean
+            # cancellation, CancelledError suppressed inside the helper).
+            await self._cancel_long_turn_filler(long_turn_task)
             # Schedule the flip to idle. Keeps the speaking flag set during
             # the audio tail still playing on the carrier so STT echo on
             # the trailing samples doesn't look like a fresh user turn.
diff --git a/libraries/python/tests/test_llm_session_key_factory.py b/libraries/python/tests/test_llm_session_key_factory.py
new file mode 100644
index 0000000..4a160cc
--- /dev/null
+++ b/libraries/python/tests/test_llm_session_key_factory.py
@@ -0,0 +1,367 @@
+"""Tests for the per-call session-key factory (Feature #7).
+
+A ``session_key_factory`` derives the memory-scope header value per call from a
+:class:`SessionContext` (carrying ``caller`` + its non-reversible
+:func:`hash_caller`). The Hermes convenience ``session_key_from="caller_hash"``
+installs a default factory that scopes durable memory per caller WITHOUT the raw
+number ever reaching the wire.
+
+Real code throughout — the only mocked surface is the paid external boundary
+(``chat.completions.create``), tagged ``@pytest.mark.mocked``. The factory
+resolution, the SessionContext construction, and the caller threading through
+the REAL ``LLMLoop`` are all exercised against live code.
+"""
+
+from __future__ import annotations
+
+import pytest
+
+from getpatter.llm import hermes
+from getpatter.llm.openai_compatible import OpenAICompatibleLLMProvider
+from getpatter.models import SessionContext, hash_caller
+from getpatter.services.llm_loop import LLMLoop, _stream_accepted_context_kwargs
+
+
+# ---------------------------------------------------------------------------
+# hash_caller — stable, non-reversible, never the raw number
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.unit
+def test_hash_caller_is_stable_and_not_the_raw_number() -> None:
+    number = "+15555550100"
+    h1 = hash_caller(number)
+    h2 = hash_caller(number)
+    # Deterministic across calls.
+    assert h1 == h2
+    # 16 hex chars (64-bit truncation), and NOT the raw number.
+    assert len(h1) == 16
+    assert all(c in "0123456789abcdef" for c in h1)
+    assert number not in h1
+    assert h1 != number
+
+
+@pytest.mark.unit
+def test_hash_caller_distinguishes_different_callers() -> None:
+    assert hash_caller("+15555550100") != hash_caller("+15555550101")
+
+
+@pytest.mark.unit
+def test_hash_caller_none_or_empty_returns_none() -> None:
+    assert hash_caller(None) is None
+    assert hash_caller("") is None
+
+
+@pytest.mark.unit
+def test_session_context_defaults_are_all_none() -> None:
+    ctx = SessionContext()
+    assert ctx.call_id is None
+    assert ctx.caller is None
+    assert ctx.callee is None
+    assert ctx.caller_hash is None
+    # Frozen (immutable public config).
+    with pytest.raises(Exception):
+        ctx.caller = "x"  # type: ignore[misc]
+
+
+# ---------------------------------------------------------------------------
+# Factory precedence on the generic provider
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.unit
+def test_factory_overrides_static_session_key_and_sees_caller_hash() -> None:
+    seen: dict = {}
+
+    def factory(ctx: SessionContext) -> str:
+        seen["ctx"] = ctx
+        return f"scope-{ctx.caller_hash}"
+
+    provider = OpenAICompatibleLLMProvider(
+        base_url="http://127.0.0.1:9/v1",
+        model="m",
+        session_key_header="X-Mem",
+        session_key="static-key",  # must be overridden by the factory
+        session_key_factory=factory,
+    )
+    kwargs = provider._build_completion_kwargs(
+        [{"role": "user", "content": "hi"}],
+        None,
+        call_id="c1",
+        caller="+15555550100",
+        callee="+15555550101",
+    )
+    expected_hash = hash_caller("+15555550100")
+    assert kwargs["extra_headers"]["X-Mem"] == f"scope-{expected_hash}"
+    # The factory saw the full SessionContext, including the raw caller and the
+    # callee — but the EMITTED value carries only the hash.
+    ctx = seen["ctx"]
+    assert ctx.call_id == "c1"
+    assert ctx.caller == "+15555550100"
+    assert ctx.callee == "+15555550101"
+    assert ctx.caller_hash == expected_hash
+
+
+@pytest.mark.unit
+def test_factory_returning_none_omits_the_header() -> None:
+    provider = OpenAICompatibleLLMProvider(
+        base_url="http://127.0.0.1:9/v1",
+        model="m",
+        session_key_header="X-Mem",
+        session_key="static-key",
+        session_key_factory=lambda ctx: None,
+    )
+    kwargs = provider._build_completion_kwargs(
+        [{"role": "user", "content": "hi"}], None, call_id="c1", caller="+15555550100"
+    )
+    # Factory returned falsy => header omitted entirely (no extra_headers at all
+    # here, since nothing else is configured).
+    assert "extra_headers" not in kwargs
+
+
+@pytest.mark.unit
+def test_static_session_key_used_when_no_factory() -> None:
+    provider = OpenAICompatibleLLMProvider(
+        base_url="http://127.0.0.1:9/v1",
+        model="m",
+        session_key_header="X-Mem",
+        session_key="static-key",
+    )
+    kwargs = provider._build_completion_kwargs(
+        [{"role": "user", "content": "hi"}], None, call_id="c1", caller="+15555550100"
+    )
+    assert kwargs["extra_headers"] == {"X-Mem": "static-key"}
+
+
+@pytest.mark.unit
+def test_factory_fires_even_without_call_id() -> None:
+    """The memory-scope header is per-call-independent: a factory keying off the
+    caller hash alone produces a header even with no call id."""
+    provider = OpenAICompatibleLLMProvider(
+        base_url="http://127.0.0.1:9/v1",
+        model="m",
+        session_key_header="X-Mem",
+        session_key_factory=lambda ctx: f"caller-{ctx.caller_hash}",
+    )
+    kwargs = provider._build_completion_kwargs(
+        [{"role": "user", "content": "hi"}], None, call_id=None, caller="+15555550100"
+    )
+    assert kwargs["extra_headers"]["X-Mem"] == f"caller-{hash_caller('+15555550100')}"
+
+
+# ---------------------------------------------------------------------------
+# Hermes convenience: session_key_from="caller_hash"
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.unit
+def test_hermes_session_key_from_installs_caller_hash_factory() -> None:
+    llm = hermes.LLM(session_key_from="caller_hash")
+    kwargs = llm._build_completion_kwargs(
+        [{"role": "user", "content": "hi"}],
+        None,
+        call_id="hid-1",
+        caller="+15555550100",
+    )
+    expected = f"patter-caller-{hash_caller('+15555550100')}"
+    assert kwargs["extra_headers"]["X-Hermes-Session-Key"] == expected
+    # Per-call session id still flows alongside the memory scope.
+    assert kwargs["extra_headers"]["X-Hermes-Session-Id"] == "patter-call-hid-1"
+
+
+@pytest.mark.unit
+def test_hermes_session_key_from_omits_header_without_caller() -> None:
+    llm = hermes.LLM(session_key_from="caller_hash")
+    kwargs = llm._build_completion_kwargs(
+        [{"role": "user", "content": "hi"}], None, call_id="hid-1", caller=None
+    )
+    # No caller => no caller_hash => the default factory returns None => header
+    # omitted. The per-call session-id header is still present.
+    assert "X-Hermes-Session-Key" not in kwargs["extra_headers"]
+    assert kwargs["extra_headers"]["X-Hermes-Session-Id"] == "patter-call-hid-1"
+
+
+@pytest.mark.unit
+def test_hermes_explicit_factory_wins_over_session_key_from() -> None:
+    llm = hermes.LLM(
+        session_key_from="caller_hash",
+        session_key_factory=lambda ctx: "custom-scope",
+    )
+    kwargs = llm._build_completion_kwargs(
+        [{"role": "user", "content": "hi"}], None, call_id="hid-1", caller="+15555550100"
+    )
+    assert kwargs["extra_headers"]["X-Hermes-Session-Key"] == "custom-scope"
+
+
+@pytest.mark.unit
+def test_hermes_rejects_unknown_session_key_from() -> None:
+    with pytest.raises(ValueError, match="caller_hash"):
+        hermes.LLM(session_key_from="something-else")
+
+
+# ---------------------------------------------------------------------------
+# Caller threads through the REAL LLMLoop into the provider's stream()
+# ---------------------------------------------------------------------------
+
+
+class _CallerRecordingProvider:
+    """Records the caller/callee/call_id it was streamed with."""
+
+    def __init__(self) -> None:
+        self.seen: dict = {}
+
+    async def stream(
+        self, messages, tools=None, *, cancel_event=None, call_id=None, caller=None, callee=None
+    ):
+        self.seen = {"call_id": call_id, "caller": caller, "callee": callee}
+        yield {"type": "text", "content": "ok"}
+
+
+class _CallIdOnlyProvider:
+    """Older provider that only declares call_id — must NOT receive caller."""
+
+    def __init__(self) -> None:
+        self.seen_kwargs: object = "<<unset>>"
+
+    async def stream(self, messages, tools=None, *, cancel_event=None, call_id=None):
+        self.seen_kwargs = call_id
+        yield {"type": "text", "content": "ok"}
+
+
+def _make_loop(provider) -> LLMLoop:
+    loop = LLMLoop.__new__(LLMLoop)
+    loop._provider = provider
+    loop._system_prompt = "You are a test assistant."
+    loop._tools = None
+    loop._tool_executor = None
+    loop._metrics = None
+    loop._event_bus = None
+    loop._model = "fake-model"
+    loop._provider_name = "fake"
+    loop._openai_tools = None
+    loop._tool_map = {}
+    loop._on_tool_call = None
+    loop._usage_missing_count = 0
+    loop._logged_usage_fallback = False
+    return loop
+
+
+@pytest.mark.unit
+async def test_caller_callee_thread_through_loop_into_provider() -> None:
+    provider = _CallerRecordingProvider()
+    loop = _make_loop(provider)
+    async for _ in loop.run(
+        "Hi", [], {"call_id": "c9", "caller": "+15555550100", "callee": "+15555550101"}
+    ):
+        pass
+    assert provider.seen == {
+        "call_id": "c9",
+        "caller": "+15555550100",
+        "callee": "+15555550101",
+    }
+
+
+@pytest.mark.unit
+def test_signature_guard_classifies_caller_aware_provider() -> None:
+    accepted = _stream_accepted_context_kwargs(_CallerRecordingProvider())
+    assert accepted == frozenset({"call_id", "caller", "callee"})
+    # An older call_id-only provider is not handed caller/callee.
+    assert _stream_accepted_context_kwargs(_CallIdOnlyProvider()) == frozenset(
+        {"call_id"}
+    )
+
+
+@pytest.mark.unit
+async def test_call_id_only_provider_never_receives_caller() -> None:
+    """A provider that declares only call_id must keep working when the loop has
+    caller/callee in context — it gets call_id only, never the new kwargs."""
+    provider = _CallIdOnlyProvider()
+    loop = _make_loop(provider)
+    async for _ in loop.run(
+        "Hi", [], {"call_id": "c9", "caller": "+15555550100", "callee": "+15555550101"}
+    ):
+        pass
+    assert provider.seen_kwargs == "c9"
+
+
+# ---------------------------------------------------------------------------
+# Wire-level — mocks ONLY the paid boundary (chat.completions.create).
+# ---------------------------------------------------------------------------
+
+
+class _Choice:
+    def __init__(self, content) -> None:
+        self.delta = type("D", (), {"content": content, "tool_calls": None})()
+
+
+class _Chunk:
+    def __init__(self, content) -> None:
+        self.choices = [_Choice(content)]
+        self.usage = None
+
+
+class _FakeStream:
+    def __init__(self, chunks) -> None:
+        self._chunks = chunks
+
+    def __aiter__(self):
+        return self._gen()
+
+    async def _gen(self):
+        for chunk in self._chunks:
+            yield chunk
+
+    async def close(self) -> None:  # pragma: no cover - not exercised
+        pass
+
+
+@pytest.mark.mocked
+async def test_hermes_caller_hash_reaches_the_wire() -> None:
+    """End-to-end on the wire: Hermes(session_key_from='caller_hash') emits
+    X-Hermes-Session-Key=patter-caller-<hash> where <hash>=hash_caller(caller),
+    and the raw caller is NEVER in the header value."""
+    llm = hermes.LLM(session_key_from="caller_hash")
+    captured: dict = {}
+
+    async def fake_create(**kwargs):
+        captured.update(kwargs)
+        return _FakeStream([_Chunk("ok")])
+
+    llm._client.chat.completions.create = fake_create
+
+    caller = "+15555550100"
+    async for _ in llm.stream(
+        [{"role": "user", "content": "hi"}], None, call_id="hid-1", caller=caller
+    ):
+        pass
+
+    headers = captured["extra_headers"]
+    expected = f"patter-caller-{hash_caller(caller)}"
+    assert headers["X-Hermes-Session-Key"] == expected
+    # The raw number is never on the wire in the memory-scope header.
+    assert caller not in headers["X-Hermes-Session-Key"]
+
+
+@pytest.mark.mocked
+async def test_custom_factory_overrides_static_on_the_wire() -> None:
+    provider = OpenAICompatibleLLMProvider(
+        base_url="http://127.0.0.1:9/v1",
+        model="m",
+        session_key_header="X-Mem",
+        session_key="static-key",
+        session_key_factory=lambda ctx: f"dyn-{ctx.caller_hash}",
+    )
+    captured: dict = {}
+
+    async def fake_create(**kwargs):
+        captured.update(kwargs)
+        return _FakeStream([_Chunk("ok")])
+
+    provider._client.chat.completions.create = fake_create
+
+    async for _ in provider.stream(
+        [{"role": "user", "content": "hi"}], None, call_id="c1", caller="+15555550100"
+    ):
+        pass
+
+    assert captured["extra_headers"]["X-Mem"] == f"dyn-{hash_caller('+15555550100')}"
diff --git a/libraries/python/tests/unit/test_long_turn_filler.py b/libraries/python/tests/unit/test_long_turn_filler.py
new file mode 100644
index 0000000..d162c69
--- /dev/null
+++ b/libraries/python/tests/unit/test_long_turn_filler.py
@@ -0,0 +1,312 @@
+"""Authentic tests for the opt-in long-turn filler (pipeline mode, Feature #8).
+
+When an LLM turn is SLOW (e.g. an agent runtime running tools) and NO audio has
+reached the carrier after ``agent.long_turn_message_after_s`` seconds, the SDK
+speaks a short filler instead of dead silence — distinct from
+``llm_error_message`` (which fires on an ERROR, not on slowness).
+
+Only the external boundary is mocked: the LLM provider's ``stream()`` (its
+timing / the gateway hop) and the TTS byte boundary
+(``_tts.synthesize`` yielding PCM). Everything inward — the real ``LLMLoop.run``
+async generator, the real ``PipelineStreamHandler._process_streaming_response``,
+the real filler task scheduling / cancellation, the real ``_synthesize_sentence``
+speak primitive — runs unmocked. The filler timeout is set tiny (a few ms) so
+the suite stays fast while exercising the real ``asyncio.sleep`` path.
+
+These tests carry ``@pytest.mark.mocked`` because the provider stream is an
+external-boundary mock.
+"""
+
+from __future__ import annotations
+
+import asyncio
+from collections import deque
+from unittest.mock import AsyncMock, MagicMock
+
+import pytest
+
+from getpatter.stream_handler import PipelineStreamHandler
+
+from tests.conftest import make_agent
+
+_FILLER = "One moment while I look that up."
+
+
+# ---------------------------------------------------------------------------
+# Boundary doubles — the ONLY mocks: the LLM stream timing and the TTS bytes
+# ---------------------------------------------------------------------------
+
+
+class _SlowThenTextLLMProvider:
+    """Sleeps past the filler timeout, THEN yields a complete sentence.
+
+    Models an agent runtime that runs tools for a while before producing its
+    first words: the caller hears silence during ``delay_s``, the filler fires,
+    and only then the real reply arrives.
+    """
+
+    def __init__(self, delay_s: float) -> None:
+        self._delay_s = delay_s
+
+    async def stream(self, messages, tools=None, **_kwargs):
+        await asyncio.sleep(self._delay_s)
+        yield {"type": "text", "content": "Here is your answer. "}
+
+
+class _FastTextLLMProvider:
+    """Yields a complete sentence immediately (no slow gap)."""
+
+    async def stream(self, messages, tools=None, **_kwargs):
+        yield {"type": "text", "content": "Quick reply right away. "}
+
+
+class _FakeTTS:
+    """TTS byte boundary — ``synthesize(text)`` yields a couple of PCM chunks.
+
+    Records every text it was asked to synthesize so a test can assert whether
+    (and in what order) the filler / the real reply were spoken.
+    """
+
+    output_format = "pcm_16000"
+
+    def __init__(self) -> None:
+        self.synthesized: list[str] = []
+
+    async def synthesize(self, text: str):
+        self.synthesized.append(text)
+        yield b"\x00\x00" * 80
+        yield b"\x00\x00" * 80
+
+
+def _make_loop(provider) -> object:
+    """Build a REAL ``LLMLoop`` wrapping the boundary provider double."""
+    from getpatter.services.llm_loop import LLMLoop
+
+    loop = LLMLoop.__new__(LLMLoop)
+    loop._provider = provider
+    loop._system_prompt = "You are a test assistant."
+    loop._tools = None
+    loop._tool_executor = None
+    loop._metrics = None
+    loop._event_bus = None
+    loop._model = "fake-model"
+    loop._provider_name = "fake"
+    loop._openai_tools = None
+    loop._tool_map = {}
+    loop._on_tool_call = None
+    loop._usage_missing_count = 0
+    loop._logged_usage_fallback = False
+    return loop
+
+
+def _make_handler(*, long_turn_message, long_turn_message_after_s, tts):
+    audio_sender = AsyncMock()
+    audio_sender.reset_pcm_carry = MagicMock()
+    overrides: dict = {"long_turn_message": long_turn_message}
+    if long_turn_message_after_s is not None:
+        overrides["long_turn_message_after_s"] = long_turn_message_after_s
+    handler = PipelineStreamHandler(
+        agent=make_agent(**overrides),
+        audio_sender=audio_sender,
+        call_id="call-long-turn",
+        caller="+15551110000",
+        callee="+15552220000",
+        resolved_prompt="p",
+        metrics=None,
+        for_twilio=True,
+        on_transcript=None,
+        conversation_history=deque(maxlen=10),
+        transcript_entries=deque(maxlen=10),
+    )
+    handler.on_message = None
+    handler._tts = tts  # type: ignore[assignment]
+    handler._is_speaking = True
+    return handler
+
+
+# ---------------------------------------------------------------------------
+# Positive: slow turn + message set → filler is spoken before the real reply
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.mocked
+class TestFillerSpokenOnSlowTurn:
+    async def test_filler_spoken_when_turn_is_slow(self) -> None:
+        tts = _FakeTTS()
+        handler = _make_handler(
+            long_turn_message=_FILLER,
+            long_turn_message_after_s=0.02,
+            tts=tts,
+        )
+        # The provider takes 80 ms before its first word; the filler fires at
+        # 20 ms — so the caller hears the filler, then the real reply.
+        loop = _make_loop(_SlowThenTextLLMProvider(delay_s=0.08))
+
+        result = loop.run("Hi", [], {"call_id": "call-long-turn"})
+        await handler._process_streaming_response(result, "call-long-turn")
+
+        # The filler was synthesized (and reached the carrier) FIRST, then the
+        # real reply followed — exactly one filler, no double-speak.
+        assert _FILLER in tts.synthesized
+        assert tts.synthesized.index(_FILLER) == 0
+        assert "Here is your answer." in tts.synthesized
+        assert tts.synthesized.count(_FILLER) == 1
+        handler.audio_sender.send_audio.assert_awaited()
+
+
+# ---------------------------------------------------------------------------
+# Negative: fast turn → filler must NOT fire (no race / no double-speak)
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.mocked
+class TestFillerNotSpokenOnFastTurn:
+    async def test_filler_not_spoken_when_audio_starts_quickly(self) -> None:
+        tts = _FakeTTS()
+        handler = _make_handler(
+            long_turn_message=_FILLER,
+            long_turn_message_after_s=0.5,  # well beyond the fast reply
+            tts=tts,
+        )
+        loop = _make_loop(_FastTextLLMProvider())
+
+        result = loop.run("Hi", [], {"call_id": "call-long-turn"})
+        await handler._process_streaming_response(result, "call-long-turn")
+
+        # Real audio started immediately; the filler is cancelled before firing.
+        assert _FILLER not in tts.synthesized
+        assert "Quick reply right away." in tts.synthesized
+
+    async def test_no_orphaned_filler_task_after_fast_turn(self) -> None:
+        """The filler task must be cleanly cancelled — no pending filler task
+        lingers past the turn (the cancellation path awaits/suppresses
+        CancelledError). Captures the actual task handle the handler created."""
+        tts = _FakeTTS()
+        handler = _make_handler(
+            long_turn_message=_FILLER,
+            long_turn_message_after_s=0.5,
+            tts=tts,
+        )
+        loop = _make_loop(_FastTextLLMProvider())
+
+        created: list = []
+        real_schedule = handler._schedule_long_turn_filler
+
+        def _capture(*args, **kwargs):
+            task = real_schedule(*args, **kwargs)
+            if task is not None:
+                created.append(task)
+            return task
+
+        handler._schedule_long_turn_filler = _capture  # type: ignore[assignment]
+
+        result = loop.run("Hi", [], {"call_id": "call-long-turn"})
+        await handler._process_streaming_response(result, "call-long-turn")
+        # Yield once so any cancelled task fully tears down.
+        await asyncio.sleep(0)
+
+        # The filler task was created and is now finished (cancelled cleanly),
+        # not left pending past the turn.
+        assert len(created) == 1
+        assert created[0].done()
+        assert created[0].cancelled()
+
+
+# ---------------------------------------------------------------------------
+# Regression: feature OFF by default → behaviour unchanged
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.mocked
+class TestFillerOffByDefault:
+    async def test_unset_message_speaks_nothing_extra(self) -> None:
+        tts = _FakeTTS()
+        handler = _make_handler(
+            long_turn_message=None,  # default — feature OFF
+            long_turn_message_after_s=None,
+            tts=tts,
+        )
+        # Even on a slow turn, with the message unset nothing extra is spoken.
+        loop = _make_loop(_SlowThenTextLLMProvider(delay_s=0.05))
+
+        result = loop.run("Hi", [], {"call_id": "call-long-turn"})
+        await handler._process_streaming_response(result, "call-long-turn")
+
+        # Only the real reply — no filler ever synthesized.
+        assert tts.synthesized == ["Here is your answer."]
+
+
+# ---------------------------------------------------------------------------
+# Barge-in guard: floor flipped off during the slow gap → filler stays silent
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.mocked
+class TestFillerSuppressedByBargeIn:
+    async def test_filler_not_spoken_when_floor_flips_off_before_firing(self) -> None:
+        tts = _FakeTTS()
+        handler = _make_handler(
+            long_turn_message=_FILLER,
+            long_turn_message_after_s=0.02,
+            tts=tts,
+        )
+
+        class _SlowFlipThenText:
+            async def stream(self, messages, tools=None, **_kwargs):
+                # Concurrent barge-in flips the floor off before the filler
+                # timeout elapses — the filler must observe ``_is_speaking`` is
+                # False and stay silent.
+                handler._is_speaking = False
+                await asyncio.sleep(0.08)
+                handler._is_speaking = True  # restored for the (real) reply path
+                yield {"type": "text", "content": "Late reply. "}
+
+        loop = _make_loop(_SlowFlipThenText())
+        result = loop.run("Hi", [], {"call_id": "call-long-turn"})
+        await handler._process_streaming_response(result, "call-long-turn")
+
+        assert _FILLER not in tts.synthesized
+
+
+# ---------------------------------------------------------------------------
+# Authenticity invariant: the positive test exercises the REAL speak primitive
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.mocked
+class TestExercisesRealSpeakPrimitive:
+    async def test_fails_if_synthesize_sentence_is_not_real(self) -> None:
+        tts = _FakeTTS()
+        handler = _make_handler(
+            long_turn_message=_FILLER,
+            long_turn_message_after_s=0.02,
+            tts=tts,
+        )
+        loop = _make_loop(_SlowThenTextLLMProvider(delay_s=0.08))
+
+        filler_attempts: list[str] = []
+        real_synth = handler._synthesize_sentence
+
+        async def _broken_for_filler(sentence, *args, **kwargs):
+            # The FILLER's speak primitive is broken (its own try/except must
+            # swallow the failure → no filler PCM reaches the carrier). The real
+            # reply still routes to the genuine primitive, so the turn completes.
+            if sentence == _FILLER:
+                filler_attempts.append(sentence)
+                raise NotImplementedError
+            return await real_synth(sentence, *args, **kwargs)
+
+        handler._synthesize_sentence = _broken_for_filler  # type: ignore[assignment]
+
+        result = loop.run("Hi", [], {"call_id": "call-long-turn"})
+        # Must not raise — a filler-primitive outage degrades to silence, not a
+        # handler crash, and the real reply still plays.
+        await handler._process_streaming_response(result, "call-long-turn")
+
+        # The filler attempted to speak through the (now-broken) primitive but
+        # its bytes never reached the carrier — proving the positive test above
+        # depends on the REAL primitive running, not a mock.
+        assert filler_attempts == [_FILLER]
+        assert _FILLER not in tts.synthesized
+        # The real reply still played (broken filler never blocked the turn).
+        assert "Here is your answer." in tts.synthesized
diff --git a/libraries/typescript/src/index.ts b/libraries/typescript/src/index.ts
index 1b99333..cbd16a0 100644
--- a/libraries/typescript/src/index.ts
+++ b/libraries/typescript/src/index.ts
@@ -32,6 +32,7 @@ export type {
   PipelineHooks,
   HookContext,
   RealtimeTurnDetection,
+  SessionContext,
 } from "./types";
 // `Guardrail` is intentionally not re-exported from `./types` — the public
 // `Guardrail` identifier is the class from `./public-api` (exported below),
@@ -192,11 +193,23 @@ export type { GoogleLLMOptions } from "./llm/google";
 // OpenAI-compatible agent runtime / local inference gateway).
 export { LLM as OpenAICompatibleLLM, OpenAICompatibleLLMProvider } from "./llm/openai-compatible";
 export type { OpenAICompatibleLLMOptions } from "./llm/openai-compatible";
+export { hashCaller } from "./llm/openai-compatible";
 export { LLM as HermesLLM } from "./llm/hermes";
 export type { HermesLLMOptions } from "./llm/hermes";
 export { LLM as OpenClawLLM } from "./llm/openclaw";
 export type { OpenClawLLMOptions } from "./llm/openclaw";
 
+// Namespace objects mirroring the Python ``from getpatter.llm import hermes``
+// ergonomics: ``import { hermes } from 'getpatter'; new hermes.LLM()`` builds
+// the SAME class as the ``HermesLLM`` named export above. Provided alongside
+// (not instead of) the named exports.
+import { LLM as HermesLLMClass } from "./llm/hermes";
+import { LLM as OpenClawLLMClass } from "./llm/openclaw";
+import { LLM as OpenAICompatibleLLMClass } from "./llm/openai-compatible";
+export const hermes = Object.freeze({ LLM: HermesLLMClass });
+export const openclaw = Object.freeze({ LLM: OpenClawLLMClass });
+export const openaiCompatible = Object.freeze({ LLM: OpenAICompatibleLLMClass });
+
 // Voice Activity Detection (server-side) — Silero ONNX.
 export { SileroVAD } from "./providers/silero-vad";
 export type { SileroVADOptions, SileroSampleRate } from "./providers/silero-vad";
diff --git a/libraries/typescript/src/llm-loop.ts b/libraries/typescript/src/llm-loop.ts
index 6005f2b..3f2f7d2 100644
--- a/libraries/typescript/src/llm-loop.ts
+++ b/libraries/typescript/src/llm-loop.ts
@@ -432,6 +432,17 @@ export interface LLMStreamOptions {
    * config) no ``user`` field is sent — fully backward compatible.
    */
   callId?: string;
+  /**
+   * Caller / callee for this turn (the same values the stream handler builds
+   * into ``callCtx.caller`` / ``callCtx.callee``). Threaded purely so a
+   * session-aware provider with a ``sessionKeyFactory`` can derive a per-caller
+   * memory scope from the NON-REVERSIBLE caller hash. Additive and optional:
+   * providers that read only ``signal`` / ``callId`` ignore them, and the raw
+   * ``caller`` is never logged. Mirrors the Python loop threading
+   * ``caller`` / ``callee`` into the provider's ``stream``.
+   */
+  caller?: string;
+  callee?: string;
 }
 
 /**
@@ -921,17 +932,30 @@ export class LLMLoop {
     const hasAfterLlmChunk = Boolean(hookExecutor?.hasAfterLlmChunk());
     const allEmittedText: string[] = [];
 
-    // Thread the stable per-call id into the provider stream options so
-    // session-aware providers (OpenAI-compatible / Hermes / OpenClaw) can
-    // emit the ``user`` field for one runtime session per phone call. Purely
-    // additive: providers that read only ``signal`` ignore it. Only spread a
-    // string call id — leave ``opts`` untouched otherwise so existing
-    // behaviour is byte-identical when no call id is present.
+    // Thread the stable per-call id (plus caller / callee for a
+    // sessionKeyFactory) into the provider stream options so session-aware
+    // providers (OpenAI-compatible / Hermes / OpenClaw) can emit the ``user``
+    // field / memory-scope header for one runtime session per phone call.
+    // Purely additive: providers that read only ``signal`` ignore them. Only
+    // build the augmented opts when at least one context value is a non-empty
+    // string — leave ``opts`` untouched otherwise so existing behaviour is
+    // byte-identical when no call context is present. The raw caller is never
+    // logged; a factory keys off its non-reversible hash.
     const callId = callContext.call_id;
-    const streamOpts: LLMStreamOptions | undefined =
-      typeof callId === 'string' && callId.length > 0
-        ? { ...opts, callId }
-        : opts;
+    const caller = callContext.caller;
+    const callee = callContext.callee;
+    const hasContext =
+      (typeof callId === 'string' && callId.length > 0) ||
+      (typeof caller === 'string' && caller.length > 0) ||
+      (typeof callee === 'string' && callee.length > 0);
+    const streamOpts: LLMStreamOptions | undefined = hasContext
+      ? {
+          ...opts,
+          ...(typeof callId === 'string' && callId.length > 0 ? { callId } : {}),
+          ...(typeof caller === 'string' && caller.length > 0 ? { caller } : {}),
+          ...(typeof callee === 'string' && callee.length > 0 ? { callee } : {}),
+        }
+      : opts;
 
     for (let iter = 0; iter < maxIterations; iter++) {
       const toolCallsAccumulated = new Map<number, ToolCallAccumulator>();
diff --git a/libraries/typescript/src/llm/hermes.ts b/libraries/typescript/src/llm/hermes.ts
index ef05517..39a441c 100644
--- a/libraries/typescript/src/llm/hermes.ts
+++ b/libraries/typescript/src/llm/hermes.ts
@@ -14,6 +14,7 @@
  * pass ``sessionKey``. (It also still emits ``user=patter-call-<callId>`` for
  * upstream-log correlation, but that is not what drives the session.)
  */
+import type { SessionContext } from '../types';
 import {
   OpenAICompatibleLLMProvider,
   type OpenAICompatibleLLMOptions,
@@ -49,11 +50,28 @@ export interface HermesLLMOptions {
   /** Per-request timeout in seconds. Default ``120``. */
   timeout?: number;
   /**
-   * Long-term memory scope. When set, emits ``X-Hermes-Session-Key`` so Hermes
-   * scopes durable memory to this value across calls. ``undefined`` (default)
-   * means the header is not sent. Credential-grade — never logged.
+   * Static long-term memory scope. When set, emits ``X-Hermes-Session-Key`` so
+   * Hermes scopes durable memory to this value across calls. ``undefined``
+   * (default) means the header is not sent. Credential-grade — never logged.
    */
   sessionKey?: string;
+  /**
+   * Convenience selector for a built-in per-call key derivation. Set to
+   * ``'caller_hash'`` to derive the session key per call as
+   * ``` `patter-caller-${ctx.callerHash}` ``` (a stable, non-reversible hash of
+   * the caller — never the raw number), enabling per-caller cross-call memory.
+   * ``undefined`` (default) uses the static ``sessionKey`` path. Ignored when
+   * ``sessionKeyFactory`` is given explicitly. Mirrors Python
+   * ``session_key_from``.
+   */
+  sessionKeyFrom?: 'caller_hash';
+  /**
+   * Custom callback deriving the ``X-Hermes-Session-Key`` value per call from a
+   * {@link SessionContext}. Takes precedence over both ``sessionKey`` and
+   * ``sessionKeyFrom``. A falsy return omits the header for that call.
+   * Credential-grade — never logged. Mirrors Python ``session_key_factory``.
+   */
+  sessionKeyFactory?: (ctx: SessionContext) => string | undefined;
   /** Extra headers merged after the SDK ``User-Agent``. */
   extraHeaders?: Record<string, string>;
   /** Sampling temperature [0, 2]. */
@@ -93,6 +111,26 @@ export class LLM extends OpenAICompatibleLLMProvider {
 
   constructor(opts: HermesLLMOptions = {}) {
     const model = opts.model ?? process.env[MODEL_ENV] ?? DEFAULT_MODEL;
+    // ``sessionKeyFrom: 'caller_hash'`` installs a default factory that scopes
+    // durable memory per caller via the non-reversible caller hash (never the
+    // raw number). An explicit ``sessionKeyFactory`` always wins over it.
+    let sessionKeyFactory = opts.sessionKeyFactory;
+    if (!sessionKeyFactory && opts.sessionKeyFrom === 'caller_hash') {
+      sessionKeyFactory = (ctx: SessionContext): string | undefined =>
+        ctx.callerHash ? `patter-caller-${ctx.callerHash}` : undefined;
+    } else if (
+      opts.sessionKeyFrom !== undefined &&
+      opts.sessionKeyFrom !== 'caller_hash'
+    ) {
+      // Runtime validation for non-TypeScript / dynamic-JS / JSON callers — the
+      // literal type already catches this at compile time. Mirrors Python's
+      // ValueError so a misconfigured key derivation fails loudly, not silently.
+      throw new Error(
+        `sessionKeyFrom must be 'caller_hash' or undefined, got ${JSON.stringify(
+          opts.sessionKeyFrom,
+        )}`,
+      );
+    }
     const options: OpenAICompatibleLLMOptions = {
       apiKey: opts.apiKey,
       apiKeyEnv: API_KEY_ENV,
@@ -104,6 +142,7 @@ export class LLM extends OpenAICompatibleLLMProvider {
       sessionIdPrefix: SESSION_ID_PREFIX,
       sessionKeyHeader: SESSION_KEY_HEADER,
       sessionKey: opts.sessionKey,
+      sessionKeyFactory,
       extraHeaders: opts.extraHeaders,
       temperature: opts.temperature,
       maxTokens: opts.maxTokens,
diff --git a/libraries/typescript/src/llm/openai-compatible.ts b/libraries/typescript/src/llm/openai-compatible.ts
index 2dd7f78..a19a399 100644
--- a/libraries/typescript/src/llm/openai-compatible.ts
+++ b/libraries/typescript/src/llm/openai-compatible.ts
@@ -44,16 +44,36 @@
  * ``Bearer EMPTY`` placeholder breaks some gateways).
  */
 
+import { createHash } from 'node:crypto';
 import type { LLMChunk, LLMProvider, LLMStreamOptions } from '../llm-loop';
 import { mergeAbortSignals } from '../llm-loop';
 import { parseOpenAISseStream } from '../providers/groq-llm';
 import { PatterConnectionError } from '../errors';
 import { getLogger } from '../logger';
+import type { SessionContext } from '../types';
 import { VERSION } from '../version';
 
 /** Default per-request timeout in seconds for the generic provider. */
 const DEFAULT_TIMEOUT_S = 60;
 
+/**
+ * Stable, non-reversible 16-char hash of a caller for session scoping.
+ *
+ * Used to derive a per-caller memory namespace (e.g. an agent runtime's session
+ * key) WITHOUT ever exposing the raw phone number — the call site keys cross-
+ * call memory off the hash, never the number itself. Returns the first 16 hex
+ * chars of the SHA-256 digest of the UTF-8 ``caller`` string, or ``undefined``
+ * when ``caller`` is undefined / empty. The 16-char (64-bit) truncation is
+ * plenty for namespacing while keeping the emitted header value compact; it is
+ * NOT a security primitive (a phone number has too little entropy to make the
+ * digest a secret) — its only job is to keep the raw number off the wire / out
+ * of logs. Mirrors Python ``hash_caller``.
+ */
+export function hashCaller(caller?: string): string | undefined {
+  if (!caller) return undefined;
+  return createHash('sha256').update(caller, 'utf8').digest('hex').slice(0, 16);
+}
+
 /** Constructor options for {@link OpenAICompatibleLLMProvider}. */
 export interface OpenAICompatibleLLMOptions {
   /**
@@ -118,6 +138,17 @@ export interface OpenAICompatibleLLMOptions {
    * scope — NEVER logged. ``undefined`` (default) means the header is omitted.
    */
   sessionKey?: string;
+  /**
+   * Optional callback that derives the ``sessionKeyHeader`` VALUE per call from
+   * a {@link SessionContext} (carrying ``callId`` / ``caller`` / ``callee`` /
+   * ``callerHash``). When set it takes PRECEDENCE over the static ``sessionKey``:
+   * at request-build time the factory is called and its return value is emitted
+   * in ``sessionKeyHeader``. A falsy return (``undefined`` / ``''``) omits the
+   * header for that call. The static ``sessionKey`` remains the simple fallback
+   * used when no factory is configured. The returned value is a credential-grade
+   * memory scope and is NEVER logged. Mirrors Python ``session_key_factory``.
+   */
+  sessionKeyFactory?: (ctx: SessionContext) => string | undefined;
   /** Sampling temperature [0, 2]. */
   temperature?: number;
   /** Max tokens in the assistant response (sent as ``max_completion_tokens``). */
@@ -165,6 +196,7 @@ export class OpenAICompatibleLLMProvider implements LLMProvider {
   private readonly sessionIdPrefix?: string;
   private readonly sessionKeyHeader?: string;
   private readonly sessionKey?: string;
+  private readonly sessionKeyFactory?: (ctx: SessionContext) => string | undefined;
   private readonly temperature?: number;
   private readonly maxTokens?: number;
   private readonly responseFormat?: Record<string, unknown>;
@@ -199,6 +231,7 @@ export class OpenAICompatibleLLMProvider implements LLMProvider {
     this.sessionIdPrefix = options.sessionIdPrefix;
     this.sessionKeyHeader = options.sessionKeyHeader;
     this.sessionKey = options.sessionKey;
+    this.sessionKeyFactory = options.sessionKeyFactory;
     this.temperature = options.temperature;
     this.maxTokens = options.maxTokens;
     this.responseFormat = options.responseFormat;
@@ -223,7 +256,11 @@ export class OpenAICompatibleLLMProvider implements LLMProvider {
    *  - ``sessionKeyHeader`` (+ ``sessionKey``) → the static ``sessionKey`` value.
    * ``sessionKey`` is a credential-grade memory scope and is never logged.
    */
-  private buildHeaders(callId?: string): Record<string, string> {
+  private buildHeaders(
+    callId?: string,
+    caller?: string,
+    callee?: string,
+  ): Record<string, string> {
     const headers: Record<string, string> = {
       'Content-Type': 'application/json',
       'User-Agent': `getpatter/${VERSION}`,
@@ -236,15 +273,43 @@ export class OpenAICompatibleLLMProvider implements LLMProvider {
       // Per-call session id for session / transcript continuity.
       headers[this.sessionIdHeader] = `${this.sessionIdPrefix ?? ''}${callId}`;
     }
-    if (this.sessionKeyHeader && this.sessionKey) {
-      // Truthy check (not `!== undefined`): an empty-string session key is not
-      // a meaningful memory scope — treat it as unset rather than emitting a
-      // confusing empty header. Value is the raw key (never logged).
-      headers[this.sessionKeyHeader] = this.sessionKey;
+    if (this.sessionKeyHeader) {
+      // The factory (when configured) wins over the static sessionKey. Truthy
+      // check (not `!== undefined`): an empty-string scope is not meaningful —
+      // treat it as unset rather than emitting a confusing empty header. The
+      // value is credential-grade and never logged.
+      const sessionKeyValue = this.resolveSessionKey(callId, caller, callee);
+      if (sessionKeyValue) {
+        headers[this.sessionKeyHeader] = sessionKeyValue;
+      }
     }
     return headers;
   }
 
+  /**
+   * Resolve the ``sessionKeyHeader`` VALUE for this call. When a
+   * ``sessionKeyFactory`` is configured it is called with a
+   * {@link SessionContext} (the raw ``caller`` plus its non-reversible
+   * {@link hashCaller}) and its return value wins — a falsy return omits the
+   * header. Otherwise the static ``sessionKey`` is used. Never logged.
+   */
+  private resolveSessionKey(
+    callId?: string,
+    caller?: string,
+    callee?: string,
+  ): string | undefined {
+    if (this.sessionKeyFactory) {
+      const ctx: SessionContext = {
+        callId,
+        caller,
+        callee,
+        callerHash: hashCaller(caller),
+      };
+      return this.sessionKeyFactory(ctx);
+    }
+    return this.sessionKey;
+  }
+
   /**
    * Pre-call DNS / TLS warmup for the configured endpoint. Best-effort:
    * 5 s timeout, all exceptions swallowed at debug level. The ``Authorization``
@@ -308,11 +373,13 @@ export class OpenAICompatibleLLMProvider implements LLMProvider {
     opts?: LLMStreamOptions,
   ): AsyncGenerator<LLMChunk, void, unknown> {
     const callId = opts?.callId;
+    const caller = opts?.caller;
+    const callee = opts?.callee;
     const body = this.buildBody(messages, tools, callId);
 
     const response = await fetch(`${this.baseUrl}/chat/completions`, {
       method: 'POST',
-      headers: this.buildHeaders(callId),
+      headers: this.buildHeaders(callId, caller, callee),
       body: JSON.stringify(body),
       signal: mergeAbortSignals(opts?.signal, AbortSignal.timeout(this.timeoutMs)),
     });
diff --git a/libraries/typescript/src/stream-handler.ts b/libraries/typescript/src/stream-handler.ts
index 38a1198..342676c 100644
--- a/libraries/typescript/src/stream-handler.ts
+++ b/libraries/typescript/src/stream-handler.ts
@@ -2690,6 +2690,64 @@ export class StreamHandler {
     return true;
   }
 
+  /**
+   * Schedule the opt-in long-turn filler and return its async ``clear()``.
+   *
+   * When ``agent.longTurnMessage`` is unset / empty the returned clear is a
+   * no-op (byte-identical to today's behaviour). Otherwise a one-shot timer
+   * fires after ``agent.longTurnMessageAfterS`` seconds and, IFF no audio has
+   * reached the carrier this turn (``!ttsFirstByteSent.value``) AND we still own
+   * the floor (``this.isSpeaking``), synthesizes the filler ONCE via the same
+   * per-sentence TTS primitive every sentence uses.
+   *
+   * The returned ``clear()`` is **async**: it stops the timer AND, if the filler
+   * already started synthesizing (its ``setTimeout`` callback runs in a separate
+   * macro-task, so it can fire just before the first real sentence), AWAITS the
+   * in-flight synthesis so the filler audio can never interleave with the real
+   * sentence that follows. Idempotent; self-synthesis failure degrades to
+   * silence (never crashes the turn). The caller must clear on first real audio,
+   * on the error branch, and in the finally.
+   */
+  private scheduleLongTurnFiller(
+    ttsFirstByteSent: { value: boolean },
+    hookExecutor: PipelineHookExecutor,
+    hookCtx: HookContext,
+    label: string,
+  ): () => Promise<void> {
+    const message = this.deps.agent.longTurnMessage;
+    if (!message) return async () => {};
+    const afterS = this.deps.agent.longTurnMessageAfterS ?? 4.0;
+    let cancelled = false;
+    let inFlight: Promise<void> | null = null;
+    const timer = setTimeout(() => {
+      // Fire at most once, only if the caller still heard SILENCE this turn, we
+      // still hold the floor, and the turn has not already moved on.
+      if (cancelled || ttsFirstByteSent.value || !this.isSpeaking) return;
+      // Track the in-flight synthesis so clear() can await it — serializing the
+      // filler before the real sentence so their audio can never interleave.
+      inFlight = this.synthesizeSentence(
+        message,
+        hookExecutor,
+        hookCtx,
+        ttsFirstByteSent,
+      ).catch((err) => {
+        getLogger().error(
+          `longTurnMessage filler synthesis failed (${label}):`,
+          err,
+        );
+      });
+    }, Math.max(0, afterS * 1000));
+    return async () => {
+      cancelled = true;
+      clearTimeout(timer);
+      if (inFlight !== null) {
+        const pending = inFlight;
+        inFlight = null;
+        await pending;
+      }
+    };
+  }
+
   /**
    * Streaming built-in LLM path with sentence chunking and per-sentence
    * guardrails/TTS. Returns the concatenated response text.
@@ -2716,6 +2774,20 @@ export class StreamHandler {
     const llmSignal = this.llmAbort.signal;
     let llmError = false;
 
+    // Opt-in long-turn filler: when the turn is SLOW (agent runtime running
+    // tools/memory) and NO audio has reached the carrier yet, speak a short
+    // filler instead of dead silence. Distinct from ``llmErrorMessage`` (that
+    // fires on an LLM ERROR; this fires on SLOWNESS). The timer waits
+    // ``longTurnMessageAfterS`` then, IFF still no audio this turn AND we still
+    // own the floor, synthesizes the filler ONCE. Cleared the moment real audio
+    // is emitted, on the error branch, and in the finally.
+    const clearLongTurnFiller = this.scheduleLongTurnFiller(
+      ttsFirstByteSent,
+      hookExecutor,
+      hookCtx,
+      label,
+    );
+
     // Span lifetime: LLM dispatch → final token / TTS handoff. Always closed
     // in the ``finally`` block so an early throw cannot leak a span.
     const llmSpan = startSpan(SPAN_LLM, { 'patter.call.id': this.callId });
@@ -2736,6 +2808,9 @@ export class StreamHandler {
         if (transformed === null) return; // hook dropped this sentence
         sentenceText = transformed;
       }
+      // Real audio is about to play — cancel the long-turn filler so it can
+      // never fire (or double-speak) once the agent's own reply has started.
+      await clearLongTurnFiller();
       await this.synthesizeSentence(sentenceText, hookExecutor, hookCtx, ttsFirstByteSent);
     };
     let firstSentenceEmitted = false;
@@ -2769,6 +2844,9 @@ export class StreamHandler {
         // Treat AbortError as a clean barge-in cancellation, not an LLM error.
         const isAbort =
           (e as Error)?.name === 'AbortError' || llmSignal.aborted;
+        // The turn ended (error or clean abort) — stop the filler so it cannot
+        // speak over the error fallback below or after a barge-in.
+        await clearLongTurnFiller();
         if (!isAbort) {
           llmError = true;
           chunker.reset(); // discard partial content on LLM error
@@ -2810,6 +2888,9 @@ export class StreamHandler {
         }
       }
     } finally {
+      // Ensure the long-turn filler never outlives the turn (idempotent — a
+      // no-op when already cleared at the first real audio / error branch).
+      await clearLongTurnFiller();
       this.endSpeakingWithGrace();
       // Drop the per-turn abort controller so the next turn starts with a
       // fresh one and barge-ins on the next turn cannot accidentally fire
diff --git a/libraries/typescript/src/types.ts b/libraries/typescript/src/types.ts
index 5a9bb26..8f37186 100644
--- a/libraries/typescript/src/types.ts
+++ b/libraries/typescript/src/types.ts
@@ -489,6 +489,28 @@ export interface BackgroundAudioPlayer {
  *    (see ``Patter.agent()`` for the resolution).
  * 3. Otherwise, the AgentOptions default is used.
  */
+/**
+ * Per-call context handed to a ``sessionKeyFactory`` (see
+ * {@link OpenAICompatibleLLMOptions.sessionKeyFactory}).
+ *
+ * A session-aware LLM provider (e.g. the Hermes preset) can derive its
+ * memory-scope header value per call from this — most usefully from
+ * {@link SessionContext.callerHash}, a stable non-reversible hash of the
+ * caller, so one phone number maps to one durable memory namespace across calls
+ * WITHOUT the raw number ever being emitted or logged.
+ *
+ * All fields are optional: ``callId`` / ``caller`` / ``callee`` are present when
+ * the call provides them; ``callerHash`` is {@link hashCaller} of ``caller``
+ * (``undefined`` when there is no caller). The raw ``caller`` is carried here
+ * only so a factory CAN re-derive its own scope — it must never be put on the
+ * wire or logged beyond what already exists. Mirrors Python ``SessionContext``.
+ */
+export interface SessionContext {
+  readonly callId?: string;
+  readonly caller?: string;
+  readonly callee?: string;
+  readonly callerHash?: string;
+}
 /** Configuration for a local-mode voice AI agent (passed to `phone.agent({...})`). */
 export interface AgentOptions {
   readonly systemPrompt: string;
@@ -524,6 +546,26 @@ export interface AgentOptions {
    * Mirrors Python ``llm_error_message`` on ``Patter.agent()`` / ``Agent``.
    */
   readonly llmErrorMessage?: string;
+  /**
+   * Opt-in short filler spoken when an LLM turn is SLOW (e.g. an agent runtime
+   * running tools / memory) and no audio has reached the carrier yet — DISTINCT
+   * from ``llmErrorMessage`` (which fires on an ERROR; this fires on SLOWNESS).
+   * When set to a non-empty string and the turn has produced NO audio after
+   * ``longTurnMessageAfterS`` seconds, the SDK synthesizes this line ONCE
+   * through the normal TTS turn lifecycle (subject to barge-in) to fill the
+   * gap. It never fires once real audio has started this turn, and never
+   * double-speaks. ``undefined`` (default) keeps today's behaviour: nothing is
+   * spoken while a slow turn runs. Pipeline mode only. Mirrors Python
+   * ``long_turn_message`` on ``Patter.agent()`` / ``Agent``.
+   */
+  readonly longTurnMessage?: string;
+  /**
+   * Seconds to wait after the turn begins speaking before the
+   * ``longTurnMessage`` filler fires (only consulted when ``longTurnMessage``
+   * is set and no audio has reached the carrier yet). Default ``4.0``. Mirrors
+   * Python ``long_turn_message_after_s``.
+   */
+  readonly longTurnMessageAfterS?: number;
   /** Tool definitions — ``Tool`` class instances from ``getpatter``. */
   readonly tools?: ReadonlyArray<ToolInstance>;
   /**
diff --git a/libraries/typescript/tests/llm-namespace-exports.test.ts b/libraries/typescript/tests/llm-namespace-exports.test.ts
new file mode 100644
index 0000000..688abb0
--- /dev/null
+++ b/libraries/typescript/tests/llm-namespace-exports.test.ts
@@ -0,0 +1,50 @@
+/**
+ * Tests for the ``hermes`` / ``openclaw`` / ``openaiCompatible`` namespace
+ * objects (Feature #6), mirroring the Python ``from getpatter.llm import
+ * hermes`` ergonomics.
+ *
+ * Real construction throughout — no mocks. Proves ``new hermes.LLM()`` builds
+ * the SAME class as the existing ``HermesLLM`` named export, and that the named
+ * exports still work alongside the namespaces.
+ */
+
+import { describe, expect, it } from 'vitest';
+import {
+  hermes,
+  openclaw,
+  openaiCompatible,
+  HermesLLM,
+  OpenClawLLM,
+  OpenAICompatibleLLM,
+} from '../src';
+
+describe('[unit] LLM namespace exports', () => {
+  it('hermes.LLM constructs the same class as the HermesLLM named export', () => {
+    const fromNamespace = new hermes.LLM();
+    expect(fromNamespace).toBeInstanceOf(HermesLLM);
+    // Same constructor identity — the namespace re-exports the class, not a copy.
+    expect(hermes.LLM).toBe(HermesLLM);
+    expect(fromNamespace.model).toBe('hermes-agent');
+  });
+
+  it('openclaw.LLM constructs the same class as the OpenClawLLM named export', () => {
+    const fromNamespace = new openclaw.LLM({ agent: 'x' });
+    expect(fromNamespace).toBeInstanceOf(OpenClawLLM);
+    expect(openclaw.LLM).toBe(OpenClawLLM);
+    expect(fromNamespace.model).toBe('openclaw/x');
+  });
+
+  it('openaiCompatible.LLM constructs the same class as the named export', () => {
+    const fromNamespace = new openaiCompatible.LLM({
+      baseUrl: 'http://127.0.0.1:11434/v1',
+      model: 'llama3.1',
+    });
+    expect(fromNamespace).toBeInstanceOf(OpenAICompatibleLLM);
+    expect(openaiCompatible.LLM).toBe(OpenAICompatibleLLM);
+    expect(fromNamespace.model).toBe('llama3.1');
+  });
+
+  it('openclaw.LLM enforces the same agent-id validation as the named export', () => {
+    expect(() => new openclaw.LLM({ agent: 'a b' })).toThrow(/agent id/i);
+  });
+});
diff --git a/libraries/typescript/tests/llm-session-key-factory.mocked.test.ts b/libraries/typescript/tests/llm-session-key-factory.mocked.test.ts
new file mode 100644
index 0000000..deb2ef9
--- /dev/null
+++ b/libraries/typescript/tests/llm-session-key-factory.mocked.test.ts
@@ -0,0 +1,286 @@
+/**
+ * Tests for the per-call session-key factory (Feature #7).
+ *
+ * A ``sessionKeyFactory`` derives the memory-scope header value per call from a
+ * {@link SessionContext} (carrying ``caller`` + its non-reversible
+ * {@link hashCaller}). The Hermes convenience ``sessionKeyFrom: 'caller_hash'``
+ * installs a default factory that scopes durable memory per caller WITHOUT the
+ * raw number ever reaching the wire.
+ *
+ * Factory resolution, the SessionContext construction, and the caller threading
+ * through the REAL ``LLMLoop`` are all real code. The only mocked surface is
+ * ``global.fetch`` — used to inspect the request the provider would POST (the
+ * X-Hermes-Session-Key header value) without touching the network.
+ */
+
+import { describe, expect, it, vi, afterEach } from 'vitest';
+import {
+  OpenAICompatibleLLMProvider,
+  hashCaller,
+} from '../src/llm/openai-compatible';
+import { LLM as HermesLLM } from '../src/llm/hermes';
+import { LLMLoop } from '../src/llm-loop';
+import type {
+  LLMChunk,
+  LLMProvider,
+  LLMStreamOptions,
+} from '../src/llm-loop';
+import type { SessionContext } from '../src/types';
+
+const originalFetch = globalThis.fetch;
+
+afterEach(() => {
+  globalThis.fetch = originalFetch;
+  vi.restoreAllMocks();
+});
+
+/** Capture the single fetch a provider issues, returning a 200 + empty body. */
+function captureFetch(): { calls: Array<{ url: string; init: RequestInit }> } {
+  const calls: Array<{ url: string; init: RequestInit }> = [];
+  globalThis.fetch = vi.fn(
+    async (url: string | URL | Request, init?: RequestInit) => {
+      calls.push({ url: String(url), init: init ?? {} });
+      return new Response('', { status: 200 });
+    },
+  ) as unknown as typeof fetch;
+  return { calls };
+}
+
+async function inspectHeaders(
+  provider: {
+    stream: (
+      m: Array<Record<string, unknown>>,
+      t?: unknown,
+      o?: LLMStreamOptions,
+    ) => AsyncGenerator<unknown>;
+  },
+  opts?: LLMStreamOptions,
+): Promise<Record<string, string>> {
+  const { calls } = captureFetch();
+  for await (const _ of provider.stream(
+    [{ role: 'user', content: 'hi' }],
+    null,
+    opts,
+  )) {
+    // drain
+  }
+  return calls[0].init.headers as Record<string, string>;
+}
+
+// ---------------------------------------------------------------------------
+// hashCaller — stable, non-reversible, never the raw number
+// ---------------------------------------------------------------------------
+
+describe('[unit] hashCaller', () => {
+  it('is stable across calls and never the raw number', () => {
+    const number = '+15555550100';
+    const h1 = hashCaller(number);
+    const h2 = hashCaller(number);
+    expect(h1).toBe(h2);
+    expect(h1).toHaveLength(16);
+    expect(h1).toMatch(/^[0-9a-f]{16}$/);
+    expect(h1).not.toContain(number);
+    expect(h1).not.toBe(number);
+  });
+
+  it('distinguishes different callers and returns undefined for empty input', () => {
+    expect(hashCaller('+15555550100')).not.toBe(hashCaller('+15555550101'));
+    expect(hashCaller(undefined)).toBeUndefined();
+    expect(hashCaller('')).toBeUndefined();
+  });
+});
+
+// ---------------------------------------------------------------------------
+// Factory precedence on the generic provider — observed on the wire
+// ---------------------------------------------------------------------------
+
+describe('[mocked] sessionKeyFactory on OpenAICompatibleLLMProvider', () => {
+  it('factory overrides the static sessionKey and sees the caller hash', async () => {
+    let seen: SessionContext | undefined;
+    const provider = new OpenAICompatibleLLMProvider({
+      baseUrl: 'http://127.0.0.1:9/v1',
+      model: 'm',
+      sessionKeyHeader: 'X-Mem',
+      sessionKey: 'static-key', // must be overridden
+      sessionKeyFactory: (ctx) => {
+        seen = ctx;
+        return `scope-${ctx.callerHash}`;
+      },
+    });
+
+    const headers = await inspectHeaders(provider, {
+      callId: 'c1',
+      caller: '+15555550100',
+      callee: '+15555550101',
+    });
+
+    const expectedHash = hashCaller('+15555550100');
+    expect(headers['X-Mem']).toBe(`scope-${expectedHash}`);
+    // The factory saw the full context; the EMITTED value carries only the hash.
+    expect(seen?.callId).toBe('c1');
+    expect(seen?.caller).toBe('+15555550100');
+    expect(seen?.callee).toBe('+15555550101');
+    expect(seen?.callerHash).toBe(expectedHash);
+  });
+
+  it('factory returning undefined omits the header', async () => {
+    const provider = new OpenAICompatibleLLMProvider({
+      baseUrl: 'http://127.0.0.1:9/v1',
+      model: 'm',
+      sessionKeyHeader: 'X-Mem',
+      sessionKey: 'static-key',
+      sessionKeyFactory: () => undefined,
+    });
+
+    const headers = await inspectHeaders(provider, {
+      callId: 'c1',
+      caller: '+15555550100',
+    });
+    expect(headers['X-Mem']).toBeUndefined();
+  });
+
+  it('static sessionKey is used when no factory is configured', async () => {
+    const provider = new OpenAICompatibleLLMProvider({
+      baseUrl: 'http://127.0.0.1:9/v1',
+      model: 'm',
+      sessionKeyHeader: 'X-Mem',
+      sessionKey: 'static-key',
+    });
+
+    const headers = await inspectHeaders(provider, {
+      callId: 'c1',
+      caller: '+15555550100',
+    });
+    expect(headers['X-Mem']).toBe('static-key');
+  });
+
+  it('factory fires even without a callId (keys off the caller hash alone)', async () => {
+    const provider = new OpenAICompatibleLLMProvider({
+      baseUrl: 'http://127.0.0.1:9/v1',
+      model: 'm',
+      sessionKeyHeader: 'X-Mem',
+      sessionKeyFactory: (ctx) => `caller-${ctx.callerHash}`,
+    });
+
+    const headers = await inspectHeaders(provider, { caller: '+15555550100' });
+    expect(headers['X-Mem']).toBe(`caller-${hashCaller('+15555550100')}`);
+  });
+});
+
+// ---------------------------------------------------------------------------
+// Hermes convenience: sessionKeyFrom: 'caller_hash'
+// ---------------------------------------------------------------------------
+
+describe('[mocked] HermesLLM sessionKeyFrom convenience', () => {
+  it('caller_hash derives patter-caller-<hash> on the wire (raw number absent)', async () => {
+    const llm = new HermesLLM({ sessionKeyFrom: 'caller_hash' });
+    const caller = '+15555550100';
+    const headers = await inspectHeaders(llm, { callId: 'hid-1', caller });
+
+    const expected = `patter-caller-${hashCaller(caller)}`;
+    expect(headers['X-Hermes-Session-Key']).toBe(expected);
+    // The raw number is never in the memory-scope header value.
+    expect(headers['X-Hermes-Session-Key']).not.toContain(caller);
+    // Per-call session id still flows alongside the memory scope.
+    expect(headers['X-Hermes-Session-Id']).toBe('patter-call-hid-1');
+  });
+
+  it('omits the memory-scope header when there is no caller', async () => {
+    const llm = new HermesLLM({ sessionKeyFrom: 'caller_hash' });
+    const headers = await inspectHeaders(llm, { callId: 'hid-1' });
+    expect(headers['X-Hermes-Session-Key']).toBeUndefined();
+    expect(headers['X-Hermes-Session-Id']).toBe('patter-call-hid-1');
+  });
+
+  it('an explicit sessionKeyFactory wins over sessionKeyFrom', async () => {
+    const llm = new HermesLLM({
+      sessionKeyFrom: 'caller_hash',
+      sessionKeyFactory: () => 'custom-scope',
+    });
+    const headers = await inspectHeaders(llm, {
+      callId: 'hid-1',
+      caller: '+15555550100',
+    });
+    expect(headers['X-Hermes-Session-Key']).toBe('custom-scope');
+  });
+
+  it('rejects an unknown sessionKeyFrom at runtime (parity with Python ValueError)', () => {
+    // A non-TypeScript / dynamic-JS caller can pass an invalid value the literal
+    // type would reject only at compile time; it must fail loudly, not silently.
+    expect(
+      () => new HermesLLM({ sessionKeyFrom: 'bogus' as unknown as 'caller_hash' }),
+    ).toThrow(/caller_hash/);
+  });
+});
+
+// ---------------------------------------------------------------------------
+// Caller threads through the REAL LLMLoop into the provider's stream()
+// ---------------------------------------------------------------------------
+
+/** Records the stream options it received and yields a single text chunk. */
+class RecordingProvider implements LLMProvider {
+  static readonly providerKey = 'recording';
+  public lastOpts: LLMStreamOptions | undefined;
+
+  async *stream(
+    _messages: Array<Record<string, unknown>>,
+    _tools?: Array<Record<string, unknown>> | null,
+    opts?: LLMStreamOptions,
+  ): AsyncGenerator<LLMChunk, void, unknown> {
+    this.lastOpts = opts;
+    yield { type: 'text', content: 'ok' };
+  }
+}
+
+async function drain(
+  gen: AsyncGenerator<string, void, unknown>,
+): Promise<string> {
+  let out = '';
+  for await (const tok of gen) out += tok;
+  return out;
+}
+
+describe('[unit] LLMLoop threads caller/callee into stream opts', () => {
+  it('forwards caller and callee from call_context into provider.stream opts', async () => {
+    const provider = new RecordingProvider();
+    const loop = new LLMLoop('', 'm', 'be helpful', null, provider);
+
+    const out = await drain(
+      loop.run('hi', [], {
+        call_id: 'xyz',
+        caller: '+15555550100',
+        callee: '+15555550101',
+      }),
+    );
+
+    expect(out).toBe('ok');
+    expect(provider.lastOpts?.callId).toBe('xyz');
+    expect(provider.lastOpts?.caller).toBe('+15555550100');
+    expect(provider.lastOpts?.callee).toBe('+15555550101');
+  });
+
+  it('leaves opts untouched when no call context is present', async () => {
+    const provider = new RecordingProvider();
+    const loop = new LLMLoop('', 'm', 'be helpful', null, provider);
+
+    await drain(loop.run('hi', [], {}));
+
+    expect(provider.lastOpts?.callId).toBeUndefined();
+    expect(provider.lastOpts?.caller).toBeUndefined();
+    expect(provider.lastOpts?.callee).toBeUndefined();
+  });
+
+  it('end-to-end: a Hermes caller_hash key reaches the wire via the real loop', async () => {
+    const llm = new HermesLLM({ sessionKeyFrom: 'caller_hash' });
+    const loop = new LLMLoop('', 'hermes-agent', 'be helpful', null, llm);
+
+    const { calls } = captureFetch();
+    const caller = '+15555550100';
+    await drain(loop.run('hi', [], { call_id: 'c9', caller, callee: '+15555550101' }));
+
+    const headers = calls[0].init.headers as Record<string, string>;
+    expect(headers['X-Hermes-Session-Key']).toBe(
+      `patter-caller-${hashCaller(caller)}`,
+    );
+  });
+});
diff --git a/libraries/typescript/tests/long-turn-filler.mocked.test.ts b/libraries/typescript/tests/long-turn-filler.mocked.test.ts
new file mode 100644
index 0000000..2729413
--- /dev/null
+++ b/libraries/typescript/tests/long-turn-filler.mocked.test.ts
@@ -0,0 +1,341 @@
+/**
+ * [mocked] Pipeline-mode opt-in long-turn filler (longTurnMessage, Feature #8).
+ *
+ * Exercises the REAL pipeline turn path:
+ *   STT final → processTranscript → runPipelineLlm → real LLMLoop.run →
+ *   provider.stream() is SLOW (agent runtime running tools) → the scheduled
+ *   long-turn filler fires after ``longTurnMessageAfterS`` → spoken via the same
+ *   per-sentence TTS primitive (synthesizeSentence) every normal sentence uses.
+ *
+ * AUTHENTIC: the StreamHandler, the real ``LLMLoop`` (constructed inside
+ * ``initPipeline`` from ``agent.llm``), the sentence chunker, the filler
+ * scheduling / cancellation, and the TTS-send path are REAL. The ONLY mocked
+ * surfaces are the two external boundaries:
+ *   1. The LLM provider's ``stream()`` — its TIMING / the paid gateway hop —
+ *      stubbed to delay (or not) before yielding text.
+ *   2. The TTS byte stream (ElevenLabsTTS ``synthesizeStream``) — replaced with
+ *      PCM Buffers so audio-out is observable.
+ * Everything inward runs unmodified.
+ */
+
+import { describe, it, expect, vi, beforeEach, afterEach } from 'vitest';
+import { StreamHandler } from '../src/stream-handler';
+import type { TelephonyBridge, StreamHandlerDeps } from '../src/stream-handler';
+import { MetricsStore } from '../src/dashboard/store';
+import { RemoteMessageHandler } from '../src/remote-message';
+import type { AgentOptions } from '../src/types';
+import type { LLMProvider, LLMChunk, LLMStreamOptions } from '../src/llm-loop';
+import type { WebSocket as WSWebSocket } from 'ws';
+
+const FILLER = 'One moment while I look that up.';
+
+vi.mock('../src/providers/elevenlabs-tts', async (importOriginal) => {
+  const original =
+    await importOriginal<typeof import('../src/providers/elevenlabs-tts')>();
+  return {
+    ...original,
+    ElevenLabsTTS: vi.fn().mockImplementation(() => ({
+      synthesizeStream: vi.fn(async function* () {
+        yield Buffer.from('tts-audio');
+      }),
+    })),
+  };
+});
+
+vi.mock('../src/dashboard/persistence', () => ({
+  notifyDashboard: vi.fn(),
+}));
+
+import { ElevenLabsTTS } from '../src/providers/elevenlabs-tts';
+
+function makeMockWs(): WSWebSocket {
+  return {
+    send: vi.fn(),
+    close: vi.fn(),
+    on: vi.fn(),
+    once: vi.fn(),
+    readyState: 1,
+    removeListener: vi.fn(),
+    addEventListener: vi.fn(),
+    removeEventListener: vi.fn(),
+  } as unknown as WSWebSocket;
+}
+
+function makeMockStt() {
+  let transcriptCb:
+    | ((t: { isFinal?: boolean; text?: string }) => Promise<void>)
+    | undefined;
+  return {
+    connect: vi.fn().mockResolvedValue(undefined),
+    close: vi.fn(),
+    sendAudio: vi.fn(),
+    onTranscript: vi.fn(
+      (cb: (t: { isFinal?: boolean; text?: string }) => Promise<void>) => {
+        transcriptCb = cb;
+      },
+    ),
+    get requestId() {
+      return 'stt-filler-req';
+    },
+    emitTranscript(text: string): Promise<void> | undefined {
+      return transcriptCb?.({ isFinal: true, text });
+    },
+  };
+}
+
+function makeTwilioBridge(
+  mockStt: ReturnType<typeof makeMockStt>,
+): TelephonyBridge {
+  return {
+    label: 'Twilio',
+    telephonyProvider: 'twilio',
+    sendAudio: vi.fn(),
+    sendMark: vi.fn(),
+    sendClear: vi.fn(),
+    transferCall: vi.fn().mockResolvedValue(undefined),
+    endCall: vi.fn().mockResolvedValue(undefined),
+    createStt: vi.fn().mockReturnValue(mockStt),
+    queryTelephonyCost: vi.fn().mockResolvedValue(undefined),
+  } as unknown as TelephonyBridge;
+}
+
+/**
+ * A real ``LLMProvider`` whose ``stream()`` timing is the only mocked surface.
+ *  - ``delayMs``: wait this long before yielding the (single, complete) reply
+ *    sentence. A delay past ``longTurnMessageAfterS`` lets the filler fire; a
+ *    zero delay starts real audio before the filler timer.
+ */
+function makeSlowProvider(delayMs: number, reply = 'Here is your answer. '): LLMProvider {
+  return {
+    model: 'agent-runtime-1',
+    async *stream(
+      _messages: Array<Record<string, unknown>>,
+      _tools?: Array<Record<string, unknown>> | null,
+      _opts?: LLMStreamOptions,
+    ): AsyncGenerator<LLMChunk, void, unknown> {
+      if (delayMs > 0) {
+        await new Promise<void>((r) => setTimeout(r, delayMs));
+      }
+      yield { type: 'text', content: reply };
+    },
+  } as unknown as LLMProvider;
+}
+
+function setupTtsMock(): { calls: string[] } {
+  const calls: string[] = [];
+  const MockTTS = ElevenLabsTTS as unknown as ReturnType<typeof vi.fn>;
+  MockTTS.mockImplementation(() => ({
+    synthesizeStream: vi.fn(async function* (text: string) {
+      calls.push(text);
+      yield Buffer.from('pcm-chunk-1');
+      yield Buffer.from('pcm-chunk-2');
+    }),
+  }));
+  return { calls };
+}
+
+function makeDeps(
+  bridge: TelephonyBridge,
+  agentOverrides: Partial<AgentOptions>,
+): StreamHandlerDeps {
+  const mockTts = new (ElevenLabsTTS as unknown as new (
+    key: string,
+    voice?: string,
+  ) => { synthesizeStream: (t: string) => AsyncIterable<Buffer> })(
+    'el-key',
+    'rachel',
+  );
+  const agent: AgentOptions = {
+    systemPrompt: 'You are a test pipeline agent.',
+    provider: 'pipeline',
+    tts: mockTts as unknown as AgentOptions['tts'],
+    ...agentOverrides,
+  } as AgentOptions;
+  return {
+    config: {},
+    agent,
+    bridge,
+    metricsStore: new MetricsStore(),
+    pricing: null,
+    remoteHandler: new RemoteMessageHandler(),
+    recording: false,
+    buildAIAdapter: vi.fn(),
+    sanitizeVariables: vi.fn((raw: Record<string, unknown>) => {
+      const safe: Record<string, string> = {};
+      for (const [k, v] of Object.entries(raw)) safe[k] = String(v);
+      return safe;
+    }),
+    resolveVariables: vi.fn((tpl: string) => tpl),
+  } as unknown as StreamHandlerDeps;
+}
+
+describe('[mocked] pipeline long-turn filler (longTurnMessage)', () => {
+  beforeEach(() => {
+    vi.spyOn(globalThis, 'fetch').mockResolvedValue({
+      ok: true,
+      status: 200,
+      json: async () => ({}),
+      text: async () => '',
+    } as Response);
+    const MockTTS = ElevenLabsTTS as unknown as ReturnType<typeof vi.fn>;
+    MockTTS.mockClear();
+    MockTTS.mockImplementation(() => ({
+      synthesizeStream: vi.fn(async function* () {
+        yield Buffer.from('tts-audio');
+      }),
+    }));
+  });
+
+  afterEach(() => {
+    vi.restoreAllMocks();
+  });
+
+  it('speaks the filler when the turn is slow, then the real reply', async () => {
+    const stt = makeMockStt();
+    const bridge = makeTwilioBridge(stt);
+    const { calls: ttsCalls } = setupTtsMock();
+
+    // Provider takes 120 ms before its first word; the filler fires at ~20 ms.
+    const deps = makeDeps(bridge, {
+      llm: makeSlowProvider(120) as unknown as AgentOptions['llm'],
+      longTurnMessage: FILLER,
+      longTurnMessageAfterS: 0.02,
+    });
+    const handler = new StreamHandler(
+      deps,
+      makeMockWs(),
+      '+15551111111',
+      '+15552222222',
+    );
+
+    await handler.handleCallStart('CA-filler-slow');
+    await stt.emitTranscript('What is the weather?');
+
+    await vi.waitFor(() => expect(ttsCalls).toContain(FILLER), {
+      timeout: 5000,
+    });
+    // The filler was spoken FIRST, then the real reply — exactly one filler.
+    expect(ttsCalls.indexOf(FILLER)).toBe(0);
+    expect(ttsCalls).toContain('Here is your answer.');
+    expect(ttsCalls.filter((t) => t === FILLER)).toHaveLength(1);
+    expect(
+      (bridge.sendAudio as ReturnType<typeof vi.fn>).mock.calls.length,
+    ).toBeGreaterThanOrEqual(1);
+  }, 10000);
+
+  it('does NOT speak the filler when real audio starts quickly (no double-speak)', async () => {
+    const stt = makeMockStt();
+    const bridge = makeTwilioBridge(stt);
+    const { calls: ttsCalls } = setupTtsMock();
+
+    // Reply is immediate; the filler timer (500 ms) must be cleared before firing.
+    const deps = makeDeps(bridge, {
+      llm: makeSlowProvider(0) as unknown as AgentOptions['llm'],
+      longTurnMessage: FILLER,
+      longTurnMessageAfterS: 0.5,
+    });
+    const handler = new StreamHandler(
+      deps,
+      makeMockWs(),
+      '+15551111111',
+      '+15552222222',
+    );
+
+    await handler.handleCallStart('CA-filler-fast');
+    await stt.emitTranscript('What is the weather?');
+
+    await vi.waitFor(() => expect(ttsCalls).toContain('Here is your answer.'), {
+      timeout: 5000,
+    });
+    // Give the (cleared) timer ample chance to (not) fire.
+    await new Promise<void>((r) => setTimeout(r, 600));
+
+    expect(ttsCalls).not.toContain(FILLER);
+  }, 10000);
+
+  it('speaks NOTHING extra when longTurnMessage is unset (feature OFF)', async () => {
+    const stt = makeMockStt();
+    const bridge = makeTwilioBridge(stt);
+    const { calls: ttsCalls } = setupTtsMock();
+
+    // Slow turn but no longTurnMessage → no filler, behaviour unchanged.
+    const deps = makeDeps(bridge, {
+      llm: makeSlowProvider(80) as unknown as AgentOptions['llm'],
+    });
+    const handler = new StreamHandler(
+      deps,
+      makeMockWs(),
+      '+15551111111',
+      '+15552222222',
+    );
+
+    await handler.handleCallStart('CA-filler-off');
+    await stt.emitTranscript('What is the weather?');
+
+    await vi.waitFor(() => expect(ttsCalls).toContain('Here is your answer.'), {
+      timeout: 5000,
+    });
+    await new Promise<void>((r) => setTimeout(r, 150));
+
+    // Only the real reply — the filler never appears.
+    expect(ttsCalls).toEqual(['Here is your answer.']);
+  }, 10000);
+
+  it('authenticity: stubbing synthesizeSentence to throw makes the filler emit no audio', async () => {
+    const stt = makeMockStt();
+    const bridge = makeTwilioBridge(stt);
+    setupTtsMock();
+
+    const deps = makeDeps(bridge, {
+      llm: makeSlowProvider(120) as unknown as AgentOptions['llm'],
+      longTurnMessage: FILLER,
+      longTurnMessageAfterS: 0.02,
+    });
+    const handler = new StreamHandler(
+      deps,
+      makeMockWs(),
+      '+15551111111',
+      '+15552222222',
+    );
+
+    // Break the REAL speak primitive only for the filler line. Its scheduled
+    // call is wrapped in a .catch(), so the turn must not crash, but no filler
+    // PCM can reach the carrier — proving the positive test exercised the real
+    // synthesizeSentence rather than a mock.
+    const realSynth = (
+      handler as unknown as {
+        synthesizeSentence: (...args: unknown[]) => Promise<void>;
+      }
+    ).synthesizeSentence.bind(handler);
+    const synthSpy = vi
+      .spyOn(
+        handler as unknown as {
+          synthesizeSentence: (...args: unknown[]) => Promise<void>;
+        },
+        'synthesizeSentence',
+      )
+      .mockImplementation(async (...args: unknown[]) => {
+        if (args[0] === FILLER) {
+          throw new Error('synthesizeSentence disabled for filler');
+        }
+        return realSynth(...args);
+      });
+
+    await handler.handleCallStart('CA-filler-authentic');
+    await stt.emitTranscript('What is the weather?');
+
+    // The real reply still plays; the broken filler degraded to silence.
+    await vi.waitFor(() => expect(synthSpy).toHaveBeenCalledWith(
+      FILLER,
+      expect.anything(),
+      expect.anything(),
+      expect.anything(),
+    ), { timeout: 5000 });
+    await vi.waitFor(() => expect(synthSpy).toHaveBeenCalledWith(
+      'Here is your answer.',
+      expect.anything(),
+      expect.anything(),
+      expect.anything(),
+    ), { timeout: 5000 });
+  }, 10000);
+});

From 526d8feb6f40be5080e9cdb1d0630b3dc9db4eb7 Mon Sep 17 00:00:00 2001
From: nicolotognoni <nicolo.tognoni1@gmail.com>
Date: Sat, 6 Jun 2026 16:02:30 +0200
Subject: [PATCH 02/11] =?UTF-8?q?fix(pipeline):=20multi-turn=20turn-taking?=
 =?UTF-8?q?=20=E2=80=94=20tail-grace=20next-turn=20rescue=20+=20per-turn?=
 =?UTF-8?q?=20cancel-event=20reset?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Live-call bug: in pipeline mode (Twilio + Deepgram STT + ElevenLabs TTS + an
agent-LLM provider) the FIRST turn worked end-to-end but every subsequent turn
went silent, leaving a ghost metrics turn of user_text='' agent_text='[interrupted]'.

Two root causes in the pipeline turn-taking state machine, full Python/TypeScript parity:

1. Tail-grace misclassified the next turn as a barge-in.
   After the agent finishes TTS, _end_speaking_with_grace / endSpeakingWithGrace
   keeps _is_speaking=true for PATTER_TTS_TAIL_GRACE_MS (default 1500 ms) to
   swallow the fading echo tail. Humans reply in 200-700 ms — inside that window
   — so the user's next utterance was detected as a barge-in: it recorded an
   interrupted turn and the leading audio was withheld from STT (only a <=260 ms
   echo-contaminated ring), so no final transcript was produced and the agent
   never answered. New _tail_grace_active / tailGraceActive flag distinguishes
   "actively streaming TTS" from "post-TTS echo guard". A VAD speech_start OR a
   transcript during the tail grace now ends the grace and dispatches as a clean
   NEW turn via _end_tail_grace_for_new_turn / endTailGraceForNewTurn —
   recovering the leading audio from the ring instead of dropping it, with no
   spurious send_clear / record_turn_interrupted. Real barge-in during active
   TTS (tail_grace_active=false) is unchanged. The Python grace flip task is now
   tracked and cancelled (parity with TS clearGraceTimer) so at most one is in
   flight.

2. (Python) A barge-in's per-turn cancel event leaked into the next turn.
   _llm_cancel_event was recreated inside _process_streaming_response — AFTER
   LLMLoop.run had already captured the previous (still-set) event for the next
   turn — so the turn after any real barge-in bailed immediately. The reset
   moved to the top of _dispatch_turn, before dispatch; the event object is now
   stable through a turn (generator and consumption loop share it). TypeScript
   already allocates a fresh AbortController per turn in runPipelineLlm.

Tests: new test_pipeline_multiturn_tail_grace.py (6) +
pipeline-multiturn-tail-grace.mocked.test.ts (4) reproduce the bug and assert
the rescue, the flag lifecycle, the active-TTS barge-in regression guard, and
the fresh cancel-event. Python 2212 / TypeScript 1763 pass; tsc + build clean.
Adversarial review: 0 critical / 0 high.
---
 CHANGELOG.md                                  |   6 +
 libraries/python/getpatter/stream_handler.py  | 130 +++++++++-
 .../test_pipeline_multiturn_tail_grace.py     | 244 ++++++++++++++++++
 libraries/typescript/src/stream-handler.ts    |  82 ++++++
 ...peline-multiturn-tail-grace.mocked.test.ts | 187 ++++++++++++++
 5 files changed, 642 insertions(+), 7 deletions(-)
 create mode 100644 libraries/python/tests/unit/test_pipeline_multiturn_tail_grace.py
 create mode 100644 libraries/typescript/tests/pipeline-multiturn-tail-grace.mocked.test.ts

diff --git a/CHANGELOG.md b/CHANGELOG.md
index b727355..6baed1d 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -6,6 +6,12 @@
 - **`session_key_factory` / `sessionKeyFactory` — per-call long-term memory scope from a caller hash.** `OpenAICompatibleLLM` (and `HermesLLM`) can derive the `X-Hermes-Session-Key` header per call from a `SessionContext` (`call_id` / `caller` / `callee` / `caller_hash`) instead of a static value, so an agent runtime can remember a caller across calls **without the raw phone number ever reaching the wire or the logs**. Shortcut `HermesLLM(session_key_from="caller_hash")` installs a default `patter-caller-<caller_hash>` factory (SHA-256, 16 hex chars). New public `SessionContext` + `hash_caller` / `hashCaller` helper. The factory takes precedence over the static `session_key`; a falsy return omits the header. The loop dispatch was generalised to thread `caller` / `callee` only to providers whose `stream()` declares them (or `**kwargs`), keeping built-in and minimal custom providers unchanged. `libraries/python/getpatter/models.py`, `.../llm/openai_compatible.py`, `.../llm/hermes.py`, `.../services/llm_loop.py` + TypeScript mirrors.
 - **`long_turn_message` / `longTurnMessage` — opt-in spoken filler during a slow turn.** When an LLM turn takes longer than `long_turn_message_after_s` (default 4 s) and no audio has reached the caller yet, Patter speaks a short configurable line (e.g. "One moment, let me check.") instead of dead silence — useful for agent runtimes (Hermes / OpenClaw) that run tools mid-turn. Distinct from `llm_error_message` (which fires on error): this fires on **slowness**, once per turn, gated on emitted audio so it never double-speaks. `None` / unset = off (no behaviour change). `libraries/python/getpatter/models.py`, `.../stream_handler.py`, `.../client.py` + TypeScript mirrors.
 
+### Fixed
+
+- **Multi-turn pipeline conversations no longer go silent after the first turn.** The agent answered the first turn but then ignored every subsequent utterance, leaving a ghost metrics turn of `user_text='' agent_text='[interrupted]'`. Two root causes in the pipeline turn-taking state machine:
+  - **Tail-grace misclassified the next turn as a barge-in.** After the agent finishes speaking, `_end_speaking_with_grace` keeps `_is_speaking=true` for `PATTER_TTS_TAIL_GRACE_MS` (default 1500 ms) to swallow the fading TTS echo tail. Humans reply in 200-700 ms — inside that window — so the user's next utterance was treated as a barge-in: it recorded an interrupted turn and the leading audio was withheld from STT (only a ≤260 ms echo-contaminated ring), so no final transcript was produced and the agent never answered. A new `_tail_grace_active` / `tailGraceActive` flag now distinguishes "actively streaming TTS" from "post-TTS echo guard"; a VAD `speech_start` (or a transcript) during the tail grace ends the grace and is dispatched as a clean new turn — recovering the leading audio from the ring instead of dropping it — with no spurious `record_turn_interrupted`. Tunable `PATTER_TTS_TAIL_GRACE_MS` (0 / 200 / 1500) is now safe for fast next-turn speech.
+  - **(Python) A barge-in's per-turn cancel event leaked into the next turn.** `_llm_cancel_event` was only recreated *inside* `_process_streaming_response` — after `LLMLoop.run` had already been handed the (still-set) event for the next turn — so the turn following any real barge-in bailed immediately. The event is now recreated at the top of `_dispatch_turn`, before dispatch (TypeScript already allocated a fresh `AbortController` per turn). `libraries/python/getpatter/stream_handler.py`, `libraries/typescript/src/stream-handler.ts`.
+
 ## 0.6.5 (2026-06-05)
 
 ### Added
diff --git a/libraries/python/getpatter/stream_handler.py b/libraries/python/getpatter/stream_handler.py
index 7967e54..beb368f 100644
--- a/libraries/python/getpatter/stream_handler.py
+++ b/libraries/python/getpatter/stream_handler.py
@@ -2348,6 +2348,20 @@ def __init__(
         self._auto_vad = None
         self._stt_task: asyncio.Task | None = None
         self._is_speaking = False
+        # True only while the post-TTS tail-grace window is pending: the
+        # agent has finished its turn but ``_is_speaking`` is still held for
+        # ``PATTER_TTS_TAIL_GRACE_MS`` to swallow the fading echo tail. A VAD
+        # ``speech_start`` (or a transcript) during this window is the user's
+        # NEXT turn, not a barge-in — there is nothing left to interrupt. Set
+        # by ``_end_speaking_with_grace``; cleared by ``_begin_speaking``, the
+        # grace flip, barge-in cancels, and ``_end_tail_grace_for_new_turn``.
+        self._tail_grace_active = False
+        # Handle to the scheduled grace-flip task so it can be cancelled
+        # (parity with TS ``clearGraceTimer``) — at most one pending at a
+        # time. The ``_speaking_generation`` guard already makes a stale flip
+        # a no-op; cancelling avoids leaving an idle ``asyncio.sleep`` task
+        # per turn on long, fast-turn calls.
+        self._grace_task: asyncio.Task | None = None
         # Per-turn LLM cancel event. Recreated on every new turn before LLM
         # consumption so a stale cancel from a previous turn cannot terminate
         # the next stream prematurely. Initialized here so the STT loop's
@@ -3195,12 +3209,14 @@ async def _process_streaming_response(self, result, call_id: str) -> str:
             first_tts_chunk, hook_executor, hook_ctx
         )
 
-        # Reset the per-turn LLM cancel event so a stale cancel from a
-        # previous turn cannot terminate this stream prematurely.  The
-        # event is *set* by ``_handle_barge_in`` to break out of the
-        # consumption loop and close the generator (which propagates
-        # cancellation into the LLM provider's HTTP/WS connection).
-        self._llm_cancel_event = asyncio.Event()
+        # NOTE: the per-turn ``_llm_cancel_event`` is reset at the TOP of
+        # ``_dispatch_turn`` (before ``LLMLoop.run`` is handed the event), not
+        # here. Recreating it at this point — after ``run`` already captured
+        # the previous reference — used to leave the generator bound to a
+        # different event object than the consumption loop reads, and left a
+        # barge-in's set event leaking into the next turn. The event is *set*
+        # by ``_handle_barge_in`` to break out of the loop below and close the
+        # generator (propagating cancellation into the provider connection).
 
         interrupted = False
         llm_error = False
@@ -3426,6 +3442,17 @@ async def _handle_barge_in(self, transcript) -> None:
         """
         if not (transcript.text and self._is_speaking):
             return
+        # Defensive ``getattr`` — test fixtures build the handler via
+        # ``object.__new__`` and skip ``__init__`` (no tail-grace state).
+        if getattr(self, "_tail_grace_active", False):
+            # A transcript arriving during the post-TTS tail grace is the
+            # next turn, not a barge-in (the agent already finished). End the
+            # grace and return WITHOUT cancelling — the same transcript then
+            # flows on to ``_commit_transcript``/``_dispatch_turn`` as a
+            # normal new turn. Closes the race where a transcript lands
+            # before the VAD speech_start rescue fires.
+            await self._end_tail_grace_for_new_turn()
+            return
         if not self._can_barge_in():
             aec_state = "on" if getattr(self, "_aec", None) is not None else "off"
             logger.info(
@@ -3487,6 +3514,7 @@ async def _do_cancel_for_barge_in(self, transcript_text: str) -> None:
             {"patter.call.id": self.call_id},
         ):
             self._is_speaking = False
+            self._tail_grace_active = False
             self._speaking_started_at = None
             self._first_audio_sent_at = None
             self._last_cancel_at = time.time()
@@ -3671,6 +3699,18 @@ async def _dispatch_turn(self, transcript_text: str) -> None:
         """Run the post-commit pipeline (record STT → afterTranscribe →
         LLM dispatch → TTS → turn-complete) inline on the STT loop.
         """
+        # Reset the per-turn LLM cancel event BEFORE dispatch so a stale
+        # cancel set by a previous turn's barge-in (``_do_cancel_for_barge_in``
+        # calls ``cancel_event.set()``) cannot terminate this turn's LLM
+        # stream the instant it starts. This must happen before
+        # ``self._llm_loop.run(..., cancel_event=self._llm_cancel_event)`` is
+        # handed the event — recreating it later (inside
+        # ``_process_streaming_response``) was too late: ``run`` had already
+        # captured the set event, so the next turn after any barge-in went
+        # silent. Parity with TS, which allocates a fresh ``AbortController``
+        # per turn in ``runPipelineLlm``.
+        self._llm_cancel_event = asyncio.Event()
+
         # Record one STT span per final transcript turn. The span is
         # short-lived (just the attribute set) because STT is
         # streaming — we do not re-wrap the long-lived iterator.
@@ -3927,6 +3967,20 @@ async def on_audio_received(self, audio_bytes: bytes) -> None:
                 vad_event = None
             if vad_event is not None:
                 if vad_event.type == "speech_start":
+                    # Tail-grace new-turn rescue: the agent already finished
+                    # its turn and we are only in the post-TTS echo-guard
+                    # window. A VAD speech_start here is the user's next turn,
+                    # not a barge-in — end the grace synchronously so this
+                    # utterance flows to STT as a clean new turn instead of
+                    # being swallowed by the self-hearing guard or mislabelled
+                    # as an empty ``[interrupted]`` turn (the multi-turn
+                    # silence bug). After this ``_is_speaking`` is False, so
+                    # the if/elif below is a no-op and the frame falls through
+                    # to STT. Parity with TS ``endTailGraceForNewTurn``.
+                    if self._is_speaking and getattr(
+                        self, "_tail_grace_active", False
+                    ):
+                        await self._end_tail_grace_for_new_turn()
                     phantom_suppressed = self._is_speaking and not self._can_barge_in()
                     if phantom_suppressed:
                         # Within the per-turn warmup gate. With AEC on
@@ -4107,6 +4161,10 @@ async def _begin_speaking(self, is_first_message: bool = False) -> None:
                 await asyncio.sleep(remaining)
         self._speaking_generation += 1
         self._is_speaking = True
+        # A fresh turn is actively streaming — not in the post-TTS echo
+        # window. Clear the tail-grace flag so a VAD speech_start during this
+        # turn is treated as a real barge-in (not a new-turn rescue).
+        self._tail_grace_active = False
         self._speaking_started_at = time.time()
         # Stamp ``_first_audio_sent_at`` synchronously for EVERY turn so the
         # ``_can_barge_in()`` gate (250 ms anti-flicker for PSTN no-AEC) runs
@@ -4207,6 +4265,7 @@ async def _end_speaking_with_grace(self) -> None:
         # ``_begin_speaking()``.
         if grace_ms <= 0:
             self._is_speaking = False
+            self._tail_grace_active = False
             self._speaking_started_at = None
             self._first_audio_sent_at = None
             self._clear_pending_barge_in()
@@ -4218,6 +4277,14 @@ async def _end_speaking_with_grace(self) -> None:
             return
 
         gen = self._speaking_generation
+        # The agent has finished pushing audio; we now hold ``_is_speaking``
+        # only to suppress the fading echo tail. Mark this as the tail-grace
+        # window so fast next-turn speech is rescued as a new turn rather
+        # than mis-detected as a barge-in.
+        self._tail_grace_active = True
+        # Cancel any still-pending flip from a previous turn so at most one
+        # grace task is ever in flight (parity with TS ``clearGraceTimer``).
+        self._clear_grace_task()
 
         async def _flip_after_grace() -> None:
             try:
@@ -4226,6 +4293,7 @@ async def _flip_after_grace() -> None:
                 # newer turn would have bumped ``_speaking_generation``.
                 if self._speaking_generation == gen:
                     self._is_speaking = False
+                    self._tail_grace_active = False
                     self._speaking_started_at = None
                     self._first_audio_sent_at = None
                     self._clear_pending_barge_in()
@@ -4242,7 +4310,52 @@ async def _flip_after_grace() -> None:
             except Exception as exc:  # pragma: no cover - defensive
                 logger.debug("tts grace flip failed: %s", exc)
 
-        asyncio.create_task(_flip_after_grace())
+        self._grace_task = asyncio.create_task(_flip_after_grace())
+
+    def _clear_grace_task(self) -> None:
+        """Cancel the pending grace-flip task, if any. Idempotent; safe from
+        test fixtures built via ``object.__new__`` (no ``__init__``)."""
+        task = getattr(self, "_grace_task", None)
+        if task is not None and not task.done():
+            task.cancel()
+        self._grace_task = None
+
+    async def _end_tail_grace_for_new_turn(self) -> None:
+        """End the post-TTS tail-grace window because the user has begun
+        their next turn.
+
+        Unlike a barge-in, the agent's response already played out in full —
+        there is nothing to cancel and no turn was interrupted. We flip the
+        speaking flag off (bumping ``_speaking_generation`` so the scheduled
+        grace-flip task no-ops), recover any leading audio the self-hearing
+        guard captured into the ring (the user's first ~250 ms, which VAD
+        needed before it could emit ``speech_start``), and let the live STT
+        stream take over. Crucially we do NOT call ``send_clear``,
+        ``record_bargein_detected`` or ``record_turn_interrupted`` — none of
+        those apply to a turn that completed normally.
+
+        Without this, fast next-turn speech (humans reply in 200-700 ms, well
+        inside the 1500 ms default grace) is withheld from STT and recorded
+        as an empty ``[interrupted]`` turn, after which the agent goes silent
+        for the rest of the call.
+        """
+        self._is_speaking = False
+        self._tail_grace_active = False
+        self._speaking_started_at = None
+        self._first_audio_sent_at = None
+        # Invalidate the pending grace-flip task scheduled by
+        # ``_end_speaking_with_grace`` so it cannot later flip state on a turn
+        # that has already moved on (bump the generation AND cancel the task —
+        # parity with TS ``clearGraceTimer``).
+        self._speaking_generation += 1
+        self._clear_grace_task()
+        self._clear_pending_barge_in()
+        await self._reset_barge_in_strategies()
+        # Recover the user's leading words. Same rationale as the barge-in
+        # flush — but here it is the only audio recovery, since the agent
+        # already stopped and no new TTS will overwrite it.
+        self._suppressed_speech_pending = False
+        await self._flush_inbound_audio_ring()
 
     async def _reset_barge_in_strategies(self) -> None:
         if not self._barge_in_strategies:
@@ -4544,6 +4657,9 @@ async def cleanup(self) -> None:
         # spurious overlap_end events. Idempotent: safe to call when no
         # pending state exists.
         self._clear_pending_barge_in()
+        # Cancel any pending tail-grace flip task so it does not sleep past
+        # teardown and touch a finalised handler.
+        self._clear_grace_task()
         # Resolve every pending firstMessage mark future before tearing
         # down adapters. Without this, a call that ends abnormally mid
         # firstMessage (carrier WS drop, hangup during the paced sender)
diff --git a/libraries/python/tests/unit/test_pipeline_multiturn_tail_grace.py b/libraries/python/tests/unit/test_pipeline_multiturn_tail_grace.py
new file mode 100644
index 0000000..912713d
--- /dev/null
+++ b/libraries/python/tests/unit/test_pipeline_multiturn_tail_grace.py
@@ -0,0 +1,244 @@
+"""Multi-turn regression tests for the pipeline turn-taking state machine.
+
+Reproduces the live-call failure where the *first* turn works end-to-end but
+every *subsequent* turn goes silent, leaving a ghost metrics turn of
+``user_text='' agent_text='[interrupted]'``.
+
+Root causes covered here:
+
+1. **Tail-grace misclassification.** After the agent finishes a turn,
+   ``_end_speaking_with_grace`` keeps ``_is_speaking=True`` for
+   ``PATTER_TTS_TAIL_GRACE_MS`` (default 1500 ms) to swallow the fading TTS
+   echo tail. Humans reply in 200-700 ms — well inside that window — so the
+   user's next utterance was being mis-detected as a *barge-in*:
+   ``record_turn_interrupted`` fired (the ``[interrupted]`` ghost) and the
+   leading audio was withheld from STT (only a <=260 ms echo-contaminated
+   ring), so no final transcript was produced and the agent never answered.
+   The fix treats a VAD ``speech_start`` (or a transcript) during the tail
+   grace as the start of a NEW turn, not a barge-in.
+
+2. **Stale ``_llm_cancel_event``.** A real barge-in sets the per-turn cancel
+   event; it was only recreated *inside* ``_process_streaming_response`` —
+   AFTER ``LLMLoop.run`` had already been handed the (now set) event for the
+   next turn. The next turn's LLM stream then bailed immediately. The fix
+   recreates the event at the top of ``_dispatch_turn``, before dispatch.
+
+Only the external boundary is mocked (STT/TTS/audio sender). The VAD is a
+scripted in-process double so the on_audio_received path runs unmocked.
+"""
+
+from __future__ import annotations
+
+import asyncio
+import os
+import time
+from collections import deque
+from typing import AsyncIterator
+from unittest.mock import AsyncMock, MagicMock
+
+import pytest
+
+from getpatter.providers.base import VADEvent
+from getpatter.stream_handler import PipelineStreamHandler
+
+from tests.conftest import make_agent
+
+
+# ---------------------------------------------------------------------------
+# Scripted in-process VAD — emits a caller-supplied event per frame
+# ---------------------------------------------------------------------------
+
+
+class _ScriptedVAD:
+    """Returns the next queued VADEvent (or None) on each ``process_frame``."""
+
+    def __init__(self, events: list[VADEvent | None]) -> None:
+        self._events = list(events)
+        self.reset_calls = 0
+
+    async def process_frame(self, pcm: bytes, sample_rate: int) -> VADEvent | None:
+        if self._events:
+            return self._events.pop(0)
+        return None
+
+    async def close(self) -> None:  # pragma: no cover - not exercised
+        pass
+
+    def reset(self) -> None:
+        self.reset_calls += 1
+
+
+def _make_pipeline_handler(*, metrics: MagicMock | None = None) -> PipelineStreamHandler:
+    audio_sender = AsyncMock()
+    handler = PipelineStreamHandler(
+        agent=make_agent(),
+        audio_sender=audio_sender,
+        call_id="call-multiturn",
+        caller="+15551110000",
+        callee="+15552220000",
+        resolved_prompt="p",
+        metrics=metrics,
+        for_twilio=True,
+        on_transcript=None,
+        conversation_history=deque(maxlen=20),
+        transcript_entries=deque(maxlen=20),
+    )
+    handler.on_message = None
+    handler._llm_loop = None
+    handler._stt = AsyncMock()
+    handler._aec = None
+    # Treat inbound as already-PCM16 16 kHz so on_audio_received skips the
+    # mulaw decode path (the scripted VAD ignores the bytes anyway).
+    handler._input_is_mulaw_8k = False
+    return handler
+
+
+def _enter_tail_grace(handler: PipelineStreamHandler) -> None:
+    """Put the handler into the post-TTS tail-grace window: the agent has
+    finished speaking but ``_is_speaking`` is still held for echo suppression.
+    ``_first_audio_sent_at`` is stamped in the past so ``_can_barge_in`` is
+    True (the warmup gate elapsed) — exactly the state that produced the bug.
+    """
+    handler._is_speaking = True
+    handler._tail_grace_active = True
+    handler._speaking_generation = 1
+    handler._speaking_started_at = time.time() - 2.0
+    handler._first_audio_sent_at = time.time() - 2.0
+    handler._inbound_audio_ring = []
+
+
+_FRAME = b"\x00\x01" * 160  # arbitrary PCM16 bytes; scripted VAD ignores content
+
+
+@pytest.mark.unit
+@pytest.mark.asyncio
+class TestTailGraceNewTurn:
+    """Speech during the tail grace is a new turn, not a barge-in."""
+
+    async def test_speech_during_tail_grace_reaches_stt_without_interrupt(self) -> None:
+        metrics = MagicMock()
+        handler = _make_pipeline_handler(metrics=metrics)
+        handler._auto_vad = _ScriptedVAD(
+            [None, None, VADEvent(type="speech_start"), None]
+        )
+        _enter_tail_grace(handler)
+
+        # Two leading frames while still in tail grace → buffered to ring,
+        # NOT yet forwarded to STT.
+        await handler.on_audio_received(_FRAME)
+        await handler.on_audio_received(_FRAME)
+        assert handler._stt.send_audio.await_count == 0
+        assert len(handler._inbound_audio_ring) == 2
+
+        # VAD speech_start fires → tail grace ends as a NEW TURN.
+        await handler.on_audio_received(_FRAME)
+
+        # Not a barge-in: the agent already finished, nothing was interrupted.
+        metrics.record_bargein_detected.assert_not_called()
+        handler.audio_sender.send_clear.assert_not_awaited()
+        assert handler._is_speaking is False
+        assert handler._tail_grace_active is False
+
+        # Leading audio recovered (ring flushed) + the trigger frame sent live.
+        assert handler._stt.send_audio.await_count >= 3
+
+        # A following frame now streams straight through to STT.
+        await handler.on_audio_received(_FRAME)
+        assert handler._stt.send_audio.await_count >= 4
+
+    async def test_active_tts_speech_still_barges_in(self) -> None:
+        """Regression guard: speech during *active* TTS (not tail grace) must
+        still trigger a real barge-in."""
+        metrics = MagicMock()
+        handler = _make_pipeline_handler(metrics=metrics)
+        handler._auto_vad = _ScriptedVAD([VADEvent(type="speech_start")])
+        # Active TTS: speaking, but NOT in the post-completion tail grace.
+        handler._is_speaking = True
+        handler._tail_grace_active = False
+        handler._speaking_generation = 1
+        handler._speaking_started_at = time.time() - 2.0
+        handler._first_audio_sent_at = time.time() - 2.0
+        handler._inbound_audio_ring = []
+
+        await handler.on_audio_received(_FRAME)
+
+        metrics.record_bargein_detected.assert_called_once()
+        handler.audio_sender.send_clear.assert_awaited_once()
+        assert handler._is_speaking is False
+
+
+@pytest.mark.unit
+@pytest.mark.asyncio
+class TestTailGraceFlagLifecycle:
+    """``_tail_grace_active`` tracks the post-TTS grace window precisely."""
+
+    async def test_begin_speaking_clears_flag(self) -> None:
+        handler = _make_pipeline_handler()
+        handler._tail_grace_active = True
+        await handler._begin_speaking()
+        assert handler._is_speaking is True
+        assert handler._tail_grace_active is False
+
+    async def test_grace_sets_then_clears_flag(self, monkeypatch) -> None:
+        monkeypatch.setenv("PATTER_TTS_TAIL_GRACE_MS", "20")
+        handler = _make_pipeline_handler()
+        await handler._begin_speaking()
+        handler._first_audio_sent_at = time.time() - 1.0
+
+        await handler._end_speaking_with_grace()
+        # Grace pending: still "speaking" but flagged as tail grace.
+        assert handler._is_speaking is True
+        assert handler._tail_grace_active is True
+
+        await asyncio.sleep(0.06)  # > 20 ms grace
+        assert handler._is_speaking is False
+        assert handler._tail_grace_active is False
+
+    async def test_zero_grace_does_not_enter_tail_grace(self, monkeypatch) -> None:
+        monkeypatch.setenv("PATTER_TTS_TAIL_GRACE_MS", "0")
+        handler = _make_pipeline_handler()
+        await handler._begin_speaking()
+        await handler._end_speaking_with_grace()
+        assert handler._is_speaking is False
+        assert handler._tail_grace_active is False
+
+
+@pytest.mark.unit
+@pytest.mark.asyncio
+class TestLlmCancelEventReset:
+    """A barge-in's cancel event must not leak into the next turn's dispatch."""
+
+    async def test_dispatch_uses_fresh_cancel_event(self) -> None:
+        handler = _make_pipeline_handler()
+
+        captured: dict = {}
+
+        async def _fake_stream():
+            yield "hello "
+
+        class _FakeLoop:
+            def run(self, text, history, ctx, *, cancel_event=None, **kwargs):
+                # Record whether the event handed to the LLM was already set
+                # (i.e. leaked from a previous turn's barge-in).
+                captured["was_set"] = bool(cancel_event and cancel_event.is_set())
+                return _fake_stream()
+
+        handler._llm_loop = _FakeLoop()
+
+        # Avoid the real TTS speak path: stub the response processor.
+        async def _fake_process(result, call_id):
+            # Drain the generator so the loop's run() actually executed.
+            async for _ in result:
+                pass
+            return "hello"
+
+        handler._process_streaming_response = _fake_process  # type: ignore[assignment]
+        handler._emit_assistant_transcript = AsyncMock()
+
+        # Simulate a stale cancel left set by a previous turn's barge-in.
+        handler._llm_cancel_event.set()
+        assert handler._llm_cancel_event.is_set() is True
+
+        await handler._dispatch_turn("dimmi che ore sono")
+
+        assert captured.get("was_set") is False
diff --git a/libraries/typescript/src/stream-handler.ts b/libraries/typescript/src/stream-handler.ts
index 342676c..fefd056 100644
--- a/libraries/typescript/src/stream-handler.ts
+++ b/libraries/typescript/src/stream-handler.ts
@@ -417,6 +417,17 @@ export class StreamHandler {
   private stt: STTAdapter | null = null;
   private tts: TTSAdapter | null = null;
   private isSpeaking = false;
+  /**
+   * True only while the post-TTS tail-grace window is pending: the agent has
+   * finished its turn but ``isSpeaking`` is still held for
+   * ``PATTER_TTS_TAIL_GRACE_MS`` to swallow the fading echo tail. A VAD
+   * ``speech_start`` (or a transcript) during this window is the user's NEXT
+   * turn, not a barge-in — there is nothing left to interrupt. Set by
+   * ``endSpeakingWithGrace``; cleared by ``beginSpeaking``, the grace flip,
+   * ``cancelSpeaking``, and ``endTailGraceForNewTurn``. Parity with Python
+   * ``_tail_grace_active``.
+   */
+  private tailGraceActive = false;
   /**
    * Ring buffer of inbound PCM16 16 kHz frames captured while the agent
    * is speaking and the self-hearing guard is dropping audio. On
@@ -615,6 +626,10 @@ export class StreamHandler {
     }
     this.speakingGeneration++;
     this.isSpeaking = true;
+    // A fresh turn is actively streaming — not in the post-TTS echo window.
+    // Clear the tail-grace flag so a VAD speech_start during this turn is
+    // treated as a real barge-in (not a new-turn rescue).
+    this.tailGraceActive = false;
     this.speakingStartedAt = Date.now();
     this.suppressedSpeechPending = false;
     // Stamp ``firstAudioSentAt`` synchronously for EVERY turn so the
@@ -670,6 +685,7 @@ export class StreamHandler {
   private cancelSpeaking(): void {
     this.speakingGeneration++; // invalidates pending grace timers
     this.isSpeaking = false;
+    this.tailGraceActive = false;
     this.speakingStartedAt = null;
     this.firstAudioSentAt = null;
     this.lastCancelAt = Date.now();
@@ -782,10 +798,16 @@ export class StreamHandler {
     if (grace > 0) {
       const gen = this.speakingGeneration;
       this.clearGraceTimer();
+      // The agent has finished pushing audio; ``isSpeaking`` is now held only
+      // to suppress the fading echo tail. Mark the tail-grace window so fast
+      // next-turn speech is rescued as a new turn rather than mis-detected as
+      // a barge-in.
+      this.tailGraceActive = true;
       this.graceTimer = setTimeout(() => {
         this.graceTimer = null;
         if (this.speakingGeneration === gen) {
           this.isSpeaking = false;
+          this.tailGraceActive = false;
           this.speakingStartedAt = null;
           this.firstAudioSentAt = null;
           this.clearPendingBargeIn();
@@ -806,6 +828,7 @@ export class StreamHandler {
       }, grace);
     } else {
       this.isSpeaking = false;
+      this.tailGraceActive = false;
       this.speakingStartedAt = null;
       this.firstAudioSentAt = null;
       this.clearPendingBargeIn();
@@ -818,6 +841,38 @@ export class StreamHandler {
     }
   }
 
+  /**
+   * End the post-TTS tail-grace window because the user has begun their next
+   * turn. Unlike a barge-in, the agent's response already played out in full
+   * — there is nothing to cancel and no turn was interrupted. We flip the
+   * speaking flag off (bumping ``speakingGeneration`` so the scheduled grace
+   * timer no-ops), recover any leading audio the self-hearing guard captured
+   * into the ring (the user's first ~250 ms, which VAD needed before it could
+   * emit ``speech_start``), and let the live STT stream take over. We do NOT
+   * call ``sendClear``, ``recordBargeinDetected`` or ``recordTurnInterrupted``
+   * — none apply to a turn that completed normally.
+   *
+   * Without this, fast next-turn speech (humans reply in 200-700 ms, well
+   * inside the 1500 ms default grace) is withheld from STT and recorded as an
+   * empty ``[interrupted]`` turn, after which the agent goes silent for the
+   * rest of the call. Parity with Python ``_end_tail_grace_for_new_turn``.
+   */
+  private endTailGraceForNewTurn(): void {
+    this.isSpeaking = false;
+    this.tailGraceActive = false;
+    this.speakingStartedAt = null;
+    this.firstAudioSentAt = null;
+    this.speakingGeneration++; // invalidates the pending grace timer
+    this.clearGraceTimer();
+    this.clearPendingBargeIn();
+    void this.resetBargeInStrategies();
+    // Recover the user's leading words. Same rationale as the barge-in flush
+    // — but here it is the only audio recovery, since the agent already
+    // stopped and no new TTS will overwrite it.
+    this.suppressedSpeechPending = false;
+    this.flushInboundAudioRing();
+  }
+
   private async resetBargeInStrategies(): Promise<void> {
     if (this.bargeInStrategies.length === 0) return;
     const { resetStrategies } = await import('./services/barge-in-strategies.js');
@@ -1427,6 +1482,18 @@ export class StreamHandler {
             );
           }
           if (evt?.type === 'speech_start') {
+            // Tail-grace new-turn rescue: the agent already finished its turn
+            // and we are only in the post-TTS echo-guard window. A VAD
+            // speech_start here is the user's next turn, not a barge-in — end
+            // the grace so this utterance flows to STT as a clean new turn
+            // instead of being swallowed by the self-hearing guard or
+            // mislabelled as an empty ``[interrupted]`` turn (the multi-turn
+            // silence bug). After this ``isSpeaking`` is false, so the
+            // if/else below is a no-op and the frame falls through to STT.
+            // Parity with Python ``_end_tail_grace_for_new_turn``.
+            if (this.isSpeaking && this.tailGraceActive) {
+              this.endTailGraceForNewTurn();
+            }
             const phantomSuppressed = this.isSpeaking && !this.canBargeIn();
             if (phantomSuppressed) {
               // Within the per-turn warmup gate. With AEC on this is the
@@ -2518,6 +2585,15 @@ export class StreamHandler {
     isFinal?: boolean;
   }): Promise<boolean> {
     if (!transcript.text || !this.isSpeaking) return false;
+    if (this.tailGraceActive) {
+      // A transcript during the post-TTS tail grace is the next turn, not a
+      // barge-in (the agent already finished). End the grace and return
+      // WITHOUT cancelling — the same transcript then flows on to dispatch as
+      // a normal new turn. Closes the race where a transcript lands before
+      // the VAD speech_start rescue fires.
+      this.endTailGraceForNewTurn();
+      return false;
+    }
     if (!this.canBargeIn()) {
       getLogger().info(
         `Barge-in transcript suppressed (agent speaking < gate, aec=${this.aec ? 'on' : 'off'})`,
@@ -2560,6 +2636,12 @@ export class StreamHandler {
    */
   private handleBargeIn(transcript: { text?: string; isFinal?: boolean }): boolean {
     if (!transcript.text || !this.isSpeaking) return false;
+    if (this.tailGraceActive) {
+      // Tail-grace transcript = next turn, not a barge-in. End the grace and
+      // let the transcript dispatch normally (parity with the async path).
+      this.endTailGraceForNewTurn();
+      return false;
+    }
     if (this.bargeInStrategies.length === 0) {
       // Legacy synchronous path — preserve exact byte-for-byte behaviour
       // for users who haven't opted into the confirm pipeline.
diff --git a/libraries/typescript/tests/pipeline-multiturn-tail-grace.mocked.test.ts b/libraries/typescript/tests/pipeline-multiturn-tail-grace.mocked.test.ts
new file mode 100644
index 0000000..0a13846
--- /dev/null
+++ b/libraries/typescript/tests/pipeline-multiturn-tail-grace.mocked.test.ts
@@ -0,0 +1,187 @@
+/**
+ * [mocked] Multi-turn turn-taking — tail-grace new-turn rescue (parity with
+ * Python ``test_pipeline_multiturn_tail_grace.py``).
+ *
+ * Reproduces the live-call failure where the FIRST turn works but every
+ * SUBSEQUENT turn goes silent with a ghost ``[interrupted]`` metrics turn.
+ *
+ * Root cause: after the agent finishes, ``endSpeakingWithGrace`` keeps
+ * ``isSpeaking=true`` for ``PATTER_TTS_TAIL_GRACE_MS`` (default 1500 ms) to
+ * swallow the fading echo tail. Humans reply in 200-700 ms — inside that
+ * window — so the user's next utterance was mis-detected as a barge-in
+ * (``recordTurnInterrupted`` + leading audio withheld from STT), so no
+ * transcript was produced and the agent never answered. The fix treats a VAD
+ * ``speech_start`` (or a transcript) during the tail grace as a NEW turn.
+ *
+ * AUTHENTIC: the real StreamHandler + CallMetricsAccumulator. Mocked only at
+ * the external boundary (telephony bridge, STT adapter). Private state/methods
+ * are exercised via casts — the same surface the audio/STT loops drive.
+ */
+import { describe, it, expect, vi, beforeEach, afterEach } from 'vitest';
+import { StreamHandler } from '../src/stream-handler';
+import type { TelephonyBridge, StreamHandlerDeps } from '../src/stream-handler';
+import { MetricsStore } from '../src/dashboard/store';
+import { RemoteMessageHandler } from '../src/remote-message';
+import type { WebSocket as WSWebSocket } from 'ws';
+import type { AgentOptions } from '../src/types';
+
+function makeMockWs(): WSWebSocket {
+  return {
+    send: vi.fn(),
+    close: vi.fn(),
+    on: vi.fn(),
+    once: vi.fn(),
+    readyState: 1,
+    removeListener: vi.fn(),
+    addEventListener: vi.fn(),
+    removeEventListener: vi.fn(),
+  } as unknown as WSWebSocket;
+}
+
+function makeBridge(): TelephonyBridge {
+  return {
+    label: 'Twilio',
+    telephonyProvider: 'twilio',
+    sendAudio: vi.fn(),
+    sendMark: vi.fn(),
+    sendClear: vi.fn(),
+    transferCall: vi.fn().mockResolvedValue(undefined),
+    endCall: vi.fn().mockResolvedValue(undefined),
+    createStt: vi.fn().mockReturnValue(null),
+    queryTelephonyCost: vi.fn().mockResolvedValue(undefined),
+  } as unknown as TelephonyBridge;
+}
+
+function makeDeps(bridge: TelephonyBridge, store: MetricsStore): StreamHandlerDeps {
+  const agent: AgentOptions = {
+    systemPrompt: 'You are a helpful test agent.',
+    provider: 'pipeline',
+    model: 'gpt-4o-mini',
+    voice: 'alloy',
+  };
+  return {
+    config: {},
+    agent,
+    bridge,
+    metricsStore: store,
+    pricing: null,
+    remoteHandler: new RemoteMessageHandler(),
+    recording: false,
+    buildAIAdapter: vi.fn().mockReturnValue(null),
+    sanitizeVariables: vi.fn((raw: Record<string, unknown>) => raw),
+    resolveVariables: vi.fn((tpl: string) => tpl),
+  } as unknown as StreamHandlerDeps;
+}
+
+function makeHandler(): StreamHandler {
+  const handler = new StreamHandler(
+    makeDeps(makeBridge(), new MetricsStore()),
+    makeMockWs(),
+    '+15551110000',
+    '+15552220000',
+  );
+  handler.setStreamSid('MZ-multiturn');
+  return handler;
+}
+
+const FRAME = Buffer.from([0, 1, 0, 1, 0, 1]);
+
+describe('[mocked] pipeline multi-turn tail-grace rescue', () => {
+  afterEach(() => {
+    vi.useRealTimers();
+  });
+
+  it('endSpeakingWithGrace flags the tail grace, the grace timer clears it', () => {
+    vi.useFakeTimers();
+    const h = makeHandler() as unknown as {
+      isSpeaking: boolean;
+      tailGraceActive: boolean;
+      endSpeakingWithGrace(): void;
+    };
+    h.isSpeaking = true;
+    h.endSpeakingWithGrace();
+    // Grace pending: still "speaking" but flagged as tail grace.
+    expect(h.isSpeaking).toBe(true);
+    expect(h.tailGraceActive).toBe(true);
+
+    vi.advanceTimersByTime(1600); // > 1500 ms default grace
+    expect(h.isSpeaking).toBe(false);
+    expect(h.tailGraceActive).toBe(false);
+  });
+
+  it('beginSpeaking clears the tail-grace flag', async () => {
+    const h = makeHandler();
+    const priv = h as unknown as { tailGraceActive: boolean; isSpeaking: boolean };
+    priv.tailGraceActive = true;
+    await (h as unknown as { beginSpeaking(): Promise<void> }).beginSpeaking();
+    expect(priv.isSpeaking).toBe(true);
+    expect(priv.tailGraceActive).toBe(false);
+  });
+
+  it('a transcript during the tail grace is a new turn, not a barge-in', () => {
+    const bridge = makeBridge();
+    const handler = new StreamHandler(
+      makeDeps(bridge, new MetricsStore()),
+      makeMockWs(),
+      '+15551110000',
+      '+15552220000',
+    );
+    const sttSendAudio = vi.fn();
+    const priv = handler as unknown as {
+      isSpeaking: boolean;
+      tailGraceActive: boolean;
+      speakingStartedAt: number | null;
+      firstAudioSentAt: number | null;
+      inboundAudioRing: Buffer[];
+      stt: unknown;
+      handleBargeIn(t: { text?: string; isFinal?: boolean }): boolean;
+    };
+    priv.stt = { sendAudio: sttSendAudio, finalize: vi.fn() };
+    priv.isSpeaking = true;
+    priv.tailGraceActive = true;
+    priv.speakingStartedAt = Date.now() - 2000;
+    priv.firstAudioSentAt = Date.now() - 2000;
+    priv.inboundAudioRing = [FRAME, FRAME];
+
+    const interrupted = priv.handleBargeIn({ text: 'che ore sono', isFinal: true });
+
+    expect(interrupted).toBe(false);
+    expect(priv.isSpeaking).toBe(false);
+    expect(priv.tailGraceActive).toBe(false);
+    // Not a barge-in: nothing was cleared on the carrier.
+    expect(bridge.sendClear).not.toHaveBeenCalled();
+    // Leading audio recovered: the ring was flushed to STT.
+    expect(sttSendAudio).toHaveBeenCalledTimes(2);
+  });
+
+  it('a transcript during ACTIVE TTS still triggers a real barge-in', () => {
+    const bridge = makeBridge();
+    const handler = new StreamHandler(
+      makeDeps(bridge, new MetricsStore()),
+      makeMockWs(),
+      '+15551110000',
+      '+15552220000',
+    );
+    const priv = handler as unknown as {
+      isSpeaking: boolean;
+      tailGraceActive: boolean;
+      speakingStartedAt: number | null;
+      firstAudioSentAt: number | null;
+      inboundAudioRing: Buffer[];
+      stt: unknown;
+      handleBargeIn(t: { text?: string; isFinal?: boolean }): boolean;
+    };
+    priv.stt = { sendAudio: vi.fn(), finalize: vi.fn() };
+    priv.isSpeaking = true;
+    priv.tailGraceActive = false; // active TTS, NOT the post-completion grace
+    priv.speakingStartedAt = Date.now() - 2000;
+    priv.firstAudioSentAt = Date.now() - 2000;
+    priv.inboundAudioRing = [];
+
+    const interrupted = priv.handleBargeIn({ text: 'actually wait', isFinal: true });
+
+    expect(interrupted).toBe(true);
+    expect(priv.isSpeaking).toBe(false);
+    expect(bridge.sendClear).toHaveBeenCalled();
+  });
+});

From c34a7db4a6961bd1d118d659cb1ba99c31d6d6a8 Mon Sep 17 00:00:00 2001
From: nicolotognoni <nicolo.tognoni1@gmail.com>
Date: Sat, 6 Jun 2026 19:55:55 +0200
Subject: [PATCH 03/11] =?UTF-8?q?fix(pipeline):=20barge-in=20during=20a=20?=
 =?UTF-8?q?turn=20=E2=80=94=20decoupled=20dispatch=20+=20pre-first-token?=
 =?UTF-8?q?=20abort=20(Hermes/OpenClaw)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The caller could not interrupt the agent mid-response. The STT receive loop
awaited the turn's LLM+TTS dispatch inline (`await self._dispatch_turn(...)` /
`await this.runPipelineLlm(...)`), so during a long (30-90 s) Hermes/OpenClaw
tool-running turn it stopped reading transcripts — a barge-in transcript ("ferma")
was only processed AFTER the turn ended. On PSTN with echo-masked/unreliable VAD,
the transcript path is the only barge-in fallback and it was structurally dead.

Three coordinated changes, full Python/TypeScript parity:

1. Decoupled single-in-flight dispatch. The turn runs as one tracked background
   task (_dispatch_task / dispatchTask) so the receive loop keeps draining
   transcripts and runs handleBargeIn against the LIVE turn. The loop settles
   the previous dispatch before launching the next (single-in-flight), so
   conversation_history / metrics ordering is unchanged; the loop still awaits
   the final turn to settle before returning, so existing tests that inspect
   state right after the loop are unaffected.

2. Prompt pre-first-token abort (Python). Agent runtimes run tools for tens of
   seconds before the first token, during which the per-chunk cancel_event
   check never runs. The provider now races create()+first-byte against the
   cancel signal and spawns a watchdog that close()s the response the instant a
   barge-in fires (TS already aborts promptly via fetch + AbortController). The
   VAD legacy barge-in branch now also sets _llm_cancel_event (previously it
   only flipped _is_speaking, which Hermes never observed pre-first-token), and
   the OpenAI-compatible client uses an explicit httpx read/connect timeout so a
   dead gateway fails fast.

3. PATTER_FORWARD_STT_WHILE_SPEAKING (opt-in, default off). Forwards inbound
   audio to STT during TTS even with a VAD configured, so the transcript
   barge-in path can receive a transcript on echo-masked links where the VAD
   never fires. The leading-edge ring is still captured. Echo caveat (WARN on
   enable): without AEC the agent's own voice may be transcribed as a phantom
   interruption — pair with agent.barge_in_strategies.

Default behaviour (flag off, VAD present, normal LLM) is byte-identical; the
just-landed tail-grace multi-turn fix is preserved.

Tests: new test_pipeline_bargein_backgrounded.py (4), test_provider_prefirsttoken_abort.py (3),
pipeline-bargein-backgrounded.mocked.test.ts (2). Python 2219 / TypeScript 1765 pass;
tsc + build clean.
---
 CHANGELOG.md                                  |   4 +
 .../python/getpatter/llm/openai_compatible.py | 132 +++++++----
 .../python/getpatter/services/llm_loop.py     | 179 ++++++++++-----
 libraries/python/getpatter/stream_handler.py  | 106 ++++++++-
 .../tests/test_llm_hermes_openclaw_presets.py |   6 +-
 .../tests/test_llm_openai_compatible.py       |   5 +-
 .../test_pipeline_bargein_backgrounded.py     | 206 ++++++++++++++++++
 .../unit/test_provider_prefirsttoken_abort.py | 152 +++++++++++++
 libraries/typescript/src/stream-handler.ts    | 195 +++++++++++------
 .../tests/long-turn-filler.mocked.test.ts     |   6 +
 ...peline-bargein-backgrounded.mocked.test.ts | 187 ++++++++++++++++
 11 files changed, 1009 insertions(+), 169 deletions(-)
 create mode 100644 libraries/python/tests/unit/test_pipeline_bargein_backgrounded.py
 create mode 100644 libraries/python/tests/unit/test_provider_prefirsttoken_abort.py
 create mode 100644 libraries/typescript/tests/pipeline-bargein-backgrounded.mocked.test.ts

diff --git a/CHANGELOG.md b/CHANGELOG.md
index 6baed1d..a98937b 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -11,6 +11,10 @@
 - **Multi-turn pipeline conversations no longer go silent after the first turn.** The agent answered the first turn but then ignored every subsequent utterance, leaving a ghost metrics turn of `user_text='' agent_text='[interrupted]'`. Two root causes in the pipeline turn-taking state machine:
   - **Tail-grace misclassified the next turn as a barge-in.** After the agent finishes speaking, `_end_speaking_with_grace` keeps `_is_speaking=true` for `PATTER_TTS_TAIL_GRACE_MS` (default 1500 ms) to swallow the fading TTS echo tail. Humans reply in 200-700 ms — inside that window — so the user's next utterance was treated as a barge-in: it recorded an interrupted turn and the leading audio was withheld from STT (only a ≤260 ms echo-contaminated ring), so no final transcript was produced and the agent never answered. A new `_tail_grace_active` / `tailGraceActive` flag now distinguishes "actively streaming TTS" from "post-TTS echo guard"; a VAD `speech_start` (or a transcript) during the tail grace ends the grace and is dispatched as a clean new turn — recovering the leading audio from the ring instead of dropping it — with no spurious `record_turn_interrupted`. Tunable `PATTER_TTS_TAIL_GRACE_MS` (0 / 200 / 1500) is now safe for fast next-turn speech.
   - **(Python) A barge-in's per-turn cancel event leaked into the next turn.** `_llm_cancel_event` was only recreated *inside* `_process_streaming_response` — after `LLMLoop.run` had already been handed the (still-set) event for the next turn — so the turn following any real barge-in bailed immediately. The event is now recreated at the top of `_dispatch_turn`, before dispatch (TypeScript already allocated a fresh `AbortController` per turn). `libraries/python/getpatter/stream_handler.py`, `libraries/typescript/src/stream-handler.ts`.
+- **Pipeline barge-in now works DURING a turn — including long Hermes/OpenClaw tool-running turns.** The caller could not interrupt the agent mid-response: the STT receive loop awaited the turn's LLM+TTS dispatch inline (`await self._dispatch_turn(...)` / `await this.runPipelineLlm(...)`), so for the whole 30-90 s of a tool-running agent-runtime turn it stopped reading transcripts — a barge-in transcript was only processed *after* the turn ended ("ferma" → answered late). Three coordinated changes, full Python/TS parity:
+  - **Decoupled, single-in-flight dispatch.** The turn now runs as one tracked background task (`_dispatch_task` / `dispatchTask`) so the receive loop keeps draining transcripts and runs barge-in detection against the LIVE turn. Exactly one dispatch is in flight: the loop settles the previous one before launching the next, so `conversation_history` / metrics ordering is unchanged. With no barge-in (default, VAD present, normal LLM) behaviour is unchanged — the loop still awaits the final turn to settle before returning.
+  - **Prompt pre-first-token abort (Python).** Agent runtimes run tools for tens of seconds before the first token, during which the per-chunk `cancel_event` check never runs. The provider now races `create()` + first-byte against the cancel signal and spawns a watchdog that `close()`s the response the instant a barge-in fires, so the request is torn down immediately instead of blocking the next turn (TS already aborts promptly via `fetch` + `AbortController`). The VAD legacy barge-in branch now also sets `_llm_cancel_event` (it previously only flipped `_is_speaking`), and the OpenAI-compatible client uses an explicit httpx read/connect timeout so a dead gateway fails fast.
+  - **`PATTER_FORWARD_STT_WHILE_SPEAKING` (opt-in, default off).** Forwards inbound audio to STT during TTS even with a VAD configured, so the transcript barge-in path can receive a transcript on echo-masked PSTN links where the VAD never fires. The leading-edge ring buffer is still captured. **Echo caveat:** without AEC the agent's own voice may be transcribed as a phantom interruption — pair with `agent.barge_in_strategies`. `libraries/python/getpatter/stream_handler.py`, `.../services/llm_loop.py`, `.../llm/openai_compatible.py`, `libraries/typescript/src/stream-handler.ts`.
 
 ## 0.6.5 (2026-06-05)
 
diff --git a/libraries/python/getpatter/llm/openai_compatible.py b/libraries/python/getpatter/llm/openai_compatible.py
index 0825c30..902ba83 100644
--- a/libraries/python/getpatter/llm/openai_compatible.py
+++ b/libraries/python/getpatter/llm/openai_compatible.py
@@ -155,10 +155,17 @@ def __init__(
         super().__init__(api_key=key or _EMPTY_KEY_SENTINEL, model=model, **kwargs)
 
         default_headers = {"User-Agent": self._user_agent, **(extra_headers or {})}
+        # Bound a connect that never lands (gateway down) to ~10 s while keeping
+        # the long read budget for tool-running turns. A scalar timeout would
+        # apply the full (120 s) ceiling to connect too, so a dead Hermes would
+        # hang the first turn for the full budget instead of failing fast.
+        import httpx as _httpx
+
+        _client_timeout: Any = _httpx.Timeout(timeout, connect=min(timeout, 10.0))
         self._client: Any = AsyncOpenAI(
             api_key=key or _EMPTY_KEY_SENTINEL,
             base_url=base_url,
-            timeout=timeout,
+            timeout=_client_timeout,
             default_headers=default_headers,
         )
 
@@ -304,58 +311,87 @@ async def stream(
         kwargs = self._build_completion_kwargs(
             messages, tools, call_id=call_id, caller=caller, callee=callee
         )
-        response = await self._client.chat.completions.create(**kwargs)
+        # Agent runtimes run tools for tens of seconds before the first token;
+        # race create()/first-byte against the cancel signal + spawn a close()
+        # watchdog so a barge-in during that pre-first-token window aborts the
+        # request promptly instead of blocking the next turn (see base
+        # ``OpenAILLMProvider._open_stream_with_cancel`` / ``_abort_on_cancel``).
+        response = await self._open_stream_with_cancel(kwargs, cancel_event)
+        if response is None:
+            return
+        abort_watcher = (
+            asyncio.ensure_future(self._abort_on_cancel(cancel_event, response))
+            if cancel_event is not None
+            else None
+        )
 
         last_usage = None
-        async for chunk in response:
+        try:
+            async for chunk in response:
+                if cancel_event is not None and cancel_event.is_set():
+                    try:
+                        await response.close()
+                    except Exception:  # noqa: BLE001 - best-effort cleanup
+                        pass
+                    return
+                usage = getattr(chunk, "usage", None)
+                if usage is not None:
+                    last_usage = usage
+
+                delta = chunk.choices[0].delta if chunk.choices else None
+                if delta is None:
+                    continue
+
+                if delta.content:
+                    yield {"type": "text", "content": delta.content}
+
+                if delta.tool_calls:
+                    for tc in delta.tool_calls:
+                        yield {
+                            "type": "tool_call",
+                            "index": tc.index,
+                            "id": tc.id,
+                            "name": tc.function.name if tc.function else None,
+                            "arguments": tc.function.arguments if tc.function else None,
+                        }
+
+            if last_usage is not None:
+                cache_read = 0
+                details = getattr(last_usage, "prompt_tokens_details", None)
+                if details is not None:
+                    cache_read = getattr(details, "cached_tokens", 0) or 0
+                # Mirror OpenAILLMProvider.stream exactly: prompt_tokens is the
+                # TOTAL input (uncached + cached); subtract cached so input_tokens
+                # is the uncached portion and cost isn't double-billed.
+                prompt_tokens = getattr(last_usage, "prompt_tokens", 0) or 0
+                uncached_input = max(0, prompt_tokens - cache_read)
+                completion_tokens = getattr(last_usage, "completion_tokens", 0) or 0
+                self._record_completion_cost(
+                    prompt_tokens=prompt_tokens,
+                    completion_tokens=completion_tokens,
+                )
+                yield {
+                    "type": "usage",
+                    "input_tokens": uncached_input,
+                    "output_tokens": completion_tokens,
+                    "cache_read_tokens": cache_read,
+                }
+        except asyncio.CancelledError:
+            raise
+        except Exception:
+            # close()-induced read error during a barge-in is a clean stop, not
+            # an LLM error — swallow when cancelling so llm_error_message does
+            # not fire. Genuine upstream errors (cancel not set) propagate.
             if cancel_event is not None and cancel_event.is_set():
+                return
+            raise
+        finally:
+            if abort_watcher is not None:
+                abort_watcher.cancel()
                 try:
-                    await response.close()
-                except Exception:  # noqa: BLE001 - best-effort cleanup
+                    await abort_watcher
+                except (asyncio.CancelledError, Exception):
                     pass
-                return
-            usage = getattr(chunk, "usage", None)
-            if usage is not None:
-                last_usage = usage
-
-            delta = chunk.choices[0].delta if chunk.choices else None
-            if delta is None:
-                continue
-
-            if delta.content:
-                yield {"type": "text", "content": delta.content}
-
-            if delta.tool_calls:
-                for tc in delta.tool_calls:
-                    yield {
-                        "type": "tool_call",
-                        "index": tc.index,
-                        "id": tc.id,
-                        "name": tc.function.name if tc.function else None,
-                        "arguments": tc.function.arguments if tc.function else None,
-                    }
-
-        if last_usage is not None:
-            cache_read = 0
-            details = getattr(last_usage, "prompt_tokens_details", None)
-            if details is not None:
-                cache_read = getattr(details, "cached_tokens", 0) or 0
-            # Mirror OpenAILLMProvider.stream exactly: prompt_tokens is the
-            # TOTAL input (uncached + cached); subtract cached so input_tokens
-            # is the uncached portion and cost isn't double-billed.
-            prompt_tokens = getattr(last_usage, "prompt_tokens", 0) or 0
-            uncached_input = max(0, prompt_tokens - cache_read)
-            completion_tokens = getattr(last_usage, "completion_tokens", 0) or 0
-            self._record_completion_cost(
-                prompt_tokens=prompt_tokens,
-                completion_tokens=completion_tokens,
-            )
-            yield {
-                "type": "usage",
-                "input_tokens": uncached_input,
-                "output_tokens": completion_tokens,
-                "cache_read_tokens": cache_read,
-            }
 
 
 class LLM(OpenAICompatibleLLMProvider):
diff --git a/libraries/python/getpatter/services/llm_loop.py b/libraries/python/getpatter/services/llm_loop.py
index fb4c9f4..c5bd6ec 100644
--- a/libraries/python/getpatter/services/llm_loop.py
+++ b/libraries/python/getpatter/services/llm_loop.py
@@ -662,6 +662,55 @@ def _build_completion_kwargs(
             kwargs["max_completion_tokens"] = self._max_tokens
         return kwargs
 
+    async def _open_stream_with_cancel(self, kwargs: dict, cancel_event):
+        """Create the streaming completion, aborting promptly if ``cancel_event``
+        fires while awaiting — INCLUDING before the first SSE byte.
+
+        Agent runtimes (Hermes / OpenClaw) run tools/memory/skills for tens of
+        seconds before the first token; the ``create()`` await (and the first
+        ``__anext__``) would otherwise be unabortable, so a barge-in during
+        that window could not free the connection and the next user turn would
+        block behind it. Races ``create()`` against the cancel signal and
+        cancels the in-flight POST if the user interrupts first.
+
+        Returns the streaming response, or ``None`` if cancelled before the
+        response object existed.
+        """
+        if cancel_event is None:
+            return await self._client.chat.completions.create(**kwargs)
+        create_task = asyncio.ensure_future(
+            self._client.chat.completions.create(**kwargs)
+        )
+        cancel_task = asyncio.ensure_future(cancel_event.wait())
+        try:
+            await asyncio.wait(
+                {create_task, cancel_task}, return_when=asyncio.FIRST_COMPLETED
+            )
+        finally:
+            cancel_task.cancel()
+        if not create_task.done():
+            # The caller interrupted before the upstream even responded —
+            # abort the in-flight POST so the socket is freed immediately.
+            create_task.cancel()
+            try:
+                await create_task
+            except BaseException:  # noqa: BLE001 - aborting in-flight request
+                pass
+            return None
+        return create_task.result()
+
+    @staticmethod
+    async def _abort_on_cancel(cancel_event, response) -> None:
+        """Close the streaming response the instant ``cancel_event`` fires so a
+        consumer parked on the first SSE byte unblocks immediately instead of
+        waiting out the read timeout. Best-effort; cancelled in the stream's
+        ``finally`` once the turn ends normally."""
+        try:
+            await cancel_event.wait()
+            await response.close()
+        except Exception:  # noqa: BLE001 - best-effort teardown
+            pass
+
     async def stream(
         self,
         messages: list[dict],
@@ -694,64 +743,88 @@ async def stream(
         it into ``_build_completion_kwargs``.
         """
         kwargs = self._build_completion_kwargs(messages, tools)
-        response = await self._client.chat.completions.create(**kwargs)
+        response = await self._open_stream_with_cancel(kwargs, cancel_event)
+        if response is None:
+            # Cancelled before the first byte (barge-in during the agent
+            # runtime's pre-first-token tool window) — nothing to yield.
+            return
+        abort_watcher = (
+            asyncio.ensure_future(self._abort_on_cancel(cancel_event, response))
+            if cancel_event is not None
+            else None
+        )
 
         last_usage = None
-        async for chunk in response:
+        try:
+            async for chunk in response:
+                if cancel_event is not None and cancel_event.is_set():
+                    try:
+                        await response.close()
+                    except Exception:  # noqa: BLE001 - best-effort cleanup
+                        pass
+                    return
+                # Usage chunks have empty ``choices`` and a populated ``usage``.
+                usage = getattr(chunk, "usage", None)
+                if usage is not None:
+                    last_usage = usage
+
+                delta = chunk.choices[0].delta if chunk.choices else None
+                if delta is None:
+                    continue
+
+                if delta.content:
+                    yield {"type": "text", "content": delta.content}
+
+                if delta.tool_calls:
+                    for tc in delta.tool_calls:
+                        yield {
+                            "type": "tool_call",
+                            "index": tc.index,
+                            "id": tc.id,
+                            "name": tc.function.name if tc.function else None,
+                            "arguments": tc.function.arguments if tc.function else None,
+                        }
+
+            if last_usage is not None:
+                cache_read = 0
+                details = getattr(last_usage, "prompt_tokens_details", None)
+                if details is not None:
+                    cache_read = getattr(details, "cached_tokens", 0) or 0
+                # OpenAI's prompt_tokens is the TOTAL input (uncached + cached).
+                # Subtract cached so input_tokens represents only the uncached
+                # portion and calculate_llm_cost doesn't bill cached tokens at
+                # the full input rate (mirrors libraries/typescript/src/llm-loop.ts:296-305).
+                prompt_tokens = getattr(last_usage, "prompt_tokens", 0) or 0
+                uncached_input = max(0, prompt_tokens - cache_read)
+                completion_tokens = getattr(last_usage, "completion_tokens", 0) or 0
+                self._record_completion_cost(
+                    prompt_tokens=prompt_tokens,
+                    completion_tokens=completion_tokens,
+                )
+                yield {
+                    "type": "usage",
+                    "input_tokens": uncached_input,
+                    "output_tokens": completion_tokens,
+                    "cache_read_tokens": cache_read,
+                }
+        except asyncio.CancelledError:
+            raise
+        except Exception:
+            # A read error AFTER the cancel watchdog closed the response is the
+            # expected clean stop on a (possibly pre-first-token) barge-in —
+            # swallow it so it is not surfaced as an LLM error (which would
+            # trip the spoken llm_error_message fallback). Any genuine upstream
+            # error (cancel_event not set) propagates unchanged.
             if cancel_event is not None and cancel_event.is_set():
-                # Best-effort cancel of the upstream stream so the underlying
-                # HTTP connection is freed instead of waiting for the server
-                # to close. ``response.close()`` is sync on AsyncOpenAI and
-                # may raise if the stream already ended — best-effort.
+                return
+            raise
+        finally:
+            if abort_watcher is not None:
+                abort_watcher.cancel()
                 try:
-                    await response.close()
-                except Exception:  # noqa: BLE001 - best-effort cleanup
+                    await abort_watcher
+                except (asyncio.CancelledError, Exception):
                     pass
-                return
-            # Usage chunks have empty ``choices`` and a populated ``usage``.
-            usage = getattr(chunk, "usage", None)
-            if usage is not None:
-                last_usage = usage
-
-            delta = chunk.choices[0].delta if chunk.choices else None
-            if delta is None:
-                continue
-
-            if delta.content:
-                yield {"type": "text", "content": delta.content}
-
-            if delta.tool_calls:
-                for tc in delta.tool_calls:
-                    yield {
-                        "type": "tool_call",
-                        "index": tc.index,
-                        "id": tc.id,
-                        "name": tc.function.name if tc.function else None,
-                        "arguments": tc.function.arguments if tc.function else None,
-                    }
-
-        if last_usage is not None:
-            cache_read = 0
-            details = getattr(last_usage, "prompt_tokens_details", None)
-            if details is not None:
-                cache_read = getattr(details, "cached_tokens", 0) or 0
-            # OpenAI's prompt_tokens is the TOTAL input (uncached + cached).
-            # Subtract cached so input_tokens represents only the uncached
-            # portion and calculate_llm_cost doesn't bill cached tokens at
-            # the full input rate (mirrors libraries/typescript/src/llm-loop.ts:296-305).
-            prompt_tokens = getattr(last_usage, "prompt_tokens", 0) or 0
-            uncached_input = max(0, prompt_tokens - cache_read)
-            completion_tokens = getattr(last_usage, "completion_tokens", 0) or 0
-            self._record_completion_cost(
-                prompt_tokens=prompt_tokens,
-                completion_tokens=completion_tokens,
-            )
-            yield {
-                "type": "usage",
-                "input_tokens": uncached_input,
-                "output_tokens": completion_tokens,
-                "cache_read_tokens": cache_read,
-            }
 
     def _record_completion_cost(
         self, *, prompt_tokens: int, completion_tokens: int
diff --git a/libraries/python/getpatter/stream_handler.py b/libraries/python/getpatter/stream_handler.py
index beb368f..14a0798 100644
--- a/libraries/python/getpatter/stream_handler.py
+++ b/libraries/python/getpatter/stream_handler.py
@@ -2347,6 +2347,27 @@ def __init__(
         # because ``agent`` is a frozen dataclass.
         self._auto_vad = None
         self._stt_task: asyncio.Task | None = None
+        # The in-flight turn dispatch (LLM + TTS) runs as a SINGLE tracked task
+        # so the STT receive loop keeps draining transcripts during a long
+        # (30-90 s) agent-runtime turn and can fire transcript-based barge-in
+        # against the LIVE turn. Exactly one is active at a time — the loop
+        # awaits the previous one to settle before launching the next, so
+        # conversation_history / metrics ordering is unchanged. None when idle.
+        self._dispatch_task: asyncio.Task | None = None
+        # Opt-in (default OFF): forward inbound audio to STT even while the
+        # agent is speaking, so the transcript barge-in path can receive a
+        # transcript on echo-masked PSTN links where the VAD never fires.
+        # ECHO RISK without AEC — see ``on_audio_received`` self-hearing guard.
+        self._forward_stt_while_speaking = os.environ.get(
+            "PATTER_FORWARD_STT_WHILE_SPEAKING", ""
+        ).strip().lower() in ("1", "true", "yes")
+        if self._forward_stt_while_speaking:
+            logger.warning(
+                "PATTER_FORWARD_STT_WHILE_SPEAKING=on: inbound audio is sent to "
+                "STT during TTS so transcript barge-in works on echo-masked "
+                "links. Without AEC the agent's own voice may be transcribed as "
+                "a phantom interruption — pair with agent.barge_in_strategies."
+            )
         self._is_speaking = False
         # True only while the post-TTS tail-grace window is pending: the
         # agent has finished its turn but ``_is_speaking`` is still held for
@@ -3690,14 +3711,58 @@ async def _stt_loop(self) -> None:
                         self.metrics.anchor_user_speech_start()
                     continue
 
-                await self._dispatch_turn(transcript.text)
+                # Decouple dispatch from the receive loop: run the turn as a
+                # SINGLE tracked task so the ``async for`` keeps draining
+                # transcripts during a long (30-90 s) agent-runtime turn and
+                # can fire transcript-based barge-in against the LIVE turn —
+                # the head-of-line-blocking fix. Settle the previous turn
+                # first so exactly one dispatch is in flight and the per-turn
+                # conversation_history / metrics ordering is preserved.
+                await self._await_dispatch_settle()
+                self._dispatch_task = asyncio.create_task(
+                    self._dispatch_turn(transcript.text)
+                )
 
         except Exception as exc:
             logger.exception("Pipeline STT loop error: %s", exc)
+        finally:
+            # Return only once the last dispatch fully settles, so callers and
+            # tests that inspect state right after ``await _stt_loop()`` still
+            # observe completed turn effects (the loop no longer blocks DURING
+            # a turn, but it does block until the FINAL turn is done).
+            await self._await_dispatch_settle()
+
+    async def _await_dispatch_settle(self) -> None:
+        """Await the in-flight turn dispatch to fully settle.
+
+        Called before launching the next turn (single-in-flight) and once
+        more when the STT loop exits. Two cases: the prior dispatch either
+        completed naturally (await is a no-op) or was cancelled by a barge-in
+        (await lets its ``finally`` — grace flip, LLM span close, ring reset,
+        history flush — run BEFORE the next turn's ``_begin_speaking``). Always
+        clears the handle so a backgrounded-task exception is retrieved (no
+        ``Task exception was never retrieved`` leak).
+        """
+        task = self._dispatch_task
+        if task is None:
+            return
+        try:
+            await task
+        except asyncio.CancelledError:  # pragma: no cover - teardown path
+            pass
+        except Exception as exc:  # pragma: no cover - already handled in dispatch
+            logger.debug("backgrounded dispatch raised: %s", exc)
+        finally:
+            # Only clear if it is still the task we awaited — a re-entrant
+            # launch could have replaced it (it cannot today: the loop is the
+            # sole launcher and awaits here first, but be defensive).
+            if self._dispatch_task is task:
+                self._dispatch_task = None
 
     async def _dispatch_turn(self, transcript_text: str) -> None:
         """Run the post-commit pipeline (record STT → afterTranscribe →
-        LLM dispatch → TTS → turn-complete) inline on the STT loop.
+        LLM dispatch → TTS → turn-complete) as a tracked background task so
+        the STT receive loop keeps draining transcripts during the turn.
         """
         # Reset the per-turn LLM cancel event BEFORE dispatch so a stale
         # cancel set by a previous turn's barge-in (``_do_cancel_for_barge_in``
@@ -4038,11 +4103,25 @@ async def on_audio_received(self, audio_bytes: bytes) -> None:
                                     self.metrics.record_tts_stopped()
                                     self.metrics.record_turn_interrupted()
                                 self._is_speaking = False
+                                self._tail_grace_active = False
                                 self._speaking_started_at = None
                                 self._first_audio_sent_at = None
                                 self._speaking_generation += 1
                                 self._last_cancel_at = time.time()
                                 self._suppressed_speech_pending = False
+                                # Tear down the in-flight LLM stream too. The
+                                # consumption loop polls ``_llm_cancel_event``
+                                # per chunk, but a turn parked PRE-first-token
+                                # on a hung agent request never sees a chunk —
+                                # the provider cancel watchdog (see
+                                # ``OpenAICompatibleLLMProvider.stream``) closes
+                                # the request the instant this fires. Parity
+                                # with TS ``cancelSpeaking`` → ``llmAbort.abort``.
+                                cancel_event = getattr(
+                                    self, "_llm_cancel_event", None
+                                )
+                                if cancel_event is not None:
+                                    cancel_event.set()
                     if not phantom_suppressed and self.metrics is not None:
                         # Industry-standard pattern: every legitimate VAD speech_start
                         # re-anchors the turn timestamp pre-commit. This
@@ -4098,7 +4177,16 @@ async def on_audio_received(self, audio_bytes: bytes) -> None:
                 # post-barge-in bleed-transcription entry.
                 if len(self._inbound_audio_ring) > 13:  # ~260 ms at 20 ms/frame
                     self._inbound_audio_ring.pop(0)
-                return
+                # Opt-in: also forward the frame to STT during TTS so the
+                # transcript barge-in path can receive a transcript on
+                # echo-masked links where the VAD never fires. The ring push
+                # above stays unconditional (leading-edge recovery preserved);
+                # only the early-return is gated. ECHO RISK without AEC — the
+                # agent's own voice may be transcribed as a phantom
+                # interruption; pair with agent.barge_in_strategies. Default
+                # OFF → byte-identical push-and-return.
+                if not self._forward_stt_while_speaking:
+                    return
 
         # before_send_to_stt hook — gate/transform the audio chunk before it
         # reaches the STT provider. Returning None drops the chunk (useful
@@ -4648,6 +4736,18 @@ async def cleanup(self) -> None:
                 _tts_cancel()
             except Exception:
                 pass
+        # Hard-cancel the backgrounded turn dispatch (teardown backstop) so no
+        # orphan task touches a finalized handler. The cancel_event.set() above
+        # lets a post-first-token turn break gracefully; the cancel covers a
+        # turn parked pre-first-token on a hung agent request.
+        _dispatch_task = getattr(self, "_dispatch_task", None)
+        if _dispatch_task is not None and not _dispatch_task.done():
+            _dispatch_task.cancel()
+            try:
+                await _dispatch_task
+            except (asyncio.CancelledError, Exception):
+                pass
+        self._dispatch_task = None
         # Drop any pending barge-in timeout BEFORE we tear down metrics /
         # adapters. Without this, a call that ends while a barge-in is
         # pending leaves an asyncio.Task scheduled to fire
diff --git a/libraries/python/tests/test_llm_hermes_openclaw_presets.py b/libraries/python/tests/test_llm_hermes_openclaw_presets.py
index c0a2a6a..1363452 100644
--- a/libraries/python/tests/test_llm_hermes_openclaw_presets.py
+++ b/libraries/python/tests/test_llm_hermes_openclaw_presets.py
@@ -33,7 +33,8 @@ def test_hermes_defaults_base_url_model_timeout(monkeypatch) -> None:
     llm = hermes.LLM()
     assert _base_url_str(llm).startswith("http://127.0.0.1:8642/v1")
     assert llm._model == "hermes-agent"
-    assert llm._client.timeout == 120.0
+    assert llm._client.timeout.read == 120.0
+    assert llm._client.timeout.connect == 10.0
     # Hermes is stateless and keys continuity off HEADERS:
     #   X-Hermes-Session-Id (per call) + optional X-Hermes-Session-Key (memory).
     assert llm._session_user_prefix == "patter-call-"
@@ -113,7 +114,8 @@ def test_openclaw_defaults_match_consult_preset(monkeypatch) -> None:
     assert llm._session_user_prefix == "patter-call-"
     # OpenClaw has no separate memory-scope header.
     assert llm._session_key_header is None
-    assert llm._client.timeout == 120.0
+    assert llm._client.timeout.read == 120.0
+    assert llm._client.timeout.connect == 10.0
     assert llm.provider_key == "openclaw"
 
 
diff --git a/libraries/python/tests/test_llm_openai_compatible.py b/libraries/python/tests/test_llm_openai_compatible.py
index ea00b11..b803c5c 100644
--- a/libraries/python/tests/test_llm_openai_compatible.py
+++ b/libraries/python/tests/test_llm_openai_compatible.py
@@ -27,7 +27,10 @@ def test_openai_compatible_provider_points_client_at_base_url_with_timeout() ->
     )
     # Real client carries the base URL and the long (non-default) timeout.
     assert _base_url_str(provider).startswith("http://127.0.0.1:9/v1")
-    assert provider._client.timeout == 120.0
+    # Long read budget for tool-running turns; connect bounded (~10 s) so a
+    # dead gateway fails fast instead of hanging the full read budget.
+    assert provider._client.timeout.read == 120.0
+    assert provider._client.timeout.connect == 10.0
     assert provider._model == "m"
     # Satisfies the LLMProvider protocol.
     assert isinstance(provider, LLMProvider)
diff --git a/libraries/python/tests/unit/test_pipeline_bargein_backgrounded.py b/libraries/python/tests/unit/test_pipeline_bargein_backgrounded.py
new file mode 100644
index 0000000..295e200
--- /dev/null
+++ b/libraries/python/tests/unit/test_pipeline_bargein_backgrounded.py
@@ -0,0 +1,206 @@
+"""Backgrounded-dispatch barge-in: the STT receive loop keeps draining
+transcripts during a long agent-runtime turn, so a barge-in (transcript OR
+VAD) can interrupt the LIVE turn instead of being processed only after it ends.
+
+Reproduces the live Hermes failure: while the assistant was speaking, the
+caller said "ferma" but Patter only reacted after the turn finished — because
+``_stt_loop`` awaited ``_dispatch_turn`` inline and stopped reading transcripts
+for the whole (30-90 s) turn. Covers:
+
+* the decoupled dispatch + transcript barge-in cancelling the in-flight turn;
+* the VAD legacy branch now setting ``_llm_cancel_event`` (pre-first-token
+  teardown parity with TS);
+* the opt-in ``PATTER_FORWARD_STT_WHILE_SPEAKING`` guard.
+
+Only the external boundary (LLM stream timing, TTS bytes, STT) is faked.
+"""
+
+from __future__ import annotations
+
+import asyncio
+import time
+from collections import deque
+from typing import AsyncIterator
+from unittest.mock import AsyncMock, MagicMock
+
+import pytest
+
+from getpatter.providers.base import Transcript, VADEvent
+from getpatter.stream_handler import PipelineStreamHandler
+
+from tests.conftest import make_agent
+
+
+class _FakeTTS:
+    output_format = "pcm_16000"
+
+    def __init__(self) -> None:
+        self.synthesized: list[str] = []
+
+    async def synthesize(self, text: str):
+        self.synthesized.append(text)
+        yield b"\x00\x00" * 80
+
+
+class _ScriptedVAD:
+    def __init__(self, events: list[VADEvent | None]) -> None:
+        self._events = list(events)
+
+    async def process_frame(self, pcm: bytes, sample_rate: int) -> VADEvent | None:
+        return self._events.pop(0) if self._events else None
+
+    async def close(self) -> None:  # pragma: no cover
+        pass
+
+    def reset(self) -> None:
+        pass
+
+
+def _make_handler(*, metrics: MagicMock | None = None) -> PipelineStreamHandler:
+    handler = PipelineStreamHandler(
+        agent=make_agent(),
+        audio_sender=AsyncMock(),
+        call_id="call-bg",
+        caller="+15551110000",
+        callee="+15552220000",
+        resolved_prompt="p",
+        metrics=metrics,
+        for_twilio=True,
+        on_transcript=None,
+        conversation_history=deque(maxlen=20),
+        transcript_entries=deque(maxlen=20),
+    )
+    handler.on_message = None
+    handler._tts = _FakeTTS()  # type: ignore[assignment]
+    handler._stt = AsyncMock()
+    handler._aec = None
+    handler._input_is_mulaw_8k = False
+    return handler
+
+
+_FRAME = b"\x00\x01" * 160
+
+
+@pytest.mark.unit
+@pytest.mark.asyncio
+class TestTranscriptBargeInDuringInFlightTurn:
+    """A barge-in transcript cancels the LIVE turn — proving the receive loop
+    is no longer blocked on dispatch."""
+
+    async def test_bargein_transcript_cancels_inflight_long_turn(self) -> None:
+        metrics = MagicMock()
+        handler = _make_handler(metrics=metrics)
+        # Past the warmup gate so barge-in is allowed.
+        handler._can_barge_in = lambda: True  # type: ignore[assignment]
+
+        cancel_seen = asyncio.Event()
+
+        class _ParkUntilCancelLoop:
+            def __init__(self) -> None:
+                self.calls = 0
+
+            def run(self, text, history, ctx, *, cancel_event=None, **kw):
+                self.calls += 1
+                first = self.calls == 1
+
+                async def _gen():
+                    if first:
+                        # Long turn: only ends when the barge-in sets cancel.
+                        while cancel_event is None or not cancel_event.is_set():
+                            await asyncio.sleep(0.005)
+                        cancel_seen.set()
+                        return
+                    yield "ok "  # any later turn replies quickly
+
+                return _gen()
+
+        handler._llm_loop = _ParkUntilCancelLoop()
+
+        class _STT:
+            async def receive_transcripts(self) -> AsyncIterator[Transcript]:
+                yield Transcript(text="dimmi una storia", is_final=True, confidence=0.9)
+                # Let turn 1 begin_speaking + park on its (long) LLM stream.
+                await asyncio.sleep(0.08)
+                # Caller barges in WHILE the agent turn is in flight.
+                yield Transcript(text="ferma per favore", is_final=True, confidence=0.9)
+                await asyncio.sleep(0.08)
+
+        handler._stt = _STT()  # type: ignore[assignment]
+
+        await asyncio.wait_for(handler._stt_loop(), timeout=3.0)
+
+        # The barge-in fired DURING turn 1 (the loop kept reading transcripts):
+        handler.audio_sender.send_clear.assert_awaited()
+        metrics.record_bargein_detected.assert_called()
+        # Turn 1's LLM stream observed the cancel (it was torn down, not left
+        # running until the next turn).
+        assert cancel_seen.is_set()
+        assert handler._is_speaking is False
+
+
+@pytest.mark.unit
+@pytest.mark.asyncio
+class TestVadLegacyBranchSetsCancelEvent:
+    """A VAD-only barge-in during TTS tears down the LLM stream (layer 2a)."""
+
+    async def test_vad_speech_start_sets_llm_cancel_event(self) -> None:
+        metrics = MagicMock()
+        handler = _make_handler(metrics=metrics)
+        handler._auto_vad = _ScriptedVAD([VADEvent(type="speech_start")])
+        handler._is_speaking = True
+        handler._tail_grace_active = False
+        handler._speaking_generation = 1
+        handler._speaking_started_at = time.time() - 2.0
+        handler._first_audio_sent_at = time.time() - 2.0
+        handler._inbound_audio_ring = []
+        assert handler._llm_cancel_event.is_set() is False
+
+        await handler.on_audio_received(_FRAME)
+
+        # Real barge-in cancel ran AND the LLM stream cancel was signalled
+        # (previously only `_is_speaking` flipped, which Hermes never observed
+        # pre-first-token).
+        metrics.record_bargein_detected.assert_called_once()
+        assert handler._is_speaking is False
+        assert handler._llm_cancel_event.is_set() is True
+
+
+@pytest.mark.unit
+@pytest.mark.asyncio
+class TestForwardSttWhileSpeakingFlag:
+    """``PATTER_FORWARD_STT_WHILE_SPEAKING`` gates audio-to-STT during TTS."""
+
+    async def test_flag_off_buffers_and_returns(self, monkeypatch) -> None:
+        monkeypatch.delenv("PATTER_FORWARD_STT_WHILE_SPEAKING", raising=False)
+        handler = _make_handler()
+        handler._auto_vad = _ScriptedVAD([None, None])
+        handler._stt = AsyncMock()
+        handler._is_speaking = True
+        handler._tail_grace_active = False
+        handler._inbound_audio_ring = []
+
+        await handler.on_audio_received(_FRAME)
+        await handler.on_audio_received(_FRAME)
+
+        # Default: audio withheld from STT during TTS, only ring-buffered.
+        assert handler._stt.send_audio.await_count == 0
+        assert len(handler._inbound_audio_ring) == 2
+
+    async def test_flag_on_forwards_to_stt_during_tts(self, monkeypatch) -> None:
+        monkeypatch.setenv("PATTER_FORWARD_STT_WHILE_SPEAKING", "1")
+        handler = _make_handler()
+        assert handler._forward_stt_while_speaking is True
+        handler._auto_vad = _ScriptedVAD([None, None])
+        handler._stt = AsyncMock()
+        handler._is_speaking = True
+        handler._tail_grace_active = False
+        handler._inbound_audio_ring = []
+
+        await handler.on_audio_received(_FRAME)
+        await handler.on_audio_received(_FRAME)
+
+        # Flag on: audio ALSO reaches STT during TTS (so the transcript barge-in
+        # path can fire on echo-masked links) AND the ring still captures the
+        # leading edge for flush-on-barge-in.
+        assert handler._stt.send_audio.await_count == 2
+        assert len(handler._inbound_audio_ring) == 2
diff --git a/libraries/python/tests/unit/test_provider_prefirsttoken_abort.py b/libraries/python/tests/unit/test_provider_prefirsttoken_abort.py
new file mode 100644
index 0000000..13391ee
--- /dev/null
+++ b/libraries/python/tests/unit/test_provider_prefirsttoken_abort.py
@@ -0,0 +1,152 @@
+"""Pre-first-token cancel abort for agent-runtime LLM providers.
+
+Hermes / OpenClaw run tools/memory/skills for tens of seconds BEFORE the first
+SSE byte. The per-chunk ``cancel_event.is_set()`` check inside ``async for chunk
+in response`` never runs during that window (the consumer is parked awaiting the
+first byte), so a barge-in could not free the connection and the next user turn
+blocked behind it. The provider now races ``create()`` + first-byte against the
+cancel event and spawns a watchdog that ``close()``s the response the instant
+the event fires, returning promptly without yielding.
+
+Only the external boundary is mocked: a fake AsyncOpenAI client whose streaming
+response parks on an event until ``close()`` is called.
+"""
+
+from __future__ import annotations
+
+import asyncio
+from unittest.mock import AsyncMock, MagicMock
+
+import pytest
+
+from getpatter.llm.openai_compatible import OpenAICompatibleLLMProvider
+
+
+class _ParkingResponse:
+    """Async-iterable streaming response that parks on the first ``__anext__``
+    until ``close()`` is called — modelling Hermes holding the connection open
+    while it runs tools before the first token."""
+
+    def __init__(self) -> None:
+        self._closed = asyncio.Event()
+        self.close_calls = 0
+        self.yielded_any = False
+
+    def __aiter__(self):
+        return self
+
+    async def __anext__(self):
+        await self._closed.wait()
+        # A read after the httpx stream is closed raises — model that so the
+        # provider's except-branch (swallow-when-cancelling) is exercised.
+        raise RuntimeError("stream closed mid-read")
+
+    async def close(self) -> None:
+        self.close_calls += 1
+        self._closed.set()
+
+
+def _provider_with_fake_client(response: _ParkingResponse) -> OpenAICompatibleLLMProvider:
+    provider = OpenAICompatibleLLMProvider(base_url="http://127.0.0.1:9/v1", model="m")
+    fake_client = MagicMock()
+    fake_client.chat.completions.create = AsyncMock(return_value=response)
+    provider._client = fake_client  # type: ignore[assignment]
+    return provider
+
+
+@pytest.mark.mocked
+@pytest.mark.asyncio
+async def test_cancel_event_closes_response_before_first_token() -> None:
+    provider = _provider_with_fake_client(resp := _ParkingResponse())
+    cancel = asyncio.Event()
+    chunks: list = []
+
+    async def _consume() -> None:
+        async for chunk in provider.stream(
+            [{"role": "user", "content": "hi"}], cancel_event=cancel
+        ):
+            chunks.append(chunk)
+
+    task = asyncio.create_task(_consume())
+    # Let it reach the parked first __anext__.
+    await asyncio.sleep(0.05)
+    assert chunks == []  # parked pre-first-token
+
+    # Barge-in: the watchdog must close the response and stream() must return
+    # promptly WITHOUT yielding or raising.
+    cancel.set()
+    await asyncio.wait_for(task, timeout=1.0)
+
+    assert resp.close_calls >= 1  # watchdog tore the request down
+    assert chunks == []  # nothing spoken
+
+
+@pytest.mark.mocked
+@pytest.mark.asyncio
+async def test_cancel_during_create_aborts_in_flight_post() -> None:
+    """If the cancel fires while ``create()`` itself is still awaiting (the
+    server hasn't even sent headers), the in-flight POST is cancelled and
+    stream() returns nothing — no response object, no yield."""
+    provider = OpenAICompatibleLLMProvider(base_url="http://127.0.0.1:9/v1", model="m")
+    cancel = asyncio.Event()
+    create_started = asyncio.Event()
+
+    async def _never_returns(**_kwargs):
+        create_started.set()
+        await asyncio.Event().wait()  # parks forever (server never responds)
+
+    fake_client = MagicMock()
+    fake_client.chat.completions.create = _never_returns
+    provider._client = fake_client  # type: ignore[assignment]
+
+    chunks: list = []
+
+    async def _consume() -> None:
+        async for chunk in provider.stream(
+            [{"role": "user", "content": "hi"}], cancel_event=cancel
+        ):
+            chunks.append(chunk)
+
+    task = asyncio.create_task(_consume())
+    await asyncio.wait_for(create_started.wait(), timeout=1.0)
+    cancel.set()
+    await asyncio.wait_for(task, timeout=1.0)
+    assert chunks == []
+
+
+@pytest.mark.mocked
+@pytest.mark.asyncio
+async def test_no_cancel_event_streams_normally() -> None:
+    """Regression guard: with no cancel_event the watchdog is never spawned and
+    a normal streamed response yields its text unchanged."""
+
+    class _Chunk:
+        def __init__(self, content):
+            self.usage = None
+            self.choices = [
+                MagicMock(delta=MagicMock(content=content, tool_calls=None))
+            ]
+
+    class _OneShot:
+        def __init__(self):
+            self._items = [_Chunk("Hello "), _Chunk("there.")]
+
+        def __aiter__(self):
+            return self
+
+        async def __anext__(self):
+            if not self._items:
+                raise StopAsyncIteration
+            return self._items.pop(0)
+
+    provider = OpenAICompatibleLLMProvider(base_url="http://127.0.0.1:9/v1", model="m")
+    fake_client = MagicMock()
+    fake_client.chat.completions.create = AsyncMock(return_value=_OneShot())
+    provider._client = fake_client  # type: ignore[assignment]
+
+    texts = [
+        c["content"]
+        async for c in provider.stream([{"role": "user", "content": "hi"}])
+        if c.get("type") == "text"
+    ]
+    assert texts == ["Hello ", "there."]
diff --git a/libraries/typescript/src/stream-handler.ts b/libraries/typescript/src/stream-handler.ts
index fefd056..23d1f51 100644
--- a/libraries/typescript/src/stream-handler.ts
+++ b/libraries/typescript/src/stream-handler.ts
@@ -1021,6 +1021,25 @@ export class StreamHandler {
   private maxDurationTimer: ReturnType<typeof setTimeout> | null = null;
   private transcriptProcessing = false;
   private transcriptQueue: STTTranscript[] = [];
+  /**
+   * The in-flight turn dispatch (LLM + TTS) runs as a SINGLE tracked promise
+   * so the transcript drain loop keeps running ``handleBargeIn`` against the
+   * LIVE turn during a long (30-90 s) agent-runtime response, instead of
+   * head-of-line-blocking on it. Exactly one is in flight: the launcher awaits
+   * the previous one to settle (fast — a barge-in already aborted it) before
+   * starting the next, preserving history/metrics ordering. Parity with
+   * Python ``_dispatch_task``.
+   */
+  private dispatchTask: Promise<void> | null = null;
+  /**
+   * Opt-in (default OFF): forward inbound audio to STT even while the agent is
+   * speaking, so the transcript barge-in path can receive a transcript on
+   * echo-masked PSTN links where the VAD never fires. ECHO RISK without AEC.
+   * Parity with Python ``_forward_stt_while_speaking``.
+   */
+  private readonly forwardSttWhileSpeaking = ['1', 'true', 'yes'].includes(
+    (process.env.PATTER_FORWARD_STT_WHILE_SPEAKING ?? '').trim().toLowerCase(),
+  );
   // Throttle state for back-to-back STT finals — see ``commitTranscript``.
   private lastCommitText = '';
   private lastCommitAt = 0;
@@ -1046,6 +1065,15 @@ export class StreamHandler {
     this.caller = caller;
     this.callee = callee;
 
+    if (this.forwardSttWhileSpeaking) {
+      getLogger().warn(
+        'PATTER_FORWARD_STT_WHILE_SPEAKING=on: inbound audio is sent to STT ' +
+          'during TTS so transcript barge-in works on echo-masked links. ' +
+          "Without AEC the agent's own voice may be transcribed as a phantom " +
+          'interruption — pair with agent.bargeInStrategies.',
+      );
+    }
+
     this.bargeInStrategies = (deps.agent.bargeInStrategies ?? []).slice();
     const confirmMs = deps.agent.bargeInConfirmMs;
     this.bargeInConfirmMs =
@@ -1611,9 +1639,15 @@ export class StreamHandler {
           ) {
             this.inboundAudioRing.shift();
           }
+          // Opt-in: also forward the frame to STT during TTS so the transcript
+          // barge-in path can receive a transcript on echo-masked links where
+          // the VAD never fires. The ring push above stays unconditional
+          // (leading-edge recovery preserved); only the early-return is gated.
+          // ECHO RISK without AEC. Default OFF → byte-identical push-and-return.
+          if (!this.forwardSttWhileSpeaking) return;
+        } else if ((this.deps.agent.bargeInThresholdMs ?? 300) === 0) {
           return;
         }
-        if ((this.deps.agent.bargeInThresholdMs ?? 300) === 0) return;
       }
 
       // beforeSendToStt hook — gate/transform the audio chunk before it
@@ -1738,6 +1772,10 @@ export class StreamHandler {
     if (typeof ttsCancelable?.cancelActiveStream === 'function') {
       try { ttsCancelable.cancelActiveStream(); } catch { /* defensive */ }
     }
+    // Settle the backgrounded turn dispatch (the abort above unblocks it) so
+    // no in-flight LLM/TTS work touches adapters after they close. Parity with
+    // Python cleanup awaiting ``_dispatch_task``.
+    await this.dispatchTask?.catch(() => {});
     // Drop any pending barge-in timer BEFORE we tear down metrics /
     // adapters. Without this, a call that ends while a barge-in is
     // pending leaves a setTimeout scheduled to fire ``bargeInConfirmMs``
@@ -1775,6 +1813,9 @@ export class StreamHandler {
     if (typeof ttsCancelable?.cancelActiveStream === 'function') {
       try { ttsCancelable.cancelActiveStream(); } catch { /* defensive */ }
     }
+    // Settle the backgrounded turn dispatch before tearing down adapters
+    // (parity with handleStop / Python cleanup).
+    await this.dispatchTask?.catch(() => {});
     // See handleStop — drop pending barge-in timer before cleanup so a
     // dead handler can never fire a stale recordOverlapEnd callback.
     this.clearPendingBargeIn();
@@ -2493,84 +2534,114 @@ export class StreamHandler {
     // Push filtered text to history (after hook, so LLM sees redacted/modified text)
     this.history.push({ role: 'user', text: filteredTranscript, timestamp: Date.now() });
 
-    let responseText = '';
-
     // Wave6B: record that the transcript is being committed to the LLM.
     // onUserTurnCompleted hook is not yet wired in TS — record 0 delay so EOU can still emit.
     this.metricsAcc.recordOnUserTurnCompletedDelay(0);
     this.metricsAcc.recordTurnCommitted();
     closeEndpointSpan();
 
-    if (this.deps.onMessage && typeof this.deps.onMessage === 'function') {
-      try {
-        responseText = await this.deps.onMessage({
+    // Settle the previous turn first (single-in-flight). It is either already
+    // done, or this transcript's handleBargeIn above just aborted it — so this
+    // await is fast and does not head-of-line-block the drain loop in
+    // practice, while preserving strict per-turn history/metrics ordering.
+    await this.dispatchTask?.catch(() => {});
+    // Launch the turn as a tracked background task and RETURN immediately so
+    // the transcript drain loop keeps running handleBargeIn against this LIVE
+    // turn (the head-of-line-blocking fix). Parity with Python
+    // ``create_task(_dispatch_turn(...))``.
+    this.dispatchTask = this.dispatchTurn(filteredTranscript, hookExecutor, hookCtx, interrupted);
+  }
+
+  /**
+   * Post-commit turn body (LLM dispatch → TTS → turn-complete) run as a
+   * tracked background task so the transcript drain loop is not blocked for
+   * the whole (possibly 30-90 s) agent-runtime turn. A barge-in — transcript
+   * (now reachable mid-turn) or VAD — aborts the in-flight ``llmAbort`` and
+   * flips ``isSpeaking``, which the LLM/TTS loops here observe and break on.
+   * Parity with Python ``_dispatch_turn``.
+   */
+  private async dispatchTurn(
+    filteredTranscript: string,
+    hookExecutor: PipelineHookExecutor,
+    hookCtx: HookContext,
+    interrupted: boolean,
+  ): Promise<void> {
+    const label = this.deps.bridge.label;
+    let responseText = '';
+    try {
+      if (this.deps.onMessage && typeof this.deps.onMessage === 'function') {
+        try {
+          responseText = await this.deps.onMessage({
+            text: filteredTranscript,
+            call_id: this.callId,
+            caller: this.caller,
+            callee: this.callee,
+            history: [...this.history.entries],
+          });
+        } catch (e) {
+          getLogger().error(`onMessage error (${label}):`, e);
+          return;
+        }
+        if (!responseText) {
+          // Common misuse: onMessage was provided as an observer (returning void)
+          // but it actually replaces the built-in LLM loop. Warn loudly — the caller
+          // will hear no audio until the handler returns a non-empty string.
+          getLogger().warn(
+            `onMessage returned empty/void (${label}) — no TTS will play. ` +
+            `If you intended to observe transcripts, use onTranscript instead; ` +
+            `if you meant to answer via the built-in LLM, remove onMessage and pass openaiKey.`,
+          );
+        }
+      } else if (this.deps.onMessage && isRemoteUrl(this.deps.onMessage)) {
+        const msgData = {
           text: filteredTranscript,
           call_id: this.callId,
           caller: this.caller,
           callee: this.callee,
           history: [...this.history.entries],
-        });
-      } catch (e) {
-        getLogger().error(`onMessage error (${label}):`, e);
-        return;
-      }
-      if (!responseText) {
-        // Common misuse: onMessage was provided as an observer (returning void)
-        // but it actually replaces the built-in LLM loop. Warn loudly — the caller
-        // will hear no audio until the handler returns a non-empty string.
+        };
+        if (isWebSocketUrl(this.deps.onMessage)) {
+          await this.handleWebSocketResponse(msgData);
+          return;
+        }
+        try {
+          responseText = await this.deps.remoteHandler.callWebhook(this.deps.onMessage, msgData);
+        } catch (e) {
+          getLogger().error(`Webhook remote error (${label}):`, e);
+          return;
+        }
+      } else if (this.llmLoop) {
+        responseText = await this.runPipelineLlm(filteredTranscript, hookExecutor, hookCtx);
+      } else {
         getLogger().warn(
-          `onMessage returned empty/void (${label}) — no TTS will play. ` +
-          `If you intended to observe transcripts, use onTranscript instead; ` +
-          `if you meant to answer via the built-in LLM, remove onMessage and pass openaiKey.`,
+          `Pipeline (${label}) has no llm/onMessage handler — transcript ` +
+            `"${sanitizeLogValue(filteredTranscript.slice(0, 60))}" dropped. ` +
+            'Check that agent.llm or onMessage is configured.',
         );
-      }
-    } else if (this.deps.onMessage && isRemoteUrl(this.deps.onMessage)) {
-      const msgData = {
-        text: filteredTranscript,
-        call_id: this.callId,
-        caller: this.caller,
-        callee: this.callee,
-        history: [...this.history.entries],
-      };
-      if (isWebSocketUrl(this.deps.onMessage)) {
-        await this.handleWebSocketResponse(msgData);
         return;
       }
-      try {
-        responseText = await this.deps.remoteHandler.callWebhook(this.deps.onMessage, msgData);
-      } catch (e) {
-        getLogger().error(`Webhook remote error (${label}):`, e);
-        return;
-      }
-    } else if (this.llmLoop) {
-      responseText = await this.runPipelineLlm(filteredTranscript, hookExecutor, hookCtx);
-    } else {
-      getLogger().warn(
-        `Pipeline (${label}) has no llm/onMessage handler — transcript ` +
-          `"${sanitizeLogValue(filteredTranscript.slice(0, 60))}" dropped. ` +
-          'Check that agent.llm or onMessage is configured.',
-      );
-      return;
-    }
 
-    if (!responseText) return;
+      if (!responseText) return;
 
-    if (this.llmLoop) {
-      await this.emitAssistantTranscript(responseText);
-      this.metricsAcc.recordTtsComplete(responseText);
-    } else {
-      interrupted = await this.runRegularLlm(responseText, hookExecutor, hookCtx) || interrupted;
-      // ``runRegularLlm`` returns the possibly-replaced text via side effect on
-      // history; recompute responseText from the last history entry for the
-      // turn-complete record.
-      responseText = this.history.entries[this.history.entries.length - 1]?.text ?? responseText;
-    }
-
-    // Skip turn-complete when barge-in already recorded the turn as
-    // interrupted — mirrors Python ``if not interrupted``. Prevents
-    // double-counting / turn-count inflation / polluting p95.
-    if (!interrupted) {
-      await this.emitTurnMetrics(this.metricsAcc.recordTurnComplete(responseText));
+      if (this.llmLoop) {
+        await this.emitAssistantTranscript(responseText);
+        this.metricsAcc.recordTtsComplete(responseText);
+      } else {
+        interrupted = (await this.runRegularLlm(responseText, hookExecutor, hookCtx)) || interrupted;
+        // ``runRegularLlm`` returns the possibly-replaced text via side effect on
+        // history; recompute responseText from the last history entry for the
+        // turn-complete record.
+        responseText = this.history.entries[this.history.entries.length - 1]?.text ?? responseText;
+      }
+
+      // Skip turn-complete when barge-in already recorded the turn as
+      // interrupted — mirrors Python ``if not interrupted``. Prevents
+      // double-counting / turn-count inflation / polluting p95.
+      if (!interrupted) {
+        await this.emitTurnMetrics(this.metricsAcc.recordTurnComplete(responseText));
+      }
+    } finally {
+      this.dispatchTask = null;
     }
   }
 
diff --git a/libraries/typescript/tests/long-turn-filler.mocked.test.ts b/libraries/typescript/tests/long-turn-filler.mocked.test.ts
index 2729413..bca84d6 100644
--- a/libraries/typescript/tests/long-turn-filler.mocked.test.ts
+++ b/libraries/typescript/tests/long-turn-filler.mocked.test.ts
@@ -214,6 +214,12 @@ describe('[mocked] pipeline long-turn filler (longTurnMessage)', () => {
     await vi.waitFor(() => expect(ttsCalls).toContain(FILLER), {
       timeout: 5000,
     });
+    // Dispatch now runs as a backgrounded task (so the STT loop can barge-in
+    // mid-turn) — await it so the real reply has been synthesized before we
+    // assert on the full TTS sequence.
+    await (handler as unknown as { dispatchTask: Promise<void> | null }).dispatchTask?.catch(
+      () => {},
+    );
     // The filler was spoken FIRST, then the real reply — exactly one filler.
     expect(ttsCalls.indexOf(FILLER)).toBe(0);
     expect(ttsCalls).toContain('Here is your answer.');
diff --git a/libraries/typescript/tests/pipeline-bargein-backgrounded.mocked.test.ts b/libraries/typescript/tests/pipeline-bargein-backgrounded.mocked.test.ts
new file mode 100644
index 0000000..d23c3e6
--- /dev/null
+++ b/libraries/typescript/tests/pipeline-bargein-backgrounded.mocked.test.ts
@@ -0,0 +1,187 @@
+/**
+ * [mocked] Backgrounded-dispatch barge-in (parity with Python
+ * test_pipeline_bargein_backgrounded.py).
+ *
+ * The transcript drain loop must keep running ``handleBargeIn`` DURING a long
+ * agent-runtime turn — so a caller who speaks over a slow (30-90 s, tool-
+ * running) Hermes/OpenClaw response actually interrupts it, instead of being
+ * answered only after the turn finishes. Previously ``processTranscript``
+ * awaited ``runPipelineLlm`` inline and stopped draining transcripts for the
+ * whole turn (head-of-line blocking).
+ *
+ * AUTHENTIC: the real StreamHandler + LLMLoop + pipeline turn path. Mocked only
+ * at the external boundary: the LLM provider stream (parks until aborted, like
+ * Hermes pre-first-token) and the TTS byte stream.
+ */
+
+import { describe, it, expect, vi, beforeEach, afterEach } from 'vitest';
+import { StreamHandler } from '../src/stream-handler';
+import type { TelephonyBridge, StreamHandlerDeps } from '../src/stream-handler';
+import { MetricsStore } from '../src/dashboard/store';
+import { RemoteMessageHandler } from '../src/remote-message';
+import type { AgentOptions } from '../src/types';
+import type { LLMProvider, LLMChunk, LLMStreamOptions } from '../src/llm-loop';
+import type { WebSocket as WSWebSocket } from 'ws';
+
+vi.mock('../src/providers/elevenlabs-tts', async (importOriginal) => {
+  const original =
+    await importOriginal<typeof import('../src/providers/elevenlabs-tts')>();
+  return {
+    ...original,
+    ElevenLabsTTS: vi.fn().mockImplementation(() => ({
+      synthesizeStream: vi.fn(async function* () {
+        yield Buffer.from('tts-audio');
+      }),
+    })),
+  };
+});
+vi.mock('../src/dashboard/persistence', () => ({ notifyDashboard: vi.fn() }));
+
+import { ElevenLabsTTS } from '../src/providers/elevenlabs-tts';
+
+function makeMockWs(): WSWebSocket {
+  return {
+    send: vi.fn(), close: vi.fn(), on: vi.fn(), once: vi.fn(), readyState: 1,
+    removeListener: vi.fn(), addEventListener: vi.fn(), removeEventListener: vi.fn(),
+  } as unknown as WSWebSocket;
+}
+
+function makeMockStt() {
+  let cb: ((t: { isFinal?: boolean; text?: string }) => Promise<void>) | undefined;
+  return {
+    connect: vi.fn().mockResolvedValue(undefined),
+    close: vi.fn(),
+    sendAudio: vi.fn(),
+    onTranscript: vi.fn((fn: (t: { isFinal?: boolean; text?: string }) => Promise<void>) => {
+      cb = fn;
+    }),
+    get requestId() { return 'stt-bg-req'; },
+    emitTranscript(text: string): Promise<void> | undefined {
+      return cb?.({ isFinal: true, text });
+    },
+  };
+}
+
+function makeTwilioBridge(mockStt: ReturnType<typeof makeMockStt>): TelephonyBridge {
+  return {
+    label: 'Twilio', telephonyProvider: 'twilio',
+    sendAudio: vi.fn(), sendMark: vi.fn(), sendClear: vi.fn(),
+    transferCall: vi.fn().mockResolvedValue(undefined),
+    endCall: vi.fn().mockResolvedValue(undefined),
+    createStt: vi.fn().mockReturnValue(mockStt),
+    queryTelephonyCost: vi.fn().mockResolvedValue(undefined),
+  } as unknown as TelephonyBridge;
+}
+
+/** Provider that parks until its turn is aborted — models a long Hermes turn
+ * (tools running before the first token) that only ends on barge-in. */
+function makeParkUntilAbortProvider(aborted: { value: boolean }): LLMProvider {
+  return {
+    model: 'agent-runtime-1',
+    async *stream(
+      _messages: Array<Record<string, unknown>>,
+      _tools?: Array<Record<string, unknown>> | null,
+      opts?: LLMStreamOptions,
+    ): AsyncGenerator<LLMChunk, void, unknown> {
+      const signal = opts?.signal;
+      await new Promise<void>((resolve) => {
+        if (signal?.aborted) return resolve();
+        signal?.addEventListener('abort', () => resolve(), { once: true });
+      });
+      aborted.value = true;
+      // Yield nothing after abort — the turn was interrupted pre-first-token.
+    },
+  } as unknown as LLMProvider;
+}
+
+function makeDeps(bridge: TelephonyBridge, agentOverrides: Partial<AgentOptions>): StreamHandlerDeps {
+  const mockTts = new (ElevenLabsTTS as unknown as new (k: string, v?: string) => {
+    synthesizeStream: (t: string) => AsyncIterable<Buffer>;
+  })('el-key', 'rachel');
+  const agent: AgentOptions = {
+    systemPrompt: 'You are a test pipeline agent.',
+    provider: 'pipeline',
+    tts: mockTts as unknown as AgentOptions['tts'],
+    ...agentOverrides,
+  } as AgentOptions;
+  return {
+    config: {}, agent, bridge,
+    metricsStore: new MetricsStore(),
+    pricing: null,
+    remoteHandler: new RemoteMessageHandler(),
+    recording: false,
+    buildAIAdapter: vi.fn(),
+    sanitizeVariables: vi.fn((raw: Record<string, unknown>) => raw),
+    resolveVariables: vi.fn((tpl: string) => tpl),
+  } as unknown as StreamHandlerDeps;
+}
+
+describe('[mocked] pipeline backgrounded-dispatch barge-in', () => {
+  beforeEach(() => {
+    vi.spyOn(globalThis, 'fetch').mockResolvedValue({
+      ok: true, status: 200, json: async () => ({}), text: async () => '',
+    } as unknown as Response);
+  });
+  afterEach(() => {
+    vi.restoreAllMocks();
+    delete process.env.PATTER_FORWARD_STT_WHILE_SPEAKING;
+  });
+
+  it('a barge-in transcript cancels the in-flight long turn (loop not blocked)', async () => {
+    const stt = makeMockStt();
+    const bridge = makeTwilioBridge(stt);
+    const aborted = { value: false };
+    const deps = makeDeps(bridge, {
+      llm: makeParkUntilAbortProvider(aborted) as unknown as AgentOptions['llm'],
+    });
+    const handler = new StreamHandler(deps, makeMockWs(), '+15551111111', '+15552222222');
+    // Past the warmup gate so the barge-in is allowed to fire.
+    (handler as unknown as { canBargeIn: () => boolean }).canBargeIn = () => true;
+
+    await handler.handleCallStart('CA-bg-bargein');
+
+    // Turn 1 starts and parks on its (long) LLM stream.
+    await stt.emitTranscript('dimmi una storia lunga');
+    await vi.waitFor(
+      () => expect((handler as unknown as { isSpeaking: boolean }).isSpeaking).toBe(true),
+      { timeout: 3000 },
+    );
+
+    // Caller barges in WHILE turn 1 is in flight. With the old inline-await
+    // this transcript would not be read until turn 1 ended.
+    await stt.emitTranscript('ferma per favore');
+
+    // The in-flight turn was cancelled: the carrier buffer was cleared and the
+    // LLM stream's abort signal fired (turn torn down pre-first-token).
+    await vi.waitFor(
+      () => expect(bridge.sendClear as ReturnType<typeof vi.fn>).toHaveBeenCalled(),
+      { timeout: 3000 },
+    );
+    expect((handler as unknown as { isSpeaking: boolean }).isSpeaking).toBe(false);
+    await vi.waitFor(() => expect(aborted.value).toBe(true), { timeout: 3000 });
+
+    // Settle the backgrounded dispatch before teardown.
+    await (handler as unknown as { dispatchTask: Promise<void> | null }).dispatchTask?.catch(
+      () => {},
+    );
+  }, 10000);
+
+  it('PATTER_FORWARD_STT_WHILE_SPEAKING is read from env (default off)', () => {
+    const offBridge = makeTwilioBridge(makeMockStt());
+    const offHandler = new StreamHandler(
+      makeDeps(offBridge, {}), makeMockWs(), '+1', '+2',
+    );
+    expect(
+      (offHandler as unknown as { forwardSttWhileSpeaking: boolean }).forwardSttWhileSpeaking,
+    ).toBe(false);
+
+    process.env.PATTER_FORWARD_STT_WHILE_SPEAKING = '1';
+    const onBridge = makeTwilioBridge(makeMockStt());
+    const onHandler = new StreamHandler(
+      makeDeps(onBridge, {}), makeMockWs(), '+1', '+2',
+    );
+    expect(
+      (onHandler as unknown as { forwardSttWhileSpeaking: boolean }).forwardSttWhileSpeaking,
+    ).toBe(true);
+  });
+});

From 762ac6ba102bd13ca640a399e8d807d9ef431787 Mon Sep 17 00:00:00 2001
From: nicolotognoni <nicolo.tognoni1@gmail.com>
Date: Sun, 7 Jun 2026 11:26:51 +0200
Subject: [PATCH 04/11] =?UTF-8?q?fix(pipeline):=20harden=20backgrounded=20?=
 =?UTF-8?q?barge-in=20=E2=80=94=20history=20snapshot,=20abort-on-cancel,?=
 =?UTF-8?q?=20bounded=20teardown?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Three defects found by adversarial review of the previous commit's
decoupled-dispatch barge-in, all fixed with full Python/TS parity:

1. (HIGH, TS) Per-turn history was passed to the LLM by LIVE reference. With the
   turn dispatch backgrounded, a following transcript's user push (on the drain
   loop while the turn is in flight) could land in the in-flight turn's prompt
   before buildMessages read it — conflating two turns. Now a history SNAPSHOT
   is captured at launch and threaded through dispatchTurn → runPipelineLlm →
   llmLoop.run (and the onMessage/webhook paths), mirroring Python's
   list(self.conversation_history). Regression test added.

2. (MEDIUM, Python) On cleanup/hangup hard-cancel while the provider was parked
   pre-first-token, asyncio.wait did not cancel the in-flight create() POST,
   orphaning the Hermes/OpenClaw connection ("Task exception was never
   retrieved"). _open_stream_with_cancel now catches CancelledError and aborts
   the create task. Test added.

3. (MEDIUM, TS) handleStop/handleWsClose awaited the backgrounded dispatch with
   no timeout — a hung user onMessage (no AbortSignal) could block call teardown
   indefinitely. Teardown now bounds the wait via settleDispatchForTeardown
   (DISPATCH_SETTLE_TIMEOUT_MS = 30s); Python hard-cancels the task.

Python 2220 / TypeScript 1766 pass; tsc + build clean.
---
 .../python/getpatter/services/llm_loop.py     | 13 ++++
 .../unit/test_provider_prefirsttoken_abort.py | 40 ++++++++++
 libraries/typescript/src/stream-handler.ts    | 73 ++++++++++++++++---
 ...peline-bargein-backgrounded.mocked.test.ts | 57 +++++++++++++++
 4 files changed, 172 insertions(+), 11 deletions(-)

diff --git a/libraries/python/getpatter/services/llm_loop.py b/libraries/python/getpatter/services/llm_loop.py
index c5bd6ec..5c038a2 100644
--- a/libraries/python/getpatter/services/llm_loop.py
+++ b/libraries/python/getpatter/services/llm_loop.py
@@ -686,6 +686,19 @@ async def _open_stream_with_cancel(self, kwargs: dict, cancel_event):
             await asyncio.wait(
                 {create_task, cancel_task}, return_when=asyncio.FIRST_COMPLETED
             )
+        except asyncio.CancelledError:
+            # The containing dispatch task was hard-cancelled (cleanup /
+            # hangup teardown) while parked here pre-first-token.
+            # ``asyncio.wait`` does NOT cancel the futures it waits on, so abort
+            # the in-flight POST to free the Hermes/OpenClaw connection instead
+            # of orphaning it (which would later raise "Task exception was never
+            # retrieved" when the abandoned request errors).
+            create_task.cancel()
+            try:
+                await create_task
+            except BaseException:  # noqa: BLE001 - aborting in-flight request
+                pass
+            raise
         finally:
             cancel_task.cancel()
         if not create_task.done():
diff --git a/libraries/python/tests/unit/test_provider_prefirsttoken_abort.py b/libraries/python/tests/unit/test_provider_prefirsttoken_abort.py
index 13391ee..80daee9 100644
--- a/libraries/python/tests/unit/test_provider_prefirsttoken_abort.py
+++ b/libraries/python/tests/unit/test_provider_prefirsttoken_abort.py
@@ -114,6 +114,46 @@ async def _consume() -> None:
     assert chunks == []
 
 
+@pytest.mark.mocked
+@pytest.mark.asyncio
+async def test_task_cancel_aborts_in_flight_create_no_orphan() -> None:
+    """When the containing dispatch task is hard-cancelled (cleanup / hangup)
+    while parked pre-first-token, the in-flight create() POST must be aborted —
+    not orphaned (which would later raise 'Task exception was never retrieved'
+    and leak the Hermes/OpenClaw connection)."""
+    provider = OpenAICompatibleLLMProvider(base_url="http://127.0.0.1:9/v1", model="m")
+    cancel = asyncio.Event()
+    create_started = asyncio.Event()
+    create_cancelled = {"value": False}
+
+    async def _never_returns(**_kwargs):
+        create_started.set()
+        try:
+            await asyncio.Event().wait()  # parks (server running tools)
+        except asyncio.CancelledError:
+            create_cancelled["value"] = True
+            raise
+
+    fake_client = MagicMock()
+    fake_client.chat.completions.create = _never_returns
+    provider._client = fake_client  # type: ignore[assignment]
+
+    async def _consume() -> None:
+        async for _chunk in provider.stream(
+            [{"role": "user", "content": "hi"}], cancel_event=cancel
+        ):
+            pass
+
+    task = asyncio.create_task(_consume())
+    await asyncio.wait_for(create_started.wait(), timeout=1.0)
+    # Simulate cleanup() hard-cancelling _dispatch_task while parked pre-create.
+    task.cancel()
+    with pytest.raises(asyncio.CancelledError):
+        await task
+    await asyncio.sleep(0)  # let the abort propagate
+    assert create_cancelled["value"] is True
+
+
 @pytest.mark.mocked
 @pytest.mark.asyncio
 async def test_no_cancel_event_streams_normally() -> None:
diff --git a/libraries/typescript/src/stream-handler.ts b/libraries/typescript/src/stream-handler.ts
index 23d1f51..cbacbf3 100644
--- a/libraries/typescript/src/stream-handler.ts
+++ b/libraries/typescript/src/stream-handler.ts
@@ -1031,6 +1031,15 @@ export class StreamHandler {
    * Python ``_dispatch_task``.
    */
   private dispatchTask: Promise<void> | null = null;
+  /**
+   * Cap (ms) on how long teardown waits for the backgrounded dispatch to
+   * settle. JS promises are not cancellable, so a user-supplied ``onMessage``
+   * (which receives no AbortSignal) parked on a hung external call could block
+   * call cleanup indefinitely — `llmAbort.abort()` only unblocks the built-in
+   * LLM/TTS paths. We bound the WAIT (Python hard-cancels the task instead).
+   * 30 s matches the webhook ceiling.
+   */
+  private static readonly DISPATCH_SETTLE_TIMEOUT_MS = 30_000;
   /**
    * Opt-in (default OFF): forward inbound audio to STT even while the agent is
    * speaking, so the transcript barge-in path can receive a transcript on
@@ -1752,6 +1761,27 @@ export class StreamHandler {
     }
   }
 
+  /**
+   * Await the backgrounded turn dispatch during teardown, but never block
+   * longer than ``DISPATCH_SETTLE_TIMEOUT_MS``. The earlier ``llmAbort.abort()``
+   * settles the built-in LLM/TTS paths immediately; the cap only bites a
+   * misbehaving user ``onMessage`` parked on a hung external call (JS promises
+   * can't be cancelled). No-op when nothing is in flight.
+   */
+  private async settleDispatchForTeardown(): Promise<void> {
+    if (!this.dispatchTask) return;
+    const settle = this.dispatchTask.catch(() => {});
+    let timer: ReturnType<typeof setTimeout> | undefined;
+    const cap = new Promise<void>((resolve) => {
+      timer = setTimeout(resolve, StreamHandler.DISPATCH_SETTLE_TIMEOUT_MS);
+    });
+    try {
+      await Promise.race([settle, cap]);
+    } finally {
+      if (timer) clearTimeout(timer);
+    }
+  }
+
   /** Handle call stop / stream end. */
   /** Handle a carrier-emitted `stop` event signalling the call has ended. */
   async handleStop(): Promise<void> {
@@ -1773,9 +1803,10 @@ export class StreamHandler {
       try { ttsCancelable.cancelActiveStream(); } catch { /* defensive */ }
     }
     // Settle the backgrounded turn dispatch (the abort above unblocks it) so
-    // no in-flight LLM/TTS work touches adapters after they close. Parity with
-    // Python cleanup awaiting ``_dispatch_task``.
-    await this.dispatchTask?.catch(() => {});
+    // no in-flight LLM/TTS work touches adapters after they close — bounded so
+    // a hung user onMessage cannot block teardown. Parity with Python cleanup
+    // hard-cancelling ``_dispatch_task``.
+    await this.settleDispatchForTeardown();
     // Drop any pending barge-in timer BEFORE we tear down metrics /
     // adapters. Without this, a call that ends while a barge-in is
     // pending leaves a setTimeout scheduled to fire ``bargeInConfirmMs``
@@ -1813,9 +1844,9 @@ export class StreamHandler {
     if (typeof ttsCancelable?.cancelActiveStream === 'function') {
       try { ttsCancelable.cancelActiveStream(); } catch { /* defensive */ }
     }
-    // Settle the backgrounded turn dispatch before tearing down adapters
-    // (parity with handleStop / Python cleanup).
-    await this.dispatchTask?.catch(() => {});
+    // Settle the backgrounded turn dispatch before tearing down adapters,
+    // bounded so a hung user onMessage cannot block teardown (see handleStop).
+    await this.settleDispatchForTeardown();
     // See handleStop — drop pending barge-in timer before cleanup so a
     // dead handler can never fire a stale recordOverlapEnd callback.
     this.clearPendingBargeIn();
@@ -2545,11 +2576,24 @@ export class StreamHandler {
     // await is fast and does not head-of-line-block the drain loop in
     // practice, while preserving strict per-turn history/metrics ordering.
     await this.dispatchTask?.catch(() => {});
+    // Snapshot history at launch — AFTER this turn's own user push above, BEFORE
+    // any later transcript can mutate it. The dispatch runs in the background,
+    // so passing the LIVE ``this.history.entries`` would let a following
+    // transcript's user push (which happens on the drain loop while this turn is
+    // in flight) contaminate this turn's LLM prompt. Mirrors Python's
+    // ``list(self.conversation_history)`` snapshot.
+    const historySnapshot = [...this.history.entries];
     // Launch the turn as a tracked background task and RETURN immediately so
     // the transcript drain loop keeps running handleBargeIn against this LIVE
     // turn (the head-of-line-blocking fix). Parity with Python
     // ``create_task(_dispatch_turn(...))``.
-    this.dispatchTask = this.dispatchTurn(filteredTranscript, hookExecutor, hookCtx, interrupted);
+    this.dispatchTask = this.dispatchTurn(
+      filteredTranscript,
+      hookExecutor,
+      hookCtx,
+      interrupted,
+      historySnapshot,
+    );
   }
 
   /**
@@ -2565,6 +2609,7 @@ export class StreamHandler {
     hookExecutor: PipelineHookExecutor,
     hookCtx: HookContext,
     interrupted: boolean,
+    historySnapshot: Array<{ role: string; text: string }>,
   ): Promise<void> {
     const label = this.deps.bridge.label;
     let responseText = '';
@@ -2576,7 +2621,7 @@ export class StreamHandler {
             call_id: this.callId,
             caller: this.caller,
             callee: this.callee,
-            history: [...this.history.entries],
+            history: historySnapshot,
           });
         } catch (e) {
           getLogger().error(`onMessage error (${label}):`, e);
@@ -2598,7 +2643,7 @@ export class StreamHandler {
           call_id: this.callId,
           caller: this.caller,
           callee: this.callee,
-          history: [...this.history.entries],
+          history: historySnapshot,
         };
         if (isWebSocketUrl(this.deps.onMessage)) {
           await this.handleWebSocketResponse(msgData);
@@ -2611,7 +2656,12 @@ export class StreamHandler {
           return;
         }
       } else if (this.llmLoop) {
-        responseText = await this.runPipelineLlm(filteredTranscript, hookExecutor, hookCtx);
+        responseText = await this.runPipelineLlm(
+          filteredTranscript,
+          hookExecutor,
+          hookCtx,
+          historySnapshot,
+        );
       } else {
         getLogger().warn(
           `Pipeline (${label}) has no llm/onMessage handler — transcript ` +
@@ -2909,6 +2959,7 @@ export class StreamHandler {
     filteredTranscript: string,
     hookExecutor: PipelineHookExecutor,
     hookCtx: HookContext,
+    historySnapshot: Array<{ role: string; text: string }>,
   ): Promise<string> {
     const label = this.deps.bridge.label;
     const callCtx = { call_id: this.callId, caller: this.caller, callee: this.callee };
@@ -2972,7 +3023,7 @@ export class StreamHandler {
       try {
         for await (const token of this.llmLoop!.run(
           filteredTranscript,
-          this.history.entries,
+          historySnapshot,
           callCtx,
           this.metricsAcc,
           hookExecutor,
diff --git a/libraries/typescript/tests/pipeline-bargein-backgrounded.mocked.test.ts b/libraries/typescript/tests/pipeline-bargein-backgrounded.mocked.test.ts
index d23c3e6..91e0476 100644
--- a/libraries/typescript/tests/pipeline-bargein-backgrounded.mocked.test.ts
+++ b/libraries/typescript/tests/pipeline-bargein-backgrounded.mocked.test.ts
@@ -94,6 +94,30 @@ function makeParkUntilAbortProvider(aborted: { value: boolean }): LLMProvider {
   } as unknown as LLMProvider;
 }
 
+/** Captures the built messages (prompt) the first stream() call receives, then
+ * yields a quick reply — to assert the in-flight turn's prompt is built from a
+ * history SNAPSHOT and cannot be contaminated by a later transcript's push. */
+function makeMessageCapturingProvider(captured: {
+  messages?: Array<{ role: string; content: string }>;
+}): LLMProvider {
+  return {
+    model: 'agent-runtime-1',
+    async *stream(
+      messages: Array<Record<string, unknown>>,
+      _tools?: Array<Record<string, unknown>> | null,
+      _opts?: LLMStreamOptions,
+    ): AsyncGenerator<LLMChunk, void, unknown> {
+      if (!captured.messages) {
+        captured.messages = JSON.parse(JSON.stringify(messages)) as Array<{
+          role: string;
+          content: string;
+        }>;
+      }
+      yield { type: 'text', content: 'va bene. ' };
+    },
+  } as unknown as LLMProvider;
+}
+
 function makeDeps(bridge: TelephonyBridge, agentOverrides: Partial<AgentOptions>): StreamHandlerDeps {
   const mockTts = new (ElevenLabsTTS as unknown as new (k: string, v?: string) => {
     synthesizeStream: (t: string) => AsyncIterable<Buffer>;
@@ -184,4 +208,37 @@ describe('[mocked] pipeline backgrounded-dispatch barge-in', () => {
       (onHandler as unknown as { forwardSttWhileSpeaking: boolean }).forwardSttWhileSpeaking,
     ).toBe(true);
   });
+
+  it("a later transcript's history push does NOT contaminate the in-flight turn's prompt", async () => {
+    const stt = makeMockStt();
+    const bridge = makeTwilioBridge(stt);
+    const captured: { messages?: Array<{ role: string; content: string }> } = {};
+    const deps = makeDeps(bridge, {
+      llm: makeMessageCapturingProvider(captured) as unknown as AgentOptions['llm'],
+    });
+    const handler = new StreamHandler(deps, makeMockWs(), '+15551111111', '+15552222222');
+    await handler.handleCallStart('CA-bg-snapshot');
+
+    // Turn A is committed and its dispatch launched (backgrounded). emitTranscript
+    // resolves after the snapshot was captured at launch.
+    await stt.emitTranscript('domanda del turno A');
+    // Simulate a FOLLOWING transcript's user push landing on the drain loop while
+    // turn A is still in flight (the exact race the snapshot fix guards against).
+    (handler as unknown as { history: { push: (e: unknown) => void } }).history.push({
+      role: 'user',
+      text: 'TURNO B PIU TARDI',
+      timestamp: Date.now(),
+    });
+
+    await vi.waitFor(() => expect(captured.messages).toBeDefined(), { timeout: 3000 });
+    await (handler as unknown as { dispatchTask: Promise<void> | null }).dispatchTask?.catch(
+      () => {},
+    );
+
+    const contents = (captured.messages ?? []).map((m) => m.content);
+    // Turn A's prompt was built from the launch-time snapshot — the later push
+    // is absent. With the pre-fix LIVE array it would leak in.
+    expect(contents).toContain('domanda del turno A');
+    expect(contents).not.toContain('TURNO B PIU TARDI');
+  }, 10000);
 });

From 83052d1eb1c6c9569b725eb37c8231fc1fc2f76e Mon Sep 17 00:00:00 2001
From: nicolotognoni <nicolo.tognoni1@gmail.com>
Date: Sun, 7 Jun 2026 12:32:36 +0200
Subject: [PATCH 05/11] =?UTF-8?q?fix(pipeline):=20echo-safe=20barge-in=20?=
 =?UTF-8?q?=E2=80=94=20drop=20agent=20self-echo,=20keep=20fast=20follow-up?=
 =?UTF-8?q?,=20mark=20interrupted=20turns?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Residual Hermes/OpenClaw barge-in failure (live test, PATTER_FORWARD_STT_WHILE_SPEAKING=1,
no AEC, no barge_in_strategies): barge-in fired on a PHANTOM transcript ("che tu
l'hai" — the agent's own Italian TTS echoing into Deepgram, not caught by the
English-only hallucination filter), the real follow-up was dropped leaving an
empty [interrupted] turn, and the post-barge-in context was poisoned.

A workflow root-cause (code trace + web research: Coval/Pipecat/LiveKit/Azure)
confirmed this is NOT an interruptibility problem — the abort already works
(bargein_ms=1.0). It is a GATE + ECHO + CONTEXT-REWRITE problem. Fixes, full
Python/TS parity:

1. Echo guard (language-agnostic). Track the agent's in-flight spoken text
   (_current_agent_spoken_text / currentAgentSpokenText). A new _looks_like_echo
   / looksLikeEcho (substring OR >=60% word overlap) drops any barge-in
   (_handle_barge_in) or commit (_commit_transcript) that is the agent's own
   TTS echoing back. Active ONLY while _forward_stt_while_speaking, so the
   default VAD path and real post-turn replies are unaffected.

2. Back-to-back dedup fix. The <500ms drop now applies only to a NEAR-DUPLICATE
   of the previous final (Deepgram speech_final+is_final for the same
   utterance), via _is_near_duplicate / isNearDuplicate. A genuinely different
   fast follow-up is no longer swallowed into an empty [interrupted] turn.

3. Interrupted-turn context rewrite. On a confirmed mid-turn barge-in the spoken
   prefix is appended to history with an "[interrupted by caller]" marker, so a
   stateful agent runtime (Hermes/OpenClaw, X-Hermes-Session-Id) sees next turn
   that it was cut off and what the caller actually heard.

Plus: fixed the stale _can_barge_in docstring (0.25 -> 0.5 s no-AEC gate).
Recommended caller config (unchanged SDK defaults): barge_in_strategies=
(MinWordsStrategy(min_words=2),), echo_cancellation=True.

Tests: test_pipeline_echo_dedup.py (19) + pipeline-echo-dedup.mocked.test.ts (11);
updated the back-to-back dedup tests to the corrected behaviour. Python 2236 /
TypeScript 1777 pass; tsc + build clean.
---
 CHANGELOG.md                                  |   4 +
 libraries/python/getpatter/stream_handler.py  | 123 +++++++++-
 .../test_pipeline_bargein_backgrounded.py     |   6 +-
 .../python/tests/unit/test_pipeline_dedup.py  |  38 ++-
 .../tests/unit/test_pipeline_echo_dedup.py    | 218 ++++++++++++++++++
 libraries/typescript/src/stream-handler.ts    | 114 ++++++++-
 ...peline-bargein-backgrounded.mocked.test.ts |  18 +-
 .../tests/pipeline-echo-dedup.mocked.test.ts  | 119 ++++++++++
 8 files changed, 618 insertions(+), 22 deletions(-)
 create mode 100644 libraries/python/tests/unit/test_pipeline_echo_dedup.py
 create mode 100644 libraries/typescript/tests/pipeline-echo-dedup.mocked.test.ts

diff --git a/CHANGELOG.md b/CHANGELOG.md
index a98937b..7c09b0e 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -15,6 +15,10 @@
   - **Decoupled, single-in-flight dispatch.** The turn now runs as one tracked background task (`_dispatch_task` / `dispatchTask`) so the receive loop keeps draining transcripts and runs barge-in detection against the LIVE turn. Exactly one dispatch is in flight: the loop settles the previous one before launching the next, so `conversation_history` / metrics ordering is unchanged. With no barge-in (default, VAD present, normal LLM) behaviour is unchanged — the loop still awaits the final turn to settle before returning.
   - **Prompt pre-first-token abort (Python).** Agent runtimes run tools for tens of seconds before the first token, during which the per-chunk `cancel_event` check never runs. The provider now races `create()` + first-byte against the cancel signal and spawns a watchdog that `close()`s the response the instant a barge-in fires, so the request is torn down immediately instead of blocking the next turn (TS already aborts promptly via `fetch` + `AbortController`). The VAD legacy barge-in branch now also sets `_llm_cancel_event` (it previously only flipped `_is_speaking`), and the OpenAI-compatible client uses an explicit httpx read/connect timeout so a dead gateway fails fast.
   - **`PATTER_FORWARD_STT_WHILE_SPEAKING` (opt-in, default off).** Forwards inbound audio to STT during TTS even with a VAD configured, so the transcript barge-in path can receive a transcript on echo-masked PSTN links where the VAD never fires. The leading-edge ring buffer is still captured. **Echo caveat:** without AEC the agent's own voice may be transcribed as a phantom interruption — pair with `agent.barge_in_strategies`. `libraries/python/getpatter/stream_handler.py`, `.../services/llm_loop.py`, `.../llm/openai_compatible.py`, `libraries/typescript/src/stream-handler.ts`.
+- **Echo-safe barge-in: the agent no longer interrupts itself, and a fast real follow-up is no longer lost.** Hardening for the echo-prone agent-runtime case (`PATTER_FORWARD_STT_WHILE_SPEAKING` on, no AEC), where the agent's own TTS bled into STT and was transcribed (e.g. a garbled fragment in another language not covered by the English hallucination filter), firing a phantom barge-in and leaving an empty `[interrupted]` turn:
+  - **Echo guard** — a language-agnostic check (`_looks_like_echo` / `looksLikeEcho`: substring or ≥60% word overlap against the agent's in-flight spoken text) now drops any candidate barge-in/commit that is the agent's own speech echoing back. Active only while forwarding audio during TTS, so the default VAD path and real post-turn replies are untouched.
+  - **Back-to-back dedup fix** — a final within 500 ms of the previous is now dropped only when it is a *near-duplicate* (Deepgram emitting `speech_final` then `is_final` for the same utterance). A genuinely different fast follow-up (e.g. the real interruption right after a suppressed phantom) is kept instead of being silently swallowed into an empty turn.
+  - **Interrupted-turn context rewrite** — on a confirmed mid-turn barge-in the spoken prefix is recorded in history with an `[interrupted by caller]` marker (instead of an ungrounded full reply), so a stateful agent runtime (Hermes/OpenClaw, keyed by `X-Hermes-Session-Id`) sees on the next turn that it was cut off and what the caller actually heard. `libraries/python/getpatter/stream_handler.py`, `libraries/typescript/src/stream-handler.ts`.
 
 ## 0.6.5 (2026-06-05)
 
diff --git a/libraries/python/getpatter/stream_handler.py b/libraries/python/getpatter/stream_handler.py
index 14a0798..5cf0729 100644
--- a/libraries/python/getpatter/stream_handler.py
+++ b/libraries/python/getpatter/stream_handler.py
@@ -131,6 +131,50 @@
 # ("We'll see you next time. Bye bye.") without importing ``re``.
 _SENTENCE_ENDERS = ".!?…。！？"
 
+# Fraction of a candidate transcript's words that must appear in the agent's
+# in-flight spoken text for it to be treated as the agent's own TTS echoing
+# back (rather than real caller speech). 0.6 keeps real replies that merely
+# share a couple of words while catching garbled echo fragments. Language-
+# agnostic — unlike the English-only ``_STT_HALLUCINATIONS`` set.
+_ECHO_WORD_OVERLAP_THRESHOLD = 0.6
+
+
+def _normalize_for_echo(text: str) -> str:
+    """Lowercase, drop punctuation, collapse whitespace — for echo comparison."""
+    out = []
+    for ch in text.lower():
+        out.append(ch if (ch.isalnum() or ch.isspace()) else " ")
+    return " ".join("".join(out).split())
+
+
+def _looks_like_echo(candidate: str, agent_text: str) -> bool:
+    """True when ``candidate`` looks like a fragment of ``agent_text`` — i.e. the
+    agent's own TTS bleeding into STT (forwarded during TTS without effective
+    AEC) rather than real caller speech. Substring match OR high word-overlap.
+    """
+    a = _normalize_for_echo(agent_text)
+    c = _normalize_for_echo(candidate)
+    if not a or not c:
+        return False
+    if c in a:  # candidate is verbatim a fragment of what the agent said
+        return True
+    words = c.split()
+    if not words:
+        return False
+    agent_words = set(a.split())
+    overlap = sum(1 for w in words if w in agent_words) / len(words)
+    return overlap >= _ECHO_WORD_OVERLAP_THRESHOLD
+
+
+def _is_near_duplicate(a: str, b: str) -> bool:
+    """True when two normalised finals are the same utterance double-emitted
+    (identical, or one a substring of the other) — used to drop Deepgram's
+    ``speech_final``+``is_final`` back-to-back pair WITHOUT swallowing a
+    genuinely different utterance that merely arrives quickly."""
+    if not a or not b:
+        return False
+    return a == b or a in b or b in a
+
 
 def _is_stt_hallucination(text: str) -> bool:
     """True when *text* is — or is composed entirely of — known STT
@@ -2383,6 +2427,17 @@ def __init__(
         # a no-op; cancelling avoids leaving an idle ``asyncio.sleep`` task
         # per turn on long, fast-turn calls.
         self._grace_task: asyncio.Task | None = None
+        # The agent's spoken text for the CURRENT turn, accumulated as tokens
+        # stream. Used by the echo guard to reject the agent's own TTS bleeding
+        # back into STT (when audio is forwarded during TTS without effective
+        # AEC) so it never barges in or becomes a phantom user turn. Reset at
+        # ``_begin_speaking``; only consulted while ``_forward_stt_while_speaking``.
+        self._current_agent_spoken_text = ""
+        # Whether the last completed turn was cut short by a confirmed barge-in
+        # — set by ``_process_streaming_response`` so the spoken prefix is
+        # appended to history with an ``[interrupted by caller]`` marker (keeps
+        # a stateful agent runtime's context grounded in what was actually heard).
+        self._last_response_interrupted = False
         # Per-turn LLM cancel event. Recreated on every new turn before LLM
         # consumption so a stale cancel from a previous turn cannot terminate
         # the next stream prematurely. Initialized here so the STT loop's
@@ -3253,6 +3308,11 @@ async def _process_streaming_response(self, result, call_id: str) -> str:
                         interrupted = True
                         break
                     full_response_parts.append(token)
+                    # Keep the echo-guard reference current as the agent speaks,
+                    # so a barge-in transcript arriving mid-turn can be compared
+                    # against what the agent has said SO FAR (echo lags the
+                    # tokens, so this is already ahead of the bleed).
+                    self._current_agent_spoken_text = "".join(full_response_parts)
                     # Fix 5: record LLM first-token (TTFT).
                     if llm_first_token_sent[0] and self.metrics is not None:
                         self.metrics.record_llm_first_token()
@@ -3402,6 +3462,14 @@ async def _process_streaming_response(self, result, call_id: str) -> str:
                 self.metrics.record_tts_complete(response_text)
                 turn = self.metrics.record_turn_complete(response_text)
                 await self._emit_turn_metrics(turn, call_id=call_id)
+        # Tell the caller (``_dispatch_turn``) whether this turn was cut short so
+        # the spoken prefix is recorded in history WITH a marker. A stateful
+        # agent runtime (Hermes/OpenClaw) then sees, on the next turn, that it
+        # was interrupted and what the caller actually heard — instead of an
+        # ungrounded full reply that pollutes its context.
+        self._last_response_interrupted = interrupted
+        if interrupted and response_text:
+            response_text = f"{response_text} [interrupted by caller]"
         return response_text
 
     async def _process_regular_response(self, response_text: str, call_id: str) -> None:
@@ -3474,6 +3542,21 @@ async def _handle_barge_in(self, transcript) -> None:
             # before the VAD speech_start rescue fires.
             await self._end_tail_grace_for_new_turn()
             return
+        # Echo guard: when audio is forwarded to STT during TTS (no effective
+        # AEC), the agent's own voice can be transcribed and would otherwise
+        # barge in on itself. Drop any transcript that looks like a fragment of
+        # what the agent is currently saying. Only active under
+        # ``_forward_stt_while_speaking`` (the only path that feeds TTS audio to
+        # STT), so the default VAD path is unaffected. Mirrors TS ``handleBargeIn``.
+        if getattr(self, "_forward_stt_while_speaking", False) and _looks_like_echo(
+            transcript.text, getattr(self, "_current_agent_spoken_text", "")
+        ):
+            logger.info(
+                "Barge-in suppressed: transcript matches agent's own speech "
+                "(echo) — %r",
+                sanitize_log_value(transcript.text[:40]),
+            )
+            return
         if not self._can_barge_in():
             aec_state = "on" if getattr(self, "_aec", None) is not None else "off"
             logger.info(
@@ -3638,8 +3721,10 @@ def _commit_transcript(self, text: str) -> bool:
 
         Mirrors TS ``commitTranscript``. Returns ``True`` if the transcript
         should be committed to a turn, ``False`` if it must be dropped.
-        Drop reasons: common hallucinations, duplicate within 2 s, or any
-        final within 500 ms of the previous one.
+        Drop reasons: common hallucinations, the agent's own TTS echo (when
+        forwarding audio to STT during TTS), exact duplicate within 2 s, or a
+        near-duplicate within 500 ms (the same utterance double-emitted) — a
+        genuinely different fast follow-up is NOT dropped.
         """
         now = time.time()
         normalised = text.strip().lower()
@@ -3649,6 +3734,19 @@ def _commit_transcript(self, text: str) -> bool:
         if stripped in _STT_HALLUCINATIONS or stripped == "":
             logger.debug("Dropped likely STT hallucination: %r", normalised[:40])
             return False
+        # Echo guard: while the agent is still speaking (the forward-STT echo
+        # window), a transcript that matches the agent's own speech is its TTS
+        # bleeding back into STT, not a user turn. Gated on
+        # ``_forward_stt_while_speaking`` + ``_is_speaking`` so a real post-turn
+        # reply (committed when the agent is idle) is never dropped, and the
+        # default VAD path — which withholds audio during TTS — is unaffected.
+        if (
+            getattr(self, "_forward_stt_while_speaking", False)
+            and getattr(self, "_is_speaking", False)
+            and _looks_like_echo(text, getattr(self, "_current_agent_spoken_text", ""))
+        ):
+            logger.debug("Dropped agent-echo transcript (not a user turn): %r", normalised[:40])
+            return False
         if since_last < 2.0 and normalised == self._last_commit_text:
             logger.debug(
                 "Dropped duplicate final transcript (%.1fs since last): %r",
@@ -3656,9 +3754,14 @@ def _commit_transcript(self, text: str) -> bool:
                 normalised[:40],
             )
             return False
-        if since_last < 0.5:
+        # Back-to-back: drop a NEAR-DUPLICATE within 0.5 s (Deepgram emitting
+        # ``speech_final`` then ``is_final`` for the SAME utterance). A
+        # genuinely DIFFERENT utterance arriving this fast (e.g. the real reply
+        # right after a suppressed phantom) must NOT be swallowed — dropping it
+        # unconditionally left an empty ``[interrupted]`` turn before this fix.
+        if since_last < 0.5 and _is_near_duplicate(normalised, self._last_commit_text):
             logger.debug(
-                "Dropped back-to-back final transcript (%.2fs since last): %r",
+                "Dropped back-to-back near-duplicate final (%.2fs since last): %r",
                 since_last,
                 normalised[:40],
             )
@@ -4271,6 +4374,9 @@ async def _begin_speaking(self, is_first_message: bool = False) -> None:
         # turn so we never replay yesterday's audio to STT.
         self._inbound_audio_ring = []
         self._suppressed_speech_pending = False
+        # Fresh turn — reset the echo-guard reference so this turn's barge-in
+        # checks compare against THIS turn's spoken text, not the last turn's.
+        self._current_agent_spoken_text = ""
         # Reset the VAD detector so the next user utterance triggers a clean
         # SILENCE→SPEECH transition. Without this, PSTN echo from the
         # previous turn can keep the smoothed probability above the
@@ -4295,10 +4401,11 @@ def _mark_first_audio_sent(self) -> None:
     def _can_barge_in(self) -> bool:
         """Whether barge-in is allowed to fire right now.
 
-        Gate length depends on whether AEC is active: 1 s with AEC
-        (covers filter warmup), 0.25 s without (anti-flicker only —
-        keeps PSTN barge-in responsive, since on PSTN AEC is a no-op
-        and there is no warmup to protect).
+        Gate length depends on whether AEC is active:
+        ``MIN_AGENT_SPEAKING_S_BEFORE_BARGE_IN_AEC`` with AEC (covers filter
+        warmup), ``MIN_AGENT_SPEAKING_S_BEFORE_BARGE_IN_NO_AEC`` (0.5 s) without
+        — an anti-flicker margin that keeps PSTN barge-in responsive while
+        rejecting the first burst of echo/noise before real speech.
 
         ``getattr`` is used so test fixtures that flip ``_is_speaking``
         directly (without going through ``_begin_speaking``) still
diff --git a/libraries/python/tests/unit/test_pipeline_bargein_backgrounded.py b/libraries/python/tests/unit/test_pipeline_bargein_backgrounded.py
index 295e200..f362e0c 100644
--- a/libraries/python/tests/unit/test_pipeline_bargein_backgrounded.py
+++ b/libraries/python/tests/unit/test_pipeline_bargein_backgrounded.py
@@ -135,7 +135,11 @@ async def receive_transcripts(self) -> AsyncIterator[Transcript]:
         # Turn 1's LLM stream observed the cancel (it was torn down, not left
         # running until the next turn).
         assert cancel_seen.is_set()
-        assert handler._is_speaking is False
+        # The REAL follow-up "ferma per favore" (different text, arriving <0.5s
+        # after turn 1) is NOT swallowed by the back-to-back dedup — it
+        # dispatches as a fresh turn (calls==2). Before the dedup fix it was
+        # dropped, leaving an empty [interrupted] turn and no reply.
+        assert handler._llm_loop.calls >= 2
 
 
 @pytest.mark.unit
diff --git a/libraries/python/tests/unit/test_pipeline_dedup.py b/libraries/python/tests/unit/test_pipeline_dedup.py
index ba1bc25..a85f2fc 100644
--- a/libraries/python/tests/unit/test_pipeline_dedup.py
+++ b/libraries/python/tests/unit/test_pipeline_dedup.py
@@ -272,12 +272,16 @@ async def test_duplicate_normalises_whitespace_and_case(
 @pytest.mark.unit
 @pytest.mark.asyncio
 class TestThrottleFilter:
-    """Rule 3 — drop ANY final that lands within 500 ms of the last turn."""
+    """Rule 3 — drop a NEAR-DUPLICATE final within 500 ms (Deepgram emitting
+    ``speech_final`` then ``is_final`` for the same utterance). A genuinely
+    DIFFERENT fast follow-up must NOT be swallowed (the empty-[interrupted]-turn
+    fix)."""
 
-    async def test_drops_back_to_back_under_500ms(
+    async def test_drops_back_to_back_near_duplicate_under_500ms(
         self, monkeypatch: pytest.MonkeyPatch
     ) -> None:
-        # Different text but only 0.2 s apart — treated as STT over-firing.
+        # Same utterance double-emitted 0.2 s apart (the second a superset of the
+        # first) — real STT over-firing, still de-duplicated.
         times = iter([100.0, 100.2])
         monkeypatch.setattr(
             "getpatter.stream_handler.time.time",
@@ -286,7 +290,7 @@ async def test_drops_back_to_back_under_500ms(
         stt = _StubSTT(
             [
                 Transcript(text="What time is it", is_final=True, confidence=0.9),
-                Transcript(text="Tell me the weather", is_final=True, confidence=0.9),
+                Transcript(text="What time is it now", is_final=True, confidence=0.9),
             ]
         )
         on_transcript = AsyncMock()
@@ -294,9 +298,33 @@ async def test_drops_back_to_back_under_500ms(
 
         await _run_loop(handler)
 
-        # Only the first should have been forwarded.
+        # Only the first should have been forwarded (near-duplicate dropped).
         assert on_transcript.await_count == 1
 
+    async def test_keeps_different_followup_under_500ms(
+        self, monkeypatch: pytest.MonkeyPatch
+    ) -> None:
+        # Two genuinely DIFFERENT utterances 0.2 s apart are NOT over-firing —
+        # both must be kept (before the fix the second was silently dropped,
+        # leaving an empty [interrupted] turn).
+        times = iter([100.0, 100.2])
+        monkeypatch.setattr(
+            "getpatter.stream_handler.time.time",
+            lambda: next(times),
+        )
+        stt = _StubSTT(
+            [
+                Transcript(text="What time is it", is_final=True, confidence=0.9),
+                Transcript(text="Tell me the weather", is_final=True, confidence=0.9),
+            ]
+        )
+        on_transcript = AsyncMock()
+        handler = _make_handler(stt, on_transcript)
+
+        await _run_loop(handler)
+
+        assert on_transcript.await_count == 2
+
     async def test_passes_after_500ms(self, monkeypatch: pytest.MonkeyPatch) -> None:
         # Different text, 700 ms apart — legitimate second turn.
         times = iter([100.0, 100.7])
diff --git a/libraries/python/tests/unit/test_pipeline_echo_dedup.py b/libraries/python/tests/unit/test_pipeline_echo_dedup.py
new file mode 100644
index 0000000..248d9d3
--- /dev/null
+++ b/libraries/python/tests/unit/test_pipeline_echo_dedup.py
@@ -0,0 +1,218 @@
+"""Echo-guard, back-to-back dedup, and interrupted-turn marking for the
+pipeline turn-taking path — the residual Hermes/OpenClaw barge-in fixes.
+
+Root causes (live Hermes test, PATTER_FORWARD_STT_WHILE_SPEAKING=1, no AEC):
+* the agent's own TTS bled into Deepgram and was transcribed as a phantom
+  ("che tu l'hai"), firing a false barge-in (legacy "any transcript = cancel");
+* the real follow-up final arriving <0.5s later was dropped by the back-to-back
+  filter even though its text was completely different → empty [interrupted] turn;
+* the interrupted assistant turn was stored ungrounded, poisoning the next turn.
+"""
+
+from __future__ import annotations
+
+import asyncio
+import time
+from collections import deque
+from unittest.mock import AsyncMock, MagicMock
+
+import pytest
+
+from getpatter.providers.base import Transcript
+from getpatter.stream_handler import (
+    PipelineStreamHandler,
+    _is_near_duplicate,
+    _looks_like_echo,
+    _normalize_for_echo,
+)
+
+from tests.conftest import make_agent
+
+
+def _make_handler() -> PipelineStreamHandler:
+    handler = PipelineStreamHandler(
+        agent=make_agent(),
+        audio_sender=AsyncMock(),
+        call_id="call-echo",
+        caller="+15551110000",
+        callee="+15552220000",
+        resolved_prompt="p",
+        metrics=MagicMock(),
+        for_twilio=True,
+        on_transcript=None,
+        conversation_history=deque(maxlen=20),
+        transcript_entries=deque(maxlen=20),
+    )
+    handler.on_message = None
+    handler._stt = AsyncMock()
+    return handler
+
+
+# ---------------------------------------------------------------------------
+# Pure helpers
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.unit
+class TestEchoHelpers:
+    def test_normalize_strips_punct_and_case(self) -> None:
+        assert _normalize_for_echo("Ciao, come VA?!") == "ciao come va"
+
+    def test_substring_fragment_is_echo(self) -> None:
+        agent = "Certo, ti racconto una storia molto lunga sul mare"
+        assert _looks_like_echo("una storia molto", agent) is True
+
+    def test_high_word_overlap_is_echo(self) -> None:
+        agent = "che tu lo voglia o no, te l'ho già detto"
+        # garbled echo fragment whose words are mostly in the agent text
+        assert _looks_like_echo("che tu l'hai", agent) is True
+
+    def test_unrelated_user_speech_is_not_echo(self) -> None:
+        agent = "Sto bene grazie, sono pronto ad aiutarti col tuo problema"
+        assert _looks_like_echo("fermati dimmi solo interrotto", agent) is False
+
+    def test_empty_inputs_not_echo(self) -> None:
+        assert _looks_like_echo("", "qualcosa") is False
+        assert _looks_like_echo("qualcosa", "") is False
+
+    def test_near_duplicate_substring_and_exact(self) -> None:
+        assert _is_near_duplicate("ciao come va", "ciao come va") is True
+        assert _is_near_duplicate("ciao come", "ciao come va") is True  # prefix
+        assert _is_near_duplicate("ciao come va bene", "ciao come va") is True
+        assert _is_near_duplicate("fermati subito", "dimmi una storia") is False
+
+
+# ---------------------------------------------------------------------------
+# _commit_transcript
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.unit
+class TestCommitTranscriptEchoAndDedup:
+    def test_echo_dropped_while_speaking_with_forward_flag(self) -> None:
+        h = _make_handler()
+        h._forward_stt_while_speaking = True
+        h._is_speaking = True
+        h._current_agent_spoken_text = "ti racconto una storia lunga sul mare"
+        assert h._commit_transcript("una storia lunga") is False
+
+    def test_echo_not_dropped_when_flag_off(self) -> None:
+        h = _make_handler()
+        h._forward_stt_while_speaking = False  # default
+        h._is_speaking = True
+        h._current_agent_spoken_text = "ti racconto una storia lunga sul mare"
+        # Flag off → echo guard inert → normal commit (real user could legitimately
+        # echo words; we only filter under the forward-STT echo-prone config).
+        assert h._commit_transcript("una storia lunga") is True
+
+    def test_echo_not_dropped_when_idle(self) -> None:
+        h = _make_handler()
+        h._forward_stt_while_speaking = True
+        h._is_speaking = False  # post-turn user reply, not an echo window
+        h._current_agent_spoken_text = "ti racconto una storia lunga sul mare"
+        assert h._commit_transcript("una storia lunga") is True
+
+    def test_different_followup_within_500ms_not_dropped(self) -> None:
+        h = _make_handler()
+        h._last_commit_text = "dimmi una storia"
+        h._last_commit_at = time.time()  # just now
+        # A genuinely different utterance arriving <0.5s later must survive
+        # (the empty-[interrupted]-turn fix).
+        assert h._commit_transcript("fermati dimmi solo interrotto") is True
+
+    def test_near_duplicate_within_500ms_dropped(self) -> None:
+        h = _make_handler()
+        h._last_commit_text = "fermati dimmi solo"
+        h._last_commit_at = time.time()
+        # Deepgram speech_final then is_final for the same utterance (a superset)
+        # is still de-duplicated.
+        assert h._commit_transcript("fermati dimmi solo interrotto") is False
+
+
+# ---------------------------------------------------------------------------
+# _handle_barge_in echo guard
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.unit
+@pytest.mark.asyncio
+class TestHandleBargeInEchoGuard:
+    async def test_echo_transcript_does_not_barge_in(self) -> None:
+        h = _make_handler()
+        h._forward_stt_while_speaking = True
+        h._is_speaking = True
+        h._tail_grace_active = False
+        h._can_barge_in = lambda: True  # type: ignore[assignment]
+        h._current_agent_spoken_text = "ti racconto una storia lunga sul mare aperto"
+
+        await h._handle_barge_in(Transcript(text="una storia lunga", is_final=True, confidence=0.9))
+
+        # No cancel: the agent's own echo must not interrupt it.
+        h.audio_sender.send_clear.assert_not_awaited()
+        assert h._is_speaking is True
+
+    async def test_real_speech_still_barges_in(self) -> None:
+        h = _make_handler()
+        h._forward_stt_while_speaking = True
+        h._is_speaking = True
+        h._tail_grace_active = False
+        h._can_barge_in = lambda: True  # type: ignore[assignment]
+        h._current_agent_spoken_text = "ti racconto una storia lunga sul mare aperto"
+
+        await h._handle_barge_in(
+            Transcript(text="fermati dimmi solo interrotto", is_final=True, confidence=0.9)
+        )
+
+        h.audio_sender.send_clear.assert_awaited()
+        assert h._is_speaking is False
+
+
+# ---------------------------------------------------------------------------
+# Interrupted-turn marking
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.unit
+@pytest.mark.asyncio
+class TestInterruptedTurnMarking:
+    async def test_interrupted_response_gets_marker(self) -> None:
+        h = _make_handler()
+
+        class _FakeTTS:
+            output_format = "pcm_16000"
+
+            async def synthesize(self, text: str):
+                yield b"\x00\x00" * 80
+
+        h._tts = _FakeTTS()  # type: ignore[assignment]
+
+        async def _result():
+            yield "Ti racconto. "
+            # Simulate a barge-in cancelling the stream mid-turn.
+            h._llm_cancel_event.set()
+            yield "Questo non si sente."
+
+        text = await h._process_streaming_response(_result(), "call-echo")
+
+        assert h._last_response_interrupted is True
+        assert text.endswith("[interrupted by caller]")
+        assert "Ti racconto." in text
+
+    async def test_complete_response_no_marker(self) -> None:
+        h = _make_handler()
+
+        class _FakeTTS:
+            output_format = "pcm_16000"
+
+            async def synthesize(self, text: str):
+                yield b"\x00\x00" * 80
+
+        h._tts = _FakeTTS()  # type: ignore[assignment]
+
+        async def _result():
+            yield "Tutto bene, grazie. "
+
+        text = await h._process_streaming_response(_result(), "call-echo")
+
+        assert h._last_response_interrupted is False
+        assert "[interrupted by caller]" not in text
diff --git a/libraries/typescript/src/stream-handler.ts b/libraries/typescript/src/stream-handler.ts
index cbacbf3..9338883 100644
--- a/libraries/typescript/src/stream-handler.ts
+++ b/libraries/typescript/src/stream-handler.ts
@@ -342,6 +342,45 @@ export function isSttHallucination(text: string): boolean {
   return pieces.length > 1 && pieces.every((p) => HALLUCINATIONS.has(p));
 }
 
+/** Fraction of a candidate's words that must appear in the agent's spoken text
+ * for it to count as the agent's own TTS echoing back. Mirrors Python
+ * ``_ECHO_WORD_OVERLAP_THRESHOLD``. */
+const ECHO_WORD_OVERLAP_THRESHOLD = 0.6;
+
+/** Lowercase, drop punctuation, collapse whitespace — for echo comparison. */
+export function normalizeForEcho(text: string): string {
+  return text
+    .toLowerCase()
+    .replace(/[^\p{L}\p{N}\s]/gu, ' ')
+    .replace(/\s+/u, ' ')
+    .trim()
+    .replace(/\s+/gu, ' ');
+}
+
+/** True when ``candidate`` looks like a fragment of ``agentText`` — i.e. the
+ * agent's own TTS bleeding into STT (forwarded during TTS without effective
+ * AEC) rather than real caller speech. Substring OR high word-overlap.
+ * Mirrors Python ``_looks_like_echo``. */
+export function looksLikeEcho(candidate: string, agentText: string): boolean {
+  const a = normalizeForEcho(agentText);
+  const c = normalizeForEcho(candidate);
+  if (!a || !c) return false;
+  if (a.includes(c)) return true;
+  const words = c.split(' ').filter(Boolean);
+  if (words.length === 0) return false;
+  const agentWords = new Set(a.split(' '));
+  const overlap = words.filter((w) => agentWords.has(w)).length / words.length;
+  return overlap >= ECHO_WORD_OVERLAP_THRESHOLD;
+}
+
+/** True when two normalised finals are the same utterance double-emitted
+ * (identical, or one a substring of the other). Mirrors Python
+ * ``_is_near_duplicate``. */
+export function isNearDuplicate(a: string, b: string): boolean {
+  if (!a || !b) return false;
+  return a === b || a.includes(b) || b.includes(a);
+}
+
 // ---------------------------------------------------------------------------
 // StreamHandler context (immutable per-call configuration)
 // ---------------------------------------------------------------------------
@@ -650,6 +689,9 @@ export class StreamHandler {
     // Fresh turn — drop any stale pre-barge-in buffer from a previous turn
     // so we never replay yesterday's audio to STT.
     this.inboundAudioRing = [];
+    // Fresh turn — reset the echo-guard reference so barge-in checks compare
+    // against THIS turn's spoken text, not the last turn's.
+    this.currentAgentSpokenText = '';
     // Reset the VAD detector so the next user utterance triggers a clean
     // SILENCE→SPEECH transition. Without this, PSTN echo from the previous
     // turn can keep the detector's smoothed probability above the
@@ -1052,6 +1094,12 @@ export class StreamHandler {
   // Throttle state for back-to-back STT finals — see ``commitTranscript``.
   private lastCommitText = '';
   private lastCommitAt = 0;
+  /** The agent's spoken text for the CURRENT turn, accumulated as tokens stream.
+   * The echo guard rejects transcripts matching it (the agent's own TTS bleeding
+   * back into STT when audio is forwarded during TTS without effective AEC).
+   * Reset in ``beginSpeaking``; only consulted while ``forwardSttWhileSpeaking``.
+   * Parity with Python ``_current_agent_spoken_text``. */
+  private currentAgentSpokenText = '';
   // PCM16 byte-alignment carry for TTS streaming (pipeline mode).
   // HTTP streams from ElevenLabs / OpenAI / Cartesia can yield chunks of any
   // size, including odd byte counts. Silently dropping the trailing odd byte
@@ -2715,6 +2763,21 @@ export class StreamHandler {
       this.endTailGraceForNewTurn();
       return false;
     }
+    // Echo guard: when audio is forwarded to STT during TTS (no effective AEC),
+    // the agent's own voice can be transcribed and would barge in on itself.
+    // Drop transcripts that look like a fragment of what the agent is saying.
+    // Only under forwardSttWhileSpeaking, so the default VAD path is unaffected.
+    if (
+      this.forwardSttWhileSpeaking &&
+      looksLikeEcho(transcript.text, this.currentAgentSpokenText)
+    ) {
+      getLogger().info(
+        `Barge-in suppressed: transcript matches agent's own speech (echo) — ${sanitizeLogValue(
+          transcript.text.slice(0, 40),
+        )}`,
+      );
+      return false;
+    }
     if (!this.canBargeIn()) {
       getLogger().info(
         `Barge-in transcript suppressed (agent speaking < gate, aec=${this.aec ? 'on' : 'off'})`,
@@ -2763,6 +2826,19 @@ export class StreamHandler {
       this.endTailGraceForNewTurn();
       return false;
     }
+    // Echo guard (parity with handleBargeInAsync) — never let the agent's own
+    // forwarded TTS echo barge in on itself.
+    if (
+      this.forwardSttWhileSpeaking &&
+      looksLikeEcho(transcript.text, this.currentAgentSpokenText)
+    ) {
+      getLogger().info(
+        `Barge-in suppressed: transcript matches agent's own speech (echo) — ${sanitizeLogValue(
+          transcript.text.slice(0, 40),
+        )}`,
+      );
+      return false;
+    }
     if (this.bargeInStrategies.length === 0) {
       // Legacy synchronous path — preserve exact byte-for-byte behaviour
       // for users who haven't opted into the confirm pipeline.
@@ -2876,15 +2952,34 @@ export class StreamHandler {
       getLogger().debug(`Dropped likely STT hallucination: ${sanitizeLogValue(normalised.slice(0, 40))}`);
       return false;
     }
+    // Echo guard: while the agent is still speaking (the forward-STT echo
+    // window), a transcript that matches the agent's own speech is its TTS
+    // bleeding back into STT, not a user turn. Gated on forwardSttWhileSpeaking
+    // + isSpeaking so a real post-turn reply (committed when idle) is never
+    // dropped, and the default VAD path is unaffected. Parity with Python.
+    if (
+      this.forwardSttWhileSpeaking &&
+      this.isSpeaking &&
+      looksLikeEcho(text, this.currentAgentSpokenText)
+    ) {
+      getLogger().debug(
+        `Dropped agent-echo transcript (not a user turn): ${sanitizeLogValue(normalised.slice(0, 40))}`,
+      );
+      return false;
+    }
     if (sinceLastMs < 2000 && normalised === this.lastCommitText) {
       getLogger().debug(
         `Dropped duplicate final transcript (${(sinceLastMs / 1000).toFixed(1)}s since last): ${sanitizeLogValue(normalised.slice(0, 40))}`,
       );
       return false;
     }
-    if (sinceLastMs < 500) {
+    // Back-to-back: drop a NEAR-DUPLICATE within 500 ms (Deepgram emitting
+    // speech_final then is_final for the SAME utterance). A genuinely DIFFERENT
+    // fast follow-up must NOT be swallowed — dropping it unconditionally left
+    // an empty [interrupted] turn before this fix. Parity with Python.
+    if (sinceLastMs < 500 && isNearDuplicate(normalised, this.lastCommitText)) {
       getLogger().debug(
-        `Dropped back-to-back final transcript (${(sinceLastMs / 1000).toFixed(2)}s since last): ${sanitizeLogValue(normalised.slice(0, 40))}`,
+        `Dropped back-to-back near-duplicate final (${(sinceLastMs / 1000).toFixed(2)}s since last): ${sanitizeLogValue(normalised.slice(0, 40))}`,
       );
       return false;
     }
@@ -3037,6 +3132,10 @@ export class StreamHandler {
           // Idempotent in the dispatcher.
           await this.emitLlmFirstToken();
           allParts.push(token);
+          // Keep the echo-guard reference current as the agent speaks, so a
+          // barge-in transcript mid-turn is compared against what the agent has
+          // said so far (echo lags the tokens). Parity with Python.
+          this.currentAgentSpokenText = allParts.join('');
           for (const sentence of chunker.push(token)) {
             if (!this.isSpeaking) break;
             await guardAndSpeak(sentence, !firstSentenceEmitted);
@@ -3106,7 +3205,16 @@ export class StreamHandler {
         // Swallow — span teardown should never crash the call path.
       }
     }
-    return allParts.join('');
+    const responseText = allParts.join('');
+    // Tag the spoken prefix with an ``[interrupted by caller]`` marker when the
+    // turn was cut short, so a stateful agent runtime (Hermes/OpenClaw) sees,
+    // next turn, that it was interrupted and what the caller actually heard —
+    // not an ungrounded full reply that pollutes its context. Parity with
+    // Python ``_process_streaming_response``.
+    if (llmSignal.aborted && responseText) {
+      return `${responseText} [interrupted by caller]`;
+    }
+    return responseText;
   }
 
   /**
diff --git a/libraries/typescript/tests/pipeline-bargein-backgrounded.mocked.test.ts b/libraries/typescript/tests/pipeline-bargein-backgrounded.mocked.test.ts
index 91e0476..40ed429 100644
--- a/libraries/typescript/tests/pipeline-bargein-backgrounded.mocked.test.ts
+++ b/libraries/typescript/tests/pipeline-bargein-backgrounded.mocked.test.ts
@@ -73,9 +73,11 @@ function makeTwilioBridge(mockStt: ReturnType<typeof makeMockStt>): TelephonyBri
   } as unknown as TelephonyBridge;
 }
 
-/** Provider that parks until its turn is aborted — models a long Hermes turn
- * (tools running before the first token) that only ends on barge-in. */
+/** Provider whose FIRST turn parks until aborted (models a long Hermes turn
+ * that only ends on barge-in); any later turn replies quickly so the real
+ * follow-up — no longer swallowed by the back-to-back dedup — can complete. */
 function makeParkUntilAbortProvider(aborted: { value: boolean }): LLMProvider {
+  let calls = 0;
   return {
     model: 'agent-runtime-1',
     async *stream(
@@ -83,6 +85,11 @@ function makeParkUntilAbortProvider(aborted: { value: boolean }): LLMProvider {
       _tools?: Array<Record<string, unknown>> | null,
       opts?: LLMStreamOptions,
     ): AsyncGenerator<LLMChunk, void, unknown> {
+      calls += 1;
+      if (calls > 1) {
+        yield { type: 'text', content: 'va bene. ' };
+        return;
+      }
       const signal = opts?.signal;
       await new Promise<void>((resolve) => {
         if (signal?.aborted) return resolve();
@@ -176,15 +183,16 @@ describe('[mocked] pipeline backgrounded-dispatch barge-in', () => {
     await stt.emitTranscript('ferma per favore');
 
     // The in-flight turn was cancelled: the carrier buffer was cleared and the
-    // LLM stream's abort signal fired (turn torn down pre-first-token).
+    // LLM stream's abort signal fired (turn 1 torn down pre-first-token).
     await vi.waitFor(
       () => expect(bridge.sendClear as ReturnType<typeof vi.fn>).toHaveBeenCalled(),
       { timeout: 3000 },
     );
-    expect((handler as unknown as { isSpeaking: boolean }).isSpeaking).toBe(false);
     await vi.waitFor(() => expect(aborted.value).toBe(true), { timeout: 3000 });
 
-    // Settle the backgrounded dispatch before teardown.
+    // Settle the backgrounded dispatch before teardown. The real follow-up
+    // "ferma per favore" (different text, <0.5s) is NOT swallowed by the
+    // back-to-back dedup — it dispatches as turn 2 and replies.
     await (handler as unknown as { dispatchTask: Promise<void> | null }).dispatchTask?.catch(
       () => {},
     );
diff --git a/libraries/typescript/tests/pipeline-echo-dedup.mocked.test.ts b/libraries/typescript/tests/pipeline-echo-dedup.mocked.test.ts
new file mode 100644
index 0000000..b36fbc1
--- /dev/null
+++ b/libraries/typescript/tests/pipeline-echo-dedup.mocked.test.ts
@@ -0,0 +1,119 @@
+/**
+ * [mocked] Echo guard + back-to-back dedup for the pipeline turn-taking path —
+ * parity with Python test_pipeline_echo_dedup.py. Stops the agent's own TTS
+ * echo (forwarded to STT during TTS without AEC) from firing a phantom barge-in
+ * or becoming a user turn, and keeps a genuinely different fast follow-up from
+ * being swallowed by the back-to-back filter.
+ */
+import { describe, it, expect, vi } from 'vitest';
+import {
+  StreamHandler,
+  looksLikeEcho,
+  normalizeForEcho,
+  isNearDuplicate,
+} from '../src/stream-handler';
+import type { TelephonyBridge, StreamHandlerDeps } from '../src/stream-handler';
+import { MetricsStore } from '../src/dashboard/store';
+import { RemoteMessageHandler } from '../src/remote-message';
+import type { AgentOptions } from '../src/types';
+import type { WebSocket as WSWebSocket } from 'ws';
+
+function makeMockWs(): WSWebSocket {
+  return {
+    send: vi.fn(), close: vi.fn(), on: vi.fn(), once: vi.fn(), readyState: 1,
+    removeListener: vi.fn(), addEventListener: vi.fn(), removeEventListener: vi.fn(),
+  } as unknown as WSWebSocket;
+}
+function makeBridge(): TelephonyBridge {
+  return {
+    label: 'Twilio', telephonyProvider: 'twilio',
+    sendAudio: vi.fn(), sendMark: vi.fn(), sendClear: vi.fn(),
+    transferCall: vi.fn().mockResolvedValue(undefined),
+    endCall: vi.fn().mockResolvedValue(undefined),
+    createStt: vi.fn().mockReturnValue(null),
+    queryTelephonyCost: vi.fn().mockResolvedValue(undefined),
+  } as unknown as TelephonyBridge;
+}
+function makeDeps(): StreamHandlerDeps {
+  const agent: AgentOptions = {
+    systemPrompt: 'test', provider: 'pipeline', model: 'gpt-4o-mini', voice: 'alloy',
+  };
+  return {
+    config: {}, agent, bridge: makeBridge(), metricsStore: new MetricsStore(),
+    pricing: null, remoteHandler: new RemoteMessageHandler(), recording: false,
+    buildAIAdapter: vi.fn().mockReturnValue(null),
+    sanitizeVariables: vi.fn((r: Record<string, unknown>) => r),
+    resolveVariables: vi.fn((t: string) => t),
+  } as unknown as StreamHandlerDeps;
+}
+interface CommitHandle {
+  forwardSttWhileSpeaking: boolean;
+  isSpeaking: boolean;
+  currentAgentSpokenText: string;
+  lastCommitText: string;
+  lastCommitAt: number;
+  commitTranscript(text: string): boolean;
+}
+function makeHandler(): CommitHandle {
+  return new StreamHandler(makeDeps(), makeMockWs(), '+1', '+2') as unknown as CommitHandle;
+}
+
+describe('[mocked] echo + dedup helpers', () => {
+  it('normalizeForEcho strips punctuation and case', () => {
+    expect(normalizeForEcho('Ciao, come VA?!')).toBe('ciao come va');
+  });
+  it('looksLikeEcho: substring fragment is echo', () => {
+    expect(looksLikeEcho('una storia molto', 'Certo, ti racconto una storia molto lunga')).toBe(true);
+  });
+  it('looksLikeEcho: high word overlap is echo', () => {
+    expect(looksLikeEcho("che tu l'hai", "che tu lo voglia o no, te l'ho già detto")).toBe(true);
+  });
+  it('looksLikeEcho: unrelated user speech is not echo', () => {
+    expect(looksLikeEcho('fermati dimmi solo interrotto', 'Sto bene grazie sono pronto ad aiutarti')).toBe(false);
+  });
+  it('looksLikeEcho: empty inputs are not echo', () => {
+    expect(looksLikeEcho('', 'qualcosa')).toBe(false);
+    expect(looksLikeEcho('qualcosa', '')).toBe(false);
+  });
+  it('isNearDuplicate: exact and substring', () => {
+    expect(isNearDuplicate('ciao come va', 'ciao come va')).toBe(true);
+    expect(isNearDuplicate('ciao come', 'ciao come va')).toBe(true);
+    expect(isNearDuplicate('fermati subito', 'dimmi una storia')).toBe(false);
+  });
+});
+
+describe('[mocked] commitTranscript echo + dedup', () => {
+  it('drops echo while speaking with the forward flag', () => {
+    const h = makeHandler();
+    h.forwardSttWhileSpeaking = true;
+    h.isSpeaking = true;
+    h.currentAgentSpokenText = 'ti racconto una storia lunga sul mare';
+    expect(h.commitTranscript('una storia lunga')).toBe(false);
+  });
+  it('does NOT drop echo when the flag is off (default)', () => {
+    const h = makeHandler();
+    h.forwardSttWhileSpeaking = false;
+    h.isSpeaking = true;
+    h.currentAgentSpokenText = 'ti racconto una storia lunga sul mare';
+    expect(h.commitTranscript('una storia lunga')).toBe(true);
+  });
+  it('does NOT drop when idle (post-turn user reply)', () => {
+    const h = makeHandler();
+    h.forwardSttWhileSpeaking = true;
+    h.isSpeaking = false;
+    h.currentAgentSpokenText = 'ti racconto una storia lunga sul mare';
+    expect(h.commitTranscript('una storia lunga')).toBe(true);
+  });
+  it('keeps a different follow-up within 500ms (empty-[interrupted]-turn fix)', () => {
+    const h = makeHandler();
+    h.lastCommitText = 'dimmi una storia';
+    h.lastCommitAt = Date.now();
+    expect(h.commitTranscript('fermati dimmi solo interrotto')).toBe(true);
+  });
+  it('drops a near-duplicate within 500ms (Deepgram double-final)', () => {
+    const h = makeHandler();
+    h.lastCommitText = 'fermati dimmi solo';
+    h.lastCommitAt = Date.now();
+    expect(h.commitTranscript('fermati dimmi solo interrotto')).toBe(false);
+  });
+});

From 2d88502285f2ada5f7a72b242df4ba2a478a86aa Mon Sep 17 00:00:00 2001
From: nicolotognoni <nicolo.tognoni1@gmail.com>
Date: Sun, 7 Jun 2026 12:57:38 +0200
Subject: [PATCH 06/11] =?UTF-8?q?fix(pipeline):=20harden=20echo-safe=20bar?=
 =?UTF-8?q?ge-in=20=E2=80=94=20no=20false-positives=20on=20short=20replies?=
 =?UTF-8?q?,=20word-boundary=20dedup,=20clean=20interrupted=20metrics?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Adversarial review of the echo-safe barge-in commit found three real HIGH
false-positive risks; all fixed with full Python/TS parity:

1. (HIGH) Echo guard could silently drop a legitimate SHORT caller answer that
   repeats the agent's offered words (e.g. agent "lunedì o martedì?", caller
   "lunedì" → substring match → dropped, caller goes unheard). Real TTS echo is
   a long near-complete fragment, not a 1-3 word reply. The echo guard now
   requires >= _ECHO_MIN_CANDIDATE_WORDS (4) words before classifying a
   candidate as echo, so short answers are never dropped. (Short echo blips on a
   no-AEC link are left to AEC / barge_in_strategies.)

2. (HIGH) Back-to-back dedup used a character-level substring test, so a
   genuinely different short follow-up was dropped ("no" matched inside
   "nothing else") — and this ran on the DEFAULT path (not gated on the echo
   flag), affecting all pipeline users. _is_near_duplicate / isNearDuplicate is
   now word-boundary aware (equal, or a true word-prefix double-emit), so
   "nothing else" is no longer a duplicate of "no" while Deepgram's
   speech_final+is_final pair still de-duplicates.

3. (HIGH, TS) The interrupted-turn "[interrupted by caller]" marker leaked into
   metrics: runPipelineLlm returned the marked text and dispatchTurn fed it to
   recordTtsComplete/recordTurnComplete. runPipelineLlm now returns
   { text, interrupted }; dispatchTurn records metrics on the PLAIN text (gated
   on !interrupted) and applies the marker to the history/transcript only —
   mirroring Python, where metrics are recorded before the marker is appended.

Tests updated to the corrected behaviour (>=4-word echo examples + explicit
short-answer-exemption + word-boundary dedup cases). Python 2237 / TypeScript
1779 pass; tsc + build clean.
---
 libraries/python/getpatter/stream_handler.py  | 27 ++++++---
 .../tests/unit/test_pipeline_echo_dedup.py    | 29 +++++++---
 libraries/typescript/src/stream-handler.ts    | 57 +++++++++++++------
 .../tests/pipeline-echo-dedup.mocked.test.ts  | 31 +++++++---
 4 files changed, 103 insertions(+), 41 deletions(-)

diff --git a/libraries/python/getpatter/stream_handler.py b/libraries/python/getpatter/stream_handler.py
index 5cf0729..3d322ed 100644
--- a/libraries/python/getpatter/stream_handler.py
+++ b/libraries/python/getpatter/stream_handler.py
@@ -137,6 +137,12 @@
 # share a couple of words while catching garbled echo fragments. Language-
 # agnostic — unlike the English-only ``_STT_HALLUCINATIONS`` set.
 _ECHO_WORD_OVERLAP_THRESHOLD = 0.6
+# Minimum word count before a candidate can be classified as echo. Real TTS
+# bleed is a long, near-complete fragment of the agent's speech; a 1-3 word
+# caller reply that happens to repeat the agent's offered words ("lunedì",
+# "yes", "Monday at two") is a legitimate answer and must NEVER be dropped.
+# Short echo blips on a no-AEC link are left to AEC / barge_in_strategies.
+_ECHO_MIN_CANDIDATE_WORDS = 4
 
 
 def _normalize_for_echo(text: str) -> str:
@@ -156,11 +162,13 @@ def _looks_like_echo(candidate: str, agent_text: str) -> bool:
     c = _normalize_for_echo(candidate)
     if not a or not c:
         return False
-    if c in a:  # candidate is verbatim a fragment of what the agent said
-        return True
     words = c.split()
-    if not words:
+    # Never classify a short reply as echo — exempts single-word / few-word
+    # caller answers that legitimately repeat the agent's offered words.
+    if len(words) < _ECHO_MIN_CANDIDATE_WORDS:
         return False
+    if c in a:  # candidate is verbatim a long fragment of what the agent said
+        return True
     agent_words = set(a.split())
     overlap = sum(1 for w in words if w in agent_words) / len(words)
     return overlap >= _ECHO_WORD_OVERLAP_THRESHOLD
@@ -168,12 +176,17 @@ def _looks_like_echo(candidate: str, agent_text: str) -> bool:
 
 def _is_near_duplicate(a: str, b: str) -> bool:
     """True when two normalised finals are the same utterance double-emitted
-    (identical, or one a substring of the other) — used to drop Deepgram's
-    ``speech_final``+``is_final`` back-to-back pair WITHOUT swallowing a
-    genuinely different utterance that merely arrives quickly."""
+    (identical, or one a WORD-PREFIX of the other — Deepgram's
+    ``speech_final``+``is_final`` pair) — used to drop the back-to-back pair
+    WITHOUT swallowing a genuinely different utterance that merely arrives
+    quickly. Word-boundary aware so a character infix ("no" in "nothing
+    else") is NOT treated as a duplicate."""
     if not a or not b:
         return False
-    return a == b or a in b or b in a
+    if a == b:
+        return True
+    shorter, longer = (a, b) if len(a) <= len(b) else (b, a)
+    return longer.startswith(shorter + " ")
 
 
 def _is_stt_hallucination(text: str) -> bool:
diff --git a/libraries/python/tests/unit/test_pipeline_echo_dedup.py b/libraries/python/tests/unit/test_pipeline_echo_dedup.py
index 248d9d3..d7992b5 100644
--- a/libraries/python/tests/unit/test_pipeline_echo_dedup.py
+++ b/libraries/python/tests/unit/test_pipeline_echo_dedup.py
@@ -59,13 +59,22 @@ def test_normalize_strips_punct_and_case(self) -> None:
         assert _normalize_for_echo("Ciao, come VA?!") == "ciao come va"
 
     def test_substring_fragment_is_echo(self) -> None:
-        agent = "Certo, ti racconto una storia molto lunga sul mare"
-        assert _looks_like_echo("una storia molto", agent) is True
+        agent = "Certo, ti racconto una storia molto lunga sul mare aperto"
+        # A long (>=4 word) verbatim fragment of the agent's speech is echo.
+        assert _looks_like_echo("ti racconto una storia molto", agent) is True
 
     def test_high_word_overlap_is_echo(self) -> None:
-        agent = "che tu lo voglia o no, te l'ho già detto"
-        # garbled echo fragment whose words are mostly in the agent text
-        assert _looks_like_echo("che tu l'hai", agent) is True
+        agent = "che tu lo voglia o no, te l'ho già detto chiaramente"
+        # garbled >=4-word echo fragment whose words are mostly in the agent text
+        assert _looks_like_echo("che tu lo voglia detto", agent) is True
+
+    def test_short_answer_repeating_agent_is_not_echo(self) -> None:
+        # The key false-positive guard: a 1-3 word caller answer that picks one
+        # of the agent's offered words must NEVER be classified as echo.
+        agent = "preferisci lunedì o martedì per l'appuntamento"
+        assert _looks_like_echo("lunedì", agent) is False
+        assert _looks_like_echo("monday at two", agent) is False
+        assert _looks_like_echo("sì va bene", agent) is False
 
     def test_unrelated_user_speech_is_not_echo(self) -> None:
         agent = "Sto bene grazie, sono pronto ad aiutarti col tuo problema"
@@ -94,7 +103,7 @@ def test_echo_dropped_while_speaking_with_forward_flag(self) -> None:
         h._forward_stt_while_speaking = True
         h._is_speaking = True
         h._current_agent_spoken_text = "ti racconto una storia lunga sul mare"
-        assert h._commit_transcript("una storia lunga") is False
+        assert h._commit_transcript("ti racconto una storia lunga") is False
 
     def test_echo_not_dropped_when_flag_off(self) -> None:
         h = _make_handler()
@@ -103,14 +112,14 @@ def test_echo_not_dropped_when_flag_off(self) -> None:
         h._current_agent_spoken_text = "ti racconto una storia lunga sul mare"
         # Flag off → echo guard inert → normal commit (real user could legitimately
         # echo words; we only filter under the forward-STT echo-prone config).
-        assert h._commit_transcript("una storia lunga") is True
+        assert h._commit_transcript("ti racconto una storia lunga") is True
 
     def test_echo_not_dropped_when_idle(self) -> None:
         h = _make_handler()
         h._forward_stt_while_speaking = True
         h._is_speaking = False  # post-turn user reply, not an echo window
         h._current_agent_spoken_text = "ti racconto una storia lunga sul mare"
-        assert h._commit_transcript("una storia lunga") is True
+        assert h._commit_transcript("ti racconto una storia lunga") is True
 
     def test_different_followup_within_500ms_not_dropped(self) -> None:
         h = _make_handler()
@@ -145,7 +154,9 @@ async def test_echo_transcript_does_not_barge_in(self) -> None:
         h._can_barge_in = lambda: True  # type: ignore[assignment]
         h._current_agent_spoken_text = "ti racconto una storia lunga sul mare aperto"
 
-        await h._handle_barge_in(Transcript(text="una storia lunga", is_final=True, confidence=0.9))
+        await h._handle_barge_in(
+            Transcript(text="ti racconto una storia lunga", is_final=True, confidence=0.9)
+        )
 
         # No cancel: the agent's own echo must not interrupt it.
         h.audio_sender.send_clear.assert_not_awaited()
diff --git a/libraries/typescript/src/stream-handler.ts b/libraries/typescript/src/stream-handler.ts
index 9338883..7757313 100644
--- a/libraries/typescript/src/stream-handler.ts
+++ b/libraries/typescript/src/stream-handler.ts
@@ -347,6 +347,12 @@ export function isSttHallucination(text: string): boolean {
  * ``_ECHO_WORD_OVERLAP_THRESHOLD``. */
 const ECHO_WORD_OVERLAP_THRESHOLD = 0.6;
 
+/** Minimum word count before a candidate can be classified as echo — short
+ * caller replies that repeat the agent's offered words ("lunedì", "yes",
+ * "Monday at two") are legitimate answers, never echo. Mirrors Python
+ * ``_ECHO_MIN_CANDIDATE_WORDS``. */
+const ECHO_MIN_CANDIDATE_WORDS = 4;
+
 /** Lowercase, drop punctuation, collapse whitespace — for echo comparison. */
 export function normalizeForEcho(text: string): string {
   return text
@@ -365,9 +371,11 @@ export function looksLikeEcho(candidate: string, agentText: string): boolean {
   const a = normalizeForEcho(agentText);
   const c = normalizeForEcho(candidate);
   if (!a || !c) return false;
-  if (a.includes(c)) return true;
   const words = c.split(' ').filter(Boolean);
-  if (words.length === 0) return false;
+  // Never classify a short reply as echo — exempts single-word / few-word
+  // caller answers that legitimately repeat the agent's offered words.
+  if (words.length < ECHO_MIN_CANDIDATE_WORDS) return false;
+  if (a.includes(c)) return true;
   const agentWords = new Set(a.split(' '));
   const overlap = words.filter((w) => agentWords.has(w)).length / words.length;
   return overlap >= ECHO_WORD_OVERLAP_THRESHOLD;
@@ -378,7 +386,11 @@ export function looksLikeEcho(candidate: string, agentText: string): boolean {
  * ``_is_near_duplicate``. */
 export function isNearDuplicate(a: string, b: string): boolean {
   if (!a || !b) return false;
-  return a === b || a.includes(b) || b.includes(a);
+  if (a === b) return true;
+  const [shorter, longer] = a.length <= b.length ? [a, b] : [b, a];
+  // Word-boundary aware: a character infix ("no" in "nothing else") is NOT a
+  // duplicate; only a true word-prefix double-emit (speech_final+is_final) is.
+  return longer.startsWith(shorter + ' ');
 }
 
 // ---------------------------------------------------------------------------
@@ -2704,12 +2716,16 @@ export class StreamHandler {
           return;
         }
       } else if (this.llmLoop) {
-        responseText = await this.runPipelineLlm(
+        const llmResult = await this.runPipelineLlm(
           filteredTranscript,
           hookExecutor,
           hookCtx,
           historySnapshot,
         );
+        responseText = llmResult.text;
+        // OR in whether the LLM stream itself was cut short, in addition to a
+        // barge-in already seen by handleBargeIn at the top of this turn.
+        interrupted = interrupted || llmResult.interrupted;
       } else {
         getLogger().warn(
           `Pipeline (${label}) has no llm/onMessage handler — transcript ` +
@@ -2722,8 +2738,14 @@ export class StreamHandler {
       if (!responseText) return;
 
       if (this.llmLoop) {
-        await this.emitAssistantTranscript(responseText);
-        this.metricsAcc.recordTtsComplete(responseText);
+        // Marker goes to the history/transcript ONLY (so a stateful agent
+        // runtime sees it was interrupted); metrics use the PLAIN text and are
+        // gated on !interrupted — mirrors Python.
+        const spokenText = interrupted
+          ? `${responseText} [interrupted by caller]`
+          : responseText;
+        await this.emitAssistantTranscript(spokenText);
+        if (!interrupted) this.metricsAcc.recordTtsComplete(responseText);
       } else {
         interrupted = (await this.runRegularLlm(responseText, hookExecutor, hookCtx)) || interrupted;
         // ``runRegularLlm`` returns the possibly-replaced text via side effect on
@@ -3048,14 +3070,16 @@ export class StreamHandler {
 
   /**
    * Streaming built-in LLM path with sentence chunking and per-sentence
-   * guardrails/TTS. Returns the concatenated response text.
+   * guardrails/TTS. Returns the concatenated (plain) response text plus whether
+   * the turn was cut short by a barge-in — the caller applies the interrupted
+   * marker to history only, keeping metrics on the plain text.
    */
   private async runPipelineLlm(
     filteredTranscript: string,
     hookExecutor: PipelineHookExecutor,
     hookCtx: HookContext,
     historySnapshot: Array<{ role: string; text: string }>,
-  ): Promise<string> {
+  ): Promise<{ text: string; interrupted: boolean }> {
     const label = this.deps.bridge.label;
     const callCtx = { call_id: this.callId, caller: this.caller, callee: this.callee };
     const chunker = new SentenceChunker({
@@ -3205,16 +3229,13 @@ export class StreamHandler {
         // Swallow — span teardown should never crash the call path.
       }
     }
-    const responseText = allParts.join('');
-    // Tag the spoken prefix with an ``[interrupted by caller]`` marker when the
-    // turn was cut short, so a stateful agent runtime (Hermes/OpenClaw) sees,
-    // next turn, that it was interrupted and what the caller actually heard —
-    // not an ungrounded full reply that pollutes its context. Parity with
-    // Python ``_process_streaming_response``.
-    if (llmSignal.aborted && responseText) {
-      return `${responseText} [interrupted by caller]`;
-    }
-    return responseText;
+    // Return the PLAIN text plus whether the turn was cut short. The caller
+    // (dispatchTurn) records metrics on the plain text and applies the
+    // ``[interrupted by caller]`` marker only to the history/transcript, so
+    // metrics (TTS cost, turn-complete) are never polluted by the marker.
+    // Parity with Python, where metrics are recorded on the unmarked text
+    // inside ``_process_streaming_response`` before the marker is appended.
+    return { text: allParts.join(''), interrupted: llmSignal.aborted };
   }
 
   /**
diff --git a/libraries/typescript/tests/pipeline-echo-dedup.mocked.test.ts b/libraries/typescript/tests/pipeline-echo-dedup.mocked.test.ts
index b36fbc1..9f59674 100644
--- a/libraries/typescript/tests/pipeline-echo-dedup.mocked.test.ts
+++ b/libraries/typescript/tests/pipeline-echo-dedup.mocked.test.ts
@@ -62,11 +62,21 @@ describe('[mocked] echo + dedup helpers', () => {
   it('normalizeForEcho strips punctuation and case', () => {
     expect(normalizeForEcho('Ciao, come VA?!')).toBe('ciao come va');
   });
-  it('looksLikeEcho: substring fragment is echo', () => {
-    expect(looksLikeEcho('una storia molto', 'Certo, ti racconto una storia molto lunga')).toBe(true);
+  it('looksLikeEcho: long substring fragment is echo', () => {
+    expect(
+      looksLikeEcho('ti racconto una storia molto', 'Certo, ti racconto una storia molto lunga'),
+    ).toBe(true);
   });
-  it('looksLikeEcho: high word overlap is echo', () => {
-    expect(looksLikeEcho("che tu l'hai", "che tu lo voglia o no, te l'ho già detto")).toBe(true);
+  it('looksLikeEcho: long high-word-overlap fragment is echo', () => {
+    expect(
+      looksLikeEcho('che tu lo voglia detto', "che tu lo voglia o no, te l'ho già detto"),
+    ).toBe(true);
+  });
+  it('looksLikeEcho: short answer repeating the agent is NOT echo', () => {
+    const agent = "preferisci lunedì o martedì per l'appuntamento";
+    expect(looksLikeEcho('lunedì', agent)).toBe(false);
+    expect(looksLikeEcho('monday at two', agent)).toBe(false);
+    expect(looksLikeEcho('sì va bene', agent)).toBe(false);
   });
   it('looksLikeEcho: unrelated user speech is not echo', () => {
     expect(looksLikeEcho('fermati dimmi solo interrotto', 'Sto bene grazie sono pronto ad aiutarti')).toBe(false);
@@ -88,21 +98,28 @@ describe('[mocked] commitTranscript echo + dedup', () => {
     h.forwardSttWhileSpeaking = true;
     h.isSpeaking = true;
     h.currentAgentSpokenText = 'ti racconto una storia lunga sul mare';
-    expect(h.commitTranscript('una storia lunga')).toBe(false);
+    expect(h.commitTranscript('ti racconto una storia lunga')).toBe(false);
   });
   it('does NOT drop echo when the flag is off (default)', () => {
     const h = makeHandler();
     h.forwardSttWhileSpeaking = false;
     h.isSpeaking = true;
     h.currentAgentSpokenText = 'ti racconto una storia lunga sul mare';
-    expect(h.commitTranscript('una storia lunga')).toBe(true);
+    expect(h.commitTranscript('ti racconto una storia lunga')).toBe(true);
   });
   it('does NOT drop when idle (post-turn user reply)', () => {
     const h = makeHandler();
     h.forwardSttWhileSpeaking = true;
     h.isSpeaking = false;
     h.currentAgentSpokenText = 'ti racconto una storia lunga sul mare';
-    expect(h.commitTranscript('una storia lunga')).toBe(true);
+    expect(h.commitTranscript('ti racconto una storia lunga')).toBe(true);
+  });
+  it('does NOT drop a short answer repeating the agent (false-positive guard)', () => {
+    const h = makeHandler();
+    h.forwardSttWhileSpeaking = true;
+    h.isSpeaking = true;
+    h.currentAgentSpokenText = "preferisci lunedì o martedì per l'appuntamento";
+    expect(h.commitTranscript('lunedì')).toBe(true);
   });
   it('keeps a different follow-up within 500ms (empty-[interrupted]-turn fix)', () => {
     const h = makeHandler();

From 200f915ba63fd79d32082f7506ac873f690c315a Mon Sep 17 00:00:00 2001
From: nicolotognoni <nicolo.tognoni1@gmail.com>
Date: Tue, 9 Jun 2026 00:13:03 +0200
Subject: [PATCH 07/11] =?UTF-8?q?fix(pipeline):=20echo-safe=20barge-in=20f?=
 =?UTF-8?q?or=20forward-STT=20without=20AEC=20=E2=80=94=20defer=20VAD=20ca?=
 =?UTF-8?q?ncel=20to=20transcript?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

On a no-AEC link with PATTER_FORWARD_STT_WHILE_SPEAKING and no
barge_in_strategies, a VAD speech_start during TTS cancelled the turn
immediately. But that speech_start is very often the agent's own TTS
echo (or pre-first-token line noise on a long tool-running Hermes/OpenClaw
turn), so the agent self-interrupted almost every turn: a short normal
reply "bene bene" produced agent_text='[interrupted]', and the next turn
ran the LLM for seconds yet emitted tts_characters=0 (torn down before
its first token).

The echo guard only protected the transcript path; the raw VAD-energy
cancel had none. Defer the VAD-energy cancel to transcript confirmation
whenever forward_stt_while_speaking && aec is None — exactly as it already
worked when barge_in_strategies are configured. The speech_start now marks
the barge-in PENDING (agent keeps talking); the cancel fires only on a real
transcript that survives the echo guard, else the agent resumes after
barge_in_confirm_ms (default 1500ms). Default VAD path and forward-STT WITH
AEC keep the responsive immediate cancel — no behaviour change for existing
configs.

Full Python/TS parity. New tests drive the VAD path through on_audio_received
/ handleAudio: no-AEC+no-strategies defers to pending; AEC on still cancels
immediately; a real transcript confirms, an echo transcript does not.
---
 CHANGELOG.md                                  |   1 +
 libraries/python/getpatter/stream_handler.py  |  56 ++++++----
 .../test_pipeline_bargein_backgrounded.py     | 103 ++++++++++++++++++
 libraries/typescript/src/stream-handler.ts    |  16 ++-
 .../tests/unit/barge-in-two-stage.test.ts     |  80 ++++++++++++++
 5 files changed, 234 insertions(+), 22 deletions(-)

diff --git a/CHANGELOG.md b/CHANGELOG.md
index 7c09b0e..bc85b86 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -19,6 +19,7 @@
   - **Echo guard** — a language-agnostic check (`_looks_like_echo` / `looksLikeEcho`: substring or ≥60% word overlap against the agent's in-flight spoken text) now drops any candidate barge-in/commit that is the agent's own speech echoing back. Active only while forwarding audio during TTS, so the default VAD path and real post-turn replies are untouched.
   - **Back-to-back dedup fix** — a final within 500 ms of the previous is now dropped only when it is a *near-duplicate* (Deepgram emitting `speech_final` then `is_final` for the same utterance). A genuinely different fast follow-up (e.g. the real interruption right after a suppressed phantom) is kept instead of being silently swallowed into an empty turn.
   - **Interrupted-turn context rewrite** — on a confirmed mid-turn barge-in the spoken prefix is recorded in history with an `[interrupted by caller]` marker (instead of an ungrounded full reply), so a stateful agent runtime (Hermes/OpenClaw, keyed by `X-Hermes-Session-Id`) sees on the next turn that it was cut off and what the caller actually heard. `libraries/python/getpatter/stream_handler.py`, `libraries/typescript/src/stream-handler.ts`.
+- **Forward-STT-without-AEC no longer self-interrupts on its own echo.** The remaining live Hermes/OpenClaw barge-in failure: with `PATTER_FORWARD_STT_WHILE_SPEAKING` on, no AEC, and no `barge_in_strategies`, a VAD `speech_start` during TTS cancelled the turn immediately — but on a no-AEC link that `speech_start` is very often the agent's *own* TTS echo (or pre-first-token line noise during a long tool-running turn). The result was a cascade of false-positive interruptions: a short normal reply like "bene bene" produced `agent_text='[interrupted]'` with `bargein_ms≈0`, and the next turn's LLM ran for seconds but emitted `tts_characters=0` because it was torn down before its first token. The echo guard existed only on the *transcript* path, so the raw VAD-energy cancel had no protection. The VAD-energy cancel is now **deferred to transcript confirmation** whenever audio is forwarded during TTS without AEC (`forward_stt_while_speaking && aec is None`), exactly as it already was when `barge_in_strategies` are configured: the `speech_start` marks the barge-in *pending* (the agent keeps talking) and the cancel only fires once `_handle_barge_in` / `handleBargeIn` sees a real transcript that survives the echo guard; if none confirms within `barge_in_confirm_ms` (default 1500 ms) the agent resumes its sentence. The default VAD path and forward-STT *with* AEC keep the responsive immediate cancel — no behaviour change for existing configs. For the cleanest short-echo handling, still pair with `echo_cancellation=True` or `barge_in_strategies`. `libraries/python/getpatter/stream_handler.py`, `libraries/typescript/src/stream-handler.ts`.
 
 ## 0.6.5 (2026-06-05)
 
diff --git a/libraries/python/getpatter/stream_handler.py b/libraries/python/getpatter/stream_handler.py
index 3d322ed..971992f 100644
--- a/libraries/python/getpatter/stream_handler.py
+++ b/libraries/python/getpatter/stream_handler.py
@@ -3250,9 +3250,7 @@ async def _filler() -> None:
 
         return asyncio.create_task(_filler())
 
-    async def _cancel_long_turn_filler(
-        self, task: "asyncio.Task | None"
-    ) -> None:
+    async def _cancel_long_turn_filler(self, task: "asyncio.Task | None") -> None:
         """Cancel the long-turn filler task and await its teardown.
 
         Idempotent and race-safe: a ``None`` / already-finished task is a no-op,
@@ -3269,7 +3267,9 @@ async def _cancel_long_turn_filler(
         except asyncio.CancelledError:
             pass
         except Exception:  # pragma: no cover - defensive
-            logger.debug("long_turn_message filler task ended with error", exc_info=True)
+            logger.debug(
+                "long_turn_message filler task ended with error", exc_info=True
+            )
         return None
 
     async def _process_streaming_response(self, result, call_id: str) -> str:
@@ -3438,9 +3438,7 @@ async def _process_streaming_response(self, result, call_id: str) -> str:
                         sentence = transformed
 
                     # Real flushed audio about to play — cancel the filler.
-                    long_turn_task = await self._cancel_long_turn_filler(
-                        long_turn_task
-                    )
+                    long_turn_task = await self._cancel_long_turn_filler(long_turn_task)
                     if not await self._synthesize_sentence(
                         sentence, hook_executor, hook_ctx, first_tts_chunk
                     ):
@@ -3758,7 +3756,9 @@ def _commit_transcript(self, text: str) -> bool:
             and getattr(self, "_is_speaking", False)
             and _looks_like_echo(text, getattr(self, "_current_agent_spoken_text", ""))
         ):
-            logger.debug("Dropped agent-echo transcript (not a user turn): %r", normalised[:40])
+            logger.debug(
+                "Dropped agent-echo transcript (not a user turn): %r", normalised[:40]
+            )
             return False
         if since_last < 2.0 and normalised == self._last_commit_text:
             logger.debug(
@@ -4158,9 +4158,7 @@ async def on_audio_received(self, audio_bytes: bytes) -> None:
                     # silence bug). After this ``_is_speaking`` is False, so
                     # the if/elif below is a no-op and the frame falls through
                     # to STT. Parity with TS ``endTailGraceForNewTurn``.
-                    if self._is_speaking and getattr(
-                        self, "_tail_grace_active", False
-                    ):
+                    if self._is_speaking and getattr(self, "_tail_grace_active", False):
                         await self._end_tail_grace_for_new_turn()
                     phantom_suppressed = self._is_speaking and not self._can_barge_in()
                     if phantom_suppressed:
@@ -4192,13 +4190,31 @@ async def on_audio_received(self, audio_bytes: bytes) -> None:
                         # STT so the user's words are not silently lost.
                         self._suppressed_speech_pending = True
                     elif self._is_speaking:
-                        # Caller spoke over in-flight TTS. With opt-in
-                        # confirmation strategies the cancel is deferred
-                        # until at least one strategy approves the user's
-                        # transcript; otherwise we keep the legacy
-                        # "cancel immediately" path so existing users
-                        # see no behaviour change.
-                        if self._barge_in_strategies:
+                        # Caller spoke over in-flight TTS. The cancel is
+                        # DEFERRED to transcript confirmation — instead of
+                        # firing on raw VAD energy — when EITHER:
+                        #   (a) opt-in ``barge_in_strategies`` are configured
+                        #       (a strategy must approve the transcript), OR
+                        #   (b) we forward STT during TTS WITHOUT AEC. On a
+                        #       no-AEC link a VAD ``speech_start`` here is very
+                        #       often the agent's OWN TTS echo, not the caller;
+                        #       cancelling on it self-interrupts almost every
+                        #       turn (the "bene bene" → [interrupted] cascade
+                        #       seen on live Hermes/OpenClaw calls). Deferring
+                        #       lets ``_handle_barge_in`` run the echo guard on
+                        #       the resulting transcript and cancel only on real
+                        #       caller speech; if no transcript confirms within
+                        #       ``_barge_in_confirm_s`` the pending state times
+                        #       out and the agent resumes its sentence.
+                        # Otherwise (default VAD path, or forward-STT WITH AEC
+                        # where the canceller makes VAD trustworthy) the legacy
+                        # immediate cancel runs — existing users see no change.
+                        # Parity with TS speech_start ``deferCancel``.
+                        defer_cancel = bool(self._barge_in_strategies) or (
+                            getattr(self, "_forward_stt_while_speaking", False)
+                            and getattr(self, "_aec", None) is None
+                        )
+                        if defer_cancel:
                             await self._start_pending_barge_in()
                         else:
                             if self.metrics is not None:
@@ -4233,9 +4249,7 @@ async def on_audio_received(self, audio_bytes: bytes) -> None:
                                 # ``OpenAICompatibleLLMProvider.stream``) closes
                                 # the request the instant this fires. Parity
                                 # with TS ``cancelSpeaking`` → ``llmAbort.abort``.
-                                cancel_event = getattr(
-                                    self, "_llm_cancel_event", None
-                                )
+                                cancel_event = getattr(self, "_llm_cancel_event", None)
                                 if cancel_event is not None:
                                     cancel_event.set()
                     if not phantom_suppressed and self.metrics is not None:
diff --git a/libraries/python/tests/unit/test_pipeline_bargein_backgrounded.py b/libraries/python/tests/unit/test_pipeline_bargein_backgrounded.py
index f362e0c..fdf6187 100644
--- a/libraries/python/tests/unit/test_pipeline_bargein_backgrounded.py
+++ b/libraries/python/tests/unit/test_pipeline_bargein_backgrounded.py
@@ -208,3 +208,106 @@ async def test_flag_on_forwards_to_stt_during_tts(self, monkeypatch) -> None:
         # leading edge for flush-on-barge-in.
         assert handler._stt.send_audio.await_count == 2
         assert len(handler._inbound_audio_ring) == 2
+
+
+class _PassthroughAEC:
+    """Minimal AEC stand-in: marks the link as AEC-protected without altering audio."""
+
+    def process_near_end(self, pcm: bytes) -> bytes:
+        return pcm
+
+
+@pytest.mark.unit
+@pytest.mark.asyncio
+class TestForwardSttDeferredBargeIn:
+    """On a forward-STT link WITHOUT AEC, a VAD ``speech_start`` during TTS is
+    very often the agent's own echo. Cancelling on raw VAD energy self-interrupts
+    almost every turn (live Hermes "bene bene" → [interrupted] cascade). The fix
+    defers the cancel to a transcript that survives the echo guard.
+    """
+
+    def _speaking_handler(self, *, forward: bool, aec) -> PipelineStreamHandler:
+        handler = _make_handler()
+        handler._forward_stt_while_speaking = forward
+        handler._aec = aec
+        handler._barge_in_strategies = ()
+        handler._auto_vad = _ScriptedVAD([VADEvent(type="speech_start")])
+        handler._stt = AsyncMock()
+        handler._is_speaking = True
+        handler._tail_grace_active = False
+        handler._speaking_generation = 1
+        handler._speaking_started_at = time.time() - 2.0
+        handler._first_audio_sent_at = time.time() - 2.0
+        handler._inbound_audio_ring = []
+        handler._can_barge_in = lambda: True  # type: ignore[assignment]
+        return handler
+
+    async def test_no_aec_no_strategies_defers_to_pending(self) -> None:
+        """forward-STT + no AEC + no strategies → VAD speech_start goes PENDING,
+        does NOT cancel: the agent keeps talking until a real transcript confirms."""
+        handler = self._speaking_handler(forward=True, aec=None)
+
+        await handler.on_audio_received(_FRAME)
+
+        # Deferred: no cancel, agent still owns the floor, LLM stream untouched.
+        assert handler._is_speaking is True
+        assert handler._barge_in_pending_since is not None
+        handler.audio_sender.send_clear.assert_not_called()
+        assert handler._llm_cancel_event.is_set() is False
+        # Clean up the pending-timeout task so it doesn't outlive the test.
+        handler._clear_pending_barge_in()
+
+    async def test_with_aec_still_immediate_cancel(self) -> None:
+        """forward-STT + AEC ON → the canceller makes VAD trustworthy, so the
+        legacy immediate cancel is preserved (responsive barge-in)."""
+        handler = self._speaking_handler(forward=True, aec=_PassthroughAEC())
+
+        await handler.on_audio_received(_FRAME)
+
+        assert handler._is_speaking is False
+        assert handler._barge_in_pending_since is None
+        handler.audio_sender.send_clear.assert_awaited()
+        assert handler._llm_cancel_event.is_set() is True
+
+    async def test_pending_then_real_transcript_cancels(self) -> None:
+        """After a deferred (pending) VAD barge-in, a real (non-echo) transcript
+        confirms the cancel via the echo-guarded transcript path."""
+        handler = self._speaking_handler(forward=True, aec=None)
+        handler._current_agent_spoken_text = "sto bene grazie e tu come stai oggi"
+
+        await handler.on_audio_received(_FRAME)
+        assert handler._barge_in_pending_since is not None  # pending
+
+        # A genuinely different caller utterance (not the agent's own words).
+        await handler._handle_barge_in(
+            Transcript(
+                text="fermati e dimmi solo questo", is_final=True, confidence=0.9
+            )
+        )
+
+        assert handler._is_speaking is False
+        assert handler._barge_in_pending_since is None  # pending cleared on confirm
+        handler.audio_sender.send_clear.assert_awaited()
+        assert handler._llm_cancel_event.is_set() is True
+
+    async def test_pending_then_echo_transcript_does_not_cancel(self) -> None:
+        """An echo transcript (the agent's own forwarded TTS) must NOT confirm
+        the pending barge-in — the agent keeps talking."""
+        handler = self._speaking_handler(forward=True, aec=None)
+        handler._current_agent_spoken_text = "sto bene grazie e tu come stai oggi"
+
+        await handler.on_audio_received(_FRAME)
+        assert handler._barge_in_pending_since is not None  # pending
+
+        # Transcript is a fragment of what the agent is currently saying.
+        await handler._handle_barge_in(
+            Transcript(
+                text="sto bene grazie e tu come stai", is_final=True, confidence=0.9
+            )
+        )
+
+        # Echo guard dropped it — no cancel, still pending, agent still speaking.
+        assert handler._is_speaking is True
+        handler.audio_sender.send_clear.assert_not_called()
+        assert handler._llm_cancel_event.is_set() is False
+        handler._clear_pending_barge_in()
diff --git a/libraries/typescript/src/stream-handler.ts b/libraries/typescript/src/stream-handler.ts
index 7757313..b47ffc0 100644
--- a/libraries/typescript/src/stream-handler.ts
+++ b/libraries/typescript/src/stream-handler.ts
@@ -1615,7 +1615,21 @@ export class StreamHandler {
               // agent finishes naturally without a barge-in.
               this.suppressedSpeechPending = true;
             } else if (this.isSpeaking) {
-              if (this.bargeInStrategies.length > 0) {
+              // Defer the cancel to transcript confirmation — instead of
+              // firing on raw VAD energy — when EITHER opt-in
+              // ``bargeInStrategies`` are configured OR we forward STT during
+              // TTS WITHOUT AEC. On a no-AEC link a VAD ``speech_start`` here
+              // is very often the agent's OWN echo, and cancelling on it
+              // self-interrupts almost every turn (the "bene bene" →
+              // [interrupted] cascade). Deferring lets ``handleBargeIn`` run
+              // the echo guard on the resulting transcript and cancel only on
+              // real caller speech; the pending state times out after
+              // ``bargeInConfirmS`` so the agent resumes if nothing confirms.
+              // Parity with Python on_audio_received ``defer_cancel``.
+              const deferCancel =
+                this.bargeInStrategies.length > 0 ||
+                (this.forwardSttWhileSpeaking && !this.aec);
+              if (deferCancel) {
                 this.startPendingBargeIn();
                 this.metricsAcc.anchorUserSpeechStart();
                 return;
diff --git a/libraries/typescript/tests/unit/barge-in-two-stage.test.ts b/libraries/typescript/tests/unit/barge-in-two-stage.test.ts
index 2fc35e6..8d7b50d 100644
--- a/libraries/typescript/tests/unit/barge-in-two-stage.test.ts
+++ b/libraries/typescript/tests/unit/barge-in-two-stage.test.ts
@@ -337,6 +337,86 @@ describe('StreamHandler — handleStop / handleWsClose drops pending barge-in ti
   });
 });
 
+/** Scripted VAD: returns queued events frame-by-frame, then silence. */
+function makeScriptedVad(events: Array<{ type: string } | null>) {
+  const queue = [...events];
+  return {
+    async processFrame(): Promise<{ type: string } | null> {
+      return queue.shift() ?? null;
+    },
+    async close(): Promise<void> {},
+    reset(): void {},
+  };
+}
+
+describe('StreamHandler — forward-STT-without-AEC defers VAD-energy barge-in (Hermes/OpenClaw)', () => {
+  beforeEach(() => {
+    // Real timers — handleAudio races the VAD promise against a 25 ms timeout.
+    vi.useRealTimers();
+  });
+  afterEach(() => {
+    vi.restoreAllMocks();
+    delete process.env.PATTER_FORWARD_STT_WHILE_SPEAKING;
+  });
+
+  interface HandleAudioPriv {
+    isSpeaking: boolean;
+    bargeInPendingSince: number | null;
+    forwardSttWhileSpeaking: boolean;
+    aec: unknown;
+    autoVad: unknown;
+    stt: unknown;
+    inboundAudioRing: Buffer[];
+    canBargeIn: () => boolean;
+    clearPendingBargeIn: () => void;
+  }
+
+  function armForwardStt(
+    h: StreamHandler,
+    aec: unknown,
+  ): HandleAudioPriv {
+    armSpeakingState(h);
+    const p = h as unknown as HandleAudioPriv;
+    p.stt = { sendAudio: vi.fn() };
+    p.autoVad = makeScriptedVad([{ type: 'speech_start' }]);
+    p.forwardSttWhileSpeaking = true;
+    p.aec = aec;
+    p.inboundAudioRing = [];
+    p.canBargeIn = () => true;
+    return p;
+  }
+
+  it('no AEC + no strategies → VAD speech_start DEFERS to pending (no immediate cancel)', async () => {
+    const deps = makeDeps([]); // legacy config — no opt-in strategies
+    const h = new StreamHandler(deps, makeMockWs(), '+1', '+2');
+    const p = armForwardStt(h, null);
+
+    await h.handleAudio(Buffer.alloc(160)); // 20 ms mulaw frame
+
+    // Deferred: the agent keeps the floor; the cancel waits for a transcript
+    // that survives the echo guard (the "bene bene" → [interrupted] fix).
+    expect(p.isSpeaking).toBe(true);
+    expect(p.bargeInPendingSince).not.toBeNull();
+    expect(deps.bridge.sendClear).not.toHaveBeenCalled();
+    p.clearPendingBargeIn();
+  });
+
+  it('AEC ON → VAD speech_start still cancels immediately (canceller makes VAD trustworthy)', async () => {
+    const deps = makeDeps([]);
+    const h = new StreamHandler(deps, makeMockWs(), '+1', '+2');
+    const p = armForwardStt(h, {
+      processNearEnd: (b: Buffer) => b,
+      pushFarEnd: () => {},
+    });
+
+    await h.handleAudio(Buffer.alloc(160));
+
+    expect(p.isSpeaking).toBe(false);
+    expect(p.bargeInPendingSince).toBeNull();
+    expect(deps.bridge.sendClear).toHaveBeenCalled();
+  });
+});
+
 describe('MinWordsStrategy threshold parity (TS↔Py)', () => {
   it.each([2, 3, 5])(
     'agent stays talking below threshold and cancels at threshold (minWords=%i)',

From d06bf6294b853f0d0f6ba3b4ee1263df79e6057f Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Tue, 9 Jun 2026 07:24:21 +0000
Subject: [PATCH 08/11] =?UTF-8?q?feat(hermes):=20zero-config=20CLI=20?=
 =?UTF-8?q?=E2=80=94=20`patter=20hermes`=20doctor=20/=20setup=20/=20attach?=
 =?UTF-8?q?-number=20+=20example=20app?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Make standing up the Hermes voice shell (Direction A) copy-paste simple, on par
with wiring a hosted custom-LLM voice agent but keeping Hermes on loopback.

New `patter hermes` CLI group (Python):
- `doctor`  — preflight across the Hermes gateway (/v1/models reachability +
  model presence), the Patter providers (HermesLLM constructible, Deepgram /
  ElevenLabs keys, ElevenLabs transport, Silero VAD), and the Twilio carrier
  (creds valid, number webhook). Each problem prints a suggested fix.
  `--no-network` skips live probes, `--json` for machine-readable output.
- `setup`   — scaffold a ready-to-run hermes-phone-agent project, run the
  checks, optionally attach a Twilio number (`--number`/`--url`). Non-interactive
  with `--yes`.
- `attach-number` / `numbers` — point a Twilio number's voice webhook at your
  Patter URL / list account numbers.

Scaffold (`getpatter/_hermes_scaffold.py`) is the single source of truth for the
committed `examples/hermes-phone-agent/` project (app.py, .env.example, README,
docker-compose, doctor/text-turn/outbound-call scripts); a test keeps them in
sync. The example defaults to REST ElevenLabs TTS and caller-hash memory.

TS CLI gains a `hermes` stub pointing to the Python wizard (mirrors the `eval`
stub); the HermesLLM provider stays available in both SDKs. Docs updated with a
zero-config setup section.

https://claude.ai/code/session_01TNysNGx7woXM99fHBjpsts
---
 docs/integrations/hermes.mdx                  |  28 +
 examples/hermes-phone-agent/.env.example      |  21 +
 examples/hermes-phone-agent/README.md         |  59 ++
 examples/hermes-phone-agent/app.py            |  60 ++
 .../hermes-phone-agent/docker-compose.yml     |  14 +
 examples/hermes-phone-agent/scripts/doctor.py |   7 +
 .../scripts/test_outbound_call.py             |  50 ++
 .../scripts/test_text_turn.py                 |  41 ++
 .../python/getpatter/_hermes_scaffold.py      | 333 +++++++++
 libraries/python/getpatter/cli.py             |   7 +
 libraries/python/getpatter/cli_hermes.py      | 649 ++++++++++++++++++
 .../python/tests/unit/test_hermes_cli.py      | 184 +++++
 libraries/typescript/src/cli.ts               |  18 +
 13 files changed, 1471 insertions(+)
 create mode 100644 examples/hermes-phone-agent/.env.example
 create mode 100644 examples/hermes-phone-agent/README.md
 create mode 100644 examples/hermes-phone-agent/app.py
 create mode 100644 examples/hermes-phone-agent/docker-compose.yml
 create mode 100644 examples/hermes-phone-agent/scripts/doctor.py
 create mode 100644 examples/hermes-phone-agent/scripts/test_outbound_call.py
 create mode 100644 examples/hermes-phone-agent/scripts/test_text_turn.py
 create mode 100644 libraries/python/getpatter/_hermes_scaffold.py
 create mode 100644 libraries/python/getpatter/cli_hermes.py
 create mode 100644 libraries/python/tests/unit/test_hermes_cli.py

diff --git a/docs/integrations/hermes.mdx b/docs/integrations/hermes.mdx
index 006ed0f..160c48c 100644
--- a/docs/integrations/hermes.mdx
+++ b/docs/integrations/hermes.mdx
@@ -164,6 +164,34 @@ gateway that isn't listening.
   `hermes-agent`).
 </Note>
 
+### Zero-config setup (Python)
+
+If you'd rather not wire it up by hand, the Python CLI scaffolds a ready-to-run project,
+checks your environment, and can point your Twilio number at Patter:
+
+```bash
+pip install getpatter
+
+patter hermes doctor        # preflight: gateway, providers, carrier — with fixes
+patter hermes setup         # scaffold ./hermes-phone-agent (app.py, .env, scripts)
+```
+
+`patter hermes doctor` probes the Hermes gateway (`/v1/models`), confirms `HermesLLM` is
+constructible, checks your Deepgram / ElevenLabs / Twilio credentials, and prints a
+suggested fix for anything missing (`--no-network` skips live probes, `--json` for
+machine-readable output). `patter hermes setup` writes the same starter project shown in
+[`examples/hermes-phone-agent`](https://github.com/PatterAI/Patter/tree/main/examples/hermes-phone-agent)
+and, given `--number` and `--url`, attaches the Twilio webhook for you. To wire an existing
+number on its own:
+
+```bash
+patter hermes numbers       # list the numbers on your Twilio account
+patter hermes attach-number +15551234567 --url https://<your-tunnel>/calls/inbound
+```
+
+These commands live in the Python SDK today; the `HermesLLM` provider itself is available
+in both the Python and TypeScript SDKs.
+
 ### Running Patter locally
 
 Build a pipeline-mode agent whose LLM is `HermesLLM`. Patter wraps the carrier, STT, and
diff --git a/examples/hermes-phone-agent/.env.example b/examples/hermes-phone-agent/.env.example
new file mode 100644
index 0000000..de760cf
--- /dev/null
+++ b/examples/hermes-phone-agent/.env.example
@@ -0,0 +1,21 @@
+# ── Hermes gateway (the brain — keep it on loopback) ──────────────────
+API_SERVER_ENABLED=true
+API_SERVER_HOST=127.0.0.1
+API_SERVER_PORT=8642
+API_SERVER_KEY=choose-a-strong-key
+API_SERVER_MODEL_NAME=hermes-agent
+
+# ── Patter (the voice shell) ──────────────────────────────────────────
+PATTER_PHONE_NUMBER=+15551234567
+PATTER_LANGUAGE=en
+# REST is the safer default for a first PSTN demo; set to ws for streaming.
+PATTER_ELEVENLABS_TRANSPORT=rest
+
+# ── Twilio carrier ────────────────────────────────────────────────────
+TWILIO_ACCOUNT_SID=ACxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
+TWILIO_AUTH_TOKEN=your-twilio-auth-token
+
+# ── STT / TTS providers ───────────────────────────────────────────────
+DEEPGRAM_API_KEY=your-deepgram-key
+ELEVENLABS_API_KEY=your-elevenlabs-key
+# ELEVENLABS_VOICE_ID=EXAVITQu4vr4xnSDxMaL
diff --git a/examples/hermes-phone-agent/README.md b/examples/hermes-phone-agent/README.md
new file mode 100644
index 0000000..3b5298c
--- /dev/null
+++ b/examples/hermes-phone-agent/README.md
@@ -0,0 +1,59 @@
+# Hermes phone agent
+
+A self-hosted phone line for your [Hermes Agent](https://github.com/NousResearch/hermes-agent).
+Patter is the **voice shell** (carrier, speech-to-text, turn-taking, barge-in,
+text-to-speech); Hermes is the **brain** on the line. Each conversation turn is
+one `POST http://127.0.0.1:8642/v1/chat/completions` against your local Hermes
+gateway — so Hermes keeps its tools, memory, and skills, and **never leaves
+loopback**. The only thing exposed to the internet is Patter's carrier webhook.
+
+## 1. Configure
+
+```bash
+cp .env.example .env
+# then fill in API_SERVER_KEY, TWILIO_*, DEEPGRAM_API_KEY, ELEVENLABS_API_KEY,
+# and PATTER_PHONE_NUMBER
+```
+
+## 2. Check everything is wired up
+
+```bash
+pip install getpatter
+patter hermes doctor
+```
+
+Fix anything it flags (it prints a suggested command for each problem), then
+smoke-test the brain without spending a phone call:
+
+```bash
+python scripts/test_text_turn.py "say hello in one sentence"
+```
+
+## 3. Answer the phone
+
+```bash
+python app.py
+```
+
+Patter opens a tunnel and prints the public webhook URL. Point your Twilio
+number's voice webhook at it — or let Patter do it for you:
+
+```bash
+patter hermes attach-number "$PATTER_PHONE_NUMBER" --url https://<your-tunnel>/calls/inbound
+```
+
+Now call your number and talk to Hermes.
+
+## 4. Place an outbound call (optional)
+
+```bash
+python scripts/test_outbound_call.py +15557654321
+```
+
+## Why Patter instead of a hosted custom-LLM voice agent?
+
+- **Hermes stays private.** A hosted platform has to reach your "brain" endpoint
+  over the public internet; here Hermes is loopback-only and only Patter is
+  exposed.
+- **You own the voice layer** — STT, turn-taking, barge-in, TTS — and can script it.
+- **Inbound *and* outbound**, plus the Patter MCP server so Hermes can place calls.
diff --git a/examples/hermes-phone-agent/app.py b/examples/hermes-phone-agent/app.py
new file mode 100644
index 0000000..f939d78
--- /dev/null
+++ b/examples/hermes-phone-agent/app.py
@@ -0,0 +1,60 @@
+"""Hermes phone agent — Patter is the voice shell, Hermes is the brain.
+
+A caller dials your number, Patter answers (carrier + STT + turn-taking + TTS),
+and every conversation turn is routed to your local Hermes gateway as the LLM.
+Hermes stays on loopback (127.0.0.1:8642); only Patter's carrier webhook is
+exposed to the internet, via the tunnel.
+
+Run:
+    python app.py
+
+Check your setup first with:
+    patter hermes doctor
+"""
+
+from __future__ import annotations
+
+import os
+
+from getpatter import (
+    DeepgramSTT,
+    ElevenLabsRestTTS,
+    ElevenLabsTTS,
+    HermesLLM,
+    Patter,
+    Twilio,
+)
+
+# REST TTS is the safer default for a first PSTN demo: there is no long-lived
+# WebSocket that can stall before the first audio frame. Set
+# PATTER_ELEVENLABS_TRANSPORT=ws to opt into streaming once the basics work.
+if os.environ.get("PATTER_ELEVENLABS_TRANSPORT", "rest").lower() == "ws":
+    tts = ElevenLabsTTS.for_twilio()
+else:
+    tts = ElevenLabsRestTTS.for_twilio()
+
+phone = Patter(
+    carrier=Twilio(),                       # TWILIO_ACCOUNT_SID / TWILIO_AUTH_TOKEN
+    phone_number=os.environ["PATTER_PHONE_NUMBER"],
+    tunnel=True,                            # auto Cloudflare quick-tunnel (local dev)
+)
+
+agent = phone.agent(
+    system_prompt=(
+        "You are Hermes on a live phone call. Keep replies concise, warm, and "
+        "spoken-friendly. Avoid markdown, code blocks, long lists, and URLs "
+        "unless the caller asks. If you use a tool, say you are checking, then "
+        "summarize the result naturally. If interrupted, stop and answer the "
+        "latest request."
+    ),
+    language=os.environ.get("PATTER_LANGUAGE", "en"),
+    first_message="Hello, this is Hermes. How can I help?",
+    stt=DeepgramSTT(),                      # DEEPGRAM_API_KEY
+    llm=HermesLLM(session_key_from="caller_hash"),   # http://127.0.0.1:8642/v1
+    tts=tts,                                # ELEVENLABS_API_KEY
+    long_turn_message="One moment, let me check that.",
+    llm_error_message="Sorry, I'm having trouble reaching Hermes right now.",
+)
+
+if __name__ == "__main__":
+    phone.serve(agent)                      # answers inbound calls
diff --git a/examples/hermes-phone-agent/docker-compose.yml b/examples/hermes-phone-agent/docker-compose.yml
new file mode 100644
index 0000000..cffcb9b
--- /dev/null
+++ b/examples/hermes-phone-agent/docker-compose.yml
@@ -0,0 +1,14 @@
+# Patter + Hermes on one box. Hermes stays on loopback; only Patter is exposed.
+#
+# This runs the Patter voice shell in a container that shares the host network
+# so it can reach the Hermes gateway on 127.0.0.1:8642. Start your Hermes
+# gateway on the host first (see the Hermes docs), then `docker compose up`.
+services:
+  patter:
+    image: python:3.12-slim
+    working_dir: /app
+    env_file: .env
+    network_mode: host          # so Patter reaches Hermes on 127.0.0.1:8642
+    volumes:
+      - .:/app
+    command: sh -c "pip install --quiet getpatter && python app.py"
diff --git a/examples/hermes-phone-agent/scripts/doctor.py b/examples/hermes-phone-agent/scripts/doctor.py
new file mode 100644
index 0000000..0f13930
--- /dev/null
+++ b/examples/hermes-phone-agent/scripts/doctor.py
@@ -0,0 +1,7 @@
+#!/usr/bin/env python
+"""Run the Patter Hermes preflight checks (wraps `patter hermes doctor`)."""
+
+import subprocess
+import sys
+
+raise SystemExit(subprocess.call(["patter", "hermes", "doctor", *sys.argv[1:]]))
diff --git a/examples/hermes-phone-agent/scripts/test_outbound_call.py b/examples/hermes-phone-agent/scripts/test_outbound_call.py
new file mode 100644
index 0000000..1c35f9b
--- /dev/null
+++ b/examples/hermes-phone-agent/scripts/test_outbound_call.py
@@ -0,0 +1,50 @@
+#!/usr/bin/env python
+"""Place a test outbound call through the Hermes voice shell.
+
+    python scripts/test_outbound_call.py +15557654321
+
+The callee picks up and talks to Hermes. Requires the same env as app.py.
+"""
+
+from __future__ import annotations
+
+import asyncio
+import os
+import sys
+
+from getpatter import (
+    DeepgramSTT,
+    ElevenLabsRestTTS,
+    HermesLLM,
+    Patter,
+    Twilio,
+)
+
+
+async def main() -> int:
+    if len(sys.argv) < 2:
+        print("Usage: python scripts/test_outbound_call.py <+E164>")
+        return 2
+    to = sys.argv[1]
+    phone = Patter(
+        carrier=Twilio(),
+        phone_number=os.environ["PATTER_PHONE_NUMBER"],
+        tunnel=True,
+    )
+    agent = phone.agent(
+        system_prompt=(
+            "You are Hermes on a short test call. Greet the person warmly and "
+            "ask how they are. Keep it brief and spoken-friendly."
+        ),
+        first_message="Hi, this is a Patter and Hermes test call.",
+        stt=DeepgramSTT(),
+        llm=HermesLLM(session_key_from="caller_hash"),
+        tts=ElevenLabsRestTTS.for_twilio(),
+    )
+    result = await phone.call(to, agent=agent, wait=True)
+    print(f"Call outcome: {result.outcome if result else 'unknown'}")
+    return 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(asyncio.run(main()))
diff --git a/examples/hermes-phone-agent/scripts/test_text_turn.py b/examples/hermes-phone-agent/scripts/test_text_turn.py
new file mode 100644
index 0000000..ddda492
--- /dev/null
+++ b/examples/hermes-phone-agent/scripts/test_text_turn.py
@@ -0,0 +1,41 @@
+#!/usr/bin/env python
+"""Text-only smoke test: send one turn to the local Hermes gateway.
+
+Verifies the brain answers before you spend a phone call debugging it. Reads
+API_SERVER_PORT / API_SERVER_MODEL_NAME / API_SERVER_KEY from the environment.
+
+    python scripts/test_text_turn.py "say hello in one short sentence"
+"""
+
+from __future__ import annotations
+
+import json
+import os
+import sys
+import urllib.request
+
+base = f"http://127.0.0.1:{os.environ.get('API_SERVER_PORT', '8642')}/v1"
+model = os.environ.get("API_SERVER_MODEL_NAME", "hermes-agent")
+key = os.environ.get("API_SERVER_KEY", "")
+prompt = " ".join(sys.argv[1:]) or "Say hello in one short sentence."
+
+headers = {"Content-Type": "application/json"}
+if key:
+    headers["Authorization"] = f"Bearer {key}"
+
+req = urllib.request.Request(
+    f"{base}/chat/completions",
+    data=json.dumps(
+        {"model": model, "messages": [{"role": "user", "content": prompt}]}
+    ).encode(),
+    headers=headers,
+)
+
+try:
+    with urllib.request.urlopen(req, timeout=120) as resp:  # noqa: S310
+        data = json.load(resp)
+except Exception as exc:  # noqa: BLE001
+    print(f"Hermes did not answer: {exc}")
+    raise SystemExit(1)
+
+print(data["choices"][0]["message"]["content"])
diff --git a/libraries/python/getpatter/_hermes_scaffold.py b/libraries/python/getpatter/_hermes_scaffold.py
new file mode 100644
index 0000000..fe9b7a5
--- /dev/null
+++ b/libraries/python/getpatter/_hermes_scaffold.py
@@ -0,0 +1,333 @@
+"""Project scaffold for the Hermes phone agent.
+
+Single source of truth for the ``hermes-phone-agent`` starter project. The
+``patter hermes setup`` wizard (see :mod:`getpatter.cli_hermes`) writes these
+files for a user, and the committed ``examples/hermes-phone-agent/`` tree is
+generated from the same :data:`FILES` map (a test asserts they stay in sync).
+
+Each entry maps a project-relative path to its file contents. Keep the contents
+runnable against the real public API — they double as the example the docs
+point at.
+"""
+
+from __future__ import annotations
+
+from pathlib import Path
+
+__all__ = ["FILES", "scaffold"]
+
+
+_APP_PY = '''\
+"""Hermes phone agent — Patter is the voice shell, Hermes is the brain.
+
+A caller dials your number, Patter answers (carrier + STT + turn-taking + TTS),
+and every conversation turn is routed to your local Hermes gateway as the LLM.
+Hermes stays on loopback (127.0.0.1:8642); only Patter's carrier webhook is
+exposed to the internet, via the tunnel.
+
+Run:
+    python app.py
+
+Check your setup first with:
+    patter hermes doctor
+"""
+
+from __future__ import annotations
+
+import os
+
+from getpatter import (
+    DeepgramSTT,
+    ElevenLabsRestTTS,
+    ElevenLabsTTS,
+    HermesLLM,
+    Patter,
+    Twilio,
+)
+
+# REST TTS is the safer default for a first PSTN demo: there is no long-lived
+# WebSocket that can stall before the first audio frame. Set
+# PATTER_ELEVENLABS_TRANSPORT=ws to opt into streaming once the basics work.
+if os.environ.get("PATTER_ELEVENLABS_TRANSPORT", "rest").lower() == "ws":
+    tts = ElevenLabsTTS.for_twilio()
+else:
+    tts = ElevenLabsRestTTS.for_twilio()
+
+phone = Patter(
+    carrier=Twilio(),                       # TWILIO_ACCOUNT_SID / TWILIO_AUTH_TOKEN
+    phone_number=os.environ["PATTER_PHONE_NUMBER"],
+    tunnel=True,                            # auto Cloudflare quick-tunnel (local dev)
+)
+
+agent = phone.agent(
+    system_prompt=(
+        "You are Hermes on a live phone call. Keep replies concise, warm, and "
+        "spoken-friendly. Avoid markdown, code blocks, long lists, and URLs "
+        "unless the caller asks. If you use a tool, say you are checking, then "
+        "summarize the result naturally. If interrupted, stop and answer the "
+        "latest request."
+    ),
+    language=os.environ.get("PATTER_LANGUAGE", "en"),
+    first_message="Hello, this is Hermes. How can I help?",
+    stt=DeepgramSTT(),                      # DEEPGRAM_API_KEY
+    llm=HermesLLM(session_key_from="caller_hash"),   # http://127.0.0.1:8642/v1
+    tts=tts,                                # ELEVENLABS_API_KEY
+    long_turn_message="One moment, let me check that.",
+    llm_error_message="Sorry, I'm having trouble reaching Hermes right now.",
+)
+
+if __name__ == "__main__":
+    phone.serve(agent)                      # answers inbound calls
+'''
+
+
+_ENV_EXAMPLE = """\
+# ── Hermes gateway (the brain — keep it on loopback) ──────────────────
+API_SERVER_ENABLED=true
+API_SERVER_HOST=127.0.0.1
+API_SERVER_PORT=8642
+API_SERVER_KEY=choose-a-strong-key
+API_SERVER_MODEL_NAME=hermes-agent
+
+# ── Patter (the voice shell) ──────────────────────────────────────────
+PATTER_PHONE_NUMBER=+15551234567
+PATTER_LANGUAGE=en
+# REST is the safer default for a first PSTN demo; set to ws for streaming.
+PATTER_ELEVENLABS_TRANSPORT=rest
+
+# ── Twilio carrier ────────────────────────────────────────────────────
+TWILIO_ACCOUNT_SID=ACxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
+TWILIO_AUTH_TOKEN=your-twilio-auth-token
+
+# ── STT / TTS providers ───────────────────────────────────────────────
+DEEPGRAM_API_KEY=your-deepgram-key
+ELEVENLABS_API_KEY=your-elevenlabs-key
+# ELEVENLABS_VOICE_ID=EXAVITQu4vr4xnSDxMaL
+"""
+
+
+_README = """\
+# Hermes phone agent
+
+A self-hosted phone line for your [Hermes Agent](https://github.com/NousResearch/hermes-agent).
+Patter is the **voice shell** (carrier, speech-to-text, turn-taking, barge-in,
+text-to-speech); Hermes is the **brain** on the line. Each conversation turn is
+one `POST http://127.0.0.1:8642/v1/chat/completions` against your local Hermes
+gateway — so Hermes keeps its tools, memory, and skills, and **never leaves
+loopback**. The only thing exposed to the internet is Patter's carrier webhook.
+
+## 1. Configure
+
+```bash
+cp .env.example .env
+# then fill in API_SERVER_KEY, TWILIO_*, DEEPGRAM_API_KEY, ELEVENLABS_API_KEY,
+# and PATTER_PHONE_NUMBER
+```
+
+## 2. Check everything is wired up
+
+```bash
+pip install getpatter
+patter hermes doctor
+```
+
+Fix anything it flags (it prints a suggested command for each problem), then
+smoke-test the brain without spending a phone call:
+
+```bash
+python scripts/test_text_turn.py "say hello in one sentence"
+```
+
+## 3. Answer the phone
+
+```bash
+python app.py
+```
+
+Patter opens a tunnel and prints the public webhook URL. Point your Twilio
+number's voice webhook at it — or let Patter do it for you:
+
+```bash
+patter hermes attach-number "$PATTER_PHONE_NUMBER" --url https://<your-tunnel>/calls/inbound
+```
+
+Now call your number and talk to Hermes.
+
+## 4. Place an outbound call (optional)
+
+```bash
+python scripts/test_outbound_call.py +15557654321
+```
+
+## Why Patter instead of a hosted custom-LLM voice agent?
+
+- **Hermes stays private.** A hosted platform has to reach your "brain" endpoint
+  over the public internet; here Hermes is loopback-only and only Patter is
+  exposed.
+- **You own the voice layer** — STT, turn-taking, barge-in, TTS — and can script it.
+- **Inbound *and* outbound**, plus the Patter MCP server so Hermes can place calls.
+"""
+
+
+_DOCKER_COMPOSE = """\
+# Patter + Hermes on one box. Hermes stays on loopback; only Patter is exposed.
+#
+# This runs the Patter voice shell in a container that shares the host network
+# so it can reach the Hermes gateway on 127.0.0.1:8642. Start your Hermes
+# gateway on the host first (see the Hermes docs), then `docker compose up`.
+services:
+  patter:
+    image: python:3.12-slim
+    working_dir: /app
+    env_file: .env
+    network_mode: host          # so Patter reaches Hermes on 127.0.0.1:8642
+    volumes:
+      - .:/app
+    command: sh -c "pip install --quiet getpatter && python app.py"
+"""
+
+
+_SCRIPT_DOCTOR = '''\
+#!/usr/bin/env python
+"""Run the Patter Hermes preflight checks (wraps `patter hermes doctor`)."""
+
+import subprocess
+import sys
+
+raise SystemExit(subprocess.call(["patter", "hermes", "doctor", *sys.argv[1:]]))
+'''
+
+
+_SCRIPT_TEXT_TURN = '''\
+#!/usr/bin/env python
+"""Text-only smoke test: send one turn to the local Hermes gateway.
+
+Verifies the brain answers before you spend a phone call debugging it. Reads
+API_SERVER_PORT / API_SERVER_MODEL_NAME / API_SERVER_KEY from the environment.
+
+    python scripts/test_text_turn.py "say hello in one short sentence"
+"""
+
+from __future__ import annotations
+
+import json
+import os
+import sys
+import urllib.request
+
+base = f"http://127.0.0.1:{os.environ.get('API_SERVER_PORT', '8642')}/v1"
+model = os.environ.get("API_SERVER_MODEL_NAME", "hermes-agent")
+key = os.environ.get("API_SERVER_KEY", "")
+prompt = " ".join(sys.argv[1:]) or "Say hello in one short sentence."
+
+headers = {"Content-Type": "application/json"}
+if key:
+    headers["Authorization"] = f"Bearer {key}"
+
+req = urllib.request.Request(
+    f"{base}/chat/completions",
+    data=json.dumps(
+        {"model": model, "messages": [{"role": "user", "content": prompt}]}
+    ).encode(),
+    headers=headers,
+)
+
+try:
+    with urllib.request.urlopen(req, timeout=120) as resp:  # noqa: S310
+        data = json.load(resp)
+except Exception as exc:  # noqa: BLE001
+    print(f"Hermes did not answer: {exc}")
+    raise SystemExit(1)
+
+print(data["choices"][0]["message"]["content"])
+'''
+
+
+_SCRIPT_OUTBOUND = '''\
+#!/usr/bin/env python
+"""Place a test outbound call through the Hermes voice shell.
+
+    python scripts/test_outbound_call.py +15557654321
+
+The callee picks up and talks to Hermes. Requires the same env as app.py.
+"""
+
+from __future__ import annotations
+
+import asyncio
+import os
+import sys
+
+from getpatter import (
+    DeepgramSTT,
+    ElevenLabsRestTTS,
+    HermesLLM,
+    Patter,
+    Twilio,
+)
+
+
+async def main() -> int:
+    if len(sys.argv) < 2:
+        print("Usage: python scripts/test_outbound_call.py <+E164>")
+        return 2
+    to = sys.argv[1]
+    phone = Patter(
+        carrier=Twilio(),
+        phone_number=os.environ["PATTER_PHONE_NUMBER"],
+        tunnel=True,
+    )
+    agent = phone.agent(
+        system_prompt=(
+            "You are Hermes on a short test call. Greet the person warmly and "
+            "ask how they are. Keep it brief and spoken-friendly."
+        ),
+        first_message="Hi, this is a Patter and Hermes test call.",
+        stt=DeepgramSTT(),
+        llm=HermesLLM(session_key_from="caller_hash"),
+        tts=ElevenLabsRestTTS.for_twilio(),
+    )
+    result = await phone.call(to, agent=agent, wait=True)
+    print(f"Call outcome: {result.outcome if result else 'unknown'}")
+    return 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(asyncio.run(main()))
+'''
+
+
+# Project-relative path -> file contents. The committed
+# ``examples/hermes-phone-agent/`` tree is generated from this map.
+FILES: dict[str, str] = {
+    "app.py": _APP_PY,
+    ".env.example": _ENV_EXAMPLE,
+    "README.md": _README,
+    "docker-compose.yml": _DOCKER_COMPOSE,
+    "scripts/doctor.py": _SCRIPT_DOCTOR,
+    "scripts/test_text_turn.py": _SCRIPT_TEXT_TURN,
+    "scripts/test_outbound_call.py": _SCRIPT_OUTBOUND,
+}
+
+
+def scaffold(target_dir: Path | str, *, force: bool = False) -> list[Path]:
+    """Write the project files under ``target_dir``.
+
+    Args:
+        target_dir: Destination directory (created if missing).
+        force: Overwrite existing files. When ``False`` (default), existing
+            files are left untouched and skipped.
+
+    Returns:
+        The list of paths that were written (skipped files are excluded).
+    """
+    root = Path(target_dir)
+    written: list[Path] = []
+    for rel, content in FILES.items():
+        dest = root / rel
+        if dest.exists() and not force:
+            continue
+        dest.parent.mkdir(parents=True, exist_ok=True)
+        dest.write_text(content, encoding="utf-8")
+        written.append(dest)
+    return written
diff --git a/libraries/python/getpatter/cli.py b/libraries/python/getpatter/cli.py
index 09ea2d7..204a061 100644
--- a/libraries/python/getpatter/cli.py
+++ b/libraries/python/getpatter/cli.py
@@ -32,12 +32,19 @@ def main() -> None:
 
     build_eval_parser(subparsers)
 
+    # patter hermes {doctor|setup|attach-number|numbers}
+    from getpatter.cli_hermes import build_hermes_parser, dispatch_hermes
+
+    build_hermes_parser(subparsers)
+
     args = parser.parse_args()
 
     if args.command == "dashboard":
         asyncio.run(_run_dashboard(args.port))
     elif args.command == "eval":
         sys.exit(dispatch_eval(args))
+    elif args.command == "hermes":
+        sys.exit(dispatch_hermes(args))
     else:
         parser.print_help()
         sys.exit(1)
diff --git a/libraries/python/getpatter/cli_hermes.py b/libraries/python/getpatter/cli_hermes.py
new file mode 100644
index 0000000..1c3c603
--- /dev/null
+++ b/libraries/python/getpatter/cli_hermes.py
@@ -0,0 +1,649 @@
+"""``patter hermes ...`` — zero-config setup, diagnostics, and Twilio wiring
+for the Hermes voice shell (Direction A: Patter is the voice, Hermes is the
+brain).
+
+Subcommands:
+
+* ``patter hermes doctor`` — preflight checks across the Hermes gateway, the
+  Patter providers, and the carrier, each with a suggested fix.
+* ``patter hermes setup`` — scaffold a ready-to-run ``hermes-phone-agent``
+  project, run the checks, and optionally attach a Twilio number.
+* ``patter hermes attach-number`` — point a Twilio number's voice webhook at
+  your Patter URL.
+* ``patter hermes numbers`` — list the Twilio numbers on your account.
+
+Live probes (gateway / Twilio API) are best-effort and time-bounded; pass
+``--no-network`` to skip them. Nothing is mutated unless you ask for it
+(``setup`` prompts before writing; ``attach-number`` is an explicit command).
+"""
+
+from __future__ import annotations
+
+import argparse
+import importlib.util
+import json
+import os
+import shutil
+import sys
+from dataclasses import dataclass, field
+from pathlib import Path
+
+# Check statuses.
+OK = "ok"
+WARN = "warn"
+FAIL = "fail"
+SKIP = "skip"
+
+_SYMBOL = {OK: "✓", WARN: "!", FAIL: "✗", SKIP: "·"}
+
+
+def _color(text: str, status: str) -> str:
+    """Colorize a status symbol unless NO_COLOR is set or output isn't a tty."""
+    if os.environ.get("NO_COLOR") or not sys.stdout.isatty():
+        return text
+    code = {OK: "32", WARN: "33", FAIL: "31", SKIP: "90"}.get(status, "0")
+    return f"\033[{code}m{text}\033[0m"
+
+
+@dataclass
+class Check:
+    """One diagnostic result."""
+
+    status: str
+    label: str
+    detail: str = ""
+    fix: str = ""
+
+
+@dataclass
+class Section:
+    """A named group of checks."""
+
+    title: str
+    checks: list[Check] = field(default_factory=list)
+
+
+# ──────────────────────────────────────────────────────────────────────────
+# Hermes gateway base URL resolution (mirrors HermesLLM defaults)
+# ──────────────────────────────────────────────────────────────────────────
+def _hermes_base_url(override: str | None) -> str:
+    if override:
+        return override.rstrip("/")
+    host = os.environ.get("API_SERVER_HOST", "127.0.0.1")
+    port = os.environ.get("API_SERVER_PORT", "8642")
+    return f"http://{host}:{port}/v1"
+
+
+def _get_json(url: str, *, headers: dict | None = None, timeout: float = 4.0):
+    """Best-effort sync GET returning ``(status_code, json_or_none, error)``."""
+    try:
+        import httpx
+    except ImportError:  # pragma: no cover - httpx is a core dep
+        return None, None, "httpx not installed"
+    try:
+        resp = httpx.get(url, headers=headers or {}, timeout=timeout)
+    except Exception as exc:  # noqa: BLE001 - surface any connection failure
+        return None, None, str(exc)
+    try:
+        body = resp.json()
+    except Exception:  # noqa: BLE001 - non-JSON body
+        body = None
+    return resp.status_code, body, ""
+
+
+# ──────────────────────────────────────────────────────────────────────────
+# Check groups
+# ──────────────────────────────────────────────────────────────────────────
+def _check_hermes(base_url: str, *, network: bool) -> Section:
+    sec = Section("Hermes")
+
+    if shutil.which("hermes"):
+        sec.checks.append(Check(OK, "CLI found", "hermes on PATH"))
+    else:
+        sec.checks.append(
+            Check(
+                WARN,
+                "CLI not found",
+                "optional — only the gateway is required",
+                "install Hermes: https://github.com/NousResearch/hermes-agent",
+            )
+        )
+
+    key = os.environ.get("API_SERVER_KEY", "")
+    if key:
+        sec.checks.append(Check(OK, "API_SERVER_KEY set"))
+    else:
+        sec.checks.append(
+            Check(
+                WARN,
+                "API_SERVER_KEY not set",
+                "keyless local gateways work, but a key is recommended",
+                'export API_SERVER_KEY="choose-a-strong-key"',
+            )
+        )
+
+    if not network:
+        sec.checks.append(Check(SKIP, "Gateway reachable", "skipped (--no-network)"))
+        return sec
+
+    headers = {"Authorization": f"Bearer {key}"} if key else {}
+    status, body, err = _get_json(f"{base_url}/models", headers=headers)
+    if status == 200:
+        sec.checks.append(Check(OK, "Gateway reachable", base_url))
+        want = os.environ.get("API_SERVER_MODEL_NAME", "hermes-agent")
+        ids = _model_ids(body)
+        if want in ids:
+            sec.checks.append(Check(OK, "Model available", want))
+        elif ids:
+            sec.checks.append(
+                Check(
+                    WARN,
+                    "Model not found",
+                    f"{want!r} missing; saw {', '.join(sorted(ids)[:5])}",
+                    f'set API_SERVER_MODEL_NAME to one of the served models',
+                )
+            )
+        else:
+            sec.checks.append(
+                Check(WARN, "Model list empty", "gateway returned no models")
+            )
+    elif status in (401, 403):
+        sec.checks.append(
+            Check(
+                FAIL,
+                "Gateway rejected key",
+                f"HTTP {status} from {base_url}/models",
+                "check API_SERVER_KEY matches the gateway",
+            )
+        )
+    else:
+        detail = f"HTTP {status}" if status else (err or "no response")
+        sec.checks.append(
+            Check(
+                FAIL,
+                "Gateway unreachable",
+                f"{base_url} — {detail}",
+                "enable + start the gateway (API_SERVER_ENABLED=true), "
+                "or pass --base-url",
+            )
+        )
+    return sec
+
+
+def _model_ids(body) -> set[str]:
+    """Extract model ids from an OpenAI-style ``/models`` payload."""
+    if not isinstance(body, dict):
+        return set()
+    data = body.get("data")
+    if not isinstance(data, list):
+        return set()
+    return {m.get("id") for m in data if isinstance(m, dict) and m.get("id")}
+
+
+def _check_patter() -> Section:
+    sec = Section("Patter")
+
+    try:
+        from getpatter import __version__
+
+        sec.checks.append(Check(OK, "getpatter installed", __version__))
+    except Exception as exc:  # noqa: BLE001
+        sec.checks.append(Check(FAIL, "getpatter import failed", str(exc)))
+        return sec
+
+    try:
+        from getpatter import HermesLLM
+
+        HermesLLM()
+        sec.checks.append(Check(OK, "HermesLLM constructible"))
+    except Exception as exc:  # noqa: BLE001
+        sec.checks.append(Check(FAIL, "HermesLLM construction failed", str(exc)))
+
+    sec.checks.append(_env_key("DEEPGRAM_API_KEY", "Deepgram STT"))
+    sec.checks.append(_env_key("ELEVENLABS_API_KEY", "ElevenLabs TTS"))
+
+    transport = os.environ.get("PATTER_ELEVENLABS_TRANSPORT", "").lower()
+    if transport == "rest":
+        sec.checks.append(Check(OK, "ElevenLabs transport", "rest"))
+    elif transport == "ws":
+        sec.checks.append(
+            Check(
+                WARN,
+                "ElevenLabs transport",
+                "ws — WebSocket can stall before the first frame on PSTN",
+                "PATTER_ELEVENLABS_TRANSPORT=rest for a more robust demo",
+            )
+        )
+    else:
+        sec.checks.append(
+            Check(OK, "ElevenLabs transport", "unset (example defaults to REST)")
+        )
+
+    if importlib.util.find_spec("onnxruntime") is not None:
+        sec.checks.append(Check(OK, "Silero VAD available"))
+    else:
+        sec.checks.append(
+            Check(
+                WARN,
+                "Silero VAD missing",
+                "only needed for the pipeline VAD",
+                'pip install "getpatter[silero]"',
+            )
+        )
+    return sec
+
+
+def _env_key(var: str, label: str) -> Check:
+    if os.environ.get(var):
+        return Check(OK, f"{label} key found")
+    return Check(
+        WARN,
+        f"{label} key missing",
+        f"{var} not set",
+        f"export {var}=...",
+    )
+
+
+def _check_twilio(*, network: bool) -> Section:
+    sec = Section("Twilio")
+    sid = os.environ.get("TWILIO_ACCOUNT_SID", "")
+    token = os.environ.get("TWILIO_AUTH_TOKEN", "")
+    if not sid or not token:
+        sec.checks.append(
+            Check(
+                WARN,
+                "Carrier credentials missing",
+                "TWILIO_ACCOUNT_SID / TWILIO_AUTH_TOKEN not set",
+                "set them, or use Telnyx/Plivo instead",
+            )
+        )
+        return sec
+    sec.checks.append(Check(OK, "Credentials present"))
+
+    if not network:
+        sec.checks.append(Check(SKIP, "Credentials valid", "skipped (--no-network)"))
+        return sec
+
+    try:
+        import httpx
+    except ImportError:  # pragma: no cover
+        sec.checks.append(Check(SKIP, "Credentials valid", "httpx not installed"))
+        return sec
+
+    base = f"https://api.twilio.com/2010-04-01/Accounts/{sid}"
+    try:
+        resp = httpx.get(f"{base}.json", auth=(sid, token), timeout=6.0)
+    except Exception as exc:  # noqa: BLE001
+        sec.checks.append(Check(FAIL, "Twilio API unreachable", str(exc)))
+        return sec
+    if resp.status_code == 200:
+        sec.checks.append(Check(OK, "Credentials valid"))
+    else:
+        sec.checks.append(
+            Check(
+                FAIL,
+                "Credentials rejected",
+                f"HTTP {resp.status_code}",
+                "check TWILIO_ACCOUNT_SID / TWILIO_AUTH_TOKEN",
+            )
+        )
+        return sec
+
+    number = os.environ.get("PATTER_PHONE_NUMBER") or os.environ.get(
+        "TWILIO_PHONE_NUMBER", ""
+    )
+    if not number:
+        sec.checks.append(
+            Check(SKIP, "Webhook configured", "set PATTER_PHONE_NUMBER to check")
+        )
+        return sec
+    try:
+        resp = httpx.get(
+            f"{base}/IncomingPhoneNumbers.json",
+            params={"PhoneNumber": number},
+            auth=(sid, token),
+            timeout=6.0,
+        )
+        rows = resp.json().get("incoming_phone_numbers", []) if resp.status_code == 200 else []
+    except Exception as exc:  # noqa: BLE001
+        sec.checks.append(Check(WARN, "Webhook check failed", str(exc)))
+        return sec
+    if not rows:
+        sec.checks.append(
+            Check(
+                WARN,
+                "Number not on account",
+                f"{number} not found",
+                "buy/port the number in Twilio, or fix PATTER_PHONE_NUMBER",
+            )
+        )
+        return sec
+    voice_url = rows[0].get("voice_url", "")
+    if voice_url:
+        sec.checks.append(Check(OK, "Webhook configured", voice_url))
+    else:
+        sec.checks.append(
+            Check(
+                WARN,
+                "Webhook not configured",
+                f"{number} has no voice webhook",
+                f'patter hermes attach-number {number} --url https://<tunnel>/calls/inbound',
+            )
+        )
+    return sec
+
+
+# ──────────────────────────────────────────────────────────────────────────
+# Rendering
+# ──────────────────────────────────────────────────────────────────────────
+def _print_sections(sections: list[Section]) -> None:
+    for sec in sections:
+        print(f"\n{sec.title}")
+        for c in sec.checks:
+            sym = _color(_SYMBOL.get(c.status, "?"), c.status)
+            line = f"  {sym} {c.label}"
+            if c.detail:
+                line += f": {c.detail}"
+            print(line)
+            if c.fix and c.status in (WARN, FAIL):
+                print(f"      fix: {c.fix}")
+
+
+def _sections_to_dict(sections: list[Section]) -> dict:
+    return {
+        "sections": [
+            {
+                "title": s.title,
+                "checks": [
+                    {
+                        "status": c.status,
+                        "label": c.label,
+                        "detail": c.detail,
+                        "fix": c.fix,
+                    }
+                    for c in s.checks
+                ],
+            }
+            for s in sections
+        ],
+        "failures": sum(
+            1 for s in sections for c in s.checks if c.status == FAIL
+        ),
+        "warnings": sum(
+            1 for s in sections for c in s.checks if c.status == WARN
+        ),
+    }
+
+
+def _run_doctor(args: argparse.Namespace) -> list[Section]:
+    base_url = _hermes_base_url(getattr(args, "base_url", None))
+    network = not getattr(args, "no_network", False)
+    return [
+        _check_hermes(base_url, network=network),
+        _check_patter(),
+        _check_twilio(network=network),
+    ]
+
+
+# ──────────────────────────────────────────────────────────────────────────
+# Subcommands
+# ──────────────────────────────────────────────────────────────────────────
+def cmd_doctor(args: argparse.Namespace) -> int:
+    sections = _run_doctor(args)
+    report = _sections_to_dict(sections)
+    if getattr(args, "json", False):
+        print(json.dumps(report, indent=2))
+    else:
+        _print_sections(sections)
+        print()
+        if report["failures"]:
+            print(
+                f"{report['failures']} problem(s) to fix, "
+                f"{report['warnings']} warning(s)."
+            )
+        elif report["warnings"]:
+            print(f"Ready, with {report['warnings']} warning(s).")
+        else:
+            print("All checks passed. You're ready to take calls.")
+    return 1 if report["failures"] else 0
+
+
+def cmd_setup(args: argparse.Namespace) -> int:
+    from getpatter import _hermes_scaffold
+
+    target = Path(args.dir).resolve()
+    interactive = sys.stdin.isatty() and not args.yes
+
+    print(f"Patter + Hermes setup\n  project: {target}\n")
+
+    # 1. Preflight.
+    print("Checking your environment…")
+    sections = _run_doctor(args)
+    _print_sections(sections)
+    failures = sum(1 for s in sections for c in s.checks if c.status == FAIL)
+    print()
+
+    # 2. Scaffold the project.
+    if interactive and not _confirm(f"Scaffold the project into {target}?"):
+        print("Skipped scaffolding.")
+    else:
+        written = _hermes_scaffold.scaffold(target, force=args.force)
+        if written:
+            print(f"Wrote {len(written)} file(s):")
+            for p in written:
+                print(f"  + {p.relative_to(target)}")
+        else:
+            print("Project already exists (use --force to overwrite).")
+        env_path = target / ".env"
+        if not env_path.exists():
+            (env_path).write_text(
+                (target / ".env.example").read_text(encoding="utf-8"),
+                encoding="utf-8",
+            )
+            print("  + .env (from .env.example — fill in your keys)")
+    print()
+
+    # 3. Optionally attach a Twilio number.
+    if args.number and args.url:
+        print(f"Attaching {args.number} → {args.url}")
+        rc = _attach_number(args.number, args.url, args.status_callback)
+        if rc != 0:
+            return rc
+    elif args.number or args.url:
+        print(
+            "Note: pass both --number and --url to auto-configure the Twilio "
+            "webhook, or run `patter hermes attach-number` later."
+        )
+
+    # 4. Next steps.
+    print("\nNext steps:")
+    print(f"  cd {target}")
+    print("  # edit .env with your keys")
+    print("  python scripts/test_text_turn.py   # smoke-test the Hermes brain")
+    print("  python app.py                      # answer the phone")
+    if not (args.number and args.url):
+        print(
+            "  patter hermes attach-number <number> --url <tunnel-url>/calls/inbound"
+        )
+    return 1 if failures else 0
+
+
+def cmd_attach_number(args: argparse.Namespace) -> int:
+    return _attach_number(args.number, args.url, args.status_callback)
+
+
+def cmd_numbers(args: argparse.Namespace) -> int:
+    sid, token = _twilio_creds()
+    if not sid or not token:
+        print(
+            "Twilio credentials not found. Set TWILIO_ACCOUNT_SID and "
+            "TWILIO_AUTH_TOKEN.",
+            file=sys.stderr,
+        )
+        return 2
+    try:
+        import httpx
+
+        resp = httpx.get(
+            f"https://api.twilio.com/2010-04-01/Accounts/{sid}/IncomingPhoneNumbers.json",
+            auth=(sid, token),
+            params={"PageSize": 50},
+            timeout=10.0,
+        )
+    except Exception as exc:  # noqa: BLE001
+        print(f"Twilio API error: {exc}", file=sys.stderr)
+        return 1
+    if resp.status_code != 200:
+        print(f"Twilio returned HTTP {resp.status_code}", file=sys.stderr)
+        return 1
+    rows = resp.json().get("incoming_phone_numbers", [])
+    if not rows:
+        print("No phone numbers on this account.")
+        return 0
+    print(f"{len(rows)} number(s):")
+    for r in rows:
+        num = r.get("phone_number", "?")
+        url = r.get("voice_url", "") or "(no voice webhook)"
+        print(f"  {num}  →  {url}")
+    return 0
+
+
+# ──────────────────────────────────────────────────────────────────────────
+# Twilio helpers
+# ──────────────────────────────────────────────────────────────────────────
+def _twilio_creds() -> tuple[str, str]:
+    return (
+        os.environ.get("TWILIO_ACCOUNT_SID", ""),
+        os.environ.get("TWILIO_AUTH_TOKEN", ""),
+    )
+
+
+def _attach_number(number: str, url: str, status_callback: str | None) -> int:
+    """Set a Twilio number's voice webhook. Returns a process exit code."""
+    sid, token = _twilio_creds()
+    if not sid or not token:
+        print(
+            "Twilio credentials not found. Set TWILIO_ACCOUNT_SID and "
+            "TWILIO_AUTH_TOKEN.",
+            file=sys.stderr,
+        )
+        return 2
+    if not url.lower().startswith("https://"):
+        print(f"Webhook URL must be https:// (got {url!r})", file=sys.stderr)
+        return 2
+    try:
+        import httpx
+    except ImportError:  # pragma: no cover
+        print("httpx is required for attach-number.", file=sys.stderr)
+        return 1
+
+    base = f"https://api.twilio.com/2010-04-01/Accounts/{sid}"
+    # Resolve the number's SID.
+    try:
+        lookup = httpx.get(
+            f"{base}/IncomingPhoneNumbers.json",
+            params={"PhoneNumber": number},
+            auth=(sid, token),
+            timeout=10.0,
+        )
+    except Exception as exc:  # noqa: BLE001
+        print(f"Twilio API error: {exc}", file=sys.stderr)
+        return 1
+    if lookup.status_code != 200:
+        print(f"Twilio returned HTTP {lookup.status_code}", file=sys.stderr)
+        return 1
+    rows = lookup.json().get("incoming_phone_numbers", [])
+    if not rows:
+        print(
+            f"{number} is not on this account. Run `patter hermes numbers` to "
+            "list available numbers.",
+            file=sys.stderr,
+        )
+        return 1
+    number_sid = rows[0].get("sid")
+
+    data = {"VoiceUrl": url, "VoiceMethod": "POST"}
+    if status_callback:
+        data["StatusCallback"] = status_callback
+        data["StatusCallbackMethod"] = "POST"
+    try:
+        upd = httpx.post(
+            f"{base}/IncomingPhoneNumbers/{number_sid}.json",
+            data=data,
+            auth=(sid, token),
+            timeout=10.0,
+        )
+    except Exception as exc:  # noqa: BLE001
+        print(f"Twilio API error: {exc}", file=sys.stderr)
+        return 1
+    if upd.status_code in (200, 201):
+        print(f"✓ {number} voice webhook → {url}")
+        if status_callback:
+            print(f"✓ status callback → {status_callback}")
+        return 0
+    print(
+        f"Failed to update webhook: HTTP {upd.status_code} {upd.text[:200]}",
+        file=sys.stderr,
+    )
+    return 1
+
+
+def _confirm(prompt: str) -> bool:
+    try:
+        return input(f"{prompt} [Y/n] ").strip().lower() in ("", "y", "yes")
+    except (EOFError, KeyboardInterrupt):
+        return False
+
+
+# ──────────────────────────────────────────────────────────────────────────
+# Parser wiring
+# ──────────────────────────────────────────────────────────────────────────
+def build_hermes_parser(subparsers: argparse._SubParsersAction) -> argparse.ArgumentParser:
+    """Attach the ``hermes`` subcommand tree to the parent CLI."""
+    hermes = subparsers.add_parser(
+        "hermes",
+        help="Set up, diagnose, and wire the Hermes voice shell",
+    )
+    hsub = hermes.add_subparsers(dest="hermes_command")
+
+    doctor = hsub.add_parser("doctor", help="Preflight checks for the Hermes voice shell")
+    doctor.add_argument("--base-url", default=None, help="Hermes gateway base URL")
+    doctor.add_argument("--no-network", action="store_true", help="Skip live probes")
+    doctor.add_argument("--json", action="store_true", help="Machine-readable output")
+
+    setup = hsub.add_parser("setup", help="Scaffold a hermes-phone-agent project")
+    setup.add_argument("--dir", default="hermes-phone-agent", help="Target directory")
+    setup.add_argument("--force", action="store_true", help="Overwrite existing files")
+    setup.add_argument("--yes", action="store_true", help="Non-interactive (assume yes)")
+    setup.add_argument("--number", default=None, help="Twilio number to attach")
+    setup.add_argument("--url", default=None, help="Public webhook URL to attach")
+    setup.add_argument("--status-callback", default=None, help="Twilio status callback URL")
+    setup.add_argument("--base-url", default=None, help="Hermes gateway base URL")
+    setup.add_argument("--no-network", action="store_true", help="Skip live probes")
+
+    attach = hsub.add_parser("attach-number", help="Point a Twilio number at your Patter URL")
+    attach.add_argument("number", help="Phone number in E.164 (e.g. +15551234567)")
+    attach.add_argument("--url", required=True, help="Public voice webhook URL (https)")
+    attach.add_argument("--status-callback", default=None, help="Twilio status callback URL")
+
+    hsub.add_parser("numbers", help="List the Twilio numbers on your account")
+    return hermes
+
+
+def dispatch_hermes(args: argparse.Namespace) -> int:
+    """Entry for ``patter hermes ...``. Returns a process exit code."""
+    command = getattr(args, "hermes_command", None)
+    if command == "doctor":
+        return cmd_doctor(args)
+    if command == "setup":
+        return cmd_setup(args)
+    if command == "attach-number":
+        return cmd_attach_number(args)
+    if command == "numbers":
+        return cmd_numbers(args)
+    print(
+        "Usage: patter hermes {doctor|setup|attach-number|numbers}\n"
+        "Try:   patter hermes doctor",
+        file=sys.stderr,
+    )
+    return 2
diff --git a/libraries/python/tests/unit/test_hermes_cli.py b/libraries/python/tests/unit/test_hermes_cli.py
new file mode 100644
index 0000000..68298f5
--- /dev/null
+++ b/libraries/python/tests/unit/test_hermes_cli.py
@@ -0,0 +1,184 @@
+"""Unit tests for the ``patter hermes ...`` CLI (doctor / setup / attach-number).
+
+Live probes are exercised by monkeypatching ``httpx`` at the boundary, so no
+real Hermes gateway or Twilio account is touched.
+"""
+
+from __future__ import annotations
+
+import argparse
+from pathlib import Path
+
+import pytest
+
+from getpatter import _hermes_scaffold, cli_hermes
+
+
+# ── scaffold ───────────────────────────────────────────────────────────────
+def test_scaffold_writes_all_files(tmp_path: Path) -> None:
+    written = _hermes_scaffold.scaffold(tmp_path)
+    rel = {p.relative_to(tmp_path).as_posix() for p in written}
+    assert rel == set(_hermes_scaffold.FILES)
+    for name in _hermes_scaffold.FILES:
+        assert (tmp_path / name).exists()
+
+
+def test_scaffold_skips_existing_without_force(tmp_path: Path) -> None:
+    (tmp_path / "app.py").write_text("# mine", encoding="utf-8")
+    written = _hermes_scaffold.scaffold(tmp_path)
+    assert tmp_path / "app.py" not in written
+    assert (tmp_path / "app.py").read_text(encoding="utf-8") == "# mine"
+    # force overwrites.
+    written2 = _hermes_scaffold.scaffold(tmp_path, force=True)
+    assert tmp_path / "app.py" in written2
+    assert (tmp_path / "app.py").read_text(encoding="utf-8") != "# mine"
+
+
+def test_committed_example_matches_scaffold() -> None:
+    """The committed examples/ tree must stay in sync with the scaffold map."""
+    root = Path(__file__).resolve().parents[4] / "examples" / "hermes-phone-agent"
+    assert root.is_dir(), f"missing example dir: {root}"
+    for rel, content in _hermes_scaffold.FILES.items():
+        on_disk = (root / rel).read_text(encoding="utf-8")
+        assert on_disk == content, f"{rel} drifted from the scaffold"
+
+
+# ── doctor ─────────────────────────────────────────────────────────────────
+def _doctor_args(**over) -> argparse.Namespace:
+    base = {"base_url": None, "no_network": True, "json": True}
+    base.update(over)
+    return argparse.Namespace(**base)
+
+
+def test_doctor_no_network_skips_probes(capsys, monkeypatch) -> None:
+    for var in ("API_SERVER_KEY", "DEEPGRAM_API_KEY", "ELEVENLABS_API_KEY"):
+        monkeypatch.delenv(var, raising=False)
+    rc = cli_hermes.cmd_doctor(_doctor_args())
+    out = capsys.readouterr().out
+    assert '"skipped (--no-network)"' in out
+    # warnings don't fail the run.
+    assert rc == 0
+
+
+def test_doctor_gateway_unreachable_is_failure(monkeypatch) -> None:
+    # Force the gateway probe to look like a refused connection.
+    def boom(*_a, **_k):
+        raise OSError("Connection refused")
+
+    monkeypatch.setattr("httpx.get", boom)
+    sections = cli_hermes._check_hermes("http://127.0.0.1:8642/v1", network=True)
+    statuses = {c.label: c.status for c in sections.checks}
+    assert statuses["Gateway unreachable"] == cli_hermes.FAIL
+
+
+def test_doctor_gateway_ok_and_model_present(monkeypatch) -> None:
+    class Resp:
+        status_code = 200
+
+        @staticmethod
+        def json():
+            return {"data": [{"id": "hermes-agent"}]}
+
+    monkeypatch.setenv("API_SERVER_KEY", "k")
+    monkeypatch.setattr("httpx.get", lambda *a, **k: Resp())
+    sec = cli_hermes._check_hermes("http://127.0.0.1:8642/v1", network=True)
+    labels = {c.label: c.status for c in sec.checks}
+    assert labels["Gateway reachable"] == cli_hermes.OK
+    assert labels["Model available"] == cli_hermes.OK
+
+
+def test_doctor_exit_code_one_when_failures(monkeypatch) -> None:
+    monkeypatch.setattr(
+        cli_hermes,
+        "_run_doctor",
+        lambda _a: [cli_hermes.Section("X", [cli_hermes.Check(cli_hermes.FAIL, "bad")])],
+    )
+    assert cli_hermes.cmd_doctor(_doctor_args(json=False)) == 1
+
+
+# ── attach-number ──────────────────────────────────────────────────────────
+def test_attach_number_requires_https(monkeypatch, capsys) -> None:
+    monkeypatch.setenv("TWILIO_ACCOUNT_SID", "AC123")
+    monkeypatch.setenv("TWILIO_AUTH_TOKEN", "tok")
+    rc = cli_hermes._attach_number("+15551234567", "http://x/y", None)
+    assert rc == 2
+    assert "https" in capsys.readouterr().err
+
+
+def test_attach_number_missing_creds(monkeypatch, capsys) -> None:
+    monkeypatch.delenv("TWILIO_ACCOUNT_SID", raising=False)
+    monkeypatch.delenv("TWILIO_AUTH_TOKEN", raising=False)
+    rc = cli_hermes._attach_number("+15551234567", "https://x/y", None)
+    assert rc == 2
+    assert "credentials not found" in capsys.readouterr().err.lower()
+
+
+def test_attach_number_posts_voice_url(monkeypatch, capsys) -> None:
+    monkeypatch.setenv("TWILIO_ACCOUNT_SID", "AC123")
+    monkeypatch.setenv("TWILIO_AUTH_TOKEN", "tok")
+    posted: dict = {}
+
+    class Lookup:
+        status_code = 200
+
+        @staticmethod
+        def json():
+            return {"incoming_phone_numbers": [{"sid": "PN1"}]}
+
+    class Update:
+        status_code = 200
+        text = ""
+
+    def fake_get(url, **kw):
+        assert "IncomingPhoneNumbers.json" in url
+        return Lookup()
+
+    def fake_post(url, **kw):
+        posted["url"] = url
+        posted["data"] = kw.get("data")
+        return Update()
+
+    monkeypatch.setattr("httpx.get", fake_get)
+    monkeypatch.setattr("httpx.post", fake_post)
+    rc = cli_hermes._attach_number(
+        "+15551234567", "https://abc.example.com/calls/inbound", None
+    )
+    assert rc == 0
+    assert "PN1.json" in posted["url"]
+    assert posted["data"]["VoiceUrl"] == "https://abc.example.com/calls/inbound"
+    assert posted["data"]["VoiceMethod"] == "POST"
+    assert "voice webhook" in capsys.readouterr().out
+
+
+def test_attach_number_unknown_number(monkeypatch, capsys) -> None:
+    monkeypatch.setenv("TWILIO_ACCOUNT_SID", "AC123")
+    monkeypatch.setenv("TWILIO_AUTH_TOKEN", "tok")
+
+    class Lookup:
+        status_code = 200
+
+        @staticmethod
+        def json():
+            return {"incoming_phone_numbers": []}
+
+    monkeypatch.setattr("httpx.get", lambda *a, **k: Lookup())
+    rc = cli_hermes._attach_number("+15550000000", "https://x/y", None)
+    assert rc == 1
+    assert "not on this account" in capsys.readouterr().err
+
+
+# ── dispatch ───────────────────────────────────────────────────────────────
+def test_dispatch_unknown_subcommand_returns_usage() -> None:
+    args = argparse.Namespace(hermes_command=None)
+    assert cli_hermes.dispatch_hermes(args) == 2
+
+
+def test_parser_wires_subcommands() -> None:
+    parser = argparse.ArgumentParser()
+    sub = parser.add_subparsers(dest="command")
+    cli_hermes.build_hermes_parser(sub)
+    ns = parser.parse_args(["hermes", "attach-number", "+15551234567", "--url", "https://x/y"])
+    assert ns.command == "hermes"
+    assert ns.hermes_command == "attach-number"
+    assert ns.number == "+15551234567"
+    assert ns.url == "https://x/y"
diff --git a/libraries/typescript/src/cli.ts b/libraries/typescript/src/cli.ts
index fd83779..0d8fd77 100644
--- a/libraries/typescript/src/cli.ts
+++ b/libraries/typescript/src/cli.ts
@@ -43,15 +43,33 @@ function printEvalStub(): void {
   );
 }
 
+function printHermesStub(): void {
+  console.log(
+    'The Hermes setup wizard (doctor / setup / attach-number) lives in the\n' +
+      'Python CLI today. Use it from the Python SDK:\n\n' +
+      '  pip install getpatter\n' +
+      '  patter hermes doctor\n' +
+      '  patter hermes setup\n\n' +
+      'The HermesLLM provider itself is fully available in this TypeScript SDK\n' +
+      "(import { HermesLLM } from 'getpatter'). See\n" +
+      'https://docs.getpatter.com/integrations/hermes for docs.',
+  );
+}
+
 async function main(): Promise<void> {
   const command = process.argv[2];
   if (command === 'eval') {
     printEvalStub();
     process.exit(0);
   }
+  if (command === 'hermes') {
+    printHermesStub();
+    process.exit(0);
+  }
   if (command !== 'dashboard') {
     console.log('Usage: getpatter dashboard [--port 8000]');
     console.log('       getpatter eval          (stub — use Python SDK for evals)');
+    console.log('       getpatter hermes        (stub — use Python SDK for the wizard)');
     process.exit(command ? 1 : 0);
   }
 

From 2b40d1f876c11e357845ee58e2190525fd51f75e Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Tue, 9 Jun 2026 08:35:34 +0000
Subject: [PATCH 09/11] =?UTF-8?q?feat(hermes):=20doctor=20&=20setup=20read?=
 =?UTF-8?q?=20real=20config=20=E2=80=94=20.env=20autoload,=20~/.hermes=20d?=
 =?UTF-8?q?etection,=20key-gen,=20--enable-hermes?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Address the review gaps on the Hermes wizard: it now reads and (opt-in) writes
real config instead of only consulting os.environ.

doctor:
- Autoloads dotenv files before checking — ~/.hermes/.env then the project/cwd
  .env (non-overriding), with --env-file/--no-env-file to control it. Loaded
  paths are reported; secrets are never echoed.
- Reads ~/.hermes/.env + config.yaml directly: reports API_SERVER_ENABLED,
  surfaces the configured key/port/model, and runs `hermes gateway status` when
  the CLI is present.
- Sharper severity: CLI missing AND gateway unreachable is now a failure, not a
  soft warning; gateway-down fix adapts to whether the CLI is available.

setup:
- --enable-hermes writes API_SERVER_ENABLED=true (and generates an
  API_SERVER_KEY if absent) into ~/.hermes/.env, backing up to .env.bak first,
  then reminds the operator to restart the gateway.
- --generate-key writes a strong key into the project .env; when used with
  --enable-hermes the SAME key is mirrored so Patter and Hermes agree (a
  mismatch is a 401 at call time).
- Autoloads env for the preflight so checks reflect the project's .env.

New helpers (_parse_env_file / _upsert_env_file / _load_env_files /
_read_hermes_config / _enable_hermes_gateway / _generate_key), no new deps.
+11 unit tests; docs updated.

https://claude.ai/code/session_01TNysNGx7woXM99fHBjpsts
---
 docs/integrations/hermes.mdx                  |  25 +-
 libraries/python/getpatter/cli_hermes.py      | 339 +++++++++++++++++-
 .../python/tests/unit/test_hermes_cli.py      | 122 ++++++-
 3 files changed, 466 insertions(+), 20 deletions(-)

diff --git a/docs/integrations/hermes.mdx b/docs/integrations/hermes.mdx
index 160c48c..ce7dccf 100644
--- a/docs/integrations/hermes.mdx
+++ b/docs/integrations/hermes.mdx
@@ -176,13 +176,26 @@ patter hermes doctor        # preflight: gateway, providers, carrier — with fi
 patter hermes setup         # scaffold ./hermes-phone-agent (app.py, .env, scripts)
 ```
 
-`patter hermes doctor` probes the Hermes gateway (`/v1/models`), confirms `HermesLLM` is
-constructible, checks your Deepgram / ElevenLabs / Twilio credentials, and prints a
-suggested fix for anything missing (`--no-network` skips live probes, `--json` for
-machine-readable output). `patter hermes setup` writes the same starter project shown in
+`patter hermes doctor` reads your Hermes config directly — it autoloads `~/.hermes/.env`
+and the nearest project `.env`, reports whether `API_SERVER_ENABLED` is set and which
+gateway port is configured, runs `hermes gateway status` when the CLI is present, then
+probes the gateway (`/v1/models`), confirms `HermesLLM` is constructible, and checks your
+Deepgram / ElevenLabs / Twilio credentials — printing a suggested fix for anything missing
+(`--no-network` skips live probes, `--json` for machine-readable output, `--env-file` /
+`--no-env-file` to control autoloading).
+
+`patter hermes setup` writes the same starter project shown in
 [`examples/hermes-phone-agent`](https://github.com/PatterAI/Patter/tree/main/examples/hermes-phone-agent)
-and, given `--number` and `--url`, attaches the Twilio webhook for you. To wire an existing
-number on its own:
+and can also wire the two ends together for you:
+
+- `--enable-hermes` writes `API_SERVER_ENABLED=true` (and generates an `API_SERVER_KEY` if
+  absent) into `~/.hermes/.env`, backing the file up first — then reminds you to restart the
+  gateway. The same key is mirrored into the project `.env` so Patter and Hermes agree (a
+  mismatch is a 401 at call time).
+- `--generate-key` puts a strong `API_SERVER_KEY` into the project `.env`.
+- `--number` + `--url` attach the Twilio webhook in the same run.
+
+To wire an existing number on its own:
 
 ```bash
 patter hermes numbers       # list the numbers on your Twilio account
diff --git a/libraries/python/getpatter/cli_hermes.py b/libraries/python/getpatter/cli_hermes.py
index 1c3c603..4e271a1 100644
--- a/libraries/python/getpatter/cli_hermes.py
+++ b/libraries/python/getpatter/cli_hermes.py
@@ -91,25 +91,225 @@ def _get_json(url: str, *, headers: dict | None = None, timeout: float = 4.0):
     return resp.status_code, body, ""
 
 
+# ──────────────────────────────────────────────────────────────────────────
+# .env / Hermes config helpers
+# ──────────────────────────────────────────────────────────────────────────
+def _parse_env_file(path: Path) -> dict[str, str]:
+    """Parse a ``KEY=VALUE`` dotenv file. Ignores blanks, comments, ``export``.
+
+    Surrounding single/double quotes are stripped. Returns an empty dict if the
+    file is missing or unreadable.
+    """
+    out: dict[str, str] = {}
+    try:
+        text = path.read_text(encoding="utf-8")
+    except OSError:
+        return out
+    for raw in text.splitlines():
+        line = raw.strip()
+        if not line or line.startswith("#") or "=" not in line:
+            continue
+        if line.startswith("export "):
+            line = line[len("export ") :].lstrip()
+        key, _, val = line.partition("=")
+        key = key.strip()
+        val = val.strip()
+        if len(val) >= 2 and val[0] == val[-1] and val[0] in ("'", '"'):
+            val = val[1:-1]
+        if key:
+            out[key] = val
+    return out
+
+
+def _hermes_home() -> Path:
+    """Hermes config dir — ``$HERMES_HOME`` or ``~/.hermes`` (override for tests)."""
+    override = os.environ.get("HERMES_HOME")
+    return Path(override) if override else Path.home() / ".hermes"
+
+
+def _read_hermes_config() -> dict[str, str]:
+    """Read Hermes ``api_server`` settings from ``~/.hermes/.env`` and
+    ``~/.hermes/config.yaml``.
+
+    Returns a flat dict that may include ``API_SERVER_ENABLED``,
+    ``API_SERVER_KEY``, ``API_SERVER_HOST``, ``API_SERVER_PORT``, and
+    ``API_SERVER_MODEL_NAME``. The ``.env`` file wins over ``config.yaml``.
+    Returns an empty dict when ``~/.hermes`` is absent.
+    """
+    home = _hermes_home()
+    if not home.exists():
+        return {}
+    cfg: dict[str, str] = {}
+    yaml_path = home / "config.yaml"
+    if yaml_path.exists():
+        try:
+            import yaml
+
+            data = yaml.safe_load(yaml_path.read_text(encoding="utf-8")) or {}
+            api = data.get("api_server") if isinstance(data, dict) else None
+            if isinstance(api, dict):
+                _map = {
+                    "enabled": "API_SERVER_ENABLED",
+                    "key": "API_SERVER_KEY",
+                    "host": "API_SERVER_HOST",
+                    "port": "API_SERVER_PORT",
+                    "model_name": "API_SERVER_MODEL_NAME",
+                }
+                for src, dst in _map.items():
+                    if src in api and api[src] is not None:
+                        cfg[dst] = str(api[src])
+        except Exception:  # noqa: BLE001 - malformed yaml shouldn't crash doctor
+            pass
+    cfg.update(_parse_env_file(home / ".env"))  # .env overrides config.yaml
+    return cfg
+
+
+def _env_files_to_load(
+    explicit: list[str] | None, *, project_dir: Path | None
+) -> list[Path]:
+    """Resolve which dotenv files to autoload, in increasing priority order."""
+    if explicit:
+        return [Path(p) for p in explicit]
+    chain: list[Path] = [_hermes_home() / ".env"]
+    if project_dir is not None:
+        chain.append(project_dir / ".env")
+    chain.append(Path.cwd() / ".env")
+    return chain
+
+
+def _load_env_files(paths: list[Path], *, override: bool = False) -> list[Path]:
+    """Load dotenv files into ``os.environ``. Returns the files actually applied.
+
+    Later paths win over earlier ones. Existing ``os.environ`` values are kept
+    unless ``override`` is set.
+    """
+    applied: list[Path] = []
+    for path in paths:
+        values = _parse_env_file(path)
+        if not values:
+            continue
+        for key, val in values.items():
+            if override or key not in os.environ:
+                os.environ[key] = val
+        applied.append(path)
+    return applied
+
+
+def _upsert_env_file(path: Path, updates: dict[str, str]) -> None:
+    """Set ``KEY=VALUE`` pairs in a dotenv file, preserving other lines.
+
+    Existing keys are replaced in place; new keys are appended. Creates the file
+    (and parent dir) if missing.
+    """
+    path.parent.mkdir(parents=True, exist_ok=True)
+    lines = path.read_text(encoding="utf-8").splitlines() if path.exists() else []
+    remaining = dict(updates)
+    for i, raw in enumerate(lines):
+        stripped = raw.strip()
+        if not stripped or stripped.startswith("#") or "=" not in stripped:
+            continue
+        key = stripped.split("=", 1)[0].strip()
+        if key.startswith("export "):
+            key = key[len("export ") :].strip()
+        if key in remaining:
+            lines[i] = f"{key}={remaining.pop(key)}"
+    for key, val in remaining.items():
+        lines.append(f"{key}={val}")
+    path.write_text("\n".join(lines) + "\n", encoding="utf-8")
+
+
+def _generate_key() -> str:
+    """A strong, URL-safe API key."""
+    import secrets
+
+    return secrets.token_urlsafe(32)
+
+
+def _hermes_gateway_status() -> Check | None:
+    """Best-effort ``hermes gateway status`` probe. ``None`` if CLI absent."""
+    if not shutil.which("hermes"):
+        return None
+    import subprocess
+
+    try:
+        proc = subprocess.run(
+            ["hermes", "gateway", "status"],
+            capture_output=True,
+            text=True,
+            timeout=8,
+        )
+    except Exception:  # noqa: BLE001 - missing subcommand / timeout
+        return None
+    if proc.returncode == 0:
+        first = (proc.stdout or proc.stderr or "").strip().splitlines()
+        return Check(OK, "Gateway service", first[0] if first else "status ok")
+    return Check(
+        WARN,
+        "Gateway service",
+        "hermes gateway status reported a problem",
+        "hermes gateway start",
+    )
+
+
 # ──────────────────────────────────────────────────────────────────────────
 # Check groups
 # ──────────────────────────────────────────────────────────────────────────
 def _check_hermes(base_url: str, *, network: bool) -> Section:
     sec = Section("Hermes")
+    have_cli = bool(shutil.which("hermes"))
+    hermes_cfg = _read_hermes_config()
+    home = _hermes_home()
 
-    if shutil.which("hermes"):
+    # CLI presence (severity finalised at the end once we know gateway state).
+    if have_cli:
         sec.checks.append(Check(OK, "CLI found", "hermes on PATH"))
     else:
         sec.checks.append(
             Check(
                 WARN,
                 "CLI not found",
-                "optional — only the gateway is required",
+                "optional when the gateway is already running",
                 "install Hermes: https://github.com/NousResearch/hermes-agent",
             )
         )
 
-    key = os.environ.get("API_SERVER_KEY", "")
+    # Hermes-side config (~/.hermes/.env + config.yaml), read directly.
+    if hermes_cfg:
+        enabled = hermes_cfg.get("API_SERVER_ENABLED", "").lower()
+        if enabled in ("true", "1", "yes"):
+            sec.checks.append(
+                Check(OK, "API server enabled", f"API_SERVER_ENABLED=true in {home}")
+            )
+        elif enabled:
+            sec.checks.append(
+                Check(
+                    WARN,
+                    "API server disabled",
+                    f"API_SERVER_ENABLED={enabled!r} in {home}",
+                    "patter hermes setup --enable-hermes",
+                )
+            )
+        else:
+            sec.checks.append(
+                Check(
+                    WARN,
+                    "API server flag absent",
+                    f"API_SERVER_ENABLED not set in {home}",
+                    "patter hermes setup --enable-hermes",
+                )
+            )
+    else:
+        sec.checks.append(
+            Check(SKIP, "Hermes config", f"no {home} directory found")
+        )
+
+    # Gateway service status via the CLI (best-effort).
+    gw_status = _hermes_gateway_status()
+    if gw_status is not None:
+        sec.checks.append(gw_status)
+
+    # A key may live in the process env or in ~/.hermes — accept either.
+    key = os.environ.get("API_SERVER_KEY", "") or hermes_cfg.get("API_SERVER_KEY", "")
     if key:
         sec.checks.append(Check(OK, "API_SERVER_KEY set"))
     else:
@@ -118,7 +318,7 @@ def _check_hermes(base_url: str, *, network: bool) -> Section:
                 WARN,
                 "API_SERVER_KEY not set",
                 "keyless local gateways work, but a key is recommended",
-                'export API_SERVER_KEY="choose-a-strong-key"',
+                "patter hermes setup --generate-key",
             )
         )
 
@@ -130,7 +330,9 @@ def _check_hermes(base_url: str, *, network: bool) -> Section:
     status, body, err = _get_json(f"{base_url}/models", headers=headers)
     if status == 200:
         sec.checks.append(Check(OK, "Gateway reachable", base_url))
-        want = os.environ.get("API_SERVER_MODEL_NAME", "hermes-agent")
+        want = os.environ.get("API_SERVER_MODEL_NAME") or hermes_cfg.get(
+            "API_SERVER_MODEL_NAME", "hermes-agent"
+        )
         ids = _model_ids(body)
         if want in ids:
             sec.checks.append(Check(OK, "Model available", want))
@@ -140,7 +342,7 @@ def _check_hermes(base_url: str, *, network: bool) -> Section:
                     WARN,
                     "Model not found",
                     f"{want!r} missing; saw {', '.join(sorted(ids)[:5])}",
-                    f'set API_SERVER_MODEL_NAME to one of the served models',
+                    "set API_SERVER_MODEL_NAME to one of the served models",
                 )
             )
         else:
@@ -158,15 +360,26 @@ def _check_hermes(base_url: str, *, network: bool) -> Section:
         )
     else:
         detail = f"HTTP {status}" if status else (err or "no response")
+        # Gateway down with no CLI to start it is a hard stop — promote the
+        # CLI-not-found note to a failure so the verdict isn't a soft warning.
+        fix = (
+            "hermes gateway start"
+            if have_cli
+            else "install Hermes + start the gateway (API_SERVER_ENABLED=true)"
+        )
         sec.checks.append(
             Check(
                 FAIL,
                 "Gateway unreachable",
                 f"{base_url} — {detail}",
-                "enable + start the gateway (API_SERVER_ENABLED=true), "
-                "or pass --base-url",
+                fix + ", or pass --base-url",
             )
         )
+        if not have_cli:
+            for c in sec.checks:
+                if c.label == "CLI not found":
+                    c.status = FAIL
+                    c.detail = "and the gateway is unreachable"
     return sec
 
 
@@ -388,12 +601,24 @@ def _run_doctor(args: argparse.Namespace) -> list[Section]:
 # ──────────────────────────────────────────────────────────────────────────
 # Subcommands
 # ──────────────────────────────────────────────────────────────────────────
+def _apply_env(args: argparse.Namespace, *, project_dir: Path | None = None) -> list[Path]:
+    """Autoload dotenv files for a command unless ``--no-env-file`` was passed."""
+    if getattr(args, "no_env_file", False):
+        return []
+    paths = _env_files_to_load(getattr(args, "env_file", None), project_dir=project_dir)
+    return _load_env_files(paths)
+
+
 def cmd_doctor(args: argparse.Namespace) -> int:
+    loaded = _apply_env(args)
     sections = _run_doctor(args)
     report = _sections_to_dict(sections)
     if getattr(args, "json", False):
+        report["loaded_env_files"] = [str(p) for p in loaded]
         print(json.dumps(report, indent=2))
     else:
+        if loaded:
+            print("Loaded env from: " + ", ".join(str(p) for p in loaded))
         _print_sections(sections)
         print()
         if report["failures"]:
@@ -416,14 +641,26 @@ def cmd_setup(args: argparse.Namespace) -> int:
 
     print(f"Patter + Hermes setup\n  project: {target}\n")
 
-    # 1. Preflight.
+    # 0. Optionally enable the Hermes API server in ~/.hermes/.env. This is the
+    #    one step that writes to your Hermes install — explicit opt-in, backed up.
+    gateway_key = ""
+    if getattr(args, "enable_hermes", False):
+        gateway_key = _enable_hermes_gateway()
+        print()
+
+    # 1. Load any existing env (project .env, ~/.hermes/.env) for the preflight.
+    loaded = _apply_env(args, project_dir=target)
+    if loaded:
+        print("Loaded env from: " + ", ".join(str(p) for p in loaded))
+
+    # 2. Preflight.
     print("Checking your environment…")
     sections = _run_doctor(args)
     _print_sections(sections)
     failures = sum(1 for s in sections for c in s.checks if c.status == FAIL)
     print()
 
-    # 2. Scaffold the project.
+    # 3. Scaffold the project.
     if interactive and not _confirm(f"Scaffold the project into {target}?"):
         print("Skipped scaffolding.")
     else:
@@ -441,9 +678,22 @@ def cmd_setup(args: argparse.Namespace) -> int:
                 encoding="utf-8",
             )
             print("  + .env (from .env.example — fill in your keys)")
+        # Put an API_SERVER_KEY into the project .env. When we just enabled the
+        # gateway, reuse ITS key so Patter and Hermes agree (a mismatch is a 401
+        # at call time); otherwise generate a fresh one on --generate-key.
+        if gateway_key:
+            _upsert_env_file(env_path, {"API_SERVER_KEY": gateway_key})
+            print("  + API_SERVER_KEY in .env (matches the gateway key)")
+        elif getattr(args, "generate_key", False):
+            existing = _parse_env_file(env_path).get("API_SERVER_KEY", "")
+            if existing and existing != "choose-a-strong-key" and not args.force:
+                print("  · API_SERVER_KEY already set (use --force to regenerate)")
+            else:
+                _upsert_env_file(env_path, {"API_SERVER_KEY": _generate_key()})
+                print("  + API_SERVER_KEY generated in .env")
     print()
 
-    # 3. Optionally attach a Twilio number.
+    # 4. Optionally attach a Twilio number.
     if args.number and args.url:
         print(f"Attaching {args.number} → {args.url}")
         rc = _attach_number(args.number, args.url, args.status_callback)
@@ -455,7 +705,7 @@ def cmd_setup(args: argparse.Namespace) -> int:
             "webhook, or run `patter hermes attach-number` later."
         )
 
-    # 4. Next steps.
+    # 5. Next steps.
     print("\nNext steps:")
     print(f"  cd {target}")
     print("  # edit .env with your keys")
@@ -508,6 +758,46 @@ def cmd_numbers(args: argparse.Namespace) -> int:
     return 0
 
 
+# ──────────────────────────────────────────────────────────────────────────
+# Hermes gateway enablement (the one step that writes to ~/.hermes)
+# ──────────────────────────────────────────────────────────────────────────
+def _enable_hermes_gateway() -> str:
+    """Write ``API_SERVER_ENABLED=true`` (+ a key if absent) to ``~/.hermes/.env``.
+
+    Backs the file up to ``.env.bak`` first. Prints what it changed and reminds
+    the operator to (re)start the gateway — Patter does not manage the service.
+    Returns the gateway's ``API_SERVER_KEY`` (existing or freshly generated) so
+    the caller can keep the project ``.env`` in sync with it.
+    """
+    home = _hermes_home()
+    env_path = home / ".env"
+    existing = _parse_env_file(env_path)
+
+    if env_path.exists():
+        backup = env_path.parent / (env_path.name + ".bak")
+        backup.write_text(env_path.read_text(encoding="utf-8"), encoding="utf-8")
+        print(f"Backed up {env_path} → {backup}")
+
+    updates: dict[str, str] = {"API_SERVER_ENABLED": "true"}
+    key = existing.get("API_SERVER_KEY", "")
+    if not key:
+        key = _generate_key()
+        updates["API_SERVER_KEY"] = key
+    _upsert_env_file(env_path, updates)
+
+    print(f"✓ API_SERVER_ENABLED=true written to {env_path}")
+    if "API_SERVER_KEY" in updates:
+        print("✓ API_SERVER_KEY generated for the gateway")
+    if shutil.which("hermes"):
+        print("Now (re)start the gateway:  hermes gateway start")
+    else:
+        print(
+            "Hermes CLI not found — start the gateway your usual way so the new "
+            "settings take effect."
+        )
+    return key
+
+
 # ──────────────────────────────────────────────────────────────────────────
 # Twilio helpers
 # ──────────────────────────────────────────────────────────────────────────
@@ -610,6 +900,15 @@ def build_hermes_parser(subparsers: argparse._SubParsersAction) -> argparse.Argu
     doctor.add_argument("--base-url", default=None, help="Hermes gateway base URL")
     doctor.add_argument("--no-network", action="store_true", help="Skip live probes")
     doctor.add_argument("--json", action="store_true", help="Machine-readable output")
+    doctor.add_argument(
+        "--env-file",
+        action="append",
+        default=None,
+        help="dotenv file(s) to load (repeatable; default: ~/.hermes/.env + ./.env)",
+    )
+    doctor.add_argument(
+        "--no-env-file", action="store_true", help="Do not autoload any .env file"
+    )
 
     setup = hsub.add_parser("setup", help="Scaffold a hermes-phone-agent project")
     setup.add_argument("--dir", default="hermes-phone-agent", help="Target directory")
@@ -620,6 +919,22 @@ def build_hermes_parser(subparsers: argparse._SubParsersAction) -> argparse.Argu
     setup.add_argument("--status-callback", default=None, help="Twilio status callback URL")
     setup.add_argument("--base-url", default=None, help="Hermes gateway base URL")
     setup.add_argument("--no-network", action="store_true", help="Skip live probes")
+    setup.add_argument(
+        "--generate-key",
+        action="store_true",
+        help="Generate a strong API_SERVER_KEY into the project .env",
+    )
+    setup.add_argument(
+        "--enable-hermes",
+        action="store_true",
+        help="Write API_SERVER_ENABLED=true (+ key) to ~/.hermes/.env (backed up)",
+    )
+    setup.add_argument(
+        "--env-file", action="append", default=None, help="dotenv file(s) to load"
+    )
+    setup.add_argument(
+        "--no-env-file", action="store_true", help="Do not autoload any .env file"
+    )
 
     attach = hsub.add_parser("attach-number", help="Point a Twilio number at your Patter URL")
     attach.add_argument("number", help="Phone number in E.164 (e.g. +15551234567)")
diff --git a/libraries/python/tests/unit/test_hermes_cli.py b/libraries/python/tests/unit/test_hermes_cli.py
index 68298f5..5877dbd 100644
--- a/libraries/python/tests/unit/test_hermes_cli.py
+++ b/libraries/python/tests/unit/test_hermes_cli.py
@@ -7,10 +7,9 @@
 from __future__ import annotations
 
 import argparse
+import os
 from pathlib import Path
 
-import pytest
-
 from getpatter import _hermes_scaffold, cli_hermes
 
 
@@ -173,6 +172,125 @@ def test_dispatch_unknown_subcommand_returns_usage() -> None:
     assert cli_hermes.dispatch_hermes(args) == 2
 
 
+# ── env / config helpers ────────────────────────────────────────────────────
+def test_parse_env_file_handles_quotes_export_comments(tmp_path: Path) -> None:
+    p = tmp_path / ".env"
+    p.write_text(
+        "# comment\n"
+        "\n"
+        "export API_SERVER_KEY='secret'\n"
+        'PATTER_LANGUAGE="it"\n'
+        "BARE=value\n"
+        "NOEQUALS\n",
+        encoding="utf-8",
+    )
+    parsed = cli_hermes._parse_env_file(p)
+    assert parsed == {
+        "API_SERVER_KEY": "secret",
+        "PATTER_LANGUAGE": "it",
+        "BARE": "value",
+    }
+
+
+def test_parse_env_file_missing_returns_empty(tmp_path: Path) -> None:
+    assert cli_hermes._parse_env_file(tmp_path / "nope.env") == {}
+
+
+def test_upsert_env_file_replaces_and_appends(tmp_path: Path) -> None:
+    p = tmp_path / ".env"
+    p.write_text("# header\nAPI_SERVER_PORT=8642\nKEEP=me\n", encoding="utf-8")
+    cli_hermes._upsert_env_file(p, {"API_SERVER_PORT": "9000", "NEW": "x"})
+    text = p.read_text(encoding="utf-8")
+    assert "# header" in text  # comments preserved
+    assert "KEEP=me" in text
+    assert "API_SERVER_PORT=9000" in text
+    assert "API_SERVER_PORT=8642" not in text
+    assert "NEW=x" in text
+
+
+def test_upsert_env_file_creates_when_missing(tmp_path: Path) -> None:
+    p = tmp_path / "sub" / ".env"
+    cli_hermes._upsert_env_file(p, {"A": "1"})
+    assert p.read_text(encoding="utf-8").strip() == "A=1"
+
+
+def test_load_env_files_does_not_override_existing(tmp_path: Path, monkeypatch) -> None:
+    p = tmp_path / ".env"
+    p.write_text("FOO=fromfile\nBAR=baz\n", encoding="utf-8")
+    monkeypatch.setenv("FOO", "fromenv")
+    monkeypatch.delenv("BAR", raising=False)
+    applied = cli_hermes._load_env_files([p])
+    assert applied == [p]
+    assert os.environ["FOO"] == "fromenv"  # not overridden
+    assert os.environ["BAR"] == "baz"  # newly loaded
+
+
+def test_read_hermes_config_env_overrides_yaml(tmp_path: Path, monkeypatch) -> None:
+    monkeypatch.setenv("HERMES_HOME", str(tmp_path))
+    (tmp_path / "config.yaml").write_text(
+        "api_server:\n  enabled: true\n  port: 8642\n  key: fromyaml\n",
+        encoding="utf-8",
+    )
+    (tmp_path / ".env").write_text("API_SERVER_KEY=fromenv\n", encoding="utf-8")
+    cfg = cli_hermes._read_hermes_config()
+    assert cfg["API_SERVER_ENABLED"] == "True"
+    assert cfg["API_SERVER_PORT"] == "8642"
+    assert cfg["API_SERVER_KEY"] == "fromenv"  # .env wins
+
+
+def test_read_hermes_config_absent_home(tmp_path: Path, monkeypatch) -> None:
+    monkeypatch.setenv("HERMES_HOME", str(tmp_path / "missing"))
+    assert cli_hermes._read_hermes_config() == {}
+
+
+def test_enable_hermes_gateway_writes_and_backs_up(tmp_path: Path, monkeypatch) -> None:
+    monkeypatch.setenv("HERMES_HOME", str(tmp_path))
+    env_path = tmp_path / ".env"
+    env_path.write_text("API_SERVER_PORT=8642\n", encoding="utf-8")
+    key = cli_hermes._enable_hermes_gateway()
+    assert (tmp_path / ".env.bak").exists()
+    parsed = cli_hermes._parse_env_file(env_path)
+    assert parsed["API_SERVER_ENABLED"] == "true"
+    assert parsed["API_SERVER_KEY"] == key  # returned key matches what was written
+    assert parsed["API_SERVER_PORT"] == "8642"  # preserved
+
+
+def test_enable_hermes_gateway_keeps_existing_key(tmp_path: Path, monkeypatch) -> None:
+    monkeypatch.setenv("HERMES_HOME", str(tmp_path))
+    (tmp_path / ".env").write_text("API_SERVER_KEY=keepme\n", encoding="utf-8")
+    key = cli_hermes._enable_hermes_gateway()
+    assert key == "keepme"
+
+
+# ── severity + autoload integration ──────────────────────────────────────────
+def test_cli_missing_and_gateway_down_is_failure(tmp_path: Path, monkeypatch) -> None:
+    monkeypatch.setenv("HERMES_HOME", str(tmp_path / "missing"))
+    monkeypatch.setattr(cli_hermes.shutil, "which", lambda _name: None)
+    monkeypatch.setattr("httpx.get", lambda *a, **k: (_ for _ in ()).throw(OSError("refused")))
+    sec = cli_hermes._check_hermes("http://127.0.0.1:8642/v1", network=True)
+    by_label = {c.label: c.status for c in sec.checks}
+    assert by_label["CLI not found"] == cli_hermes.FAIL
+    assert by_label["Gateway unreachable"] == cli_hermes.FAIL
+
+
+def test_doctor_autoloads_env_file(tmp_path: Path, monkeypatch, capsys) -> None:
+    monkeypatch.delenv("DEEPGRAM_API_KEY", raising=False)
+    env = tmp_path / ".env"
+    env.write_text("DEEPGRAM_API_KEY=dg-from-file\n", encoding="utf-8")
+    args = argparse.Namespace(
+        base_url=None,
+        no_network=True,
+        json=True,
+        env_file=[str(env)],
+        no_env_file=False,
+    )
+    cli_hermes.cmd_doctor(args)
+    assert os.environ.get("DEEPGRAM_API_KEY") == "dg-from-file"
+    out = capsys.readouterr().out
+    assert "dg-from-file" not in out  # secrets aren't echoed
+    assert str(env) in out  # but the loaded path is reported
+
+
 def test_parser_wires_subcommands() -> None:
     parser = argparse.ArgumentParser()
     sub = parser.add_subparsers(dest="command")

From 86c37d5fd507f173e43b1c116aa742b6f2b3c71e Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Tue, 9 Jun 2026 11:06:24 +0000
Subject: [PATCH 10/11] =?UTF-8?q?feat(hermes):=20end-to-end=20setup=20acce?=
 =?UTF-8?q?ptance=20=E2=80=94=20test,=20gateway=20start+wait,=20trace/diag?=
 =?UTF-8?q?nose?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Close the acceptance + debugging gaps so a green run means a real call works.

setup:
- --start-gateway runs `hermes gateway start` then polls /v1/models until the
  gateway answers, completing the enable → start → verify cycle.

New `patter hermes test` — acceptance, not just preflight: GET /v1/models, send a
real /v1/chat/completions turn with the X-Hermes-Session-Id header and report the
latency + reply snippet, confirm HermesLLM is constructible, and check the STT/TTS
keys. Exit non-zero on any blocker.

New `patter hermes trace [call]` / `diagnose [call]` — read the on-disk per-call
log (PATTER_LOG_DIR; services/call_log.py) and classify the pipeline stage by
stage (carrier → STT → Hermes → TTS), with a latency breakdown. `diagnose`
applies a decision tree and names the first broken stage with a fix, e.g.
"Hermes replied but no audio — TTS stage. Check ELEVENLABS_API_KEY / REST
transport." Defaults to the latest call; accepts a call_id or a directory.

Note: item #3 (auto-attach the tunnel URL to the carrier) is already handled by
the SDK — serve() auto-configures the Twilio/Plivo webhook once the tunnel is up
(server.py) — so the scaffold app does it on `python app.py`; documented.

Scaffold now sets PATTER_LOG_DIR and documents test/trace/diagnose; example dir
regenerated. TS CLI stub lists the new subcommands. +15 unit tests; docs updated.

https://claude.ai/code/session_01TNysNGx7woXM99fHBjpsts
---
 docs/integrations/hermes.mdx                  |  19 +
 examples/hermes-phone-agent/.env.example      |   2 +
 examples/hermes-phone-agent/README.md         |  16 +
 .../python/getpatter/_hermes_scaffold.py      |  18 +
 libraries/python/getpatter/cli_hermes.py      | 431 +++++++++++++++++-
 .../python/tests/unit/test_hermes_cli.py      | 180 ++++++++
 libraries/typescript/src/cli.ts               |   7 +-
 7 files changed, 669 insertions(+), 4 deletions(-)

diff --git a/docs/integrations/hermes.mdx b/docs/integrations/hermes.mdx
index ce7dccf..8c9118b 100644
--- a/docs/integrations/hermes.mdx
+++ b/docs/integrations/hermes.mdx
@@ -202,6 +202,25 @@ patter hermes numbers       # list the numbers on your Twilio account
 patter hermes attach-number +15551234567 --url https://<your-tunnel>/calls/inbound
 ```
 
+To go from a freshly enabled gateway to a verified one in a single run, add
+`--start-gateway` — `setup` then runs `hermes gateway start` and waits for `/v1/models` to
+answer before continuing. Before placing a real call, run the end-to-end acceptance check,
+which sends an actual chat turn through the gateway (with the Hermes session header) and
+confirms your providers are ready:
+
+```bash
+patter hermes test          # /v1/models + a real /v1/chat/completions turn + provider keys
+```
+
+When a call misbehaves, point Patter's per-call log (`PATTER_LOG_DIR`) at the tracer to see
+exactly which stage broke — carrier → STT → Hermes → TTS — with a latency breakdown and a
+one-line verdict:
+
+```bash
+patter hermes trace         # latest call's pipeline stages + stt/llm/tts latency
+patter hermes diagnose      # e.g. "Hermes replied but no audio — TTS stage" + the fix
+```
+
 These commands live in the Python SDK today; the `HermesLLM` provider itself is available
 in both the Python and TypeScript SDKs.
 
diff --git a/examples/hermes-phone-agent/.env.example b/examples/hermes-phone-agent/.env.example
index de760cf..0b382e3 100644
--- a/examples/hermes-phone-agent/.env.example
+++ b/examples/hermes-phone-agent/.env.example
@@ -10,6 +10,8 @@ PATTER_PHONE_NUMBER=+15551234567
 PATTER_LANGUAGE=en
 # REST is the safer default for a first PSTN demo; set to ws for streaming.
 PATTER_ELEVENLABS_TRANSPORT=rest
+# Per-call logs — enables `patter hermes trace` / `patter hermes diagnose`.
+PATTER_LOG_DIR=./patter-logs
 
 # ── Twilio carrier ────────────────────────────────────────────────────
 TWILIO_ACCOUNT_SID=ACxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
diff --git a/examples/hermes-phone-agent/README.md b/examples/hermes-phone-agent/README.md
index 3b5298c..cc7af5c 100644
--- a/examples/hermes-phone-agent/README.md
+++ b/examples/hermes-phone-agent/README.md
@@ -50,6 +50,22 @@ Now call your number and talk to Hermes.
 python scripts/test_outbound_call.py +15557654321
 ```
 
+## Debug a call
+
+With `PATTER_LOG_DIR` set (see `.env`), Patter writes a per-call log. After a
+call, inspect what happened stage by stage, or get a one-line verdict:
+
+```bash
+patter hermes trace        # latest call: carrier → STT → Hermes → TTS + latency
+patter hermes diagnose     # "Hermes replied but no audio — TTS stage" + fix
+```
+
+Before placing a call at all, confirm the brain answers and providers are ready:
+
+```bash
+patter hermes test         # /v1/models + a real chat turn + provider keys
+```
+
 ## Why Patter instead of a hosted custom-LLM voice agent?
 
 - **Hermes stays private.** A hosted platform has to reach your "brain" endpoint
diff --git a/libraries/python/getpatter/_hermes_scaffold.py b/libraries/python/getpatter/_hermes_scaffold.py
index fe9b7a5..0c8ba89 100644
--- a/libraries/python/getpatter/_hermes_scaffold.py
+++ b/libraries/python/getpatter/_hermes_scaffold.py
@@ -94,6 +94,8 @@
 PATTER_LANGUAGE=en
 # REST is the safer default for a first PSTN demo; set to ws for streaming.
 PATTER_ELEVENLABS_TRANSPORT=rest
+# Per-call logs — enables `patter hermes trace` / `patter hermes diagnose`.
+PATTER_LOG_DIR=./patter-logs
 
 # ── Twilio carrier ────────────────────────────────────────────────────
 TWILIO_ACCOUNT_SID=ACxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
@@ -159,6 +161,22 @@
 python scripts/test_outbound_call.py +15557654321
 ```
 
+## Debug a call
+
+With `PATTER_LOG_DIR` set (see `.env`), Patter writes a per-call log. After a
+call, inspect what happened stage by stage, or get a one-line verdict:
+
+```bash
+patter hermes trace        # latest call: carrier → STT → Hermes → TTS + latency
+patter hermes diagnose     # "Hermes replied but no audio — TTS stage" + fix
+```
+
+Before placing a call at all, confirm the brain answers and providers are ready:
+
+```bash
+patter hermes test         # /v1/models + a real chat turn + provider keys
+```
+
 ## Why Patter instead of a hosted custom-LLM voice agent?
 
 - **Hermes stays private.** A hosted platform has to reach your "brain" endpoint
diff --git a/libraries/python/getpatter/cli_hermes.py b/libraries/python/getpatter/cli_hermes.py
index 4e271a1..7fda0b1 100644
--- a/libraries/python/getpatter/cli_hermes.py
+++ b/libraries/python/getpatter/cli_hermes.py
@@ -653,6 +653,15 @@ def cmd_setup(args: argparse.Namespace) -> int:
     if loaded:
         print("Loaded env from: " + ", ".join(str(p) for p in loaded))
 
+    # 1b. Optionally start the gateway and wait for readiness — completes the
+    #     enable → start → verify cycle so the preflight sees a live gateway.
+    if getattr(args, "start_gateway", False) and not getattr(args, "no_network", False):
+        base_url = _hermes_base_url(getattr(args, "base_url", None))
+        key = gateway_key or os.environ.get("API_SERVER_KEY", "")
+        if _start_gateway():
+            _wait_for_gateway(base_url, key)
+        print()
+
     # 2. Preflight.
     print("Checking your environment…")
     sections = _run_doctor(args)
@@ -718,6 +727,111 @@ def cmd_setup(args: argparse.Namespace) -> int:
     return 1 if failures else 0
 
 
+def _chat_turn_check(base_url: str, key: str, model: str, prompt: str) -> Check:
+    """Send one ``/chat/completions`` turn with Hermes session headers."""
+    import time
+
+    try:
+        import httpx
+    except ImportError:  # pragma: no cover
+        return Check(SKIP, "Chat turn", "httpx not installed")
+    headers = {
+        "Content-Type": "application/json",
+        # Mirror HermesLLM: per-call continuity is carried in headers.
+        "X-Hermes-Session-Id": "patter-cli-test",
+    }
+    if key:
+        headers["Authorization"] = f"Bearer {key}"
+    payload = {
+        "model": model,
+        "messages": [{"role": "user", "content": prompt}],
+        "stream": False,
+    }
+    start = time.monotonic()
+    try:
+        resp = httpx.post(
+            f"{base_url}/chat/completions", json=payload, headers=headers, timeout=120.0
+        )
+    except Exception as exc:  # noqa: BLE001
+        return Check(FAIL, "Chat turn", str(exc), "patter hermes doctor")
+    elapsed = int((time.monotonic() - start) * 1000)
+    if resp.status_code != 200:
+        return Check(
+            FAIL,
+            "Chat turn",
+            f"HTTP {resp.status_code}: {resp.text[:160]}",
+            "check the model name and API_SERVER_KEY",
+        )
+    try:
+        content = resp.json()["choices"][0]["message"]["content"]
+    except Exception:  # noqa: BLE001
+        return Check(FAIL, "Chat turn", "200 but no choices[0].message.content")
+    snippet = " ".join((content or "").split())[:60]
+    if not snippet:
+        return Check(WARN, "Chat turn", f"empty reply ({elapsed} ms)")
+    return Check(OK, "Chat turn", f'{elapsed} ms — "{snippet}…"')
+
+
+def cmd_test(args: argparse.Namespace) -> int:
+    """End-to-end acceptance: gateway + a real chat turn + provider readiness."""
+    loaded = _apply_env(args)
+    base_url = _hermes_base_url(getattr(args, "base_url", None))
+    hermes_cfg = _read_hermes_config()
+    key = os.environ.get("API_SERVER_KEY", "") or hermes_cfg.get("API_SERVER_KEY", "")
+    model = os.environ.get("API_SERVER_MODEL_NAME") or hermes_cfg.get(
+        "API_SERVER_MODEL_NAME", "hermes-agent"
+    )
+
+    sec = Section("Hermes acceptance")
+    status, body, err = _get_json(
+        f"{base_url}/models",
+        headers={"Authorization": f"Bearer {key}"} if key else None,
+    )
+    if status == 200:
+        sec.checks.append(Check(OK, "Gateway reachable", base_url))
+        ids = _model_ids(body)
+        sec.checks.append(
+            Check(OK, "Model available", model)
+            if model in ids
+            else Check(WARN, "Model not found", f"{model!r} not in {sorted(ids)[:5]}")
+        )
+        sec.checks.append(
+            _chat_turn_check(base_url, key, model, getattr(args, "prompt", None) or
+                             "Reply with one short spoken sentence to confirm you are online.")
+        )
+    else:
+        detail = f"HTTP {status}" if status else (err or "no response")
+        sec.checks.append(
+            Check(FAIL, "Gateway reachable", f"{base_url} — {detail}", "patter hermes doctor")
+        )
+
+    # HermesLLM + provider readiness (so a green `test` means a real call can run).
+    try:
+        from getpatter import HermesLLM
+
+        HermesLLM()
+        sec.checks.append(Check(OK, "HermesLLM constructible"))
+    except Exception as exc:  # noqa: BLE001
+        sec.checks.append(Check(FAIL, "HermesLLM construction failed", str(exc)))
+    sec.checks.append(_env_key("DEEPGRAM_API_KEY", "Deepgram STT"))
+    sec.checks.append(_env_key("ELEVENLABS_API_KEY", "ElevenLabs TTS"))
+
+    report = _sections_to_dict([sec])
+    if getattr(args, "json", False):
+        report["loaded_env_files"] = [str(p) for p in loaded]
+        print(json.dumps(report, indent=2))
+    else:
+        if loaded:
+            print("Loaded env from: " + ", ".join(str(p) for p in loaded))
+        _print_sections([sec])
+        print()
+        if report["failures"]:
+            print(f"{report['failures']} blocker(s) — fix before calling.")
+        else:
+            print("Acceptance passed — Hermes is answering and providers are ready.")
+    return 1 if report["failures"] else 0
+
+
 def cmd_attach_number(args: argparse.Namespace) -> int:
     return _attach_number(args.number, args.url, args.status_callback)
 
@@ -758,6 +872,238 @@ def cmd_numbers(args: argparse.Namespace) -> int:
     return 0
 
 
+# ──────────────────────────────────────────────────────────────────────────
+# Call trace / diagnose (reads the on-disk call log; see services/call_log.py)
+# ──────────────────────────────────────────────────────────────────────────
+def _call_log_root(override: str | None) -> Path | None:
+    from getpatter.services.call_log import resolve_log_root
+
+    return resolve_log_root(override)
+
+
+def _read_jsonl(path: Path) -> list[dict]:
+    rows: list[dict] = []
+    try:
+        text = path.read_text(encoding="utf-8")
+    except OSError:
+        return rows
+    for line in text.splitlines():
+        line = line.strip()
+        if not line:
+            continue
+        try:
+            rows.append(json.loads(line))
+        except json.JSONDecodeError:
+            continue
+    return rows
+
+
+def _find_call_dir(root: Path, call: str | None) -> Path | None:
+    """Locate a call directory under ``<root>/calls`` (newest if ``call`` is None)."""
+    if call:
+        direct = Path(call)
+        if (direct / "metadata.json").exists():
+            return direct
+        matches = list(root.glob(f"calls/**/{call}/metadata.json"))
+        return matches[0].parent if matches else None
+    metas = list(root.glob("calls/**/metadata.json"))
+    if not metas:
+        return None
+    newest = max(metas, key=lambda p: p.stat().st_mtime)
+    return newest.parent
+
+
+def _load_call(call_dir: Path) -> dict:
+    meta: dict = {}
+    try:
+        meta = json.loads((call_dir / "metadata.json").read_text(encoding="utf-8"))
+    except (OSError, json.JSONDecodeError):
+        pass
+    return {
+        "dir": call_dir,
+        "metadata": meta,
+        "turns": _read_jsonl(call_dir / "transcript.jsonl"),
+        "events": _read_jsonl(call_dir / "events.jsonl"),
+    }
+
+
+def _turn_latency(turn: dict) -> dict:
+    lat = turn.get("latency")
+    return lat if isinstance(lat, dict) else {}
+
+
+def _classify_stages(call: dict) -> list[tuple[str, str, str]]:
+    """Return ``(stage, status, detail)`` for each pipeline stage of one call."""
+    meta = call["metadata"]
+    turns = call["turns"]
+    events = call["events"]
+
+    def any_turn(pred) -> bool:
+        return any(pred(t) for t in turns)
+
+    has_stt = any_turn(lambda t: bool(t.get("user_text"))) or any_turn(
+        lambda t: (t.get("stt_audio_seconds") or 0) > 0
+    )
+    has_llm = any_turn(lambda t: bool(t.get("agent_text"))) or any_turn(
+        lambda t: (_turn_latency(t).get("llm_ms") or 0) > 0
+    )
+    has_tts = any_turn(lambda t: (t.get("tts_characters") or 0) > 0) or any_turn(
+        lambda t: (_turn_latency(t).get("tts_ms") or 0) > 0
+    )
+    bargeins = sum(1 for e in events if e.get("type") == "barge_in")
+    errors = [e for e in events if e.get("type") == "error"]
+
+    out: list[tuple[str, str, str]] = []
+    provider = meta.get("telephony_provider") or "?"
+    out.append(
+        (
+            "Call reached Patter",
+            OK if meta else FAIL,
+            f"{meta.get('direction', '?')} via {provider}, status={meta.get('status', '?')}"
+            if meta
+            else "no metadata.json",
+        )
+    )
+    out.append(
+        ("Caller transcribed (STT)", OK if has_stt else FAIL, f"{len(turns)} turn(s)")
+    )
+    out.append(("Hermes replied (LLM)", OK if has_llm else FAIL, ""))
+    out.append(("Spoken back (TTS)", OK if has_tts else FAIL, ""))
+    out.append(
+        (
+            "Barge-in",
+            OK if bargeins else SKIP,
+            f"{bargeins} event(s)" if bargeins else "none recorded",
+        )
+    )
+    if errors or meta.get("error"):
+        detail = meta.get("error") or errors[0].get("data", {})
+        out.append(("Errors", WARN, str(detail)[:120]))
+    return out
+
+
+def _latency_summary(turns: list[dict]) -> str:
+    keys = ("stt_ms", "llm_ttft_ms", "llm_ms", "tts_ms", "total_ms")
+    sums: dict[str, float] = {k: 0.0 for k in keys}
+    counts: dict[str, int] = {k: 0 for k in keys}
+    for t in turns:
+        lat = _turn_latency(t)
+        for k in keys:
+            v = lat.get(k)
+            if isinstance(v, (int, float)) and v > 0:
+                sums[k] += v
+                counts[k] += 1
+    parts = [
+        f"{k}={int(sums[k] / counts[k])}ms" for k in keys if counts[k]
+    ]
+    return "  ".join(parts) if parts else "no latency recorded"
+
+
+def _diagnose_verdict(call: dict) -> tuple[str, str]:
+    """Decision tree → ``(verdict, suggested_fix)`` for the first broken stage."""
+    stages = {name: status for name, status, _ in _classify_stages(call)}
+    if stages.get("Call reached Patter") == FAIL:
+        return (
+            "No call record — the carrier webhook never reached Patter.",
+            "Check the number's voice webhook (patter hermes attach-number) and "
+            "that the tunnel was up.",
+        )
+    if stages.get("Caller transcribed (STT)") == FAIL:
+        return (
+            "Audio reached Patter but produced no transcript — STT/VAD stage.",
+            "Check DEEPGRAM_API_KEY, the STT language/model, and that media streamed.",
+        )
+    if stages.get("Hermes replied (LLM)") == FAIL:
+        return (
+            "Transcript captured but Hermes never replied — gateway/LLM stage.",
+            "Run `patter hermes test`; check the gateway is up and the key matches.",
+        )
+    if stages.get("Spoken back (TTS)") == FAIL:
+        return (
+            "Hermes replied but no audio was synthesized — TTS stage.",
+            "Check ELEVENLABS_API_KEY and use REST transport on PSTN "
+            "(PATTER_ELEVENLABS_TRANSPORT=rest).",
+        )
+    return ("Pipeline looks healthy end-to-end.", "")
+
+
+def _resolve_trace_call(args: argparse.Namespace) -> tuple[Path | None, str]:
+    """Shared resolution for trace/diagnose. Returns ``(call_dir, error_msg)``."""
+    root = _call_log_root(getattr(args, "log_dir", None))
+    if root is None:
+        return None, (
+            "Call logging is off. Set PATTER_LOG_DIR (or pass --log-dir) so Patter "
+            "writes per-call logs, then place a call and retry."
+        )
+    call_dir = _find_call_dir(root, getattr(args, "call", None))
+    if call_dir is None:
+        which = getattr(args, "call", None) or "any call"
+        return None, f"No call log found for {which} under {root}/calls."
+    return call_dir, ""
+
+
+def cmd_trace(args: argparse.Namespace) -> int:
+    _apply_env(args)
+    call_dir, errmsg = _resolve_trace_call(args)
+    if call_dir is None:
+        print(errmsg, file=sys.stderr)
+        return 2
+    call = _load_call(call_dir)
+    stages = _classify_stages(call)
+    if getattr(args, "json", False):
+        print(
+            json.dumps(
+                {
+                    "call_id": call["metadata"].get("call_id"),
+                    "dir": str(call_dir),
+                    "stages": [
+                        {"stage": n, "status": s, "detail": d} for n, s, d in stages
+                    ],
+                    "latency": _latency_summary(call["turns"]),
+                },
+                indent=2,
+            )
+        )
+        return 0
+    meta = call["metadata"]
+    print(f"Call {meta.get('call_id', call_dir.name)}  ({call_dir})")
+    for name, status, detail in stages:
+        sym = _color(_SYMBOL.get(status, "?"), status)
+        line = f"  {sym} {name}"
+        if detail:
+            line += f": {detail}"
+        print(line)
+    print(f"\n  latency: {_latency_summary(call['turns'])}")
+    return 0
+
+
+def cmd_diagnose(args: argparse.Namespace) -> int:
+    _apply_env(args)
+    call_dir, errmsg = _resolve_trace_call(args)
+    if call_dir is None:
+        print(errmsg, file=sys.stderr)
+        return 2
+    call = _load_call(call_dir)
+    verdict, fix = _diagnose_verdict(call)
+    if getattr(args, "json", False):
+        print(
+            json.dumps(
+                {
+                    "call_id": call["metadata"].get("call_id"),
+                    "verdict": verdict,
+                    "fix": fix,
+                },
+                indent=2,
+            )
+        )
+        return 0
+    print(f"Call {call['metadata'].get('call_id', call_dir.name)}")
+    print(f"  {verdict}")
+    if fix:
+        print(f"  fix: {fix}")
+    return 0
+
+
 # ──────────────────────────────────────────────────────────────────────────
 # Hermes gateway enablement (the one step that writes to ~/.hermes)
 # ──────────────────────────────────────────────────────────────────────────
@@ -798,6 +1144,56 @@ def _enable_hermes_gateway() -> str:
     return key
 
 
+def _start_gateway() -> bool:
+    """Start the Hermes gateway via the CLI. Returns True on success.
+
+    Patter does not own the service — this is a convenience that shells out to
+    ``hermes gateway start`` when the CLI is available.
+    """
+    if not shutil.which("hermes"):
+        print(
+            "Cannot start the gateway: hermes CLI not found. Start it your usual "
+            "way, then re-run with --no-network skipped to verify."
+        )
+        return False
+    import subprocess
+
+    print("Starting the Hermes gateway (hermes gateway start)…")
+    try:
+        proc = subprocess.run(
+            ["hermes", "gateway", "start"],
+            capture_output=True,
+            text=True,
+            timeout=60,
+        )
+    except Exception as exc:  # noqa: BLE001
+        print(f"Could not start the gateway: {exc}")
+        return False
+    if proc.returncode == 0:
+        return True
+    print((proc.stderr or proc.stdout or "").strip()[:300])
+    return False
+
+
+def _wait_for_gateway(
+    base_url: str, key: str, *, timeout: float = 60.0, interval: float = 2.0
+) -> bool:
+    """Poll ``{base_url}/models`` until it answers 200 or the timeout elapses."""
+    import time
+
+    headers = {"Authorization": f"Bearer {key}"} if key else {}
+    deadline = time.monotonic() + timeout
+    print(f"Waiting for the gateway at {base_url} (up to {int(timeout)}s)…")
+    while time.monotonic() < deadline:
+        status, _body, _err = _get_json(f"{base_url}/models", headers=headers, timeout=3.0)
+        if status == 200:
+            print("✓ Gateway is ready.")
+            return True
+        time.sleep(interval)
+    print("✗ Gateway did not become ready in time.")
+    return False
+
+
 # ──────────────────────────────────────────────────────────────────────────
 # Twilio helpers
 # ──────────────────────────────────────────────────────────────────────────
@@ -929,6 +1325,11 @@ def build_hermes_parser(subparsers: argparse._SubParsersAction) -> argparse.Argu
         action="store_true",
         help="Write API_SERVER_ENABLED=true (+ key) to ~/.hermes/.env (backed up)",
     )
+    setup.add_argument(
+        "--start-gateway",
+        action="store_true",
+        help="Run `hermes gateway start` and wait for /v1/models readiness",
+    )
     setup.add_argument(
         "--env-file", action="append", default=None, help="dotenv file(s) to load"
     )
@@ -936,6 +1337,27 @@ def build_hermes_parser(subparsers: argparse._SubParsersAction) -> argparse.Argu
         "--no-env-file", action="store_true", help="Do not autoload any .env file"
     )
 
+    test = hsub.add_parser("test", help="Acceptance: gateway + a real chat turn + providers")
+    test.add_argument("--base-url", default=None, help="Hermes gateway base URL")
+    test.add_argument("--prompt", default=None, help="Prompt to send for the chat turn")
+    test.add_argument("--json", action="store_true", help="Machine-readable output")
+    test.add_argument("--env-file", action="append", default=None, help="dotenv file(s)")
+    test.add_argument("--no-env-file", action="store_true", help="Do not autoload .env")
+
+    trace = hsub.add_parser("trace", help="Show the pipeline stages of a logged call")
+    trace.add_argument("call", nargs="?", default=None, help="call_id or dir (default: latest)")
+    trace.add_argument("--log-dir", default=None, help="Call log root (else PATTER_LOG_DIR)")
+    trace.add_argument("--json", action="store_true", help="Machine-readable output")
+    trace.add_argument("--env-file", action="append", default=None, help="dotenv file(s)")
+    trace.add_argument("--no-env-file", action="store_true", help="Do not autoload .env")
+
+    diagnose = hsub.add_parser("diagnose", help="Classify where a logged call broke")
+    diagnose.add_argument("call", nargs="?", default=None, help="call_id or dir (default: latest)")
+    diagnose.add_argument("--log-dir", default=None, help="Call log root (else PATTER_LOG_DIR)")
+    diagnose.add_argument("--json", action="store_true", help="Machine-readable output")
+    diagnose.add_argument("--env-file", action="append", default=None, help="dotenv file(s)")
+    diagnose.add_argument("--no-env-file", action="store_true", help="Do not autoload .env")
+
     attach = hsub.add_parser("attach-number", help="Point a Twilio number at your Patter URL")
     attach.add_argument("number", help="Phone number in E.164 (e.g. +15551234567)")
     attach.add_argument("--url", required=True, help="Public voice webhook URL (https)")
@@ -952,12 +1374,19 @@ def dispatch_hermes(args: argparse.Namespace) -> int:
         return cmd_doctor(args)
     if command == "setup":
         return cmd_setup(args)
+    if command == "test":
+        return cmd_test(args)
+    if command == "trace":
+        return cmd_trace(args)
+    if command == "diagnose":
+        return cmd_diagnose(args)
     if command == "attach-number":
         return cmd_attach_number(args)
     if command == "numbers":
         return cmd_numbers(args)
     print(
-        "Usage: patter hermes {doctor|setup|attach-number|numbers}\n"
+        "Usage: patter hermes "
+        "{doctor|setup|test|trace|diagnose|attach-number|numbers}\n"
         "Try:   patter hermes doctor",
         file=sys.stderr,
     )
diff --git a/libraries/python/tests/unit/test_hermes_cli.py b/libraries/python/tests/unit/test_hermes_cli.py
index 5877dbd..0542562 100644
--- a/libraries/python/tests/unit/test_hermes_cli.py
+++ b/libraries/python/tests/unit/test_hermes_cli.py
@@ -300,3 +300,183 @@ def test_parser_wires_subcommands() -> None:
     assert ns.hermes_command == "attach-number"
     assert ns.number == "+15551234567"
     assert ns.url == "https://x/y"
+
+
+def test_parser_wires_test_trace_diagnose() -> None:
+    parser = argparse.ArgumentParser()
+    sub = parser.add_subparsers(dest="command")
+    cli_hermes.build_hermes_parser(sub)
+    for name in ("test", "trace", "diagnose"):
+        ns = parser.parse_args(["hermes", name])
+        assert ns.hermes_command == name
+
+
+# ── gateway lifecycle ────────────────────────────────────────────────────────
+def test_start_gateway_no_cli(monkeypatch, capsys) -> None:
+    monkeypatch.setattr(cli_hermes.shutil, "which", lambda _n: None)
+    assert cli_hermes._start_gateway() is False
+    assert "hermes CLI not found" in capsys.readouterr().out
+
+
+def test_start_gateway_success(monkeypatch) -> None:
+    monkeypatch.setattr(cli_hermes.shutil, "which", lambda _n: "/usr/bin/hermes")
+
+    class Proc:
+        returncode = 0
+        stdout = "started"
+        stderr = ""
+
+    monkeypatch.setattr("subprocess.run", lambda *a, **k: Proc())
+    assert cli_hermes._start_gateway() is True
+
+
+def test_wait_for_gateway_ready(monkeypatch) -> None:
+    monkeypatch.setattr(cli_hermes, "_get_json", lambda *a, **k: (200, {}, ""))
+    assert cli_hermes._wait_for_gateway("http://x/v1", "k", timeout=1, interval=0.01) is True
+
+
+def test_wait_for_gateway_times_out(monkeypatch) -> None:
+    monkeypatch.setattr(cli_hermes, "_get_json", lambda *a, **k: (None, None, "down"))
+    assert (
+        cli_hermes._wait_for_gateway("http://x/v1", "k", timeout=0.05, interval=0.01)
+        is False
+    )
+
+
+# ── acceptance test command ──────────────────────────────────────────────────
+def test_chat_turn_check_ok(monkeypatch) -> None:
+    class Resp:
+        status_code = 200
+
+        @staticmethod
+        def json():
+            return {"choices": [{"message": {"content": "Hi there"}}]}
+
+    monkeypatch.setattr("httpx.post", lambda *a, **k: Resp())
+    c = cli_hermes._chat_turn_check("http://x/v1", "k", "hermes-agent", "hi")
+    assert c.status == cli_hermes.OK
+    assert "Hi there" in c.detail
+
+
+def test_chat_turn_check_http_error(monkeypatch) -> None:
+    class Resp:
+        status_code = 500
+        text = "boom"
+
+    monkeypatch.setattr("httpx.post", lambda *a, **k: Resp())
+    c = cli_hermes._chat_turn_check("http://x/v1", "k", "m", "hi")
+    assert c.status == cli_hermes.FAIL
+
+
+def test_cmd_test_passes_when_gateway_and_turn_ok(monkeypatch, capsys) -> None:
+    monkeypatch.setenv("DEEPGRAM_API_KEY", "dg")
+    monkeypatch.setenv("ELEVENLABS_API_KEY", "el")
+    monkeypatch.setattr(
+        cli_hermes, "_get_json", lambda *a, **k: (200, {"data": [{"id": "hermes-agent"}]}, "")
+    )
+    monkeypatch.setattr(
+        cli_hermes,
+        "_chat_turn_check",
+        lambda *a, **k: cli_hermes.Check(cli_hermes.OK, "Chat turn", "120 ms"),
+    )
+    args = argparse.Namespace(
+        base_url=None, prompt=None, json=True, env_file=None, no_env_file=True
+    )
+    assert cli_hermes.cmd_test(args) == 0
+    assert '"failures": 0' in capsys.readouterr().out
+
+
+# ── trace / diagnose ─────────────────────────────────────────────────────────
+def _make_call(root: Path, call_id: str, *, meta: dict, turns: list[dict], events=None):
+    d = root / "calls" / "2026" / "06" / "09" / call_id
+    d.mkdir(parents=True, exist_ok=True)
+    (d / "metadata.json").write_text(
+        __import__("json").dumps({"call_id": call_id, **meta}), encoding="utf-8"
+    )
+    (d / "transcript.jsonl").write_text(
+        "\n".join(__import__("json").dumps(t) for t in turns), encoding="utf-8"
+    )
+    if events:
+        (d / "events.jsonl").write_text(
+            "\n".join(__import__("json").dumps(e) for e in events), encoding="utf-8"
+        )
+    return d
+
+
+def test_find_call_dir_latest_and_by_id(tmp_path: Path) -> None:
+    _make_call(tmp_path, "CA1", meta={"status": "x"}, turns=[{"user_text": "a"}])
+    d2 = _make_call(tmp_path, "CA2", meta={"status": "x"}, turns=[{"user_text": "b"}])
+    # latest by mtime is CA2
+    assert cli_hermes._find_call_dir(tmp_path, None) == d2
+    assert cli_hermes._find_call_dir(tmp_path, "CA1").name == "CA1"
+    assert cli_hermes._find_call_dir(tmp_path, "nope") is None
+
+
+def test_classify_stages_healthy(tmp_path: Path) -> None:
+    d = _make_call(
+        tmp_path,
+        "CA",
+        meta={"status": "completed", "telephony_provider": "twilio"},
+        turns=[{"user_text": "hi", "agent_text": "hello", "tts_characters": 10}],
+        events=[{"type": "barge_in", "data": {}}],
+    )
+    stages = {n: s for n, s, _ in cli_hermes._classify_stages(cli_hermes._load_call(d))}
+    assert stages["Caller transcribed (STT)"] == cli_hermes.OK
+    assert stages["Hermes replied (LLM)"] == cli_hermes.OK
+    assert stages["Spoken back (TTS)"] == cli_hermes.OK
+
+
+def test_diagnose_tts_stage(tmp_path: Path) -> None:
+    d = _make_call(
+        tmp_path,
+        "CA",
+        meta={"status": "completed"},
+        turns=[{"user_text": "hi", "agent_text": "hello", "tts_characters": 0}],
+    )
+    verdict, fix = cli_hermes._diagnose_verdict(cli_hermes._load_call(d))
+    assert "TTS" in verdict
+    assert "ELEVENLABS_API_KEY" in fix
+
+
+def test_diagnose_llm_stage(tmp_path: Path) -> None:
+    d = _make_call(
+        tmp_path,
+        "CA",
+        meta={"status": "completed"},
+        turns=[{"user_text": "hi", "agent_text": "", "tts_characters": 0}],
+    )
+    verdict, _fix = cli_hermes._diagnose_verdict(cli_hermes._load_call(d))
+    assert "Hermes never replied" in verdict
+
+
+def test_diagnose_stt_stage(tmp_path: Path) -> None:
+    d = _make_call(
+        tmp_path, "CA", meta={"status": "completed"}, turns=[{"user_text": ""}]
+    )
+    verdict, _fix = cli_hermes._diagnose_verdict(cli_hermes._load_call(d))
+    assert "no transcript" in verdict
+
+
+def test_trace_no_log_dir(monkeypatch) -> None:
+    monkeypatch.delenv("PATTER_LOG_DIR", raising=False)
+    args = argparse.Namespace(
+        call=None, log_dir=None, json=False, env_file=None, no_env_file=True
+    )
+    assert cli_hermes.cmd_trace(args) == 2
+
+
+def test_trace_json_output(tmp_path: Path, monkeypatch, capsys) -> None:
+    _make_call(
+        tmp_path,
+        "CA",
+        meta={"status": "completed", "telephony_provider": "twilio"},
+        turns=[{"user_text": "hi", "agent_text": "yo", "tts_characters": 3,
+                "latency": {"llm_ttft_ms": 1000, "total_ms": 1500}}],
+    )
+    args = argparse.Namespace(
+        call=None, log_dir=str(tmp_path), json=True, env_file=None, no_env_file=True
+    )
+    assert cli_hermes.cmd_trace(args) == 0
+    out = capsys.readouterr().out
+    assert '"call_id": "CA"' in out
+    assert "llm_ttft_ms" in out
diff --git a/libraries/typescript/src/cli.ts b/libraries/typescript/src/cli.ts
index 0d8fd77..3616ec2 100644
--- a/libraries/typescript/src/cli.ts
+++ b/libraries/typescript/src/cli.ts
@@ -45,11 +45,12 @@ function printEvalStub(): void {
 
 function printHermesStub(): void {
   console.log(
-    'The Hermes setup wizard (doctor / setup / attach-number) lives in the\n' +
-      'Python CLI today. Use it from the Python SDK:\n\n' +
+    'The Hermes wizard (doctor / setup / test / trace / diagnose /\n' +
+      'attach-number) lives in the Python CLI today. Use it from the Python SDK:\n\n' +
       '  pip install getpatter\n' +
       '  patter hermes doctor\n' +
-      '  patter hermes setup\n\n' +
+      '  patter hermes setup\n' +
+      '  patter hermes test\n\n' +
       'The HermesLLM provider itself is fully available in this TypeScript SDK\n' +
       "(import { HermesLLM } from 'getpatter'). See\n" +
       'https://docs.getpatter.com/integrations/hermes for docs.',

From 1b575eef65b90b3265c21946b81c84ed3aa2d146 Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Tue, 9 Jun 2026 11:32:18 +0000
Subject: [PATCH 11/11] docs(readme): condense telemetry note to a short
 opt-out callout

https://claude.ai/code/session_01TNysNGx7woXM99fHBjpsts
---
 README.md | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/README.md b/README.md
index a012905..70f928f 100644
--- a/README.md
+++ b/README.md
@@ -132,6 +132,12 @@ cp .env.example .env    # fill in your keys
 cd python && pip install -r requirements.txt && python main.py
 ```
 
+## Telemetry
+
+> **Note** Patter collects anonymous, opt-out usage data (SDK version, bucketed provider/model and call facts) to help us prioritise — never call content, prompts, phone numbers, keys, or free text.
+>
+> Opt out any time: `Patter(telemetry=False)` (`new Patter({ telemetry: false })`) or `PATTER_TELEMETRY_DISABLED=1` (also honours `DO_NOT_TRACK=1`); auto-off in CI/tests. Full details: [Telemetry](https://docs.getpatter.com/telemetry).
+
 ## Star History
 
 <a href="https://www.star-history.com/?repos=PatterAI%2FPatter&type=date&legend=top-left">