PatterAI · nicolotognoni · Jun 5, 2026 · Jun 6, 2026 · Jun 6, 2026 · Jun 7, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -2,6 +2,9 @@
 
 ### Added
 
+- **TypeScript namespace exports for the agent-LLM presets.** `import { hermes, openclaw, openaiCompatible } from "getpatter"` now works alongside the existing `HermesLLM` / `OpenClawLLM` / `OpenAICompatibleLLM` named exports, so `new hermes.LLM()` mirrors Python's `from getpatter.llm import hermes; hermes.LLM()`. `libraries/typescript/src/index.ts`.
+- **`session_key_factory` / `sessionKeyFactory` — per-call long-term memory scope from a caller hash.** `OpenAICompatibleLLM` (and `HermesLLM`) can derive the `X-Hermes-Session-Key` header per call from a `SessionContext` (`call_id` / `caller` / `callee` / `caller_hash`) instead of a static value, so an agent runtime can remember a caller across calls **without the raw phone number ever reaching the wire or the logs**. Shortcut `HermesLLM(session_key_from="caller_hash")` installs a default `patter-caller-<caller_hash>` factory (SHA-256, 16 hex chars). New public `SessionContext` + `hash_caller` / `hashCaller` helper. The factory takes precedence over the static `session_key`; a falsy return omits the header. The loop dispatch was generalised to thread `caller` / `callee` only to providers whose `stream()` declares them (or `**kwargs`), keeping built-in and minimal custom providers unchanged. `libraries/python/getpatter/models.py`, `.../llm/openai_compatible.py`, `.../llm/hermes.py`, `.../services/llm_loop.py` + TypeScript mirrors.
+- **`long_turn_message` / `longTurnMessage` — opt-in spoken filler during a slow turn.** When an LLM turn takes longer than `long_turn_message_after_s` (default 4 s) and no audio has reached the caller yet, Patter speaks a short configurable line (e.g. "One moment, let me check.") instead of dead silence — useful for agent runtimes (Hermes / OpenClaw) that run tools mid-turn. Distinct from `llm_error_message` (which fires on error): this fires on **slowness**, once per turn, gated on emitted audio so it never double-speaks. `None` / unset = off (no behaviour change). `libraries/python/getpatter/models.py`, `.../stream_handler.py`, `.../client.py` + TypeScript mirrors.
 - **Anonymous usage telemetry (opt-out, on by default).** Patter now sends a
   small, anonymous, fail-safe usage event when the SDK is initialised and when
   an engine family is first used, so the maintainers can see which engines,
@@ -69,6 +72,21 @@
   `libraries/python/getpatter/cli.py` / `telemetry/{events,consent,install_id,call_metrics}.py`
   and the TypeScript mirrors; both SDKs verified byte-for-byte at parity.
 
+### Fixed
+
+- **Multi-turn pipeline conversations no longer go silent after the first turn.** The agent answered the first turn but then ignored every subsequent utterance, leaving a ghost metrics turn of `user_text='' agent_text='[interrupted]'`. Two root causes in the pipeline turn-taking state machine:
+  - **Tail-grace misclassified the next turn as a barge-in.** After the agent finishes speaking, `_end_speaking_with_grace` keeps `_is_speaking=true` for `PATTER_TTS_TAIL_GRACE_MS` (default 1500 ms) to swallow the fading TTS echo tail. Humans reply in 200-700 ms — inside that window — so the user's next utterance was treated as a barge-in: it recorded an interrupted turn and the leading audio was withheld from STT (only a ≤260 ms echo-contaminated ring), so no final transcript was produced and the agent never answered. A new `_tail_grace_active` / `tailGraceActive` flag now distinguishes "actively streaming TTS" from "post-TTS echo guard"; a VAD `speech_start` (or a transcript) during the tail grace ends the grace and is dispatched as a clean new turn — recovering the leading audio from the ring instead of dropping it — with no spurious `record_turn_interrupted`. Tunable `PATTER_TTS_TAIL_GRACE_MS` (0 / 200 / 1500) is now safe for fast next-turn speech.
+  - **(Python) A barge-in's per-turn cancel event leaked into the next turn.** `_llm_cancel_event` was only recreated *inside* `_process_streaming_response` — after `LLMLoop.run` had already been handed the (still-set) event for the next turn — so the turn following any real barge-in bailed immediately. The event is now recreated at the top of `_dispatch_turn`, before dispatch (TypeScript already allocated a fresh `AbortController` per turn). `libraries/python/getpatter/stream_handler.py`, `libraries/typescript/src/stream-handler.ts`.
+- **Pipeline barge-in now works DURING a turn — including long Hermes/OpenClaw tool-running turns.** The caller could not interrupt the agent mid-response: the STT receive loop awaited the turn's LLM+TTS dispatch inline (`await self._dispatch_turn(...)` / `await this.runPipelineLlm(...)`), so for the whole 30-90 s of a tool-running agent-runtime turn it stopped reading transcripts — a barge-in transcript was only processed *after* the turn ended ("ferma" → answered late). Three coordinated changes, full Python/TS parity:
+  - **Decoupled, single-in-flight dispatch.** The turn now runs as one tracked background task (`_dispatch_task` / `dispatchTask`) so the receive loop keeps draining transcripts and runs barge-in detection against the LIVE turn. Exactly one dispatch is in flight: the loop settles the previous one before launching the next, so `conversation_history` / metrics ordering is unchanged. With no barge-in (default, VAD present, normal LLM) behaviour is unchanged — the loop still awaits the final turn to settle before returning.
+  - **Prompt pre-first-token abort (Python).** Agent runtimes run tools for tens of seconds before the first token, during which the per-chunk `cancel_event` check never runs. The provider now races `create()` + first-byte against the cancel signal and spawns a watchdog that `close()`s the response the instant a barge-in fires, so the request is torn down immediately instead of blocking the next turn (TS already aborts promptly via `fetch` + `AbortController`). The VAD legacy barge-in branch now also sets `_llm_cancel_event` (it previously only flipped `_is_speaking`), and the OpenAI-compatible client uses an explicit httpx read/connect timeout so a dead gateway fails fast.
+  - **`PATTER_FORWARD_STT_WHILE_SPEAKING` (opt-in, default off).** Forwards inbound audio to STT during TTS even with a VAD configured, so the transcript barge-in path can receive a transcript on echo-masked PSTN links where the VAD never fires. The leading-edge ring buffer is still captured. **Echo caveat:** without AEC the agent's own voice may be transcribed as a phantom interruption — pair with `agent.barge_in_strategies`. `libraries/python/getpatter/stream_handler.py`, `.../services/llm_loop.py`, `.../llm/openai_compatible.py`, `libraries/typescript/src/stream-handler.ts`.
+- **Echo-safe barge-in: the agent no longer interrupts itself, and a fast real follow-up is no longer lost.** Hardening for the echo-prone agent-runtime case (`PATTER_FORWARD_STT_WHILE_SPEAKING` on, no AEC), where the agent's own TTS bled into STT and was transcribed (e.g. a garbled fragment in another language not covered by the English hallucination filter), firing a phantom barge-in and leaving an empty `[interrupted]` turn:
+  - **Echo guard** — a language-agnostic check (`_looks_like_echo` / `looksLikeEcho`: substring or ≥60% word overlap against the agent's in-flight spoken text) now drops any candidate barge-in/commit that is the agent's own speech echoing back. Active only while forwarding audio during TTS, so the default VAD path and real post-turn replies are untouched.
+  - **Back-to-back dedup fix** — a final within 500 ms of the previous is now dropped only when it is a *near-duplicate* (Deepgram emitting `speech_final` then `is_final` for the same utterance). A genuinely different fast follow-up (e.g. the real interruption right after a suppressed phantom) is kept instead of being silently swallowed into an empty turn.
+  - **Interrupted-turn context rewrite** — on a confirmed mid-turn barge-in the spoken prefix is recorded in history with an `[interrupted by caller]` marker (instead of an ungrounded full reply), so a stateful agent runtime (Hermes/OpenClaw, keyed by `X-Hermes-Session-Id`) sees on the next turn that it was cut off and what the caller actually heard. `libraries/python/getpatter/stream_handler.py`, `libraries/typescript/src/stream-handler.ts`.
+- **Forward-STT-without-AEC no longer self-interrupts on its own echo.** The remaining live Hermes/OpenClaw barge-in failure: with `PATTER_FORWARD_STT_WHILE_SPEAKING` on, no AEC, and no `barge_in_strategies`, a VAD `speech_start` during TTS cancelled the turn immediately — but on a no-AEC link that `speech_start` is very often the agent's *own* TTS echo (or pre-first-token line noise during a long tool-running turn). The result was a cascade of false-positive interruptions: a short normal reply like "bene bene" produced `agent_text='[interrupted]'` with `bargein_ms≈0`, and the next turn's LLM ran for seconds but emitted `tts_characters=0` because it was torn down before its first token. The echo guard existed only on the *transcript* path, so the raw VAD-energy cancel had no protection. The VAD-energy cancel is now **deferred to transcript confirmation** whenever audio is forwarded during TTS without AEC (`forward_stt_while_speaking && aec is None`), exactly as it already was when `barge_in_strategies` are configured: the `speech_start` marks the barge-in *pending* (the agent keeps talking) and the cancel only fires once `_handle_barge_in` / `handleBargeIn` sees a real transcript that survives the echo guard; if none confirms within `barge_in_confirm_ms` (default 1500 ms) the agent resumes its sentence. The default VAD path and forward-STT *with* AEC keep the responsive immediate cancel — no behaviour change for existing configs. For the cleanest short-echo handling, still pair with `echo_cancellation=True` or `barge_in_strategies`. `libraries/python/getpatter/stream_handler.py`, `libraries/typescript/src/stream-handler.ts`.
+
 ## 0.6.5 (2026-06-05)
 
 ### Added

diff --git a/README.md b/README.md
@@ -110,28 +110,11 @@ await phone.serve(agent, tunnel=True)
 
 `tunnel: true` spawns a Cloudflare quick tunnel and points your number at it — ideal for local dev. For production, use a static `webhook_url` (or [ngrok](https://ngrok.com)); see [Tunneling](https://docs.getpatter.com).
 
-## Anonymous Telemetry
+## Telemetry
 
-Patter collects **completely anonymous**, **opt-out** usage telemetry so the maintainers can see how the SDK is used in aggregate — which engines, providers, models, and carriers people choose — and prioritise accordingly. It is on by default, following the open-source norm (Next.js, Astro, Homebrew). **No data we collect is personally identifiable**, and none of it ever contains call content.
-
-**What we collect** (coarse and bucketed):
-
-- SDK version, language (Python/TS), OS family, CPU arch, and runtime version.
-- A random anonymous install id (a UUID, not tied to you) and a per-run id, plus the upgrade funnel (previous → current version) and a first-run activation marker.
-- Deploy shape: container / serverless / cloud / package-manager presence, and whether an AI coding agent invoked the SDK.
-- The composed stack — provider vendor and a sanitized model token per layer (e.g. `anthropic-claude-haiku-4-5`, `deepgram-nova-3`).
-- Agent shape: bucketed tool counts, integration category, and which coarse features are enabled.
-- CLI commands invoked (the command name only) and per-call facts: inbound vs outbound, outcome, error code (the code, never the message), duration, latency, cost, and a bucketed turn count.
-
-**What we never collect:** phone numbers, transcripts, audio, prompts, tool arguments, API keys, customer identifiers, IPs (dropped at the collector), hostnames, file paths, or any free text. Custom or self-hosted model names and custom tool names are structurally impossible to send — they collapse to a vendor bucket or `other` before anything leaves the process.
-
-**Opt out** anytime — any one of:
-
-- `Patter(telemetry=False)` / `new Patter({ telemetry: false })`
-- `getpatter telemetry disable` (persisted; re-enable with `getpatter telemetry enable`)
-- `PATTER_TELEMETRY_DISABLED=1`, or the cross-tool standard `DO_NOT_TRACK=1`
-
-It is auto-disabled in CI and test runs. **Inspect exactly what would be sent, without sending it:** set `PATTER_TELEMETRY_DEBUG=1` (prints each event to stderr and sends nothing), or run `getpatter telemetry status`. Full details: [Telemetry](https://docs.getpatter.com/telemetry).
+> **Note** Patter collects anonymous, opt-out usage data (SDK version, bucketed provider/model and call facts) to help us prioritise — never call content, prompts, phone numbers, keys, or free text.
+>
+> Opt out any time: `Patter(telemetry=False)` (`new Patter({ telemetry: false })`), `getpatter telemetry disable`, or `PATTER_TELEMETRY_DISABLED=1` (also honours `DO_NOT_TRACK=1`); auto-off in CI/tests. Inspect without sending: `PATTER_TELEMETRY_DEBUG=1`. Full details: [Telemetry](https://docs.getpatter.com/telemetry).
 
 ## Templates
 

diff --git a/docs/integrations/hermes.mdx b/docs/integrations/hermes.mdx
@@ -63,6 +63,23 @@ single turn can take **30–90 s**. That is why `HermesLLM` defaults to a **120
 timeout (the generic provider's 60 s, raised for the preset) instead of the short ceiling
 used for raw inference providers — a turn that runs a tool isn't cut off mid-thought.
 
+Because a tool-running turn can leave the caller in **silence** for several seconds, the
+agent supports an opt-in spoken **filler**: set `long_turn_message` / `longTurnMessage`
+(with `long_turn_message_after_s` / `longTurnMessageAfterS`, default 4 s) and Patter speaks
+a short line if no audio has reached the caller yet by then. It fires once per turn, only
+on slowness, and never overlaps the real reply. (A separate `llm_error_message` /
+`llmErrorMessage` covers the gateway-down / timeout **error** case.)
+
+```python
+agent = phone.agent(
+    stt=DeepgramSTT(),
+    llm=HermesLLM(),
+    tts=ElevenLabsTTS(),
+    long_turn_message="One moment, let me check that.",
+    long_turn_message_after_s=4,
+)
+```
+
 <Note>
   **Where the session lives.** Hermes is **stateless** and keys continuity off
   **HTTP headers**, not the OpenAI `user` field. Each phone call maps to **one** Hermes
@@ -83,6 +100,15 @@ used for raw inference providers — a turn that runs a tool isn't cut off mid-t
   const llm = new HermesLLM({ sessionKey: 'customer-42' });
   ```
 
+  For **per-caller memory without storing the raw phone number**, derive the key from a
+  caller hash instead of a static value — `HermesLLM(session_key_from="caller_hash")` /
+  `new HermesLLM({ sessionKeyFrom: 'caller_hash' })` emits
+  `X-Hermes-Session-Key: patter-caller-<hash>` (SHA-256, 16 hex chars), so Hermes
+  remembers a caller across calls while the raw number never reaches the wire or the
+  logs. For a custom scheme, pass `session_key_factory` / `sessionKeyFactory`, a callback
+  that receives a `SessionContext` (`call_id` / `caller` / `callee` / `caller_hash`) and
+  returns the scope value (a falsy return omits the header for that call).
+
   (Patter also still sends `user=patter-call-<call_id>` for upstream-log correlation,
   but that field is **not** what drives the Hermes session — the headers are.)
 </Note>
@@ -138,6 +164,66 @@ gateway that isn't listening.
   `hermes-agent`).
 </Note>
 
+### Zero-config setup (Python)
+
+If you'd rather not wire it up by hand, the Python CLI scaffolds a ready-to-run project,
+checks your environment, and can point your Twilio number at Patter:
+
+```bash
+pip install getpatter
+
+patter hermes doctor        # preflight: gateway, providers, carrier — with fixes
+patter hermes setup         # scaffold ./hermes-phone-agent (app.py, .env, scripts)
+```
+
+`patter hermes doctor` reads your Hermes config directly — it autoloads `~/.hermes/.env`
+and the nearest project `.env`, reports whether `API_SERVER_ENABLED` is set and which
+gateway port is configured, runs `hermes gateway status` when the CLI is present, then
+probes the gateway (`/v1/models`), confirms `HermesLLM` is constructible, and checks your
+Deepgram / ElevenLabs / Twilio credentials — printing a suggested fix for anything missing
+(`--no-network` skips live probes, `--json` for machine-readable output, `--env-file` /
+`--no-env-file` to control autoloading).
+
+`patter hermes setup` writes the same starter project shown in
+[`examples/hermes-phone-agent`](https://github.com/PatterAI/Patter/tree/main/examples/hermes-phone-agent)
+and can also wire the two ends together for you:
+
+- `--enable-hermes` writes `API_SERVER_ENABLED=true` (and generates an `API_SERVER_KEY` if
+  absent) into `~/.hermes/.env`, backing the file up first — then reminds you to restart the
+  gateway. The same key is mirrored into the project `.env` so Patter and Hermes agree (a
+  mismatch is a 401 at call time).
+- `--generate-key` puts a strong `API_SERVER_KEY` into the project `.env`.
+- `--number` + `--url` attach the Twilio webhook in the same run.
+
+To wire an existing number on its own:
+
+```bash
+patter hermes numbers       # list the numbers on your Twilio account
+patter hermes attach-number +15551234567 --url https://<your-tunnel>/calls/inbound
+```
+
+To go from a freshly enabled gateway to a verified one in a single run, add
+`--start-gateway` — `setup` then runs `hermes gateway start` and waits for `/v1/models` to
+answer before continuing. Before placing a real call, run the end-to-end acceptance check,
+which sends an actual chat turn through the gateway (with the Hermes session header) and
+confirms your providers are ready:
+
+```bash
+patter hermes test          # /v1/models + a real /v1/chat/completions turn + provider keys
+```
+
+When a call misbehaves, point Patter's per-call log (`PATTER_LOG_DIR`) at the tracer to see
+exactly which stage broke — carrier → STT → Hermes → TTS — with a latency breakdown and a
+one-line verdict:
+
+```bash
+patter hermes trace         # latest call's pipeline stages + stt/llm/tts latency
+patter hermes diagnose      # e.g. "Hermes replied but no audio — TTS stage" + the fix
+```
+
+These commands live in the Python SDK today; the `HermesLLM` provider itself is available
+in both the Python and TypeScript SDKs.
+
 ### Running Patter locally
 
 Build a pipeline-mode agent whose LLM is `HermesLLM`. Patter wraps the carrier, STT, and

diff --git a/examples/hermes-phone-agent/.env.example b/examples/hermes-phone-agent/.env.example
@@ -0,0 +1,23 @@
+# ── Hermes gateway (the brain — keep it on loopback) ──────────────────
+API_SERVER_ENABLED=true
+API_SERVER_HOST=127.0.0.1
+API_SERVER_PORT=8642
+API_SERVER_KEY=choose-a-strong-key
+API_SERVER_MODEL_NAME=hermes-agent
+
+# ── Patter (the voice shell) ──────────────────────────────────────────
+PATTER_PHONE_NUMBER=+15551234567
+PATTER_LANGUAGE=en
+# REST is the safer default for a first PSTN demo; set to ws for streaming.
+PATTER_ELEVENLABS_TRANSPORT=rest
+# Per-call logs — enables `patter hermes trace` / `patter hermes diagnose`.
+PATTER_LOG_DIR=./patter-logs
+
+# ── Twilio carrier ────────────────────────────────────────────────────
+TWILIO_ACCOUNT_SID=ACxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
+TWILIO_AUTH_TOKEN=your-twilio-auth-token
+
+# ── STT / TTS providers ───────────────────────────────────────────────
+DEEPGRAM_API_KEY=your-deepgram-key
+ELEVENLABS_API_KEY=your-elevenlabs-key
+# ELEVENLABS_VOICE_ID=EXAVITQu4vr4xnSDxMaL