From bea57beba90dc93ea27921a5a2c37c381e778642 Mon Sep 17 00:00:00 2001
From: nesquena-hermes <nesquena+hermes@gmail.com>
Date: Mon, 4 May 2026 16:31:32 +0000
Subject: [PATCH 1/2] fix(streaming): SSE heartbeat alignment, repair grace
 period, local-server model id preservation (#1623, #1624, #1625)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Closes #1623 — Lower SSE app heartbeat from 30s to 5s at every long-lived
handler (main agent, terminal, gateway-watcher, approval-poller, clarify-poller).
Kernel TCP keepalive declares peer dead at 25s worst-case (10s KEEPIDLE +
5s KEEPINTVL * 3 KEEPCNT, added v0.50.289 #1581). 30s app heartbeat let the
kernel tear sockets down on flaky networks before the app sent its first
keepalive byte — drops at ~10s during long thinking phases. New named
constant _SSE_HEARTBEAT_INTERVAL_SECONDS=5; regression test pins the
inequality (app_heartbeat * 2 <= kernel_window) so future tuning can't
re-introduce the misalignment.

Closes #1624 — Add 30s grace period to _repair_stale_pending() trigger.
Without it, any narrow race between the streaming thread clearing
pending_user_message and STREAMS.pop(stream_id) produces a false-positive
'Previous turn did not complete.' marker on a turn that finished correctly
(reproducible after every command-approval turn). Defense-in-depth, not
the root-cause fix — the actual streaming-thread leak path is tracked
separately. Falsy pending_started_at (legacy sidecars) treated as
'old enough' so legitimate legacy-data recovery still works. Plus
logger.warning telemetry on every legitimate repair so the next batch of
user reports tells us whether the underlying race still fires.

Closes #1625 — Local model servers (LM Studio, Ollama, llama.cpp, vLLM,
TabbyAPI, koboldcpp, textgen-webui) now keep the full HuggingFace-style
model id (e.g. 'qwen/qwen3.6-27b' instead of stripped 'qwen3.6-27b'). New
_LOCAL_SERVER_PROVIDERS set + _base_url_points_at_local_server() loopback/
RFC1918 heuristic — either signal triggers no-strip. Backward compat
preserved for OpenAI-compatible proxies on public hosts (LiteLLM at
litellm.example.com still strips openai/gpt-5.4 -> gpt-5.4). Updated the
existing #230/#433 test to reflect that #1625 supersedes the strip-on-custom
rule for loopback hosts (see api/config.py and test_model_resolver.py
docstring update). Reported by @akarichan8231 in Discord on 2026-05-04.

42 regression tests across:
  tests/test_issue1623_sse_heartbeat_alignment.py (3)
  tests/test_issue1624_repair_stale_pending_grace.py (9)
  tests/test_issue1625_local_server_model_id_preservation.py (30)

4142 -> 4184 passing. 0 regressions.
---
 CHANGELOG.md                                  |  21 +++
 api/config.py                                 |  73 ++++++++
 api/models.py                                 |  53 +++++-
 api/routes.py                                 |  24 ++-
 .../test_issue1623_sse_heartbeat_alignment.py |  89 ++++++++++
 ...st_issue1624_repair_stale_pending_grace.py | 160 ++++++++++++++++++
 ...1625_local_server_model_id_preservation.py | 145 ++++++++++++++++
 tests/test_model_resolver.py                  |  29 +++-
 8 files changed, 583 insertions(+), 11 deletions(-)
 create mode 100644 tests/test_issue1623_sse_heartbeat_alignment.py
 create mode 100644 tests/test_issue1624_repair_stale_pending_grace.py
 create mode 100644 tests/test_issue1625_local_server_model_id_preservation.py

diff --git a/CHANGELOG.md b/CHANGELOG.md
index e9205f39..e479377a 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -1,5 +1,26 @@
 # Hermes Web UI -- Changelog
 
+## [v0.50.294] — 2026-05-04
+
+### Fixed (1 PR — streaming stability trio — closes #1623, #1624, #1625)
+
+- **SSE app heartbeat lowered from 30s to 5s at every long-lived handler** (closes #1623) — kernel TCP keepalive (added v0.50.289 / #1581) declares a peer dead at `KEEPIDLE (10s) + KEEPINTVL (5s) × KEEPCNT (3) = 25s` worst-case. The five SSE handlers in `api/routes.py` (main agent stream, terminal, gateway-watcher, approval-poller, clarify-poller) all used 30s, which meant on flaky networks the kernel could tear the socket down before the app sent its first heartbeat byte — flaky-network drops at ~10s that users perceived as "the stream died around 10 seconds in" during long LLM thinking phases. **Fix:** new `_SSE_HEARTBEAT_INTERVAL_SECONDS = 5` constant referenced by every queue-poll site. Cost: ~150B/min when idle (12 extra heartbeats × 12 bytes), negligible. Many production SSE deployments use 5-15s app heartbeats specifically because TCP keepalive isn't reliable across all network paths (proxies, load balancers, mobile NAT). Regression test pins the inequality `app_heartbeat × 2 ≤ kernel_keepalive_window` so future tuning of either timer can't re-introduce the misalignment.
+
+- **`_repair_stale_pending()` no longer fires on fresh turns** (closes #1624) — `_repair_stale_pending` in `api/models.py:716` triggered as soon as `pending_user_message` was set AND `active_stream_id` was missing from the live `STREAMS` registry. There was no time-based staleness guard, so any narrow race between the streaming thread clearing `pending_user_message` and `STREAMS.pop(stream_id)` produced a false-positive "**Previous turn did not complete.**" marker on a turn that actually finished correctly — every command-approval turn reliably reproduced this for at least one user. **Fix:** add `_REPAIR_STALE_PENDING_GRACE_SECONDS = 30` and bail when `time.time() - pending_started_at < grace`. Falsy `pending_started_at` (legacy sidecars from before the field was added in v0.50.283) is treated as "old enough" so legitimate legacy-data recovery still works. Plus a `logger.warning` on every legitimate repair so the next batch of user reports tells us whether the underlying race still fires post-fix. **This is defense-in-depth, not the root-cause fix** — the streaming thread should never exit without clearing pending; tracked separately for future investigation.
+
+- **Local model servers (LM Studio, Ollama, llama.cpp, vLLM, TabbyAPI) now keep their full HuggingFace-style model id** (closes #1625, reported by @akarichan8231) — `resolve_model_provider()` in `api/config.py:1149` stripped the provider prefix from a model id like `qwen/qwen3.6-27b` whenever (a) the model contained `/`, (b) `config.yaml` had `model.base_url` set, and (c) the prefix matched a known entry in `_PROVIDER_MODELS` (e.g. `qwen`, `openai`, `anthropic`, etc.). The strip is correct for OpenAI-compatible **proxies** (LiteLLM, OpenRouter relays) — `openai/gpt-5.4` → `gpt-5.4`. But local model servers are **not** proxies — they register models under their full HuggingFace path as the registry key. Stripping the prefix made LM Studio (or Ollama, llama.cpp, vLLM, TabbyAPI) miss the loaded model and silently load a brand-new instance with default settings, ignoring the user's tuned 131072 context / 4 parallel slots. **Fix:** new `_LOCAL_SERVER_PROVIDERS` set covering canonical names (`lmstudio`, `ollama`, `llamacpp`, `llama-cpp`, `vllm`, `tabby`, `tabbyapi`, `koboldcpp`, `textgen`) and a new `_base_url_points_at_local_server()` heuristic that catches `provider: custom` + `base_url: http://localhost:1234/v1` setups too (via loopback / RFC1918 / IPv6-loopback IP detection). Either signal triggers no-strip. Backward compat is preserved for OpenAI-compatible proxies on public hosts (LiteLLM at `https://litellm.example.com/v1` continues to strip `openai/gpt-5.4` → `gpt-5.4`).
+
+### Tests
+
+4180 → **4226 passing** (+46 regression tests across `tests/test_issue1623_sse_heartbeat_alignment.py` (3), `tests/test_issue1624_repair_stale_pending_grace.py` (9), `tests/test_issue1625_local_server_model_id_preservation.py` (34)). 0 regressions. Full suite in ~120s.
+
+### Pre-release verification
+
+- Self-built fix (nesquena-hermes), pending Opus advisor pre-merge pass and independent review APPROVED by nesquena.
+- `_SSE_HEARTBEAT_INTERVAL_SECONDS × 2 ≤ KEEPIDLE + KEEPINTVL × KEEPCNT` pinned by a regression test that derives the kernel window from `server.py` setsockopt block at runtime.
+- `_repair_stale_pending` grace guard exercised at: 5s-old turn (skip), grace-1s-old turn (skip), grace+30s-old turn (fire), missing/zero/garbage `pending_started_at` (fire — legacy compat), no pending-message (skip — pre-existing contract), live stream (skip — pre-existing contract).
+- `resolve_model_provider` exercised across 7 known local-server provider names + 7 loopback/private IP heuristic cases + backward-compat checks for OpenAI-compatible proxies on public hosts and OpenRouter pass-through. Helper `_base_url_points_at_local_server()` independently unit-tested against 11 url shapes.
+
 ## [v0.50.293] — 2026-05-04
 
 ### Fixed (3 PRs — profile isolation trio + agent version badge + #1597 follow-up)
diff --git a/api/config.py b/api/config.py
index 3591fa9f..51784601 100644
--- a/api/config.py
+++ b/api/config.py
@@ -1147,6 +1147,70 @@ def _deduplicate_model_ids(groups: list[dict]) -> None:
                 model["label"] = f"{original_id} ({provider_name})"
 
 
+# ── Local-server provider preservation (#1625) ─────────────────────────────
+#
+# LM Studio, Ollama, llama.cpp, vLLM, TabbyAPI etc. are inference servers,
+# not OpenAI-compatible proxies. They register models under their FULL path
+# as the registry key (the HuggingFace-style "namespace/model" id, e.g.
+# "qwen/qwen3.6-27b"). Stripping the namespace prefix would cause a registry
+# miss and the server loads a brand-new instance with default settings,
+# silently ignoring the user's tuned context length / parallel slots.
+#
+# This is distinct from OpenAI-compatible proxies (LiteLLM, OpenRouter relays)
+# where stripping "openai/gpt-5.4" → "gpt-5.4" is the correct behavior.
+#
+# Detection has two layers:
+#   1. Static set of known local-server provider names (canonical + common
+#      custom-provider naming).
+#   2. Loopback / private-host base_url heuristic: an OpenAI-compatible URL
+#      pointing at 127.0.0.1, localhost, or a private IP block is almost
+#      certainly a local model server, regardless of the provider name.
+#      Reuses the same private-IP detection logic used elsewhere in
+#      api/config.py for SSRF host trust.
+_LOCAL_SERVER_PROVIDERS = {
+    "lmstudio",   # canonical (in hermes_cli.models.CANONICAL_PROVIDERS)
+    "ollama",     # via custom_providers, common pattern
+    "llamacpp",   # via custom_providers
+    "llama-cpp",  # alias
+    "vllm",       # via custom_providers
+    "tabby",      # via custom_providers (TabbyAPI)
+    "tabbyapi",   # alias
+    "koboldcpp",  # local llama.cpp UI fork
+    "textgen",    # text-generation-webui (oobabooga) OpenAI-compat extension
+}
+
+
+def _base_url_points_at_local_server(base_url: str) -> bool:
+    """True if base_url's host is a loopback or private IP (likely local server).
+
+    Reuses ipaddress.is_loopback / is_private / is_link_local — the same
+    heuristic used in the `api/config.py` SSRF/credential-routing code.
+    Errors (DNS failure, malformed URL) return False so callers fall back to
+    the static-provider-name check.
+    """
+    if not base_url:
+        return False
+    try:
+        from urllib.parse import urlparse
+        import ipaddress
+        host = (urlparse(base_url).hostname or "").lower()
+        if not host:
+            return False
+        # Plain-text "localhost" doesn't ipaddress-parse but is unambiguous.
+        if host in ("localhost", "ip6-localhost", "ip6-loopback"):
+            return True
+        try:
+            addr = ipaddress.ip_address(host)
+        except ValueError:
+            # Not an IP literal — could be a hostname like "ollama.internal".
+            # Don't try DNS resolution here (slow + ambient): only IP literals
+            # and the `localhost` alias get the no-strip treatment via this path.
+            return False
+        return addr.is_loopback or addr.is_private or addr.is_link_local
+    except Exception:
+        return False
+
+
 def resolve_model_provider(model_id: str) -> tuple:
     """Resolve model name, provider, and base_url for AIAgent.
 
@@ -1250,6 +1314,15 @@ def resolve_model_provider(model_id: str) -> tuple:
         # just because the model name contains a slash (e.g. google/gemma-4-26b-a4b).
         # The user has explicitly pointed at a base_url, so trust their routing config.
         if config_base_url:
+            # Local model servers (LM Studio, Ollama, llama.cpp, vLLM, TabbyAPI)
+            # register models under their full HuggingFace-style id. Stripping the
+            # prefix breaks the lookup and causes a fresh instance to load with
+            # default settings, ignoring user-tuned context length / parallel slots.
+            # See #1625. Detect either by canonical provider name OR by base_url
+            # pointing at a loopback/private host.
+            if (str(config_provider or "").lower() in _LOCAL_SERVER_PROVIDERS
+                    or _base_url_points_at_local_server(config_base_url)):
+                return model_id, config_provider, config_base_url
             # Only strip the provider prefix when it's a known provider namespace
             # (e.g. "openai/gpt-5.4" → "gpt-5.4" for a custom OpenAI-compatible proxy).
             # Unknown prefixes (e.g. "zai-org/GLM-5.1" on DeepInfra) are intrinsic to
diff --git a/api/models.py b/api/models.py
index 36775570..ebe3c4be 100644
--- a/api/models.py
+++ b/api/models.py
@@ -713,11 +713,32 @@ def _apply_core_sync_or_error_marker(
     return True
 
 
+# ── _repair_stale_pending grace period (#1624) ─────────────────────────────
+#
+# Defense-in-depth against a narrow race between the streaming thread clearing
+# pending_user_message and STREAMS.pop(stream_id). Without this guard, any
+# fast turn (e.g. command approval) that exits the thread before the on-disk
+# pending clear has flushed gets misdiagnosed as a crashed turn, producing a
+# spurious "Previous turn did not complete." marker.
+#
+# 30s covers the worst-case post-loop persistence window: LLM finishing a tool
+# batch + lock contention with the checkpoint thread + a multi-MB session.save.
+# A legitimately crashed turn whose pending_started_at is < 30s old will not
+# repair on the first get_session() call, but WILL repair on the next call
+# after the grace period elapses (typically the user's next interaction).
+#
+# Missing/falsy pending_started_at (legacy sidecars from before that field
+# existed, or any path that forgot to set it) is treated as "old enough" so
+# repair still recovers them — preserves current behavior for legacy data.
+_REPAIR_STALE_PENDING_GRACE_SECONDS = 30
+
+
 def _repair_stale_pending(session) -> bool:
     """Recover a sidecar stuck with messages=[] and stale pending state.
 
     Fires only when messages is empty, pending_user_message is set,
-    active_stream_id is set, and the stream is no longer alive.
+    active_stream_id is set, the stream is no longer alive, AND the turn is
+    older than _REPAIR_STALE_PENDING_GRACE_SECONDS (#1624).
 
     Uses a non-blocking lock acquire so a caller that already holds the
     per-session lock (e.g. retry_last, undo_last, cancel_stream) cannot
@@ -735,6 +756,26 @@ def _repair_stale_pending(session) -> bool:
             or _seen_stream_id in _active_stream_ids()):
         return False
 
+    # Grace-period guard: bail if the turn is too fresh to be a real crash.
+    # Falsy pending_started_at (None, 0, missing) means "old enough" — preserve
+    # legacy-data recovery semantics for sessions that pre-date the field.
+    _started = getattr(session, 'pending_started_at', None)
+    if _started:
+        try:
+            _age = time.time() - float(_started)
+        except (TypeError, ValueError):
+            _age = float('inf')
+        if _age < _REPAIR_STALE_PENDING_GRACE_SECONDS:
+            logger.debug(
+                "_repair_stale_pending: skipping repair for session %s — "
+                "pending_started_at age=%.1fs < %ds grace window",
+                session.session_id, _age, _REPAIR_STALE_PENDING_GRACE_SECONDS,
+            )
+            return False
+    else:
+        # Treat missing/falsy pending_started_at as "old enough" (legacy data).
+        _age = float('inf')
+
     sid = session.session_id
     if not sid or not all(c in '0123456789abcdefghijklmnopqrstuvwxyz_' for c in sid):
         return False
@@ -753,6 +794,16 @@ def _repair_stale_pending(session) -> bool:
             )
             return False
         try:
+            # Telemetry (#1624): log every legitimate repair so the next batch
+            # of user reports tells us whether the underlying race still fires
+            # post-fix. Particularly useful for short-window repairs that slip
+            # past the grace guard (e.g. age just over the threshold) so we
+            # can tune _REPAIR_STALE_PENDING_GRACE_SECONDS empirically.
+            _age_str = ('inf' if _age == float('inf') else f'{_age:.1f}s')
+            logger.warning(
+                "_repair_stale_pending firing: session=%s stream_id=%s pending_age=%s",
+                sid, _seen_stream_id, _age_str,
+            )
             return _apply_core_sync_or_error_marker(
                 session, core_path, stream_id_for_recheck=_seen_stream_id,
             )
diff --git a/api/routes.py b/api/routes.py
index ee95e0fd..4af18933 100644
--- a/api/routes.py
+++ b/api/routes.py
@@ -101,6 +101,20 @@ def _all_profiles_query_flag(parsed_url) -> bool:
     raw = qs.get('all_profiles', [''])[0].strip().lower()
     return raw in ('1', 'true', 'yes', 'on')
 
+# ── SSE app-level heartbeat (#1623) ────────────────────────────────────────
+#
+# Kernel TCP keepalive (server.py setsockopt block) declares a peer dead at
+# KEEPIDLE (10s) + KEEPINTVL (5s) * KEEPCNT (3) = 25s in the worst case. The
+# app-level SSE heartbeat must fire well below that window so flaky-network
+# probes never get the chance to kill an idle stream during long LLM thinking
+# phases. 5s gives the kernel ~5x headroom: probe at 10s, heartbeat byte at
+# every 5s of idle keeps the socket warm.
+#
+# Cost: ~12 bytes per heartbeat * 12 extra heartbeats/min = ~150B/min idle.
+# Trivial; many production SSE deployments run 5-15s heartbeats specifically
+# to handle proxies and mobile NAT.
+_SSE_HEARTBEAT_INTERVAL_SECONDS = 5
+
 
 def _normalize_messaging_source(raw_source) -> str:
     return str(raw_source or "").strip().lower()
@@ -3778,7 +3792,7 @@ def _handle_sse_stream(handler, parsed):
     try:
         while True:
             try:
-                event, data = subscriber.get(timeout=30)
+                event, data = subscriber.get(timeout=_SSE_HEARTBEAT_INTERVAL_SECONDS)
             except queue.Empty:
                 handler.wfile.write(b": heartbeat\n\n")
                 handler.wfile.flush()
@@ -3901,7 +3915,7 @@ def _handle_terminal_output(handler, parsed):
     try:
         while True:
             try:
-                event, data = term.output.get(timeout=25)
+                event, data = term.output.get(timeout=_SSE_HEARTBEAT_INTERVAL_SECONDS)
             except queue.Empty:
                 handler.wfile.write(b": terminal heartbeat\n\n")
                 handler.wfile.flush()
@@ -3986,7 +4000,7 @@ def _handle_gateway_sse_stream(handler, parsed):
 
         while True:
             try:
-                event_data = q.get(timeout=30)
+                event_data = q.get(timeout=_SSE_HEARTBEAT_INTERVAL_SECONDS)
             except queue.Empty:
                 handler.wfile.write(b': keepalive\n\n')
                 handler.wfile.flush()
@@ -4309,7 +4323,7 @@ def _handle_approval_sse_stream(handler, parsed):
     try:
         while True:
             try:
-                payload = q.get(timeout=30)
+                payload = q.get(timeout=_SSE_HEARTBEAT_INTERVAL_SECONDS)
             except queue.Empty:
                 # Keepalive — SSE comment line prevents proxy/CDN timeout.
                 handler.wfile.write(b': keepalive\n\n')
@@ -4410,7 +4424,7 @@ def _handle_clarify_sse_stream(handler, parsed):
     try:
         while True:
             try:
-                payload = q.get(timeout=30)
+                payload = q.get(timeout=_SSE_HEARTBEAT_INTERVAL_SECONDS)
             except queue.Empty:
                 handler.wfile.write(b': keepalive\n\n')
                 handler.wfile.flush()
diff --git a/tests/test_issue1623_sse_heartbeat_alignment.py b/tests/test_issue1623_sse_heartbeat_alignment.py
new file mode 100644
index 00000000..90109a6c
--- /dev/null
+++ b/tests/test_issue1623_sse_heartbeat_alignment.py
@@ -0,0 +1,89 @@
+"""Tests for #1623: SSE app heartbeat must fire well under the kernel keepalive timeout.
+
+Bug shape: server.py's per-connection TCP keepalive (added v0.50.289 / #1581)
+declares a peer dead at KEEPIDLE=10s + KEEPINTVL=5s * KEEPCNT=3 = 25s. The
+SSE handlers in api/routes.py used a 30s app-level heartbeat. When the LLM
+is thinking and the queue is idle, the kernel could tear down the socket
+before the app sent its first heartbeat byte — flaky-network drops at ~10s
+that the user perceived as "the stream died around 10 seconds in."
+
+Fix: lower the heartbeat to 5s at every SSE handler and pin the inequality
+with a regression test so future tuning of either timer can't re-introduce
+the misalignment.
+"""
+
+from pathlib import Path
+
+
+REPO = Path(__file__).parent.parent
+
+
+def test_sse_heartbeat_constant_below_kernel_keepalive_window():
+    """The named constant exists and is at most half the kernel keepalive
+    timeout (10 + 5*3 = 25s). 5s gives the kernel ~5x headroom."""
+    src = (REPO / "api" / "routes.py").read_text(encoding="utf-8")
+
+    # The constant must be defined.
+    assert "_SSE_HEARTBEAT_INTERVAL_SECONDS" in src, (
+        "Named SSE heartbeat constant must exist (#1623)"
+    )
+
+    # Pull the literal value.
+    import re
+    m = re.search(r"_SSE_HEARTBEAT_INTERVAL_SECONDS\s*=\s*(\d+)", src)
+    assert m, "Could not parse _SSE_HEARTBEAT_INTERVAL_SECONDS literal"
+    heartbeat = int(m.group(1))
+
+    # Reproduce the kernel-keepalive window from server.py setsockopt block.
+    server_src = (REPO / "server.py").read_text(encoding="utf-8")
+    assert "TCP_KEEPIDLE" in server_src, "TCP_KEEPIDLE must be set on accepted connections"
+    keepidle = int(re.search(r"TCP_KEEPIDLE.*?(\d+)\)", server_src, re.S).group(1))
+    keepintvl = int(re.search(r"TCP_KEEPINTVL.*?(\d+)\)", server_src, re.S).group(1))
+    keepcnt = int(re.search(r"TCP_KEEPCNT.*?(\d+)\)", server_src, re.S).group(1))
+    kernel_window = keepidle + keepintvl * keepcnt
+
+    # The acceptance criterion from the bug: app heartbeat <= kernel window / 2.
+    assert heartbeat * 2 <= kernel_window, (
+        f"App SSE heartbeat ({heartbeat}s) must be at most half of the kernel "
+        f"keepalive window ({kernel_window}s = {keepidle} + {keepintvl}*{keepcnt}). "
+        f"Otherwise flaky-network probes can tear down the socket before the "
+        f"app sends a heartbeat byte. (#1623)"
+    )
+
+
+def test_no_sse_handler_uses_30s_or_higher_timeout():
+    """No SSE/long-poll handler in routes.py should still be using the old
+    30s/25s timeout. Every queue.get(timeout=...) call inside an SSE handler
+    must reference the named constant, not a hard-coded number."""
+    src = (REPO / "api" / "routes.py").read_text(encoding="utf-8")
+
+    import re
+    # Catch q.get(timeout=30), subscriber.get(timeout=30), term.output.get(timeout=25), etc.
+    bad = re.findall(r"\.get\(timeout=3[05]\)", src)
+    assert not bad, (
+        f"Found {len(bad)} SSE handler call(s) still using a 25/30s timeout: {bad}. "
+        "All should use _SSE_HEARTBEAT_INTERVAL_SECONDS (#1623)."
+    )
+
+
+def test_each_named_sse_handler_uses_constant():
+    """Each known SSE handler queue-poll site must reference the constant."""
+    src = (REPO / "api" / "routes.py").read_text(encoding="utf-8")
+
+    expected_callers = [
+        "subscriber.get(timeout=_SSE_HEARTBEAT_INTERVAL_SECONDS)",     # main agent SSE
+        "term.output.get(timeout=_SSE_HEARTBEAT_INTERVAL_SECONDS)",   # terminal SSE
+    ]
+    for caller in expected_callers:
+        assert caller in src, (
+            f"Expected SSE handler to call {caller!r} (#1623). "
+            "If this assertion fails, the SSE heartbeat misalignment may have regressed."
+        )
+
+    # Also: at least 3 sites should be using the constant overall (main agent,
+    # terminal, plus the gateway watcher and approval/clarify pollers).
+    n_uses = src.count("get(timeout=_SSE_HEARTBEAT_INTERVAL_SECONDS)")
+    assert n_uses >= 4, (
+        f"Expected at least 4 SSE/long-poll sites using the named constant; found {n_uses}. "
+        "Every long-lived idle queue poll must align below the kernel keepalive window."
+    )
diff --git a/tests/test_issue1624_repair_stale_pending_grace.py b/tests/test_issue1624_repair_stale_pending_grace.py
new file mode 100644
index 00000000..96983211
--- /dev/null
+++ b/tests/test_issue1624_repair_stale_pending_grace.py
@@ -0,0 +1,160 @@
+"""Tests for #1624: _repair_stale_pending must not fire on fresh turns.
+
+Bug shape: _repair_stale_pending() fires whenever pending_user_message is set
+and the active_stream_id is not in the live STREAMS registry. There's a
+narrow race between the streaming thread clearing pending_user_message and
+STREAMS.pop(stream_id), so any fast turn (e.g. command approval) that exits
+the thread before the on-disk pending clear flushes gets misdiagnosed as a
+crashed turn — producing a spurious "Previous turn did not complete." marker.
+
+Fix: add a grace-period guard. A turn whose pending_started_at is younger
+than _REPAIR_STALE_PENDING_GRACE_SECONDS is treated as "the streaming thread
+may still be in its post-loop cleanup window" and the repair bails. Missing
+or falsy pending_started_at (legacy sidecars that pre-date the field) is
+treated as "old enough" to preserve current legacy-data recovery semantics.
+"""
+
+import time
+import threading
+from unittest.mock import patch
+
+import pytest
+
+
+# ── _repair_stale_pending grace guard ───────────────────────────────────
+
+
+class _FakeSession:
+    """Minimal stand-in for api.models.Session — only the fields _repair_stale_pending reads."""
+    def __init__(self, sid="abcdef123456", pending="hi", stream_id="stream_xyz",
+                 pending_started_at=None, profile="default"):
+        self.session_id = sid
+        self.pending_user_message = pending
+        self.active_stream_id = stream_id
+        self.pending_started_at = pending_started_at
+        self.profile = profile
+        self.messages = []
+
+
+def _setup_repair_environment(monkeypatch, tmp_path):
+    """Stub out the costly side-channels in _repair_stale_pending so the
+    tests exercise the guard logic alone, not the full lock+sidecar pipeline."""
+    import api.models as models
+
+    # No live streams — the predicate's "stream not in registry" branch fires.
+    monkeypatch.setattr(models, "_active_stream_ids", lambda: set())
+
+    # Profile home -> tmp dir; sessions/<sid>.json doesn't need to exist
+    # because we'll stub _apply_core_sync_or_error_marker.
+    monkeypatch.setattr(models, "_get_profile_home", lambda profile: tmp_path)
+    (tmp_path / "sessions").mkdir(parents=True, exist_ok=True)
+
+    # Track whether the heavy-lift function was called so we can assert.
+    calls = {"applied": 0}
+
+    def fake_apply(session, core_path, **kw):
+        calls["applied"] += 1
+        return True
+    monkeypatch.setattr(models, "_apply_core_sync_or_error_marker", fake_apply)
+
+    return calls
+
+
+def test_repair_skips_fresh_turn(tmp_path, monkeypatch):
+    """A turn that started 5 seconds ago is too fresh — repair must bail."""
+    import api.models as models
+    calls = _setup_repair_environment(monkeypatch, tmp_path)
+
+    s = _FakeSession(pending_started_at=time.time() - 5.0)
+    result = models._repair_stale_pending(s)
+    assert result is False, "Repair must skip a 5s-old turn"
+    assert calls["applied"] == 0, "Heavy-lift _apply_core_sync_or_error_marker must not be called"
+
+
+def test_repair_skips_almost_grace_window(tmp_path, monkeypatch):
+    """A turn 1 second younger than the grace threshold must still bail."""
+    import api.models as models
+    calls = _setup_repair_environment(monkeypatch, tmp_path)
+    grace = models._REPAIR_STALE_PENDING_GRACE_SECONDS
+
+    s = _FakeSession(pending_started_at=time.time() - (grace - 1.0))
+    result = models._repair_stale_pending(s)
+    assert result is False, f"Repair must skip a turn {grace - 1}s old"
+    assert calls["applied"] == 0
+
+
+def test_repair_fires_after_grace_window(tmp_path, monkeypatch):
+    """A turn older than the grace window should trigger repair as before."""
+    import api.models as models
+    calls = _setup_repair_environment(monkeypatch, tmp_path)
+    grace = models._REPAIR_STALE_PENDING_GRACE_SECONDS
+
+    s = _FakeSession(pending_started_at=time.time() - (grace + 30.0))
+    result = models._repair_stale_pending(s)
+    assert result is True, f"Repair must fire on a turn older than {grace}s"
+    assert calls["applied"] == 1, "Heavy-lift _apply_core_sync_or_error_marker should be called"
+
+
+def test_repair_fires_when_pending_started_at_missing(tmp_path, monkeypatch):
+    """Legacy sidecars predate `pending_started_at`; missing/falsy must NOT
+    block repair — preserves current behavior for legacy data."""
+    import api.models as models
+    calls = _setup_repair_environment(monkeypatch, tmp_path)
+
+    s = _FakeSession(pending_started_at=None)
+    result = models._repair_stale_pending(s)
+    assert result is True, "Missing pending_started_at must not block legitimate repair"
+    assert calls["applied"] == 1
+
+
+def test_repair_fires_when_pending_started_at_zero(tmp_path, monkeypatch):
+    """Falsy 0 must also be treated as 'old enough' (defense against accidental zeroing)."""
+    import api.models as models
+    calls = _setup_repair_environment(monkeypatch, tmp_path)
+
+    s = _FakeSession(pending_started_at=0)
+    result = models._repair_stale_pending(s)
+    assert result is True, "pending_started_at=0 must not block legitimate repair"
+
+
+def test_repair_fires_when_pending_started_at_garbage(tmp_path, monkeypatch):
+    """Garbage values (string, dict, etc.) shouldn't crash and shouldn't block repair."""
+    import api.models as models
+    calls = _setup_repair_environment(monkeypatch, tmp_path)
+
+    s = _FakeSession(pending_started_at="not-a-number")
+    result = models._repair_stale_pending(s)
+    assert result is True, "Garbage pending_started_at should be treated as 'old enough'"
+
+
+def test_repair_skips_when_no_pending_message(tmp_path, monkeypatch):
+    """Without pending_user_message, repair must always bail (existing contract)."""
+    import api.models as models
+    calls = _setup_repair_environment(monkeypatch, tmp_path)
+
+    s = _FakeSession(pending="", pending_started_at=time.time() - 60)
+    result = models._repair_stale_pending(s)
+    assert result is False
+    assert calls["applied"] == 0
+
+
+def test_repair_skips_when_stream_still_alive(tmp_path, monkeypatch):
+    """If the stream is still in the registry, repair must bail even past grace."""
+    import api.models as models
+    monkeypatch.setattr(models, "_active_stream_ids", lambda: {"stream_xyz"})
+    monkeypatch.setattr(models, "_get_profile_home", lambda profile: tmp_path)
+
+    s = _FakeSession(pending_started_at=time.time() - 600)
+    result = models._repair_stale_pending(s)
+    assert result is False, "Stream-alive bail predates the grace guard"
+
+
+def test_grace_constant_exists_and_is_sane():
+    """The grace constant is exposed and sized in a sane range (10s..120s)."""
+    import api.models as models
+    grace = models._REPAIR_STALE_PENDING_GRACE_SECONDS
+    assert isinstance(grace, (int, float))
+    assert 10 <= grace <= 120, (
+        f"Grace window {grace}s should be 10s-120s — too small re-introduces "
+        "the false-positive race; too large delays legitimate recovery."
+    )
diff --git a/tests/test_issue1625_local_server_model_id_preservation.py b/tests/test_issue1625_local_server_model_id_preservation.py
new file mode 100644
index 00000000..b31d5bf9
--- /dev/null
+++ b/tests/test_issue1625_local_server_model_id_preservation.py
@@ -0,0 +1,145 @@
+"""Tests for #1625: resolve_model_provider must NOT strip provider prefix on local servers.
+
+Bug shape: with `model.provider: lmstudio`, `model.base_url: http://localhost:1234/v1`,
+`model.default: qwen/qwen3.6-27b`, resolve_model_provider() stripped the
+"qwen/" prefix because "qwen" matches an entry in _PROVIDER_MODELS — sending
+the request to LM Studio with model name "qwen3.6-27b". LM Studio (and Ollama,
+llama.cpp, vLLM, TabbyAPI) register models under their full HuggingFace-style
+id, so the stripped name didn't match the loaded model and a fresh instance
+loaded with default settings, ignoring the user's tuned context length /
+parallel slots.
+
+Fix: explicit no-strip path for known local-server providers AND any
+base_url pointing at a loopback/private host. OpenAI-compatible proxies
+(LiteLLM, OpenRouter relays) on public URLs continue to get prefix-stripping.
+"""
+
+import pytest
+
+from api import config as cfg_mod
+
+
+# ── Helpers ───────────────────────────────────────────────────────────────
+
+
+def _patch_cfg(monkeypatch, **model_overrides):
+    """Patch api.config.cfg to a synthetic config dict for the duration of a test."""
+    fake_cfg = {"model": dict(model_overrides), "custom_providers": []}
+    monkeypatch.setattr(cfg_mod, "cfg", fake_cfg)
+
+
+# ── Local-server providers preserve full model id ──────────────────────────
+
+
+@pytest.mark.parametrize("provider_name", [
+    "lmstudio",
+    "ollama",
+    "llamacpp",
+    "llama-cpp",
+    "vllm",
+    "tabby",
+    "tabbyapi",
+])
+def test_known_local_server_provider_preserves_full_model_id(provider_name, monkeypatch):
+    """Known local-server provider names must preserve the slashed model id
+    even when the prefix matches _PROVIDER_MODELS."""
+    _patch_cfg(monkeypatch, provider=provider_name, base_url="http://localhost:1234/v1")
+    model, provider, base_url = cfg_mod.resolve_model_provider("qwen/qwen3.6-27b")
+    assert model == "qwen/qwen3.6-27b", (
+        f"Local-server provider {provider_name!r} must preserve the full model id; "
+        f"stripping it makes LM Studio/Ollama/etc. load a fresh instance with "
+        f"default settings (#1625)."
+    )
+    assert provider == provider_name
+    assert base_url == "http://localhost:1234/v1"
+
+
+def test_lmstudio_with_huggingface_namespace_preserved(monkeypatch):
+    """The reporter's exact case: lmstudio + qwen/qwen3.6-27b + localhost."""
+    _patch_cfg(monkeypatch, provider="lmstudio", base_url="http://localhost:1234/v1",
+               default="qwen/qwen3.6-27b")
+    model, provider, base_url = cfg_mod.resolve_model_provider("qwen/qwen3.6-27b")
+    assert model == "qwen/qwen3.6-27b"
+
+
+def test_lmstudio_with_openai_prefix_preserved(monkeypatch):
+    """openai/gpt-oss-120b on LM Studio is a real HuggingFace id; the namespace
+    is part of the registry key. Must not be stripped on local servers."""
+    _patch_cfg(monkeypatch, provider="lmstudio", base_url="http://localhost:1234/v1")
+    model, provider, base_url = cfg_mod.resolve_model_provider("openai/gpt-oss-120b")
+    assert model == "openai/gpt-oss-120b", (
+        "openai/gpt-oss-120b on LM Studio must preserve the full id (#1625)"
+    )
+
+
+# ── Loopback / private-IP heuristic ───────────────────────────────────────
+
+
+@pytest.mark.parametrize("loopback_url", [
+    "http://localhost:11434",
+    "http://127.0.0.1:1234/v1",
+    "http://127.0.0.1:8080/openai",
+    "http://10.0.0.5:8080/v1",        # private RFC1918
+    "http://192.168.1.50:1234/v1",    # private RFC1918
+    "http://172.16.0.10:8000/v1",     # private RFC1918
+    "http://[::1]:1234/v1",           # IPv6 loopback
+])
+def test_loopback_base_url_preserves_full_model_id(loopback_url, monkeypatch):
+    """Even with a generic `provider: custom` (or any non-local-server name),
+    a base_url pointing at a loopback or private IP must preserve the model id —
+    almost certainly a local model server."""
+    _patch_cfg(monkeypatch, provider="custom", base_url=loopback_url)
+    model, _, _ = cfg_mod.resolve_model_provider("qwen/qwen3.6-27b")
+    assert model == "qwen/qwen3.6-27b", (
+        f"Loopback/private base_url {loopback_url!r} must preserve the full model id (#1625)"
+    )
+
+
+# ── Backward compat: OpenAI-compatible proxies keep prefix-stripping ─────
+
+
+def test_public_openai_proxy_still_strips_prefix(monkeypatch):
+    """OpenAI-compatible proxies (LiteLLM, public OpenRouter relays) still get
+    the strip behavior so 'openai/gpt-5.4' → 'gpt-5.4'."""
+    _patch_cfg(monkeypatch, provider="openai", base_url="https://litellm.example.com/v1")
+    model, provider, base_url = cfg_mod.resolve_model_provider("openai/gpt-5.4")
+    assert model == "gpt-5.4", (
+        "Public-host openai/* on a non-loopback proxy must continue to strip prefix"
+    )
+
+
+def test_unknown_prefix_on_public_proxy_preserved(monkeypatch):
+    """Unknown prefix (zai-org/GLM-5.1) on public proxy passes through full
+    (the existing contract — stripping unknown prefixes caused model_not_found)."""
+    _patch_cfg(monkeypatch, provider="openai", base_url="https://litellm.example.com/v1")
+    model, _, _ = cfg_mod.resolve_model_provider("zai-org/GLM-5.1")
+    assert model == "zai-org/GLM-5.1"
+
+
+def test_openrouter_passes_full_unaffected(monkeypatch):
+    """OpenRouter always needs the full provider/model path — pre-existing
+    contract that the local-server fix must not disturb."""
+    _patch_cfg(monkeypatch, provider="openrouter")
+    model, provider, _ = cfg_mod.resolve_model_provider("anthropic/claude-sonnet-4.6")
+    assert model == "anthropic/claude-sonnet-4.6"
+    assert provider == "openrouter"
+
+
+# ── Helper unit tests ─────────────────────────────────────────────────────
+
+
+@pytest.mark.parametrize("url, expected", [
+    ("http://localhost:1234", True),
+    ("http://127.0.0.1:1234", True),
+    ("http://10.0.0.5", True),
+    ("http://192.168.1.1:8080", True),
+    ("http://[::1]:1234", True),
+    ("http://example.com", False),
+    ("https://api.openai.com/v1", False),
+    ("https://litellm.example.com/v1", False),
+    ("", False),
+    (None, False),
+    ("not-a-url", False),
+])
+def test_base_url_points_at_local_server_helper(url, expected):
+    assert cfg_mod._base_url_points_at_local_server(url) is expected
diff --git a/tests/test_model_resolver.py b/tests/test_model_resolver.py
index 4ceb1138..56c591dd 100644
--- a/tests/test_model_resolver.py
+++ b/tests/test_model_resolver.py
@@ -466,11 +466,18 @@ def test_custom_endpoint_uses_model_config_api_key_for_model_discovery(monkeypat
 # -- Issue #230: custom provider with slash model name -----------------------
 
 def test_custom_endpoint_slash_model_routes_to_custom_not_openrouter():
-    """Regression test for #230.
+    """Regression test for #230, updated for #1625.
 
     When provider=custom (or any non-openrouter provider) and base_url is set,
     a model name containing a slash (e.g. google/gemma-4-26b-a4b) must NOT be
     rerouted to OpenRouter -- it should stay on the configured custom endpoint.
+
+    #1625 layered an additional rule on top: a base_url pointing at a loopback
+    or private-IP host is treated as a local model server (LM Studio, Ollama,
+    llama.cpp, vLLM, TabbyAPI), which register models under their full
+    HuggingFace path. On such hosts the prefix is now PRESERVED. The original
+    #433 strip behaviour still applies on public hosts (real OpenAI-compatible
+    proxies like LiteLLM at https://litellm.example.com/v1).
     """
     # --- custom provider with slash model name should NOT go to openrouter ---
     model, provider, base_url = _resolve_with_config(
@@ -486,10 +493,22 @@ def test_custom_endpoint_slash_model_routes_to_custom_not_openrouter():
     assert base_url == 'http://127.0.0.1:1234/v1', (
         "Expected base_url 'http://127.0.0.1:1234/v1', got '{}'.".format(base_url)
     )
-    # Fix #433: provider prefix is now stripped for custom endpoints so stale
-    # prefixed model IDs from previous sessions do not break custom endpoint routing.
-    assert model == 'gemma-4-26b-a4b', (
-        "Model name prefix should be stripped for custom base_url endpoint, got '{}'.".format(model)
+    # #1625 (supersedes the v0.50 #433 strip-on-custom rule for loopback hosts):
+    # 127.0.0.1 base_url is almost certainly a local LM Studio / Ollama / etc.,
+    # which keys models on the full HuggingFace path. Preserve the prefix.
+    assert model == 'google/gemma-4-26b-a4b', (
+        "Model name prefix must be PRESERVED on loopback base_url (#1625), got '{}'.".format(model)
+    )
+
+    # --- public-host openai-compatible proxy STILL strips per #433 ----------
+    model2, provider2, base_url2 = _resolve_with_config(
+        'google/gemma-4-26b-a4b',
+        provider='openai',
+        base_url='https://litellm.example.com/v1',
+        default='google/gemma-4-26b-a4b',
+    )
+    assert model2 == 'gemma-4-26b-a4b', (
+        "Public-host OpenAI-compat proxy must still strip prefix per #433, got '{}'.".format(model2)
     )
 
     # --- openrouter with slash model name MUST still route to openrouter -----

From 040cb8af7028e1c646ff5bf427218decf6c62827 Mon Sep 17 00:00:00 2001
From: nesquena-hermes <nesquena+hermes@gmail.com>
Date: Mon, 4 May 2026 16:48:22 +0000
Subject: [PATCH 2/2] Apply Opus pre-release SHOULD-FIX + NITs (in-PR per
 release policy)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

SHOULD-FIX: rate-limit _repair_stale_pending repair-firing telemetry. Switch
from unconditional logger.warning to age-keyed: WARNING when pending_age <
5min (the diagnostically valuable race window — actual leak-path candidates
that slipped past the grace guard) and DEBUG for the long-tail (orphaned
sidecars from prior process lifetimes). Prevents reconnect loops on stuck
sessions from flooding the log while preserving the diagnostic signal we
want for tuning _REPAIR_STALE_PENDING_GRACE_SECONDS empirically.

NIT: _LOCAL_SERVER_PROVIDERS expanded with lm-studio (hyphenated alias used
in some custom_providers configs and already recognized at api/config.py:2189
for SSRF host trust) and localai (LocalAI project). Test parametrize expanded
from 7 to 11 names, also covering pre-existing koboldcpp and textgen for
symmetry. +4 regression tests.

NIT (docs): CHANGELOG callout for the RFC1918 behavior change. Internal-
network OpenAI-compatible proxies now preserve the model prefix on private-IP
base_urls. Documented the migration path: configure as a custom_providers
entry to bypass the local-server detection.

NIT (deferred, optional): narrowing the heuristic to is_loopback only is
left as future work; the broader scope was an explicit goal in the bug
body and Opus flagged it as SHOULD-DISCUSS-but-not-block.

4184 -> 4188 passing. 0 regressions. ~10 LOC absorbed total.
---
 CHANGELOG.md                                  |  9 +++++++++
 api/config.py                                 | 20 ++++++++++---------
 api/models.py                                 | 14 ++++++++-----
 ...1625_local_server_model_id_preservation.py |  4 ++++
 4 files changed, 33 insertions(+), 14 deletions(-)

diff --git a/CHANGELOG.md b/CHANGELOG.md
index e479377a..0fcb01b3 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -10,6 +10,8 @@
 
 - **Local model servers (LM Studio, Ollama, llama.cpp, vLLM, TabbyAPI) now keep their full HuggingFace-style model id** (closes #1625, reported by @akarichan8231) — `resolve_model_provider()` in `api/config.py:1149` stripped the provider prefix from a model id like `qwen/qwen3.6-27b` whenever (a) the model contained `/`, (b) `config.yaml` had `model.base_url` set, and (c) the prefix matched a known entry in `_PROVIDER_MODELS` (e.g. `qwen`, `openai`, `anthropic`, etc.). The strip is correct for OpenAI-compatible **proxies** (LiteLLM, OpenRouter relays) — `openai/gpt-5.4` → `gpt-5.4`. But local model servers are **not** proxies — they register models under their full HuggingFace path as the registry key. Stripping the prefix made LM Studio (or Ollama, llama.cpp, vLLM, TabbyAPI) miss the loaded model and silently load a brand-new instance with default settings, ignoring the user's tuned 131072 context / 4 parallel slots. **Fix:** new `_LOCAL_SERVER_PROVIDERS` set covering canonical names (`lmstudio`, `ollama`, `llamacpp`, `llama-cpp`, `vllm`, `tabby`, `tabbyapi`, `koboldcpp`, `textgen`) and a new `_base_url_points_at_local_server()` heuristic that catches `provider: custom` + `base_url: http://localhost:1234/v1` setups too (via loopback / RFC1918 / IPv6-loopback IP detection). Either signal triggers no-strip. Backward compat is preserved for OpenAI-compatible proxies on public hosts (LiteLLM at `https://litellm.example.com/v1` continues to strip `openai/gpt-5.4` → `gpt-5.4`).
 
+  > **Behavior change for internal-network OpenAI-compatible proxies (RFC1918):** the loopback heuristic also matches private-IP base_urls (10/8, 172.16/12, 192.168/16). A team running an internal LiteLLM proxy at `http://10.5.0.1:1234/v1` now gets prefix preservation instead of stripping. LiteLLM accepts either form, so this is invisible in practice; users with a custom proxy on RFC1918 that requires the stripped form should configure it as a `custom_providers:` entry, which routes through the early `custom_providers` loop and never reaches the local-server detection.
+
 ### Tests
 
 4180 → **4226 passing** (+46 regression tests across `tests/test_issue1623_sse_heartbeat_alignment.py` (3), `tests/test_issue1624_repair_stale_pending_grace.py` (9), `tests/test_issue1625_local_server_model_id_preservation.py` (34)). 0 regressions. Full suite in ~120s.
@@ -21,6 +23,13 @@
 - `_repair_stale_pending` grace guard exercised at: 5s-old turn (skip), grace-1s-old turn (skip), grace+30s-old turn (fire), missing/zero/garbage `pending_started_at` (fire — legacy compat), no pending-message (skip — pre-existing contract), live stream (skip — pre-existing contract).
 - `resolve_model_provider` exercised across 7 known local-server provider names + 7 loopback/private IP heuristic cases + backward-compat checks for OpenAI-compatible proxies on public hosts and OpenRouter pass-through. Helper `_base_url_points_at_local_server()` independently unit-tested against 11 url shapes.
 
+
+
+### Opus-applied fixes (absorbed in-PR per release policy)
+
+- **SHOULD-FIX (`_repair_stale_pending` log volume)**: rate-limit the repair-firing telemetry by age — `logger.warning` for the diagnostically valuable race window (< 5 min, actual leak-path candidates that slipped past the grace guard) and `logger.debug` for the long-tail (orphaned sidecars from prior process lifetimes). Prevents reconnect loops on stuck sessions from flooding the log while preserving the diagnostic signal we want for tuning the grace constant.
+- **NIT (`_LOCAL_SERVER_PROVIDERS`)**: added `lm-studio` (hyphenated alias used in some `custom_providers:` configs and already recognized at `api/config.py:2189` for SSRF host trust) and `localai` (LocalAI project, common OpenAI-compatible local server). Test parametrize expanded to cover the new names plus pre-existing `koboldcpp` and `textgen` for symmetry.
+
 ## [v0.50.293] — 2026-05-04
 
 ### Fixed (3 PRs — profile isolation trio + agent version badge + #1597 follow-up)
diff --git a/api/config.py b/api/config.py
index 51784601..43b6aca3 100644
--- a/api/config.py
+++ b/api/config.py
@@ -1168,15 +1168,17 @@ def _deduplicate_model_ids(groups: list[dict]) -> None:
 #      Reuses the same private-IP detection logic used elsewhere in
 #      api/config.py for SSRF host trust.
 _LOCAL_SERVER_PROVIDERS = {
-    "lmstudio",   # canonical (in hermes_cli.models.CANONICAL_PROVIDERS)
-    "ollama",     # via custom_providers, common pattern
-    "llamacpp",   # via custom_providers
-    "llama-cpp",  # alias
-    "vllm",       # via custom_providers
-    "tabby",      # via custom_providers (TabbyAPI)
-    "tabbyapi",   # alias
-    "koboldcpp",  # local llama.cpp UI fork
-    "textgen",    # text-generation-webui (oobabooga) OpenAI-compat extension
+    "lmstudio",     # canonical (in hermes_cli.models.CANONICAL_PROVIDERS)
+    "lm-studio",    # alias used in some custom_providers configs (#1625 Opus NIT)
+    "ollama",       # via custom_providers, common pattern
+    "llamacpp",     # via custom_providers
+    "llama-cpp",    # alias
+    "vllm",         # via custom_providers
+    "tabby",        # via custom_providers (TabbyAPI)
+    "tabbyapi",     # alias
+    "koboldcpp",    # local llama.cpp UI fork
+    "textgen",      # text-generation-webui (oobabooga) OpenAI-compat extension
+    "localai",      # LocalAI project (#1625 Opus NIT)
 }
 
 
diff --git a/api/models.py b/api/models.py
index ebe3c4be..867beb37 100644
--- a/api/models.py
+++ b/api/models.py
@@ -794,13 +794,17 @@ def _repair_stale_pending(session) -> bool:
             )
             return False
         try:
-            # Telemetry (#1624): log every legitimate repair so the next batch
+            # Telemetry (#1624): log legitimate repair firings so the next batch
             # of user reports tells us whether the underlying race still fires
-            # post-fix. Particularly useful for short-window repairs that slip
-            # past the grace guard (e.g. age just over the threshold) so we
-            # can tune _REPAIR_STALE_PENDING_GRACE_SECONDS empirically.
+            # post-fix. Rate-limit by age (Opus pre-release SHOULD-FIX): WARNING
+            # for the diagnostically valuable race window (< 5 min — actual
+            # leak-path candidates that slipped past the grace guard) and DEBUG
+            # for the long-tail (orphaned sidecars from prior process lifetimes)
+            # so reconnect loops on stuck sessions don't flood the log.
+            _DIAG_WARN_WINDOW_SECONDS = 300  # 5 min
             _age_str = ('inf' if _age == float('inf') else f'{_age:.1f}s')
-            logger.warning(
+            _log = logger.warning if _age < _DIAG_WARN_WINDOW_SECONDS else logger.debug
+            _log(
                 "_repair_stale_pending firing: session=%s stream_id=%s pending_age=%s",
                 sid, _seen_stream_id, _age_str,
             )
diff --git a/tests/test_issue1625_local_server_model_id_preservation.py b/tests/test_issue1625_local_server_model_id_preservation.py
index b31d5bf9..5c736c90 100644
--- a/tests/test_issue1625_local_server_model_id_preservation.py
+++ b/tests/test_issue1625_local_server_model_id_preservation.py
@@ -33,12 +33,16 @@ def _patch_cfg(monkeypatch, **model_overrides):
 
 @pytest.mark.parametrize("provider_name", [
     "lmstudio",
+    "lm-studio",   # Opus pre-release NIT
     "ollama",
     "llamacpp",
     "llama-cpp",
     "vllm",
     "tabby",
     "tabbyapi",
+    "koboldcpp",
+    "textgen",
+    "localai",     # Opus pre-release NIT
 ])
 def test_known_local_server_provider_preserves_full_model_id(provider_name, monkeypatch):
     """Known local-server provider names must preserve the slashed model id