diff --git a/CHANGELOG.md b/CHANGELOG.md index 69546608..63535b4d 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -2,20 +2,50 @@ ## [v0.50.294] — 2026-05-04 -### Fixed (1 PR — closes #1633) +### Fixed (3 PRs — streaming stability trio + models cache version stamp + session race + readonly fs guard — closes #1430, #1470, #1623, #1624, #1625, #1633) -- **`/api/models` disk cache now invalidated on every WebUI version change** (closes #1633) — `STATE_DIR/models_cache.json` was persisted across server restarts without any version stamp. A Docker container update from version A to version B read the cache file written by version A — users saw stale picker contents (missing models, phantom provider groups, e.g. the v0.50.281 4-model Nous Portal + `Opencode_Go` phantom) for up to 24 hours until either the TTL expired, an unrelated provider edit triggered `invalidate_models_cache()`, or they manually deleted the file. Reporter Deor (Discord) updated to v0.50.292 — which contained fixes for #1538, #1539, and #1568 — did a hard refresh and cleared site data, and still saw byte-for-byte identical picker contents because the server kept reading the v0.50.281 cache file off the host-mounted volume. **Fix:** `_save_models_cache_to_disk()` now stamps payloads with `_webui_version` (resolved lazily from `api.updates.WEBUI_VERSION` to avoid a circular import) and `_schema_version = 2`. `_load_models_cache_from_disk()` rejects any cache where either field mismatches the runtime — every release auto-rebuilds from live provider data on the very next `/api/models` call. Legacy unstamped caches (pre-#1633 files) are also rejected, so the first read after upgrading to this release rebuilds cleanly. Schema version is independent of the WebUI version stamp so future cache-shape changes can invalidate older releases without relying on a tag bump alone. The early-init edge case (api.updates not yet loaded) skips the version check rather than wedging the boot — at worst an unstamped file is written once and rejected on the next call. +- **SSE app heartbeat lowered from 30s to 5s at every long-lived handler** (closes #1623) — kernel TCP keepalive (added v0.50.289 / #1581) declares a peer dead at `KEEPIDLE (10s) + KEEPINTVL (5s) × KEEPCNT (3) = 25s` worst-case. The five SSE handlers in `api/routes.py` (main agent stream, terminal, gateway-watcher, approval-poller, clarify-poller) all used 30s, which meant on flaky networks the kernel could tear the socket down before the app sent its first heartbeat byte — flaky-network drops at ~10s that users perceived as "the stream died around 10 seconds in" during long LLM thinking phases. **Fix:** new `_SSE_HEARTBEAT_INTERVAL_SECONDS = 5` constant referenced by every queue-poll site. Cost: ~150B/min when idle (12 extra heartbeats × 12 bytes), negligible. Many production SSE deployments use 5-15s app heartbeats specifically because TCP keepalive isn't reliable across all network paths (proxies, load balancers, mobile NAT). Regression test pins the inequality `app_heartbeat × 2 ≤ kernel_keepalive_window` so future tuning of either timer can't re-introduce the misalignment. + +- **`_repair_stale_pending()` no longer fires on fresh turns** (closes #1624) — `_repair_stale_pending` in `api/models.py:716` triggered as soon as `pending_user_message` was set AND `active_stream_id` was missing from the live `STREAMS` registry. There was no time-based staleness guard, so any narrow race between the streaming thread clearing `pending_user_message` and `STREAMS.pop(stream_id)` produced a false-positive "**Previous turn did not complete.**" marker on a turn that actually finished correctly — every command-approval turn reliably reproduced this for at least one user. **Fix:** add `_REPAIR_STALE_PENDING_GRACE_SECONDS = 30` and bail when `time.time() - pending_started_at < grace`. Falsy `pending_started_at` (legacy sidecars from before the field was added in v0.50.283) is treated as "old enough" so legitimate legacy-data recovery still works. Plus a rate-limited `logger.warning`/`logger.debug` on every legitimate repair so the next batch of user reports tells us whether the underlying race still fires post-fix. **This is defense-in-depth, not the root-cause fix** — the streaming thread should never exit without clearing pending; tracked separately for future investigation. + +- **Local model servers (LM Studio, Ollama, llama.cpp, vLLM, TabbyAPI, LocalAI) now keep their full HuggingFace-style model id** (closes #1625, reported by @akarichan8231) — `resolve_model_provider()` in `api/config.py:1149` stripped the provider prefix from a model id like `qwen/qwen3.6-27b` whenever (a) the model contained `/`, (b) `config.yaml` had `model.base_url` set, and (c) the prefix matched a known entry in `_PROVIDER_MODELS` (e.g. `qwen`, `openai`, `anthropic`, etc.). The strip is correct for OpenAI-compatible **proxies** (LiteLLM, OpenRouter relays) — `openai/gpt-5.4` → `gpt-5.4`. But local model servers are **not** proxies — they register models under their full HuggingFace path as the registry key. Stripping the prefix made LM Studio (or Ollama, llama.cpp, vLLM, TabbyAPI) miss the loaded model and silently load a brand-new instance with default settings, ignoring the user's tuned 131072 context / 4 parallel slots. **Fix:** new `_LOCAL_SERVER_PROVIDERS` set covering canonical names (`lmstudio`, `lm-studio`, `localai`, `ollama`, `llamacpp`, `llama-cpp`, `vllm`, `tabby`, `tabbyapi`, `koboldcpp`, `textgen`) and a new `_base_url_points_at_local_server()` heuristic that catches `provider: custom` + `base_url: http://localhost:1234/v1` setups too (via loopback / RFC1918 / IPv6-loopback IP detection). Either signal triggers no-strip. Backward compat is preserved for OpenAI-compatible proxies on public hosts (LiteLLM at `https://litellm.example.com/v1` continues to strip `openai/gpt-5.4` → `gpt-5.4`). + + > **Behavior change for internal-network OpenAI-compatible proxies (RFC1918):** the loopback heuristic also matches private-IP base_urls (10/8, 172.16/12, 192.168/16). A team running an internal LiteLLM proxy at `http://10.5.0.1:1234/v1` now gets prefix preservation instead of stripping. LiteLLM accepts either form, so this is invisible in practice; users with a custom proxy on RFC1918 that requires the stripped form should configure it as a `custom_providers:` entry, which routes through the early `custom_providers` loop and never reaches the local-server detection. + +- **`/api/models` disk cache now invalidated on every WebUI version change** (closes #1633, reported by @Deor on Discord) — `STATE_DIR/models_cache.json` was persisted across server restarts without any version stamp. A Docker container update from version A to version B read the cache file written by version A — users saw stale picker contents (missing models, phantom provider groups, e.g. the v0.50.281 4-model Nous Portal + `Opencode_Go` phantom) for up to 24 hours until either the TTL expired, an unrelated provider edit triggered `invalidate_models_cache()`, or they manually deleted the file. Reporter Deor updated to v0.50.292 — which contained fixes for #1538, #1539, and #1568 — did a hard refresh and cleared site data, and still saw byte-for-byte identical picker contents because the server kept reading the v0.50.281 cache file off the host-mounted volume. **Fix:** `_save_models_cache_to_disk()` now stamps payloads with `_webui_version` (resolved lazily from `api.updates.WEBUI_VERSION` to avoid a circular import) and `_schema_version = 2`. `_load_models_cache_from_disk()` rejects any cache where either field mismatches the runtime — every release auto-rebuilds from live provider data on the very next `/api/models` call. Legacy unstamped caches (pre-#1633 files) are also rejected, so the first read after upgrading to this release rebuilds cleanly. Schema version is independent of the WebUI version stamp so future cache-shape changes can invalidate older releases without relying on a tag bump alone. The early-init edge case (api.updates not yet loaded) skips the version check rather than wedging the boot — at worst an unstamped file is written once and rejected on the next call. + +- **Session list race condition no longer makes today's sessions disappear** (closes #1430, reported by @Olyno) — `renderSessionList()` in `static/sessions.js` had no staleness guard. Multiple callers (message send, rename, session switch) fire it concurrently without awaiting, so a slower previous-day fetch could overwrite `_allSessions` with stale data after a faster newer fetch had already written today's data — manifesting as today's sessions disappearing when the user clicked an older conversation. **Fix:** new module-local `_renderSessionListGen` generation counter pre-incremented before the `await` and re-checked after it; stale calls (older `_gen`) self-discard before mutating state. Lightest-weight correct shape — no AbortController, no debounce, no state machine. Behavioral harness verifies three concurrent calls with varying delays correctly land only the most recently issued response. (PR #1635 by @bergeouss, AI-assisted via Hermes Agent.) + +- **Read-only root filesystem under podman no longer crashes container startup** (closes #1470, reported by @cosmoceus) — `docker_init.bash` unconditionally called `groupmod`/`usermod` even when `/etc/group` and `/etc/passwd` were on a read-only filesystem (typical podman + `read_only=true` setup). `groupmod: cannot lock /etc/group; try again later.` killed the container at boot. **Fix:** writability check via `[ ! -w /etc/group ] || [ ! -w /etc/passwd ]`; on read-only mounts with matching UID/GID skip gracefully with a log message; on read-only mounts with mismatched UID/GID emit a clear `error_exit` directing the user to set matching IDs or disable `read_only=true`. (PR #1635 by @bergeouss.) ### Tests -4180 → **4199 passing** (+19 regression tests on `tests/test_issue1633_models_cache_version_stamp.py`). 0 regressions. Full suite in ~115s. +4180 → **4271 passing** (+91 regression tests across `tests/test_issue1623_sse_heartbeat_alignment.py` (3), `tests/test_issue1624_repair_stale_pending_grace.py` (9), `tests/test_issue1625_local_server_model_id_preservation.py` (34, expanded for `lm-studio`/`localai`), `tests/test_issue1633_models_cache_version_stamp.py` (19); plus `tests/test_model_resolver.py` updates and `tests/test_model_cache_metadata.py` round-trip semantics). 0 regressions. Full suite ~120s. ### Pre-release verification -- Self-built fix (nesquena-hermes), pending Opus advisor pre-merge pass and independent review APPROVED by nesquena. +- **Self-built fixes** (#1631, #1636 — nesquena-hermes), independent review **APPROVED by nesquena** for both, with comprehensive end-to-end traces including reproducer harnesses for Deor's Docker-upgrade scenario (#1633) and the kernel-keepalive math (#1623). +- **External contributor PR** #1635 by @bergeouss (AI-assisted via Hermes Agent), independent review **APPROVED by nesquena** with behavioral harness for the race fix (three concurrent fetches with varying delays — only the latest writes to state). +- **Opus advisor pre-merge pass on #1631**: SHIP — no MUST-FIX, one SHOULD-FIX (rate-limited `_repair_stale_pending` telemetry) and three NITs (expanded `_LOCAL_SERVER_PROVIDERS`, RFC1918 CHANGELOG callout) absorbed in-PR (commit `2161fc1`). +- **Opus advisor pre-merge pass on stage-294**: see "Opus-applied fixes" below. +- `_SSE_HEARTBEAT_INTERVAL_SECONDS × 2 ≤ KEEPIDLE + KEEPINTVL × KEEPCNT` pinned by a regression test that derives the kernel window from `server.py` setsockopt block at runtime. +- `_repair_stale_pending` grace guard exercised at: 5s-old turn (skip), grace-1s-old turn (skip), grace+30s-old turn (fire), missing/zero/garbage `pending_started_at` (fire — legacy compat), no pending-message (skip — pre-existing contract), live stream (skip — pre-existing contract). +- `resolve_model_provider` exercised across local-server provider names + 7 loopback/private IP heuristic cases + backward-compat checks for OpenAI-compatible proxies on public hosts and OpenRouter pass-through. Helper `_base_url_points_at_local_server()` independently unit-tested against 11 url shapes. - End-to-end behavioral test (`test_docker_update_scenario_invalidates_old_cache`) reproduces Deor's exact reported scenario: a cache stamped at `v0.50.281` fails to load when runtime is `v0.50.292`, forcing a fresh rebuild that picks up the picker fixes shipped between releases. - Round-trip + version-mismatch + legacy-unstamped + schema-mismatch + early-init + corrupt-JSON + missing-file + atomic-overwrite + invalidate-cache-tear-down all pinned. -- Smoke test: real `api.updates.WEBUI_VERSION` resolved at runtime (`v0.50.293-dirty` against the worktree); cache stamped with that version; round-trip works end-to-end. +- Cross-tool verified: agent has its own model-cache files at different paths (`hermes_cli/codex_models.py`, `hermes_cli/models.py`) — no collision. + +### Opus-applied fixes (absorbed in-release) + +**From #1631 in-PR Opus pre-merge pass (already on the PR's branch):** + +- **SHOULD-FIX (`_repair_stale_pending` log volume)**: rate-limit the repair-firing telemetry by age — `logger.warning` for the diagnostically valuable race window (< 5 min, actual leak-path candidates that slipped past the grace guard) and `logger.debug` for the long-tail (orphaned sidecars from prior process lifetimes). Prevents reconnect loops on stuck sessions from flooding the log while preserving the diagnostic signal we want for tuning the grace constant. +- **NIT (`_LOCAL_SERVER_PROVIDERS`)**: added `lm-studio` (hyphenated alias used in some `custom_providers:` configs) and `localai` (LocalAI project, common OpenAI-compatible local server). Test parametrize expanded to cover the new names plus pre-existing `koboldcpp` and `textgen` for symmetry. + +**From #1636 stage-294 absorption (this release):** + +- **Minor observation absorbed** — `_is_loadable_disk_cache()` now logs at DEBUG when rejecting (`schema=N vs M`, `version=A vs B`). Useful diagnostic when investigating future "why did my cache rebuild" questions. +- **Code comment** added to `_is_loadable_disk_cache()` documenting that `_webui_version` is a string compare (not semver) — paired with `_schema_version` independent axis for breaking changes that lack a tag bump. ## [v0.50.293] — 2026-05-04 diff --git a/api/config.py b/api/config.py index b2233da3..3614504c 100644 --- a/api/config.py +++ b/api/config.py @@ -1147,6 +1147,72 @@ def _deduplicate_model_ids(groups: list[dict]) -> None: model["label"] = f"{original_id} ({provider_name})" +# ── Local-server provider preservation (#1625) ───────────────────────────── +# +# LM Studio, Ollama, llama.cpp, vLLM, TabbyAPI etc. are inference servers, +# not OpenAI-compatible proxies. They register models under their FULL path +# as the registry key (the HuggingFace-style "namespace/model" id, e.g. +# "qwen/qwen3.6-27b"). Stripping the namespace prefix would cause a registry +# miss and the server loads a brand-new instance with default settings, +# silently ignoring the user's tuned context length / parallel slots. +# +# This is distinct from OpenAI-compatible proxies (LiteLLM, OpenRouter relays) +# where stripping "openai/gpt-5.4" → "gpt-5.4" is the correct behavior. +# +# Detection has two layers: +# 1. Static set of known local-server provider names (canonical + common +# custom-provider naming). +# 2. Loopback / private-host base_url heuristic: an OpenAI-compatible URL +# pointing at 127.0.0.1, localhost, or a private IP block is almost +# certainly a local model server, regardless of the provider name. +# Reuses the same private-IP detection logic used elsewhere in +# api/config.py for SSRF host trust. +_LOCAL_SERVER_PROVIDERS = { + "lmstudio", # canonical (in hermes_cli.models.CANONICAL_PROVIDERS) + "lm-studio", # alias used in some custom_providers configs (#1625 Opus NIT) + "ollama", # via custom_providers, common pattern + "llamacpp", # via custom_providers + "llama-cpp", # alias + "vllm", # via custom_providers + "tabby", # via custom_providers (TabbyAPI) + "tabbyapi", # alias + "koboldcpp", # local llama.cpp UI fork + "textgen", # text-generation-webui (oobabooga) OpenAI-compat extension + "localai", # LocalAI project (#1625 Opus NIT) +} + + +def _base_url_points_at_local_server(base_url: str) -> bool: + """True if base_url's host is a loopback or private IP (likely local server). + + Reuses ipaddress.is_loopback / is_private / is_link_local — the same + heuristic used in the `api/config.py` SSRF/credential-routing code. + Errors (DNS failure, malformed URL) return False so callers fall back to + the static-provider-name check. + """ + if not base_url: + return False + try: + from urllib.parse import urlparse + import ipaddress + host = (urlparse(base_url).hostname or "").lower() + if not host: + return False + # Plain-text "localhost" doesn't ipaddress-parse but is unambiguous. + if host in ("localhost", "ip6-localhost", "ip6-loopback"): + return True + try: + addr = ipaddress.ip_address(host) + except ValueError: + # Not an IP literal — could be a hostname like "ollama.internal". + # Don't try DNS resolution here (slow + ambient): only IP literals + # and the `localhost` alias get the no-strip treatment via this path. + return False + return addr.is_loopback or addr.is_private or addr.is_link_local + except Exception: + return False + + def resolve_model_provider(model_id: str) -> tuple: """Resolve model name, provider, and base_url for AIAgent. @@ -1250,6 +1316,15 @@ def resolve_model_provider(model_id: str) -> tuple: # just because the model name contains a slash (e.g. google/gemma-4-26b-a4b). # The user has explicitly pointed at a base_url, so trust their routing config. if config_base_url: + # Local model servers (LM Studio, Ollama, llama.cpp, vLLM, TabbyAPI) + # register models under their full HuggingFace-style id. Stripping the + # prefix breaks the lookup and causes a fresh instance to load with + # default settings, ignoring user-tuned context length / parallel slots. + # See #1625. Detect either by canonical provider name OR by base_url + # pointing at a loopback/private host. + if (str(config_provider or "").lower() in _LOCAL_SERVER_PROVIDERS + or _base_url_points_at_local_server(config_base_url)): + return model_id, config_provider, config_base_url # Only strip the provider prefix when it's a known provider namespace # (e.g. "openai/gpt-5.4" → "gpt-5.4" for a custom OpenAI-compatible proxy). # Unknown prefixes (e.g. "zai-org/GLM-5.1" on DeepInfra) are intrinsic to @@ -1587,17 +1662,34 @@ def _is_loadable_disk_cache(cache: object) -> bool: immediately, instead of waiting up to 24 hours for the TTL to expire. If the runtime version cannot be resolved (early-init edge case), skip this check rather than wedge the boot. + + Note: ``_webui_version`` is a string equality check, not a semver compare — + two debug builds with the same `WEBUI_VERSION` string but different actual + code wouldn't invalidate via this axis. ``_schema_version`` is the + independent invalidation axis for breaking changes that lack a tag bump; + bump it whenever the cache shape changes incompatibly. """ if not _is_valid_models_cache(cache): return False if not isinstance(cache, dict): # appease type-narrowing — already guarded above return False - if cache.get("_schema_version") != _MODELS_CACHE_SCHEMA_VERSION: + cached_schema = cache.get("_schema_version") + if cached_schema != _MODELS_CACHE_SCHEMA_VERSION: + # DEBUG telemetry per stage-294 absorption: makes "why did my cache + # rebuild" investigations one log-grep away. + logger.debug( + "models cache rejected: schema=%r vs runtime=%r", + cached_schema, _MODELS_CACHE_SCHEMA_VERSION, + ) return False runtime_version = _current_webui_version() if runtime_version is not None: cached_version = cache.get("_webui_version") if not isinstance(cached_version, str) or cached_version != runtime_version: + logger.debug( + "models cache rejected: webui_version=%r vs runtime=%r", + cached_version, runtime_version, + ) return False return True diff --git a/api/models.py b/api/models.py index 36775570..867beb37 100644 --- a/api/models.py +++ b/api/models.py @@ -713,11 +713,32 @@ def _apply_core_sync_or_error_marker( return True +# ── _repair_stale_pending grace period (#1624) ───────────────────────────── +# +# Defense-in-depth against a narrow race between the streaming thread clearing +# pending_user_message and STREAMS.pop(stream_id). Without this guard, any +# fast turn (e.g. command approval) that exits the thread before the on-disk +# pending clear has flushed gets misdiagnosed as a crashed turn, producing a +# spurious "Previous turn did not complete." marker. +# +# 30s covers the worst-case post-loop persistence window: LLM finishing a tool +# batch + lock contention with the checkpoint thread + a multi-MB session.save. +# A legitimately crashed turn whose pending_started_at is < 30s old will not +# repair on the first get_session() call, but WILL repair on the next call +# after the grace period elapses (typically the user's next interaction). +# +# Missing/falsy pending_started_at (legacy sidecars from before that field +# existed, or any path that forgot to set it) is treated as "old enough" so +# repair still recovers them — preserves current behavior for legacy data. +_REPAIR_STALE_PENDING_GRACE_SECONDS = 30 + + def _repair_stale_pending(session) -> bool: """Recover a sidecar stuck with messages=[] and stale pending state. Fires only when messages is empty, pending_user_message is set, - active_stream_id is set, and the stream is no longer alive. + active_stream_id is set, the stream is no longer alive, AND the turn is + older than _REPAIR_STALE_PENDING_GRACE_SECONDS (#1624). Uses a non-blocking lock acquire so a caller that already holds the per-session lock (e.g. retry_last, undo_last, cancel_stream) cannot @@ -735,6 +756,26 @@ def _repair_stale_pending(session) -> bool: or _seen_stream_id in _active_stream_ids()): return False + # Grace-period guard: bail if the turn is too fresh to be a real crash. + # Falsy pending_started_at (None, 0, missing) means "old enough" — preserve + # legacy-data recovery semantics for sessions that pre-date the field. + _started = getattr(session, 'pending_started_at', None) + if _started: + try: + _age = time.time() - float(_started) + except (TypeError, ValueError): + _age = float('inf') + if _age < _REPAIR_STALE_PENDING_GRACE_SECONDS: + logger.debug( + "_repair_stale_pending: skipping repair for session %s — " + "pending_started_at age=%.1fs < %ds grace window", + session.session_id, _age, _REPAIR_STALE_PENDING_GRACE_SECONDS, + ) + return False + else: + # Treat missing/falsy pending_started_at as "old enough" (legacy data). + _age = float('inf') + sid = session.session_id if not sid or not all(c in '0123456789abcdefghijklmnopqrstuvwxyz_' for c in sid): return False @@ -753,6 +794,20 @@ def _repair_stale_pending(session) -> bool: ) return False try: + # Telemetry (#1624): log legitimate repair firings so the next batch + # of user reports tells us whether the underlying race still fires + # post-fix. Rate-limit by age (Opus pre-release SHOULD-FIX): WARNING + # for the diagnostically valuable race window (< 5 min — actual + # leak-path candidates that slipped past the grace guard) and DEBUG + # for the long-tail (orphaned sidecars from prior process lifetimes) + # so reconnect loops on stuck sessions don't flood the log. + _DIAG_WARN_WINDOW_SECONDS = 300 # 5 min + _age_str = ('inf' if _age == float('inf') else f'{_age:.1f}s') + _log = logger.warning if _age < _DIAG_WARN_WINDOW_SECONDS else logger.debug + _log( + "_repair_stale_pending firing: session=%s stream_id=%s pending_age=%s", + sid, _seen_stream_id, _age_str, + ) return _apply_core_sync_or_error_marker( session, core_path, stream_id_for_recheck=_seen_stream_id, ) diff --git a/api/routes.py b/api/routes.py index ee95e0fd..4af18933 100644 --- a/api/routes.py +++ b/api/routes.py @@ -101,6 +101,20 @@ def _all_profiles_query_flag(parsed_url) -> bool: raw = qs.get('all_profiles', [''])[0].strip().lower() return raw in ('1', 'true', 'yes', 'on') +# ── SSE app-level heartbeat (#1623) ──────────────────────────────────────── +# +# Kernel TCP keepalive (server.py setsockopt block) declares a peer dead at +# KEEPIDLE (10s) + KEEPINTVL (5s) * KEEPCNT (3) = 25s in the worst case. The +# app-level SSE heartbeat must fire well below that window so flaky-network +# probes never get the chance to kill an idle stream during long LLM thinking +# phases. 5s gives the kernel ~5x headroom: probe at 10s, heartbeat byte at +# every 5s of idle keeps the socket warm. +# +# Cost: ~12 bytes per heartbeat * 12 extra heartbeats/min = ~150B/min idle. +# Trivial; many production SSE deployments run 5-15s heartbeats specifically +# to handle proxies and mobile NAT. +_SSE_HEARTBEAT_INTERVAL_SECONDS = 5 + def _normalize_messaging_source(raw_source) -> str: return str(raw_source or "").strip().lower() @@ -3778,7 +3792,7 @@ def _handle_sse_stream(handler, parsed): try: while True: try: - event, data = subscriber.get(timeout=30) + event, data = subscriber.get(timeout=_SSE_HEARTBEAT_INTERVAL_SECONDS) except queue.Empty: handler.wfile.write(b": heartbeat\n\n") handler.wfile.flush() @@ -3901,7 +3915,7 @@ def _handle_terminal_output(handler, parsed): try: while True: try: - event, data = term.output.get(timeout=25) + event, data = term.output.get(timeout=_SSE_HEARTBEAT_INTERVAL_SECONDS) except queue.Empty: handler.wfile.write(b": terminal heartbeat\n\n") handler.wfile.flush() @@ -3986,7 +4000,7 @@ def _handle_gateway_sse_stream(handler, parsed): while True: try: - event_data = q.get(timeout=30) + event_data = q.get(timeout=_SSE_HEARTBEAT_INTERVAL_SECONDS) except queue.Empty: handler.wfile.write(b': keepalive\n\n') handler.wfile.flush() @@ -4309,7 +4323,7 @@ def _handle_approval_sse_stream(handler, parsed): try: while True: try: - payload = q.get(timeout=30) + payload = q.get(timeout=_SSE_HEARTBEAT_INTERVAL_SECONDS) except queue.Empty: # Keepalive — SSE comment line prevents proxy/CDN timeout. handler.wfile.write(b': keepalive\n\n') @@ -4410,7 +4424,7 @@ def _handle_clarify_sse_stream(handler, parsed): try: while True: try: - payload = q.get(timeout=30) + payload = q.get(timeout=_SSE_HEARTBEAT_INTERVAL_SECONDS) except queue.Empty: handler.wfile.write(b': keepalive\n\n') handler.wfile.flush() diff --git a/tests/test_issue1623_sse_heartbeat_alignment.py b/tests/test_issue1623_sse_heartbeat_alignment.py new file mode 100644 index 00000000..90109a6c --- /dev/null +++ b/tests/test_issue1623_sse_heartbeat_alignment.py @@ -0,0 +1,89 @@ +"""Tests for #1623: SSE app heartbeat must fire well under the kernel keepalive timeout. + +Bug shape: server.py's per-connection TCP keepalive (added v0.50.289 / #1581) +declares a peer dead at KEEPIDLE=10s + KEEPINTVL=5s * KEEPCNT=3 = 25s. The +SSE handlers in api/routes.py used a 30s app-level heartbeat. When the LLM +is thinking and the queue is idle, the kernel could tear down the socket +before the app sent its first heartbeat byte — flaky-network drops at ~10s +that the user perceived as "the stream died around 10 seconds in." + +Fix: lower the heartbeat to 5s at every SSE handler and pin the inequality +with a regression test so future tuning of either timer can't re-introduce +the misalignment. +""" + +from pathlib import Path + + +REPO = Path(__file__).parent.parent + + +def test_sse_heartbeat_constant_below_kernel_keepalive_window(): + """The named constant exists and is at most half the kernel keepalive + timeout (10 + 5*3 = 25s). 5s gives the kernel ~5x headroom.""" + src = (REPO / "api" / "routes.py").read_text(encoding="utf-8") + + # The constant must be defined. + assert "_SSE_HEARTBEAT_INTERVAL_SECONDS" in src, ( + "Named SSE heartbeat constant must exist (#1623)" + ) + + # Pull the literal value. + import re + m = re.search(r"_SSE_HEARTBEAT_INTERVAL_SECONDS\s*=\s*(\d+)", src) + assert m, "Could not parse _SSE_HEARTBEAT_INTERVAL_SECONDS literal" + heartbeat = int(m.group(1)) + + # Reproduce the kernel-keepalive window from server.py setsockopt block. + server_src = (REPO / "server.py").read_text(encoding="utf-8") + assert "TCP_KEEPIDLE" in server_src, "TCP_KEEPIDLE must be set on accepted connections" + keepidle = int(re.search(r"TCP_KEEPIDLE.*?(\d+)\)", server_src, re.S).group(1)) + keepintvl = int(re.search(r"TCP_KEEPINTVL.*?(\d+)\)", server_src, re.S).group(1)) + keepcnt = int(re.search(r"TCP_KEEPCNT.*?(\d+)\)", server_src, re.S).group(1)) + kernel_window = keepidle + keepintvl * keepcnt + + # The acceptance criterion from the bug: app heartbeat <= kernel window / 2. + assert heartbeat * 2 <= kernel_window, ( + f"App SSE heartbeat ({heartbeat}s) must be at most half of the kernel " + f"keepalive window ({kernel_window}s = {keepidle} + {keepintvl}*{keepcnt}). " + f"Otherwise flaky-network probes can tear down the socket before the " + f"app sends a heartbeat byte. (#1623)" + ) + + +def test_no_sse_handler_uses_30s_or_higher_timeout(): + """No SSE/long-poll handler in routes.py should still be using the old + 30s/25s timeout. Every queue.get(timeout=...) call inside an SSE handler + must reference the named constant, not a hard-coded number.""" + src = (REPO / "api" / "routes.py").read_text(encoding="utf-8") + + import re + # Catch q.get(timeout=30), subscriber.get(timeout=30), term.output.get(timeout=25), etc. + bad = re.findall(r"\.get\(timeout=3[05]\)", src) + assert not bad, ( + f"Found {len(bad)} SSE handler call(s) still using a 25/30s timeout: {bad}. " + "All should use _SSE_HEARTBEAT_INTERVAL_SECONDS (#1623)." + ) + + +def test_each_named_sse_handler_uses_constant(): + """Each known SSE handler queue-poll site must reference the constant.""" + src = (REPO / "api" / "routes.py").read_text(encoding="utf-8") + + expected_callers = [ + "subscriber.get(timeout=_SSE_HEARTBEAT_INTERVAL_SECONDS)", # main agent SSE + "term.output.get(timeout=_SSE_HEARTBEAT_INTERVAL_SECONDS)", # terminal SSE + ] + for caller in expected_callers: + assert caller in src, ( + f"Expected SSE handler to call {caller!r} (#1623). " + "If this assertion fails, the SSE heartbeat misalignment may have regressed." + ) + + # Also: at least 3 sites should be using the constant overall (main agent, + # terminal, plus the gateway watcher and approval/clarify pollers). + n_uses = src.count("get(timeout=_SSE_HEARTBEAT_INTERVAL_SECONDS)") + assert n_uses >= 4, ( + f"Expected at least 4 SSE/long-poll sites using the named constant; found {n_uses}. " + "Every long-lived idle queue poll must align below the kernel keepalive window." + ) diff --git a/tests/test_issue1624_repair_stale_pending_grace.py b/tests/test_issue1624_repair_stale_pending_grace.py new file mode 100644 index 00000000..96983211 --- /dev/null +++ b/tests/test_issue1624_repair_stale_pending_grace.py @@ -0,0 +1,160 @@ +"""Tests for #1624: _repair_stale_pending must not fire on fresh turns. + +Bug shape: _repair_stale_pending() fires whenever pending_user_message is set +and the active_stream_id is not in the live STREAMS registry. There's a +narrow race between the streaming thread clearing pending_user_message and +STREAMS.pop(stream_id), so any fast turn (e.g. command approval) that exits +the thread before the on-disk pending clear flushes gets misdiagnosed as a +crashed turn — producing a spurious "Previous turn did not complete." marker. + +Fix: add a grace-period guard. A turn whose pending_started_at is younger +than _REPAIR_STALE_PENDING_GRACE_SECONDS is treated as "the streaming thread +may still be in its post-loop cleanup window" and the repair bails. Missing +or falsy pending_started_at (legacy sidecars that pre-date the field) is +treated as "old enough" to preserve current legacy-data recovery semantics. +""" + +import time +import threading +from unittest.mock import patch + +import pytest + + +# ── _repair_stale_pending grace guard ─────────────────────────────────── + + +class _FakeSession: + """Minimal stand-in for api.models.Session — only the fields _repair_stale_pending reads.""" + def __init__(self, sid="abcdef123456", pending="hi", stream_id="stream_xyz", + pending_started_at=None, profile="default"): + self.session_id = sid + self.pending_user_message = pending + self.active_stream_id = stream_id + self.pending_started_at = pending_started_at + self.profile = profile + self.messages = [] + + +def _setup_repair_environment(monkeypatch, tmp_path): + """Stub out the costly side-channels in _repair_stale_pending so the + tests exercise the guard logic alone, not the full lock+sidecar pipeline.""" + import api.models as models + + # No live streams — the predicate's "stream not in registry" branch fires. + monkeypatch.setattr(models, "_active_stream_ids", lambda: set()) + + # Profile home -> tmp dir; sessions/.json doesn't need to exist + # because we'll stub _apply_core_sync_or_error_marker. + monkeypatch.setattr(models, "_get_profile_home", lambda profile: tmp_path) + (tmp_path / "sessions").mkdir(parents=True, exist_ok=True) + + # Track whether the heavy-lift function was called so we can assert. + calls = {"applied": 0} + + def fake_apply(session, core_path, **kw): + calls["applied"] += 1 + return True + monkeypatch.setattr(models, "_apply_core_sync_or_error_marker", fake_apply) + + return calls + + +def test_repair_skips_fresh_turn(tmp_path, monkeypatch): + """A turn that started 5 seconds ago is too fresh — repair must bail.""" + import api.models as models + calls = _setup_repair_environment(monkeypatch, tmp_path) + + s = _FakeSession(pending_started_at=time.time() - 5.0) + result = models._repair_stale_pending(s) + assert result is False, "Repair must skip a 5s-old turn" + assert calls["applied"] == 0, "Heavy-lift _apply_core_sync_or_error_marker must not be called" + + +def test_repair_skips_almost_grace_window(tmp_path, monkeypatch): + """A turn 1 second younger than the grace threshold must still bail.""" + import api.models as models + calls = _setup_repair_environment(monkeypatch, tmp_path) + grace = models._REPAIR_STALE_PENDING_GRACE_SECONDS + + s = _FakeSession(pending_started_at=time.time() - (grace - 1.0)) + result = models._repair_stale_pending(s) + assert result is False, f"Repair must skip a turn {grace - 1}s old" + assert calls["applied"] == 0 + + +def test_repair_fires_after_grace_window(tmp_path, monkeypatch): + """A turn older than the grace window should trigger repair as before.""" + import api.models as models + calls = _setup_repair_environment(monkeypatch, tmp_path) + grace = models._REPAIR_STALE_PENDING_GRACE_SECONDS + + s = _FakeSession(pending_started_at=time.time() - (grace + 30.0)) + result = models._repair_stale_pending(s) + assert result is True, f"Repair must fire on a turn older than {grace}s" + assert calls["applied"] == 1, "Heavy-lift _apply_core_sync_or_error_marker should be called" + + +def test_repair_fires_when_pending_started_at_missing(tmp_path, monkeypatch): + """Legacy sidecars predate `pending_started_at`; missing/falsy must NOT + block repair — preserves current behavior for legacy data.""" + import api.models as models + calls = _setup_repair_environment(monkeypatch, tmp_path) + + s = _FakeSession(pending_started_at=None) + result = models._repair_stale_pending(s) + assert result is True, "Missing pending_started_at must not block legitimate repair" + assert calls["applied"] == 1 + + +def test_repair_fires_when_pending_started_at_zero(tmp_path, monkeypatch): + """Falsy 0 must also be treated as 'old enough' (defense against accidental zeroing).""" + import api.models as models + calls = _setup_repair_environment(monkeypatch, tmp_path) + + s = _FakeSession(pending_started_at=0) + result = models._repair_stale_pending(s) + assert result is True, "pending_started_at=0 must not block legitimate repair" + + +def test_repair_fires_when_pending_started_at_garbage(tmp_path, monkeypatch): + """Garbage values (string, dict, etc.) shouldn't crash and shouldn't block repair.""" + import api.models as models + calls = _setup_repair_environment(monkeypatch, tmp_path) + + s = _FakeSession(pending_started_at="not-a-number") + result = models._repair_stale_pending(s) + assert result is True, "Garbage pending_started_at should be treated as 'old enough'" + + +def test_repair_skips_when_no_pending_message(tmp_path, monkeypatch): + """Without pending_user_message, repair must always bail (existing contract).""" + import api.models as models + calls = _setup_repair_environment(monkeypatch, tmp_path) + + s = _FakeSession(pending="", pending_started_at=time.time() - 60) + result = models._repair_stale_pending(s) + assert result is False + assert calls["applied"] == 0 + + +def test_repair_skips_when_stream_still_alive(tmp_path, monkeypatch): + """If the stream is still in the registry, repair must bail even past grace.""" + import api.models as models + monkeypatch.setattr(models, "_active_stream_ids", lambda: {"stream_xyz"}) + monkeypatch.setattr(models, "_get_profile_home", lambda profile: tmp_path) + + s = _FakeSession(pending_started_at=time.time() - 600) + result = models._repair_stale_pending(s) + assert result is False, "Stream-alive bail predates the grace guard" + + +def test_grace_constant_exists_and_is_sane(): + """The grace constant is exposed and sized in a sane range (10s..120s).""" + import api.models as models + grace = models._REPAIR_STALE_PENDING_GRACE_SECONDS + assert isinstance(grace, (int, float)) + assert 10 <= grace <= 120, ( + f"Grace window {grace}s should be 10s-120s — too small re-introduces " + "the false-positive race; too large delays legitimate recovery." + ) diff --git a/tests/test_issue1625_local_server_model_id_preservation.py b/tests/test_issue1625_local_server_model_id_preservation.py new file mode 100644 index 00000000..5c736c90 --- /dev/null +++ b/tests/test_issue1625_local_server_model_id_preservation.py @@ -0,0 +1,149 @@ +"""Tests for #1625: resolve_model_provider must NOT strip provider prefix on local servers. + +Bug shape: with `model.provider: lmstudio`, `model.base_url: http://localhost:1234/v1`, +`model.default: qwen/qwen3.6-27b`, resolve_model_provider() stripped the +"qwen/" prefix because "qwen" matches an entry in _PROVIDER_MODELS — sending +the request to LM Studio with model name "qwen3.6-27b". LM Studio (and Ollama, +llama.cpp, vLLM, TabbyAPI) register models under their full HuggingFace-style +id, so the stripped name didn't match the loaded model and a fresh instance +loaded with default settings, ignoring the user's tuned context length / +parallel slots. + +Fix: explicit no-strip path for known local-server providers AND any +base_url pointing at a loopback/private host. OpenAI-compatible proxies +(LiteLLM, OpenRouter relays) on public URLs continue to get prefix-stripping. +""" + +import pytest + +from api import config as cfg_mod + + +# ── Helpers ─────────────────────────────────────────────────────────────── + + +def _patch_cfg(monkeypatch, **model_overrides): + """Patch api.config.cfg to a synthetic config dict for the duration of a test.""" + fake_cfg = {"model": dict(model_overrides), "custom_providers": []} + monkeypatch.setattr(cfg_mod, "cfg", fake_cfg) + + +# ── Local-server providers preserve full model id ────────────────────────── + + +@pytest.mark.parametrize("provider_name", [ + "lmstudio", + "lm-studio", # Opus pre-release NIT + "ollama", + "llamacpp", + "llama-cpp", + "vllm", + "tabby", + "tabbyapi", + "koboldcpp", + "textgen", + "localai", # Opus pre-release NIT +]) +def test_known_local_server_provider_preserves_full_model_id(provider_name, monkeypatch): + """Known local-server provider names must preserve the slashed model id + even when the prefix matches _PROVIDER_MODELS.""" + _patch_cfg(monkeypatch, provider=provider_name, base_url="http://localhost:1234/v1") + model, provider, base_url = cfg_mod.resolve_model_provider("qwen/qwen3.6-27b") + assert model == "qwen/qwen3.6-27b", ( + f"Local-server provider {provider_name!r} must preserve the full model id; " + f"stripping it makes LM Studio/Ollama/etc. load a fresh instance with " + f"default settings (#1625)." + ) + assert provider == provider_name + assert base_url == "http://localhost:1234/v1" + + +def test_lmstudio_with_huggingface_namespace_preserved(monkeypatch): + """The reporter's exact case: lmstudio + qwen/qwen3.6-27b + localhost.""" + _patch_cfg(monkeypatch, provider="lmstudio", base_url="http://localhost:1234/v1", + default="qwen/qwen3.6-27b") + model, provider, base_url = cfg_mod.resolve_model_provider("qwen/qwen3.6-27b") + assert model == "qwen/qwen3.6-27b" + + +def test_lmstudio_with_openai_prefix_preserved(monkeypatch): + """openai/gpt-oss-120b on LM Studio is a real HuggingFace id; the namespace + is part of the registry key. Must not be stripped on local servers.""" + _patch_cfg(monkeypatch, provider="lmstudio", base_url="http://localhost:1234/v1") + model, provider, base_url = cfg_mod.resolve_model_provider("openai/gpt-oss-120b") + assert model == "openai/gpt-oss-120b", ( + "openai/gpt-oss-120b on LM Studio must preserve the full id (#1625)" + ) + + +# ── Loopback / private-IP heuristic ─────────────────────────────────────── + + +@pytest.mark.parametrize("loopback_url", [ + "http://localhost:11434", + "http://127.0.0.1:1234/v1", + "http://127.0.0.1:8080/openai", + "http://10.0.0.5:8080/v1", # private RFC1918 + "http://192.168.1.50:1234/v1", # private RFC1918 + "http://172.16.0.10:8000/v1", # private RFC1918 + "http://[::1]:1234/v1", # IPv6 loopback +]) +def test_loopback_base_url_preserves_full_model_id(loopback_url, monkeypatch): + """Even with a generic `provider: custom` (or any non-local-server name), + a base_url pointing at a loopback or private IP must preserve the model id — + almost certainly a local model server.""" + _patch_cfg(monkeypatch, provider="custom", base_url=loopback_url) + model, _, _ = cfg_mod.resolve_model_provider("qwen/qwen3.6-27b") + assert model == "qwen/qwen3.6-27b", ( + f"Loopback/private base_url {loopback_url!r} must preserve the full model id (#1625)" + ) + + +# ── Backward compat: OpenAI-compatible proxies keep prefix-stripping ───── + + +def test_public_openai_proxy_still_strips_prefix(monkeypatch): + """OpenAI-compatible proxies (LiteLLM, public OpenRouter relays) still get + the strip behavior so 'openai/gpt-5.4' → 'gpt-5.4'.""" + _patch_cfg(monkeypatch, provider="openai", base_url="https://litellm.example.com/v1") + model, provider, base_url = cfg_mod.resolve_model_provider("openai/gpt-5.4") + assert model == "gpt-5.4", ( + "Public-host openai/* on a non-loopback proxy must continue to strip prefix" + ) + + +def test_unknown_prefix_on_public_proxy_preserved(monkeypatch): + """Unknown prefix (zai-org/GLM-5.1) on public proxy passes through full + (the existing contract — stripping unknown prefixes caused model_not_found).""" + _patch_cfg(monkeypatch, provider="openai", base_url="https://litellm.example.com/v1") + model, _, _ = cfg_mod.resolve_model_provider("zai-org/GLM-5.1") + assert model == "zai-org/GLM-5.1" + + +def test_openrouter_passes_full_unaffected(monkeypatch): + """OpenRouter always needs the full provider/model path — pre-existing + contract that the local-server fix must not disturb.""" + _patch_cfg(monkeypatch, provider="openrouter") + model, provider, _ = cfg_mod.resolve_model_provider("anthropic/claude-sonnet-4.6") + assert model == "anthropic/claude-sonnet-4.6" + assert provider == "openrouter" + + +# ── Helper unit tests ───────────────────────────────────────────────────── + + +@pytest.mark.parametrize("url, expected", [ + ("http://localhost:1234", True), + ("http://127.0.0.1:1234", True), + ("http://10.0.0.5", True), + ("http://192.168.1.1:8080", True), + ("http://[::1]:1234", True), + ("http://example.com", False), + ("https://api.openai.com/v1", False), + ("https://litellm.example.com/v1", False), + ("", False), + (None, False), + ("not-a-url", False), +]) +def test_base_url_points_at_local_server_helper(url, expected): + assert cfg_mod._base_url_points_at_local_server(url) is expected diff --git a/tests/test_model_resolver.py b/tests/test_model_resolver.py index 4ceb1138..56c591dd 100644 --- a/tests/test_model_resolver.py +++ b/tests/test_model_resolver.py @@ -466,11 +466,18 @@ def test_custom_endpoint_uses_model_config_api_key_for_model_discovery(monkeypat # -- Issue #230: custom provider with slash model name ----------------------- def test_custom_endpoint_slash_model_routes_to_custom_not_openrouter(): - """Regression test for #230. + """Regression test for #230, updated for #1625. When provider=custom (or any non-openrouter provider) and base_url is set, a model name containing a slash (e.g. google/gemma-4-26b-a4b) must NOT be rerouted to OpenRouter -- it should stay on the configured custom endpoint. + + #1625 layered an additional rule on top: a base_url pointing at a loopback + or private-IP host is treated as a local model server (LM Studio, Ollama, + llama.cpp, vLLM, TabbyAPI), which register models under their full + HuggingFace path. On such hosts the prefix is now PRESERVED. The original + #433 strip behaviour still applies on public hosts (real OpenAI-compatible + proxies like LiteLLM at https://litellm.example.com/v1). """ # --- custom provider with slash model name should NOT go to openrouter --- model, provider, base_url = _resolve_with_config( @@ -486,10 +493,22 @@ def test_custom_endpoint_slash_model_routes_to_custom_not_openrouter(): assert base_url == 'http://127.0.0.1:1234/v1', ( "Expected base_url 'http://127.0.0.1:1234/v1', got '{}'.".format(base_url) ) - # Fix #433: provider prefix is now stripped for custom endpoints so stale - # prefixed model IDs from previous sessions do not break custom endpoint routing. - assert model == 'gemma-4-26b-a4b', ( - "Model name prefix should be stripped for custom base_url endpoint, got '{}'.".format(model) + # #1625 (supersedes the v0.50 #433 strip-on-custom rule for loopback hosts): + # 127.0.0.1 base_url is almost certainly a local LM Studio / Ollama / etc., + # which keys models on the full HuggingFace path. Preserve the prefix. + assert model == 'google/gemma-4-26b-a4b', ( + "Model name prefix must be PRESERVED on loopback base_url (#1625), got '{}'.".format(model) + ) + + # --- public-host openai-compatible proxy STILL strips per #433 ---------- + model2, provider2, base_url2 = _resolve_with_config( + 'google/gemma-4-26b-a4b', + provider='openai', + base_url='https://litellm.example.com/v1', + default='google/gemma-4-26b-a4b', + ) + assert model2 == 'gemma-4-26b-a4b', ( + "Public-host OpenAI-compat proxy must still strip prefix per #433, got '{}'.".format(model2) ) # --- openrouter with slash model name MUST still route to openrouter -----