mirror of
https://github.com/nesquena/hermes-webui.git
synced 2026-05-26 03:30:36 +00:00
Merge pull request #1661 from nesquena/stage-297
Release v0.50.297 — 3-PR batch (Docker regression fix + OAuth cancel race + persistent-host health hardening)
This commit is contained in:
@@ -1,5 +1,54 @@
|
||||
# Hermes Web UI -- Changelog
|
||||
|
||||
## [v0.50.297] — 2026-05-04
|
||||
|
||||
### Fixed (3 PRs — closes #1658; refs #1458, #1652)
|
||||
|
||||
- **Docker container no longer enters a crash loop on every normal Docker setup** (#1659 by @bergeouss, closes #1658) — PR #1635 (v0.50.295) added a writability guard `[ ! -w /etc/group ] || [ ! -w /etc/passwd ]` for podman `read_only=true` containers. Bug: the script runs as the non-root `hermeswebuitoo` user, so `/etc/group` (owned by root) is **always** non-writable from that user — guard fires on EVERY normal Docker setup, container enters a crash loop with `!! ERROR: Cannot modify /etc/group or /etc/passwd (read-only root fs)`. Affects all users running standard Docker after upgrading to v0.50.295. **Fix:** replace `[ ! -w ]` with `! sudo sh -c 'test -w /etc/group && test -w /etc/passwd' 2>/dev/null` — matches the fact that `groupmod`/`usermod` already use sudo a few lines below. Truly read-only rootfs (podman) → sudo can't write → guard fires correctly. Writable rootfs (normal Docker) → sudo can write → guard doesn't fire → groupmod/usermod runs normally. **3 LOC `docker_init.bash` change.** P0 regression fix.
|
||||
|
||||
- **OAuth Cancel during Codex device-token exchange now wins the race** (#1653 by @nesquena, follow-up to #1652 / refs #1362) — race in v0.50.296's Codex OAuth onboarding flow where a `POST /api/onboarding/oauth/cancel` arriving while the worker was mid-network-call would be silently overridden: credentials would still get persisted to `auth.json` and the flow status would flip from `cancelled` → `success`. Net effect: the user's explicit cancel was ignored, credentials persisted, UI reported success. **Fix:** re-check `_OAUTH_FLOWS[flow_id].status` under `_OAUTH_FLOWS_LOCK` immediately AFTER `_exchange_codex_authorization()` returns and BEFORE writing `auth.json`. If status is no longer `pending`, return cleanly — no persistence, no status overwrite. Behavioral test using `threading.Event` deterministically reproduces the race. UX-inconsistency severity, not a security bug (the credentials that get persisted ARE tokens the user authorized in their browser), but the cancel button stops doing what it says, violating the design intent of #1650's server-owned lifecycle.
|
||||
|
||||
- **Persistent-host health diagnostics + watchdog hardening** (#1657 by @Michaelyklam, refs #1458) — addresses the residual #1458 Bug #3 failure mode (process alive + port listening but HTTP requests not advancing), the wedge that survives after v0.50.275's FD-leak fix and v0.50.269's bootstrap fix. Adds three signals process supervisors can use to distinguish "process exists" from "request handling is still advancing":
|
||||
- **Accept-loop heartbeat**: `QuietHTTPServer.accept_loop_requests_total` + `accept_loop_last_request_at` instance attributes, incremented in `_handle_request_noblock()` (single `serve_forever()` thread, un-locked `+=` is safe). Surfaced in `/health` as `accept_loop: {requests_total, last_request_at}`.
|
||||
- **`/health?deep=1` readiness probe**: bounded `STREAMS_LOCK.acquire(timeout=0.5)` + `all_sessions()` walk + `load_projects(_migrate=False)` + `sqlite3.connect(state.db) + PRAGMA schema_version`. Returns 503 with `status: degraded` when streams lock blocks or any deep check errors. Watchdogs polling `/health?deep=1` every 30s open-and-close 2880 short-lived sqlite connections per day per probe — bounded FD usage, no leak surface.
|
||||
- **`RLIMIT_NOFILE` raise to 4096** at startup (best-effort, defense in depth for macOS launchd jobs that start at 256). Doesn't hide future FD leaks; gives diagnostic headroom before request handling falls over.
|
||||
- **`docs/supervisor.md` updates**: launchd/systemd HTTP watchdog recipe using `curl -fsS --max-time 10 /health?deep=1` + `launchctl kickstart -k`. Notes `accept_loop.requests_total` should advance — if it stays flat while the process is alive, the accept loop is wedged.
|
||||
|
||||
Per Opus advisor on stage-297: refactored `_deep_health_checks(stream_check=...)` to accept the pre-computed stream check from `_handle_health()` so we don't acquire `STREAMS_LOCK` twice on the same `/health?deep=1` request (cosmetic inefficiency, not a correctness bug — but also could false-fail when the second acquire times out under contention). Plus a docstring note on `_handle_request_noblock` documenting why the un-locked `+=` is safe (single-thread-only call site in CPython socketserver).
|
||||
|
||||
PR #1656 by the same author (smaller, module-level globals approach) was closed as superseded by #1657 (instance-level + state.db check + projects check + supervisor.md docs).
|
||||
|
||||
### Tests
|
||||
|
||||
4284 → **4288 passing** (+4 regression tests across `tests/test_issue1458_stability_hardening.py` (3) + `tests/test_issue1362_codex_oauth_onboarding.py::test_cancel_during_token_exchange_does_not_persist_credentials` (1)). 0 regressions. Full suite ~118s.
|
||||
|
||||
### Pre-release verification
|
||||
|
||||
- **Opus advisor on stage-297 combined diff: SHIP verdict.** All 9 verification questions cleared:
|
||||
- `_active_state_db_path()` verified at `api/models.py:924`, returns Path without opening connection
|
||||
- 500ms `STREAMS_LOCK.acquire(timeout=...)` ceiling reasonable for watchdog timeouts (10s curl `--max-time` typical)
|
||||
- `with closing(sqlite3.connect(...))` deterministically releases FD, `PRAGMA schema_version` is read-only
|
||||
- `_handle_request_noblock` heartbeat increment is BEFORE super() — counter advances even if request handling raises, correct accept-loop semantics
|
||||
- `_raise_fd_soft_limit()` correctly clamps to hard limit, only RAISES soft limit (won't lower below launchd's `LimitNOFILE` setting)
|
||||
- OAuth fix narrows race window from "seconds-long network call" to "microseconds-long file write" — minimal correct change at the right layer
|
||||
- Docker fix `sudo sh -c 'test -w'` correctly handles all 3 cases (writable+sudo / readonly+sudo / no-sudo)
|
||||
- **Two minor Opus follow-ups absorbed in-release**:
|
||||
- `_deep_health_checks(stream_check=...)` reuses pre-computed stream check from `_handle_health()` — saves redundant lock acquisition
|
||||
- Docstring note on `_handle_request_noblock` documenting single-thread safety of un-locked `+=`
|
||||
- **Self-built #1653** has thorough `threading.Event`-gated behavioral test demonstrating the race exists pre-fix and is fixed post-fix.
|
||||
- **Browser API sanity**: 11/11 endpoints OK on stage server.
|
||||
- **Conflict resolution**: zero file overlap across all 3 PRs (#1659 → docker_init.bash; #1653 → api/oauth.py; #1657 → api/routes.py + server.py + docs/supervisor.md). Auto-merged clean.
|
||||
|
||||
### Authors
|
||||
|
||||
- @bergeouss — 1 PR (#1659, AI-assisted via Hermes Agent) — fixing their own v0.50.295 #1635 regression
|
||||
- @nesquena (self-built) — 1 PR (#1653, follow-up to v0.50.296 #1652)
|
||||
- @Michaelyklam — 1 PR (#1657, hardening for #1458 Bug #3)
|
||||
|
||||
### Note on closed-as-superseded
|
||||
|
||||
PR #1656 (also @Michaelyklam) was closed as superseded by #1657. Both target #1458 Bug #3, both add accept-loop heartbeat + `/health?deep=1` + 503-on-degraded. #1657 adds beyond #1656: state.db connectivity check, projects state check, FD soft-limit raise, and `docs/supervisor.md` watchdog recipe. Same author iterated; the second PR was the keeper.
|
||||
|
||||
## [v0.50.296] — 2026-05-04
|
||||
|
||||
### Fixed (3 PRs — closes #1406, #1617; refs #1362)
|
||||
|
||||
+1
-1
@@ -2,7 +2,7 @@
|
||||
|
||||
> Web companion to the Hermes Agent CLI. Same workflows, browser-native.
|
||||
>
|
||||
> Last updated: v0.50.296 (May 04, 2026) — 4284 tests collected
|
||||
> Last updated: v0.50.297 (May 04, 2026) — 4288 tests collected
|
||||
> Test source: `pytest tests/ --collect-only -q`
|
||||
> Per-version detail: see [CHANGELOG.md](./CHANGELOG.md)
|
||||
|
||||
|
||||
+2
-2
@@ -1835,8 +1835,8 @@ Bridged CLI sessions:
|
||||
|
||||
---
|
||||
|
||||
*Last updated: v0.50.296, May 04, 2026*
|
||||
*Total automated tests collected: 4284*
|
||||
*Last updated: v0.50.297, May 04, 2026*
|
||||
*Total automated tests collected: 4288*
|
||||
*Regression gate: tests/test_regressions.py*
|
||||
*Run: pytest tests/ -v --timeout=60*
|
||||
*Source: <repo>/*
|
||||
|
||||
@@ -347,6 +347,13 @@ def _run_codex_oauth_worker(flow_id: str) -> None:
|
||||
if not authorization_code or not code_verifier:
|
||||
raise RuntimeError("Device auth response missing authorization_code or code_verifier")
|
||||
tokens = _exchange_codex_authorization(authorization_code, code_verifier)
|
||||
# Re-check status under lock before persisting: a cancel/expire that
|
||||
# raced with the device-token + token-exchange network calls must
|
||||
# win, so we don't persist credentials the user explicitly aborted.
|
||||
with _OAUTH_FLOWS_LOCK:
|
||||
current = _OAUTH_FLOWS.get(flow_id)
|
||||
if not current or current.get("status") != "pending":
|
||||
return
|
||||
_persist_codex_credentials(Path(live["hermes_home"]), tokens)
|
||||
_set_flow_status(flow_id, "success")
|
||||
return
|
||||
|
||||
+133
-11
@@ -12,6 +12,7 @@ import queue
|
||||
import re
|
||||
import platform
|
||||
import shutil
|
||||
import sqlite3
|
||||
import subprocess
|
||||
import sys
|
||||
import threading
|
||||
@@ -19,6 +20,7 @@ import time
|
||||
import uuid
|
||||
import re
|
||||
from pathlib import Path
|
||||
from contextlib import closing
|
||||
from urllib.parse import parse_qs
|
||||
from api.agent_sessions import MESSAGING_SOURCES
|
||||
|
||||
@@ -1222,6 +1224,7 @@ from api.models import (
|
||||
title_from,
|
||||
_write_session_index,
|
||||
SESSION_INDEX_FILE,
|
||||
_active_state_db_path,
|
||||
load_projects,
|
||||
save_projects,
|
||||
import_cli_session,
|
||||
@@ -1658,6 +1661,135 @@ def _handle_insights(handler, parsed) -> bool:
|
||||
# ── GET routes ────────────────────────────────────────────────────────────────
|
||||
|
||||
|
||||
def _accept_loop_health(handler) -> dict:
|
||||
server = getattr(handler, "server", None)
|
||||
return {
|
||||
"requests_total": int(getattr(server, "accept_loop_requests_total", 0) or 0),
|
||||
"last_request_at": round(float(getattr(server, "accept_loop_last_request_at", 0.0) or 0.0), 3),
|
||||
}
|
||||
|
||||
|
||||
def _streams_lock_health(timeout_seconds: float = 0.5) -> dict:
|
||||
t0 = time.time()
|
||||
acquired = STREAMS_LOCK.acquire(timeout=timeout_seconds)
|
||||
elapsed_ms = round((time.time() - t0) * 1000, 1)
|
||||
if not acquired:
|
||||
return {
|
||||
"status": "blocked",
|
||||
"timeout_seconds": timeout_seconds,
|
||||
"ms": elapsed_ms,
|
||||
}
|
||||
try:
|
||||
return {
|
||||
"status": "ok",
|
||||
"active_streams": len(STREAMS),
|
||||
"ms": elapsed_ms,
|
||||
}
|
||||
finally:
|
||||
STREAMS_LOCK.release()
|
||||
|
||||
|
||||
def _deep_health_checks(stream_check: dict | None = None) -> tuple[dict, bool]:
|
||||
"""Run cheap probes that exercise the state paths used by the UI shell.
|
||||
|
||||
Plain /health intentionally stays tiny. /health?deep=1 is for supervisors
|
||||
and watchdogs that need to know whether the process can still touch the
|
||||
shared stream map, sidebar/session path, project state, and Hermes state.db
|
||||
without hitting the RST-before-write failure mode from #1458.
|
||||
|
||||
`stream_check` is the result from a prior `_streams_lock_health()` call;
|
||||
if provided, it's reused so we don't acquire `STREAMS_LOCK` twice on the
|
||||
same /health?deep=1 request (per Opus advisor on stage-297).
|
||||
"""
|
||||
checks: dict[str, dict] = {}
|
||||
|
||||
checks["streams_lock"] = stream_check if stream_check is not None else _streams_lock_health()
|
||||
if checks["streams_lock"].get("status") != "ok":
|
||||
return checks, False
|
||||
|
||||
t0 = time.time()
|
||||
try:
|
||||
sessions = all_sessions()
|
||||
checks["sessions"] = {
|
||||
"status": "ok",
|
||||
"count": len(sessions),
|
||||
"ms": round((time.time() - t0) * 1000, 1),
|
||||
}
|
||||
except Exception as exc:
|
||||
checks["sessions"] = {
|
||||
"status": "error",
|
||||
"error": type(exc).__name__,
|
||||
"ms": round((time.time() - t0) * 1000, 1),
|
||||
}
|
||||
|
||||
t0 = time.time()
|
||||
try:
|
||||
projects = load_projects(_migrate=False)
|
||||
checks["projects"] = {
|
||||
"status": "ok",
|
||||
"count": len(projects),
|
||||
"ms": round((time.time() - t0) * 1000, 1),
|
||||
}
|
||||
except Exception as exc:
|
||||
checks["projects"] = {
|
||||
"status": "error",
|
||||
"error": type(exc).__name__,
|
||||
"ms": round((time.time() - t0) * 1000, 1),
|
||||
}
|
||||
|
||||
t0 = time.time()
|
||||
try:
|
||||
db_path = _active_state_db_path()
|
||||
if not db_path.exists():
|
||||
checks["state_db"] = {
|
||||
"status": "missing",
|
||||
"ms": round((time.time() - t0) * 1000, 1),
|
||||
}
|
||||
else:
|
||||
with closing(sqlite3.connect(str(db_path))) as conn:
|
||||
conn.execute("PRAGMA schema_version").fetchone()
|
||||
checks["state_db"] = {
|
||||
"status": "ok",
|
||||
"ms": round((time.time() - t0) * 1000, 1),
|
||||
}
|
||||
except Exception as exc:
|
||||
checks["state_db"] = {
|
||||
"status": "error",
|
||||
"error": type(exc).__name__,
|
||||
"ms": round((time.time() - t0) * 1000, 1),
|
||||
}
|
||||
|
||||
healthy = all(
|
||||
check.get("status") in {"ok", "missing"}
|
||||
for check in checks.values()
|
||||
)
|
||||
return checks, healthy
|
||||
|
||||
|
||||
def _handle_health(handler, parsed):
|
||||
deep = parse_qs(parsed.query or "").get("deep", [""])[0].lower() in {"1", "true", "yes", "on"}
|
||||
stream_check = _streams_lock_health()
|
||||
payload = {
|
||||
"status": "ok" if stream_check.get("status") == "ok" else "degraded",
|
||||
"sessions": len(SESSIONS),
|
||||
"active_streams": int(stream_check.get("active_streams") or 0),
|
||||
"uptime_seconds": round(time.time() - SERVER_START_TIME, 1),
|
||||
"accept_loop": _accept_loop_health(handler),
|
||||
}
|
||||
if deep:
|
||||
if stream_check.get("status") != "ok":
|
||||
payload["checks"] = {"streams_lock": stream_check}
|
||||
return j(handler, payload, status=503)
|
||||
checks, healthy = _deep_health_checks(stream_check=stream_check)
|
||||
payload["checks"] = checks
|
||||
if not healthy:
|
||||
payload["status"] = "degraded"
|
||||
return j(handler, payload, status=503)
|
||||
if payload["status"] != "ok":
|
||||
return j(handler, payload, status=503)
|
||||
return j(handler, payload)
|
||||
|
||||
|
||||
def handle_get(handler, parsed) -> bool:
|
||||
"""Handle all GET routes. Returns True if handled, False for 404."""
|
||||
|
||||
@@ -1776,17 +1908,7 @@ def handle_get(handler, parsed) -> bool:
|
||||
return _handle_insights(handler, parsed)
|
||||
|
||||
if parsed.path == "/health":
|
||||
with STREAMS_LOCK:
|
||||
n_streams = len(STREAMS)
|
||||
return j(
|
||||
handler,
|
||||
{
|
||||
"status": "ok",
|
||||
"sessions": len(SESSIONS),
|
||||
"active_streams": n_streams,
|
||||
"uptime_seconds": round(time.time() - SERVER_START_TIME, 1),
|
||||
},
|
||||
)
|
||||
return _handle_health(handler, parsed)
|
||||
|
||||
if parsed.path == "/api/models":
|
||||
return j(handler, get_available_models())
|
||||
|
||||
+5
-2
@@ -189,10 +189,13 @@ if [ "A${whoami}" == "Ahermeswebuitoo" ]; then
|
||||
# using usermod for the already create hermeswebui user, knowing it is not already in use
|
||||
# per usermod manual: "You must make certain that the named user is not executing any processes when this command is being executed"
|
||||
# Guard for read-only root filesystem (podman with read_only=true, issue #1470).
|
||||
# The script runs as hermeswebuitoo (non-root), but groupmod/usermod use sudo.
|
||||
# So we must check writability via sudo — a non-root user cannot write /etc/group
|
||||
# even on a normal writable rootfs, which caused a false positive (issue #1658).
|
||||
_readonly_root=false
|
||||
if [ ! -w /etc/group ] || [ ! -w /etc/passwd ]; then
|
||||
if ! sudo sh -c 'test -w /etc/group && test -w /etc/passwd' 2>/dev/null; then
|
||||
_readonly_root=true
|
||||
echo " !! Detected read-only root filesystem — /etc/group or /etc/passwd is not writable"
|
||||
echo " !! Detected read-only root filesystem — /etc/group or /etc/passwd is not writable (even via sudo)"
|
||||
fi
|
||||
if [ "A${_readonly_root}" == "Atrue" ]; then
|
||||
_current_hermeswebui_gid=$(id -g hermeswebui 2>/dev/null || echo "")
|
||||
|
||||
@@ -235,3 +235,44 @@ PID PPID CMD
|
||||
If PPID is ``1`` (init) when it should be the supervisor, the orphan-server
|
||||
loop is happening — re-check that ``--foreground`` (or one of the env vars)
|
||||
is reaching the process.
|
||||
|
||||
## HTTP watchdog / deep health
|
||||
|
||||
``KeepAlive`` / ``Restart=always`` only recover a process that exits. If the
|
||||
process is still listening on the port but request handling is wedged, pair your
|
||||
supervisor with an HTTP probe and force a restart when the probe fails.
|
||||
|
||||
Hermes Web UI exposes two health levels:
|
||||
|
||||
- ``/health`` — cheap liveness probe with ``active_streams``, uptime, and an
|
||||
``accept_loop`` heartbeat counter.
|
||||
- ``/health?deep=1`` — readiness probe that briefly acquires the stream lock,
|
||||
reads the sidebar/session path, reads projects state, and touches Hermes
|
||||
``state.db`` if it exists. Use this for watchdogs.
|
||||
|
||||
At startup the server also tries to raise its file-descriptor soft limit to
|
||||
4096 on platforms that support ``RLIMIT_NOFILE``. That is defense in depth for
|
||||
persistent hosts: leaks should still be fixed, but a higher soft limit gives
|
||||
you more diagnostic headroom before request handling falls over.
|
||||
|
||||
Minimal macOS launchd watchdog script:
|
||||
|
||||
```bash
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
LABEL="com.example.hermes-webui"
|
||||
BASE="http://127.0.0.1:8787"
|
||||
|
||||
if ! curl -fsS --max-time 10 "$BASE/health?deep=1" >/dev/null; then
|
||||
launchctl kickstart -k "gui/$(id -u)/$LABEL"
|
||||
fi
|
||||
```
|
||||
|
||||
Run it every few minutes from a separate ``StartInterval`` LaunchAgent. For
|
||||
systemd, prefer a timer/service pair that runs the same curl probe and
|
||||
``systemctl --user restart hermes-webui.service`` on failure.
|
||||
|
||||
The ``accept_loop.requests_total`` value should increase when probes arrive. If
|
||||
it stays flat while the process is still alive, the server accept loop is not
|
||||
making progress; capture logs/thread samples before restarting if you are
|
||||
collecting diagnostics for a bug report.
|
||||
|
||||
@@ -9,6 +9,11 @@ import sys
|
||||
import time
|
||||
import traceback
|
||||
from http.server import BaseHTTPRequestHandler, ThreadingHTTPServer
|
||||
|
||||
try:
|
||||
import resource
|
||||
except ImportError: # pragma: no cover - resource is Unix-only
|
||||
resource = None
|
||||
from urllib.parse import urlparse
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
@@ -26,6 +31,28 @@ class QuietHTTPServer(ThreadingHTTPServer):
|
||||
"""Custom HTTP server that silently handles common network errors."""
|
||||
daemon_threads = True
|
||||
request_queue_size = 64
|
||||
|
||||
def __init__(self, *args, **kwargs):
|
||||
super().__init__(*args, **kwargs)
|
||||
self.accept_loop_requests_total = 0
|
||||
self.accept_loop_last_request_at = 0.0
|
||||
|
||||
def _handle_request_noblock(self):
|
||||
"""Record accept-loop progress before dispatching a request handler.
|
||||
|
||||
A process can be alive and still stop accepting/dispatching requests.
|
||||
Exposing this heartbeat on /health gives supervisors and watchdogs a
|
||||
cheap signal that the accept loop is still moving.
|
||||
|
||||
Note: this method is called only from the single ``serve_forever()``
|
||||
thread in CPython socketserver, so the un-locked ``+=`` increment is
|
||||
safe — there is no other thread mutating these counters. The /health
|
||||
readers may see a stale value momentarily but never an inconsistent
|
||||
one (Python int reads are atomic). Per Opus advisor on stage-297.
|
||||
"""
|
||||
self.accept_loop_requests_total += 1
|
||||
self.accept_loop_last_request_at = time.time()
|
||||
return super()._handle_request_noblock()
|
||||
|
||||
def handle_error(self, request, client_address):
|
||||
"""Override to suppress logging for common client disconnect errors."""
|
||||
@@ -129,11 +156,50 @@ class Handler(BaseHTTPRequestHandler):
|
||||
clear_request_profile()
|
||||
|
||||
|
||||
def _raise_fd_soft_limit(target: int = 4096) -> dict:
|
||||
"""Best-effort raise of RLIMIT_NOFILE for persistent WebUI hosts.
|
||||
|
||||
macOS launchd jobs often start with a 256 soft limit. If a future FD leak
|
||||
regresses, that low ceiling turns a leak into a hard HTTP wedge quickly.
|
||||
Raising the soft limit does not hide leaks; it buys enough headroom for
|
||||
diagnostics and watchdog recovery.
|
||||
"""
|
||||
if resource is None:
|
||||
return {"status": "unsupported"}
|
||||
try:
|
||||
soft, hard = resource.getrlimit(resource.RLIMIT_NOFILE)
|
||||
except Exception as exc:
|
||||
return {"status": "error", "error": str(exc)}
|
||||
|
||||
# On Unix, RLIM_INFINITY is commonly a large int; keep the logic explicit
|
||||
# so tests can use ordinary integers without depending on platform values.
|
||||
desired = int(target)
|
||||
if hard not in (-1, getattr(resource, "RLIM_INFINITY", object())):
|
||||
desired = min(desired, int(hard))
|
||||
if soft >= desired:
|
||||
return {"status": "unchanged", "soft": soft, "hard": hard}
|
||||
try:
|
||||
resource.setrlimit(resource.RLIMIT_NOFILE, (desired, hard))
|
||||
except Exception as exc:
|
||||
return {"status": "error", "soft": soft, "hard": hard, "error": str(exc)}
|
||||
return {"status": "raised", "soft": desired, "hard": hard, "previous_soft": soft}
|
||||
|
||||
|
||||
def main() -> None:
|
||||
from api.config import print_startup_config, verify_hermes_imports, _HERMES_FOUND
|
||||
|
||||
print_startup_config()
|
||||
|
||||
fd_limit = _raise_fd_soft_limit()
|
||||
if fd_limit.get("status") == "raised":
|
||||
print(
|
||||
f"[ok] Raised file descriptor soft limit "
|
||||
f"{fd_limit.get('previous_soft')} -> {fd_limit.get('soft')}",
|
||||
flush=True,
|
||||
)
|
||||
elif fd_limit.get("status") == "error":
|
||||
print(f"[!!] WARNING: Could not raise file descriptor limit: {fd_limit.get('error')}", flush=True)
|
||||
|
||||
# Fix sensitive file permissions before doing anything else
|
||||
fix_credential_permissions()
|
||||
|
||||
|
||||
@@ -111,6 +111,60 @@ def test_cancel_marks_flow_cancelled_and_poll_stops(tmp_path):
|
||||
assert polled["status"] == "cancelled"
|
||||
|
||||
|
||||
def test_cancel_during_token_exchange_does_not_persist_credentials(monkeypatch, tmp_path):
|
||||
"""Cancel arriving while the worker is mid-network-call must win.
|
||||
|
||||
Without the post-exchange status re-check, the worker would proceed to
|
||||
persist credentials to auth.json AND override the cancelled status with
|
||||
"success" — silently storing tokens the user explicitly aborted.
|
||||
"""
|
||||
import threading
|
||||
import api.oauth as oauth
|
||||
|
||||
oauth._OAUTH_FLOWS.clear()
|
||||
|
||||
poll_started = threading.Event()
|
||||
poll_continue = threading.Event()
|
||||
|
||||
def _slow_poll(device_auth_id, user_code):
|
||||
poll_started.set()
|
||||
assert poll_continue.wait(timeout=5)
|
||||
return {"authorization_code": "auth-code", "code_verifier": "verifier"}
|
||||
|
||||
def _exchange(authorization_code, code_verifier):
|
||||
return {"access_token": "ACCESS", "refresh_token": "REFRESH"}
|
||||
|
||||
monkeypatch.setattr(oauth, "_poll_codex_authorization", _slow_poll)
|
||||
monkeypatch.setattr(oauth, "_exchange_codex_authorization", _exchange)
|
||||
|
||||
flow_id = "race-flow"
|
||||
oauth._OAUTH_FLOWS[flow_id] = {
|
||||
"provider": "openai-codex",
|
||||
"status": "pending",
|
||||
"device_auth_id": "device-secret",
|
||||
"user_code": "ABCD-EFGH",
|
||||
"expires_at": time.time() + 600,
|
||||
"poll_interval_seconds": 1,
|
||||
"hermes_home": str(tmp_path),
|
||||
"created_at": time.time(),
|
||||
"updated_at": time.time(),
|
||||
}
|
||||
|
||||
worker = threading.Thread(target=oauth._run_codex_oauth_worker, args=(flow_id,), daemon=True)
|
||||
worker.start()
|
||||
assert poll_started.wait(timeout=5)
|
||||
|
||||
oauth.cancel_onboarding_oauth_flow({"flow_id": flow_id})
|
||||
assert oauth._OAUTH_FLOWS[flow_id]["status"] == "cancelled"
|
||||
|
||||
poll_continue.set()
|
||||
worker.join(timeout=5)
|
||||
assert not worker.is_alive()
|
||||
|
||||
assert oauth._OAUTH_FLOWS[flow_id]["status"] == "cancelled"
|
||||
assert not (tmp_path / "auth.json").exists()
|
||||
|
||||
|
||||
def test_expired_flow_reports_expired_and_drops_sensitive_lifecycle(tmp_path):
|
||||
import api.oauth as oauth
|
||||
|
||||
|
||||
@@ -0,0 +1,66 @@
|
||||
"""Regression coverage for issue #1458 persistent-host hardening."""
|
||||
import json
|
||||
import urllib.request
|
||||
|
||||
from tests._pytest_port import BASE
|
||||
|
||||
|
||||
def _get(path):
|
||||
with urllib.request.urlopen(BASE + path, timeout=10) as r:
|
||||
return json.loads(r.read()), r.status
|
||||
|
||||
|
||||
def test_health_exposes_accept_loop_heartbeat():
|
||||
data, status = _get("/health")
|
||||
|
||||
assert status == 200
|
||||
heartbeat = data.get("accept_loop")
|
||||
assert isinstance(heartbeat, dict)
|
||||
assert isinstance(heartbeat.get("requests_total"), int)
|
||||
assert heartbeat["requests_total"] >= 1
|
||||
assert isinstance(heartbeat.get("last_request_at"), (int, float))
|
||||
assert heartbeat["last_request_at"] > 0
|
||||
|
||||
|
||||
def test_deep_health_exercises_session_project_and_sqlite_paths():
|
||||
data, status = _get("/health?deep=1")
|
||||
|
||||
assert status == 200
|
||||
assert data["status"] == "ok"
|
||||
checks = data.get("checks")
|
||||
assert isinstance(checks, dict)
|
||||
assert checks["streams_lock"]["status"] == "ok"
|
||||
assert isinstance(checks["streams_lock"].get("active_streams"), int)
|
||||
assert checks["sessions"]["status"] == "ok"
|
||||
assert isinstance(checks["sessions"].get("count"), int)
|
||||
assert checks["projects"]["status"] == "ok"
|
||||
assert isinstance(checks["projects"].get("count"), int)
|
||||
# The isolated test home may not have a Hermes state.db yet. Deep health
|
||||
# should still report the state-db probe explicitly so watchdogs can tell
|
||||
# whether sqlite was checked or absent.
|
||||
assert checks["state_db"]["status"] in {"ok", "missing"}
|
||||
|
||||
|
||||
def test_server_raises_fd_soft_limit_when_resource_allows(monkeypatch):
|
||||
import server
|
||||
|
||||
calls = []
|
||||
|
||||
class FakeResource:
|
||||
RLIMIT_NOFILE = object()
|
||||
|
||||
@staticmethod
|
||||
def getrlimit(which):
|
||||
return (256, 8192)
|
||||
|
||||
@staticmethod
|
||||
def setrlimit(which, limits):
|
||||
calls.append((which, limits))
|
||||
|
||||
monkeypatch.setattr(server, "resource", FakeResource, raising=False)
|
||||
|
||||
result = server._raise_fd_soft_limit(target=4096)
|
||||
|
||||
assert result["status"] == "raised"
|
||||
assert result["soft"] == 4096
|
||||
assert calls == [(FakeResource.RLIMIT_NOFILE, (4096, 8192))]
|
||||
Reference in New Issue
Block a user