Files
hermes-webui/tests
Sanjay Santhanam efcfff3d7f fix(#1879): cross-container gateway liveness via state-file freshness
The dashboard banner 'Hermes agent is not responding' fires on every
multi-container deployment that doesn't set 'pid: "service:hermes-agent"'
in compose, because get_running_pid() relies on fcntl.flock and
os.kill(pid, 0) — both PID-namespace-scoped and invisible across container
boundaries.

Fix: when get_running_pid() returns None, fall back to a freshness check on
gateway_state.json. The gateway already writes that file on every tick with
gateway_state == 'running' and an aware ISO-8601 updated_at timestamp, so a
recent (<= 120s) timestamp is an equivalent live-process signal that needs
only a shared volume — no PID namespace, no compose workaround, no extra
HTTP probe URL.

Behavior preserved:
- In-namespace deployments still hit the PID-based path first; payload shape
  unchanged (no 'reason' key) so #716 contract holds.
- Cross-container alive path adds reason='cross_container_freshness' so
  support diagnostics can tell which signal succeeded.
- Stale updated_at, non-running gateway_state, malformed/naive/missing
  timestamps, and timestamps far in the future all still report 'down' — the
  fallback never produces a false positive.
- Same redaction rules: argv/command/executable/env/raw pid never leak.

Tests: 15 new cases in test_issue1879_cross_container_gateway_liveness.py
covering the cross-container alive path, every refusal case, clock-skew
tolerance, and backward compat with the #716 PID path. Existing #716
heartbeat tests (8) continue to pass.
2026-05-08 15:15:50 +00:00
..
2026-04-29 19:54:07 -07:00
2026-05-05 01:51:05 +00:00
2026-05-05 01:51:05 +00:00
2026-04-29 17:42:32 -07:00
2026-04-29 17:42:32 -07:00
2026-04-29 21:34:27 -07:00
2026-04-29 21:06:30 -07:00
2026-04-29 17:42:32 -07:00