Adopt the UI media from @Michaelyklam's parallel-discovery PR #1641 which
shipped the same one-character regex relax fix for #1618. PR #1641 is
being closed as superseded by #1642 (which carries nesquena APPROVED +
322 LOC test suite); preserving Michael's UI evidence here so the visual
proof of the fix lives in-tree alongside the canonical PR.
Co-authored-by: Michael Lam <Michaelyklam1@gmail.com>
Three changes from the pre-merge Opus review:
**MUST-FIX** — XPC_SERVICE_NAME false-positive on macOS Terminal
macOS launchd sets `XPC_SERVICE_NAME` in EVERY Terminal-spawned shell, not
just real services. Typical noise values: `"0"` (truthy in Python!) and
`"application.com.apple.Terminal.<UUID>"`. A bare `os.environ.get(name)`
existence check would auto-promote interactive `./start.sh` runs to
foreground mode on every Mac dev machine — silently breaking the most
common installation path (no /health probe, no browser open, no log file,
hanging shell).
Fix: new `_is_real_supervisor_value()` helper that filters noise. For
`XPC_SERVICE_NAME` specifically, reject `"0"` and any `"application.*"`
prefix. Real launchd plists use reverse-DNS Label form (`com.<rdns>.<svc>`)
which still triggers correctly.
7 new tests in `TestXPCServiceNameNoiseFilter`:
- 4 noise values (`0`, Terminal.app, iTerm2, VSCode) → no detection
- 3 real Label forms → correct detection
- Mixed env with XPC noise + real INVOCATION_ID → falls through to systemd
**SHOULD-FIX 1** — Test env leakage
The original `clean_env` fixture stripped supervisor-detection env vars
but not the resolved bootstrap vars (HERMES_WEBUI_HOST/PORT/AGENT_DIR)
that `main()` mutates onto `os.environ`. After
`test_foreground_exports_resolved_env_vars` ran, later tests would import
bootstrap with polluted defaults (DEFAULT_HOST="0.0.0.0" instead of
"127.0.0.1"). Existing assertions still passed (tautological vs DEFAULT_*),
but it was a footgun for future tests.
Fix: extend `clean_env` to also `delenv` the three resolved vars before
each test.
**SHOULD-FIX 2** — Pre-execv executability guard
If `discover_launcher_python` returns a path that doesn't exist or isn't
executable, `os.execv` raises OSError → wrapper catches → SystemExit(1)
→ supervisor restarts → loop forever. That's exactly the failure mode
this PR is supposed to eliminate.
Fix: `os.access(python_exe, os.X_OK)` check before execv. Converts
infinite supervisor loop into a single visible RuntimeError.
1 new test in `TestForegroundExecutabilityGuard` pinning that the guard
fires before execv when the python path is non-executable.
**Docs** — supervisor.md updates
- New section explaining the XPC_SERVICE_NAME noise filter and what
values trigger / don't trigger detection
- New section listing supervisors that are NOT auto-detected (runit,
daemontools, PM2, Foreman/Honcho, custom shell-script supervisors)
with explicit recommendation to set HERMES_WEBUI_FOREGROUND=1
Verification
- 3820 tests pass (+9 from this commit's new tests vs the original PR
push of 3811)
- Filter manually verified end-to-end with the live os.environ:
XPC=0 → None, XPC=application.* → None, XPC=com.example.foo → triggers
- run-browser-tests.sh ALL CHECKS PASSED on the worktree
Items deferred from the Opus review
- #4 chdir target may not exist: REPO_ROOT comes from __file__.resolve()
so it's stable; not a real concern in practice
- #6 two startup messages in foreground mode: cosmetic, useful for
diagnostics
- #7 stricter explicit-only mode: leaves user the override of just not
passing --foreground (current behavior)
- #8 test stub return value: trivial, can fix later if regression surface
- #9 argparse positional-after-option ordering: test reads fine
These can be follow-up issues if anyone hits them.
Issue #1458 reports persistent-host crashes (≥1/day) when running the WebUI
under launchd KeepAlive on macOS. Root cause: `bootstrap.py` calls
`subprocess.Popen([python, "server.py"], start_new_session=True)`, probes
/health, then exits 0. Under any process supervisor (launchd, systemd,
supervisord, runit, s6), the supervisor sees its tracked PID exit, marks
the program as "completed," and respawns it. The new bootstrap fails to
bind port 8787 (orphaned server still has it), exits non-zero, supervisor
respawns again — loop until the orphan crashes for some other reason and
the next respawn finds the port free.
This PR addresses Bug #1 of the three failure modes tracked in #1458:
the `bootstrap.py` double-fork breaking process supervisors. Bug #2
(state.db FD leak) and Bug #3 (HTTP-unhealthy wedge) remain open under
the same issue — they need diagnosis data before a fix can land.
Changes
-------
1. `bootstrap.py`:
- New `--foreground` argparse flag with help text mentioning launchd /
systemd / supervisord.
- New `_detect_supervisor()` that returns the env var name for any
supervisor it detects: `INVOCATION_ID` / `JOURNAL_STREAM` /
`NOTIFY_SOCKET` (systemd, s6), `XPC_SERVICE_NAME` (launchd),
`SUPERVISOR_ENABLED` (supervisord), or `HERMES_WEBUI_FOREGROUND` for
the explicit user opt-in. Truthy values for the explicit opt-in:
`1` / `true` / `yes` / `on` (case-insensitive).
- `main()` branches on `args.foreground or _detect_supervisor()`:
- **Foreground path:** chdir to `agent_dir or REPO_ROOT`, then
`os.execv(python, [python, server_path])` to replace the bootstrap
process image with the server. The supervisor sees the long-lived
server as the original child. No `wait_for_health` probe — the
supervisor's KeepAlive / Restart=on-failure handles liveness.
- **Default path:** unchanged. Spawn server as detached child via
`Popen + start_new_session=True`, probe /health, return 0. This
still works for interactive `bash start.sh` invocations.
- Resolved env vars (HOST/PORT/STATE_DIR/AGENT_DIR) are now mutated on
`os.environ` directly instead of into a local `env` copy so they
are inherited across `os.execv`.
2. `docs/supervisor.md` (new): runnable launchd plist, systemd .service,
and supervisord conf examples + a diagnostic recipe (`lsof` + ppid
chain) for catching the orphan-loop in production.
3. `.gitignore`: allowlist `docs/supervisor.md` (the directory uses an
opt-in pattern; matches the existing `!docs/docker.md` precedent).
4. `tests/test_bootstrap_foreground.py` (new): 35 regression tests
covering the argparse flag, `_detect_supervisor()` behavior across all
five supervisor env vars, the explicit opt-in's truthy/falsy values,
and `main()`'s execv-vs-Popen routing decision under each input
combination. `os.execv` is monkeypatched in the routing tests — we
pin the structural choice (which call is made, with which args, in
which cwd, with which env) not the post-exec behavior.
Why this scope and no more
--------------------------
Bug #2 (state.db FD leak) lists 5 candidate paths and asks the reporter
for `lsof -p <pid> | sort | uniq -c | sort -rn | head -20` output to
disambiguate. Until that data lands, any "fix" would be speculative —
explicitly out of scope per the contributor-pickup comment on the issue.
Bug #3 (launchd-running, port-listening, HTTP-unhealthy) was added in
@stefanpieter's reply comment. Diagnosis is in flight; no concrete fix
shape yet. Also out of scope.
Running locally end-to-end verifies the behavior:
```
[bootstrap] Starting Hermes Web UI on http://127.0.0.1:8789 (foreground mode: --foreground)
$ pgrep -af 'server.py'
2997632 /home/.../python /tmp/wt-fix-1458/server.py
$ ps -o ppid -p 2997632
2997581 ← bash that ran bootstrap.py — same PID as the original bootstrap
$ ps -p 2997581 -o cmd
... bootstrap.py ... ← but exec'd into server.py
```
The same PID that bash forked for `bootstrap.py` is now `server.py`.
A supervisor watching that PID would correctly observe the long-lived
server. No double-fork.
Verification
------------
- 3811 tests pass (`pytest tests/` — full suite, +51 from this PR plus
master-merge-in)
- All 35 new bootstrap-foreground tests pass
- `bash scripts/run-browser-tests.sh` PASS (HTTP API checks against worktree)
- `bash scripts/webui_qa_agent.sh 8789` PASS (23/23 visual QA)
- Live verified: server starts cleanly under both `--foreground` and
`HERMES_WEBUI_FOREGROUND=1`; PID lineage confirms no double-fork
Closes#1458 (Bug #1 only). Bugs #2 and #3 remain tracked under the
issue.
REQUIRED:
- _fully_unquote_path range(3) -> range(10) — defense-in-depth so quadruple-
encoded .. is rejected by validator instead of slipping through (not
exploitable but contract violation)
- docs/EXTENSIONS.md trust-model callout moved to top of file with explicit
'don't enable in untrusted env / don't point at user-writable dir' guidance
NICE-TO-HAVE (taken since Nathan asked for all fixes big and small):
- URL list cap at _MAX_URL_LIST=32 to avoid pathological rendering
- One-shot WARNING log for rejected URLs (silent drop now visible to admin)
- One-shot WARNING log for URL list truncation
- MIME map: ttf (font/ttf), otf (font/otf), wasm (application/wasm)
5 regression tests in tests/test_pr1445_opus_followups.py pin all invariants.
Combines PR #1428 (UID/GID alignment) with a broader Docker reliability pass
that addresses recurring user reports about compose files not working.
Constituent PR:
- #1428 sunnysktsang - Align agent UID/GID with webui (fixes#1399).
Two- and three-container compose files had agent at UID 10000 (image
default) and webui at UID 1000 (WANTED_UID default), causing permission
denied on shared hermes-home volume. All services now use ${UID:-1000}.
Plus broader Docker UX overhaul:
- All 3 compose files document HERMES_SKIP_CHMOD/HERMES_HOME_MODE escape
hatches inline (the v0.50.254 fix wasn't surfaced for Docker users).
- New .env.docker.example template covering UID/GID, paths, password,
permission handling. UID/GID are uncommented with placeholder values
per Opus advisor (so macOS users don't skim past).
- New docs/docker.md - comprehensive guide: 5-min quickstart, failure
mode table with one-line fixes, bind-mount migration, multi-container
architecture diagram, macOS Docker Desktop VirtioFS note, link to
community sunnysktsang/hermes-suite all-in-one image.
- README Docker section rewritten - clearer quickstart, failure-mode
table, link to docs/docker.md. Stale /root/.hermes references removed.
Plus Opus pre-release advisor MUST-FIX:
- HERMES_HOME_MODE has DIFFERENT semantics in the WebUI vs the agent
image. WebUI: credential-file mode threshold (0640 allows group bits).
Agent: HERMES_HOME directory mode (default 0700). 0640 on a directory
has no owner-execute bit, so the agent can't traverse its own home and
bricks. My initial draft recommended HERMES_HOME_MODE=0640 in agent
service blocks - corrected to 0750 across all 4 surfaces (compose
files, .env.docker.example, docs/docker.md). 3 regression tests pin
the asymmetry.
12 regression tests total in test_v050260_docker_invariants.py.
Full suite: 3627 passed, 0 failed.
Nathan explicitly authorized merge with my own review + Opus only, no
independent review needed.
When a user switched profiles and created a new session, the session
was saved to the default profile directory instead of the active
profile directory — because get_hermes_home_for_profile() silently
fell back to _DEFAULT_HERMES_HOME when the profile directory didn't
exist yet on disk.
Root cause: api/profiles.py:156 had `if profile_dir.is_dir(): return
profile_dir; return _DEFAULT_HERMES_HOME`. New profiles (no session
yet, so no dir) routed every session back to default.
Fix: remove the is_dir() guard, return the profile path
unconditionally. The profile directory is created on first use by
the agent/session layer.
5 regression tests in tests/test_issue1195_session_profile_routing.py:
existing-profile, non-existent-profile (the core fix), None, empty-
string, 'default' all return the expected path.
Co-authored-by: bergeouss <bergeouss@users.noreply.github.com>