Commit Graph

116 Commits

Author SHA1 Message Date
Isla Liu 2a303de2a3 fix(session): preserve retry budget while journal is still arriving 2026-05-20 20:55:07 +08:00
Isla Liu d5a185d9c6 fix(session): serialize lazy journal retry per session 2026-05-20 20:48:38 +08:00
Isla Liu 9870e8f111 fix(session): address Copilot review — scope tool-card dedupe by stream id + tighten docs
Four code-review comments from the automated Copilot reviewer on this PR:

1. `_journal_tool_already_present` dedupe was session-wide, so a
   legitimately-repeated tool (e.g. a second `terminal: ls` in an
   earlier turn) could cause the retry path to falsely skip
   materializing the recovered tool card.  The helper now takes a
   keyword `stream_id` argument; when supplied, a tool card whose
   `_recovered_stream_id` is set AND differs from the candidate is no
   longer treated as a duplicate.  Untagged tool cards (live tools, or
   tool cards carried over from a pre-tagging core transcript) still
   match, preserving the existing 'core transcript already has this
   tool, don't duplicate' invariant.  Two new tests in
   `TestJournalToolDedupeScoping` cover both legs of the rule.

2./3. The troubleshooting FAQ pointed at `~/.hermes/webui/sessions/session_<sid>.json`
   and `~/.hermes/_run_journal/...`.  The actual sidecar filename has
   no `session_` prefix and the run-journal lives under the WebUI
   sessions dir (`~/.hermes/webui/sessions/_run_journal/<sid>/<stream>.jsonl`,
   default).  Both paths fixed and an explicit note added about
   `HERMES_WEBUI_STATE_DIR` overriding the state root.

4. Drop unused `json` / `queue` / `Path` imports from
   `tests/test_session_lost_response_regression.py` so the file stops
   carrying noise that future linting would flag.
2026-05-20 12:18:03 +08:00
Isla Liu 75a26174aa fix(session): lazily retry run-journal recovery so the interrupted-turn marker self-heals
When the WebUI process restarts mid-stream and sidecar repair runs while
the run-journal for the dead stream is not yet visible on disk (WSL2 9p
/ DrvFs page-cache loss, un-fsynced journal tail on network FS, …),
`_append_journaled_partial_output()` returns False and the marker is
permanently baked with the "no agent output was recovered" wording even
though the journaled tokens appear on disk shortly afterwards.

This commit reframes the recovery contract so the read side can
self-heal:

  * `_interrupted_recovery_marker` gains a `pending_retry=True` mode
    that produces a third wording ("Recovering the partial output …
    reload this session to retry.") and stamps a
    `_pending_journal_recovery` flag.
  * `_apply_core_sync_or_error_marker` now writes that pending-retry
    marker (with `_journal_retry_stream_id`,
    `_journal_retry_attempts`, `_journal_retry_first_seen_ts` meta)
    whenever it cannot recover visible output AND the stream id is
    known. The legacy "no output" wording is reserved for the
    no-stream-id case. The core-sync branch leaves marker emission to
    the existing visible-output check (the core transcript itself is the
    canonical history in that branch).
  * A new `_retry_journal_recovery_in_place(session)` helper re-runs
    `_append_journaled_partial_output(…, dedupe_existing=True)` for the
    latest pending marker. On success the marker is promoted in place to
    the recovered-output wording, the journaled rows are reordered to
    sit above the marker (preserving chronological order), and all
    retry meta is stripped. On failure attempts is incremented; after
    _JOURNAL_RETRY_MAX_ATTEMPTS (12) or _JOURNAL_RETRY_GIVEUP_SECONDS
    (24h) the marker is demoted to a neutral "Partial output may have
    been lost." wording.
  * `get_session()` cheaply short-circuits via
    `_session_has_pending_journal_retry()` and invokes the helper on
    both cache-hit and cold-load paths when a pending marker is found.
    `metadata_only=True` skips the helper to keep sidebar refresh
    cheap. The retry call runs OUTSIDE the SESSIONS LOCK to avoid a
    deadlock with `session.save()` write paths.

No streaming write path or run_journal fsync behaviour is changed — the
fix is read-side only.
2026-05-20 11:58:26 +08:00
nesquena-hermes cc8ef201be Stage 387: PR #2600 2026-05-19 22:10:20 +00:00
nesquena-hermes 93727897b6 Stage 387: PR #2605
# Conflicts:
#	api/routes.py
2026-05-19 22:10:20 +00:00
nesquena-hermes e63de7c15f Stage 387: PR #2593
# Conflicts:
#	CHANGELOG.md
2026-05-19 22:08:56 +00:00
Lumen Yang dc5c8168d1 fix(webui): refresh active session on external sidecar updates 2026-05-19 21:34:08 +00:00
Lumen Yang 8d2b9d4a16 feat(webui): render indexed context metadata 2026-05-19 18:52:50 +00:00
nesquena-hermes 86f52f67b8 Stage 386: PR #2581
# Conflicts:
#	api/streaming.py
2026-05-19 18:20:47 +00:00
Michael Lam 0736e45485 fix: dedupe tool-only partial recovery markers 2026-05-19 11:16:21 -07:00
starship-s 2e9ca283dc fix: display canonical cache hit percentage 2026-05-19 02:27:12 -06:00
Lumen Yang 600bb48970 fix(webui): use active state db for metadata summary 2026-05-19 08:02:43 +00:00
Lumen Yang 6ca63e5815 perf(webui): keep external refresh metadata cheap 2026-05-19 08:02:43 +00:00
Lumen Yang a63ab310b5 fix(webui): preserve reconciled session invariants 2026-05-19 08:02:43 +00:00
Lumen Yang 467ef33a24 feat(webui): reconcile external session updates
When API server runs append messages directly to state.db, reconcile WebUI sidecar sessions with those canonical rows across API responses, model-facing streaming context, and active browser refresh.

Add append-only state.db merge helpers, metadata-only counts for refresh polling, and regression coverage for API visibility, context incorporation, and frontend refresh behavior.
2026-05-19 08:02:43 +00:00
Frank Song 4661a5e94e Recover journal output after core transcript sync 2026-05-17 12:28:05 +08:00
nesquena-hermes 8f98465024 Stage 374: PR #2427 — fix(streaming): recover journaled partial assistant output after WebUI restart by @franksong2702 (fixes #2423)
Co-authored-by: Frank Song <franksong2702@gmail.com>
2026-05-17 02:49:35 +00:00
nesquena-hermes 47c210899e Stage 374: PR #2421 — fix(cache-tokens): surface provider prompt-cache read/write tokens in WebUI usage by @Michaelyklam (fixes #2419)
Co-authored-by: Michael Lam <michael@example.local>
2026-05-17 02:49:34 +00:00
Hermes Agent 026a9957f4 Stage 368: PR #2385 — Keep fuller compression snapshots reachable in sidebar by @franksong2702 2026-05-16 17:19:05 +00:00
Frank Song 4899ae17b9 Keep fuller compression snapshots reachable 2026-05-16 20:58:44 +08:00
Frank Song c415c843df Update interrupted recovery comment wording 2026-05-16 20:05:47 +08:00
Frank Song 49bea3ad01 Clarify interrupted turn recovery marker 2026-05-16 14:29:58 +08:00
Frank Song 40f69a2b75 Keep recovered pending turns in context 2026-05-16 04:07:02 +00:00
Hermes Agent 4826a31fbc Merge pull request #2285 into stage-359
fix: hide pre-compression snapshots from sidebar (dso2ng, refs #2230)

# Conflicts:
#	CHANGELOG.md
2026-05-15 14:55:19 +00:00
Dennis Soong bfccdc5c94 fix: hide pre-compression snapshots from sidebar 2026-05-15 11:20:17 +08:00
ai-ag2026 5110005324 fix: load CLI continuation session transcripts 2026-05-14 23:48:49 +02:00
Dennis Soong 143d9d8ef7 fix: reconcile stale sidebar display titles 2026-05-14 16:18:53 +08:00
starship-s 4084c3cf56 perf(sessions): cache CLI session scans 2026-05-12 11:24:29 -06:00
Frank Song 186453ea0e Add worktree-backed session creation 2026-05-11 12:12:40 +08:00
nesquena-hermes c624770c63 Stage 331: PR #2015 — fix(sessions): stitch continued session transcripts by @Jellypowered 2026-05-10 17:09:21 +00:00
Jellypowered 8aed650b4c Stitch continued session transcripts in WebUI 2026-05-10 11:10:54 -05:00
Frank Song 1bec8070f2 fix(1833): persist compression anchor summary for reload UI 2026-05-10 16:45:16 +08:00
Minimax 08c4ef8d88 feat: persistent composer draft — server-side, cross-client, survives refresh
- Session.composer_draft field: {text, files} stored in session JSON
- POST+GET /api/session/draft endpoint for save/load
- loadSession: save draft before switch, restore from S.session.composer_draft
- textarea input: debounced 400ms auto-save to server
- send(): clear draft after message is sent
- lockComposerForClarify(): save draft before card locks composer
- _restoreComposerDraft: clears textarea when target has no draft, guards
  against stale responses racing new session loads, exact text comparison
- Session.compact(): includes composer_draft in response
- Fix: use handler.command instead of parsed.method (ParseResult has no .method)

Co-authored-by: Minimax <noreply@minimax.io>
2026-05-09 13:47:57 +01:00
Frank Song 7e2709e281 fix: add request wedge diagnostics 2026-05-08 15:37:08 +00:00
ai-ag2026 755c18bdf9 fix: persist generated title refresh marker 2026-05-08 01:36:10 +02:00
Michael Lam 0bd65ef0bf fix: preserve CLI session tool metadata 2026-05-07 02:47:19 +00:00
ai-ag2026 8b34a79f02 fix: preserve imported session lineage visibility 2026-05-05 22:32:19 +02:00
Michael Lam c94ec31dec feat: show LLM Gateway routing metadata 2026-05-05 02:26:55 +00:00
Frank Song 8981d33543 Fix CLI session CI compatibility 2026-05-05 01:52:42 +00:00
Frank Song 79d0762d8c Filter low-value CLI agent sessions 2026-05-05 01:52:42 +00:00
Michael Lam e54a0470f0 Add Claude Code session imports 2026-05-05 01:18:34 +00:00
Michael Lam 876a670387 feat: add session save mode config 2026-05-04 14:05:49 -07:00
nesquena-hermes 040cb8af70 Apply Opus pre-release SHOULD-FIX + NITs (in-PR per release policy)
SHOULD-FIX: rate-limit _repair_stale_pending repair-firing telemetry. Switch
from unconditional logger.warning to age-keyed: WARNING when pending_age <
5min (the diagnostically valuable race window — actual leak-path candidates
that slipped past the grace guard) and DEBUG for the long-tail (orphaned
sidecars from prior process lifetimes). Prevents reconnect loops on stuck
sessions from flooding the log while preserving the diagnostic signal we
want for tuning _REPAIR_STALE_PENDING_GRACE_SECONDS empirically.

NIT: _LOCAL_SERVER_PROVIDERS expanded with lm-studio (hyphenated alias used
in some custom_providers configs and already recognized at api/config.py:2189
for SSRF host trust) and localai (LocalAI project). Test parametrize expanded
from 7 to 11 names, also covering pre-existing koboldcpp and textgen for
symmetry. +4 regression tests.

NIT (docs): CHANGELOG callout for the RFC1918 behavior change. Internal-
network OpenAI-compatible proxies now preserve the model prefix on private-IP
base_urls. Documented the migration path: configure as a custom_providers
entry to bypass the local-server detection.

NIT (deferred, optional): narrowing the heuristic to is_loopback only is
left as future work; the broader scope was an explicit goal in the bug
body and Opus flagged it as SHOULD-DISCUSS-but-not-block.

4184 -> 4188 passing. 0 regressions. ~10 LOC absorbed total.
2026-05-04 16:50:22 +00:00
nesquena-hermes bea57beba9 fix(streaming): SSE heartbeat alignment, repair grace period, local-server model id preservation (#1623, #1624, #1625)
Closes #1623 — Lower SSE app heartbeat from 30s to 5s at every long-lived
handler (main agent, terminal, gateway-watcher, approval-poller, clarify-poller).
Kernel TCP keepalive declares peer dead at 25s worst-case (10s KEEPIDLE +
5s KEEPINTVL * 3 KEEPCNT, added v0.50.289 #1581). 30s app heartbeat let the
kernel tear sockets down on flaky networks before the app sent its first
keepalive byte — drops at ~10s during long thinking phases. New named
constant _SSE_HEARTBEAT_INTERVAL_SECONDS=5; regression test pins the
inequality (app_heartbeat * 2 <= kernel_window) so future tuning can't
re-introduce the misalignment.

Closes #1624 — Add 30s grace period to _repair_stale_pending() trigger.
Without it, any narrow race between the streaming thread clearing
pending_user_message and STREAMS.pop(stream_id) produces a false-positive
'Previous turn did not complete.' marker on a turn that finished correctly
(reproducible after every command-approval turn). Defense-in-depth, not
the root-cause fix — the actual streaming-thread leak path is tracked
separately. Falsy pending_started_at (legacy sidecars) treated as
'old enough' so legitimate legacy-data recovery still works. Plus
logger.warning telemetry on every legitimate repair so the next batch of
user reports tells us whether the underlying race still fires.

Closes #1625 — Local model servers (LM Studio, Ollama, llama.cpp, vLLM,
TabbyAPI, koboldcpp, textgen-webui) now keep the full HuggingFace-style
model id (e.g. 'qwen/qwen3.6-27b' instead of stripped 'qwen3.6-27b'). New
_LOCAL_SERVER_PROVIDERS set + _base_url_points_at_local_server() loopback/
RFC1918 heuristic — either signal triggers no-strip. Backward compat
preserved for OpenAI-compatible proxies on public hosts (LiteLLM at
litellm.example.com still strips openai/gpt-5.4 -> gpt-5.4). Updated the
existing #230/#433 test to reflect that #1625 supersedes the strip-on-custom
rule for loopback hosts (see api/config.py and test_model_resolver.py
docstring update). Reported by @akarichan8231 in Discord on 2026-05-04.

42 regression tests across:
  tests/test_issue1623_sse_heartbeat_alignment.py (3)
  tests/test_issue1624_repair_stale_pending_grace.py (9)
  tests/test_issue1625_local_server_model_id_preservation.py (30)

4142 -> 4184 passing. 0 regressions.
2026-05-04 16:49:43 +00:00
Hermes Agent f3e066b53c chore(release): stamp v0.50.293 — 3-PR batch + 2 Opus follow-ups absorbed
Constituent PRs:
  #1627 by @franksong2702 — show Hermes Agent version (closes #1606)
  #1629 by @nesquena-hermes — profile isolation trio (closes #1611, #1612, #1614)
  #1630 by @Michaelyklam — provider config cleanup regression test (#1597 follow-up)

Opus advisor SHIP verdict + 2 SHOULD-FIX absorbed in-release:
- load_projects() re-reads from disk inside lock to close migration startup race
- _detect_agent_version() uses --dirty for symmetry with _detect_webui_version()

4142 → 4180 tests passing.
2026-05-04 16:33:57 +00:00
nesquena-hermes 6bc0f9c4d5 Apply Opus pre-release SHOULD-FIX + NITs (in-PR per release policy)
SHOULD-FIX #1 (renamed-root client cross-alias): drop strict-equality client
filter at static/sessions.js:1853. Server-side _profiles_match cross-aliases
'default'-tagged rows to a renamed root 'kinni'; the strict-equality client
would reject them, dropping every legacy session for renamed-root users. The
server is now solely authoritative for profile scoping.

SHOULD-FIX #2 (messaging-source dedupe ordering): _keep_latest_messaging_session_per_source
now runs AFTER the profile filter at api/routes.py:2078. Before, it ran on
the merged-cross-profile list with profile-blind keys, discarding the older
profile's row across profiles before the scope filter — leaving zero rows for
any messaging identity the active profile shared with another profile.

NIT #3: _projects_migrated flag now set only AFTER successful save_projects.
NIT #4: cleaned dead test code in test_is_root_profile_invalidation_drops_stale.
NIT #5: _create_profile_fallback's clone_from=='default' literal now routes
through _is_root_profile() for parity with the 5 other callsites.

+2 regression tests pin the SHOULD-FIX shapes:
- test_keep_latest_messaging_runs_after_profile_filter (source-string ordering)
- test_static_sessions_js_trusts_server_profile_scoping (no client re-filter)

4173 -> 4175 tests pass. 0 regressions.
2026-05-04 16:17:26 +00:00
nesquena-hermes e8862632ed fix(profiles): scope sessions, projects, and root-profile resolution to active profile (#1611, #1612, #1614)
Closes #1611 — /api/sessions filters by active profile by default; ?all_profiles=1
opt-in for aggregate views; new _profiles_match() helper honours renamed-root
cross-aliasing; static/sessions.js drops the s.is_cli_session bypass; toggle-on
re-fetches with all_profiles=1 instead of slicing client-cached rows.

Closes #1612 — new _is_root_profile() central helper consults list_profiles_api()
for is_default=True matches alongside the legacy 'default' alias. Replaces five
literal-default callsites in api/profiles.py. Memoized with explicit invalidation
hooks at create + delete. Sticky active_profile file write now stores '' for
renamed root, consistent with the legacy empty==root contract.

Closes #1614 — projects carry a profile field stamped at create-time;
/api/projects filters by active profile; /api/projects/{create,rename,delete}
and /api/session/move reject ops on cross-profile projects with 404; new
_PROJECTS_MIGRATION migration in load_projects() back-tags untagged projects
from any session that uses them, fall back to 'default'; ensure_cron_project
keys lookup by (name, profile) so each profile gets its own Cron Jobs project.

31 regression tests (9+11+11) pin the renamed-root resolution, server-side
profile scoping shape, helper invariants, cross-alias matching, migration
behavior, and active-profile guards on every project mutation endpoint.
4148 tests pass.

Reporter: @stefanpieter

Co-authored-by: stefanpieter <noreply@github.com>
2026-05-04 16:03:05 +00:00
Michael Lam 9ed0639319 fix: show first-turn chats in sidebar immediately 2026-05-03 20:10:05 -07:00
Hermes Bot da3932a7ef fix(stage-284): absorb Opus advisor SHOULD-FIX items (5+6 LOC)
Both flagged by pre-release Opus advisor; both clearly defensive and small
enough to absorb in-release per the reviewer-flagged-fix-in-release-not-followup
policy.

SHOULD-FIX #1 (api/routes.py:_clear_stale_stream_state, ~25 LOC):
After the metadata-only reload (#1559 Layer 2), the local 'session'
variable is reassigned to the full-load object but the caller still holds
the original metadata-only stub. /api/session then returns the stale
active_stream_id at routes.py:1791, causing the frontend to attempt one
ghost SSE reconnect before recovering. Fix: capture original_stub at
function entry, then patch its in-memory active_stream_id and pending_*
fields to None after both the early-return (full-load already cleared)
path AND the successful-mutation path. Now the caller's read returns
fresh state, no ghost reconnect.

SHOULD-FIX #2 (api/models.py:Session.save, ~20 LOC):
The .bak write at api/models.py:436 used write_text() which truncates-
then-writes — a crash mid-write or concurrent backup-producing save
could leave a torn .bak. Recovery defends correctly (JSONDecodeError →
returns -1 → 'no_action'), so the failure mode was 'backup lost' not
'spurious restore'. Fix: tmp + os.replace pattern matching the main file
write at line 446-453. Now backup either lands cleanly or doesn't land
at all.

4026/4026 tests pass post-absorb.
2026-05-03 20:41:00 +00:00