Replace the hardcoded Skyly cancellation wording with the configured bot_name from settings, falling back to Hermes when unset.
Keep the client-side fallback in sync by using window._botName if the session refresh after cancellation fails.
Co-authored-by: Obryn 🐉 <obryn-ai@dotbeeps.dev>
Opus stage-360 review caught that the docstring at api/streaming.py:40-43
said 'around the entire agent run' which is no longer accurate after the
narrow-lock refactor. The lock is now held only briefly for the env-mutation
critical section; the agent runs outside the lock and the finally block
re-acquires to atomically restore env vars.
Docstring now points to both narrow-lock implementations as references:
- _run_agent_streaming at line ~2719 (the original pattern)
- profile_env_for_background_worker at api/profiles.py:715 (added stage-360)
#2299 introduced profile_env_for_background_worker() in api/profiles.py and
changed _ENV_LOCK from threading.Lock() to threading.RLock(). Both changes
were incorrect:
1. RLock masked rather than fixed the underlying deadlock. The QA
test_env_lock_is_non_reentrant test exists precisely to enforce
non-reentrance — RLock would let a single thread hold _ENV_LOCK across
nested critical sections, which hides bugs while still allowing
different-thread races.
2. The original context manager held _ENV_LOCK for the ENTIRE 'yield'
duration, meaning the lock was held for the full background worker's
runtime (title generation, compression, update summary — possibly
many seconds). That blocked ALL other sessions on _ENV_LOCK, which
the QA test_third_message_completes runtime test caught as a timeout
on the third sequential message.
Fix: mirror the narrow-lock pattern from _run_agent_streaming:
- Acquire _ENV_LOCK only for env mutation (set runtime_env + patch
skill modules)
- Release immediately, yield to worker (no lock held)
- Reacquire in finally to restore env + skill modules
Restored _ENV_LOCK back to threading.Lock(). All 20 QA tests now pass,
including test_third_message_completes (was timing out, now 35s).
The FakeAgent in test_issue1857_usage_overwrite returned only 2 messages
(user + assistant) without the conversation history. The real agent always
returns the full history plus new messages. This mismatch caused the new
_has_new_assistant_reply helper (which checks only messages beyond the
pre-turn offset) to see len(result)==len(prev) and incorrectly flag the
turn as a silent failure.
Fix: prepend conversation_history to the FakeAgent's response so the
message list mirrors production behavior.
When a provider error (401/429/rate-limit) causes the agent to return
without producing a new assistant reply, the WebUI should emit an
apperror event so the user sees an inline error. However, the detection
logic scanned ALL messages in result['messages'] — which includes the
full conversation history. If any prior turn had an assistant response,
_assistant_added would be True and the apperror would be silently
skipped, leaving the user staring at a blank response.
Extract a helper _has_new_assistant_reply(all_messages, prev_count)
that only inspects messages beyond the pre-turn history offset. Apply
it to both the main detection path and the self-heal/retry path.
Tests: 15 new cases covering history masking, empty content, whitespace,
edge-case shrinks, and multi-assistant scenarios.
Opus identified that PR #2227's preservation block had two related bugs in
the parent_session_id handling:
1. During preservation save: code did
_old_parent = s.parent_session_id
s.parent_session_id = None
s.save(touch_updated_at=False, skip_index=True)
s.parent_session_id = _old_parent
The save persisted parent=None to disk. The in-memory restoration didn't
reach the disk copy. Result: a /branch fork session that subsequently
compressed lost its 'Forked from X' badge on the preserved old snapshot.
2. Stamping the continuation: code did
if not s.parent_session_id:
s.parent_session_id = old_sid
The 'if not' guard skipped the stamp when the session already had a
parent_session_id from a prior fork. Result: fork-of-fork compression
broke lineage — the continuation jumped back to the original fork parent
instead of the just-preserved immediate predecessor snapshot.
Fix (matches Opus's recommendation):
- Remove the parent clearing during preservation save (preserve as-is)
- Drop the 'if not' guard; always stamp continuation to old_sid
This makes the lineage chain consistent: new → old → old.parent → ... root.
Traversal from the continuation always walks through the just-preserved
snapshot to get to its parent's parent, never jumping over the snapshot.
Two new regression tests pin both invariants:
- test_parent_session_id_stamped_unconditionally (no 'if not' guard)
- test_old_session_parent_preserved_during_archive_save (no parent=None)
Both pass against the fix. All 8 tests in the file pass.
The previous implementation renamed old_sid.json → new_sid.json during
context compression, destroying the only persistent copy of the full
conversation history. If the summarisation LLM call also failed, the
user was left with zero recoverable messages.
Fix:
- Remove the destructive old_path.rename(new_path) call
- Preserve old_sid.json as an immutable pre-compression archive
- Create new_sid.json as a fresh file via s.save()
- Set parent_session_id on the continuation session for lineage
- Save in-memory messages to old_sid.json if they're newer than disk
Test: test_issue2223_compression_no_rename.py (6 tests, all passing)
Refs #2215 Fix B: remove the mid-response stripping hazard without losing leading multi-line wrapper cleanup.
The pattern now strips only a leading 'the user is asking' wrapper line and preserves the visible answer that follows. Add regression coverage for both the leading-wrapper and mid-response prose cases.
Opus flagged that PR #2151's cancel-handler partial-dedup loop used a
substring check that was too broad: any short prior assistant reply
('OK', 'Here is the answer:') would dedup a longer new partial containing
it, silently dropping the partial and resurrecting the #893 data-loss bug.
Tightened to only dedup against actual prior _partial=True markers with
exact (whitespace-stripped) content match. Three new regression tests
added (short-non-partial-prefix-does-not-dedup, exact-partial-match-still-
dedups, same-content-non-partial-does-not-dedup).
10/10 partial-cancel tests pass after the fix. Also updated CHANGELOG with
the conflict-resolution notes for #2151 vs #2136 and the #2178 test-fix.
fix: clarify cancelled chat turn status (Jordan-SkyLF)
Conflict resolution on api/streaming.py:4549-4567 (the cancel-handler
ownership guard). Both this PR and the already-shipped PR #2136 add a
guard at the same site against stale stream writebacks, from different
angles:
- PR #2136 (HEAD): _stream_writeback_is_current(_cs, stream_id) — strictly
dominates by checking the active_stream_id token equality.
- PR #2151: 'worker won the race' check via (active_stream_id != stream_id
and not pending_user_message), with _emit_cancel_event = False to suppress
the terminal cancel event.
Resolution merges both: keep #2136's strictly-stronger condition for skip
detection, and adopt #2151's _emit_cancel_event = False semantic so the
cancel event isn't emitted in addition to skipping the writeback (when
client may have already received the successful done payload).
55/55 tests pass across cancelled-turn-status + stale-stream-writeback +
the four cancel/data-loss sibling test files.
Providers like Xiaomi MiMo, DeepSeek, and Kimi require reasoning_content
to be echoed back on every assistant message in multi-turn conversations
with tool calls. Omitting it causes HTTP 400: 'The reasoning_content in
the thinking mode must be passed back to the API.'
The WebUI's _sanitize_messages_for_api() strips all fields not in
_API_SAFE_MSG_KEYS before sending conversation history to the LLM API.
reasoning_content was not in this whitelist, so it was silently dropped.
The CLI path (run_agent.py) is unaffected because it has its own
_copy_reasoning_content_for_api() logic that operates on raw message
dicts without going through this filter. This is why the same session
works from CLI but fails from WebUI with HTTP 400.
The fix adds 'reasoning_content' to _API_SAFE_MSG_KEYS so the field
passes through sanitization intact.
fix: guard stale stream writebacks (LumenYoung)
Prevents stale WebUI stream workers from writing old results into a session
after that session has already moved on to another stream. Adds new helper
_stream_writeback_is_current() (a token equality check against the session's
active_stream_id) and short-circuits the two finalize/cancel paths when the
worker no longer owns the session writeback.
Opus advisor pass on stage-341 found three surgical items:
1. static/i18n.js:it — PR #2064 branched before stage-340 landed the 'it'
locale (#2067), missing 9 session_*worktree* keys. Mechanical mirror of
en/ja position. Italian falls back to English silently without this fix.
2. api/streaming.py — PR #2107's new break short-circuit was silent in both
the aux and agent title-generation paths. Added logger.debug calls before
each break so production logs surface the exit shape.
3. api/streaming.py — Expanded _title_should_skip_remaining_attempts docstring
to document the membership criterion explicitly (vs the implicit
reasoning-only-burn case it ships with today). Future additions
(llm_safety_blocked, llm_oauth_quota) have a clear inclusion test.
CHANGELOG updated under the Stage-341 maintainer fixes section to mirror
the stage-340 pattern. All targeted tests pass (57/57 in the affected
modules).
Reasoning models (Qwen3-thinking via LM Studio, DeepSeek-R1, Kimi-K2,
etc.) can burn their entire output budget on hidden reasoning tokens and
emit no visible content. The previous title-generation retry path
classified that as llm_length and doubled the budget — but the second
call produces the same shape, so the retry only doubled the GPU/credit
burn. Repeated across the two prompts in _title_prompts() this came to
~3000 reasoning tokens of GPU work per new chat. On local LM Studio
servers behind a custom: provider (where is_lmstudio=False means
reasoning_effort: none never reaches the model) it manifested as the GPU
never going idle after a prompt.
Fix:
- _extract_title_response: classify reasoning-bearing empty responses
as llm_empty_reasoning regardless of finish_reason. The presence of
reasoning_content is the diagnostic signal, not finish_reason.
- _title_retry_status: drop llm_empty_reasoning from the retry set.
Length-truncated responses WITHOUT reasoning still retry (those are
legitimately recoverable by a larger budget).
- Add _title_should_skip_remaining_attempts() and break out of the
prompt-iteration loop on empty-reasoning. A second prompt against
the same model would produce the same shape.
- Falls through to _fallback_title_from_exchange for a local-summary
title.
Tests updated to invert the previous reasoning-retry assertions:
- test_aux_short_circuits_on_empty_reasoning_without_retrying
- test_aux_still_retries_finish_length_without_reasoning
- test_agent_route_short_circuits_on_empty_reasoning_without_retrying
- test_agent_route_still_retries_finish_length_without_reasoning
Companion agent-side work (LM Studio classifier for custom: providers)
is tracked separately on the hermes-agent side; this WebUI fix is the
belt-and-braces guard so the loop stops regardless of agent classifier
state.
Reported by @darkopetrovic. Closes#2083.
Co-authored-by: darkopetrovic <darkopetrovic@users.noreply.github.com>
(cherry picked from commit efeae4a86e)