From fe6558efac77f0fc86a44ed3456fed133a3ec7f0 Mon Sep 17 00:00:00 2001 From: nesquena-hermes <[email protected]> Date: Mon, 25 May 2026 00:10:39 +0000 Subject: [PATCH] Stage-batch14: drop pr-artifacts/ scratchpad from #2685 cherry-picks The contributor used pr-artifacts/ as a working scratchpad during PR development. The real test count and failure-mode docs are already covered by inline test comments and CHANGELOG entries; this directory adds nothing for upstream readers. --- pr-artifacts/context-replay-failure-cases.md | 160 ------------------ pr-artifacts/pr-body-summary-replay-dedupe.md | 83 --------- 2 files changed, 243 deletions(-) delete mode 100644 pr-artifacts/context-replay-failure-cases.md delete mode 100644 pr-artifacts/pr-body-summary-replay-dedupe.md diff --git a/pr-artifacts/context-replay-failure-cases.md b/pr-artifacts/context-replay-failure-cases.md deleted file mode 100644 index 46a9a72c..00000000 --- a/pr-artifacts/context-replay-failure-cases.md +++ /dev/null @@ -1,160 +0,0 @@ -# Context replay / progress-ring failure cases - -This artifact records concrete failure cases observed while debugging WebUI + LCM context replay. It is intended to support an upstream PR with reproducible rationale. - -## A. Compression continuation replays active tail into display/context - -Observed shape after compression/continuation: - -```text -previous_context = [summary, A, B, C] -result_messages = previous_context + [A, B, C, D] -old saved context/display = [summary, A, B, C, A, B, C, D] -expected = [summary, A, B, C, D] -``` - -Code-level cause: writeback assumed `result_messages[len(previous_context):]` was all new delta. After LCM/session rollover, the agent may replay the active tail after the compacted prefix, so this assumption is false. - -Regression target: strip candidate prefixes that are already suffixes of existing context/display. - -## B. Near-duplicate large Session Arc Summary cards - -Observed in WebUI sidecar display transcript for sessions in the `20260520_200424_a43cef` / `20260520_201320_a95eac` lineage: - -```text -[Session Arc Summary (d1, node 39)] ... 62k chars -[Session Arc Summary (d1, node 39)] ... 62k chars -[Session Arc Summary (d1, node 39)] ... 74k chars -``` - -These summaries shared thousands of identical prefix characters but differed in tails/expand hints. Exact identity checks missed them. One duplicated ~80k char summary explains a ~20k-token jump. - -Regression target: treat large `[Session Arc Summary ...]` messages with the same long prefix as replayed summary artifacts. - -## C. Non-adjacent replay blocks separated by markers/summaries - -Observed display transcript contained repeated blocks that were not immediately adjacent, e.g. best block lengths around 171 messages in historical `messages`: - -```text -A B C ... [compression marker / summary / unrelated rows] ... A B C -``` - -Adjacent-only dedupe falsely reported clean. This matters because LCM/continuation can insert compression cards, cron banners, or summary messages between original block and replayed tail. - -Regression target: detect and strip replayed non-adjacent blocks when appending model context candidates. - -## D. Non-streaming `/api/chat` writeback missed dedupe - -Observed session: `20260521_060755_294aed`. - -User asked a short question with no meaningful new tool usage: - -```text -这是一个内部服务对么?简答 -``` - -Before cleanup: - -```text -context_messages: 136 -best replay: 67 messages repeated from index 0 at index 67 -last_prompt_tokens: 136668 (~53.4% of 256k) -``` - -Expected shape was: - -```text -previous_context(67) + new_user + new_assistant = 69 messages -``` - -Actual cause: streaming writeback used `_dedupe_replayed_context_messages`, but synchronous `/api/chat` wrote `_restore_reasoning_metadata(previous_context, result_messages)` directly to `s.context_messages`. - -Regression target: both streaming and non-streaming writeback paths must use the same replay-dedupe guard. - -## E. Runtime/progress-ring jump: clean persisted context, polluted turn-start reconciliation - -Observed session: `20260521_060755_294aed` after cleanup and deployment. - -Persisted sidecar after pause/cancel: - -```text -context_messages: 69 -context chars: 199,972 -rough content tokens: ~49,993 -last_prompt_tokens: 86,723 (~33.9%) -``` - -But starting/continuing a streaming turn made the progress ring jump to ~55%. Simulating the turn-start code path showed: - -```text -ctx_before_agent: 154 messages -chars: 448,438 -rough content tokens: ~112,109 -``` - -After applying existing final-writeback dedupe to that runtime prompt: - -```text -after_current_dedupe: 85 messages -chars: 248,466 -rough content tokens: ~62,116 -``` - -So the ring was not randomly wrong: it reflected a polluted runtime prompt estimate. The persisted sidecar stayed clean because final writeback/cancel did not save the runtime replay. - -Code-level cause: streaming turn start does: - -```python -_previous_context_messages = _new_turn_context_from_messages( - reconciled_state_db_messages_for_session( - s, - prefer_context=True, - state_messages=_external_state_messages, - ), - msg_text, -) -``` - -When `prefer_context=True`, sidecar `context_messages` are clean, but `state.db` still contains mirrored/replayed transcript rows. `reconciled_state_db_messages_for_session` append-only merges `context_messages + whole state transcript`, so the agent/runtime prompt temporarily receives old transcript rows again. - -Regression target: when `prefer_context=True` and sidecar `context_messages` exists, reconciliation must return: - -```text -clean sidecar context + truly newer state.db delta -``` - -not: - -```text -clean sidecar context + full state.db transcript -``` - -## PR thesis - -The bug family is not a model behavior issue. It is a WebUI persistence/reconciliation invariant violation: - -> Model-facing context is append-only, but append candidates may contain replayed context due to LCM/session continuation/state-db mirroring. Every boundary that merges result/state messages into model context must strip replayed prefixes/blocks and must distinguish clean sidecar context from full display/state transcripts. - -Key invariant for upstream: - -```text -If context_messages exists, it is the authoritative model-facing prefix. -State/db/display histories may be fuller/noisier and should only contribute messages that are demonstrably newer than that prefix. -``` - - -## F. Live metering over-counts large in-flight tool results after cancel/retry - -Observed after deploying the reconciliation fix and retrying `continue` in session `20260521_060755_294aed`: - -```text -persisted sidecar context: 69 messages, ~49,993 rough content tokens -state.db messages: 90 messages -state delta after context: 21 messages, ~21,226 rough content tokens -turn-start context after fix: 90 messages, ~71,219 rough content tokens -last_prompt_tokens persisted: 86,723 (~33.9%) -``` - -The previous full-transcript turn-start replay was gone, but the ring still jumped during the run. The new jump came from live metering, not persisted context reconciliation. The run executed several large `read_file` tool calls (5k / 17k / 13k chars). `_record_live_tool_complete()` fed each bounded preview into `_bump_live_prompt_estimate()`, which added the full rough tool-result tokens to `last_prompt_tokens` before any exact next-prompt accounting was available. Repeated cancel/retry makes this look like context replay even when final sidecar context remains clean. - -Regression target: live tool metering should be a conservative UI hint and must not inflate `last_prompt_tokens` by the full content of large in-flight tool results. Exact provider/compressor prompt accounting should still win when available. diff --git a/pr-artifacts/pr-body-summary-replay-dedupe.md b/pr-artifacts/pr-body-summary-replay-dedupe.md deleted file mode 100644 index 56a44007..00000000 --- a/pr-artifacts/pr-body-summary-replay-dedupe.md +++ /dev/null @@ -1,83 +0,0 @@ -## Summary - -Follow-up to #2651. That PR fixed one replay boundary, but continued testing exposed the same context-invariant violation at additional WebUI merge/metering boundaries. - -This PR makes the replay protection context-engine agnostic: - -- strips replayed non-adjacent context blocks and near-duplicate large Session Arc Summary cards before writing model context -- applies the same replay guard to the non-streaming `/api/chat` writeback path -- treats `context_messages` as the authoritative model-facing prefix when reconciling sidecar state with `state.db`, appending only demonstrably new state rows -- caps live tool-result prompt estimates so the context ring does not treat large in-flight tool outputs as exact prompt growth - -LCM/continuation made these failures easy to reproduce, but the invariant is broader than LCM: - -> If `context_messages` exists, it is the authoritative model-facing prefix. `messages`/`state.db` may be fuller or noisier histories and should only contribute true deltas. Live usage estimates must not override exact prompt accounting. - -## Why this happens - -WebUI currently merges several histories that have different meanings: - -- `context_messages`: compact model-facing context for the next call -- `messages`: visible display transcript -- `state.db`: append-only runtime/session journal, including tool rows - -After compression/continuation, those sources can overlap. The old code sometimes treated append candidates as wholly new: - -```text -clean context_messages + whole state.db transcript -``` - -or: - -```text -previous_context + replayed_tail + new_delta -``` - -That reintroduced old summaries, tool rows, or active-tail messages into the next model context or into the live usage estimate. - -## Failure cases covered - -The detailed debugging artifact is in `pr-artifacts/context-replay-failure-cases.md`. The key cases are: - -1. **Compression continuation replays the active tail** - - `result_messages` can contain `previous_context + replayed_tail + new_delta`. - - Prefix slicing alone saves the replayed tail again. - -2. **Near-duplicate large Session Arc Summary cards** - - Large `[Session Arc Summary ...]` messages can share a huge prefix while differing in refreshed tails/hints. - - Exact-match dedupe misses them. - -3. **Non-adjacent replay blocks** - - Replayed blocks can be separated by compression markers/summaries/tool rows, so adjacent-only dedupe is insufficient. - -4. **Non-streaming `/api/chat` writeback missed the replay guard** - - The streaming path deduped context writeback; synchronous chat restored reasoning metadata and saved directly. - -5. **Turn-start state reconciliation polluted a clean sidecar context** - - With `prefer_context=True`, a clean sidecar context could still be followed by mirrored `state.db` transcript rows. - - The next runtime prompt grew even though persisted `context_messages` stayed compact. - -6. **Live metering over-counted large in-flight tool results** - - Tool callbacks can arrive before exact next-prompt accounting. - - The old live estimate added full rough tool-result tokens to `last_prompt_tokens`, causing context-ring jumps that disappeared after cancel/persisted refresh. - -## Implementation notes - -- `_dedupe_replayed_context_messages(...)` now handles non-adjacent replay blocks and large near-duplicate summary cards. -- `/api/chat` writeback calls the same context replay guard as streaming writeback. -- `state_db_delta_after_context(...)` uses `context_messages` as the authoritative prefix and only returns state rows after the last state row already represented by sidecar context. -- `_bounded_live_tool_prompt_delta(...)` bounds live-only tool estimate growth while preserving exact compressor/provider prompt accounting when available. - -## Test plan - -```bash -python -m pytest -q tests/test_streaming_live_usage_estimate.py tests/test_issue1217_transcript_compaction.py tests/test_session_save_mode.py -git diff --check -python -m compileall -q api/models.py api/streaming.py api/routes.py -``` - -Current local result: - -```text -45 passed -```