Stage-batch14: drop pr-artifacts/ scratchpad from #2685 cherry-picks

The contributor used pr-artifacts/ as a working scratchpad during PR
development. The real test count and failure-mode docs are already
covered by inline test comments and CHANGELOG entries; this directory
adds nothing for upstream readers.
This commit is contained in:
nesquena-hermes
2026-05-25 00:10:39 +00:00
parent d0992730a9
commit fe6558efac
2 changed files with 0 additions and 243 deletions
@@ -1,160 +0,0 @@
# Context replay / progress-ring failure cases
This artifact records concrete failure cases observed while debugging WebUI + LCM context replay. It is intended to support an upstream PR with reproducible rationale.
## A. Compression continuation replays active tail into display/context
Observed shape after compression/continuation:
```text
previous_context = [summary, A, B, C]
result_messages = previous_context + [A, B, C, D]
old saved context/display = [summary, A, B, C, A, B, C, D]
expected = [summary, A, B, C, D]
```
Code-level cause: writeback assumed `result_messages[len(previous_context):]` was all new delta. After LCM/session rollover, the agent may replay the active tail after the compacted prefix, so this assumption is false.
Regression target: strip candidate prefixes that are already suffixes of existing context/display.
## B. Near-duplicate large Session Arc Summary cards
Observed in WebUI sidecar display transcript for sessions in the `20260520_200424_a43cef` / `20260520_201320_a95eac` lineage:
```text
[Session Arc Summary (d1, node 39)] ... 62k chars
[Session Arc Summary (d1, node 39)] ... 62k chars
[Session Arc Summary (d1, node 39)] ... 74k chars
```
These summaries shared thousands of identical prefix characters but differed in tails/expand hints. Exact identity checks missed them. One duplicated ~80k char summary explains a ~20k-token jump.
Regression target: treat large `[Session Arc Summary ...]` messages with the same long prefix as replayed summary artifacts.
## C. Non-adjacent replay blocks separated by markers/summaries
Observed display transcript contained repeated blocks that were not immediately adjacent, e.g. best block lengths around 171 messages in historical `messages`:
```text
A B C ... [compression marker / summary / unrelated rows] ... A B C
```
Adjacent-only dedupe falsely reported clean. This matters because LCM/continuation can insert compression cards, cron banners, or summary messages between original block and replayed tail.
Regression target: detect and strip replayed non-adjacent blocks when appending model context candidates.
## D. Non-streaming `/api/chat` writeback missed dedupe
Observed session: `20260521_060755_294aed`.
User asked a short question with no meaningful new tool usage:
```text
这是一个内部服务对么?简答
```
Before cleanup:
```text
context_messages: 136
best replay: 67 messages repeated from index 0 at index 67
last_prompt_tokens: 136668 (~53.4% of 256k)
```
Expected shape was:
```text
previous_context(67) + new_user + new_assistant = 69 messages
```
Actual cause: streaming writeback used `_dedupe_replayed_context_messages`, but synchronous `/api/chat` wrote `_restore_reasoning_metadata(previous_context, result_messages)` directly to `s.context_messages`.
Regression target: both streaming and non-streaming writeback paths must use the same replay-dedupe guard.
## E. Runtime/progress-ring jump: clean persisted context, polluted turn-start reconciliation
Observed session: `20260521_060755_294aed` after cleanup and deployment.
Persisted sidecar after pause/cancel:
```text
context_messages: 69
context chars: 199,972
rough content tokens: ~49,993
last_prompt_tokens: 86,723 (~33.9%)
```
But starting/continuing a streaming turn made the progress ring jump to ~55%. Simulating the turn-start code path showed:
```text
ctx_before_agent: 154 messages
chars: 448,438
rough content tokens: ~112,109
```
After applying existing final-writeback dedupe to that runtime prompt:
```text
after_current_dedupe: 85 messages
chars: 248,466
rough content tokens: ~62,116
```
So the ring was not randomly wrong: it reflected a polluted runtime prompt estimate. The persisted sidecar stayed clean because final writeback/cancel did not save the runtime replay.
Code-level cause: streaming turn start does:
```python
_previous_context_messages = _new_turn_context_from_messages(
reconciled_state_db_messages_for_session(
s,
prefer_context=True,
state_messages=_external_state_messages,
),
msg_text,
)
```
When `prefer_context=True`, sidecar `context_messages` are clean, but `state.db` still contains mirrored/replayed transcript rows. `reconciled_state_db_messages_for_session` append-only merges `context_messages + whole state transcript`, so the agent/runtime prompt temporarily receives old transcript rows again.
Regression target: when `prefer_context=True` and sidecar `context_messages` exists, reconciliation must return:
```text
clean sidecar context + truly newer state.db delta
```
not:
```text
clean sidecar context + full state.db transcript
```
## PR thesis
The bug family is not a model behavior issue. It is a WebUI persistence/reconciliation invariant violation:
> Model-facing context is append-only, but append candidates may contain replayed context due to LCM/session continuation/state-db mirroring. Every boundary that merges result/state messages into model context must strip replayed prefixes/blocks and must distinguish clean sidecar context from full display/state transcripts.
Key invariant for upstream:
```text
If context_messages exists, it is the authoritative model-facing prefix.
State/db/display histories may be fuller/noisier and should only contribute messages that are demonstrably newer than that prefix.
```
## F. Live metering over-counts large in-flight tool results after cancel/retry
Observed after deploying the reconciliation fix and retrying `continue` in session `20260521_060755_294aed`:
```text
persisted sidecar context: 69 messages, ~49,993 rough content tokens
state.db messages: 90 messages
state delta after context: 21 messages, ~21,226 rough content tokens
turn-start context after fix: 90 messages, ~71,219 rough content tokens
last_prompt_tokens persisted: 86,723 (~33.9%)
```
The previous full-transcript turn-start replay was gone, but the ring still jumped during the run. The new jump came from live metering, not persisted context reconciliation. The run executed several large `read_file` tool calls (5k / 17k / 13k chars). `_record_live_tool_complete()` fed each bounded preview into `_bump_live_prompt_estimate()`, which added the full rough tool-result tokens to `last_prompt_tokens` before any exact next-prompt accounting was available. Repeated cancel/retry makes this look like context replay even when final sidecar context remains clean.
Regression target: live tool metering should be a conservative UI hint and must not inflate `last_prompt_tokens` by the full content of large in-flight tool results. Exact provider/compressor prompt accounting should still win when available.
@@ -1,83 +0,0 @@
## Summary
Follow-up to #2651. That PR fixed one replay boundary, but continued testing exposed the same context-invariant violation at additional WebUI merge/metering boundaries.
This PR makes the replay protection context-engine agnostic:
- strips replayed non-adjacent context blocks and near-duplicate large Session Arc Summary cards before writing model context
- applies the same replay guard to the non-streaming `/api/chat` writeback path
- treats `context_messages` as the authoritative model-facing prefix when reconciling sidecar state with `state.db`, appending only demonstrably new state rows
- caps live tool-result prompt estimates so the context ring does not treat large in-flight tool outputs as exact prompt growth
LCM/continuation made these failures easy to reproduce, but the invariant is broader than LCM:
> If `context_messages` exists, it is the authoritative model-facing prefix. `messages`/`state.db` may be fuller or noisier histories and should only contribute true deltas. Live usage estimates must not override exact prompt accounting.
## Why this happens
WebUI currently merges several histories that have different meanings:
- `context_messages`: compact model-facing context for the next call
- `messages`: visible display transcript
- `state.db`: append-only runtime/session journal, including tool rows
After compression/continuation, those sources can overlap. The old code sometimes treated append candidates as wholly new:
```text
clean context_messages + whole state.db transcript
```
or:
```text
previous_context + replayed_tail + new_delta
```
That reintroduced old summaries, tool rows, or active-tail messages into the next model context or into the live usage estimate.
## Failure cases covered
The detailed debugging artifact is in `pr-artifacts/context-replay-failure-cases.md`. The key cases are:
1. **Compression continuation replays the active tail**
- `result_messages` can contain `previous_context + replayed_tail + new_delta`.
- Prefix slicing alone saves the replayed tail again.
2. **Near-duplicate large Session Arc Summary cards**
- Large `[Session Arc Summary ...]` messages can share a huge prefix while differing in refreshed tails/hints.
- Exact-match dedupe misses them.
3. **Non-adjacent replay blocks**
- Replayed blocks can be separated by compression markers/summaries/tool rows, so adjacent-only dedupe is insufficient.
4. **Non-streaming `/api/chat` writeback missed the replay guard**
- The streaming path deduped context writeback; synchronous chat restored reasoning metadata and saved directly.
5. **Turn-start state reconciliation polluted a clean sidecar context**
- With `prefer_context=True`, a clean sidecar context could still be followed by mirrored `state.db` transcript rows.
- The next runtime prompt grew even though persisted `context_messages` stayed compact.
6. **Live metering over-counted large in-flight tool results**
- Tool callbacks can arrive before exact next-prompt accounting.
- The old live estimate added full rough tool-result tokens to `last_prompt_tokens`, causing context-ring jumps that disappeared after cancel/persisted refresh.
## Implementation notes
- `_dedupe_replayed_context_messages(...)` now handles non-adjacent replay blocks and large near-duplicate summary cards.
- `/api/chat` writeback calls the same context replay guard as streaming writeback.
- `state_db_delta_after_context(...)` uses `context_messages` as the authoritative prefix and only returns state rows after the last state row already represented by sidecar context.
- `_bounded_live_tool_prompt_delta(...)` bounds live-only tool estimate growth while preserving exact compressor/provider prompt accounting when available.
## Test plan
```bash
python -m pytest -q tests/test_streaming_live_usage_estimate.py tests/test_issue1217_transcript_compaction.py tests/test_session_save_mode.py
git diff --check
python -m compileall -q api/models.py api/streaming.py api/routes.py
```
Current local result:
```text
45 passed
```