Skip to content

runtime: recover oversized user message after wire/media overflow#2821

Open
trungutt wants to merge 2 commits into
docker:mainfrom
trungutt:trungutt/overflow-aware-compaction
Open

runtime: recover oversized user message after wire/media overflow#2821
trungutt wants to merge 2 commits into
docker:mainfrom
trungutt:trungutt/overflow-aware-compaction

Conversation

@trungutt
Copy link
Copy Markdown
Contributor

@trungutt trungutt commented May 19, 2026

Stacked on #2819. Review after that one merges.

After #2818 + #2819, an oversized message produces the right error message and fails fast — but the session is still poisoned. The offending user message sits verbatim in the session store, every subsequent call reloads it as part of the conversation history, and the same limit trips again. The user has to start a new chat to recover.

The end result for the user: "I shortened my paste but it kept failing."

What this change does

After a wire- or media-overflow rejection, walk back to the most recent user message and rewrite it in place:

  • Each media part (image, file, document) becomes a short text placeholder that records what was attached (name, size, MIME when known).
  • Plain-text content over a conservative threshold is replaced with a placeholder that records the original size.

The rewrite is then persisted to the session store, so on the next reload the conversation history no longer carries the payload that just got rejected. The error itself is still emitted with the same kind-specific code from #2818, and a Warning event explains to the user that the previous message was rewritten — so the hygiene action is observable rather than silent.

Schemas

Today (after #2818 + #2819)              After this change
───────────────────────────              ─────────────────
User pastes huge content                 User pastes huge content
       │                                        │
       ▼                                        ▼
Provider rejects                         Provider rejects
       │                                        │
       ▼                                        ▼
Error surfaced with the right code       Error surfaced with the right code
       │                                        │
       ▼                                        ▼
User shortens, retries                   Offending message rewritten in
       │                                  place: media → placeholders,
       ▼                                  oversized text → placeholder
History reload contains the                     │
original oversized turn                         ▼
       │                                 User shortens, retries
       ▼                                        │
Same limit tripped again                        ▼
       │                                 History reload contains only
       ▼                                 the slim placeholder + the new
Session permanently broken               smaller message
                                                │
                                                ▼
                                          Provider accepts — session
                                          continues normally

Copy link
Copy Markdown

@docker-agent docker-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assessment: 🟡 NEEDS ATTENTION

2 medium-severity issues found in the overflow recovery implementation.

Comment thread pkg/runtime/loop_steps.go Outdated
Comment thread pkg/runtime/overflow_recovery.go
@trungutt trungutt force-pushed the trungutt/overflow-aware-compaction branch from a1d06ad to 18a703a Compare May 19, 2026 12:17
@trungutt trungutt changed the base branch from main to trungutt/overflow-skip-compaction-on-wire May 19, 2026 12:24
@trungutt trungutt changed the title feat(runtime): recover oversized user message after wire/media overflow runtime: recover oversized user message after wire/media overflow May 19, 2026
@trungutt trungutt marked this pull request as ready for review May 19, 2026 12:45
@trungutt trungutt requested a review from a team as a code owner May 19, 2026 12:45
Copy link
Copy Markdown

@docker-agent docker-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assessment: 🟡 NEEDS ATTENTION

Two medium-severity findings in the new overflow-recovery code. The core approach — scrubbing oversized messages in-place, persisting the rewrite, and emitting a Warning event — is sound and the two learned patterns (wire/media scrub independence from compaction config; top-level Content vs MultiContent text scrubbing) are both correctly implemented.

Comment thread pkg/runtime/overflow_recovery.go Outdated
Comment thread pkg/session/session.go
@trungutt trungutt force-pushed the trungutt/overflow-aware-compaction branch 2 times, most recently from b22d0fb to c724506 Compare May 19, 2026 13:17
@trungutt trungutt requested a review from docker-agent May 19, 2026 14:11
Copy link
Copy Markdown

@docker-agent docker-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assessment: 🟢 APPROVE

Reviewed the overflow recovery changes across pkg/runtime/overflow_recovery.go, pkg/runtime/loop_steps.go, pkg/session/session.go, and the accompanying tests.

What was checked:

  • Correctness of findPersistedMessage when sub-sessions are present — GetAllMessages() appends top-level messages before sub-session messages, so even in a hypothetical ID-collision scenario the first match is always the correctly-rewritten top-level message ✅
  • Notification ordering (emitScrubNotice fires unconditionally after in-memory rewrite, regardless of persistence outcome) — correctly implemented per design intent ✅
  • UpdateMessage call site — called with the post-mutation message content, messageID == 0 guard handles un-persisted synthesised messages ✅
  • Store-write failure path — non-fatal, logged with slog.WarnContext, and user is still notified via the scrub notice ✅
  • Test coverage — happy path, persist-fails path, and no-user-message path all covered ✅

No confirmed or likely bugs found. The implementation correctly addresses the session-poisoning scenario described in the PR description.

@trungutt trungutt changed the base branch from trungutt/overflow-skip-compaction-on-wire to main May 19, 2026 17:09
Auto-compaction is only useful when the rejection is a token-count
overflow — summarising older turns reduces the input token count.

For wire-level overflow ([OverflowKindWire]) the request body itself
exceeds the provider's cap, and the latest turn alone is over the limit;
the compaction call would have to send the same oversized history and
would also be rejected.

For media overflow ([OverflowKindMedia]) we have no media-stripping
during compaction today, so a retry would resend the same attachment
and fail again.

In both cases the recovery attempt always fails, then we surface the
error anyway, while having spent an extra provider call and several
seconds of wall-clock latency. This change skips compaction for those
two kinds and surfaces the error directly. The token-overflow path is
unchanged.
@trungutt trungutt force-pushed the trungutt/overflow-aware-compaction branch from c724506 to b334412 Compare May 19, 2026 17:13
@trungutt trungutt requested a review from docker-agent May 19, 2026 17:14
Copy link
Copy Markdown

@docker-agent docker-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assessment: 🔴 CRITICAL

1 high-severity and 1 low-severity finding in the new overflow recovery code.

// same object that was just mutated in place — race-safe because the
// caller mutates and reads under the same goroutine within
// handleStreamError.
func findPersistedMessage(sess *session.Session, id int64) *session.Message {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[HIGH] findPersistedMessage may always return nil — SQLite-backed sessions never get the in-memory Message.ID populated

findPersistedMessage searches sess.GetAllMessages() for a message whose ID matches the one returned by RewriteLatestUserMessage. This works when Message.ID is populated in the in-memory session, but it is not for SQLite-backed sessions:

  1. SQLiteSessionStore.AddMessage inserts the row and returns LastInsertId(), but it does not call session.AddMessage itself — the runtime's PersistenceObserver calls store.AddMessage and discards the returned ID without writing it back into the in-memory session.Message.ID.
  2. loadSessionItems (the path used on reload from disk) reconstructs Message structs without scanning the id column, so Message.ID stays 0 for every row loaded from SQLite.

Consequence: For any SQLite-backed session, item.Message.ID is 0 in memory. RewriteLatestUserMessage therefore returns (0, true). The guard if messageID == 0 || r.sessionStore == nil fires, emitScrubNotice is still called (good), but findPersistedMessage / UpdateMessage are skipped — so the SQLite row is never updated. On the next session load the original oversized payload reloads from disk and the overflow trips again.

The existing test (TestRecoverFromOversizedTurn_NotifiesUserEvenWhenPersistFails) explicitly sets stored.ID = persistedID before sess.AddMessage(stored) to hit the UpdateMessage path — this pattern is not followed in production code paths (e.g. PersistenceObserver.UserMessageEvent).

Fix options:

  • Have SQLiteSessionStore.AddMessage call msg.ID = id on the caller's pointer (mirroring InMemorySessionStore), or
  • Have loadSessionItems scan and populate the id column into Message.ID, or
  • Rewrite findPersistedMessage to use RewriteLatestUserMessage's returned ID directly (skipping the search entirely), since RewriteLatestUserMessage already has the pointer and can expose the rewritten message object without a second scan.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified the claim — the bug reproduces in production for both store types because the persistence ID is never written back to the in-memory Message.ID (the observer discards the return value from Store.AddMessage), and even on reload from disk loadSessionItems doesn't SELECT id, so Message.ID stays 0 forever.

Fixed in the latest force-push with two surgical changes. Both are needed; either alone leaves a gap (writes propagate via PersistenceObserver, reads via loadSessionItems):

  1. pkg/session/store.goloadSessionItems: the SELECT now includes id and populates Message.ID on the loaded row. Pinned by TestSQLiteSessionStore_LoadSessionItems_PopulatesMessageID.
  2. pkg/runtime/persistence_observer.goUserMessageEvent: after Store.AddMessage returns, the observer calls a new Session.AssignLatestUserMessageID(id) so the in-memory message picks up the persistence ID. Pinned by TestPersistenceObserver_UserMessageEvent_PropagatesID (SQLite-backed, mirrors the production runtime+store flow) and TestPersistenceObserver_UserMessageEvent_AddMessageFailure_DoesNotAssignID for the failure path.
  3. pkg/session/session.goAssignLatestUserMessageID: new helper used by the observer. Race-safe (under s.mu), refuses to overwrite a non-zero ID so two concurrent writers can't clobber each other. Pinned by TestSession_AssignLatestUserMessageID (covers happy path, zero-ID no-op, overwrite refusal, no-user-message, empty session).

End-to-end coverage: TestRecoverFromOversizedTurn_EndToEndPersistsScrub walks the full pipeline (PersistenceObserver writes via SQLite, scrub fires, reload from DB) and asserts the reloaded row reflects the scrubbed content with the same persistence ID — which is the exact regression the reviewer described.

Left findPersistedMessage and the existing scrub structure intact since the now-populated Message.ID makes it work correctly without further refactoring. The reviewer's option 3 (return the rewritten message pointer directly) is a code-simplification that can land as a follow-up if desired.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent fix — verified the two-pronged approach covers both write (observer propagates ID back to session) and read (loadSessionItems scans the id column on reload) paths, and the test coverage pins all entry points and failure modes as described.

The decision to preserve findPersistedMessage rather than refactor to option 3 is the right call for this PR: the now-populated Message.ID makes the search work correctly, and the "return the pointer directly" simplification can land as a follow-up without blocking the regression fix. Keeping the scope surgical reduces rebase risk and makes the history easier to bisect later.

One thing I'll note for the future: AssignLatestUserMessageID refuses to overwrite a non-zero ID (the "two concurrent writers can't clobber each other" guard), which is great for safety, but that means if two observers ever race on the same AddMessage return value, one silently loses. Not a problem in the current single-observer setup, but worth remembering if the observer wiring changes. The silent false return is appropriate for now — logging would just be noise.

Marking resolved. Thanks for the thorough fix and test coverage!

Comment thread pkg/session/session.go Outdated
//
// Returns the database ID of the rewritten Item.Message (used by
// callers to persist the rewrite via [Store.UpdateMessage]) together
// with ok=true. Returns (0, false) when the session has no user
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[LOW] Docstring contradicts implementation: RewriteLatestUserMessage returns (0, true) for unpersisted messages, not (0, false)

The doc comment says:

Returns (0, false) when … the latest user message has not been persisted yet (ID is zero, e.g. a synthesised implicit message).

But the implementation returns (item.Message.ID, true) unconditionally after a successful rewrite — which is (0, true) when ID == 0.

The current caller (recoverFromOversizedTurn) handles this correctly via the if messageID == 0 guard. However, a future caller who trusts the docstring might check only ok == false to detect "nothing happened" and miss the case where an unpersisted message was rewritten in memory.

The docstring should be corrected to match the actual contract: (0, true) when the message was rewritten but had no persisted ID.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — the docstring did claim (0, false) for unpersisted messages while the implementation returned (0, true). Fixed in the latest force-push: the doc now says "Returns the database ID of the rewritten Item.Message and ok=true on a successful rewrite. The returned ID is zero when the rewritten message has not been persisted yet ...; callers that intend to mirror the rewrite to the store via Store.UpdateMessage must check for a non-zero ID before doing so. Returns (0, false) when the session has no user messages or when rewrite returns false."

Note: with the related fix in this same force-push (PersistenceObserver now propagates the persistence ID back to the in-memory session, loadSessionItems now reads the id column), the ID == 0 case shrinks to genuinely-unpersisted messages — the production runtime/store flow no longer hits it for real user messages, which was the regression the other reviewer flagged. The docstring contract is now both correct and unambiguous about the rare remaining case.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent fix! The docstring now accurately describes the contract: (0, true) when the message was successfully rewritten but had no persisted ID, and (0, false) only when nothing happened (no user messages or rewrite declined).

Thanks for the context on the broader fix — propagating persistence IDs back from the observer to the in-memory session means the ID == 0 edge case is now genuinely rare (only unpersisted synthesized messages), which makes the API much clearer for future callers.

When the provider rejects a request because the body itself is over the
wire-size cap or contains an oversized attachment, the offending user
message stays verbatim in the session. Every subsequent call reloads
that message as part of the conversation history and trips the same
limit. The session is effectively dead until the user starts over.

Add a hygiene step that runs on wire- and media-overflow rejections:
walk back to the latest user message, replace each media part (image,
file, document) with a text placeholder that records what was attached,
and replace plain-text content over 1 MiB with a size-noting placeholder.
The rewrite is mirrored to the session store so the next session load
reflects it; the in-memory mutation alone keeps the current process
healthy even if the store write fails.

A Warning event is emitted so the UI can tell the user that their
previous message was rewritten in place. The fatal ErrorEvent for the
original rejection is still emitted — scrubbing is in addition to
surfacing the error, not instead of it.

Token-overflow is unchanged: it still goes through auto-compaction,
which is the correct mechanism for that shape of failure.
@trungutt trungutt force-pushed the trungutt/overflow-aware-compaction branch from b334412 to 354f333 Compare May 19, 2026 19:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants