fix(#1937): close endless-scroll prefetch vs Start-jump race with generation-token + mutex

The originally-proposed fix (gate _ensureAllMessagesLoaded on the existing _loadingOlder flag) does not actually close the race. By the time the prefetch reaches its post-await body, it has already cleared the entry- gate that reads _loadingOlder, so a same-flag check inside the resolved callback would be a no-op for an in-flight request. The actual fix is two-pronged: 1. New module-scoped _messagesGeneration counter, bumped every time S.messages is wholesale-replaced. _loadOlderMessages snapshots it BEFORE its await and re-checks after — if it changed, the prepend is aborted. This is the canonical async-invalidation pattern. 2. _ensureAllMessagesLoaded now claims the _loadingOlder mutex around its body so a new prefetch cannot start mid-replace and concurrent ensure-all calls (rapid double-click on Start) serialize cleanly. It bumps the generation token before mutating S.messages, yields until any in-flight prefetch finishes, and resets _oldestIdx so a subsequent prefetch cannot request stale older messages. Also adds the same-session / _loadingSessionId guards that the original ensure-all body was missing post-await — if the user switched sessions mid-flight, the old code would happily overwrite the new session's messages with the previous session's full history. 12 new regression tests in tests/test_issue1937_endless_scroll_jumpstart_race.py lock in: generation token declaration, bump-helper presence, snapshot- before-await ordering, post-await-abort behaviour, mutex acquisition and finally-release, yield-then-claim ordering when a prefetch is in flight, generation bump during the wait phase, _oldestIdx reset, and the new session-switch guard. Closes #1937.
2026-05-25 03:00:23 +00:00 · 2026-05-08 21:14:22 -07:00
parent 0b7e1e60e8
commit fb822239ea
3 changed files with 292 additions and 10 deletions
@@ -1,5 +1,11 @@
 # Hermes Web UI -- Changelog

+## [Unreleased]
+
+### Fixed
+
+- **#1937 — Race: endless-scroll prefetch vs Start-jump's `_ensureAllMessagesLoaded` could duplicate messages.** With both `session_jump_buttons` AND `session_endless_scroll` enabled, an in-flight `_loadOlderMessages` prefetch racing with `jumpToSessionStart` → `_ensureAllMessagesLoaded` could prepend a duplicate page if the prefetch resolved last. The naive fix suggested in the report (gate ensure-all on the existing `_loadingOlder` flag) does not actually close the race — by the time the prefetch reaches its post-await body, it has already cleared the entry-gate that reads `_loadingOlder`, so a same-flag check inside the resolved callback is a no-op. The actual fix is a generation-token + mutex pair: (1) `_loadOlderMessages` snapshots a new module-scoped `_messagesGeneration` counter BEFORE its `await api(...)` and re-checks it after, aborting the prepend if any wholesale-replace bumped the token mid-flight; (2) `_ensureAllMessagesLoaded` claims the `_loadingOlder` mutex around its body (so a NEW prefetch cannot start mid-replace, and concurrent ensure-all calls from rapid double-clicks on Start serialize cleanly), bumps the generation token before mutating `S.messages`, yields until any in-flight prefetch's `finally`-block releases the mutex, and resets `_oldestIdx` so a subsequent prefetch cannot send a stale `msg_before` index. Also adds the same-session and `_loadingSessionId` guards that the original ensure-all body was missing post-await. (`static/sessions.js`, `tests/test_issue1937_endless_scroll_jumpstart_race.py` — 12 new regression tests)
+
 ## [v0.51.30] — 2026-05-08 — 3-PR contributor batch (Release G: offline recovery + PWA hardening + opt-in session jump buttons + opt-in endless-scroll)

 ### Added (3 PRs, all from @ai-ag2026)
@@ -31,7 +37,7 @@

 ### Follow-up items filed (non-blocking)

- **Race between endless-scroll prefetch and Start-jump's `_ensureAllMessagesLoaded`** — with both opt-ins ON, an in-flight prefetch (started by 1.5x-viewport trigger) racing with `jumpToSessionStart` → `_ensureAllMessagesLoaded` could produce duplicate messages if the prefetch resolves last. Narrow window, but the fix is to gate `_ensureAllMessagesLoaded` on the existing `_loadingOlder` flag.
+- **Race between endless-scroll prefetch and Start-jump's `_ensureAllMessagesLoaded`** — with both opt-ins ON, an in-flight prefetch (started by 1.5x-viewport trigger) racing with `jumpToSessionStart` → `_ensureAllMessagesLoaded` could produce duplicate messages if the prefetch resolves last. Narrow window, but the fix is to gate `_ensureAllMessagesLoaded` on the existing `_loadingOlder` flag. **Resolved in Unreleased — see #1937 entry above; final fix uses generation-token + mutex rather than the originally-suggested flag gate, which would not have closed the race.**
 - **#1928 locale parity** — `session_jump_*` and `settings_*_session_jump_buttons` keys are English literals in ja/ru/es/de/zh/zh-Hant/pt/ko. Default-OFF + English fallback works, but breaks the locale-parity standard set by #1929 and #1891 in the same release.


@@ -998,6 +998,19 @@ let _loadingOlder = false;
 // oldest message currently loaded in S.messages. Starts at 0 when all
 // messages are loaded, or > 0 when truncated by msg_limit.
 let _oldestIdx = 0;
+// Generation token bumped every time S.messages is wholesale-replaced
+// (rather than incrementally extended). _loadOlderMessages snapshots it
+// before its `await` and re-checks after, so a late-resolving prefetch
+// does not prepend onto a transcript that was rebuilt under it
+// (e.g. by _ensureAllMessagesLoaded after a Start-jump). See #1937.
+let _messagesGeneration = 0;
+function _bumpMessagesGeneration() {
+  // Wrap to keep the counter bounded; the only operation that matters is
+  // strict inequality between the snapshot and the post-await read, so any
+  // monotonic bump is sufficient.
+  _messagesGeneration = (_messagesGeneration + 1) | 0;
+  return _messagesGeneration;
+}

 async function _loadOlderMessages() {
  if (_loadingOlder || !_messagesTruncated) return;
@@ -1005,6 +1018,11 @@ async function _loadOlderMessages() {
  if (!sid || !S.messages.length) return;
  if (_oldestIdx <= 0) { _messagesTruncated = false; return; }
  _loadingOlder = true;
+  // Snapshot the generation BEFORE we await. If S.messages is wholesale
+  // replaced while the request is in flight, the post-await check below
+  // bails out so we never prepend stale older messages onto a freshly
+  // rebuilt transcript (#1937).
+  const startGeneration = _messagesGeneration;
  try {
    const data = await api(`/api/session?session_id=${encodeURIComponent(sid)}&messages=1&resolve_model=0&msg_before=${_oldestIdx}&msg_limit=${_INITIAL_MSG_LIMIT}`);
    // Guard: api() may have redirected (401) and returned undefined.
@@ -1017,6 +1035,13 @@ async function _loadOlderMessages() {
    if (!data || !data.session) return;
    if (!S.session || S.session.session_id !== sid) return;
    if (_loadingSessionId !== null && _loadingSessionId !== sid) return;
+    // Generation guard: another code path (typically jumpToSessionStart →
+    // _ensureAllMessagesLoaded) may have replaced S.messages while we were
+    // awaiting. Prepending older messages onto that replacement would
+    // duplicate the head of the transcript. Detect via the generation
+    // counter and abort cleanly. _oldestIdx and _messagesTruncated were
+    // already reset by the wholesale-replace path, so no rollback needed.
+    if (_messagesGeneration !== startGeneration) return;
    const olderMsgs = (data.session.messages || []).filter(m => m && m.role);
    if (!olderMsgs.length) { _messagesTruncated = false; return; }
    // Prepend older messages
@@ -1063,17 +1088,56 @@ async function _loadOlderMessages() {

 // Ensure the full message history is loaded (for undo, export, etc).
 // If the session was loaded with msg_limit, this fetches all messages.
+//
+// Race-safety (#1937): with the endless-scroll opt-in, _loadOlderMessages
+// may be in flight when this runs (e.g. user scrolled near the top, then
+// hit the Start jump pill). Two coordinated guards prevent the prefetch
+// from prepending duplicate messages onto our wholesale replacement:
+//   1. Hold the _loadingOlder mutex around the body so a NEW prefetch
+//      cannot start mid-replace (entry-gate check at line ~1003 returns
+//      early). The mutex is also self-protecting against concurrent
+//      ensure-all calls from rapid double-clicks on Start.
+//   2. Bump _messagesGeneration before mutating S.messages so any
+//      in-flight prefetch's post-await generation check bails out.
 async function _ensureAllMessagesLoaded() {
  if (!_messagesTruncated || !S.session) return;
-  const sid = S.session.session_id;
-  const data = await api(`/api/session?session_id=${encodeURIComponent(sid)}&messages=1&resolve_model=0`);
-  // Guard: api() may have redirected (401) and returned undefined.
-  if (!data || !data.session) return;
-  const msgs = (data.session.messages || []).filter(m => m && m.role);
-  S.messages = msgs;
-  _messagesTruncated = false;
-  if(S.session && S.session.session_id === sid){
-    S.session.message_count = Number(data.session.message_count || msgs.length);
+  if (_loadingOlder) {
+    // A prefetch is mid-flight (between the `_loadingOlder = true` line
+    // and its post-await guards). Bumping the generation token now
+    // poisons that prefetch's continuation, but we still need to claim
+    // the mutex AFTER it releases. Yield until the prefetch finishes
+    // (its finally-block clears _loadingOlder) before fetching the full
+    // history ourselves. The generation bump below ensures any other
+    // future race against this same continuation also fails closed.
+    _bumpMessagesGeneration();
+    while (_loadingOlder) {
+      await new Promise(resolve => setTimeout(resolve, 16));
+    }
+    if (!_messagesTruncated || !S.session) return;
+  }
+  _loadingOlder = true;
+  try {
+    const sid = S.session.session_id;
+    const data = await api(`/api/session?session_id=${encodeURIComponent(sid)}&messages=1&resolve_model=0`);
+    // Guard: api() may have redirected (401) and returned undefined.
+    if (!data || !data.session) return;
+    // Session may have been switched while we awaited. Bail rather than
+    // overwrite the new session's messages.
+    if (!S.session || S.session.session_id !== sid) return;
+    if (_loadingSessionId !== null && _loadingSessionId !== sid) return;
+    const msgs = (data.session.messages || []).filter(m => m && m.role);
+    // Bump the generation BEFORE the wholesale replace so any racing
+    // prefetch (whose snapshot was taken before this call's mutex
+    // acquisition) sees the new value and aborts.
+    _bumpMessagesGeneration();
+    S.messages = msgs;
+    _messagesTruncated = false;
+    _oldestIdx = 0;
+    if (S.session && S.session.session_id === sid) {
+      S.session.message_count = Number(data.session.message_count || msgs.length);
+    }
+  } finally {
+    _loadingOlder = false;
  }
 }

@@ -0,0 +1,212 @@
+"""Regression test for issue #1937 — endless-scroll prefetch vs Start-jump race.
+
+When both ``session_jump_buttons`` and ``session_endless_scroll`` opt-ins
+are enabled, ``_loadOlderMessages`` (the endless-scroll prefetch) can be in
+flight when the user clicks the Start jump pill, which calls
+``_ensureAllMessagesLoaded``.  If the prefetch resolves AFTER the
+ensure-all wholesale-replaces ``S.messages``, it would prepend a duplicate
+page.
+
+The fix uses two coordinated guards:
+
+1. A ``_messagesGeneration`` token that gets bumped any time
+   ``S.messages`` is wholesale-replaced.  ``_loadOlderMessages`` snapshots
+   the token before its ``await`` and re-checks afterwards; if it changed,
+   the prepend is aborted.
+
+2. ``_ensureAllMessagesLoaded`` claims the existing ``_loadingOlder``
+   mutex around its body so no NEW prefetch can start mid-replace, and so
+   concurrent ensure-all invocations (e.g. rapid double-click on Start)
+   serialize cleanly.  It also yields until any in-flight prefetch's
+   ``finally`` clears the flag before claiming the mutex itself.
+
+The old fix shape suggested in the issue (spin-wait on ``_loadingOlder``
+before running ensure-all) does not actually solve the race the report
+describes: by the time the prefetch passes its entry-gate check, it is
+already past the only point where ``_loadingOlder`` is read, so a same-
+flag check inside its post-await body would be a no-op.  The generation
+token is the canonical pattern for invalidating async continuations and
+is what this regression suite locks in.
+"""
+
+from pathlib import Path
+
+REPO = Path(__file__).resolve().parents[1]
+SESSIONS_JS = (REPO / "static" / "sessions.js").read_text(encoding="utf-8")
+
+
+def _function_body(src: str, name: str) -> str:
+    """Slice the body of ``async function <name>`` (or ``function <name>``)."""
+    needle_async = f"async function {name}"
+    needle_sync = f"function {name}"
+    if needle_async in src:
+        start = src.index(needle_async)
+    else:
+        start = src.index(needle_sync)
+    brace = src.index("{", start)
+    depth = 0
+    for i in range(brace, len(src)):
+        if src[i] == "{":
+            depth += 1
+        elif src[i] == "}":
+            depth -= 1
+            if depth == 0:
+                return src[start : i + 1]
+    raise AssertionError(f"function {name!r} body not found")
+
+
+# ---------------------------------------------------------------------------
+# Generation token: declared at module scope, bumped via the helper.
+# ---------------------------------------------------------------------------
+
+def test_generation_token_declared_at_module_scope():
+    """``_messagesGeneration`` exists as a module-scoped mutable counter."""
+    assert "let _messagesGeneration = 0;" in SESSIONS_JS, (
+        "static/sessions.js must declare `let _messagesGeneration = 0;` so "
+        "_loadOlderMessages can snapshot/re-check it across its `await`. "
+        "See #1937."
+    )
+
+
+def test_generation_bump_helper_exists():
+    """A single helper bumps the generation; both consumers route through it."""
+    assert "function _bumpMessagesGeneration()" in SESSIONS_JS, (
+        "static/sessions.js must define `_bumpMessagesGeneration()` so "
+        "wholesale-replace sites have a single, named pivot to call. See #1937."
+    )
+    body = _function_body(SESSIONS_JS, "_bumpMessagesGeneration")
+    assert "_messagesGeneration" in body, (
+        "_bumpMessagesGeneration must mutate _messagesGeneration"
+    )
+
+
+# ---------------------------------------------------------------------------
+# _loadOlderMessages: snapshot before await, re-check after.
+# ---------------------------------------------------------------------------
+
+def test_load_older_snapshots_generation_before_await():
+    """Snapshot must be captured BEFORE the `await api(...)` call."""
+    body = _function_body(SESSIONS_JS, "_loadOlderMessages")
+    snapshot_idx = body.index("const startGeneration = _messagesGeneration;")
+    await_idx = body.index("await api(")
+    assert snapshot_idx < await_idx, (
+        "_loadOlderMessages must snapshot _messagesGeneration before its "
+        "`await`. Capturing it after the await defeats the race guard. "
+        "See #1937."
+    )
+
+
+def test_load_older_aborts_when_generation_changed():
+    """Post-await guard must compare against the snapshot and abort."""
+    body = _function_body(SESSIONS_JS, "_loadOlderMessages")
+    assert "if (_messagesGeneration !== startGeneration) return;" in body, (
+        "_loadOlderMessages must bail out (without prepending) when the "
+        "generation token changed during its await — that is the signal "
+        "that S.messages was wholesale-replaced under it. See #1937."
+    )
+
+
+def test_load_older_generation_check_runs_before_prepend():
+    """Generation check must come BEFORE the `S.messages = [...older, ...]` mutation."""
+    body = _function_body(SESSIONS_JS, "_loadOlderMessages")
+    guard_idx = body.index("if (_messagesGeneration !== startGeneration) return;")
+    prepend_idx = body.index("S.messages = [...olderMsgs, ...S.messages];")
+    assert guard_idx < prepend_idx, (
+        "Generation guard must short-circuit BEFORE the prepend. "
+        "Otherwise duplicate messages can still slip through. See #1937."
+    )
+
+
+# ---------------------------------------------------------------------------
+# _ensureAllMessagesLoaded: claims the mutex, bumps the generation, yields.
+# ---------------------------------------------------------------------------
+
+def test_ensure_all_bumps_generation_before_replace():
+    """Bump must happen BEFORE `S.messages = msgs` so racing prefetch sees it."""
+    body = _function_body(SESSIONS_JS, "_ensureAllMessagesLoaded")
+    bump_idx = body.rindex("_bumpMessagesGeneration()")
+    replace_idx = body.index("S.messages = msgs;")
+    assert bump_idx < replace_idx, (
+        "_ensureAllMessagesLoaded must bump the generation token BEFORE the "
+        "wholesale replace, otherwise an in-flight prefetch's post-await "
+        "check could read the old value and prepend duplicates. See #1937."
+    )
+
+
+def test_ensure_all_claims_loading_older_mutex():
+    """The body must hold `_loadingOlder = true` so no NEW prefetch starts mid-replace."""
+    body = _function_body(SESSIONS_JS, "_ensureAllMessagesLoaded")
+    assert "_loadingOlder = true;" in body, (
+        "_ensureAllMessagesLoaded must claim the _loadingOlder mutex so "
+        "the entry-gate in _loadOlderMessages short-circuits new prefetches "
+        "while ensure-all is mid-replace. See #1937."
+    )
+    assert "_loadingOlder = false;" in body, (
+        "_ensureAllMessagesLoaded must release the _loadingOlder mutex in "
+        "its finally-block. Otherwise endless-scroll silently breaks after "
+        "every Start-jump."
+    )
+
+
+def test_ensure_all_releases_mutex_in_finally():
+    """Mutex release must live inside a `finally` so errors don't leak the lock."""
+    body = _function_body(SESSIONS_JS, "_ensureAllMessagesLoaded")
+    finally_idx = body.index("} finally {")
+    release_idx = body.index("_loadingOlder = false;", finally_idx)
+    assert release_idx > finally_idx, (
+        "_loadingOlder release must be inside the finally-block to survive "
+        "thrown errors during the wholesale replace. See #1937."
+    )
+
+
+def test_ensure_all_yields_when_prefetch_in_flight():
+    """When a prefetch holds the mutex, ensure-all must wait, not wholesale-replace alongside it."""
+    body = _function_body(SESSIONS_JS, "_ensureAllMessagesLoaded")
+    # Look for the yield-loop on _loadingOlder before the mutex claim.
+    yield_idx = body.index("while (_loadingOlder)")
+    claim_idx = body.index("_loadingOlder = true;")
+    assert yield_idx < claim_idx, (
+        "_ensureAllMessagesLoaded must yield (poll _loadingOlder) BEFORE "
+        "claiming the mutex itself, so an in-flight prefetch's finally-"
+        "block fires and the generation guard inside that prefetch resolves "
+        "the race cleanly. See #1937."
+    )
+
+
+def test_ensure_all_bumps_generation_during_wait_phase():
+    """Bumping during the wait poisons any in-flight prefetch immediately, even before ensure-all gets the mutex."""
+    body = _function_body(SESSIONS_JS, "_ensureAllMessagesLoaded")
+    # Find the _loadingOlder branch that runs when a prefetch is in flight,
+    # and verify it bumps the generation before the wait loop.
+    branch_idx = body.index("if (_loadingOlder) {")
+    wait_idx = body.index("while (_loadingOlder)", branch_idx)
+    bump_in_branch = body.index("_bumpMessagesGeneration()", branch_idx)
+    assert branch_idx < bump_in_branch < wait_idx, (
+        "When a prefetch is in flight at entry, _ensureAllMessagesLoaded "
+        "must bump the generation BEFORE the wait loop so the in-flight "
+        "prefetch's post-await check fires the moment its api() resolves, "
+        "not just for future calls. See #1937."
+    )
+
+
+def test_ensure_all_resets_oldest_idx():
+    """After wholesale-replacing with the full history, _oldestIdx must reset to 0."""
+    body = _function_body(SESSIONS_JS, "_ensureAllMessagesLoaded")
+    assert "_oldestIdx = 0;" in body, (
+        "_ensureAllMessagesLoaded must reset _oldestIdx to 0 — without it, "
+        "a subsequent prefetch could send `msg_before=<stale-idx>` and "
+        "request older messages that are already in the now-full transcript."
+    )
+
+
+def test_ensure_all_guards_against_session_switch_mid_await():
+    """Same-session check must run after await — old version skipped this."""
+    body = _function_body(SESSIONS_JS, "_ensureAllMessagesLoaded")
+    await_idx = body.index("await api(")
+    sid_check_idx = body.index("S.session.session_id !== sid", await_idx)
+    replace_idx = body.index("S.messages = msgs;", await_idx)
+    assert await_idx < sid_check_idx < replace_idx, (
+        "_ensureAllMessagesLoaded must guard against session-switch races "
+        "(re-check S.session.session_id after await) BEFORE wholesale-"
+        "replacing S.messages. The pre-fix version had no such guard."
+    )