Skip to content

perf(editor): ttl-refresh traject index snapshots, bound body cache, implements index#794

Open
tdjager wants to merge 3 commits into
mainfrom
perf-editor-api-index-caches
Open

perf(editor): ttl-refresh traject index snapshots, bound body cache, implements index#794
tdjager wants to merge 3 commits into
mainfrom
perf-editor-api-index-caches

Conversation

@tdjager

@tdjager tdjager commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

Problem (audit advice #9)

Three staleness/perf problems on the editor-api read paths:

  1. Stale-forever traject cachesTrajectCorpusCache entries were only invalidated on traject create/delete. The per-traject source-map index snapshot never picked up upstream changes (new laws merged, re-harvests, saves on another replica) until process restart, and the lazily-fetched body cache was unbounded and never refreshed either.
  2. O(corpus) display names per requestGET /corpus/laws called resolve_display_name per law per request: a line scan of every body, plus a full serde_yaml_ng parse for name: '#ref' laws. For the unfiltered list that is O(total corpus bytes) per page load.
  3. O(corpus) implementor discovery with serialized fetchesimplementors_in_scope lazily fetched every metadata-only law's body sequentially under the per-source mutex on every request, and silently skipped fetch failures, so a throttled scan returned a silently incomplete list.

Changes

TTL + refresh on the per-traject index snapshot (TRAJECT_INDEX_TTL = 60s)

Follows the existing CHANGED_LAWS_TTL precedent (same 60s convergence target). The first request past the deadline re-enumerates the sources and swaps in a fresh SourceMap; concurrent requests keep serving the stale snapshot while one refresh is in flight (single-flighted per traject), and a failed refresh re-arms the stale snapshot for another TTL window instead of erroring reads. A refresh where any source fails to enumerate is treated as failed — serving the previous complete snapshot beats swapping in one with thousands of laws missing. The failed-refresh warn is rate-limited (first failure and every 10th consecutive one; the rest log at debug) so a permanently broken source doesn't re-warn every minute.

Why this is safe w.r.t. saves: the refreshed instance carries over

  • the backend map (same Arc<Mutex<…>> instances — an in-flight save keeps excluding writers across the swap, and the writable-own → seed lock-ordering invariant from fix(editor): optimistic concurrency (ETag/If-Match) for law and scenario saves #790 is untouched);
  • the post-save overlay (shared Arc) — a refresh never resurrects pre-save content, and a record_save racing the swap on the old instance is visible through the new one;
  • the changed-laws cache (shared Arc) — a refresh cannot undo save_law's invalidation;
  • the implements index (served stale while rebuilt) and its cross-snapshot memo — see below.

Overlay reconciliation at refresh (read-your-writes-or-newer)

The overlay is carried across refreshes, but no longer unconditionally: each refresh compares every overlay entry against its write-target branch file (one per-entry backend lock — never held across the loop, and no seed lock, so the writable-own → seed lock order can't be violated) and drops the entry when the branch content no longer equals the saved body. That is the cross-replica convergence story: a law saved on this process serves the saved body until someone else (another replica, a direct push) overwrites it on the branch, after which reads converge to the newer branch content within a TTL window — instead of this process pinning its own save forever, which made every cross-replica If-Match save 412 against content the loser could never fetch. A read error keeps the entry (a network blip must not drop a save), and removal is compare-and-remove so a save landing concurrently with the reconcile pass is never dropped.

Bounded, per-snapshot body cache (BODY_CACHE_MAX_ENTRIES = 1024)

The save overlay and the lazy-read cache used to share one unbounded map. They are now split: saves stay in the overlay (bounded by laws edited in the traject, survives refreshes), lazily fetched bodies live in a FIFO-bounded per-snapshot cache that the refresh discards — that is how upstream content changes (not just index changes) converge within the TTL. FIFO instead of LRU keeps cache hits on the cheap read guard; at ~tens of KB per body the cap bounds the cache at low tens of MB per traject.

Implements index: stale-while-revalidate + content-addressed memo

  • LoadedLaw now carries display_name and implements, derived once (one shared YAML parse) at load / update_yaml_content time in regelrecht-corpus. The law list reads display_name straight off the entry; the global implementors lookup is an in-memory reverse scan.
  • Traject scopes build an implements index on the first lookup of the process — the one O(corpus) body scan. A TTL refresh does not repeat it: the previous index is carried over flagged stale, the first post-refresh lookup rebuilds it under a try_lock (same single-flight pattern as the snapshot refresh) while concurrent lookups keep serving the carried index — the federated implementors hot path (fix(editor): stop scenario dependency-scan request storm + un-gate result buttons #762) never queues behind a rescan again.
  • The rebuild itself is content-addressed: the GitHub Trees enumeration's blob shas are now plumbed through regelrecht-corpus onto metadata-only LoadedLaw entries (content_sha), and the editor keeps a cross-snapshot (source_id, blob_sha) → implements memo (shared Arc, like the overlay). A rebuild fetches/parses only bodies whose blob changed; an unchanged corpus rebuilds with zero Contents API calls. Overlay-saved laws are parsed from the overlay body (never memoized under the pre-save sha), loaded (local) laws reuse the corpus-parsed list, and each rebuild swaps the memo wholesale so it stays bounded by the current corpus size. record_save still updates the live index in place.
  • Fetch failures during a scan are recorded and reported (the build warns once with skipped/indexed counts; the per-request signal is debug) instead of being indistinguishable from "no implementors"; the failed set is retried at the next rebuild. The response is a bare id array, so there was no non-breaking field for a partiality flag — left as a possible follow-up if the frontend wants to surface it.

Small fix

The global scenario GET leaked raw backend error text via e.to_string(); it now uses the same generic-message/log-details pattern as corpus_write_error.

Possible follow-up

A partial refresh (keep healthy sources' fresh enumeration when one source fails, instead of all-or-nothing serve-stale) would tighten convergence further for multi-source trajects with one flaky seed; deliberately out of scope here.

Tests

  • corpus: display-name/implements precompute on local load, fetched files, metadata-only entries (incl. content_sha), and update_yaml_content recompute.
  • editor-api lib: body-cache FIFO bound, TTL freshness rule, lazy-fetch caching + overlay precedence, implements-index build/reuse, fetch-failure reporting, record_save in-place index updates, refresh carry-over semantics (shared backends/overlay/changed-cache), SWR (carried index served while a rebuild holds the build lock), memo (unchanged corpus rebuilds with zero fetches; a changed sha refetches only that body), and overlay reconciliation (entry kept while branch == saved body, dropped when the branch moved past the save).
  • editor-api integration (traject_reads_test, testcontainers): end-to-end TTL refresh through the real handlers — an upstream-added law appears after the refresh, a fresh save survives it (read-your-writes), and an external overwrite of a saved law converges to the external content instead of being pinned. All 11 tests green.

Note: save_annotations_test and trajects_test do not compile on origin/main either (stale imports like SESSION_KEY_ACTIVE_TRAJECT); untouched here.

just format, just lint, just build-check all green.

tdjager added 2 commits June 11, 2026 13:18
…load time

Listing endpoints called resolve_display_name per law per request (a
line scan of every body, plus a full serde_yaml parse for laws with
'name: #ref'), and implementor discovery re-parsed every body per
request. Derive both fields once, with a single shared YAML parse, when
a law is loaded (local scan or fetched file) and recompute them in
update_yaml_content; metadata-only index entries keep them empty until
their body is known.
…or lookups

Three read-path problems in the per-traject corpus:

- The cached TrajectCorpus was only invalidated on traject create or
  delete, so its index snapshot never picked up upstream changes (new
  laws merged, re-harvests, saves on another replica) until a process
  restart. The snapshot now expires after TRAJECT_INDEX_TTL (60s, the
  CHANGED_LAWS_TTL precedent): the first request past the deadline
  re-enumerates the sources and swaps in a fresh SourceMap while the
  backend mutexes, the post-save overlay and the changed-laws cache
  carry over, so an in-flight save keeps its lock and a refresh can
  never resurrect pre-save content. Concurrent requests serve the stale
  snapshot while one refresh runs; a failed refresh re-arms the stale
  snapshot for another TTL instead of erroring reads.

- The lazily-fetched law-body cache was unbounded and shared a map with
  the read-your-writes save overlay. They are now split: saves stay in
  the (per-edit-bounded, refresh-surviving) overlay, lazy fetches live
  in a per-snapshot FIFO-bounded cache (1024 entries) that a refresh
  discards, so upstream body changes converge within the TTL.

- implementors_in_scope fetched and parsed every law body sequentially
  on every request and silently skipped fetch failures. Traject scopes
  now build a per-snapshot implements index (one O(corpus) scan,
  updated in place by record_save, discarded with the snapshot) and
  report fetch-skipped laws via a warn log instead of passing off an
  incomplete scan as 'no implementors'; the global scope uses the
  implements lists precomputed at load time.

Also routes the global scenario-read error through the generic-message/
log-details pattern instead of leaking raw backend error text.
@github-actions

github-actions Bot commented Jun 11, 2026

Copy link
Copy Markdown

Preview Deployment — harvester-admin

Your changes have been deployed to a preview environment:

URL: https://harvester-admin-pr794-regel-k4c.rig.prd1.gn2.quattro.rijksapps.nl

This deployment will be automatically cleaned up when the PR is closed.

@github-actions

Copy link
Copy Markdown

Preview Deployment — harvester-worker

Your changes have been deployed to a preview environment:

URL: https://harvester-worker-pr794-regel-k4c.rig.prd1.gn2.quattro.rijksapps.nl

This deployment will be automatically cleaned up when the PR is closed.

@github-actions

Copy link
Copy Markdown

Preview Deployment — enrichworker

Your changes have been deployed to a preview environment:

URL: https://enrichworker-pr794-regel-k4c.rig.prd1.gn2.quattro.rijksapps.nl

This deployment will be automatically cleaned up when the PR is closed.

@github-actions

Copy link
Copy Markdown

Preview Deployment — editor — editor

Your changes have been deployed to a preview environment:

URL: https://editor-pr794-regel-k4c.rig.prd1.gn2.quattro.rijksapps.nl

This deployment will be automatically cleaned up when the PR is closed.

… save overlay

Review follow-ups on the TTL-refresh snapshot caches:

- Stale-while-revalidate for the implements index: a TTL refresh carries
  the previous index over flagged stale instead of discarding it. The
  first post-refresh lookup rebuilds it under a try_lock (snapshot
  pattern) while concurrent lookups keep serving the carried index, so
  the implementors hot path never queues behind a corpus rescan again.
- Content-addressed rebuild memo: the Trees enumeration's blob shas are
  now plumbed through regelrecht-corpus onto metadata-only LoadedLaw
  entries (content_sha), and the editor keeps a cross-snapshot
  (source_id, sha) -> implements memo shared like the overlay. A rebuild
  only fetches/parses bodies whose blob actually changed; an unchanged
  corpus rebuilds with zero Contents API calls. Each rebuild swaps the
  memo wholesale so it stays bounded by the current corpus size.
- Overlay reconciliation at refresh: every overlay entry is compared
  against its write-target branch file (one per-entry backend lock, no
  seed lock) and dropped when the branch moved past the save — another
  replica's edit or a direct push now converges within a TTL window
  instead of pinning this process's save forever and 412-looping every
  cross-replica edit. A read error keeps the entry, and removal is
  compare-and-remove so a concurrent save is never dropped.
- Demote the per-request "implementors incomplete" warn to debug (the
  index build already warns once with counts).
- Rate-limit the failed-refresh warn: first failure and every 10th
  consecutive one warn, the rest log at debug; the counter resets on a
  successful refresh.

Tests: SWR serving while a rebuild holds the lock, memo skipping
unchanged bodies (and refetching changed ones), overlay kept while the
branch matches the save and dropped when it moved, plus the integration
test now asserting cross-replica convergence instead of forever-pinned
saves.
@tdjager tdjager changed the title perf(editor): TTL-refresh traject index snapshots, bound body cache, implements index perf(editor): ttl-refresh traject index snapshots, bound body cache, implements index Jun 11, 2026
@tdjager tdjager marked this pull request as ready for review June 11, 2026 15:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant