perf(editor): ttl-refresh traject index snapshots, bound body cache, implements index#794
perf(editor): ttl-refresh traject index snapshots, bound body cache, implements index#794tdjager wants to merge 3 commits into
Conversation
…load time Listing endpoints called resolve_display_name per law per request (a line scan of every body, plus a full serde_yaml parse for laws with 'name: #ref'), and implementor discovery re-parsed every body per request. Derive both fields once, with a single shared YAML parse, when a law is loaded (local scan or fetched file) and recompute them in update_yaml_content; metadata-only index entries keep them empty until their body is known.
…or lookups Three read-path problems in the per-traject corpus: - The cached TrajectCorpus was only invalidated on traject create or delete, so its index snapshot never picked up upstream changes (new laws merged, re-harvests, saves on another replica) until a process restart. The snapshot now expires after TRAJECT_INDEX_TTL (60s, the CHANGED_LAWS_TTL precedent): the first request past the deadline re-enumerates the sources and swaps in a fresh SourceMap while the backend mutexes, the post-save overlay and the changed-laws cache carry over, so an in-flight save keeps its lock and a refresh can never resurrect pre-save content. Concurrent requests serve the stale snapshot while one refresh runs; a failed refresh re-arms the stale snapshot for another TTL instead of erroring reads. - The lazily-fetched law-body cache was unbounded and shared a map with the read-your-writes save overlay. They are now split: saves stay in the (per-edit-bounded, refresh-surviving) overlay, lazy fetches live in a per-snapshot FIFO-bounded cache (1024 entries) that a refresh discards, so upstream body changes converge within the TTL. - implementors_in_scope fetched and parsed every law body sequentially on every request and silently skipped fetch failures. Traject scopes now build a per-snapshot implements index (one O(corpus) scan, updated in place by record_save, discarded with the snapshot) and report fetch-skipped laws via a warn log instead of passing off an incomplete scan as 'no implementors'; the global scope uses the implements lists precomputed at load time. Also routes the global scenario-read error through the generic-message/ log-details pattern instead of leaking raw backend error text.
Preview Deployment — harvester-adminYour changes have been deployed to a preview environment: URL: https://harvester-admin-pr794-regel-k4c.rig.prd1.gn2.quattro.rijksapps.nl This deployment will be automatically cleaned up when the PR is closed. |
Preview Deployment — harvester-workerYour changes have been deployed to a preview environment: URL: https://harvester-worker-pr794-regel-k4c.rig.prd1.gn2.quattro.rijksapps.nl This deployment will be automatically cleaned up when the PR is closed. |
Preview Deployment — enrichworkerYour changes have been deployed to a preview environment: URL: https://enrichworker-pr794-regel-k4c.rig.prd1.gn2.quattro.rijksapps.nl This deployment will be automatically cleaned up when the PR is closed. |
Preview Deployment — editor — editorYour changes have been deployed to a preview environment: URL: https://editor-pr794-regel-k4c.rig.prd1.gn2.quattro.rijksapps.nl This deployment will be automatically cleaned up when the PR is closed. |
… save overlay Review follow-ups on the TTL-refresh snapshot caches: - Stale-while-revalidate for the implements index: a TTL refresh carries the previous index over flagged stale instead of discarding it. The first post-refresh lookup rebuilds it under a try_lock (snapshot pattern) while concurrent lookups keep serving the carried index, so the implementors hot path never queues behind a corpus rescan again. - Content-addressed rebuild memo: the Trees enumeration's blob shas are now plumbed through regelrecht-corpus onto metadata-only LoadedLaw entries (content_sha), and the editor keeps a cross-snapshot (source_id, sha) -> implements memo shared like the overlay. A rebuild only fetches/parses bodies whose blob actually changed; an unchanged corpus rebuilds with zero Contents API calls. Each rebuild swaps the memo wholesale so it stays bounded by the current corpus size. - Overlay reconciliation at refresh: every overlay entry is compared against its write-target branch file (one per-entry backend lock, no seed lock) and dropped when the branch moved past the save — another replica's edit or a direct push now converges within a TTL window instead of pinning this process's save forever and 412-looping every cross-replica edit. A read error keeps the entry, and removal is compare-and-remove so a concurrent save is never dropped. - Demote the per-request "implementors incomplete" warn to debug (the index build already warns once with counts). - Rate-limit the failed-refresh warn: first failure and every 10th consecutive one warn, the rest log at debug; the counter resets on a successful refresh. Tests: SWR serving while a rebuild holds the lock, memo skipping unchanged bodies (and refetching changed ones), overlay kept while the branch matches the save and dropped when it moved, plus the integration test now asserting cross-replica convergence instead of forever-pinned saves.
Problem (audit advice #9)
Three staleness/perf problems on the editor-api read paths:
TrajectCorpusCacheentries were only invalidated on traject create/delete. The per-traject source-map index snapshot never picked up upstream changes (new laws merged, re-harvests, saves on another replica) until process restart, and the lazily-fetched body cache was unbounded and never refreshed either.GET /corpus/lawscalledresolve_display_nameper law per request: a line scan of every body, plus a fullserde_yaml_ngparse forname: '#ref'laws. For the unfiltered list that is O(total corpus bytes) per page load.implementors_in_scopelazily fetched every metadata-only law's body sequentially under the per-source mutex on every request, and silently skipped fetch failures, so a throttled scan returned a silently incomplete list.Changes
TTL + refresh on the per-traject index snapshot (
TRAJECT_INDEX_TTL = 60s)Follows the existing
CHANGED_LAWS_TTLprecedent (same 60s convergence target). The first request past the deadline re-enumerates the sources and swaps in a freshSourceMap; concurrent requests keep serving the stale snapshot while one refresh is in flight (single-flighted per traject), and a failed refresh re-arms the stale snapshot for another TTL window instead of erroring reads. A refresh where any source fails to enumerate is treated as failed — serving the previous complete snapshot beats swapping in one with thousands of laws missing. The failed-refresh warn is rate-limited (first failure and every 10th consecutive one; the rest log at debug) so a permanently broken source doesn't re-warn every minute.Why this is safe w.r.t. saves: the refreshed instance carries over
Arc<Mutex<…>>instances — an in-flight save keeps excluding writers across the swap, and the writable-own → seed lock-ordering invariant from fix(editor): optimistic concurrency (ETag/If-Match) for law and scenario saves #790 is untouched);Arc) — a refresh never resurrects pre-save content, and arecord_saveracing the swap on the old instance is visible through the new one;Arc) — a refresh cannot undosave_law's invalidation;Overlay reconciliation at refresh (read-your-writes-or-newer)
The overlay is carried across refreshes, but no longer unconditionally: each refresh compares every overlay entry against its write-target branch file (one per-entry backend lock — never held across the loop, and no seed lock, so the writable-own → seed lock order can't be violated) and drops the entry when the branch content no longer equals the saved body. That is the cross-replica convergence story: a law saved on this process serves the saved body until someone else (another replica, a direct push) overwrites it on the branch, after which reads converge to the newer branch content within a TTL window — instead of this process pinning its own save forever, which made every cross-replica
If-Matchsave 412 against content the loser could never fetch. A read error keeps the entry (a network blip must not drop a save), and removal is compare-and-remove so a save landing concurrently with the reconcile pass is never dropped.Bounded, per-snapshot body cache (
BODY_CACHE_MAX_ENTRIES = 1024)The save overlay and the lazy-read cache used to share one unbounded map. They are now split: saves stay in the overlay (bounded by laws edited in the traject, survives refreshes), lazily fetched bodies live in a FIFO-bounded per-snapshot cache that the refresh discards — that is how upstream content changes (not just index changes) converge within the TTL. FIFO instead of LRU keeps cache hits on the cheap read guard; at ~tens of KB per body the cap bounds the cache at low tens of MB per traject.
Implements index: stale-while-revalidate + content-addressed memo
LoadedLawnow carriesdisplay_nameandimplements, derived once (one shared YAML parse) at load /update_yaml_contenttime inregelrecht-corpus. The law list readsdisplay_namestraight off the entry; the global implementors lookup is an in-memory reverse scan.try_lock(same single-flight pattern as the snapshot refresh) while concurrent lookups keep serving the carried index — the federated implementors hot path (fix(editor): stop scenario dependency-scan request storm + un-gate result buttons #762) never queues behind a rescan again.regelrecht-corpusonto metadata-onlyLoadedLawentries (content_sha), and the editor keeps a cross-snapshot(source_id, blob_sha) → implementsmemo (sharedArc, like the overlay). A rebuild fetches/parses only bodies whose blob changed; an unchanged corpus rebuilds with zero Contents API calls. Overlay-saved laws are parsed from the overlay body (never memoized under the pre-save sha), loaded (local) laws reuse the corpus-parsed list, and each rebuild swaps the memo wholesale so it stays bounded by the current corpus size.record_savestill updates the live index in place.Small fix
The global scenario GET leaked raw backend error text via
e.to_string(); it now uses the same generic-message/log-details pattern ascorpus_write_error.Possible follow-up
A partial refresh (keep healthy sources' fresh enumeration when one source fails, instead of all-or-nothing serve-stale) would tighten convergence further for multi-source trajects with one flaky seed; deliberately out of scope here.
Tests
content_sha), andupdate_yaml_contentrecompute.record_savein-place index updates, refresh carry-over semantics (shared backends/overlay/changed-cache), SWR (carried index served while a rebuild holds the build lock), memo (unchanged corpus rebuilds with zero fetches; a changed sha refetches only that body), and overlay reconciliation (entry kept while branch == saved body, dropped when the branch moved past the save).traject_reads_test, testcontainers): end-to-end TTL refresh through the real handlers — an upstream-added law appears after the refresh, a fresh save survives it (read-your-writes), and an external overwrite of a saved law converges to the external content instead of being pinned. All 11 tests green.Note:
save_annotations_testandtrajects_testdo not compile onorigin/maineither (stale imports likeSESSION_KEY_ACTIVE_TRAJECT); untouched here.just format,just lint,just build-checkall green.