perf(engine): halve per-write __manifest scans#307
Conversation
Adds `internal_table_scans_grow_without_compaction`, the served-regime twin of `internal_table_scans_are_flat_in_history`. The flat gate `optimize()`s before every measured write, so it only proves the *compacted* invariant and stays green even when a served graph's per-write `__manifest` scan amplifies without bound. This tripwire measures the uncompacted regime and asserts the scan grows — green today, and it flips RED once the amplification is bounded (write-path warm-reuse + version-GC), at which point it inverts to a permanent `assert_flat` gate. RFC-013.
Cuts a same-branch write from ~4 to ~2 `__manifest` scans (measured 50->25 at depth 10, 410->205 at depth 100) with the OCC contract and snapshot isolation preserved: - #1a probe-gate the OCC re-capture in `commit_all` via `occ_snapshot_for_branch` (mirrors the read path's `resolve_target_inner`): reuse the warm coordinator when a cheap incarnation probe proves it current, fall through to a cold read on mismatch. - #1b fold the post-publish `known_state` in-memory from `existing_versions` plus the committed rows instead of an O(fragments) re-scan; extracted the shared `assemble_manifest_state` reduction so the fold is byte-identical to a scan, proven by the new `post_publish_fold_matches_fresh_reopen` test. - #1c project `read_manifest_scan` to the columns it reads (drop `base_objects` always, `object_id` on the table-state path). The two remaining publish scans (`load_publish_state` and the `use_index(false)` merge-insert join) stay O(fragments), bounded by compaction/version-GC (RFC-013 PR1, not in this change).
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4ac3cde485
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| row.table_key | ||
| )) | ||
| })?, | ||
| table_branch: row.table_branch.clone(), |
There was a problem hiding this comment.
Prefer pending owner handoff in folded state
When a publish updates an existing table_version row without changing the version (the owner-branch handoff explicitly allowed above, e.g. moving a table from an inherited branch to the active branch at the same Lance version), this fold appends the pending row after existing_versions. assemble_manifest_state keeps the first entry when table_version is equal, so the live coordinator retains the old table_branch even though the merge-insert updated the row; a fresh reopen is correct, but same-handle snapshots and subsequent writes on that branch stay stale until refresh. Please replace/remove the existing (table_key, table_version) entry before folding, or otherwise make equal-version handoff rows prefer the pending value.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Valid and fixed in 245cb26 (regression test in 5537cd9, confirmed red→green).
Root cause exactly as described: version_object_id(table_key, version) is deterministic, so an owner-branch handoff is an in-place merge-insert UpdateAll (one row in __manifest), but the fold appended the pending row after existing_versions and assemble_manifest_state kept the first equal-version entry, leaving the warm coordinator on the stale table_branch.
Fix: fold_inputs now keys version entries by (table_key, table_version) (the merge-insert identity) so a pending row replaces the pre-publish entry, mirroring UpdateAll on disk. The fold input now carries the same one-row-per-version uniqueness a fresh scan does, so both feed assemble_manifest_state equivalent inputs. New test test_post_publish_fold_reflects_owner_branch_handoff commits a handoff through the coordinator (exercising the fold) and asserts the warm snapshot equals a fresh reopen.
The PR #307 post-publish fold appends pending table_version rows after existing_versions, and assemble_manifest_state keeps the first equal-version entry. A same-version owner-branch handoff updates a table_version row in place at the same Lance version with a new table_branch (merge-insert UpdateAll on the deterministic version_object_id), so the warm coordinator keeps the stale fork while a fresh re-scan reflects the handoff. This test commits a handoff through the coordinator commit path (exercising the fold) and asserts the warm snapshot equals a fresh reopen. It is red against the current fold; the following commit turns it green. Flagged by Cursor Bugbot (High) and ChatGPT Codex (P2) on PR #307.
fold_inputs now keys version entries by (table_key, table_version), the manifest row identity carried by the deterministic version_object_id that the merge-insert CAS uses. A pending row at the same identity replaces the pre-publish entry, mirroring merge-insert UpdateAll on disk. Previously the fold appended pending rows after existing_versions, so an owner-branch handoff left two equal-version entries and assemble_manifest_state retained the stale one. The fold input now carries the same one-row-per-(table_key, version) uniqueness a fresh scan produces, so both feed assemble_manifest_state equivalent inputs and the warm known_state stays byte-identical to read_manifest_state. This corrects the derivation's identity model structurally and applies to any same-version in-place update. Closes the PR #307 review finding.
| if coord.probe_latest_incarnation().await?.matches(&held) { | ||
| return Ok(coord.snapshot()); | ||
| } |
There was a problem hiding this comment.
This fast path reuses the warm snapshot whenever matches(&held) returns true, but the incarnation comparison can fall back to version-only when the store does not provide an ETag or timestamp. In that state, a branch that is deleted and recreated at the same manifest version can compare equal even though its manifest rows are different. The old path always rebuilt the snapshot from the current manifest; this path can instead derive OCC expectations and table locations from the old branch state. Please fall back to the cold read unless the probe includes a strong identity for the manifest contents.
Context Used: AGENTS.md (source)
Gives IoStats.requests + assert_io_eq!, used by the cost harness to record the __manifest read log (method + path) for failure diagnostics. Dev-dependency only, so production builds (which exclude dev-deps) never compile it.
Consolidate the per-op ProbeHandles into OpProbes plus a persistent GraphIoMeter, and read per-op deltas via lance's incremental_stats() (get-and-reset) instead of cumulative stats() -- the upstream per-request idiom (rust/lance/src/dataset/tests/dataset_io.rs). Add cost_harness(body): it installs one __manifest tracker for a whole test body, so the graph opens under it and every coordinator handle (init plus each post-publish reassignment) carries the same tracker. measure reuses that ambient tracker when present, making manifest_reads ground truth (warm probe plus cold scans, handle-age-irrelevant); outside cost_harness it falls back to a fresh per-op tracker (today's behavior). The body future is boxed so wrapping a whole test body does not overflow the test thread's stack. Also stash each op's __manifest read log on the meter for assert_io_eq!-style failure diagnostics (last_manifest_reads). Behavior-preserving: no test wraps its body in cost_harness yet, so measure takes the fallback path and every cost number is unchanged. write_cost and warm_read_cost stay green.
Wrap the three __manifest-asserting tests (flat, grow, ceiling) in cost_harness so manifest_reads is ground truth -- the warm-coordinator freshness probe rides a long-lived handle a per-op tracker installed at measure time cannot see. The flat/grow gates are depth-difference assertions, so the constant per-write probe offset cancels and they pass unchanged; the absolute ceiling is retightened from 34 to 24 (~18 measured = ~15 publish-path scans + ~3 probe RPCs) with the read log dumped on a breach. Add manifest_reads_capture_warm_probe: it measures the same warm write fresh-only and under cost_harness and asserts ground truth strictly exceeds fresh-only by the probe's RPCs (11 vs 14). Reverting the ground-truth wiring makes the two equal, so this guards that a write's warm-handle probe (3 object-store RPCs that were counted as a single version_probe) cannot silently escape manifest_reads again.
Wrap the warm (== 0) manifest gates in cost_harness so manifest_reads is ground truth. A read's freshness probe is served from Lance's cached manifest at 0 object-store reads (unlike a write's probe, which re-reads after its commit), so the == 0 assertions hold with no re-baseline -- and now also catch any future warm-handle scan a per-op tracker would miss. The stale (> 0) tests are unaffected either way and stay on the fresh fallback.
The cost harness now reads incremental_stats() deltas and, under cost_harness, installs one __manifest tracker before the graph opens so manifest_reads is ground truth (handle-age-irrelevant). Note that version_probes is the probe call count and that ground truth reveals a write's probe does ~3 object-store RPCs.
…anded) Prepend a current-state section (§A) for the __manifest scan-amplification / version-chain thread: the problem, what landed on main (step 2a, Phase 7 #299), what is in flight on this branch / PR #307 (PR2 scan-halving, the owner-branch handoff fold fix, the PR2.1 ground-truth cost harness), the accurate measurement (per-write __manifest ops ~50->410 pre-PR2 vs 28->208 ground truth; the hidden 3-RPC freshness probe), the remaining roadmap (PR1a manual cleanup, PR3-scoping, deferred PR1b/PR4), critical files, and gotchas. Staleness fixes: Phase 7 was listed as a future "step 4" but landed as #299, so mark it LANDED in the TL;DR landed list and in the remaining-steps section.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit eac7872. Configure here.
| let held = coord.manifest_incarnation(); | ||
| if coord.probe_latest_incarnation().await?.matches(&held) { | ||
| return Ok(coord.snapshot()); | ||
| } |
There was a problem hiding this comment.
Weak incarnation reuses OCC snapshot
Medium Severity
occ_snapshot_for_branch reuses the warm coordinator snapshot when probe_latest_incarnation matches the held incarnation, but ManifestIncarnation::matches treats equal manifest version as sufficient when ETag and timestamp are unavailable. After a branch delete/recreate or other same-version manifest change, commit-time OCC can skip the cold rescan and derive table pins from stale known_state instead of current rows.
Reviewed by Cursor Bugbot for commit eac7872. Configure here.


What & why
Cuts a same-branch write from ~4 to ~2
__manifestscans (measured 50→25 reads at depth 10, 410→205 at depth 100) — the per-open scan-cost half of RFC-013's write-path latency work — with snapshot isolation and the OCC contract preserved. #1a probe-gates the commit-time OCC re-capture to reuse the warm coordinator when a cheap incarnation probe proves it current; #1b folds the post-publishknown_statein-memory (byte-identical to a re-scan via a sharedassemble_manifest_state, proven by a new fold-equals-reopen test) instead of an O(fragments) re-scan; #1c projectsread_manifest_scanto the columns it actually reads. The two remaining publish scans stay O(fragments), bounded by compaction/version-GC (RFC-013 PR1, a follow-up).Backing issue / RFC
Checklist
post_publish_fold_matches_fresh_reopen(fold byte-identity) andinternal_table_scans_grow_without_compaction(served-regime cost tripwire)Notes for reviewers
read_manifest_stateor the warm coordinator silently desyncs. The sharedassemble_manifest_stateremoves any dedup/filter/sort divergence;post_publish_fold_matches_fresh_reopenguards theexisting ∪ pending == new rowsassumption, including thetable_locationsunion for newly-registered tables.writes,consistency,composite_flow,write_cost.__manifestinto version-GC — bounds the remaining O(fragments) scans and fixes the RustFS open-failure), and PR3 (branch-op de-amplification).Note
Medium Risk
Changes the write-path manifest snapshot source (in-memory fold + warm OCC reuse); correctness depends on byte-identical fold vs re-scan, but that is centralized in
assemble_manifest_stateand covered by targeted regression tests, with cold paths and publish CAS unchanged on mismatch.Overview
RFC-013 PR2 reduces per-write
__manifestwork (~4 → ~2 full scans; ground-truth IO counts roughly halve and growth slope drops) while keeping OCC and snapshot isolation: publish CAS and cold fallbacks stay the safety net.#1a — probe-gated OCC re-capture: Transactional commits in
commit_allcallocc_snapshot_for_branchinstead of always cold-opening too fresh_snapshot_for_branch_unchecked`. When the bound branch matches and an incarnation probe says the warm coordinator is current, the warm snapshot is reused; any drift falls back to a full re-read.#1b — in-memory post-publish state: After publish,
ManifestCoordinatoradoptsPublishOutcome.known_statefromfold_inputs(pre-publish state ∪ pending rows) run through sharedassemble_manifest_state, skipping the post-commitread_manifest_statere-scan. Same-version owner-branch handoff rows replace existing entries by(table_key, table_version)so the fold matches merge-insertUpdateAll.#1c — scan projection:
read_manifest_scanprojects only columns the reducer uses (dropsbase_objects;object_idonly when lineage is collected).Tests & harness:
cost_harness/GraphIoMeterattach one__manifesttracker before graph open somanifest_readsincludes warm-handle probe RPCs; new fold-vs-reopen, handoff, warm-probe, and served-regime growth tripwire tests. Dev-onlylance-iotest-utilfor read-log diagnostics.Reviewed by Cursor Bugbot for commit 4abf0dd. Bugbot is set up for automated code reviews on this repo. Configure here.
Greptile Summary
This PR lowers per-write manifest scan cost in the engine.
Confidence Score: 5/5
This looks safe to merge.
Important Files Changed
Flowchart
%%{init: {'theme': 'neutral'}}%% flowchart TD A[Write transaction commit] --> B[Probe manifest incarnation] B -->|matches warm coordinator| C[Reuse warm snapshot] B -->|mismatch or other branch| D[Open fresh snapshot] C --> E[Check expected table versions] D --> E E --> F[Build pending manifest rows] F --> G[Fold existing state with pending rows] G --> H[Publish manifest rows] H --> I[Adopt folded known_state]%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%% flowchart TD A[Write transaction commit] --> B[Probe manifest incarnation] B -->|matches warm coordinator| C[Reuse warm snapshot] B -->|mismatch or other branch| D[Open fresh snapshot] C --> E[Check expected table versions] D --> E E --> F[Build pending manifest rows] F --> G[Fold existing state with pending rows] G --> H[Publish manifest rows] H --> I[Adopt folded known_state]Reviews (4): Last reviewed commit: "docs(rfc-013): refresh PR307 handoff sta..." | Re-trigger Greptile
Context used: