Skip to content

perf(engine): halve per-write __manifest scans#307

Merged
ragnorc merged 11 commits into
mainfrom
ragnorc/read-lance-table-docs
Jun 27, 2026
Merged

perf(engine): halve per-write __manifest scans#307
ragnorc merged 11 commits into
mainfrom
ragnorc/read-lance-table-docs

Conversation

@ragnorc

@ragnorc ragnorc commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

What & why

Cuts a same-branch write from ~4 to ~2 __manifest scans (measured 50→25 reads at depth 10, 410→205 at depth 100) — the per-open scan-cost half of RFC-013's write-path latency work — with snapshot isolation and the OCC contract preserved. #1a probe-gates the commit-time OCC re-capture to reuse the warm coordinator when a cheap incarnation probe proves it current; #1b folds the post-publish known_state in-memory (byte-identical to a re-scan via a shared assemble_manifest_state, proven by a new fold-equals-reopen test) instead of an O(fragments) re-scan; #1c projects read_manifest_scan to the columns it actually reads. The two remaining publish scans stay O(fragments), bounded by compaction/version-GC (RFC-013 PR1, a follow-up).

Backing issue / RFC

Checklist

  • Change is focused (one logical change) — RFC-013 PR2, the per-open scan-cost work
  • Tests added/updated for behavior changes — post_publish_fold_matches_fresh_reopen (fold byte-identity) and internal_table_scans_grow_without_compaction (served-regime cost tripwire)
  • Public docs updated if user-facing surface changed — N/A (internal write-path perf; no user-facing surface change)
  • Reviewed against docs/dev/invariants.md — no Hard Invariant weakened: snapshot isolation (§3), manifest-atomic visibility (§2), and OCC strict-conflict all preserved; the warm-reuse mirrors the read path's existing probe idiom, and this removes a deny-listed "cold re-derivation on the hot path"

Notes for reviewers

  • #1b is the correctness-critical piece: the in-memory fold must be byte-identical to read_manifest_state or the warm coordinator silently desyncs. The shared assemble_manifest_state removes any dedup/filter/sort divergence; post_publish_fold_matches_fresh_reopen guards the existing ∪ pending == new rows assumption, including the table_locations union for newly-registered tables.
  • #1a only engages on same-branch writes (cross-branch falls through to the cold read — no regression); it keeps the "OCC re-capture must be fresh" contract because probe-match ⟺ warm == fresh, and the publish CAS remains the final arbiter.
  • Verified green locally: writes, consistency, composite_flow, write_cost.
  • Follow-ups (not in this PR): RFC-013 PR1 (bring __manifest into version-GC — bounds the remaining O(fragments) scans and fixes the RustFS open-failure), and PR3 (branch-op de-amplification).

Note

Medium Risk
Changes the write-path manifest snapshot source (in-memory fold + warm OCC reuse); correctness depends on byte-identical fold vs re-scan, but that is centralized in assemble_manifest_state and covered by targeted regression tests, with cold paths and publish CAS unchanged on mismatch.

Overview
RFC-013 PR2 reduces per-write __manifest work (~4 → ~2 full scans; ground-truth IO counts roughly halve and growth slope drops) while keeping OCC and snapshot isolation: publish CAS and cold fallbacks stay the safety net.

#1a — probe-gated OCC re-capture: Transactional commits in commit_all call occ_snapshot_for_branch instead of always cold-opening too fresh_snapshot_for_branch_unchecked`. When the bound branch matches and an incarnation probe says the warm coordinator is current, the warm snapshot is reused; any drift falls back to a full re-read.

#1b — in-memory post-publish state: After publish, ManifestCoordinator adopts PublishOutcome.known_state from fold_inputs (pre-publish state ∪ pending rows) run through shared assemble_manifest_state, skipping the post-commit read_manifest_state re-scan. Same-version owner-branch handoff rows replace existing entries by (table_key, table_version) so the fold matches merge-insert UpdateAll.

#1c — scan projection: read_manifest_scan projects only columns the reducer uses (drops base_objects; object_id only when lineage is collected).

Tests & harness: cost_harness / GraphIoMeter attach one __manifest tracker before graph open so manifest_reads includes warm-handle probe RPCs; new fold-vs-reopen, handoff, warm-probe, and served-regime growth tripwire tests. Dev-only lance-io test-util for read-log diagnostics.

Reviewed by Cursor Bugbot for commit 4abf0dd. Bugbot is set up for automated code reviews on this repo. Configure here.

Greptile Summary

This PR lowers per-write manifest scan cost in the engine.

  • Reuses a warm OCC snapshot after a matching manifest incarnation probe.
  • Folds post-publish manifest state in memory through the shared state reducer.
  • Narrows manifest scan projection to the columns the scan reads.
  • Adds write-cost and fold-vs-reopen coverage.

Confidence Score: 5/5

This looks safe to merge.

  • No blocking issues found in the changed code.
  • The fold path now uses the same reducer as fresh manifest scans.
  • The changed test-only IO accounting is isolated to test helpers.

Important Files Changed

Filename Overview
crates/omnigraph/src/db/manifest.rs Adopts the publisher-provided manifest state after a successful publish instead of rescanning.
crates/omnigraph/src/db/manifest/publisher.rs Builds the post-publish manifest state from the loaded state plus committed rows.
crates/omnigraph/src/db/manifest/state.rs Shares manifest-state assembly between scan and fold paths and projects scans to the needed columns.
crates/omnigraph/src/db/omnigraph.rs Adds the warm OCC snapshot reuse helper guarded by a manifest incarnation probe.
crates/omnigraph/src/exec/staging.rs Routes transaction-backed OCC recapture through the warm snapshot helper.
crates/omnigraph/tests/helpers/cost.rs Adds persistent manifest IO metering for cost tests.
crates/omnigraph/tests/write_cost.rs Updates write-path cost checks for the new manifest accounting.
crates/omnigraph/tests/writes.rs Adds coverage comparing folded manifest state with a fresh reopen.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
  A[Write transaction commit] --> B[Probe manifest incarnation]
  B -->|matches warm coordinator| C[Reuse warm snapshot]
  B -->|mismatch or other branch| D[Open fresh snapshot]
  C --> E[Check expected table versions]
  D --> E
  E --> F[Build pending manifest rows]
  F --> G[Fold existing state with pending rows]
  G --> H[Publish manifest rows]
  H --> I[Adopt folded known_state]
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
  A[Write transaction commit] --> B[Probe manifest incarnation]
  B -->|matches warm coordinator| C[Reuse warm snapshot]
  B -->|mismatch or other branch| D[Open fresh snapshot]
  C --> E[Check expected table versions]
  D --> E
  E --> F[Build pending manifest rows]
  F --> G[Fold existing state with pending rows]
  G --> H[Publish manifest rows]
  H --> I[Adopt folded known_state]
Loading

Reviews (4): Last reviewed commit: "docs(rfc-013): refresh PR307 handoff sta..." | Re-trigger Greptile

Context used:

  • Context used - AGENTS.md (source)
  • Context used - CLAUDE.md (source)

ragnorc added 2 commits June 25, 2026 21:55
Adds `internal_table_scans_grow_without_compaction`, the served-regime twin of
`internal_table_scans_are_flat_in_history`. The flat gate `optimize()`s before
every measured write, so it only proves the *compacted* invariant and stays
green even when a served graph's per-write `__manifest` scan amplifies without
bound. This tripwire measures the uncompacted regime and asserts the scan
grows — green today, and it flips RED once the amplification is bounded
(write-path warm-reuse + version-GC), at which point it inverts to a permanent
`assert_flat` gate. RFC-013.
Cuts a same-branch write from ~4 to ~2 `__manifest` scans (measured 50->25 at
depth 10, 410->205 at depth 100) with the OCC contract and snapshot isolation
preserved:

- #1a probe-gate the OCC re-capture in `commit_all` via `occ_snapshot_for_branch`
  (mirrors the read path's `resolve_target_inner`): reuse the warm coordinator
  when a cheap incarnation probe proves it current, fall through to a cold read
  on mismatch.
- #1b fold the post-publish `known_state` in-memory from `existing_versions` plus
  the committed rows instead of an O(fragments) re-scan; extracted the shared
  `assemble_manifest_state` reduction so the fold is byte-identical to a scan,
  proven by the new `post_publish_fold_matches_fresh_reopen` test.
- #1c project `read_manifest_scan` to the columns it reads (drop `base_objects`
  always, `object_id` on the table-state path).

The two remaining publish scans (`load_publish_state` and the `use_index(false)`
merge-insert join) stay O(fragments), bounded by compaction/version-GC (RFC-013
PR1, not in this change).
Comment thread crates/omnigraph/src/db/manifest/publisher.rs Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4ac3cde485

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

row.table_key
))
})?,
table_branch: row.table_branch.clone(),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Prefer pending owner handoff in folded state

When a publish updates an existing table_version row without changing the version (the owner-branch handoff explicitly allowed above, e.g. moving a table from an inherited branch to the active branch at the same Lance version), this fold appends the pending row after existing_versions. assemble_manifest_state keeps the first entry when table_version is equal, so the live coordinator retains the old table_branch even though the merge-insert updated the row; a fresh reopen is correct, but same-handle snapshots and subsequent writes on that branch stay stale until refresh. Please replace/remove the existing (table_key, table_version) entry before folding, or otherwise make equal-version handoff rows prefer the pending value.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid and fixed in 245cb26 (regression test in 5537cd9, confirmed red→green).

Root cause exactly as described: version_object_id(table_key, version) is deterministic, so an owner-branch handoff is an in-place merge-insert UpdateAll (one row in __manifest), but the fold appended the pending row after existing_versions and assemble_manifest_state kept the first equal-version entry, leaving the warm coordinator on the stale table_branch.

Fix: fold_inputs now keys version entries by (table_key, table_version) (the merge-insert identity) so a pending row replaces the pre-publish entry, mirroring UpdateAll on disk. The fold input now carries the same one-row-per-version uniqueness a fresh scan does, so both feed assemble_manifest_state equivalent inputs. New test test_post_publish_fold_reflects_owner_branch_handoff commits a handoff through the coordinator (exercising the fold) and asserts the warm snapshot equals a fresh reopen.

ragnorc added 2 commits June 25, 2026 22:36
The PR #307 post-publish fold appends pending table_version rows after
existing_versions, and assemble_manifest_state keeps the first equal-version
entry. A same-version owner-branch handoff updates a table_version row in place
at the same Lance version with a new table_branch (merge-insert UpdateAll on the
deterministic version_object_id), so the warm coordinator keeps the stale fork
while a fresh re-scan reflects the handoff.

This test commits a handoff through the coordinator commit path (exercising the
fold) and asserts the warm snapshot equals a fresh reopen. It is red against the
current fold; the following commit turns it green. Flagged by Cursor Bugbot
(High) and ChatGPT Codex (P2) on PR #307.
fold_inputs now keys version entries by (table_key, table_version), the manifest
row identity carried by the deterministic version_object_id that the merge-insert
CAS uses. A pending row at the same identity replaces the pre-publish entry,
mirroring merge-insert UpdateAll on disk. Previously the fold appended pending
rows after existing_versions, so an owner-branch handoff left two equal-version
entries and assemble_manifest_state retained the stale one.

The fold input now carries the same one-row-per-(table_key, version) uniqueness a
fresh scan produces, so both feed assemble_manifest_state equivalent inputs and
the warm known_state stays byte-identical to read_manifest_state. This corrects
the derivation's identity model structurally and applies to any same-version
in-place update. Closes the PR #307 review finding.
Comment on lines +888 to +890
if coord.probe_latest_incarnation().await?.matches(&held) {
return Ok(coord.snapshot());
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Require strong incarnation

This fast path reuses the warm snapshot whenever matches(&held) returns true, but the incarnation comparison can fall back to version-only when the store does not provide an ETag or timestamp. In that state, a branch that is deleted and recreated at the same manifest version can compare equal even though its manifest rows are different. The old path always rebuilt the snapshot from the current manifest; this path can instead derive OCC expectations and table locations from the old branch state. Please fall back to the cold read unless the probe includes a strong identity for the manifest contents.

Context Used: AGENTS.md (source)

Fix in Claude Code

@ragnorc ragnorc changed the title perf(engine): halve per-write __manifest scans (RFC-013 PR2) perf(engine): halve per-write __manifest scans Jun 25, 2026
ragnorc added 6 commits June 26, 2026 18:58
Gives IoStats.requests + assert_io_eq!, used by the cost harness to record the
__manifest read log (method + path) for failure diagnostics. Dev-dependency only,
so production builds (which exclude dev-deps) never compile it.
Consolidate the per-op ProbeHandles into OpProbes plus a persistent GraphIoMeter,
and read per-op deltas via lance's incremental_stats() (get-and-reset) instead of
cumulative stats() -- the upstream per-request idiom
(rust/lance/src/dataset/tests/dataset_io.rs). Add cost_harness(body): it installs
one __manifest tracker for a whole test body, so the graph opens under it and
every coordinator handle (init plus each post-publish reassignment) carries the
same tracker. measure reuses that ambient tracker when present, making
manifest_reads ground truth (warm probe plus cold scans, handle-age-irrelevant);
outside cost_harness it falls back to a fresh per-op tracker (today's behavior).
The body future is boxed so wrapping a whole test body does not overflow the test
thread's stack.

Also stash each op's __manifest read log on the meter for assert_io_eq!-style
failure diagnostics (last_manifest_reads).

Behavior-preserving: no test wraps its body in cost_harness yet, so measure takes
the fallback path and every cost number is unchanged. write_cost and
warm_read_cost stay green.
Wrap the three __manifest-asserting tests (flat, grow, ceiling) in cost_harness so
manifest_reads is ground truth -- the warm-coordinator freshness probe rides a
long-lived handle a per-op tracker installed at measure time cannot see. The
flat/grow gates are depth-difference assertions, so the constant per-write probe
offset cancels and they pass unchanged; the absolute ceiling is retightened from
34 to 24 (~18 measured = ~15 publish-path scans + ~3 probe RPCs) with the read log
dumped on a breach.

Add manifest_reads_capture_warm_probe: it measures the same warm write fresh-only
and under cost_harness and asserts ground truth strictly exceeds fresh-only by the
probe's RPCs (11 vs 14). Reverting the ground-truth wiring makes the two equal, so
this guards that a write's warm-handle probe (3 object-store RPCs that were counted
as a single version_probe) cannot silently escape manifest_reads again.
Wrap the warm (== 0) manifest gates in cost_harness so manifest_reads is ground
truth. A read's freshness probe is served from Lance's cached manifest at 0
object-store reads (unlike a write's probe, which re-reads after its commit), so
the == 0 assertions hold with no re-baseline -- and now also catch any future
warm-handle scan a per-op tracker would miss. The stale (> 0) tests are unaffected
either way and stay on the fresh fallback.
The cost harness now reads incremental_stats() deltas and, under cost_harness,
installs one __manifest tracker before the graph opens so manifest_reads is ground
truth (handle-age-irrelevant). Note that version_probes is the probe call count and
that ground truth reveals a write's probe does ~3 object-store RPCs.
…anded)

Prepend a current-state section (§A) for the __manifest scan-amplification /
version-chain thread: the problem, what landed on main (step 2a, Phase 7 #299),
what is in flight on this branch / PR #307 (PR2 scan-halving, the owner-branch
handoff fold fix, the PR2.1 ground-truth cost harness), the accurate measurement
(per-write __manifest ops ~50->410 pre-PR2 vs 28->208 ground truth; the hidden
3-RPC freshness probe), the remaining roadmap (PR1a manual cleanup, PR3-scoping,
deferred PR1b/PR4), critical files, and gotchas.

Staleness fixes: Phase 7 was listed as a future "step 4" but landed as #299, so
mark it LANDED in the TL;DR landed list and in the remaining-steps section.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit eac7872. Configure here.

let held = coord.manifest_incarnation();
if coord.probe_latest_incarnation().await?.matches(&held) {
return Ok(coord.snapshot());
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weak incarnation reuses OCC snapshot

Medium Severity

occ_snapshot_for_branch reuses the warm coordinator snapshot when probe_latest_incarnation matches the held incarnation, but ManifestIncarnation::matches treats equal manifest version as sufficient when ETag and timestamp are unavailable. After a branch delete/recreate or other same-version manifest change, commit-time OCC can skip the cold rescan and derive table pins from stale known_state instead of current rows.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit eac7872. Configure here.

@ragnorc ragnorc merged commit a7d4cba into main Jun 27, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant