Skip to content

perf(engine): scope CSR topology index to traversed edges, reuse it cross-branch#312

Merged
ragnorc merged 7 commits into
mainfrom
ragnorc/edge-join-timeout
Jun 28, 2026
Merged

perf(engine): scope CSR topology index to traversed edges, reuse it cross-branch#312
ragnorc merged 7 commits into
mainfrom
ragnorc/edge-join-timeout

Conversation

@ragnorc

@ragnorc ragnorc commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

What & why

A single-edge graph join (match { \$x: ExternalID \$x identifiesPerson \$p }) timed out at 40-60s (428s first traversal) while reading either endpoint alone was fast: the in-memory CSR topology index was built over every edge type in the catalog and cache-keyed by the resolved snapshot id, so one traversal full-scanned every edge table in the graph and a lazy-fork branch cold-rebuilt main's index. This scopes the build to only the edge types the query traverses (referenced_edge_types over Expand/AntiJoin) and re-keys the RuntimeCache by each edge table's physical identity (table_key, version, table_branch, e_tag) so a lazy-fork branch reuses main's built index (local-FS e_tag-None falls back to refresh-invalidation). Supersedes #276 (same fix, consolidated onto current main with a cleaner, exhaustive-match scoping path).

Backing issue / RFC

Checklist

  • Change is focused (one logical change: scoped + cross-branch-reused topology index)
  • Tests added (warm_read_cost.rs: fresh_branch_traversal_reuses_main_graph_index, single_edge_query_builds_only_referenced_edge; engine referenced_edge_types unit tests)
  • Public docs updated (invariants, testing, execution, architecture)
  • Reviewed against docs/dev/invariants.md — no Hard Invariant weakened (invariant 15: derived state held warm + keyed by physical source); no deny-list item hit

Notes for reviewers

The two cuts are correct-by-construction: build by referenced scope, key by physical edge-table identity. Verified the at-risk incarnation tests (recreated_branch_traversal_uses_graph_index_incarnation, recreated_branch_owned_table_handle_uses_table_etag) still pass — dropping snapshot_id is safe because those rely on refresh-invalidation, not the cache key. One documented residual: on stores without per-table e_tags (local FS) a branch recreated at the same version reuses the refresh-invalidation fallback; production object stores carry real e_tags. cargo test --workspace is green except two pre-existing environmental system_local failures (spawned server bearer-token in this sandbox; fail identically on a clean base).


Note

Medium Risk
Changes query-time CSR build scope and graph-index cache invalidation semantics on a hot read path; mitigated by exhaustive IR matching, endpoint-aware keys, and cost/S3 tests, with a documented local-FS e_tag fallback gap.

Overview
Fixes 40s+ timeouts on single-edge joins by stopping whole-catalog CSR builds and snapshot-id-only cache keys.

Scoped builds: Query execution derives referenced_edge_types from the IR (Expand plus nested AntiJoin inners) and passes that set into GraphIndexHandle / RuntimeCache::graph_index, so the CSR path scans only those edge tables—not every type in the catalog.

Cross-branch reuse (A1): The RuntimeCache key drops synthetic snapshot_id and keys on each included edge table's physical identity (table_key, version, table_branch, e_tag) plus (from_type, to_type) endpoints, so lazy-fork branches with unchanged edge tables reuse main's cached index; endpoint remaps get a distinct key.

Testability: Adds with_traversal_mode (task-local, overrides OMNIGRAPH_TRAVERSAL_MODE) and graph_build_count / graph_edges_built probes; cost tests in warm_read_cost.rs and S3 s3_fresh_branch_traversal_reuses_main_graph_index_with_etags; proptest/traversal tests drop #[serial] env mutation.

Reviewed by Cursor Bugbot for commit 89011eb. Bugbot is set up for automated code reviews on this repo. Configure here.

Greptile Summary

This PR improves graph traversal caching and CSR index build cost. The main changes are:

  • Scoped topology-index builds to edge types used by the query.
  • Reused cached graph indexes across branches with matching edge-table identity.
  • Added endpoint mapping to the cache key for schema remaps.
  • Added probes and tests for scoped builds and branch reuse.
  • Updated developer docs for the new traversal-cache behavior.

Confidence Score: 5/5

This looks safe to merge.

  • No blocking issues found in the changed code.

Important Files Changed

Filename Overview
crates/omnigraph/src/runtime_cache.rs Updates graph-index cache keys to include scoped edge-table identity, e_tag, and endpoint mapping.
crates/omnigraph/src/exec/query.rs Collects traversed edge types from query IR and passes scoped edge maps into graph-index handles.
crates/omnigraph/src/instrumentation.rs Adds graph build counters and a task-local traversal-mode override for tests.
crates/omnigraph/src/db/omnigraph/table_ops.rs Keeps whole-graph index construction available while query execution uses scoped construction.
crates/omnigraph/src/graph_index/mod.rs Documents the state that must remain reflected in the graph-index cache key.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
  Q[Query IR pipeline] --> R[Collect traversed edge types]
  R --> H[GraphIndexHandle]
  H --> C[RuntimeCache graph index]
  C --> K[Key by table identity, e_tag, endpoints]
  K -->|cache hit| I[Reuse GraphIndex]
  K -->|cache miss| B[Build scoped GraphIndex]
  B --> I
  I --> E[CSR traversal execution]
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
  Q[Query IR pipeline] --> R[Collect traversed edge types]
  R --> H[GraphIndexHandle]
  H --> C[RuntimeCache graph index]
  C --> K[Key by table identity, e_tag, endpoints]
  K -->|cache hit| I[Reuse GraphIndex]
  K -->|cache miss| B[Build scoped GraphIndex]
  B --> I
  I --> E[CSR traversal execution]
Loading

Reviews (3): Last reviewed commit: "test(engine): scoped with_traversal_mode..." | Re-trigger Greptile

Context used:

  • Context used - AGENTS.md (source)
  • Context used - CLAUDE.md (source)

ragnorc added 3 commits June 28, 2026 18:01
…it cross-branch

The in-memory CSR graph index was built over every edge type in the catalog and
cache-keyed by the resolved snapshot id, so a single-edge join
(`$x identifiesPerson $p`) full-scanned every edge table in the graph (the
40-60s / 428s-first-traversal hang), and a lazy-fork branch cold-rebuilt main's
index. Two cuts close that:

- Scope (A2): build only the edge types the query traverses
  (`referenced_edge_types` over Expand/AntiJoin, exhaustive match), not the whole
  catalog. Threaded through GraphIndexHandle -> RuntimeCache; cache-keyed on the
  scoped set.
- Cross-branch reuse (A1): key RuntimeCache by each edge table's physical identity
  (table_key, version, table_branch, e_tag) instead of the snapshot id, so a
  lazy-fork branch whose edge tables physically are main's reuses main's built
  index. Local-FS (e_tag None) falls back to refresh-invalidation.

Adds graph_build_count/graph_edges_built probes for the cost tests.
fresh_branch_traversal_reuses_main_graph_index (A1: a lazy-fork branch reuses
main's cached CSR index, 0 rebuilds) and single_edge_query_builds_only_referenced_edge
(A2: a one-edge query builds only that edge, not the whole catalog), via the
graph_build_count/graph_edges_built probes. Forced CSR mode, #[serial]. Updates the
recreated-branch incarnation test comment for the physical-identity key.
Document the scoped CSR build and the physical-identity (e_tag) graph-index cache
key with its local-FS refresh-invalidation fallback across invariants, testing,
execution, and architecture docs.
Comment thread crates/omnigraph/src/runtime_cache.rs

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 18de823048

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

snapshot_id: resolved.snapshot_id.as_str().to_string(),
edge_tables,
}
GraphIndexCacheKey { edge_tables }

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve incarnation for e_tag-less graph-index keys

On local FS or any store where version_metadata.e_tag() is None, a deleted-and-recreated branch can reuse the same edge-table (table_key, table_version, table_branch) while containing different src/dst rows. If a long-lived handle is bound to another branch and reads that recreated branch via the cold different-branch resolve path, no same-branch refresh invalidates RuntimeCache, so dropping snapshot_id here lets the old CSR entry be reused and traversal can return stale neighbors. Keep a branch/manifest incarnation fallback when e_tags are absent, or invalidate the graph-index cache for cold branch resolves.

Useful? React with 👍 / 👎.

fn set_csr_mode() {
// SAFE: every test that sets this is `#[serial]`, so no thread reads the env
// during the write.
unsafe { std::env::set_var("OMNIGRAPH_TRAVERSAL_MODE", "csr") };

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Isolate process-global traversal-mode overrides

These new tests mutate the process-wide OMNIGRAPH_TRAVERSAL_MODE, but marking only the mutating tests #[serial] does not stop Cargo from running the other non-serial tests in this same binary concurrently. Several existing warm-read tests issue traversal queries, so if they overlap this window they can be forced onto the CSR path and observe different cache/IO behavior, creating nondeterministic test failures; isolate the forced-mode tests in a separate binary or guard/serialize all tests that can read this env var.

Useful? React with 👍 / 👎.

ragnorc added 3 commits June 28, 2026 18:18
The two topology-build cost tests force OMNIGRAPH_TRAVERSAL_MODE via process-
global env mutation, which query.rs reads. In warm_read_cost.rs (a mixed
serial/non-serial binary) a concurrent non-serial traversal test could race the
env write (UB under Rust 2024's unsafe set_var contract) and be forced onto CSR.
Move them to traversal_indexed.rs — the dedicated all-serial binary with no
non-serial env reader (its documented-safe home) — and add a ModeGuard RAII
helper so a panic mid-test cannot leak the override. Addresses a PR review (P2).
The A1 physical-identity key omitted the edge's (from_type, to_type). GraphIndex
keys its TypeIndexes by those endpoint names and execute_expand_csr looks them up
by the current catalog's names, so a schema repoint of an edge type that leaves
the edge table's physical identity unchanged would reuse a stale index built with
the old endpoint namespace and fail with "no type index for <new type>". The old
snapshot_id (carrying the manifest version) masked this; dropping it exposed it.
Adding the endpoints to the key rebuilds on a repoint while preserving lazy-fork
cross-branch reuse (same endpoints -> same key). Addresses a PR review (P1).
@ragnorc

ragnorc commented Jun 28, 2026

Copy link
Copy Markdown
Contributor Author

Addressed the review comments + merged latest main (resolved the testing.md conflict from #311's commit-graph retirement):

  • P1 — schema mapping cache collision (runtime_cache.rs): valid, fixed in 4d8151e. The key now includes the edge's (from_type, to_type) endpoint mapping. GraphIndex keys its TypeIndexes by those names, so a schema repoint that leaves the edge table's physical identity unchanged now rebuilds instead of reusing a stale index (which would fail with no type index for <new type>). Lazy-fork cross-branch reuse is preserved (same endpoints → same key). Regression guard: endpoint_remap_at_same_physical_identity_splits_cache_key.

  • P2 — process-global traversal-mode env in a mixed binary (warm_read_cost.rs): valid, fixed in ea71fc5. Moved both CSR-forced topology cost tests to traversal_indexed.rs — the dedicated all-serial binary with no non-serial env reader — and added a ModeGuard RAII helper that clears the override even on panic.

  • P2 — e_tag-less incarnation on a cold cross-branch resolve (runtime_cache.rs): valid but it is the fundamental local-FS residual of cross-branch reuse — you can't both reuse across branches and distinguish a branch-recreation by key without a per-table physical token (e_tag), which local FS lacks. Production object stores carry e_tags, so the key alone distinguishes incarnations there; on local FS the same-branch manifest refresh (invalidate_all) is the fallback. The narrow gap the comment identifies (a cold cross-branch resolve of a recreated branch, which the same-branch refresh doesn't cover) is now documented explicitly in docs/dev/invariants.md as a local-FS-only, dev/test-only known gap.

…erage

Replace the process-global OMNIGRAPH_TRAVERSAL_MODE env-mutation test hack (which
forced #[serial] + dedicated all-serial binaries and was triplicated as ModeGuard
+ set_mode/clear_mode) with one general abstraction: a task-local
`with_traversal_mode` seam mirroring `with_query_io_probes`. It is scope-bound
(leak-free even on panic) and process-safe (never touches shared state), so a
forced-mode test cannot affect a concurrent test in the same binary.
`traversal_indexed_override` consults the seam first, then the env var (which
stays the documented ops escape hatch).

- Migrate traversal_indexed.rs, proptest_equivalence.rs, and the two topology cost
  tests (moved back to warm_read_cost.rs) to the seam; drop all ModeGuard /
  set_mode / clear_mode / #[serial] / per-file column0 helpers.
- Consolidate the duplicated first-column extractors into one shared
  `helpers::first_column_sorted`.
- Add `s3_storage.rs::s3_fresh_branch_traversal_reuses_main_graph_index_with_etags`:
  the CSR cache-key cross-branch reuse path on a REAL per-table e_tag (None on
  local FS, so local tests can't reach it). Confirmed empirically that RustFS — the
  CI S3 backend — surfaces ETags into version_metadata.e_tag(). CI path filter now
  triggers the rustfs job on runtime_cache/graph_index changes.
@ragnorc

ragnorc commented Jun 28, 2026

Copy link
Copy Markdown
Contributor Author

Update on the env-isolation comment (warm_read_cost.rs — process-global OMNIGRAPH_TRAVERSAL_MODE): the earlier "moved to an all-serial binary" fix has been superseded by the reviewer's preferred option. Commit 89011eb replaces the env mutation entirely with a scoped task-local seam, instrumentation::with_traversal_mode, mirroring with_query_io_probes:

  • It is scope-bound (leak-free even on panic — no RAII guard) and process-safe (never touches shared state), so a forced-mode test cannot affect a concurrent test. No #[serial] and no dedicated binary required.
  • exec::query::traversal_indexed_override consults the seam first, then the env var (which stays the documented ops escape hatch). Zero production impact — nothing sets the seam outside tests.
  • All CSR-forced tests (traversal_indexed.rs, proptest_equivalence.rs, and the two topology cost tests now back in warm_read_cost.rs) migrated to it; the triplicated ModeGuard/set_mode/clear_mode and the one-off traversal_s3.rs binary are gone. No test in the suite mutates OMNIGRAPH_TRAVERSAL_MODE anymore.

Net status of the three review comments: P1 schema-mapping → fixed (4d8151e, endpoints in the cache key + regression test); P2 env-isolation → fixed via the seam (89011eb); P2 e_tag-less cross-branch → documented local-FS-only residual, with the e_tag-present path now CI-covered on RustFS (s3_storage.rs).

@ragnorc ragnorc merged commit e7e057e into main Jun 28, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant