perf(engine): scope CSR topology index to traversed edges, reuse it cross-branch#312
Conversation
…it cross-branch The in-memory CSR graph index was built over every edge type in the catalog and cache-keyed by the resolved snapshot id, so a single-edge join (`$x identifiesPerson $p`) full-scanned every edge table in the graph (the 40-60s / 428s-first-traversal hang), and a lazy-fork branch cold-rebuilt main's index. Two cuts close that: - Scope (A2): build only the edge types the query traverses (`referenced_edge_types` over Expand/AntiJoin, exhaustive match), not the whole catalog. Threaded through GraphIndexHandle -> RuntimeCache; cache-keyed on the scoped set. - Cross-branch reuse (A1): key RuntimeCache by each edge table's physical identity (table_key, version, table_branch, e_tag) instead of the snapshot id, so a lazy-fork branch whose edge tables physically are main's reuses main's built index. Local-FS (e_tag None) falls back to refresh-invalidation. Adds graph_build_count/graph_edges_built probes for the cost tests.
fresh_branch_traversal_reuses_main_graph_index (A1: a lazy-fork branch reuses main's cached CSR index, 0 rebuilds) and single_edge_query_builds_only_referenced_edge (A2: a one-edge query builds only that edge, not the whole catalog), via the graph_build_count/graph_edges_built probes. Forced CSR mode, #[serial]. Updates the recreated-branch incarnation test comment for the physical-identity key.
Document the scoped CSR build and the physical-identity (e_tag) graph-index cache key with its local-FS refresh-invalidation fallback across invariants, testing, execution, and architecture docs.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 18de823048
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| snapshot_id: resolved.snapshot_id.as_str().to_string(), | ||
| edge_tables, | ||
| } | ||
| GraphIndexCacheKey { edge_tables } |
There was a problem hiding this comment.
Preserve incarnation for e_tag-less graph-index keys
On local FS or any store where version_metadata.e_tag() is None, a deleted-and-recreated branch can reuse the same edge-table (table_key, table_version, table_branch) while containing different src/dst rows. If a long-lived handle is bound to another branch and reads that recreated branch via the cold different-branch resolve path, no same-branch refresh invalidates RuntimeCache, so dropping snapshot_id here lets the old CSR entry be reused and traversal can return stale neighbors. Keep a branch/manifest incarnation fallback when e_tags are absent, or invalidate the graph-index cache for cold branch resolves.
Useful? React with 👍 / 👎.
| fn set_csr_mode() { | ||
| // SAFE: every test that sets this is `#[serial]`, so no thread reads the env | ||
| // during the write. | ||
| unsafe { std::env::set_var("OMNIGRAPH_TRAVERSAL_MODE", "csr") }; |
There was a problem hiding this comment.
Isolate process-global traversal-mode overrides
These new tests mutate the process-wide OMNIGRAPH_TRAVERSAL_MODE, but marking only the mutating tests #[serial] does not stop Cargo from running the other non-serial tests in this same binary concurrently. Several existing warm-read tests issue traversal queries, so if they overlap this window they can be forced onto the CSR path and observe different cache/IO behavior, creating nondeterministic test failures; isolate the forced-mode tests in a separate binary or guard/serialize all tests that can read this env var.
Useful? React with 👍 / 👎.
…eout # Conflicts: # docs/dev/testing.md
The two topology-build cost tests force OMNIGRAPH_TRAVERSAL_MODE via process- global env mutation, which query.rs reads. In warm_read_cost.rs (a mixed serial/non-serial binary) a concurrent non-serial traversal test could race the env write (UB under Rust 2024's unsafe set_var contract) and be forced onto CSR. Move them to traversal_indexed.rs — the dedicated all-serial binary with no non-serial env reader (its documented-safe home) — and add a ModeGuard RAII helper so a panic mid-test cannot leak the override. Addresses a PR review (P2).
The A1 physical-identity key omitted the edge's (from_type, to_type). GraphIndex keys its TypeIndexes by those endpoint names and execute_expand_csr looks them up by the current catalog's names, so a schema repoint of an edge type that leaves the edge table's physical identity unchanged would reuse a stale index built with the old endpoint namespace and fail with "no type index for <new type>". The old snapshot_id (carrying the manifest version) masked this; dropping it exposed it. Adding the endpoints to the key rebuilds on a repoint while preserving lazy-fork cross-branch reuse (same endpoints -> same key). Addresses a PR review (P1).
|
Addressed the review comments + merged latest
|
…erage Replace the process-global OMNIGRAPH_TRAVERSAL_MODE env-mutation test hack (which forced #[serial] + dedicated all-serial binaries and was triplicated as ModeGuard + set_mode/clear_mode) with one general abstraction: a task-local `with_traversal_mode` seam mirroring `with_query_io_probes`. It is scope-bound (leak-free even on panic) and process-safe (never touches shared state), so a forced-mode test cannot affect a concurrent test in the same binary. `traversal_indexed_override` consults the seam first, then the env var (which stays the documented ops escape hatch). - Migrate traversal_indexed.rs, proptest_equivalence.rs, and the two topology cost tests (moved back to warm_read_cost.rs) to the seam; drop all ModeGuard / set_mode / clear_mode / #[serial] / per-file column0 helpers. - Consolidate the duplicated first-column extractors into one shared `helpers::first_column_sorted`. - Add `s3_storage.rs::s3_fresh_branch_traversal_reuses_main_graph_index_with_etags`: the CSR cache-key cross-branch reuse path on a REAL per-table e_tag (None on local FS, so local tests can't reach it). Confirmed empirically that RustFS — the CI S3 backend — surfaces ETags into version_metadata.e_tag(). CI path filter now triggers the rustfs job on runtime_cache/graph_index changes.
|
Update on the env-isolation comment (
Net status of the three review comments: P1 schema-mapping → fixed ( |
What & why
A single-edge graph join (
match { \$x: ExternalID \$x identifiesPerson \$p }) timed out at 40-60s (428s first traversal) while reading either endpoint alone was fast: the in-memory CSR topology index was built over every edge type in the catalog and cache-keyed by the resolved snapshot id, so one traversal full-scanned every edge table in the graph and a lazy-fork branch cold-rebuilt main's index. This scopes the build to only the edge types the query traverses (referenced_edge_typesoverExpand/AntiJoin) and re-keys theRuntimeCacheby each edge table's physical identity(table_key, version, table_branch, e_tag)so a lazy-fork branch reuses main's built index (local-FS e_tag-Nonefalls back to refresh-invalidation). Supersedes #276 (same fix, consolidated onto currentmainwith a cleaner, exhaustive-match scoping path).Backing issue / RFC
Checklist
warm_read_cost.rs:fresh_branch_traversal_reuses_main_graph_index,single_edge_query_builds_only_referenced_edge; enginereferenced_edge_typesunit tests)Notes for reviewers
The two cuts are correct-by-construction: build by referenced scope, key by physical edge-table identity. Verified the at-risk incarnation tests (
recreated_branch_traversal_uses_graph_index_incarnation,recreated_branch_owned_table_handle_uses_table_etag) still pass — droppingsnapshot_idis safe because those rely on refresh-invalidation, not the cache key. One documented residual: on stores without per-table e_tags (local FS) a branch recreated at the same version reuses the refresh-invalidation fallback; production object stores carry real e_tags.cargo test --workspaceis green except two pre-existing environmentalsystem_localfailures (spawned server bearer-token in this sandbox; fail identically on a clean base).Note
Medium Risk
Changes query-time CSR build scope and graph-index cache invalidation semantics on a hot read path; mitigated by exhaustive IR matching, endpoint-aware keys, and cost/S3 tests, with a documented local-FS e_tag fallback gap.
Overview
Fixes 40s+ timeouts on single-edge joins by stopping whole-catalog CSR builds and snapshot-id-only cache keys.
Scoped builds: Query execution derives
referenced_edge_typesfrom the IR (Expandplus nestedAntiJoininners) and passes that set intoGraphIndexHandle/RuntimeCache::graph_index, so the CSR path scans only those edge tables—not every type in the catalog.Cross-branch reuse (A1): The
RuntimeCachekey drops syntheticsnapshot_idand keys on each included edge table's physical identity(table_key, version, table_branch, e_tag)plus(from_type, to_type)endpoints, so lazy-fork branches with unchanged edge tables reuse main's cached index; endpoint remaps get a distinct key.Testability: Adds
with_traversal_mode(task-local, overridesOMNIGRAPH_TRAVERSAL_MODE) andgraph_build_count/graph_edges_builtprobes; cost tests inwarm_read_cost.rsand S3s3_fresh_branch_traversal_reuses_main_graph_index_with_etags; proptest/traversal tests drop#[serial]env mutation.Reviewed by Cursor Bugbot for commit 89011eb. Bugbot is set up for automated code reviews on this repo. Configure here.
Greptile Summary
This PR improves graph traversal caching and CSR index build cost. The main changes are:
Confidence Score: 5/5
This looks safe to merge.
Important Files Changed
Flowchart
%%{init: {'theme': 'neutral'}}%% flowchart TD Q[Query IR pipeline] --> R[Collect traversed edge types] R --> H[GraphIndexHandle] H --> C[RuntimeCache graph index] C --> K[Key by table identity, e_tag, endpoints] K -->|cache hit| I[Reuse GraphIndex] K -->|cache miss| B[Build scoped GraphIndex] B --> I I --> E[CSR traversal execution]%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%% flowchart TD Q[Query IR pipeline] --> R[Collect traversed edge types] R --> H[GraphIndexHandle] H --> C[RuntimeCache graph index] C --> K[Key by table identity, e_tag, endpoints] K -->|cache hit| I[Reuse GraphIndex] K -->|cache miss| B[Build scoped GraphIndex] B --> I I --> E[CSR traversal execution]Reviews (3): Last reviewed commit: "test(engine): scoped with_traversal_mode..." | Re-trigger Greptile
Context used: