test(dst): comprehensive DST harness on 0.7.2 + completeness-critic hardening (supersedes #300, #302)#303
test(dst): comprehensive DST harness on 0.7.2 + completeness-critic hardening (supersedes #300, #302)#303ragnorc wants to merge 3 commits into
Conversation
…, critic findings) Port: brings the full harness (B1-B6) onto current main (0.7.2). Verified all 5 bug guards still reproduce on 0.7.2 — its only fix (#296 recovery roll-forward convergence) addresses a DIFFERENT bug than any the harness catches. Adds test-only dev-deps (arrow-array, futures, serde_json, async-trait, proptest-state-machine). Harden (completeness-critic pass): - MATRIX.md — the coverage ledger: enumerates D1-D5 sampled/unsampled cells, names the HIDDEN dimensions found only after a miss, and explains why exhaustive sampling is impossible (path-spaces / non-enumerable schedules / open-world model). Prioritized gap backlog. - Reopen op — open/recovery sweep is now a first-class walk op (drop + reopen mid-walk), sampled across varied table states, not only at the end. ops 10/10, 0 novel violations. - Critic finding #1 (#296 cell): concurrent open / >=2 recovery sweeps on one sidecar is unsampled — the named blind spot that let #296 through. - Critic finding #2 (fault seam): the FaultAdapter wraps StorageAdapter::write_text_if_match, but the __manifest publish is a Lance MergeInsertBuilder CAS (publisher.rs:377) that never flows through it — verified empirically (cas_pct=100 write succeeds, no sidecar). So Phase-2 'manifest CAS fault injection' is off the hot path; real injection needs failpoints or a Lance object_store wrapper. This is the prerequisite for closing the #296 cell.
The StorageAdapter FaultAdapter is off the manifest-publish path, so it can't induce a recovery sidecar. Failpoints can. New tests/dst_recovery.rs (own binary — the process-global fail registry must not leak into the main, non-serial walks; verified the in-binary version broke 3 parallel tests): - recovery_rolls_forward_under_finalize_failure: mutation.post_finalize_pre_publisher leaves a real RolledPastExpected sidecar; reopen rolls it forward; the harness WHITE-BOX structural battery holds (additive vs the engine's count-only test). - concurrent_opens_converge_on_pending_sidecar (the #296 cell): an inline park-first rendezvous at recovery.before_roll_forward_publish forces two open sweeps to race one sidecar; the CAS-loser must CONVERGE, not fail the open. Non-vacuous (rendezvous panics if the race never fires); guards 0.7.2 (the failpoint was added with the #296 fix, so it can't run on true 0.7.1). MATRIX.md backlog: fault-seam (#1), #296 cell (#2), Reopen op (#3) now DONE; cross-process + generative-via-object_store remain.
… proptest-state-machine)
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 76fa48f35c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| return Err(Finding::Logical(format!( | ||
| "count Person={p} != model={} (lost-write or dup-@key)", |
There was a problem hiding this comment.
Stop labeling all Person count drift as dup-key
When the Person row count diverges, this message always includes the dup-@key marker, and classify() allow-lists any logical finding containing that marker as MR-714. In the faulted walk, a lost write (count < model) or a write that committed after returning an error (count > model) will therefore be recorded as a known dup-key issue instead of failing, so the count==model oracle cannot catch the silent-loss cases it is meant to guard.
Useful? React with 👍 / 👎.
| if self.roll(self.cas_conflict_pct) { | ||
| return Ok(None); |
There was a problem hiding this comment.
Inject faults on a seam the engine write path uses
In the embedded engine context opened by open_faulted, this override never exercises graph manifest publish failures: the engine publishes __manifest through Lance MergeInsertBuilder/ManifestBatchPublisher, and a repo-wide check of write_text_if_match shows it is used by cluster state/adapter tests rather than the Omnigraph write path. As a result, seeded_op_loop_with_cas_faults can run as another clean walk with no injected CAS conflicts, giving false coverage for the manifest-fault cell.
Useful? React with 👍 / 👎.
| for h in handles { | ||
| let _ = h.await; // ignore JoinError: a contained actor panic is judged by the battery |
There was a problem hiding this comment.
Surface panics from concurrent actors
Ignoring JoinError here hides any actor panic that does not leave durable corruption behind. For example, a novel panic in load_jsonl, mutate, or optimize could abort one worker early while the post-join structural battery still passes, causing this concurrency test to report success instead of the panic it was supposed to contain and classify.
Useful? React with 👍 / 👎.
| return Err(Finding::Logical(format!( | ||
| "count Person={p} != model={} (lost-write or dup-@key)", | ||
| model.persons() | ||
| ))); |
There was a problem hiding this comment.
When count Person differs from the model for any reason, this message includes the dup-@key token that classify() allow-lists. A lost write after a faulted op, repair, or reopen can therefore be recorded as the known duplicate-key bug instead of failing the DST walk.
| return Err(Finding::Logical(format!( | |
| "count Person={p} != model={} (lost-write or dup-@key)", | |
| model.persons() | |
| ))); | |
| return Err(Finding::Logical(format!( | |
| "count Person={p} != model={} (lost-write or unexpected duplicate row)", | |
| model.persons() | |
| ))); |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| | repair | ✅ | walk (`Repair`) | | ||
| | read | ✅ | walk, readshape | | ||
| | branch create/write/merge | 🟡 | `branch_isolation_and_merge` (scenario, not generic walk) | | ||
| | **`open` / recovery sweep** | ❌ | **only a fixture step (`reopen`), never a generated op — the `#296` gap** | |
There was a problem hiding this comment.
Recovery Coverage Row Is Stale
This row still marks open / recovery as unsampled, while the same PR adds Reopen to the walk and marks the failpoint recovery cells done later in the file. The ledger now contradicts the harness and can mislead future gap-closure work.
| | **`open` / recovery sweep** | ❌ | **only a fixture step (`reopen`), never a generated op — the `#296` gap** | | |
| | **`open` / recovery sweep** | ✅ | walk (`Reopen`) + failpoint recovery cells (`dst_recovery`) | |
|
Superseded by #309, which consolidates this comprehensive harness with #305 (fuzz/S3) and the omnigraph-dst crate extraction into a single tests-only PR retargeted to |
Supersedes #300 and #302 — consolidates the full DST harness onto current
main(0.7.2) in one PR, plus the completeness-critic hardening pass. Tests-only (no enginesrcchanges).The harness (B1–B6, was #300 + #302)
A seeded, model-based DST harness over the morphological matrix (D1 ops · D2 fragment morphology · D3 read shapes · D4 oracles · D5 context):
Dataset::validate, unique live_rowid(RC-X corruption class, deletion-vector-correct), index-probe, count==model, content==model, edges==model (RI), @key-unique.catch_unwind+classify_panic), branch isolation/merge scenario, read-shape battery × 4 morphologies, concurrent multi-actor walk, and a proptest-state-machine shrinking campaign over the clean op subset.Bugs guarded (5)
Reproduced/pinned as characterization tests, all still present on 0.7.2 (verified):
regression_rc_x_…+ index-proberegression_rc1_…+ HEAD==manifest@key(MR-714)regression_dup_key_…+ concurrent walkKnowsnot traversable (NEW)regression_self_loop_…+ edges==modelHardening (completeness-critic pass)
dst/MATRIX.md— a coverage ledger: D1–D5 sampled/unsampled cells, the hidden dimensions found only after a miss, and why exhaustive sampling is impossible. A living artifact, re-run when an external bug slips through.Reopenop — the recovery sweep is now a first-class walk op.FaultAdapterwrapsStorageAdapter::write_text_if_match, which is off the manifest-publish path (a LanceMergeInsertBuilderCAS) — so it couldn't induce a recovery sidecar.--features failpointsvariant in its own binary (tests/dst_recovery.rs): induces a realRolledPastExpectedsidecar, and a rendezvous-forced concurrent-open race (the#296blind spot) asserting the CAS-loser converges. Run:cargo test -p omnigraph-engine --features failpoints --test dst_recovery.Why 0.7.2
The only fix between 0.7.1 and 0.7.2 (#296, recovery roll-forward convergence) fixes a different recovery bug than any the harness catches — confirmed by running the harness against 0.7.2 (all 5 guards stay green). The
#296cell is now sampled so its class can't slip again.Run
cargo test -p omnigraph-engine --test dst(10 tests)cargo test -p omnigraph-engine --features failpoints --test dst_recovery(2 cells)New test-only dev-deps:
arrow-array,futures,serde_json,async-trait,proptest-state-machine.Note
Low Risk
Test-only additions and dev-dependencies; no production engine code paths change.
Overview
Tests-only — adds a deterministic-simulation / morphological-matrix (DST) harness under
crates/omnigraph/tests/dst/with entrypointstests/dst.rsandtests/dst_recovery.rs(no enginesrcchanges).The harness drives the real omnigraph-engine through a seeded op alphabet, a reference model, and a white-box invariant battery after each step (HEAD==manifest,
Dataset::validate, unique live_rowid, index probe, count/content/edges vs model, @key checks). Findings are classified as known open bugs (allow-listed) vs novel (fail). Coverage is tracked;dst/MATRIX.mddocuments sampled vs gap matrix dimensions and critic findings (e.g.FaultAdapterCAS seam vs manifest publish path).New scenarios: named characterization regressions (RC-1, RC-X, dup-@key, self-loop traversal, etc.); generative walks with optional
FaultAdapterCAS faults, mid-walkReopen, and panic containment; concurrent multi-actor structural walk; branch isolation/merge; read-shape battery × table morphologies; proptest-state-machine shrinking on clean ops.Failpoint binary (
--features failpoints --test dst_recovery): real pending recovery sidecars and deterministic concurrent-open race (#296), isolated from parallel main DST runs.Dev-deps:
proptest-state-machine, plusarrow-array,futures,serde_json,async-traitfor the harness.Reviewed by Cursor Bugbot for commit 76fa48f. Bugbot is set up for automated code reviews on this repo. Configure here.
Greptile Summary
This PR adds a test-only DST harness for Omnigraph. The main changes are:
Confidence Score: 4/5
The DST harness needs a small correctness fix before merging.
crates/omnigraph/tests/dst/model.rs
Important Files Changed
Flowchart
%%{init: {'theme': 'neutral'}}%% flowchart TD A[Seeded DST walk] --> B[Run engine op] B --> C{Op succeeds?} C -->|yes| D[Update model] C -->|no| E[Classify error] D --> F[Run invariant battery] E --> F F --> G{Finding?} G -->|known| H[Record known bug] G -->|novel| I[Fail test] G -->|none| A A --> J[Reopen durability check] J --> F%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%% flowchart TD A[Seeded DST walk] --> B[Run engine op] B --> C{Op succeeds?} C -->|yes| D[Update model] C -->|no| E[Classify error] D --> F[Run invariant battery] E --> F F --> G{Finding?} G -->|known| H[Record known bug] G -->|novel| I[Fail test] G -->|none| A A --> J[Reopen durability check] J --> FReviews (1): Last reviewed commit: "chore: lock new test-only dev-deps (arro..." | Re-trigger Greptile
Context used: