feat(dst): deterministic-replay seam + replay-equality oracle (PR-C, stacked on #303)#304
feat(dst): deterministic-replay seam + replay-equality oracle (PR-C, stacked on #303)#304ragnorc wants to merge 2 commits into
Conversation
…y oracle (PR-C) The engine mints observable ULIDs/timestamps at scattered sites; a seeded harness run was not bit-reproducible. New src/dst.rs (mirrors failpoints.rs's no-op-fallback shape): next_ulid()/now_micros() consult a tokio::task_local provider when the `dst` feature is on + a provider is in scope, else fall back to real Ulid::new()/SystemTime — ZERO production cost (cfg block compiles away; verified default suite byte-unchanged, 10 passed, no warnings). Routed the OBSERVABLE sites only: 6 Ulid::new() (mutation x2, loader generate_id, commit_graph genesis+append, recovery sidecar op_id) and the timestamp seam (the 2 now_micros helpers + the raw sidecar started_at). Skipped ephemeral cleanup-cutoff/metrics sites. with_seed(seed, fut) -> (Output, fingerprint) scopes the provider and returns a rolling id/clock fingerprint. New replay_equality_same_seed test (--features dst): same seed -> identical fingerprint + outcome; different seed -> divergent. A determinism tripwire — green while the engine is reproducible, red if non-determinism leaks into the seam. Honest scope: gives bit-identical ENGINE-layer seed replay; Lance internals (compaction/index ids, threadpool) stay non-deterministic, so full on-disk equality is out of scope and the flaky FTS-panic is NOT fixed by this. Main walks not yet wrapped in with_seed (follow-up).
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b8df1c8e73
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| SystemTime::now() | ||
| .duration_since(UNIX_EPOCH) | ||
| .map(|d| d.as_micros() as i64) | ||
| .unwrap_or(0) |
There was a problem hiding this comment.
Preserve clock errors instead of stamping epoch zero
When the host clock is before UNIX_EPOCH, this fallback now returns 0, and the updated commit-graph and recovery-audit callers wrap it in Ok(...); those paths previously surfaced an error instead of writing a bogus created_at/audit timestamp. In a misconfigured or skewed environment this silently persists epoch-zero metadata, which can misorder commit/audit records and hides an operational failure; keep the production clock path fallible for non-DST callers rather than collapsing it to zero.
Useful? React with 👍 / 👎.
…lible #5: bound the seed-derived clock_micros offset via splitmix64(seed) % 86_400_000_000 so a large u64 seed can't overflow i64 and mint bogus ULID times. #3: add a fallible now_micros_result() (deterministic under with_seed; otherwise the real clock, PROPAGATING a clock-before-epoch error) and route the commit-graph + recovery-audit helpers through it, so a bad clock surfaces OmniError::manifest again instead of silently persisting 0 (deny-list: no silent failures). Verified: prod build clean; --features dst --test dst -> 11 passed (incl. replay_equality_same_seed).
| if let Ok(u) = imp::DST.try_with(|s| s.next_ulid()) { | ||
| return u; | ||
| } | ||
| } | ||
| Ulid::new() |
There was a problem hiding this comment.
with_seed only scopes a Tokio task-local provider, and this branch silently falls back to Ulid::new() when no provider is present. If a seeded replay wraps a workload that calls the engine from a spawned task, that task does not have this task-local provider, so generated commit IDs, sidecar IDs, or keyless row IDs come from the real clock/random source and are not included in the fingerprint. The replay can then contain nondeterministic persisted IDs while the oracle misses those calls. Consider making the seeded path fail loudly when the feature is enabled and deterministic mode was expected, or providing a way to propagate the seeded provider into spawned work.
Context Used: AGENTS.md (source)
Stacked on #303 (base
dst-harness-harden). The one engine change of the DST plan (PR-C) — a feature-gated deterministic-replay seam. Zero production cost.What
The engine mints observable ULIDs/timestamps at scattered sites (commit-graph ids/timestamps, keyless node/edge ids, recovery sidecar op-ids), so a seeded harness run was not bit-reproducible. New
src/dst.rs(mirrorsfailpoints.rs's no-op-fallback shape):next_ulid()/now_micros()consult atokio::task_localprovider when thedstfeature is on AND a provider is in scope, else fall back to realUlid::new()/SystemTime::now(). Without the feature the#[cfg(feature="dst")]block compiles away.with_seed(seed, fut) -> (Output, fingerprint)scopes a seeded provider and returns a rolling id/clock fingerprint.Sites routed (OBSERVABLE only)
6 ULID + the timestamp seam:
exec/mutation.rs(×2),loader/mod.rs(generate_id),db/commit_graph.rs(genesis+append, and thenow_microshelper covering all its callers),db/manifest/recovery.rs(sidecar op_id +started_at),db/recovery_audit.rs(now_microshelper). Skipped ephemeral cleanup-cutoff / metrics sites.Oracle
replay_equality_same_seed(--features dst): the same seeded clean op sequence run twice yields an identical engine fingerprint + outcome; a different seed diverges. A determinism tripwire — green while the engine is reproducible, red if non-determinism leaks into the seam.Verification
cargo test -p omnigraph-engine --test dst— 10 passed, no warnings (production/behaviour byte-unchanged).cargo test -p omnigraph-engine --features dst --test dst— 11 passed (incl. the oracle).Honest scope
Bit-identical replay at the engine layer only. Lance internals (compaction/index ids, threadpool scheduling) stay non-deterministic — full on-disk byte-equality is out of scope, and the flaky Lance FTS-builder panic is not fixed by this. The main generative walks aren't yet wrapped in
with_seed(follow-up); the dedicated oracle is the determinism guard.Note
Low Risk
Production builds without
dstcompile the seam away and keep real IDs/clocks; fallible timestamp helpers preserve the prior error contract. Residual risk is only mis-routing a mint site or enablingdstin production by mistake.Overview
Introduces Cargo feature
dstand moduledst.rs(same no-op-when-off pattern as failpoints):next_ulid,now_micros, and falliblenow_micros_resultconsult a per-task seeded provider when the feature is on andwith_seedis in scope; otherwise behavior staysUlid::new()/ the real clock.Observable mint sites are rewired to the seam: commit-graph ids and
now_microshelpers, recovery sidecaroperation_id/started_at, keyless node/edge inserts, and loadergenerate_id. Persisted timestamps on error paths still usenow_micros_resultso pre-seam clock-before-epoch errors propagate instead of writing0.Under
--features dst,replay_equality_same_seedruns the same seeded load/query workload twice and asserts matching id/clock fingerprints and outcomes; a different seed must diverge.Reviewed by Cursor Bugbot for commit 626ca8d. Bugbot is set up for automated code reviews on this repo. Configure here.
Greptile Summary
This PR adds a deterministic replay seam for engine-visible IDs and timestamps. The main changes are:
dstfeature with seeded ULID and clock helpers.Confidence Score: 4/5
This is close, but the seeded replay path should handle spawned work before merging.
crates/omnigraph/src/dst.rs
Important Files Changed
dstfeature.Flowchart
%%{init: {'theme': 'neutral'}}%% flowchart TD A[Engine ID or timestamp call] --> B{dst provider in task scope} B -- yes --> C[Seeded DstState] C --> D[Deterministic ULID or micros] C --> E[Replay fingerprint] B -- no --> F[Real ULID or clock fallback]%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%% flowchart TD A[Engine ID or timestamp call] --> B{dst provider in task scope} B -- yes --> C[Seeded DstState] C --> D[Deterministic ULID or micros] C --> E[Replay fingerprint] B -- no --> F[Real ULID or clock fallback]Reviews (2): Last reviewed commit: "fix(dst): bound seeded clock init; keep ..." | Re-trigger Greptile
Context used: