Skip to content

feat(dst): deterministic-replay seam + replay-equality oracle (PR-C, stacked on #303)#304

Open
ragnorc wants to merge 2 commits into
dst-harness-hardenfrom
dst-determinism
Open

feat(dst): deterministic-replay seam + replay-equality oracle (PR-C, stacked on #303)#304
ragnorc wants to merge 2 commits into
dst-harness-hardenfrom
dst-determinism

Conversation

@ragnorc

@ragnorc ragnorc commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Stacked on #303 (base dst-harness-harden). The one engine change of the DST plan (PR-C) — a feature-gated deterministic-replay seam. Zero production cost.

What

The engine mints observable ULIDs/timestamps at scattered sites (commit-graph ids/timestamps, keyless node/edge ids, recovery sidecar op-ids), so a seeded harness run was not bit-reproducible. New src/dst.rs (mirrors failpoints.rs's no-op-fallback shape):

  • next_ulid() / now_micros() consult a tokio::task_local provider when the dst feature is on AND a provider is in scope, else fall back to real Ulid::new() / SystemTime::now(). Without the feature the #[cfg(feature="dst")] block compiles away.
  • with_seed(seed, fut) -> (Output, fingerprint) scopes a seeded provider and returns a rolling id/clock fingerprint.

Sites routed (OBSERVABLE only)

6 ULID + the timestamp seam: exec/mutation.rs (×2), loader/mod.rs (generate_id), db/commit_graph.rs (genesis+append, and the now_micros helper covering all its callers), db/manifest/recovery.rs (sidecar op_id + started_at), db/recovery_audit.rs (now_micros helper). Skipped ephemeral cleanup-cutoff / metrics sites.

Oracle

replay_equality_same_seed (--features dst): the same seeded clean op sequence run twice yields an identical engine fingerprint + outcome; a different seed diverges. A determinism tripwire — green while the engine is reproducible, red if non-determinism leaks into the seam.

Verification

  • cargo test -p omnigraph-engine --test dst10 passed, no warnings (production/behaviour byte-unchanged).
  • cargo test -p omnigraph-engine --features dst --test dst11 passed (incl. the oracle).

Honest scope

Bit-identical replay at the engine layer only. Lance internals (compaction/index ids, threadpool scheduling) stay non-deterministic — full on-disk byte-equality is out of scope, and the flaky Lance FTS-builder panic is not fixed by this. The main generative walks aren't yet wrapped in with_seed (follow-up); the dedicated oracle is the determinism guard.


Note

Low Risk
Production builds without dst compile the seam away and keep real IDs/clocks; fallible timestamp helpers preserve the prior error contract. Residual risk is only mis-routing a mint site or enabling dst in production by mistake.

Overview
Introduces Cargo feature dst and module dst.rs (same no-op-when-off pattern as failpoints): next_ulid, now_micros, and fallible now_micros_result consult a per-task seeded provider when the feature is on and with_seed is in scope; otherwise behavior stays Ulid::new() / the real clock.

Observable mint sites are rewired to the seam: commit-graph ids and now_micros helpers, recovery sidecar operation_id / started_at, keyless node/edge inserts, and loader generate_id. Persisted timestamps on error paths still use now_micros_result so pre-seam clock-before-epoch errors propagate instead of writing 0.

Under --features dst, replay_equality_same_seed runs the same seeded load/query workload twice and asserts matching id/clock fingerprints and outcomes; a different seed must diverge.

Reviewed by Cursor Bugbot for commit 626ca8d. Bugbot is set up for automated code reviews on this repo. Configure here.

Greptile Summary

This PR adds a deterministic replay seam for engine-visible IDs and timestamps. The main changes are:

  • A new dst feature with seeded ULID and clock helpers.
  • Commit, recovery, mutation, and loader ID/time sites routed through the seam.
  • A replay-equality test that compares same-seed fingerprints and outcomes.
  • Fallible timestamp helpers preserved for commit and recovery audit writes.

Confidence Score: 4/5

This is close, but the seeded replay path should handle spawned work before merging.

  • The clock-error and large-seed fixes look addressed.
  • The seeded provider can be lost across spawned tasks, which can let nondeterministic IDs bypass the replay fingerprint.

crates/omnigraph/src/dst.rs

Important Files Changed

Filename Overview
crates/omnigraph/src/dst.rs Adds the seeded provider, production fallbacks, fallible timestamp helper, and replay fingerprint state.
crates/omnigraph/src/db/commit_graph.rs Routes commit IDs and commit timestamps through the DST seam while preserving fallible clock behavior.
crates/omnigraph/src/db/recovery_audit.rs Routes recovery audit timestamps through the fallible DST timestamp helper.
crates/omnigraph/src/db/manifest/recovery.rs Routes recovery sidecar IDs and diagnostic start times through the DST seam.
crates/omnigraph/src/exec/mutation.rs Routes generated keyless node and edge IDs through the DST seam.
crates/omnigraph/src/loader/mod.rs Routes loader-generated IDs through the DST seam.
crates/omnigraph/src/lib.rs Exports the new DST module.
crates/omnigraph/Cargo.toml Adds the opt-in dst feature.
crates/omnigraph/tests/dst.rs Adds the feature-gated replay-equality oracle.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
  A[Engine ID or timestamp call] --> B{dst provider in task scope}
  B -- yes --> C[Seeded DstState]
  C --> D[Deterministic ULID or micros]
  C --> E[Replay fingerprint]
  B -- no --> F[Real ULID or clock fallback]
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
  A[Engine ID or timestamp call] --> B{dst provider in task scope}
  B -- yes --> C[Seeded DstState]
  C --> D[Deterministic ULID or micros]
  C --> E[Replay fingerprint]
  B -- no --> F[Real ULID or clock fallback]
Loading

Fix All in Claude Code

Reviews (2): Last reviewed commit: "fix(dst): bound seeded clock init; keep ..." | Re-trigger Greptile

Greptile also left 1 inline comment on this PR.

Context used:

  • Context used - AGENTS.md (source)
  • Context used - CLAUDE.md (source)

…y oracle (PR-C)

The engine mints observable ULIDs/timestamps at scattered sites; a
seeded harness run was not bit-reproducible. New src/dst.rs (mirrors
failpoints.rs's no-op-fallback shape): next_ulid()/now_micros() consult a
tokio::task_local provider when the `dst` feature is on + a provider is
in scope, else fall back to real Ulid::new()/SystemTime — ZERO production
cost (cfg block compiles away; verified default suite byte-unchanged,
10 passed, no warnings).

Routed the OBSERVABLE sites only: 6 Ulid::new() (mutation x2, loader
generate_id, commit_graph genesis+append, recovery sidecar op_id) and
the timestamp seam (the 2 now_micros helpers + the raw sidecar started_at).
Skipped ephemeral cleanup-cutoff/metrics sites.

with_seed(seed, fut) -> (Output, fingerprint) scopes the provider and
returns a rolling id/clock fingerprint. New replay_equality_same_seed
test (--features dst): same seed -> identical fingerprint + outcome;
different seed -> divergent. A determinism tripwire — green while the
engine is reproducible, red if non-determinism leaks into the seam.

Honest scope: gives bit-identical ENGINE-layer seed replay; Lance
internals (compaction/index ids, threadpool) stay non-deterministic, so
full on-disk equality is out of scope and the flaky FTS-panic is NOT
fixed by this. Main walks not yet wrapped in with_seed (follow-up).

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b8df1c8e73

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread crates/omnigraph/src/dst.rs Outdated
SystemTime::now()
.duration_since(UNIX_EPOCH)
.map(|d| d.as_micros() as i64)
.unwrap_or(0)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve clock errors instead of stamping epoch zero

When the host clock is before UNIX_EPOCH, this fallback now returns 0, and the updated commit-graph and recovery-audit callers wrap it in Ok(...); those paths previously surfaced an error instead of writing a bogus created_at/audit timestamp. In a misconfigured or skewed environment this silently persists epoch-zero metadata, which can misorder commit/audit records and hides an operational failure; keep the production clock path fallible for non-DST callers rather than collapsing it to zero.

Useful? React with 👍 / 👎.

Comment thread crates/omnigraph/src/dst.rs Outdated
Comment thread crates/omnigraph/src/dst.rs Outdated
…lible

#5: bound the seed-derived clock_micros offset via splitmix64(seed) %
86_400_000_000 so a large u64 seed can't overflow i64 and mint bogus ULID
times. #3: add a fallible now_micros_result() (deterministic under with_seed;
otherwise the real clock, PROPAGATING a clock-before-epoch error) and route the
commit-graph + recovery-audit helpers through it, so a bad clock surfaces
OmniError::manifest again instead of silently persisting 0 (deny-list: no
silent failures). Verified: prod build clean; --features dst --test dst -> 11
passed (incl. replay_equality_same_seed).
Comment on lines +40 to +44
if let Ok(u) = imp::DST.try_with(|s| s.next_ulid()) {
return u;
}
}
Ulid::new()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Spawned Tasks Escape

with_seed only scopes a Tokio task-local provider, and this branch silently falls back to Ulid::new() when no provider is present. If a seeded replay wraps a workload that calls the engine from a spawned task, that task does not have this task-local provider, so generated commit IDs, sidecar IDs, or keyless row IDs come from the real clock/random source and are not included in the fingerprint. The replay can then contain nondeterministic persisted IDs while the oracle misses those calls. Consider making the seeded path fail loudly when the feature is enabled and deterministic mode was expected, or providing a way to propagate the seeded provider into spawned work.

Context Used: AGENTS.md (source)

Fix in Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant