feat(holograph): production polish — sled + fetch fallback + shutdown#843
Draft
data-bot-coasys wants to merge 7 commits into
Draft
feat(holograph): production polish — sled + fetch fallback + shutdown#843data-bot-coasys wants to merge 7 commits into
data-bot-coasys wants to merge 7 commits into
Conversation
…18 D1) When two HolographSpace instances race to open the same data directory (e.g. a stuck node still draining + a fresh start, or a test harness re-opening a tempdir), sled's exclusive file lock causes the second open to fail immediately. Retry with exponential backoff so the second open waits up to ~1.55s for the first holder to drop. - Backoff schedule: 50, 100, 200, 400, 800ms (5 attempts). - `is_lock_contention` recognises both the platform-typed io kinds (`WouldBlock`, `AlreadyExists`) and sled's wrapped form (`kind: Other` with "could not acquire lock" in the message). sled 0.34 takes the wrapping path on Linux/macOS so the message check is the load-bearing match. - Best-effort stale-lock recovery: after one failed attempt the loop removes any leftover `db/.lock` file once. POSIX advisory locks already die with the owning process so this is mostly belt+braces, but it covers the case of a crashed previous run that left the file behind. Test `second_open_retries_until_first_drops` spawns two opens against the same path; the first drops after 200ms, the second succeeds within the backoff window. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ure event (Wake-18 D2/D5) D2 — `SpaceConfig::fetch_fallback_policy: FetchFallbackPolicy` lifts the previously-implicit `fallback_timeout` + `max_retry_peers` knobs into one structured policy with three fields: initial_timeout (default 5s) — grace before fallback kicks in max_attempts (default 3) — peer cap across entry lifetime retry_budget (default 30s) — wall-clock cap from first_seen `fallback_pass` now round-robins through arc-overlapping peers within a single tick instead of one-peer-per-tick, with the caps enforced on a per-entry basis. The pending tree's `tried_peers` field is preserved across watcher ticks so a peer that K2 already accepted a request_ops against is never re-picked. D5 — when either cap is hit (`attempts_exhausted || budget_exhausted`) the entry is dropped from the pending tree and a new `NotifyUp::notify_parent_fetch_permanent_failure(op_id, missing_parents, last_error)` fires. The trait method has a default no-op so older `NotifyUp` impls keep compiling; in tests the mock records every call for assertion. Wiring: - `IntegrationQueueConfig`: dropped `fallback_timeout` / `max_retry_peers`, added `fallback_policy`. - `HolographSpaceConfig`: dropped the same two fields. The policy threads in from `cfg.config.fetch_fallback_policy` instead. - `SpaceConfig::full_replication_single_doc()` now includes the default fetch-fallback policy. Tests (new): - `fallback_round_robins_multiple_peers_in_one_tick` — picker yields bob then charlie in one tick; verifies both are reached. - `fallback_bounded_by_max_attempts` — renamed from `fallback_bounded_by_max_retry_peers`, now also asserts the entry is dropped and the permanent-failure notification fires once. All 45 holograph lib tests + 4 pdiff_parity + 1 two_node green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`HolographSpace::shutdown()` does:
1. Set the shutdown flag — `on_local_commit` rejects new commits with
"shutdown in progress."
2. Stop the integration queue's fallback watcher.
3. Drain the queue: poll `pending_len()` until 0 or 10s timeout. Returns
the final pending count so the caller can detect partial drains.
4. `KvOpStore::flush_async()` so the on-disk snapshot is durable.
5. `LocalCommitTarget::close()` for transport teardown (new trait
method with a default no-op; production `K2DynSpaceTarget` can
override to close the iroh DynSpace).
`Drop for HolographSpace` is the safety net for "process exit before
shutdown was called": sets the shutdown flag, stops the watcher, and
runs `KvOpStore::flush_blocking()`. Errors are logged but never panic
— a panicking Drop during unwinding aborts the process.
New surface:
- `KvOpStore::flush_async` + `flush_blocking` — public so tests +
smoketests can verify durability without going through the trait.
- `LocalCommitTarget::close` — default `async { Ok(()) }`.
Test `shutdown_flushes_and_rejects_new_commits` exercises the rejection
path + drain + flush. 46 holograph lib tests green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nd-trip (Wake-18 D4) Formalizes SPIKE §2.5 exit-check #6 ("restart survives state via sled") at the HolographSpace integration layer. The unit-level `state_persists_across_reopen` covers a single op; this exercises the substrate end-to-end: 1. Open HolographSpace at path P (fresh tempdir). 2. Commit 100 envelopes — three logical agents round-robin via distinct author tags so the op-ids cover the id space rather than clustering. 3. `space.shutdown()` flushes sled. 4. Reopen `KvOpStore` at the same path. 5. Assert op_count == 100, every op_id retrievable, every op_data bytes-for-bytes identical to the committed envelope. Also exposes `FetchFallbackPolicy` from the crate root since the test sets its own (the 60s/1-attempt/60s policy ensures the test isn't sensitive to the timing of the fallback watcher during the commit phase). 1/1 green in `tests/restart_survives.rs`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ke-18 D6) Lift the iroh-relay URL from an ad-hoc env read in `holograph_wires.rs` into a structured `SpaceConfig.iroh_relay_url: Option<String>` field plus a `resolve_iroh_relay()` helper. - `HolographSpace::new` resolves the field from env if it's `None`: `HOLOGRAPH_IROH_RELAY` is preferred, `HOLOGRAPH_IROH_RELAY_URL` is the back-compat fallback for the existing wires-file code that reads the longer name. Whitespace-only env values are treated as unset. The resolved value is folded back into `space.config()` so downstream consumers don't have to re-read the env. - New `rust-executor/crates/holograph/README.md` — single "Configuration" section with the env-vars table and a programmatic-overrides example. Wake-18's load-bearing documentation entry. Test `resolve_iroh_relay_prefers_short_name` covers all four permutations (neither set, long only, both set, whitespace-only) with a module-local mutex to serialize against the process-global env. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wake-18 fmt fixup for the D2/D3/D4/D6 commits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
This was referenced Jun 4, 2026
…into holograph-pr-d-production-polish
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fourth slice of the Holograph stack, stacked on PR-C (#841). Production-polish round addressing the v1 substrate's known soft spots from SPIKE §2.5 exit-checks and the Wake-18 dispatch.
KvOpStore::openretries on sled lock contention (50/100/200/400/800ms backoff, 5 attempts, ~1.55s budget) with a best-effort stale-lock cleanup after the first failed attempt. Two concurrentHolographSpace::newcalls against the same data directory no longer race to a hard error.Test plan
Commits
```
9e2cade feat(holograph): retry sled lock contention in KvOpStore::open (Wake-18 D1)
feb5a66 feat(holograph): fetch-fallback policy + round-robin + permanent-failure event (Wake-18 D2/D5)
32c0d4f feat(holograph): graceful shutdown + sync-flush Drop (Wake-18 D3)
dfa9358 test(holograph): restart_survives_state — 100 ops, 3 agents, full round-trip (Wake-18 D4)
8db80bf feat(holograph): HOLOGRAPH_IROH_RELAY env config surface + README (Wake-18 D6)
a4a7558 chore: cargo fmt --all
```
Notable behavioural changes for reviewers
Stacked split
🤖 Generated with Claude Code