Skip to content

feat(holograph): production polish — sled + fetch fallback + shutdown#843

Draft
data-bot-coasys wants to merge 7 commits into
holograph-pr-c-language-and-jsfrom
holograph-pr-d-production-polish
Draft

feat(holograph): production polish — sled + fetch fallback + shutdown#843
data-bot-coasys wants to merge 7 commits into
holograph-pr-c-language-and-jsfrom
holograph-pr-d-production-polish

Conversation

@data-bot-coasys

Copy link
Copy Markdown
Contributor

Summary

Fourth slice of the Holograph stack, stacked on PR-C (#841). Production-polish round addressing the v1 substrate's known soft spots from SPIKE §2.5 exit-checks and the Wake-18 dispatch.

  • D1KvOpStore::open retries on sled lock contention (50/100/200/400/800ms backoff, 5 attempts, ~1.55s budget) with a best-effort stale-lock cleanup after the first failed attempt. Two concurrent HolographSpace::new calls against the same data directory no longer race to a hard error.
  • D2 — `SpaceConfig.fetch_fallback_policy: FetchFallbackPolicy` lifts the previously-implicit `fallback_timeout` + `max_retry_peers` knobs into one structured policy (`initial_timeout` / `max_attempts` / `retry_budget`). `fallback_pass` now round-robins through all arc-overlapping peers within a single tick instead of one-peer-per-tick.
  • D3 — `HolographSpace::shutdown()`: flag-flip → stop watcher → drain queue (10s timeout) → sled flush → transport teardown. `Drop` is the safety net: best-effort sync flush, logs on error, never panics.
  • D4 — New integration test `tests/restart_survives.rs` — 100 ops across 3 logical agents, full shutdown / reopen / retrieve round-trip. Formalizes SPIKE §2.5 exit-check Integration test suit for Link Languages #6 at substrate scale.
  • D5 — When fetch-fallback exhausts `max_attempts` OR `retry_budget` for a pending entry, the entry is dropped and `NotifyUp::notify_parent_fetch_permanent_failure(op_id, missing_parents, last_error)` fires. The trait method has a default no-op so older `NotifyUp` impls compile.
  • D6 — `SpaceConfig.iroh_relay_url: Option` plus a `resolve_iroh_relay()` helper that reads `HOLOGRAPH_IROH_RELAY` (preferred) or `HOLOGRAPH_IROH_RELAY_URL` (back-compat alias). `HolographSpace::new` resolves from env if the field is `None` and folds the value into `space.config()` so consumers don't have to re-read the env. New `rust-executor/crates/holograph/README.md` with a "Configuration" section + env-vars table.

Test plan

  • `cargo test -p holograph --release -- --test-threads=1` — 53 tests green (47 lib + 4 pdiff_parity + 1 restart_survives + 1 two_node)
  • `cargo check -p ad4m-executor` — clean (with baseline `CUSTOM_DENO_SNAPSHOT.bin` + `dapp/dist` artifacts in place; same baseline as PR-C / `dev`)
  • `cargo fmt --all --check` — clean (after `a4a755840` fmt fixup)
  • JS multi-conductor regression — unchanged surface from PR-C; CI will produce the result

Commits

```
9e2cade feat(holograph): retry sled lock contention in KvOpStore::open (Wake-18 D1)
feb5a66 feat(holograph): fetch-fallback policy + round-robin + permanent-failure event (Wake-18 D2/D5)
32c0d4f feat(holograph): graceful shutdown + sync-flush Drop (Wake-18 D3)
dfa9358 test(holograph): restart_survives_state — 100 ops, 3 agents, full round-trip (Wake-18 D4)
8db80bf feat(holograph): HOLOGRAPH_IROH_RELAY env config surface + README (Wake-18 D6)
a4a7558 chore: cargo fmt --all
```

Notable behavioural changes for reviewers

  • Bool wins on env: `HOLOGRAPH_IROH_RELAY` takes precedence over the older `HOLOGRAPH_IROH_RELAY_URL`. Existing deployments setting only the long name keep working unchanged.
  • NotifyUp surface widened with one default-impl method (`notify_parent_fetch_permanent_failure`). Existing impls in the wires layer compile unchanged.
  • `IntegrationQueueConfig` / `HolographSpaceConfig` swapped `fallback_timeout` + `max_retry_peers` for `fallback_policy`. Callers inside this crate were updated; `holograph_wires.rs` is not affected (it doesn't construct queue configs directly).
  • `on_local_commit` after `shutdown` returns `K2Error::other("... shutdown in progress")`. Callers should treat this as a terminal signal, not a retry.

Stacked split

🤖 Generated with Claude Code

data-bot-coasys and others added 6 commits June 4, 2026 04:25
…18 D1)

When two HolographSpace instances race to open the same data directory
(e.g. a stuck node still draining + a fresh start, or a test harness
re-opening a tempdir), sled's exclusive file lock causes the second open
to fail immediately. Retry with exponential backoff so the second open
waits up to ~1.55s for the first holder to drop.

- Backoff schedule: 50, 100, 200, 400, 800ms (5 attempts).
- `is_lock_contention` recognises both the platform-typed io kinds
  (`WouldBlock`, `AlreadyExists`) and sled's wrapped form
  (`kind: Other` with "could not acquire lock" in the message). sled
  0.34 takes the wrapping path on Linux/macOS so the message check is
  the load-bearing match.
- Best-effort stale-lock recovery: after one failed attempt the loop
  removes any leftover `db/.lock` file once. POSIX advisory locks
  already die with the owning process so this is mostly belt+braces,
  but it covers the case of a crashed previous run that left the file
  behind.

Test `second_open_retries_until_first_drops` spawns two opens against
the same path; the first drops after 200ms, the second succeeds within
the backoff window.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ure event (Wake-18 D2/D5)

D2 — `SpaceConfig::fetch_fallback_policy: FetchFallbackPolicy` lifts the
previously-implicit `fallback_timeout` + `max_retry_peers` knobs into one
structured policy with three fields:

  initial_timeout (default 5s)  — grace before fallback kicks in
  max_attempts    (default 3)    — peer cap across entry lifetime
  retry_budget    (default 30s)  — wall-clock cap from first_seen

`fallback_pass` now round-robins through arc-overlapping peers within a
single tick instead of one-peer-per-tick, with the caps enforced on a
per-entry basis. The pending tree's `tried_peers` field is preserved
across watcher ticks so a peer that K2 already accepted a request_ops
against is never re-picked.

D5 — when either cap is hit (`attempts_exhausted || budget_exhausted`)
the entry is dropped from the pending tree and a new
`NotifyUp::notify_parent_fetch_permanent_failure(op_id,
missing_parents, last_error)` fires. The trait method has a default
no-op so older `NotifyUp` impls keep compiling; in tests the mock
records every call for assertion.

Wiring:
- `IntegrationQueueConfig`: dropped `fallback_timeout` /
  `max_retry_peers`, added `fallback_policy`.
- `HolographSpaceConfig`: dropped the same two fields. The policy
  threads in from `cfg.config.fetch_fallback_policy` instead.
- `SpaceConfig::full_replication_single_doc()` now includes the
  default fetch-fallback policy.

Tests (new):
- `fallback_round_robins_multiple_peers_in_one_tick` — picker yields
  bob then charlie in one tick; verifies both are reached.
- `fallback_bounded_by_max_attempts` — renamed from
  `fallback_bounded_by_max_retry_peers`, now also asserts the entry
  is dropped and the permanent-failure notification fires once.

All 45 holograph lib tests + 4 pdiff_parity + 1 two_node green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`HolographSpace::shutdown()` does:
  1. Set the shutdown flag — `on_local_commit` rejects new commits with
     "shutdown in progress."
  2. Stop the integration queue's fallback watcher.
  3. Drain the queue: poll `pending_len()` until 0 or 10s timeout. Returns
     the final pending count so the caller can detect partial drains.
  4. `KvOpStore::flush_async()` so the on-disk snapshot is durable.
  5. `LocalCommitTarget::close()` for transport teardown (new trait
     method with a default no-op; production `K2DynSpaceTarget` can
     override to close the iroh DynSpace).

`Drop for HolographSpace` is the safety net for "process exit before
shutdown was called": sets the shutdown flag, stops the watcher, and
runs `KvOpStore::flush_blocking()`. Errors are logged but never panic
— a panicking Drop during unwinding aborts the process.

New surface:
- `KvOpStore::flush_async` + `flush_blocking` — public so tests +
  smoketests can verify durability without going through the trait.
- `LocalCommitTarget::close` — default `async { Ok(()) }`.

Test `shutdown_flushes_and_rejects_new_commits` exercises the rejection
path + drain + flush. 46 holograph lib tests green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nd-trip (Wake-18 D4)

Formalizes SPIKE §2.5 exit-check #6 ("restart survives state via sled")
at the HolographSpace integration layer. The unit-level
`state_persists_across_reopen` covers a single op; this exercises the
substrate end-to-end:

  1. Open HolographSpace at path P (fresh tempdir).
  2. Commit 100 envelopes — three logical agents round-robin via
     distinct author tags so the op-ids cover the id space rather than
     clustering.
  3. `space.shutdown()` flushes sled.
  4. Reopen `KvOpStore` at the same path.
  5. Assert op_count == 100, every op_id retrievable, every op_data
     bytes-for-bytes identical to the committed envelope.

Also exposes `FetchFallbackPolicy` from the crate root since the test
sets its own (the 60s/1-attempt/60s policy ensures the test isn't
sensitive to the timing of the fallback watcher during the commit
phase).

1/1 green in `tests/restart_survives.rs`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ke-18 D6)

Lift the iroh-relay URL from an ad-hoc env read in `holograph_wires.rs`
into a structured `SpaceConfig.iroh_relay_url: Option<String>` field
plus a `resolve_iroh_relay()` helper.

- `HolographSpace::new` resolves the field from env if it's `None`:
  `HOLOGRAPH_IROH_RELAY` is preferred, `HOLOGRAPH_IROH_RELAY_URL` is
  the back-compat fallback for the existing wires-file code that
  reads the longer name. Whitespace-only env values are treated as
  unset. The resolved value is folded back into `space.config()` so
  downstream consumers don't have to re-read the env.
- New `rust-executor/crates/holograph/README.md` — single
  "Configuration" section with the env-vars table and a
  programmatic-overrides example. Wake-18's load-bearing
  documentation entry.

Test `resolve_iroh_relay_prefers_short_name` covers all four
permutations (neither set, long only, both set, whitespace-only)
with a module-local mutex to serialize against the process-global env.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wake-18 fmt fixup for the D2/D3/D4/D6 commits.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 028c53be-0b16-467d-989f-6a2850a970c0

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch holograph-pr-d-production-polish

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant