Skip to content

fix(store): graceful recovery from an incompatible store format (ADR-009)#49

Merged
RioPlay merged 4 commits into
mainfrom
fix/store-format-recovery
Jun 20, 2026
Merged

fix(store): graceful recovery from an incompatible store format (ADR-009)#49
RioPlay merged 4 commits into
mainfrom
fix/store-format-recovery

Conversation

@RioPlay

@RioPlay RioPlay commented Jun 20, 2026

Copy link
Copy Markdown
Owner

Summary

Implements ADR-009. Closes the fjall 2.11→3 upgrade hazard found while testing the post-#44 release: every existing V2 store hard-errored InvalidVersion on open (a user's first command after upgrading failed), and the obvious recovery (delete store/, re-gen) produced a silently empty graph that status/doctor still called "100/100 healthy".

Recovery is now automatic and seamless across every store entry point — gen, all reads, and heal — with zero user action.

What changed

  • StoreError::IncompatibleVersion — mapped from fjall's typed Error::InvalidVersion in the From impl. Every other fjall error stays on the generic Io path, so a transient I/O failure is never mistaken for a recoverable format change.
  • gen — on IncompatibleVersion, wipe the (rebuildable, ADR-003) store and rebuild from source with a one-line notice. A pinned/shared $ADEN_STORE is never auto-wiped (surfaces an actionable error instead, mirroring regen).
  • Freshness follows the store — a non-existent/empty/just-recovered store forces a full re-scan instead of trusting the stale gen-cache.json, closing the silent-empty trap. (rm -rf store/ && gen now rebuilds correctly and preserves the embeddings cache.)
  • reads + heal — route through a shared recover_if_incompatible_store helper: probe the open and, only on IncompatibleVersion, rebuild silently. heal uses the recovery-only helper (not a staleness-gen), so a normal heal still surfaces real drift.

The originally-proposed gen --rebuild flag was droppedregen already provides a discoverable clean rebuild, and the empty-store guard covers the trap (see ADR-009 part 3).

Safety (the guardrail)

Auto-wipe fires only on the version-mismatch signal, never on Io/corruption/permission errors, and never on a pinned/shared store. Nothing the user authored is touched: source is never modified, on-disk overlays live outside the store and survive, and the embeddings cache is preserved.

Validation (against a real V2 store)

  • gen on a genuine V2 store → auto-rebuilds to a populated 3280-anchor graph (was a hard error).
  • A pure read (understand) on an incompatible store → auto-recovers (was "No symbol found").
  • rm -rf store/ && gen → rebuilds (was "Stored 0 contracts / 230 skipped").
  • heal first-after-upgrade → recovers (was silent no-op).
  • Healthy store still skips incrementally (no regression); embeddings cache preserved.

A blast-radius pass (via aden) confirmed no StoreError consumer needs the new variant and surfaced the heal gap, which this PR closes.

Test plan

  • cargo test --workspace green (new aden-store mapping unit test + corrupt-version-marker integration test)
  • cargo clippy --workspace clean
  • cargo fmt --all applied
  • End-to-end validated against a real V2 store (above)

License checklist

  • No dependency changes — license checklist not applicable

RioPlay (Ernest Hamblen) and others added 4 commits June 20, 2026 15:00
Design note for the fjall-3 migration hazard found while testing the
post-#44 release: no graceful recovery on a store-format mismatch
(InvalidVersion), and freshness decoupled from the store so the obvious
recovery re-gens to a silently empty graph. Proposes a typed
IncompatibleVersion error, version-mismatch-only auto-rebuild on the
create path, and freshness-follows-store (empty-store guard + gen
--rebuild). Status: proposed; LSM-path implementation gated on sign-off.
The fjall 2.11→3 upgrade (#44) changed the on-disk format with no
migration: every existing V2 store hard-errored InvalidVersion on open,
so a user's first command after upgrading failed. Deleting just store/
and re-genning made it worse — gen trusted the per-file mtime cache and
skipped everything, producing a silently EMPTY graph that status/doctor
still called "100/100 healthy".

Recovery is now automatic and seamless across every store entry point:

- aden-store: a new StoreError::IncompatibleVersion, mapped from fjall's
  typed Error::InvalidVersion in the From impl. Every OTHER fjall error
  stays on the generic Io path, so a transient I/O failure can never be
  mistaken for a recoverable format change.
- gen (cmd_gen_inner): on IncompatibleVersion, wipe the (rebuildable,
  ADR-003) store and rebuild from source with a one-line notice. Scoped
  strictly to that signal; a generic Io error still hard-fails without
  wiping. A pinned/shared $ADEN_STORE is never auto-wiped — it surfaces
  an actionable error instead (mirrors regen's caution).
- freshness follows the store: a non-existent/empty store (first gen,
  rm -rf store/, regen, or a just-recovered store) forces a full
  re-scan instead of trusting the stale gen-cache, closing the
  silent-empty trap.
- reads (ensure_fresh) and heal (cmd_heal_scan/_since) route through a
  shared recover_if_incompatible_store helper: probe the open and, on
  IncompatibleVersion only, rebuild silently — so grep/ask/understand
  and heal auto-recover instead of degrading to empty results. heal
  uses the recovery-only helper (NOT a staleness-gen), so a normal heal
  still surfaces real drift.

The proposed `gen --rebuild` flag was dropped: `regen` already provides a
discoverable clean rebuild, and the empty-store guard makes a plain
`rm -rf store/ && gen` rebuild correctly while preserving the expensive
embeddings cache.

Tests: aden-store unit test pins the InvalidVersion→IncompatibleVersion
mapping (and Io stays Io); an integration test corrupts the on-disk
version marker and asserts the typed error surfaces. Validated
end-to-end against a real V2 store (gen, reads, and heal all recover;
healthy stores still skip incrementally; embeddings survive). ADR-009
accepted.
A multi-dimension review of the recovery change surfaced eight findings;
the load-bearing ones are fixed here:

- HIGH (recover_if_incompatible_store returned `true` even when the
  rebuild FAILED): callers (ensure_fresh, heal) treated the store as
  recovered, skipped their own logic, and degraded to empty results — most
  concretely the pinned/shared $ADEN_STORE case, whose error was also
  swallowed. Now returns `true` only on a successful rebuild and prints the
  failure (e.g. the pinned-store refusal) so it is never silent.
- MED (single-file gen lost the graph): `gen <file>` on an incompatible
  store wiped the whole store but re-indexed only that one file, leaving a
  near-empty graph. The recovery arm now expands a single-file gen to the
  full project.
- MED (cmd_heal_apply uncovered): the one heal entry point without a
  recovery guard emitted a misleading "run `aden gen` first" error on an
  incompatible store. Now guarded like the scan paths.
- LOW (test temp-dir leak): the corrupt-version-marker test now cleans up
  before asserting.

Adds hermetic integration tests (CARGO_BIN_EXE_aden + isolated
ADEN_DATA_DIR) covering read-path auto-recovery and the gen recovery +
empty-store guard end-to-end.
@RioPlay RioPlay merged commit 95c8008 into main Jun 20, 2026
6 checks passed
@RioPlay RioPlay deleted the fix/store-format-recovery branch June 20, 2026 21:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant