fix(store): graceful recovery from an incompatible store format (ADR-009)#49
Merged
Conversation
Design note for the fjall-3 migration hazard found while testing the post-#44 release: no graceful recovery on a store-format mismatch (InvalidVersion), and freshness decoupled from the store so the obvious recovery re-gens to a silently empty graph. Proposes a typed IncompatibleVersion error, version-mismatch-only auto-rebuild on the create path, and freshness-follows-store (empty-store guard + gen --rebuild). Status: proposed; LSM-path implementation gated on sign-off.
The fjall 2.11→3 upgrade (#44) changed the on-disk format with no migration: every existing V2 store hard-errored InvalidVersion on open, so a user's first command after upgrading failed. Deleting just store/ and re-genning made it worse — gen trusted the per-file mtime cache and skipped everything, producing a silently EMPTY graph that status/doctor still called "100/100 healthy". Recovery is now automatic and seamless across every store entry point: - aden-store: a new StoreError::IncompatibleVersion, mapped from fjall's typed Error::InvalidVersion in the From impl. Every OTHER fjall error stays on the generic Io path, so a transient I/O failure can never be mistaken for a recoverable format change. - gen (cmd_gen_inner): on IncompatibleVersion, wipe the (rebuildable, ADR-003) store and rebuild from source with a one-line notice. Scoped strictly to that signal; a generic Io error still hard-fails without wiping. A pinned/shared $ADEN_STORE is never auto-wiped — it surfaces an actionable error instead (mirrors regen's caution). - freshness follows the store: a non-existent/empty store (first gen, rm -rf store/, regen, or a just-recovered store) forces a full re-scan instead of trusting the stale gen-cache, closing the silent-empty trap. - reads (ensure_fresh) and heal (cmd_heal_scan/_since) route through a shared recover_if_incompatible_store helper: probe the open and, on IncompatibleVersion only, rebuild silently — so grep/ask/understand and heal auto-recover instead of degrading to empty results. heal uses the recovery-only helper (NOT a staleness-gen), so a normal heal still surfaces real drift. The proposed `gen --rebuild` flag was dropped: `regen` already provides a discoverable clean rebuild, and the empty-store guard makes a plain `rm -rf store/ && gen` rebuild correctly while preserving the expensive embeddings cache. Tests: aden-store unit test pins the InvalidVersion→IncompatibleVersion mapping (and Io stays Io); an integration test corrupts the on-disk version marker and asserts the typed error surfaces. Validated end-to-end against a real V2 store (gen, reads, and heal all recover; healthy stores still skip incrementally; embeddings survive). ADR-009 accepted.
A multi-dimension review of the recovery change surfaced eight findings; the load-bearing ones are fixed here: - HIGH (recover_if_incompatible_store returned `true` even when the rebuild FAILED): callers (ensure_fresh, heal) treated the store as recovered, skipped their own logic, and degraded to empty results — most concretely the pinned/shared $ADEN_STORE case, whose error was also swallowed. Now returns `true` only on a successful rebuild and prints the failure (e.g. the pinned-store refusal) so it is never silent. - MED (single-file gen lost the graph): `gen <file>` on an incompatible store wiped the whole store but re-indexed only that one file, leaving a near-empty graph. The recovery arm now expands a single-file gen to the full project. - MED (cmd_heal_apply uncovered): the one heal entry point without a recovery guard emitted a misleading "run `aden gen` first" error on an incompatible store. Now guarded like the scan paths. - LOW (test temp-dir leak): the corrupt-version-marker test now cleans up before asserting. Adds hermetic integration tests (CARGO_BIN_EXE_aden + isolated ADEN_DATA_DIR) covering read-path auto-recovery and the gen recovery + empty-store guard end-to-end.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements ADR-009. Closes the fjall 2.11→3 upgrade hazard found while testing the post-#44 release: every existing V2 store hard-errored
InvalidVersionon open (a user's first command after upgrading failed), and the obvious recovery (deletestore/, re-gen) produced a silently empty graph thatstatus/doctorstill called "100/100 healthy".Recovery is now automatic and seamless across every store entry point — gen, all reads, and heal — with zero user action.
What changed
StoreError::IncompatibleVersion— mapped from fjall's typedError::InvalidVersionin theFromimpl. Every other fjall error stays on the genericIopath, so a transient I/O failure is never mistaken for a recoverable format change.IncompatibleVersion, wipe the (rebuildable, ADR-003) store and rebuild from source with a one-line notice. A pinned/shared$ADEN_STOREis never auto-wiped (surfaces an actionable error instead, mirroringregen).gen-cache.json, closing the silent-empty trap. (rm -rf store/ && gennow rebuilds correctly and preserves the embeddings cache.)recover_if_incompatible_storehelper: probe the open and, only onIncompatibleVersion, rebuild silently. heal uses the recovery-only helper (not a staleness-gen), so a normal heal still surfaces real drift.The originally-proposed
gen --rebuildflag was dropped —regenalready provides a discoverable clean rebuild, and the empty-store guard covers the trap (see ADR-009 part 3).Safety (the guardrail)
Auto-wipe fires only on the version-mismatch signal, never on
Io/corruption/permission errors, and never on a pinned/shared store. Nothing the user authored is touched: source is never modified, on-disk overlays live outside the store and survive, and the embeddings cache is preserved.Validation (against a real V2 store)
genon a genuine V2 store → auto-rebuilds to a populated 3280-anchor graph (was a hard error).understand) on an incompatible store → auto-recovers (was "No symbol found").rm -rf store/ && gen→ rebuilds (was "Stored 0 contracts / 230 skipped").healfirst-after-upgrade → recovers (was silent no-op).A blast-radius pass (via aden) confirmed no
StoreErrorconsumer needs the new variant and surfaced the heal gap, which this PR closes.Test plan
cargo test --workspacegreen (new aden-store mapping unit test + corrupt-version-marker integration test)cargo clippy --workspacecleancargo fmt --allappliedLicense checklist