fix(celo-reth): correct import-celo-state + add celo-migrate-v2 (storage v1→v2)#198
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 07569efe61
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
07569ef to
a18a516
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6e43509fa7
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| for block in first_block..=tip { | ||
| let mut entries = Vec::new(); | ||
|
|
||
| while let Some(Ok((block_number, _))) = walker.peek() { | ||
| if *block_number != block { | ||
| break; | ||
| } | ||
| let (_, entry) = walker.next().expect("peeked")?; | ||
| entries.push(entry); | ||
| } | ||
|
|
||
| count += entries.len() as u64; | ||
| writer.append_account_changeset(entries, block)?; | ||
| } | ||
|
|
||
| writer.commit()?; |
There was a problem hiding this comment.
Is a single commit for the whole block range acceptable here, or do we need to chunk the commits as in 4269c58?
| for block in first_block..=tip { | ||
| let mut entries = Vec::new(); | ||
|
|
||
| while let Some(Ok((key, _))) = walker.peek() { | ||
| if key.block_number() != block { | ||
| break; | ||
| } | ||
| let (key, entry) = walker.next().expect("peeked")?; | ||
| entries.push(StorageBeforeTx { | ||
| address: key.address(), | ||
| key: entry.key, | ||
| value: entry.value, | ||
| }); | ||
| } | ||
|
|
||
| count += entries.len() as u64; | ||
| writer.append_storage_changeset(entries, block)?; | ||
| } | ||
|
|
||
| writer.commit()?; |
Upstream `reth db migrate-v2` clears the recomputable MDBX tables (senders, indices, trie, plain state) and resets the SenderRecovery / TransactionLookup / IndexAccountHistory / IndexStorageHistory / MerkleExecute / MerkleUnwind stage checkpoints to 0, on the assumption that the pipeline can rebuild from block 0. For a Celo datadir bootstrapped via `import-celo-state`, blocks 1..31_056_499 are header-only dummies with no bodies. After the upstream migrate-v2, the pipeline fails on its first SenderRecovery iteration with "block meta not found for block #1". Manually setting the cleared stages back to the migration block then panics on a static-file <> database consistency check because TransactionSenders is empty. This commit adds `celo-reth celo-migrate-v2` as a sibling to the existing `import-celo-state` interception. It runs the upstream changesets/receipts migration + storage-settings flip + MDBX compaction unchanged, but skips Phase 4 (clear_recomputable_tables) entirely: on a Celo-imported datadir those tables are already in the correct post-migration state (empty for the dummy range, populated at and beyond the migration header) and the stages are already at the migration block. The four upstream helper functions (migrate_account_changesets, migrate_storage_changesets, migrate_receipts, compact_mdbx / swap_compacted_db) are private associated functions on `reth_cli_commands::db::migrate_v2::Command`, so they are copied verbatim into `celo_migrate_v2.rs` with a note to keep them in sync when the reth pin is bumped.
A single `ensure_at_block(31_056_500)` call from an empty segment grew the static-file writer's in-memory state past 18 GB and deadlocked the final commit on a Celo Mainnet datadir, leaving the migration hung in the seed step. The writer accumulates per-block segment metadata until the next commit, and a single 31M-iteration call accumulated more than the writer could flush in one shot. Commit every 500 000 blocks (one segment file's worth) so the in-memory metadata is bounded and freed between chunks. Also log progress per chunk with elapsed + ETA so the operator has a usable signal during the multi- hour seed phase. Also: add this branch to the dev-image push workflow so CI builds an image we can test on fsn-1 without going through PR review. To revert before merging.
EnvironmentArgs::init ends with a call to init_genesis_with_settings, which
writes the chainspec's genesis allocation into PlainAccountState,
PlainStorageState, HashedAccounts, HashedStorages, AccountsHistory,
StoragesHistory and the trie tables. For Celo Mainnet that alloc includes
the Registry contract (0x000000000000000000000000000000000000ce10) with one
storage slot at key
0xb53127684a568b3173ae13b9f8a6016e243e63b6e8ee1178d6a717850b5d6103.
The L1 final state dump consumed by init_from_state_dump (via
scripts/append_l2_allocs.py) also writes that exact Registry slot, with a
different value. Both writes land in HashedStorages for the same
(keccak(0xce10), keccak(0xb53127…)) pair, and DbDupCursorRW::upsert on
HashedStorages appends a second dup-record instead of replacing. The
subsequent compute_state_root iterates HashedStorages by (outer_key,
subkey) and encounters the duplicate, panicking in
alloy_trie::hash_builder::HashBuilder::add_leaf with 'key == self.key' and
leaving the datadir unusable.
Replace the env_args.init() call with a local helper that performs the
same setup but emits only the non-state side-effects of init_genesis:
* StorageSettings (so the provider knows the chosen layout — v1 here);
* the canonical genesis header (so the Headers static-file segment has
block 0, which setup_without_evm needs as a prerequisite for
append_header(1, B256::ZERO));
* stage checkpoints at the genesis block number (so setup_without_evm
can advance them to the migration block);
* empty block ranges for the Receipts / Transactions static-file
segments (so writers that grow these later don't trip on missing
block bounds).
It deliberately does NOT call insert_genesis_state, insert_genesis_hashes,
insert_genesis_history, or compute_state_root — those are the four steps
inside init_genesis_with_settings that touch the alloc and produce the
duplicate.
Reproduced empirically on fsn-1 with the full L1 dump
(https://fsn1.your-objectstorage.com/celo/l1-final-state.json.zst): the
unpatched import crashes during state-root computation after 26+ minutes
with the exact 'add_leaf key == self.key' panic on the Registry storage
slot above; with this patch the same import is expected to complete
cleanly. The HashedStorages discrepancy (360,672,585 entries vs
PlainStorageState 360,672,584) collapses to zero.
Refs #192.
…_l2_allocs op-geth's StateDB.Commit deletes storage slots stored at value 0 (SSTORE to 0 deletes the slot from the trie), whereas reth's init_from_state_dump writes every entry from the input dump verbatim into HashedStorages. The canonical L2 allocs file at storage.googleapis.com/cel2-rollup-files/celo/l2-allocs.json contains four zero-value slots under the OP Stack implementation namespace (0xc0d3...30007/0010/0012/0014); without filtering them out, the imported datadir's computed state_root mismatches the canonical CEL2 migration root 0xed980641a4bd4d2e84c6c8db980b7f05e95733c92be2e0045db3735efeb1d807. Add _filter_zero_storage and apply it to every L2 alloc immediately after loading l2-allocs.json. This makes the merged JSONL feed init_from_state_dump the same set of slots that op-geth would commit, so the post-import state root verification passes. Verified end-to-end on fsn-1 against the full Celo Mainnet L1 final dump: import-celo-state on celo-kona-reth:sha-a881d63 + this fix produces a datadir whose computed state_root equals the canonical migration root exactly. HashedStorages and PlainStorageState end up with identical entry counts (360,672,580). Refs #192.
EnvironmentArgs::init runs a static-file/database consistency check in create_provider_factory that asserts (assert_ne!(unwind_target, PipelineTarget::Unwind(0), …)) when the only safe heal would be a full unwind to block 0. A v1 datadir freshly produced by import-celo-state has SenderRecovery's stage checkpoint at the migration block (31,056,500) but an empty TransactionSenders static-file segment, so the check picks an unwind-to-zero target and trips the assertion before celo-migrate-v2::execute_with_factory's seed_transaction_senders step can fix the inconsistency. Extract the open_env_skip_genesis_state body into a shared helper open_env_skip_init_and_consistency_check that skips BOTH init_genesis_with_settings and the consistency check, leaving any post-open initialization to the caller. Both Celo-only subcommands now use it: * import-celo-state: still calls insert_genesis_header / writes StorageSettings / seeds stage checkpoints / sets static-file block ranges after the helper returns (the freshly created datadir has none of those set up yet). Behavior unchanged from the previous fix. * celo-migrate-v2: switches from env_args.init() to the helper. The datadir is already initialized so there is nothing to write at open time; seed_transaction_senders later in the migration brings TransactionSenders into line with the SenderRecovery checkpoint so the resulting v2 datadir is consistent for any subsequent open. Verified with cargo check + cargo test -p celo-reth --lib (all 131 unit tests pass). Refs #192.
…elo-state open_env_skip_init_and_consistency_check was passing StorageSettings::base() to write_storage_settings, but upstream reth's StorageSettings::base() returns StorageSettings::v2() (default for new nodes). The intent of import-celo-state is to produce a v1/legacy datadir — see the existing comment "// v1 storage only — \`--storage.v2\` is not exposed by this subcommand" — so that celo-migrate-v2 can later perform the v1 -> v2 migration cleanly. With the v2 metadata mistakenly written, celo-migrate-v2's preflight short- circuits on "Storage is already v2, nothing to do" while the underlying tables still have the v1 layout (everything in MDBX, nothing in static files or RocksDB). The datadir ends up inconsistent: v2 metadata + v1 data. Write StorageSettings::v1() explicitly so the import produces a properly v1-flagged datadir. Refs #192.
…trigger; lint Review hardening before opening the PR: - celo-migrate-v2: bail unless the Execution stage checkpoint equals the migration block (CEL2_MIGRATION_BLOCK_NUMBER). seed_transaction_senders writes a zero-entry senders segment up to tip, which is only correct when blocks 0..=tip contain no transactions (the fresh post-import shape). Running it on a synced datadir would silently corrupt the segment; the guard refuses that. - Drop the temporary jcortejoso/celo-migrate-v2 branch trigger from push-container-dev.yaml (was an explicit 'drop before merging' dev-image hack). - Lint: remove redundant args.chain.clone() in the shared env helper; annotate the required db_path owned-copy against a redundant_clone (nursery) false positive; apply nightly rustfmt.
…DBX tables Address PR review: - import-celo-state: hoist the empty-datadir guard above the bootstrap metadata writes. Previously v1 storage settings + stage-checkpoint resets were committed before the last_block_number != 0 check, so an accidental run on a populated datadir would clobber its metadata before erroring. - celo-migrate-v2: replace the Phase 4 skip with a scoped clear of the MDBX tables the v2 flip makes dead (AccountChangeSets, StorageChangeSets now in static files; TransactionSenders in its segment; TransactionHashNumbers/AccountsHistory/ StoragesHistory now read from RocksDB). Reclaims space the compaction otherwise copies. Plain state / hashed / trie are kept (still read from MDBX in v2) and stage checkpoints are NOT reset (Celo dummy blocks can't be re-walked).
write_storage_settings persists v1 metadata but does not update the in-memory storage-settings cache. The cache is seeded by ProviderFactory::new, which at the current reth pin defaults a fresh datadir to v1, so it already matches -- set it explicitly so a future reth that defaults the cache to v2 (or makes the state dump cache-sensitive) cannot leave the cache diverging from the v1 metadata/import layout written here. Correct today, version-dependent. Hardens the Codex r3328416058 thread on #198 across reth pin bumps.
6e43509 to
97661fe
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 97661feba3
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
proc-macro-error2 is a newly-flagged unmaintained advisory, pulled in at build time via alloy-sol-macro (the sol! proc-macro); it never ships in the runtime binary and has no safe upgrade (alloy-sol-macro must drop it upstream). Ignore it, matching the existing unmaintained-advisory entries, so `cargo deny check advisories` passes again.
migrate_account_changesets / migrate_storage_changesets advance the static-file segment over Celo's ~31M dummy pre-migration blocks with a single writer.commit(), accumulating per-block segment metadata until flush -- the same pattern that OOM'd seed_transaction_senders at 18 GB on a real Celo Mainnet datadir. Commit every 500k blocks (segment aligned), mirroring the senders fix, so Phase 1 can't OOM before reaching the (already chunked) sender seeding. These two helpers now deviate from their upstream verbatim copies; the in-file note flags re-applying the chunking after a reth pin bump. Addresses the karlb and Codex (r3373429686) review threads on #198.
celo-protocol declares kona-derive and kona-proof but uses neither (not in its src, features, or re-exports), so `cargo udeps` flags them as unused. Remove the dead deps.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 20ad1d275a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
The emptiness guard only checked `last_block_number() != 0`, which is 0 for BOTH a fresh datadir and a genesis-only / partially-initialized one (after a stock `init`, or an aborted import that already committed the bootstrap block). Importing onto existing genesis state mixes it with the L1 dump and fails the state-root check after mutating the datadir. Also reject when a genesis header is already present (we insert it ourselves below), so only a truly uninitialized datadir is accepted. Addresses Codex r3373644300 on #198.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9e3cda608b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Summary
This PR makes
celo-reth's offline state-import path actually produce a working, reth-storage-v2 Celo Mainnet archive datadir. It contains two related pieces:import-celo-stateoutput (the real cause behind Snapshot-restored op-reth diverges on CIP-64 execution despite identical pre-state #192).celo-reth celo-migrate-v2subcommand that converts the imported v1 datadir to reth's storage-v2 layout without the upstream stage-reset that is incompatible with Celo's dummy pre-migration blocks.Validated end-to-end on a real node (Hetzner fsn-1):
import-celo-state→celo-migrate-v2→op-reth+op-nodesynced cleanly from the migration block (31,056,500) to live tip (68M+). The resulting v2 datadir was additionally snapshot + clone-booted in isolation: reth's startup consistency check passes and it serves the correct head.What was broken
import-celo-stateopens the env via upstreamEnvironmentArgs::init, which ends by writing the chainspec genesis alloc. Celo Mainnet's genesis alloc includes one storage slot on the Registry (0xce10, slot0xb53127…). The L1 final-state dump we ingest writes a different value for that same slot, so both land inHashedStoragesas a duplicate dup-entry and the state-root computation panics inalloy_trie HashBuilder::add_leaf(key == self.key). The datadir was unusable; the symptoms tracked in #192 were downstream of this.Two more issues blocked the v1→v2 path:
init_from_state_dumpwrites them verbatim, but op-geth'sStateDB.Commitprunes SSTORE-to-zero slots, so the computed migration root didn't match canonical.reth db migrate-v2clears recomputable tables and resets 6 stage checkpoints to 0, expecting the pipeline to rebuild by re-walking blocks1..tip. Celo's blocks1..31Mare header-only dummies (no bodies), so SenderRecovery et al. cannot rebuild → the node won't start.Changes
feat: add celo-migrate-v2 subcommand that preserves stagesTransactionSendersseed + metadata flip, skipping the upstream table-clear/stage-reset.fix: commit TransactionSenders seed per-segment + progressensure_at_block(31M)in one shot grew the static-file writer's in-memory state past 18 GB and deadlocked the commit. Chunked 500k-block commits with progress/ETA.fix: skip genesis alloc state writes in import-celo-stateopen_env_skip_init_and_consistency_checkhelper opens the env withoutinit_genesis_with_settings; the import then emits only the non-state genesis side-effects (storage settings, canonical header, stage checkpoints, segment ranges).fix(scripts): prune zero-value storage slots from L2 allocsStateDB.CommitSSTORE-to-zero semantics so the migration state root matches canonical.fix: also bypass the consistency check in celo-migrate-v2TransactionSenderssegment) doesn't trip the destructiveUnwind(0)assertion before the seed step can fix it.fix: write v1 storage settings (not base = v2) in import-celo-stateStorageSettings::v1()explicitly (upstreambase()now defaults to v2), so the import produces a coherent v1 datadir forcelo-migrate-v2to convert.fix: guard celo-migrate-v2 to fresh imports; drop temp CI trigger; lintValidation
0xed980641….db settings get:storage_v2: falseafter import →trueafter migrate.--proofs-history) booted on the v2 datadir, consistency check passed, synced 31M → 68M.Database opened,storage_v2: true, head = 68,157,296.cargo test -p celo-reth: 131 pass.fmt --check+clippy -D warningsclean.Reviewer notes / known limitations
A deep review flagged the following. The first is fixed in this PR; the rest are intentionally out of scope and noted for follow-up:
celo-migrate-v2now bails unless the Execution stage checkpoint equals the migration block.seed_transaction_senderswrites a zero-entry senders segment up totip, which is only correct when blocks0..=tiphave no transactions (the fresh post-import shape). The guard prevents running it on a synced datadir (which would silently corrupt the segment).migrate_account_changesets/migrate_storage_changesets/migrate_receipts/compact_mdbx/swap_compacted_dbare byte-for-byte copies of upstream private fns (they can't be reached via a public API). They are pinned to reth rev88505c7and must be re-audited on every reth bump. Thewhile let Some(Ok(..)) = walker.peek()changeset loops inherit upstream's behavior of breaking (rather than propagating) on a cursor error; acceptable here because the command is a one-shot offline tool gated to the known-good datadir shape, but worth tightening if these are ever generalized.swap_compacted_dbis not crash-atomic between the two renames (db → db_pre_compact,db_compact → db); a crash in that window leaves the DB atdb_pre_compact. Acceptable for an offline maintenance command; recovery path is the backup dir.storage_v2: truewith the senders boundary at the migration block.Relates to #192 (root cause + fix). The snapshot-subcommands work (#191) sits on top of a clean import produced by this PR.