Skip to content

Snapshot-restored op-reth diverges on CIP-64 execution despite identical pre-state #192

@jcortejoso

Description

@jcortejoso

⚠️ Root-cause superseded — see Update 5. The divergence is intermittent / concurrency-dependent (not deterministic on one "specific" tx) and is a reth in-memory state read-consistency bug, not snapshot/import-specific. Updates 1 & 3 below are retracted; the reproduction details remain valid.

Snapshot-restored op-reth diverges on CIP-64 execution despite identical pre-state

Summary

When celo-kona-reth is started from a snapshot of a previously-running canonical archive node and brought up to chain head via op-node, transaction execution deterministically intermittently diverges from the canonical chain the moment it encounters a specific CIP-64 transaction (on some CIP-64 blocks — see Update 5) — even though the pre-state (the snapshot's state root) is byte-for-byte identical to the canonical state at the same block.

Two celo-kona-reth image versions both exhibit the bug, with different symptoms:

Image Symptom
sha-04c2a3c (pre-#184) EVM rejects the canonical tx as invalid: "fee currency not registered: 0x0E2A3e05bc9A16F5292A6170456A710cb89C6f72". Block is dropped; chain stalls.
sha-bc50e90 (includes #184) No outright rejection. State root computed locally differs from the block header's state root → block added to fork chain. Chain forks silently. Different runs produce different fork-block hashes.

This makes snapshot-based bootstrap (the workflow PR #191 enables) unusable for production without further fix — the snapshot mechanically reads & extracts correctly, op-reth boots, but the resulting node cannot follow canonical chain.

Reproduction

Producer side (one-time, already done)

  1. Running celo-kona-reth sha-04c2a3c on /var/lib/op-reth (ZFS dataset tank/may18-fresh), in archive mode, with --proofs-history enabled.
  2. podman stop -t 60 op-reth (clean shutdown, exit 0 in 4.3 s).
  3. zfs snapshot tank/may18-fresh@export-2026-05-24 — captured at L2 block 67,728,745.
  4. podman start op-reth (resumes normally).
  5. zfs clone -o readonly=on tank/may18-fresh@export-2026-05-24 tank/snapshot-src for tooling.
  6. Ran celo-reth snapshot-manifest (from PR feat(celo-reth): add download and snapshot-manifest subcommands #191, image sha-bc50e90) against the clone → produced manifest.json + 818 component archives (354 GiB), uploaded to Hetzner Object Storage at https://fsn1.your-objectstorage.com/celo/snapshots/.

Consumer side (the bug)

  1. On a clean dataset, run celo-reth download --datadir /celo/data --chain celo --non-interactive --archive (image sha-bc50e90).
    • Completes successfully: emits "Snapshot download complete. Run celo-reth node to start syncing."
    • Datadir layout: db/ (100 GiB), static_files/ (487 GiB), rocksdb/ (47 GiB).
  2. chown -R 10001:10001 <datadir> (download command creates files as root; op-reth runs as celo UID 10001).
  3. Start celo-kona-reth node against this datadir (image sha-04c2a3c initially — the production image).
  4. Start op-node (image celo-blockchain-public/op-node:celo-v2.2.1).
  5. Engine API delivers payloads from snapshot point (67,728,745) forward.
  6. op-reth processes blocks correctly through block 67,817,510 (verified: block hash matches forno).
  7. At block 67,817,511, op-reth rejects the canonical block with:
ERROR Receipt root task received incomplete receipts, execution likely aborted
WARN Invalid block error on new payload
  invalid_hash:     0x3d7245589c3c147166ecc0b77c68ce2db263cf1af7f9a508bd7b871ad15387bc
  invalid_number:   67817511
  validation_err:   EVM reported invalid transaction (0xc0ce0d52a7594d1d35ad902786245f7665e1b254190a5218c83e8a9a98625d7f):
                    fee currency not registered: 0x0E2A3e05bc9A16F5292A6170456A710cb89C6f72

After this rejection, every subsequent payload from op-node is rejected as "links to previously rejected block". Chain stalls indefinitely.

With sha-bc50e90 (includes PR #184)

Same setup, just swap the image. The "fee currency not registered" symptom goes away, but state computation is still wrong:

WARN Changeset cache MISS, falling back to DB-based computation
       block_hash:    0x88cacb50d7e5479317e95c582b6ccf35be32cf511e59be5d9818b1cc3c8b7a72
       block_number: 67817511
INFO  State root task finished
       state_root: 0x4df15fd214e8cf7edc0207af2d13d482a47e673befa4bd0b738322f802537a15
       elapsed:    409.559µs
WARN  State root task returned incorrect state root
       state_root:        0x4df15fd214e8cf7edc0207af2d13d482a47e673befa4bd0b738322f802537a15   ← computed by op-reth
       block_state_root:  0xa1477420629fdbec1cf2c82a810a8f49d51292709b50864268d7f56e26616ec1   ← what the canonical block header says
WARN  Failed to compute state root in parallel
INFO  Block added to fork chain
INFO  Canonical chain committed

The block is added with a hash different from canonical's. Different op-reth restarts produce different fork-block hashes (e.g. 0x88cacb…, 0x0fcc36…) — strongly suggesting non-deterministic state read from an uninitialized location.

What we verified

Pre-state (block 67,817,510) matches canonical byte-for-byte

local stateRoot:  0x24276ff7b5766a423d4eace0e78bf371b0d336785ddde4059a4a1f5f8abfec02
forno stateRoot:  0x24276ff7b5766a423d4eace0e78bf371b0d336785ddde4059a4a1f5f8abfec02
                  ✓ IDENTICAL

local block hash: 0x013397ea03f01e3fb6e0d774b69909b037ecd7e1b1bd0c3e11ccb9fe285d4d97
forno block hash: 0x013397ea03f01e3fb6e0d774b69909b037ecd7e1b1bd0c3e11ccb9fe285d4d97
                  ✓ IDENTICAL

FeeCurrencyDirectory storage matches canonical at block 67,817,510

$ eth_getStorageAt 0xefb84935239dAcdecF7c5bA76d8dE40b077B7b33 0x0 0x40ad026
  local: 0x00000000000000000000000158099b74f4acd642da77b4b7966b4138ec5ba458
  forno: 0x00000000000000000000000158099b74f4acd642da77b4b7966b4138ec5ba458
         ✓ IDENTICAL

Since state root is computed over a Merkle Patricia Trie of all account states (including storage roots), an identical state root mathematically guarantees that every account's storage is identical to canonical at that block. The on-chain state at 67,817,510 is provably canonical.

Yet execution at block 67,817,511 produces different state

The only way to reconcile this is: block execution reads or modifies state that is not in the Merkle Patricia Trie.

Hypothesis

Cip64Storage (the per-block transient state that PR #184 made per-EVM instead of factory-scoped) — or something analogous — is lazily populated during normal block-by-block sync and is not captured by ZFS/file-based snapshots. From-genesis nodes accumulate this transient state as they execute every block. Snapshot-restored nodes start with it missing/empty/uninitialized, and when they hit a CIP-64 transaction whose execution depends on that state, the result diverges.

Supporting evidence for this hypothesis:

  1. The same sha-04c2a3c image had been running canonical for ~2 months prior on the same physical machine (no divergence from genesis sync). After a single stop/snapshot/restart cycle, it now fails.
  2. sha-bc50e90 (which post-dates PR Per evm CIP-64 storage #184) doesn't fix the issue but does change the symptom — consistent with "different transient-state handling, same core missing-state bug".
  3. The fork-block hash is non-deterministic across op-reth restarts on the same input — strongly suggests the divergent state is initialized from uninitialized memory or a non-deterministic source.
  4. PR Per evm CIP-64 storage #184's commit message explicitly mentions that Cip64Storage lives in CeloEvm (not in the trie) and that the proofs-history ExEx re-executes blocks through the factory — exactly the kind of side-channel state that would not survive a snapshot.

What we tried (none worked)

Attempt Result
ZFS rollback to @export-2026-05-24 and restart with sha-04c2a3c "static file tip behind checkpoint" + key decode failure during auto-unwind. Different bug class (interaction with tank/op-reth-proofs dataset not being rolled back).
celo-reth download to a fresh datadir + sandbox boot (no op-node, no peers) Boots cleanly, pipeline finishes at snapshot block. False-positive validation.
celo-reth download + production op-reth + op-node + chain catchup Fork divergence as documented above.
Restart op-reth several times Each restart produces a different fork-block hash at 67,817,511.
Swap image from sha-04c2a3c → sha-bc50e90 Symptom changes from "invalid tx" to "wrong state root", but still divergent.
Wipe and recreate tank/op-reth-proofs (the proofs-history rocksdb) No effect on divergence behaviour.

Environment

  • Host: Hetzner EX44 in FSN1 datacenter, Debian 12, ZFS root pool, sync=standard.
  • Container runtime: podman 4.x, rootful, host networking.
  • Storage: 2 × 3.5 TB NVMe in ZFS stripe.
  • Production image: us-west1-docker.pkg.dev/devopsre/dev-images/celo-kona-reth:sha-04c2a3c.
  • Test image: us-west1-docker.pkg.dev/devopsre/dev-images/celo-kona-reth:sha-bc50e90.
  • op-node: us-west1-docker.pkg.dev/devopsre/celo-blockchain-public/op-node:celo-v2.2.1.
  • reth-l1: ghcr.io/paradigmxyz/reth:latest.
  • Lighthouse: docker.io/sigp/lighthouse:latest.
  • Chain: Celo Mainnet (--chain celo, network 42220). All hardforks active including Jovian (Mar 31 2026).
  • op-reth flags include --proofs-history --proofs-history.storage-path=/celo-proofs/data --proofs-history.window=1209600.

Forensic artifacts available

The reproduction case is preserved on the affected node (celo-mainnet-archive-hetzner-fsn-1) for as long as needed:

  • tank/test-download — the celo-reth download-produced datadir, byte-for-byte equivalent to the published snapshot. Mounted at /var/lib/op-reth.
  • tank/may18-fresh — the previously-canonical broken state (forked at block 67,751,935 on May 24 from a separate incident). Mounted at /var/lib/op-reth.broken-2026-05-25.
  • tank/op-reth-proofs.broken-2026-05-25 — the proofs-history rocksdb from the broken state.
  • ZFS snapshots @backup-anchor-2026-05-17, @backup-anchor-2026-05-15, plus daily autosnaps back to early May.
  • Published snapshot artifacts at https://fsn1.your-objectstorage.com/celo/snapshots/ (manifest.json + 818 chunks, 354 GiB total).
  • Full op-reth + op-node journald logs from the reproduction window (2026-05-25 ~11:00–15:00 UTC).

Happy to grant access or attach raw logs to anyone who wants to reproduce or debug.

Impact on snapshot publication tooling

PR #191 added download and snapshot-manifest subcommands precisely to enable snapshot-based bootstrap. The file shuttling works correctly — the bug is downstream. Until this is resolved:

  • The published snapshot at https://fsn1.your-objectstorage.com/celo/snapshots/ should be marked experimental / not for production use.
  • The validation procedure for snapshots needs to extend beyond "op-reth boots cleanly" to "process N canonical blocks via engine API and verify every block's state root matches" — boot-only validation produced a false positive here.
  • A separate snapshot-export tool that also captures whatever non-trie state is being missed may be needed (or, alternately, op-reth should be able to lazily reconstruct it from MDBX/RocksDB on first start after restore).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions