Skip to content

reth: bulk-EOA RPC verify crashes daemon (status=rpc_died_before_spamoor); writer lacks intra-entity MDBX chunking #89

Description

@CPerezz

reth: bulk-EOA RPC verify crashes daemon (status=rpc_died_before_spamoor); writer lacks intra-entity MDBX chunking

Summary

State-actor's reth bench on SPEC_TARGET_GB=25 SEED=42 fails reproducibly at pre-spamoor verify. The reth daemon boots, serves 16 simple RPC queries, then stops responding mid-way through the verifier's bulk-EOA sampling. Bench reports status=rpc_died_before_spamoor, no genesis_state_root recovered, no cross-client invariance check possible for reth.

Reproducer

cd <state-actor checkout>
sudo rm -rf <bench-data>/reth <bench-logs>/reth <bench-results>/reth-result.json
CLIENTS=reth SEED=42 KEEP_DBS=1 ./scripts/run-bloatnet.sh

Result: <bench-results>/reth-result.json with "status": "rpc_died_before_spamoor", "genesis_state_root": "unknown".

Timeline

t event
0 state-actor gen begins
+5m31s gen completes, exit 0, 21 GB written (54k accounts + 100 contracts + ~5 bulk EOAs each up to 10M slots)
+5m34s container bloatnet-reth started
+5m36s RPC ready at :8545
+5m38s verify-pre.log begins; sections A-E hit ~16 RPC calls — all pass
+5m47s verifier enters section F. "Sample bulk EOAs (500 samples)"; Connection refused (os error 111) on first sample
+5m51s run-bloatnet.sh declares rpc_died_before_spamoor; runs docker rm -f

journalctl shows the container scope Deactivated successfully, Consumed 5.65 s CPU. dmesg shows no OOM events on the bench day, so OOM-killer is ruled out as the proximate cause.

Root cause synthesis (3-agent investigation)

1. Writer-side anti-pattern (primary, with strongest evidence)

client/reth/spec_storage_streaming_cgo.go::consumeEntity (lines 331-368) commits an entire bulk EOA in a single envs.Mdbx.Update. For a 10M-slot EOA that's ~10M txn.Put calls into the HashedStorages DupSort table under a single primary key keccak(address). The chunkSlots = 1024 constant at line 28 is only a goroutine→consumer batch size — it does NOT commit MDBX.

This is the exact anti-pattern that the sister Erigon commit 12fa854 just landed a fix for (internal/erigon/mdbx/state.go::WriteAlloc):

"A single transaction over millions of slots blows MDBX's dirty-page budget and triggers spill_slowpath… ~50× slower than batched."

grep chunkSize 10_000 10000 in client/reth/ returns nothing. dbs_cgo.go:113-138 enables WriteMap | SafeNoSync | NoMemInit | LifoReclaim but never raises OptTxnDpLimit — the env runs at MDBX's default ~64K dirty-page ceiling. Commit 537e280 ("per-entity Update") eliminated the global-Update OOM but did not subdivide individual entities.

Structural consequence: HashedStorages ends up with pathologically large per-addrHash DupSort subDBs. Reader cursors on those primary keys walk the resulting subDB page chains; on cold-cache reads (verifier section F) that's exponentially slower than the hot-cache reads in sections A-E.

2. Possible daemon-side contributor (needs confirmation)

Reth boots with --dev --dev.block-time=1s --debug.skip-genesis-validation (scripts/run-bloatnet.sh:201-211). The --dev.block-time=1s LocalMiner ticks once per second invoking engine_forkchoiceUpdated → payload_builder.resolve_kind → engine_newPayload. --debug.skip-genesis-validation only suppresses the genesis hash-mismatch check (db-common/src/init.rs:235-244); it does not short-circuit subsequent payload-building.

If state-actor's writer is missing block-accessory tables (BlockBodyIndices, BlockOmmers) or v2 RocksDB state-cache rows that reth's payload-builder transitively reads on the first empty block, the LocalMiner could panic. The 5.65 s CPU / 17 s wall profile is consistent with ~16 ticks before something blew up.

This angle has no direct evidence (no reth stderr was captured) — needs verification by re-running with daemon logs streamed.

3. Verifier-side amplifier

scripts/verify-bloatnet.sh sections F + G fire 2000 sequential cast RPC calls (F-plain: 500 × eth_getBalance; F-deleg: 500 × (eth_getBalance + eth_getCode); G: 500 × eth_getCode). Each cast spawns a fresh process → fresh TCP → handshake → single JSON-RPC → close. ~100× the workload of sections A-E (21 calls), and across 1500 random addresses (cold-cache) vs A-E's ~10 hot addresses.

All loops are plain sequential shell (for with no & / xargs -P). No --timeout on cast invocations, no retry. First Connection refused propagates an empty value into the next iteration.

The verifier is client-agnostic — same 2000-call burst hits geth/besu/nethermind/erigon. Erigon survives it (status=ok, 596 blocks past spamoor); reth doesn't. The differential implicates the writer-side state-shape, not the verifier per se.

Recommended fix order

  1. (Primary) Port the Erigon commit 12fa854 chunking pattern to reth's consumeEntity: chunk each entity's writes into ~10k-slot MDBX transactions instead of one transaction per entity. This matches the erigon fix that was confirmed correct at the same bench scale.
  2. (Verify) Re-run reth bench. If status=ok, ship.
  3. (If still fails) Stream reth's daemon stderr and look for panic / payload-builder error. Investigate the LocalMiner hypothesis from agent B.
  4. (Soft mitigation) Optional: bench-side SAMPLE=${SAMPLE:-50} env var so iteration can run a smaller bulk-EOA sample without modifying the verifier.

Cross-reference

  • Erigon fix (the analog): commit 12fa854internal/erigon/mdbx/state.go::WriteAlloc chunks at 10k slots/txn
  • Reth's prior global-Update OOM fix: commit 537e280 (eliminated global-Update; intra-entity chunking deferred)
  • Cross-client invariance status: erigon 0x0a57cfc9c19efae524e042f321185aec5d949e86999f59a71fb6b15f576b12af (the only daemon-side root recovered on this spec so far)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions