reth: bulk-EOA RPC verify crashes daemon (status=rpc_died_before_spamoor); writer lacks intra-entity MDBX chunking
Summary
State-actor's reth bench on SPEC_TARGET_GB=25 SEED=42 fails reproducibly at pre-spamoor verify. The reth daemon boots, serves 16 simple RPC queries, then stops responding mid-way through the verifier's bulk-EOA sampling. Bench reports status=rpc_died_before_spamoor, no genesis_state_root recovered, no cross-client invariance check possible for reth.
Reproducer
cd <state-actor checkout>
sudo rm -rf <bench-data>/reth <bench-logs>/reth <bench-results>/reth-result.json
CLIENTS=reth SEED=42 KEEP_DBS=1 ./scripts/run-bloatnet.sh
Result: <bench-results>/reth-result.json with "status": "rpc_died_before_spamoor", "genesis_state_root": "unknown".
Timeline
| t |
event |
| 0 |
state-actor gen begins |
| +5m31s |
gen completes, exit 0, 21 GB written (54k accounts + 100 contracts + ~5 bulk EOAs each up to 10M slots) |
| +5m34s |
container bloatnet-reth started |
| +5m36s |
RPC ready at :8545 |
| +5m38s |
verify-pre.log begins; sections A-E hit ~16 RPC calls — all pass |
| +5m47s |
verifier enters section F. "Sample bulk EOAs (500 samples)"; Connection refused (os error 111) on first sample |
| +5m51s |
run-bloatnet.sh declares rpc_died_before_spamoor; runs docker rm -f |
journalctl shows the container scope Deactivated successfully, Consumed 5.65 s CPU. dmesg shows no OOM events on the bench day, so OOM-killer is ruled out as the proximate cause.
Root cause synthesis (3-agent investigation)
1. Writer-side anti-pattern (primary, with strongest evidence)
client/reth/spec_storage_streaming_cgo.go::consumeEntity (lines 331-368) commits an entire bulk EOA in a single envs.Mdbx.Update. For a 10M-slot EOA that's ~10M txn.Put calls into the HashedStorages DupSort table under a single primary key keccak(address). The chunkSlots = 1024 constant at line 28 is only a goroutine→consumer batch size — it does NOT commit MDBX.
This is the exact anti-pattern that the sister Erigon commit 12fa854 just landed a fix for (internal/erigon/mdbx/state.go::WriteAlloc):
"A single transaction over millions of slots blows MDBX's dirty-page budget and triggers spill_slowpath… ~50× slower than batched."
grep chunkSize 10_000 10000 in client/reth/ returns nothing. dbs_cgo.go:113-138 enables WriteMap | SafeNoSync | NoMemInit | LifoReclaim but never raises OptTxnDpLimit — the env runs at MDBX's default ~64K dirty-page ceiling. Commit 537e280 ("per-entity Update") eliminated the global-Update OOM but did not subdivide individual entities.
Structural consequence: HashedStorages ends up with pathologically large per-addrHash DupSort subDBs. Reader cursors on those primary keys walk the resulting subDB page chains; on cold-cache reads (verifier section F) that's exponentially slower than the hot-cache reads in sections A-E.
2. Possible daemon-side contributor (needs confirmation)
Reth boots with --dev --dev.block-time=1s --debug.skip-genesis-validation (scripts/run-bloatnet.sh:201-211). The --dev.block-time=1s LocalMiner ticks once per second invoking engine_forkchoiceUpdated → payload_builder.resolve_kind → engine_newPayload. --debug.skip-genesis-validation only suppresses the genesis hash-mismatch check (db-common/src/init.rs:235-244); it does not short-circuit subsequent payload-building.
If state-actor's writer is missing block-accessory tables (BlockBodyIndices, BlockOmmers) or v2 RocksDB state-cache rows that reth's payload-builder transitively reads on the first empty block, the LocalMiner could panic. The 5.65 s CPU / 17 s wall profile is consistent with ~16 ticks before something blew up.
This angle has no direct evidence (no reth stderr was captured) — needs verification by re-running with daemon logs streamed.
3. Verifier-side amplifier
scripts/verify-bloatnet.sh sections F + G fire 2000 sequential cast RPC calls (F-plain: 500 × eth_getBalance; F-deleg: 500 × (eth_getBalance + eth_getCode); G: 500 × eth_getCode). Each cast spawns a fresh process → fresh TCP → handshake → single JSON-RPC → close. ~100× the workload of sections A-E (21 calls), and across 1500 random addresses (cold-cache) vs A-E's ~10 hot addresses.
All loops are plain sequential shell (for with no & / xargs -P). No --timeout on cast invocations, no retry. First Connection refused propagates an empty value into the next iteration.
The verifier is client-agnostic — same 2000-call burst hits geth/besu/nethermind/erigon. Erigon survives it (status=ok, 596 blocks past spamoor); reth doesn't. The differential implicates the writer-side state-shape, not the verifier per se.
Recommended fix order
- (Primary) Port the Erigon commit 12fa854 chunking pattern to reth's
consumeEntity: chunk each entity's writes into ~10k-slot MDBX transactions instead of one transaction per entity. This matches the erigon fix that was confirmed correct at the same bench scale.
- (Verify) Re-run reth bench. If status=ok, ship.
- (If still fails) Stream reth's daemon stderr and look for panic / payload-builder error. Investigate the LocalMiner hypothesis from agent B.
- (Soft mitigation) Optional: bench-side
SAMPLE=${SAMPLE:-50} env var so iteration can run a smaller bulk-EOA sample without modifying the verifier.
Cross-reference
- Erigon fix (the analog): commit 12fa854 —
internal/erigon/mdbx/state.go::WriteAlloc chunks at 10k slots/txn
- Reth's prior global-Update OOM fix: commit 537e280 (eliminated global-Update; intra-entity chunking deferred)
- Cross-client invariance status: erigon
0x0a57cfc9c19efae524e042f321185aec5d949e86999f59a71fb6b15f576b12af (the only daemon-side root recovered on this spec so far)
reth: bulk-EOA RPC verify crashes daemon (status=rpc_died_before_spamoor); writer lacks intra-entity MDBX chunking
Summary
State-actor's reth bench on
SPEC_TARGET_GB=25 SEED=42fails reproducibly atpre-spamoor verify. The reth daemon boots, serves 16 simple RPC queries, then stops responding mid-way through the verifier's bulk-EOA sampling. Bench reportsstatus=rpc_died_before_spamoor, nogenesis_state_rootrecovered, no cross-client invariance check possible for reth.Reproducer
Result:
<bench-results>/reth-result.jsonwith"status": "rpc_died_before_spamoor","genesis_state_root": "unknown".Timeline
bloatnet-rethstartedConnection refused (os error 111)on first samplerun-bloatnet.shdeclaresrpc_died_before_spamoor; runsdocker rm -fjournalctl shows the container scope
Deactivated successfully, Consumed 5.65 s CPU.dmesgshows no OOM events on the bench day, so OOM-killer is ruled out as the proximate cause.Root cause synthesis (3-agent investigation)
1. Writer-side anti-pattern (primary, with strongest evidence)
client/reth/spec_storage_streaming_cgo.go::consumeEntity(lines 331-368) commits an entire bulk EOA in a singleenvs.Mdbx.Update. For a 10M-slot EOA that's ~10Mtxn.Putcalls into theHashedStoragesDupSort table under a single primary keykeccak(address). ThechunkSlots = 1024constant at line 28 is only a goroutine→consumer batch size — it does NOT commit MDBX.This is the exact anti-pattern that the sister Erigon commit 12fa854 just landed a fix for (
internal/erigon/mdbx/state.go::WriteAlloc):grep chunkSize 10_000 10000inclient/reth/returns nothing.dbs_cgo.go:113-138enablesWriteMap | SafeNoSync | NoMemInit | LifoReclaimbut never raisesOptTxnDpLimit— the env runs at MDBX's default ~64K dirty-page ceiling. Commit537e280("per-entity Update") eliminated the global-Update OOM but did not subdivide individual entities.Structural consequence: HashedStorages ends up with pathologically large per-
addrHashDupSort subDBs. Reader cursors on those primary keys walk the resulting subDB page chains; on cold-cache reads (verifier section F) that's exponentially slower than the hot-cache reads in sections A-E.2. Possible daemon-side contributor (needs confirmation)
Reth boots with
--dev --dev.block-time=1s --debug.skip-genesis-validation(scripts/run-bloatnet.sh:201-211). The--dev.block-time=1sLocalMiner ticks once per second invokingengine_forkchoiceUpdated → payload_builder.resolve_kind → engine_newPayload.--debug.skip-genesis-validationonly suppresses the genesis hash-mismatch check (db-common/src/init.rs:235-244); it does not short-circuit subsequent payload-building.If state-actor's writer is missing block-accessory tables (
BlockBodyIndices,BlockOmmers) or v2 RocksDB state-cache rows that reth's payload-builder transitively reads on the first empty block, the LocalMiner could panic. The5.65 s CPU / 17 s wallprofile is consistent with ~16 ticks before something blew up.This angle has no direct evidence (no reth stderr was captured) — needs verification by re-running with daemon logs streamed.
3. Verifier-side amplifier
scripts/verify-bloatnet.shsections F + G fire 2000 sequentialcastRPC calls (F-plain: 500 ×eth_getBalance; F-deleg: 500 × (eth_getBalance+eth_getCode); G: 500 ×eth_getCode). Eachcastspawns a fresh process → fresh TCP → handshake → single JSON-RPC → close. ~100× the workload of sections A-E (21 calls), and across 1500 random addresses (cold-cache) vs A-E's ~10 hot addresses.All loops are plain sequential shell (
forwith no&/xargs -P). No--timeoutoncastinvocations, no retry. FirstConnection refusedpropagates an empty value into the next iteration.The verifier is client-agnostic — same 2000-call burst hits geth/besu/nethermind/erigon. Erigon survives it (status=ok, 596 blocks past spamoor); reth doesn't. The differential implicates the writer-side state-shape, not the verifier per se.
Recommended fix order
consumeEntity: chunk each entity's writes into ~10k-slot MDBX transactions instead of one transaction per entity. This matches the erigon fix that was confirmed correct at the same bench scale.SAMPLE=${SAMPLE:-50}env var so iteration can run a smaller bulk-EOA sample without modifying the verifier.Cross-reference
internal/erigon/mdbx/state.go::WriteAllocchunks at 10k slots/txn0x0a57cfc9c19efae524e042f321185aec5d949e86999f59a71fb6b15f576b12af(the only daemon-side root recovered on this spec so far)