reth: bulk-EOA RPC verify crashes daemon (status=rpc_died_before_spamoor); writer lacks intra-entity MDBX chunking

# reth: bulk-EOA RPC verify crashes daemon (status=rpc_died_before_spamoor); writer lacks intra-entity MDBX chunking

## Summary

State-actor's reth bench on `SPEC_TARGET_GB=25 SEED=42` fails reproducibly at `pre-spamoor verify`. The reth daemon boots, serves 16 simple RPC queries, then stops responding mid-way through the verifier's bulk-EOA sampling. Bench reports `status=rpc_died_before_spamoor`, no `genesis_state_root` recovered, no cross-client invariance check possible for reth.

## Reproducer

```bash
cd <state-actor checkout>
sudo rm -rf <bench-data>/reth <bench-logs>/reth <bench-results>/reth-result.json
CLIENTS=reth SEED=42 KEEP_DBS=1 ./scripts/run-bloatnet.sh
```

Result: `<bench-results>/reth-result.json` with `"status": "rpc_died_before_spamoor"`, `"genesis_state_root": "unknown"`.

## Timeline

| t       | event                                                                          |
| ------- | ------------------------------------------------------------------------------ |
| 0       | state-actor gen begins                                                          |
| +5m31s  | gen completes, exit 0, 21 GB written (54k accounts + 100 contracts + ~5 bulk EOAs each up to 10M slots) |
| +5m34s  | container `bloatnet-reth` started                                             |
| +5m36s  | RPC ready at :8545                                                              |
| +5m38s  | verify-pre.log begins; sections A-E hit ~16 RPC calls — all pass               |
| +5m47s  | verifier enters section F. "Sample bulk EOAs (500 samples)"; `Connection refused (os error 111)` on first sample |
| +5m51s  | `run-bloatnet.sh` declares `rpc_died_before_spamoor`; runs `docker rm -f`     |

journalctl shows the container scope `Deactivated successfully, Consumed 5.65 s CPU`. `dmesg` shows **no OOM events on the bench day**, so OOM-killer is ruled out as the proximate cause.

## Root cause synthesis (3-agent investigation)

### 1. Writer-side anti-pattern (primary, with strongest evidence)

`client/reth/spec_storage_streaming_cgo.go::consumeEntity` (lines 331-368) commits an entire bulk EOA in a single `envs.Mdbx.Update`. For a 10M-slot EOA that's ~10M `txn.Put` calls into the `HashedStorages` DupSort table under a single primary key `keccak(address)`. The `chunkSlots = 1024` constant at line 28 is only a goroutine→consumer batch size — it does NOT commit MDBX.

This is **the exact anti-pattern that the sister Erigon commit 12fa854 just landed a fix for** (`internal/erigon/mdbx/state.go::WriteAlloc`):

> "A single transaction over millions of slots blows MDBX's dirty-page budget and triggers spill_slowpath… ~50× slower than batched."

`grep chunkSize 10_000 10000` in `client/reth/` returns nothing. `dbs_cgo.go:113-138` enables `WriteMap | SafeNoSync | NoMemInit | LifoReclaim` but never raises `OptTxnDpLimit` — the env runs at MDBX's default ~64K dirty-page ceiling. Commit `537e280` ("per-entity Update") eliminated the global-Update OOM but did not subdivide individual entities.

Structural consequence: HashedStorages ends up with pathologically large per-`addrHash` DupSort subDBs. Reader cursors on those primary keys walk the resulting subDB page chains; on cold-cache reads (verifier section F) that's exponentially slower than the hot-cache reads in sections A-E.

### 2. Possible daemon-side contributor (needs confirmation)

Reth boots with `--dev --dev.block-time=1s --debug.skip-genesis-validation` (`scripts/run-bloatnet.sh:201-211`). The `--dev.block-time=1s` LocalMiner ticks once per second invoking `engine_forkchoiceUpdated → payload_builder.resolve_kind → engine_newPayload`. `--debug.skip-genesis-validation` only suppresses the genesis hash-mismatch check (`db-common/src/init.rs:235-244`); it does not short-circuit subsequent payload-building.

If state-actor's writer is missing block-accessory tables (`BlockBodyIndices`, `BlockOmmers`) or v2 RocksDB state-cache rows that reth's payload-builder transitively reads on the first empty block, the LocalMiner could panic. The `5.65 s CPU / 17 s wall` profile is consistent with ~16 ticks before something blew up.

This angle has no direct evidence (no reth stderr was captured) — needs verification by re-running with daemon logs streamed.

### 3. Verifier-side amplifier

`scripts/verify-bloatnet.sh` sections F + G fire **2000 sequential `cast` RPC calls** (F-plain: 500 × `eth_getBalance`; F-deleg: 500 × (`eth_getBalance` + `eth_getCode`); G: 500 × `eth_getCode`). Each `cast` spawns a fresh process → fresh TCP → handshake → single JSON-RPC → close. ~100× the workload of sections A-E (21 calls), and across 1500 random addresses (cold-cache) vs A-E's ~10 hot addresses.

All loops are plain sequential shell (`for` with no `&` / `xargs -P`). No `--timeout` on `cast` invocations, no retry. First `Connection refused` propagates an empty value into the next iteration.

The verifier is **client-agnostic** — same 2000-call burst hits geth/besu/nethermind/erigon. Erigon survives it (status=ok, 596 blocks past spamoor); reth doesn't. The differential implicates the writer-side state-shape, not the verifier per se.

## Recommended fix order

1. **(Primary)** Port the Erigon commit 12fa854 chunking pattern to reth's `consumeEntity`: chunk each entity's writes into ~10k-slot MDBX transactions instead of one transaction per entity. This matches the erigon fix that was confirmed correct at the same bench scale.
2. **(Verify)** Re-run reth bench. If status=ok, ship.
3. **(If still fails)** Stream reth's daemon stderr and look for panic / payload-builder error. Investigate the LocalMiner hypothesis from agent B.
4. **(Soft mitigation)** Optional: bench-side `SAMPLE=${SAMPLE:-50}` env var so iteration can run a smaller bulk-EOA sample without modifying the verifier.

## Cross-reference

- Erigon fix (the analog): commit 12fa854 — `internal/erigon/mdbx/state.go::WriteAlloc` chunks at 10k slots/txn
- Reth's prior global-Update OOM fix: commit 537e280 (eliminated global-Update; intra-entity chunking deferred)
- Cross-client invariance status: erigon `0x0a57cfc9c19efae524e042f321185aec5d949e86999f59a71fb6b15f576b12af` (the only daemon-side root recovered on this spec so far)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

reth: bulk-EOA RPC verify crashes daemon (status=rpc_died_before_spamoor); writer lacks intra-entity MDBX chunking #89

reth: bulk-EOA RPC verify crashes daemon (status=rpc_died_before_spamoor); writer lacks intra-entity MDBX chunking

Summary

Reproducer

Timeline

Root cause synthesis (3-agent investigation)

1. Writer-side anti-pattern (primary, with strongest evidence)

2. Possible daemon-side contributor (needs confirmation)

3. Verifier-side amplifier

Recommended fix order

Cross-reference

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

t	event
0	state-actor gen begins
+5m31s	gen completes, exit 0, 21 GB written (54k accounts + 100 contracts + ~5 bulk EOAs each up to 10M slots)
+5m34s	container `bloatnet-reth` started
+5m36s	RPC ready at :8545
+5m38s	verify-pre.log begins; sections A-E hit ~16 RPC calls — all pass
+5m47s	verifier enters section F. "Sample bulk EOAs (500 samples)"; `Connection refused (os error 111)` on first sample
+5m51s	`run-bloatnet.sh` declares `rpc_died_before_spamoor`; runs `docker rm -f`

Uh oh!

reth: bulk-EOA RPC verify crashes daemon (status=rpc_died_before_spamoor); writer lacks intra-entity MDBX chunking #89

Description

reth: bulk-EOA RPC verify crashes daemon (status=rpc_died_before_spamoor); writer lacks intra-entity MDBX chunking

Summary

Reproducer

Timeline

Root cause synthesis (3-agent investigation)

1. Writer-side anti-pattern (primary, with strongest evidence)

2. Possible daemon-side contributor (needs confirmation)

3. Verifier-side amplifier

Recommended fix order

Cross-reference

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions