Context
Follow-up to #899, scoped to the current ansible-devnet topology:
- 8 validators per subnet, flat aggregation only —
num_children = 0, no recursive child proofs yet
- Aggregator collects raw gossip XMSS signatures for one
att_data and calls rec_xmss_aggregate once
- Goal:
lean_pq_sig_aggregated_signatures_building_time_seconds p95 ≤ 1.3 s when aggregating a full 8/8 subnet committee
This issue deliberately excludes recursive-aggregation optimizations (child caches, greedy child selection, etc.) — those paths are not exercised today.
Live cross-client snapshot (head ≈ 2280, 2026-05-26)
| Aggregator (subnet) |
build count |
mean build |
| lantern_0 (sn 2) |
423 |
0.27 s |
| qlean_4 (sn 3) |
1298 |
1.01 s |
| zeam_8 (sn 0) |
169 |
1.34 s |
| ethlambda_8 (sn 1) |
4044 |
1.38 s |
On zeam_8 (post-#925 / #933, 0xpartha/zeam:local):
| Percentile |
Build time |
| p50 |
~0.5 s |
| p75 |
~2.3 s |
| p95 |
~4.0 s |
| mean |
1.34 s |
Phase metrics on zeam_8 show xmss_prove ≡ build histogram (169/169 samples, identical sums). Snapshot / prep / commit are all sub-ms. The tail is entirely inside xmss_aggregate → rec_xmss_aggregate.
What the zeam code actually does (flat, no children)
Prove path (every non-trivial att_data)
submitAggregateOnInterval → aggregateImpl → aggregateForSlots → aggregateUnlocked
→ computeSingleAggregatedSignature → prepareAggregateAttData → runAggregateAttDataFfi
→ AggregatedSignatureProof.aggregate → xmss.aggregateSignatures → xmss_aggregate (Rust)
→ rec_xmss_aggregate
Key locations:
pkgs/node/src/forkchoice.zig:2091 — aggregateUnlocked (sequential per-att_data loop)
pkgs/types/src/block.zig:794 — prepareAggregateAttData
pkgs/types/src/block.zig:1018 — runAggregateAttDataFfi (starts lean_pq_sig_* timer)
pkgs/types/src/aggregation.zig:78 — AggregatedSignatureProof.aggregate
pkgs/xmss/src/aggregation.zig:178 — xmss_aggregate FFI + zeam_xmss_rec_aggregate_prove_seconds
rust/multisig-glue/src/lib.rs:126 — clones every raw PK + sig before prove
Only fast path today
if (!has_gossip and selected_children.items.len == 1) {
// SSZ clone lone child — no STARK
}
Not applicable when aggregating raw subnet attestations (has_gossip = true, num_children = 0). Every flat 8-raw build goes through the full STARK pipeline. There is no zeam fast path for N raw + 0 children.
Trivial-input filter (aggregator only)
isAggregatorTrivialInput (forkchoice.zig:3781): skips when num_children = 0 and num_gossip_sigs < min_aggregation_inputs (default 2).
So zeam proves for 2–8 raw sigs, never 0–1. The build-time bimodal distribution (p50 ~0.5 s vs p95 ~4 s) is consistent with rec_xmss_aggregate cost scaling with num_raw, not with recursive depth.
Aggregation trigger timing
Worker runs on slot interval with window {slot-1, slot} (chain.zig:4462-4472). Logs frequently show agg start … gossip_sigs=subnet0=1/8 — the worker often proves before all 8 subnet attestations have arrived, then may prove again on a later trigger with more sigs. That produces a mix of small-N (fast) and large-N (slow) samples in the same histogram.
What is NOT the problem (please do not optimize these first)
| Previously suggested item |
Why irrelevant on current subnet |
| Child proof deserialize cache |
num_children = 0 always — deserialize loop never runs (multisig-glue/src/lib.rs:143-177) |
| Greedy child selection tuning |
No known_payloads / peer child inputs in flat mode |
| Recursive aggregation parallelism |
Not deployed |
| Rayon thread auto-tune (#925) |
Already done — 12 Rayon threads on 16 vCPU aggregators |
ThinLTO / multisig-release |
Production aggregators already build via -Dprover=dummy → multisig-release (build.zig:56, rust/Cargo.toml:34-37) |
Parallel per-att_data ThreadPool scope |
Explicitly removed in #925 — caused ThreadPool × Rayon oversubscription |
| Coalesce / in-flight skip rate |
Affects slot scheduling, not per-prove STARK cost inside one histogram sample |
Root-cause hypothesis (code-grounded)
Two independent factors explain the zeam histogram shape:
-
Input size variance: num_raw at snapshot time ranges from 2 to 8 because aggregation fires on the slot tick, not when 8/8 sigs are present. Small-N proves ≈ 0.5 s; full 8-raw ≈ 2–4 s.
-
Prove cost gap vs lantern: For the same flat multisig construction, lantern mean 0.27 s vs zeam 1.34 s on live devnet suggests zeam pays substantial overhead beyond leanMultisig's core prove — or lantern uses a lighter aggregation construction. Need direct comparison (see investigation plan).
Known zeam per-prove overhead in code:
- Rust: deep-clone all 8
XmssPublicKey + XmssSignature on every FFI call (multisig-glue/src/lib.rs:130-135)
- Zig prep:
fromBytes + handle materialization for every gossip sig every prove (block.zig:847-866)
- Timer scope:
lean_pq_sig_* includes bitfield merge + prove + serialize; live data shows prove dominates (identical to zeam_xmss_rec_aggregate_prove_seconds)
Investigation plan
1. Instrument num_raw on the build histogram (required first)
Add a label (or companion histogram) on prove entry:
lean_pq_sig_aggregated_signatures_building_time_seconds{num_raw="N"}
Without this, p95 tuning is blind — we cannot tell whether the tail is "8-raw steady state" or "partial input + reprove".
2. lean-bench parity on aggregator hardware
Run lean-bench aggregate.flat_{2,4,8}_r2 (referenced in pkgs/metrics/src/lib.zig:937) on the same host class as zeam_8 and compare to zeam_xmss_rec_aggregate_prove_seconds.
| Outcome |
Next step |
| Bench flat_8 ≈ zeam p95 (~4 s) |
Problem is leanMultisig prove scaling — upstream perf or different security params (LOG_INV_RATE_PROD = 2, aggregation.zig:132) |
| Bench flat_8 ≈ lantern mean (~0.3 s) |
Problem is zeam FFI / wrapper overhead — profile clones + serialize |
| Bench flat_8 ≈ 1.3 s |
Problem is partial-input mix — fix aggregation timing (item 3) |
3. Aggregation timing policy (if tail = partial inputs)
Options to evaluate once num_raw is labeled:
- Defer prove until subnet gossip coverage ≥ K/8 (e.g. 6/8 or 8/8) within the slot window
- Or: prove early with partial input for fork-choice weight, but do not count partial proves toward the p95 SLO for published aggregates
Code touchpoints: submitAggregateOnInterval / aggregateImpl slot window (chain.zig:4462), pruneTrivialFromAggregateSnapshot threshold (forkchoice.zig:3913).
4. Cross-client construction comparison (lantern / qlean)
lantern mean 0.27 s is the benchmark to beat. Questions for lantern/qlean teams:
- Same
rec_xmss_aggregate / leanMultisig STARK at LOG_INV_RATE=2?
- Same metric semantics on
lean_pq_sig_aggregated_signatures_building_time_seconds?
- Do they wait for full committee before proving?
zeam has no lantern/qlean source in-tree; this comparison is external but essential — ethlambda (1.38 s) is no longer the right reference point.
5. FFI clone reduction (if bench shows overhead gap)
If lean-bench flat_8 is fast but zeam is slow with identical num_raw:
- Avoid deep-cloning PKs/sigs in
multisig-glue when handles are prove-duration-borrowed
- Reuse Zig-side
fromBytes handles across reproves of the same snapshot sigs
Acceptance criteria
Related
Context
Follow-up to #899, scoped to the current ansible-devnet topology:
num_children = 0, no recursive child proofs yetatt_dataand callsrec_xmss_aggregateoncelean_pq_sig_aggregated_signatures_building_time_secondsp95 ≤ 1.3 s when aggregating a full 8/8 subnet committeeThis issue deliberately excludes recursive-aggregation optimizations (child caches, greedy child selection, etc.) — those paths are not exercised today.
Live cross-client snapshot (head ≈ 2280, 2026-05-26)
On
zeam_8(post-#925 / #933,0xpartha/zeam:local):Phase metrics on
zeam_8showxmss_prove≡ build histogram (169/169 samples, identical sums). Snapshot / prep / commit are all sub-ms. The tail is entirely insidexmss_aggregate→rec_xmss_aggregate.What the zeam code actually does (flat, no children)
Prove path (every non-trivial
att_data)Key locations:
pkgs/node/src/forkchoice.zig:2091—aggregateUnlocked(sequential per-att_dataloop)pkgs/types/src/block.zig:794—prepareAggregateAttDatapkgs/types/src/block.zig:1018—runAggregateAttDataFfi(startslean_pq_sig_*timer)pkgs/types/src/aggregation.zig:78—AggregatedSignatureProof.aggregatepkgs/xmss/src/aggregation.zig:178—xmss_aggregateFFI +zeam_xmss_rec_aggregate_prove_secondsrust/multisig-glue/src/lib.rs:126— clones every raw PK + sig before proveOnly fast path today
Not applicable when aggregating raw subnet attestations (
has_gossip = true,num_children = 0). Every flat 8-raw build goes through the full STARK pipeline. There is no zeam fast path forN raw + 0 children.Trivial-input filter (aggregator only)
isAggregatorTrivialInput(forkchoice.zig:3781): skips whennum_children = 0andnum_gossip_sigs < min_aggregation_inputs(default 2).So zeam proves for 2–8 raw sigs, never 0–1. The build-time bimodal distribution (p50 ~0.5 s vs p95 ~4 s) is consistent with
rec_xmss_aggregatecost scaling withnum_raw, not with recursive depth.Aggregation trigger timing
Worker runs on slot interval with window
{slot-1, slot}(chain.zig:4462-4472). Logs frequently showagg start … gossip_sigs=subnet0=1/8— the worker often proves before all 8 subnet attestations have arrived, then may prove again on a later trigger with more sigs. That produces a mix of small-N (fast) and large-N (slow) samples in the same histogram.What is NOT the problem (please do not optimize these first)
num_children = 0always — deserialize loop never runs (multisig-glue/src/lib.rs:143-177)known_payloads/ peer child inputs in flat modemultisig-release-Dprover=dummy→multisig-release(build.zig:56,rust/Cargo.toml:34-37)att_dataThreadPool scopeRoot-cause hypothesis (code-grounded)
Two independent factors explain the zeam histogram shape:
Input size variance:
num_rawat snapshot time ranges from 2 to 8 because aggregation fires on the slot tick, not when 8/8 sigs are present. Small-N proves ≈ 0.5 s; full 8-raw ≈ 2–4 s.Prove cost gap vs lantern: For the same flat multisig construction, lantern mean 0.27 s vs zeam 1.34 s on live devnet suggests zeam pays substantial overhead beyond leanMultisig's core prove — or lantern uses a lighter aggregation construction. Need direct comparison (see investigation plan).
Known zeam per-prove overhead in code:
XmssPublicKey+XmssSignatureon every FFI call (multisig-glue/src/lib.rs:130-135)fromBytes+ handle materialization for every gossip sig every prove (block.zig:847-866)lean_pq_sig_*includes bitfield merge + prove + serialize; live data shows prove dominates (identical tozeam_xmss_rec_aggregate_prove_seconds)Investigation plan
1. Instrument
num_rawon the build histogram (required first)Add a label (or companion histogram) on prove entry:
Without this, p95 tuning is blind — we cannot tell whether the tail is "8-raw steady state" or "partial input + reprove".
2. lean-bench parity on aggregator hardware
Run lean-bench
aggregate.flat_{2,4,8}_r2(referenced inpkgs/metrics/src/lib.zig:937) on the same host class aszeam_8and compare tozeam_xmss_rec_aggregate_prove_seconds.LOG_INV_RATE_PROD = 2,aggregation.zig:132)3. Aggregation timing policy (if tail = partial inputs)
Options to evaluate once
num_rawis labeled:Code touchpoints:
submitAggregateOnInterval/aggregateImplslot window (chain.zig:4462),pruneTrivialFromAggregateSnapshotthreshold (forkchoice.zig:3913).4. Cross-client construction comparison (lantern / qlean)
lantern mean 0.27 s is the benchmark to beat. Questions for lantern/qlean teams:
rec_xmss_aggregate/ leanMultisig STARK atLOG_INV_RATE=2?lean_pq_sig_aggregated_signatures_building_time_seconds?zeam has no lantern/qlean source in-tree; this comparison is external but essential — ethlambda (1.38 s) is no longer the right reference point.
5. FFI clone reduction (if bench shows overhead gap)
If lean-bench flat_8 is fast but zeam is slow with identical
num_raw:multisig-gluewhen handles are prove-duration-borrowedfromByteshandles across reproves of the same snapshot sigsAcceptance criteria
num_raw(and ideallynum_children) exposed on aggregate build metricszeam_xmss_rec_aggregate_prove_secondsdocumented on aggregator hardwareRelated
N≥2raw