Skip to content

node, aggregation: flat 8-raw XMSS prove tail (~4s p95) vs lantern/qlean on 8-validator subnet #940

@ch4r10t33r

Description

@ch4r10t33r

Context

Follow-up to #899, scoped to the current ansible-devnet topology:

  • 8 validators per subnet, flat aggregation only — num_children = 0, no recursive child proofs yet
  • Aggregator collects raw gossip XMSS signatures for one att_data and calls rec_xmss_aggregate once
  • Goal: lean_pq_sig_aggregated_signatures_building_time_seconds p95 ≤ 1.3 s when aggregating a full 8/8 subnet committee

This issue deliberately excludes recursive-aggregation optimizations (child caches, greedy child selection, etc.) — those paths are not exercised today.


Live cross-client snapshot (head ≈ 2280, 2026-05-26)

Aggregator (subnet) build count mean build
lantern_0 (sn 2) 423 0.27 s
qlean_4 (sn 3) 1298 1.01 s
zeam_8 (sn 0) 169 1.34 s
ethlambda_8 (sn 1) 4044 1.38 s

On zeam_8 (post-#925 / #933, 0xpartha/zeam:local):

Percentile Build time
p50 ~0.5 s
p75 ~2.3 s
p95 ~4.0 s
mean 1.34 s

Phase metrics on zeam_8 show xmss_prove ≡ build histogram (169/169 samples, identical sums). Snapshot / prep / commit are all sub-ms. The tail is entirely inside xmss_aggregaterec_xmss_aggregate.


What the zeam code actually does (flat, no children)

Prove path (every non-trivial att_data)

submitAggregateOnInterval → aggregateImpl → aggregateForSlots → aggregateUnlocked
  → computeSingleAggregatedSignature → prepareAggregateAttData → runAggregateAttDataFfi
  → AggregatedSignatureProof.aggregate → xmss.aggregateSignatures → xmss_aggregate (Rust)
  → rec_xmss_aggregate

Key locations:

  • pkgs/node/src/forkchoice.zig:2091aggregateUnlocked (sequential per-att_data loop)
  • pkgs/types/src/block.zig:794prepareAggregateAttData
  • pkgs/types/src/block.zig:1018runAggregateAttDataFfi (starts lean_pq_sig_* timer)
  • pkgs/types/src/aggregation.zig:78AggregatedSignatureProof.aggregate
  • pkgs/xmss/src/aggregation.zig:178xmss_aggregate FFI + zeam_xmss_rec_aggregate_prove_seconds
  • rust/multisig-glue/src/lib.rs:126 — clones every raw PK + sig before prove

Only fast path today

if (!has_gossip and selected_children.items.len == 1) {
    // SSZ clone lone child — no STARK
}

Not applicable when aggregating raw subnet attestations (has_gossip = true, num_children = 0). Every flat 8-raw build goes through the full STARK pipeline. There is no zeam fast path for N raw + 0 children.

Trivial-input filter (aggregator only)

isAggregatorTrivialInput (forkchoice.zig:3781): skips when num_children = 0 and num_gossip_sigs < min_aggregation_inputs (default 2).

So zeam proves for 2–8 raw sigs, never 0–1. The build-time bimodal distribution (p50 ~0.5 s vs p95 ~4 s) is consistent with rec_xmss_aggregate cost scaling with num_raw, not with recursive depth.

Aggregation trigger timing

Worker runs on slot interval with window {slot-1, slot} (chain.zig:4462-4472). Logs frequently show agg start … gossip_sigs=subnet0=1/8 — the worker often proves before all 8 subnet attestations have arrived, then may prove again on a later trigger with more sigs. That produces a mix of small-N (fast) and large-N (slow) samples in the same histogram.


What is NOT the problem (please do not optimize these first)

Previously suggested item Why irrelevant on current subnet
Child proof deserialize cache num_children = 0 always — deserialize loop never runs (multisig-glue/src/lib.rs:143-177)
Greedy child selection tuning No known_payloads / peer child inputs in flat mode
Recursive aggregation parallelism Not deployed
Rayon thread auto-tune (#925) Already done — 12 Rayon threads on 16 vCPU aggregators
ThinLTO / multisig-release Production aggregators already build via -Dprover=dummymultisig-release (build.zig:56, rust/Cargo.toml:34-37)
Parallel per-att_data ThreadPool scope Explicitly removed in #925 — caused ThreadPool × Rayon oversubscription
Coalesce / in-flight skip rate Affects slot scheduling, not per-prove STARK cost inside one histogram sample

Root-cause hypothesis (code-grounded)

Two independent factors explain the zeam histogram shape:

  1. Input size variance: num_raw at snapshot time ranges from 2 to 8 because aggregation fires on the slot tick, not when 8/8 sigs are present. Small-N proves ≈ 0.5 s; full 8-raw ≈ 2–4 s.

  2. Prove cost gap vs lantern: For the same flat multisig construction, lantern mean 0.27 s vs zeam 1.34 s on live devnet suggests zeam pays substantial overhead beyond leanMultisig's core prove — or lantern uses a lighter aggregation construction. Need direct comparison (see investigation plan).

Known zeam per-prove overhead in code:

  • Rust: deep-clone all 8 XmssPublicKey + XmssSignature on every FFI call (multisig-glue/src/lib.rs:130-135)
  • Zig prep: fromBytes + handle materialization for every gossip sig every prove (block.zig:847-866)
  • Timer scope: lean_pq_sig_* includes bitfield merge + prove + serialize; live data shows prove dominates (identical to zeam_xmss_rec_aggregate_prove_seconds)

Investigation plan

1. Instrument num_raw on the build histogram (required first)

Add a label (or companion histogram) on prove entry:

lean_pq_sig_aggregated_signatures_building_time_seconds{num_raw="N"}

Without this, p95 tuning is blind — we cannot tell whether the tail is "8-raw steady state" or "partial input + reprove".

2. lean-bench parity on aggregator hardware

Run lean-bench aggregate.flat_{2,4,8}_r2 (referenced in pkgs/metrics/src/lib.zig:937) on the same host class as zeam_8 and compare to zeam_xmss_rec_aggregate_prove_seconds.

Outcome Next step
Bench flat_8 ≈ zeam p95 (~4 s) Problem is leanMultisig prove scaling — upstream perf or different security params (LOG_INV_RATE_PROD = 2, aggregation.zig:132)
Bench flat_8 ≈ lantern mean (~0.3 s) Problem is zeam FFI / wrapper overhead — profile clones + serialize
Bench flat_8 ≈ 1.3 s Problem is partial-input mix — fix aggregation timing (item 3)

3. Aggregation timing policy (if tail = partial inputs)

Options to evaluate once num_raw is labeled:

  • Defer prove until subnet gossip coverage ≥ K/8 (e.g. 6/8 or 8/8) within the slot window
  • Or: prove early with partial input for fork-choice weight, but do not count partial proves toward the p95 SLO for published aggregates

Code touchpoints: submitAggregateOnInterval / aggregateImpl slot window (chain.zig:4462), pruneTrivialFromAggregateSnapshot threshold (forkchoice.zig:3913).

4. Cross-client construction comparison (lantern / qlean)

lantern mean 0.27 s is the benchmark to beat. Questions for lantern/qlean teams:

  • Same rec_xmss_aggregate / leanMultisig STARK at LOG_INV_RATE=2?
  • Same metric semantics on lean_pq_sig_aggregated_signatures_building_time_seconds?
  • Do they wait for full committee before proving?

zeam has no lantern/qlean source in-tree; this comparison is external but essential — ethlambda (1.38 s) is no longer the right reference point.

5. FFI clone reduction (if bench shows overhead gap)

If lean-bench flat_8 is fast but zeam is slow with identical num_raw:

  • Avoid deep-cloning PKs/sigs in multisig-glue when handles are prove-duration-borrowed
  • Reuse Zig-side fromBytes handles across reproves of the same snapshot sigs

Acceptance criteria

  • num_raw (and ideally num_children) exposed on aggregate build metrics
  • lean-bench flat_8 vs zeam_xmss_rec_aggregate_prove_seconds documented on aggregator hardware
  • Root cause classified: prove scaling vs FFI overhead vs partial-input timing
  • With 8/8 raw input (steady state, no children): p95 build ≤ 1.3 s on zeam aggregator
  • Mean build within 1.5× of lantern on the same subnet / hardware class (currently ~5×)

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions