node, aggregation: flat 8-raw XMSS prove tail (~4s p95) vs lantern/qlean on 8-validator subnet

## Context

Follow-up to #899, scoped to the **current ansible-devnet topology**:

- **8 validators per subnet**, flat aggregation only — **`num_children = 0`**, no recursive child proofs yet
- Aggregator collects **raw gossip XMSS signatures** for one `att_data` and calls `rec_xmss_aggregate` once
- Goal: **`lean_pq_sig_aggregated_signatures_building_time_seconds` p95 ≤ 1.3 s** when aggregating a full 8/8 subnet committee

This issue deliberately excludes recursive-aggregation optimizations (child caches, greedy child selection, etc.) — those paths are not exercised today.

---

## Live cross-client snapshot (head ≈ 2280, 2026-05-26)

| Aggregator (subnet) | build count | **mean build** |
|---------------------|------------:|---------------:|
| **lantern_0** (sn 2) | 423 | **0.27 s** |
| **qlean_4** (sn 3) | 1298 | **1.01 s** |
| **zeam_8** (sn 0) | 169 | **1.34 s** |
| ethlambda_8 (sn 1) | 4044 | 1.38 s |

On `zeam_8` (post-#925 / #933, `0xpartha/zeam:local`):

| Percentile | Build time |
|------------|------------|
| p50 | ~0.5 s |
| p75 | ~2.3 s |
| **p95** | **~4.0 s** |
| mean | 1.34 s |

Phase metrics on `zeam_8` show **`xmss_prove` ≡ build histogram** (169/169 samples, identical sums). Snapshot / prep / commit are all sub-ms. **The tail is entirely inside `xmss_aggregate` → `rec_xmss_aggregate`.**

---

## What the zeam code actually does (flat, no children)

### Prove path (every non-trivial `att_data`)

```
submitAggregateOnInterval → aggregateImpl → aggregateForSlots → aggregateUnlocked
  → computeSingleAggregatedSignature → prepareAggregateAttData → runAggregateAttDataFfi
  → AggregatedSignatureProof.aggregate → xmss.aggregateSignatures → xmss_aggregate (Rust)
  → rec_xmss_aggregate
```

Key locations:
- `pkgs/node/src/forkchoice.zig:2091` — `aggregateUnlocked` (sequential per-`att_data` loop)
- `pkgs/types/src/block.zig:794` — `prepareAggregateAttData`
- `pkgs/types/src/block.zig:1018` — `runAggregateAttDataFfi` (starts `lean_pq_sig_*` timer)
- `pkgs/types/src/aggregation.zig:78` — `AggregatedSignatureProof.aggregate`
- `pkgs/xmss/src/aggregation.zig:178` — `xmss_aggregate` FFI + `zeam_xmss_rec_aggregate_prove_seconds`
- `rust/multisig-glue/src/lib.rs:126` — clones every raw PK + sig before prove

### Only fast path today

```877:900:pkgs/types/src/block.zig
if (!has_gossip and selected_children.items.len == 1) {
    // SSZ clone lone child — no STARK
}
```

**Not applicable** when aggregating raw subnet attestations (`has_gossip = true`, `num_children = 0`). Every flat 8-raw build goes through the full STARK pipeline. There is **no** zeam fast path for `N raw + 0 children`.

### Trivial-input filter (aggregator only)

`isAggregatorTrivialInput` (`forkchoice.zig:3781`): skips when `num_children = 0` and `num_gossip_sigs < min_aggregation_inputs` (default **2**).

So zeam proves for **2–8 raw sigs**, never 0–1. The build-time bimodal distribution (p50 ~0.5 s vs p95 ~4 s) is consistent with **`rec_xmss_aggregate` cost scaling with `num_raw`**, not with recursive depth.

### Aggregation trigger timing

Worker runs on slot interval with window `{slot-1, slot}` (`chain.zig:4462-4472`). Logs frequently show **`agg start … gossip_sigs=subnet0=1/8`** — the worker often proves **before all 8 subnet attestations have arrived**, then may prove again on a later trigger with more sigs. That produces a mix of small-N (fast) and large-N (slow) samples in the same histogram.

---

## What is NOT the problem (please do not optimize these first)

| Previously suggested item | Why irrelevant on current subnet |
|---------------------------|----------------------------------|
| Child proof deserialize cache | `num_children = 0` always — deserialize loop never runs (`multisig-glue/src/lib.rs:143-177`) |
| Greedy child selection tuning | No `known_payloads` / peer child inputs in flat mode |
| Recursive aggregation parallelism | Not deployed |
| Rayon thread auto-tune (#925) | Already done — 12 Rayon threads on 16 vCPU aggregators |
| ThinLTO / `multisig-release` | Production aggregators already build via `-Dprover=dummy` → `multisig-release` (`build.zig:56`, `rust/Cargo.toml:34-37`) |
| Parallel per-`att_data` ThreadPool scope | Explicitly removed in #925 — caused ThreadPool × Rayon oversubscription |
| Coalesce / in-flight skip rate | Affects **slot scheduling**, not per-prove STARK cost inside one histogram sample |

---

## Root-cause hypothesis (code-grounded)

Two independent factors explain the zeam histogram shape:

1. **Input size variance:** `num_raw` at snapshot time ranges from 2 to 8 because aggregation fires on the slot tick, not when 8/8 sigs are present. Small-N proves ≈ 0.5 s; full 8-raw ≈ 2–4 s.

2. **Prove cost gap vs lantern:** For the same flat multisig construction, lantern mean **0.27 s** vs zeam **1.34 s** on live devnet suggests zeam pays substantial overhead **beyond** leanMultisig's core prove — or lantern uses a lighter aggregation construction. Need direct comparison (see investigation plan).

Known zeam per-prove overhead in code:
- **Rust:** deep-clone all 8 `XmssPublicKey` + `XmssSignature` on every FFI call (`multisig-glue/src/lib.rs:130-135`)
- **Zig prep:** `fromBytes` + handle materialization for every gossip sig every prove (`block.zig:847-866`)
- **Timer scope:** `lean_pq_sig_*` includes bitfield merge + prove + serialize; live data shows prove dominates (identical to `zeam_xmss_rec_aggregate_prove_seconds`)

---

## Investigation plan

### 1. Instrument `num_raw` on the build histogram (required first)

Add a label (or companion histogram) on prove entry:

```
lean_pq_sig_aggregated_signatures_building_time_seconds{num_raw="N"}
```

Without this, p95 tuning is blind — we cannot tell whether the tail is "8-raw steady state" or "partial input + reprove".

### 2. lean-bench parity on aggregator hardware

Run lean-bench `aggregate.flat_{2,4,8}_r2` (referenced in `pkgs/metrics/src/lib.zig:937`) on the same host class as `zeam_8` and compare to `zeam_xmss_rec_aggregate_prove_seconds`.

| Outcome | Next step |
|---------|-----------|
| Bench flat_8 ≈ zeam p95 (~4 s) | Problem is leanMultisig prove scaling — upstream perf or different security params (`LOG_INV_RATE_PROD = 2`, `aggregation.zig:132`) |
| Bench flat_8 ≈ lantern mean (~0.3 s) | Problem is zeam FFI / wrapper overhead — profile clones + serialize |
| Bench flat_8 ≈ 1.3 s | Problem is partial-input mix — fix aggregation timing (item 3) |

### 3. Aggregation timing policy (if tail = partial inputs)

Options to evaluate once `num_raw` is labeled:

- Defer prove until **subnet gossip coverage ≥ K/8** (e.g. 6/8 or 8/8) within the slot window
- Or: prove early with partial input for fork-choice weight, but **do not count partial proves** toward the p95 SLO for published aggregates

Code touchpoints: `submitAggregateOnInterval` / `aggregateImpl` slot window (`chain.zig:4462`), `pruneTrivialFromAggregateSnapshot` threshold (`forkchoice.zig:3913`).

### 4. Cross-client construction comparison (lantern / qlean)

lantern mean **0.27 s** is the benchmark to beat. Questions for lantern/qlean teams:

- Same `rec_xmss_aggregate` / leanMultisig STARK at `LOG_INV_RATE=2`?
- Same metric semantics on `lean_pq_sig_aggregated_signatures_building_time_seconds`?
- Do they wait for full committee before proving?

zeam has no lantern/qlean source in-tree; this comparison is external but **essential** — ethlambda (1.38 s) is no longer the right reference point.

### 5. FFI clone reduction (if bench shows overhead gap)

If lean-bench flat_8 is fast but zeam is slow with identical `num_raw`:

- Avoid deep-cloning PKs/sigs in `multisig-glue` when handles are prove-duration-borrowed
- Reuse Zig-side `fromBytes` handles across reproves of the same snapshot sigs

---

## Acceptance criteria

- [ ] `num_raw` (and ideally `num_children`) exposed on aggregate build metrics
- [ ] lean-bench flat_8 vs `zeam_xmss_rec_aggregate_prove_seconds` documented on aggregator hardware
- [ ] Root cause classified: **prove scaling** vs **FFI overhead** vs **partial-input timing**
- [ ] With 8/8 raw input (steady state, no children): **p95 build ≤ 1.3 s** on zeam aggregator
- [ ] Mean build within **1.5× of lantern** on the same subnet / hardware class (currently ~5×)

---

## Related

- #899 — original cross-client gap (zeam/ethlambda parity largely achieved on mean; tail remains)
- #907 — single-raw STARK cost finding; flat mode still hits full prove for `N≥2` raw
- #925 — Rayon auto-tune + sequential proves (orchestration fixed; prove cost unchanged)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

node, aggregation: flat 8-raw XMSS prove tail (~4s p95) vs lantern/qlean on 8-validator subnet #940

Context

Live cross-client snapshot (head ≈ 2280, 2026-05-26)

What the zeam code actually does (flat, no children)

Prove path (every non-trivial `att_data`)

Only fast path today

Trivial-input filter (aggregator only)

Aggregation trigger timing

What is NOT the problem (please do not optimize these first)

Root-cause hypothesis (code-grounded)

Investigation plan

1. Instrument `num_raw` on the build histogram (required first)

2. lean-bench parity on aggregator hardware

3. Aggregation timing policy (if tail = partial inputs)

4. Cross-client construction comparison (lantern / qlean)

5. FFI clone reduction (if bench shows overhead gap)

Acceptance criteria

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Aggregator (subnet)	build count	mean build
lantern_0 (sn 2)	423	0.27 s
qlean_4 (sn 3)	1298	1.01 s
zeam_8 (sn 0)	169	1.34 s
ethlambda_8 (sn 1)	4044	1.38 s

Previously suggested item	Why irrelevant on current subnet
Child proof deserialize cache	`num_children = 0` always — deserialize loop never runs (`multisig-glue/src/lib.rs:143-177`)
Greedy child selection tuning	No `known_payloads` / peer child inputs in flat mode
Recursive aggregation parallelism	Not deployed
Rayon thread auto-tune (#925)	Already done — 12 Rayon threads on 16 vCPU aggregators
ThinLTO / `multisig-release`	Production aggregators already build via `-Dprover=dummy` → `multisig-release` (`build.zig:56`, `rust/Cargo.toml:34-37`)
Parallel per-`att_data` ThreadPool scope	Explicitly removed in #925 — caused ThreadPool × Rayon oversubscription
Coalesce / in-flight skip rate	Affects slot scheduling, not per-prove STARK cost inside one histogram sample

Outcome	Next step
Bench flat_8 ≈ zeam p95 (~4 s)	Problem is leanMultisig prove scaling — upstream perf or different security params (`LOG_INV_RATE_PROD = 2`, `aggregation.zig:132`)
Bench flat_8 ≈ lantern mean (~0.3 s)	Problem is zeam FFI / wrapper overhead — profile clones + serialize
Bench flat_8 ≈ 1.3 s	Problem is partial-input mix — fix aggregation timing (item 3)

node, aggregation: flat 8-raw XMSS prove tail (~4s p95) vs lantern/qlean on 8-validator subnet #940

Description

Context

Live cross-client snapshot (head ≈ 2280, 2026-05-26)

What the zeam code actually does (flat, no children)

Prove path (every non-trivial att_data)

Only fast path today

Trivial-input filter (aggregator only)

Aggregation trigger timing

What is NOT the problem (please do not optimize these first)

Root-cause hypothesis (code-grounded)

Investigation plan

1. Instrument num_raw on the build histogram (required first)

2. lean-bench parity on aggregator hardware

3. Aggregation timing policy (if tail = partial inputs)

4. Cross-client construction comparison (lantern / qlean)

5. FFI clone reduction (if bench shows overhead gap)

Acceptance criteria

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Prove path (every non-trivial `att_data`)

1. Instrument `num_raw` on the build histogram (required first)