node, forkchoice: keep libxev tick callbacks off the forkchoice main mutex

## Context

This is the deferred "code (2)" follow-up from PR #894 review. Related issues / PRs:

- #863 — slot-driver starvation under gossip flood (P0/P1/P4 landed via #886, #890, #874)
- #893 / #894 — `blocks_by_range` catch-up retry and fork recovery
- #890 — chain-worker offload of heavy chain mutations
- #874 — aggregate FFI moved off libxev to `Io.Threaded` worker

PR #894 added `Clock.wallSlotNow()` so sync gating no longer reads the stalled `slot_clock.timeSlots` counter. That breaks the loop where a stalled tick driver also suppresses the catch-up that would unstall it. It does not however eliminate the stalls themselves.

## Problem

`Clock.tickInterval()` is invoked by libxev as the slot driver. It transitively calls `forkchoice.tickIntervalUnlocked` which mutates `slot_clock.time` / `slot_clock.timeSlots` and runs `acceptNewAttestationsUnlocked` / `updateSafeTargetUnlocked`. These run under `forkChoice.mutex` (or one of its sub-locks).

If any other path on a worker thread holds `forkChoice.mutex` (or a lock the tick path takes inside it), the libxev tick blocks for the duration of that hold. Observable consequence on aggregators: `lean_tick_interval_duration_seconds` and `zeam_fork_choice_tick_interval_duration_seconds` both grow, `slot_clock.timeSlots` lags real time, and on devnet-4 we saw `zeam_slot_driver_stall_fired_total > 0` with multi-second stalls (#863).

Recent commits removed several specific lock holders from this path:

- `forkChoice.aggregate` no longer holds the main mutex (signatures_mutex is sufficient)
- `computeAggregatedSignatures` snapshot-then-release on `signatures_mutex`
- #890 — heavy chain mutators dispatched to chain-worker rather than running on libxev
- #874 — XMSS aggregate verify off libxev

But this has been a sequence of point fixes. There is no invariant that catches a regression: any new caller that takes `forkChoice.mutex` from a worker can re-introduce the stall. We need a structural rule + an audit.

## Goal

Establish and enforce: **no work item submitted by a `tickInterval` callback may wait on a lock that another (non-libxev) thread can hold for more than O(microseconds).** Concretely, the libxev tick path must never block on `forkChoice.mutex` (or any lock acquired underneath it) while a chain-worker / Io.Threaded / libp2p-bridge thread is doing work.

## Scope

1. **Audit every `Clock.tickInterval` callback chain.**
   - `Clock.tickInterval` → registered `OnIntervalCbWrapper.onInterval` callbacks (`pkgs/node/src/clock.zig:112-141`).
   - In particular `BeamNode.onInterval` → `forkchoice.onInterval` (which already serializes on `forkChoice.mutex`).
   - Identify every lock taken from this path and the maximum hold time.

2. **Audit every non-libxev caller that takes `forkChoice.mutex` (or sub-locks shared with the tick path).**
   - Chain-worker `chainWorkerThunk` paths in `chain_worker.zig`.
   - Io.Threaded aggregate worker (#874).
   - libp2p-bridge gossip / req-resp callbacks that route through `chain.onGossip` / `chain.onBlock`.
   - Status RPC handler paths.

3. **For each pair (tick path × worker path) with overlapping lock acquisition, choose one of:**
   - Move the worker work behind a snapshot-then-release pattern (no shared lock after snapshot).
   - Hand off to a queue and process from the libxev tick (already the chain-worker pattern; verify completeness).
   - Use a finer-grained lock the tick path does not touch.

4. **Add a guard.** Options:
   - `debug` build assertion: track current-thread lock ownership and panic if `tickInterval` enters with `forkChoice.mutex` held by a non-libxev thread for > N ms.
   - Extend `SlotDriverWatchdog` to log the lock-holding thread when a stall fires.
   - A static analysis pass / lint-style script that flags `forkChoice.mutex.lock()` outside the libxev thread.

## Non-goals

- Changing the spec semantics of forkchoice ticks. Slot/interval boundaries and the operations performed at each interval (accept attestations at i=0, update safe target at i=3, etc.) must remain unchanged.
- Replacing `forkChoice.mutex` with a global RW lock or removing serialization. Forkchoice writes still need exclusive access; the goal is just to keep the libxev thread off the *waiting* side of that exclusion.

## Acceptance criteria

- [ ] Audit document committed (markdown under `docs/` or a comment block in `forkchoice.zig`) listing every code path that holds `forkChoice.mutex` and its maximum measured hold time.
- [ ] No tick-path → worker-path pair where the tick path can wait > N ms (suggest N=50ms initially) on a worker. Verified by running devnet-4-style gossip burst in a CI sim and observing `lean_tick_interval_duration_seconds` p99 < N + nominal interval.
- [ ] `zeam_slot_driver_stall_fired_total` stays 0 across `zig build simtest --summary all` under sustained gossip pressure.
- [ ] Some form of regression guard (debug assert, watchdog enrichment, or lint) lands so future regressions surface immediately.

## Why this matters

The recurring zeam_4 / zeam_8 head-stuck symptom on devnet-4 is the user-visible signature of this problem. PR #894 makes the *recovery* deterministic (catch-up triggers correctly even during stalls). This issue is about removing the stalls themselves so recovery isn't needed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

node, forkchoice: keep libxev tick callbacks off the forkchoice main mutex #897

Context

Problem

Goal

Scope

Non-goals

Acceptance criteria

Why this matters

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

node, forkchoice: keep libxev tick callbacks off the forkchoice main mutex #897

Description

Context

Problem

Goal

Scope

Non-goals

Acceptance criteria

Why this matters

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions