Context
This is the deferred "code (2)" follow-up from PR #894 review. Related issues / PRs:
PR #894 added Clock.wallSlotNow() so sync gating no longer reads the stalled slot_clock.timeSlots counter. That breaks the loop where a stalled tick driver also suppresses the catch-up that would unstall it. It does not however eliminate the stalls themselves.
Problem
Clock.tickInterval() is invoked by libxev as the slot driver. It transitively calls forkchoice.tickIntervalUnlocked which mutates slot_clock.time / slot_clock.timeSlots and runs acceptNewAttestationsUnlocked / updateSafeTargetUnlocked. These run under forkChoice.mutex (or one of its sub-locks).
If any other path on a worker thread holds forkChoice.mutex (or a lock the tick path takes inside it), the libxev tick blocks for the duration of that hold. Observable consequence on aggregators: lean_tick_interval_duration_seconds and zeam_fork_choice_tick_interval_duration_seconds both grow, slot_clock.timeSlots lags real time, and on devnet-4 we saw zeam_slot_driver_stall_fired_total > 0 with multi-second stalls (#863).
Recent commits removed several specific lock holders from this path:
But this has been a sequence of point fixes. There is no invariant that catches a regression: any new caller that takes forkChoice.mutex from a worker can re-introduce the stall. We need a structural rule + an audit.
Goal
Establish and enforce: no work item submitted by a tickInterval callback may wait on a lock that another (non-libxev) thread can hold for more than O(microseconds). Concretely, the libxev tick path must never block on forkChoice.mutex (or any lock acquired underneath it) while a chain-worker / Io.Threaded / libp2p-bridge thread is doing work.
Scope
-
Audit every Clock.tickInterval callback chain.
Clock.tickInterval → registered OnIntervalCbWrapper.onInterval callbacks (pkgs/node/src/clock.zig:112-141).
- In particular
BeamNode.onInterval → forkchoice.onInterval (which already serializes on forkChoice.mutex).
- Identify every lock taken from this path and the maximum hold time.
-
Audit every non-libxev caller that takes forkChoice.mutex (or sub-locks shared with the tick path).
-
For each pair (tick path × worker path) with overlapping lock acquisition, choose one of:
- Move the worker work behind a snapshot-then-release pattern (no shared lock after snapshot).
- Hand off to a queue and process from the libxev tick (already the chain-worker pattern; verify completeness).
- Use a finer-grained lock the tick path does not touch.
-
Add a guard. Options:
debug build assertion: track current-thread lock ownership and panic if tickInterval enters with forkChoice.mutex held by a non-libxev thread for > N ms.
- Extend
SlotDriverWatchdog to log the lock-holding thread when a stall fires.
- A static analysis pass / lint-style script that flags
forkChoice.mutex.lock() outside the libxev thread.
Non-goals
- Changing the spec semantics of forkchoice ticks. Slot/interval boundaries and the operations performed at each interval (accept attestations at i=0, update safe target at i=3, etc.) must remain unchanged.
- Replacing
forkChoice.mutex with a global RW lock or removing serialization. Forkchoice writes still need exclusive access; the goal is just to keep the libxev thread off the waiting side of that exclusion.
Acceptance criteria
Why this matters
The recurring zeam_4 / zeam_8 head-stuck symptom on devnet-4 is the user-visible signature of this problem. PR #894 makes the recovery deterministic (catch-up triggers correctly even during stalls). This issue is about removing the stalls themselves so recovery isn't needed.
Context
This is the deferred "code (2)" follow-up from PR #894 review. Related issues / PRs:
slot_interval/ tick duration (event-loop starvation vs nominal 0.8s) #863 — slot-driver starvation under gossip flood (P0/P1/P4 landed via perf(node): fix slot-driver starvation under gossip flood (#863) #886, node, metrics: offload heavy chain mutations to chain-worker, parallelize XMSS verify (#863) #890, feat: move aggregate FFI off libxev thread using Io.Threaded worker (closes #873) #874)blocks_by_rangecatch-up retry and fork recoveryIo.ThreadedworkerPR #894 added
Clock.wallSlotNow()so sync gating no longer reads the stalledslot_clock.timeSlotscounter. That breaks the loop where a stalled tick driver also suppresses the catch-up that would unstall it. It does not however eliminate the stalls themselves.Problem
Clock.tickInterval()is invoked by libxev as the slot driver. It transitively callsforkchoice.tickIntervalUnlockedwhich mutatesslot_clock.time/slot_clock.timeSlotsand runsacceptNewAttestationsUnlocked/updateSafeTargetUnlocked. These run underforkChoice.mutex(or one of its sub-locks).If any other path on a worker thread holds
forkChoice.mutex(or a lock the tick path takes inside it), the libxev tick blocks for the duration of that hold. Observable consequence on aggregators:lean_tick_interval_duration_secondsandzeam_fork_choice_tick_interval_duration_secondsboth grow,slot_clock.timeSlotslags real time, and on devnet-4 we sawzeam_slot_driver_stall_fired_total > 0with multi-second stalls (#863).Recent commits removed several specific lock holders from this path:
forkChoice.aggregateno longer holds the main mutex (signatures_mutex is sufficient)computeAggregatedSignaturessnapshot-then-release onsignatures_mutexBut this has been a sequence of point fixes. There is no invariant that catches a regression: any new caller that takes
forkChoice.mutexfrom a worker can re-introduce the stall. We need a structural rule + an audit.Goal
Establish and enforce: no work item submitted by a
tickIntervalcallback may wait on a lock that another (non-libxev) thread can hold for more than O(microseconds). Concretely, the libxev tick path must never block onforkChoice.mutex(or any lock acquired underneath it) while a chain-worker / Io.Threaded / libp2p-bridge thread is doing work.Scope
Audit every
Clock.tickIntervalcallback chain.Clock.tickInterval→ registeredOnIntervalCbWrapper.onIntervalcallbacks (pkgs/node/src/clock.zig:112-141).BeamNode.onInterval→forkchoice.onInterval(which already serializes onforkChoice.mutex).Audit every non-libxev caller that takes
forkChoice.mutex(or sub-locks shared with the tick path).chainWorkerThunkpaths inchain_worker.zig.chain.onGossip/chain.onBlock.For each pair (tick path × worker path) with overlapping lock acquisition, choose one of:
Add a guard. Options:
debugbuild assertion: track current-thread lock ownership and panic iftickIntervalenters withforkChoice.mutexheld by a non-libxev thread for > N ms.SlotDriverWatchdogto log the lock-holding thread when a stall fires.forkChoice.mutex.lock()outside the libxev thread.Non-goals
forkChoice.mutexwith a global RW lock or removing serialization. Forkchoice writes still need exclusive access; the goal is just to keep the libxev thread off the waiting side of that exclusion.Acceptance criteria
docs/or a comment block inforkchoice.zig) listing every code path that holdsforkChoice.mutexand its maximum measured hold time.lean_tick_interval_duration_secondsp99 < N + nominal interval.zeam_slot_driver_stall_fired_totalstays 0 acrosszig build simtest --summary allunder sustained gossip pressure.Why this matters
The recurring zeam_4 / zeam_8 head-stuck symptom on devnet-4 is the user-visible signature of this problem. PR #894 makes the recovery deterministic (catch-up triggers correctly even during stalls). This issue is about removing the stalls themselves so recovery isn't needed.