Skip to content

node, forkchoice: keep libxev tick callbacks off the forkchoice main mutex #897

@ch4r10t33r

Description

@ch4r10t33r

Context

This is the deferred "code (2)" follow-up from PR #894 review. Related issues / PRs:

PR #894 added Clock.wallSlotNow() so sync gating no longer reads the stalled slot_clock.timeSlots counter. That breaks the loop where a stalled tick driver also suppresses the catch-up that would unstall it. It does not however eliminate the stalls themselves.

Problem

Clock.tickInterval() is invoked by libxev as the slot driver. It transitively calls forkchoice.tickIntervalUnlocked which mutates slot_clock.time / slot_clock.timeSlots and runs acceptNewAttestationsUnlocked / updateSafeTargetUnlocked. These run under forkChoice.mutex (or one of its sub-locks).

If any other path on a worker thread holds forkChoice.mutex (or a lock the tick path takes inside it), the libxev tick blocks for the duration of that hold. Observable consequence on aggregators: lean_tick_interval_duration_seconds and zeam_fork_choice_tick_interval_duration_seconds both grow, slot_clock.timeSlots lags real time, and on devnet-4 we saw zeam_slot_driver_stall_fired_total > 0 with multi-second stalls (#863).

Recent commits removed several specific lock holders from this path:

But this has been a sequence of point fixes. There is no invariant that catches a regression: any new caller that takes forkChoice.mutex from a worker can re-introduce the stall. We need a structural rule + an audit.

Goal

Establish and enforce: no work item submitted by a tickInterval callback may wait on a lock that another (non-libxev) thread can hold for more than O(microseconds). Concretely, the libxev tick path must never block on forkChoice.mutex (or any lock acquired underneath it) while a chain-worker / Io.Threaded / libp2p-bridge thread is doing work.

Scope

  1. Audit every Clock.tickInterval callback chain.

    • Clock.tickInterval → registered OnIntervalCbWrapper.onInterval callbacks (pkgs/node/src/clock.zig:112-141).
    • In particular BeamNode.onIntervalforkchoice.onInterval (which already serializes on forkChoice.mutex).
    • Identify every lock taken from this path and the maximum hold time.
  2. Audit every non-libxev caller that takes forkChoice.mutex (or sub-locks shared with the tick path).

  3. For each pair (tick path × worker path) with overlapping lock acquisition, choose one of:

    • Move the worker work behind a snapshot-then-release pattern (no shared lock after snapshot).
    • Hand off to a queue and process from the libxev tick (already the chain-worker pattern; verify completeness).
    • Use a finer-grained lock the tick path does not touch.
  4. Add a guard. Options:

    • debug build assertion: track current-thread lock ownership and panic if tickInterval enters with forkChoice.mutex held by a non-libxev thread for > N ms.
    • Extend SlotDriverWatchdog to log the lock-holding thread when a stall fires.
    • A static analysis pass / lint-style script that flags forkChoice.mutex.lock() outside the libxev thread.

Non-goals

  • Changing the spec semantics of forkchoice ticks. Slot/interval boundaries and the operations performed at each interval (accept attestations at i=0, update safe target at i=3, etc.) must remain unchanged.
  • Replacing forkChoice.mutex with a global RW lock or removing serialization. Forkchoice writes still need exclusive access; the goal is just to keep the libxev thread off the waiting side of that exclusion.

Acceptance criteria

  • Audit document committed (markdown under docs/ or a comment block in forkchoice.zig) listing every code path that holds forkChoice.mutex and its maximum measured hold time.
  • No tick-path → worker-path pair where the tick path can wait > N ms (suggest N=50ms initially) on a worker. Verified by running devnet-4-style gossip burst in a CI sim and observing lean_tick_interval_duration_seconds p99 < N + nominal interval.
  • zeam_slot_driver_stall_fired_total stays 0 across zig build simtest --summary all under sustained gossip pressure.
  • Some form of regression guard (debug assert, watchdog enrichment, or lint) lands so future regressions surface immediately.

Why this matters

The recurring zeam_4 / zeam_8 head-stuck symptom on devnet-4 is the user-visible signature of this problem. PR #894 makes the recovery deterministic (catch-up triggers correctly even during stalls). This issue is about removing the stalls themselves so recovery isn't needed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions