Skip to content

Cancel TotalTimeoutHandler scheduled timeout on channel close#491

Open
codexcoder21 wants to merge 1 commit into
libp2p:developfrom
CodexCoder21Organization:upstream-cancel-totaltimeouthandler-on-close
Open

Cancel TotalTimeoutHandler scheduled timeout on channel close#491
codexcoder21 wants to merge 1 commit into
libp2p:developfrom
CodexCoder21Organization:upstream-cancel-totaltimeouthandler-on-close

Conversation

@codexcoder21

Copy link
Copy Markdown

Problem

TotalTimeoutHandler (installed by the multistream Negotiator to bound the time a stream may spend in protocol negotiation) cancels its scheduled timeout only in handlerRemoved.

A substream (MuxChannel) that is closed before negotiation completes is never removed by application code. Its TotalTimeoutHandler is removed only when the substream's pipeline is torn down, which happens during the channel's deferred deregistration — a regular task submitted to the channel's event loop. Cancellation of the negotiation-timeout task is therefore gated on that deferred task actually running.

Under a burst of substreams that open and abort mid-negotiation (a reconnect / negotiation-abort herd) on a CPU-constrained event loop, those deferred deregistration tasks are starved. handlerRemoved never fires, so each scheduled negotiation-timeout ScheduledFutureTask — which captures the entire closed substream pipeline (MuxChannel, Negotiator$ResponderHandler, the negotiation codecs, and their ChannelHandlerContexts) — stays pinned in the event loop's scheduled-task queue until its (10s) timeout elapses. When closes outpace timeout expiry these pipelines accumulate without bound until OutOfMemoryError.

This is a retention-after-close leak: the number of concurrently live substreams stays bounded, but closed substreams are not reclaimed. A heap dump from a memory-constrained node (128 MB heap) under such churn shows tens of thousands of pending TotalTimeoutHandler ScheduledFutureTasks, each rooting a closed MuxChannel / Negotiator$ResponderHandler pipeline.

Fix

Register the timeout cancellation on the channel's close future as well, via a listener added in handlerAdded. The close future completes while the channel is closing — independent of event-loop backlog or channel state — so the scheduled task is cancelled and its captured pipeline released promptly even when the deferred deregistration is starved. The listener is removed in handlerRemoved (and in cancel) so it does not linger on the normal negotiation-success path.

channelInactive is not a viable cancellation point: AbstractChildChannel does not fire channelInactive for a child channel that is closed while still in the OPEN state — the common case for an aborted mid-negotiation substream. Instrumentation over a churn run measured channelInactive firing only 141 times across 9.77M substreams, whereas the close-future listener cancelled all 9.77M scheduled tasks and held the heap flat.

Tests

TotalTimeoutHandlerTest (red → green):

  • closing the channel without removing the handler must cancel the scheduled timeout — fails before this change (the timeout still fires and closes the context) and passes after;
  • a sanity case asserting the timeout still fires when neither close nor removal occurs.

Verified locally: ./gradlew :libp2p:test --tests "io.libp2p.etc.util.netty.TotalTimeoutHandlerTest" (2 passed), spotlessCheck, and detekt all green.

🤖 Generated with Claude Code

TotalTimeoutHandler (installed by the multistream Negotiator to bound
negotiation time) cancelled its scheduled timeout only in handlerRemoved. A
substream (MuxChannel) that closes before negotiation completes is not removed
by application code, so handlerRemoved depends on the pipeline being destroyed
during the channel's deferred deregister, which runs as a regular task on the
channel's event loop. When that event loop is backlogged (e.g. a reconnect /
negotiation-abort herd on a CPU-constrained host) the deferred deregister is
starved, handlerRemoved never fires, and the scheduled timeout task — which
captures the whole closed substream pipeline (MuxChannel +
Negotiator$ResponderHandler + codecs) — stays pinned in the event loop's
scheduled-task queue until the timeout elapses. Under sustained churn these
closed-but-pinned pipelines accumulate unbounded and exhaust the heap.

Also cancel the timeout via a listener on the channel's close future, which
completes while the channel is closing regardless of event-loop backlog or
channel state. channelInactive is insufficient: AbstractChildChannel does not
fire it for a channel closed while still in the OPEN state, the common case for
an aborted mid-negotiation substream. The listener is removed in
handlerRemoved/cancel so it does not linger on the negotiation-success path.

Adds TotalTimeoutHandlerTest: closing the channel without removing the handler
must cancel the timeout (red before this change, green after), plus a sanity
case that the timeout still fires when neither close nor removal occurs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant