Cancel TotalTimeoutHandler scheduled timeout on channel close#491
Open
codexcoder21 wants to merge 1 commit into
Open
Conversation
TotalTimeoutHandler (installed by the multistream Negotiator to bound negotiation time) cancelled its scheduled timeout only in handlerRemoved. A substream (MuxChannel) that closes before negotiation completes is not removed by application code, so handlerRemoved depends on the pipeline being destroyed during the channel's deferred deregister, which runs as a regular task on the channel's event loop. When that event loop is backlogged (e.g. a reconnect / negotiation-abort herd on a CPU-constrained host) the deferred deregister is starved, handlerRemoved never fires, and the scheduled timeout task — which captures the whole closed substream pipeline (MuxChannel + Negotiator$ResponderHandler + codecs) — stays pinned in the event loop's scheduled-task queue until the timeout elapses. Under sustained churn these closed-but-pinned pipelines accumulate unbounded and exhaust the heap. Also cancel the timeout via a listener on the channel's close future, which completes while the channel is closing regardless of event-loop backlog or channel state. channelInactive is insufficient: AbstractChildChannel does not fire it for a channel closed while still in the OPEN state, the common case for an aborted mid-negotiation substream. The listener is removed in handlerRemoved/cancel so it does not linger on the negotiation-success path. Adds TotalTimeoutHandlerTest: closing the channel without removing the handler must cancel the timeout (red before this change, green after), plus a sanity case that the timeout still fires when neither close nor removal occurs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
TotalTimeoutHandler(installed by the multistreamNegotiatorto bound the time a stream may spend in protocol negotiation) cancels its scheduled timeout only inhandlerRemoved.A substream (
MuxChannel) that is closed before negotiation completes is never removed by application code. ItsTotalTimeoutHandleris removed only when the substream's pipeline is torn down, which happens during the channel's deferred deregistration — a regular task submitted to the channel's event loop. Cancellation of the negotiation-timeout task is therefore gated on that deferred task actually running.Under a burst of substreams that open and abort mid-negotiation (a reconnect / negotiation-abort herd) on a CPU-constrained event loop, those deferred deregistration tasks are starved.
handlerRemovednever fires, so each scheduled negotiation-timeoutScheduledFutureTask— which captures the entire closed substream pipeline (MuxChannel,Negotiator$ResponderHandler, the negotiation codecs, and theirChannelHandlerContexts) — stays pinned in the event loop's scheduled-task queue until its (10s) timeout elapses. When closes outpace timeout expiry these pipelines accumulate without bound untilOutOfMemoryError.This is a retention-after-close leak: the number of concurrently live substreams stays bounded, but closed substreams are not reclaimed. A heap dump from a memory-constrained node (128 MB heap) under such churn shows tens of thousands of pending
TotalTimeoutHandlerScheduledFutureTasks, each rooting a closedMuxChannel/Negotiator$ResponderHandlerpipeline.Fix
Register the timeout cancellation on the channel's close future as well, via a listener added in
handlerAdded. The close future completes while the channel is closing — independent of event-loop backlog or channel state — so the scheduled task is cancelled and its captured pipeline released promptly even when the deferred deregistration is starved. The listener is removed inhandlerRemoved(and incancel) so it does not linger on the normal negotiation-success path.channelInactiveis not a viable cancellation point:AbstractChildChanneldoes not firechannelInactivefor a child channel that is closed while still in theOPENstate — the common case for an aborted mid-negotiation substream. Instrumentation over a churn run measuredchannelInactivefiring only 141 times across 9.77M substreams, whereas the close-future listener cancelled all 9.77M scheduled tasks and held the heap flat.Tests
TotalTimeoutHandlerTest(red → green):Verified locally:
./gradlew :libp2p:test --tests "io.libp2p.etc.util.netty.TotalTimeoutHandlerTest"(2 passed),spotlessCheck, anddetektall green.🤖 Generated with Claude Code