Stream processor drops completion marker#80
Merged
katiewasnothere merged 1 commit intoJun 9, 2026
Merged
Conversation
katiewasnothere
approved these changes
Jun 8, 2026
7046b84 to
6166e3d
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
container buildcan hang indefinitely at "transferring context".When this happens, the build must be killed manually. The hang
occurs intermittently and is sensitive to context size, disk latency,
and probably scheduler timing as well.
I tracked this issue to
Demultiplexer.Acceptin container-build-shimwhich silently drops packets when its 32-slot per-id channel is full, leaving the
consumer to wait forever on a packet that will never arrive. This
manifests as
container buildhanging indefinitely at"transferring context" whenever a build context produces enough
packets to saturate the channel.
Bug Details
Demultiplexer.Acceptat pkg/stream/processor.go uses anon-blocking
selectwhosedefaultbranch drops packets whenthe channel is full:
On the FSSync
Walkpath, thecloseFnis a no-op(pkg/fssync/walk.go), so the demux ctx is not cancelled on
overflow. The consumer (
startTarin pkg/fileutils/tarxfer.go)blocks indefinitely in
demux.Recv()waiting for packets that weredropped. The dropped packets can include the final
complete=truemarker, so the receiver can never finish.
On the producer side, the host (MacOS) streams the context tar in 4 MiB chunks
(container/Sources/ContainerBuild/BuildFSSync.swift)
into a Swift
AsyncStreamcreated with the default unbounded bufferingpolicy (Builder.swift), so the host produces at with no backpressure.
A slow consumer can easily saturate the 32-slot demux channel.
Diagnosis
Captured live from a hung
container buildon my machine. Insidethe builder VM via
container logs buildkit:The
handler_closed=falsefield confirms the demux ctx is notcancelled on overflow, matching the no-op
closeFninpkg/fssync/walk.go. The build hung for 13 minutes until
killed manually.
Running
sampleon the CLI process showed healthy stdio handling,suggesting a hang inside the builder VM.
Reproduction
A regression test in
pkg/fileutils/tarxfer_test.goreproduces thehang deterministically and validates the fix:
go test ./pkg/fileutils/... \ -run TestReceiver_Receive_OverflowsDemuxChannel -raceThe test streams a tar split into 64 small
BuildTransferpacketsthrough the demux while the receiver consumes concurrently. Without
the fix, the receiver blocks until the test ctx times out. With the
fix, the producer waits on backpressure and all packets are
delivered.
Proposed Fix
Remove the non-blocking
defaultbranch fromDemultiplexer.Accept. The send blocks on the bounded channel,gated by
<-d.ctx.Done():Backpressure can propagate end-to-end, and the bounded
channel becomes a backpressure signal rather than a drop point.
The new pattern should be safe for all consumers.
startTarand thecontent-store readers are pure draining loops that never wait on a
future packet to make progress on the current one, so they cannot
deadlock with a blocked producer.
I considered cancelling the demux ctx on overflow (instead of blocking).
It would convert the hang into a prompt error but the build would still fail.
I also considered enlarging the channel from 32 to reduce the chance of
a race, but under sufficient producer pressure a buffer can still saturate.
Backpressure seemed like the best option.
Verification
go test ./...passes.TestReceiver_Receive_OverflowsDemuxChannelpasses 20 iterationsunder
-race.pkg/stream/...passes 20 iterations under-race.I also tested
containerend to end by overriding the builder to use my own.Using the builder image with the fix in this PR, I no longer observe the context
transfer hangs after many attempts to reproduce locally.