Skip to content

Stream processor drops completion marker#80

Merged
katiewasnothere merged 1 commit into
apple:mainfrom
tensorfields:stream-processor-drops-completion-marker
Jun 9, 2026
Merged

Stream processor drops completion marker#80
katiewasnothere merged 1 commit into
apple:mainfrom
tensorfields:stream-processor-drops-completion-marker

Conversation

@tensorfields

Copy link
Copy Markdown
Contributor

Summary

container build can hang indefinitely at "transferring context".
When this happens, the build must be killed manually. The hang
occurs intermittently and is sensitive to context size, disk latency,
and probably scheduler timing as well.

I tracked this issue to Demultiplexer.Accept in container-build-shim
which silently drops packets when its 32-slot per-id channel is full, leaving the
consumer to wait forever on a packet that will never arrive. This
manifests as container build hanging indefinitely at
"transferring context" whenever a build context produces enough
packets to saturate the channel.

Bug Details

Demultiplexer.Accept at pkg/stream/processor.go uses a
non-blocking select whose default branch drops packets when
the channel is full:

select {
case <-d.ctx.Done():
    d.closeFn(d.id)
    return d.ctx.Err()
case d.ch <- c:
    return nil
default:
    d.closeFn(d.id)
    return ErrDemuxChannelFull
}

On the FSSync Walk path, the closeFn is a no-op
(pkg/fssync/walk.go), so the demux ctx is not cancelled on
overflow. The consumer (startTar in pkg/fileutils/tarxfer.go)
blocks indefinitely in demux.Recv() waiting for packets that were
dropped. The dropped packets can include the final complete=true
marker, so the receiver can never finish.

On the producer side, the host (MacOS) streams the context tar in 4 MiB chunks
(container/Sources/ContainerBuild/BuildFSSync.swift)
into a Swift AsyncStream created with the default unbounded buffering
policy (Builder.swift), so the host produces at with no backpressure.
A slow consumer can easily saturate the 32-slot demux channel.

Diagnosis

Captured live from a hung container build on my machine. Inside
the builder VM via container logs buildkit:

20:56:03  session started
20:56:04  diffcopy took: 1.169s   load build definition from Dockerfile
20:56:05  diffcopy took: 1.026s   load .dockerignore
20:56:05  reusing ref for local: trg599mgn053v52gpzoabhryz   "[internal] load build context"
20:56:12  WARN  handler refused packet  build_id=bf31add0-... error="demux channel full" handler_closed=false
20:56:13  WARN  handler refused packet  build_id=bf31add0-... error="demux channel full" handler_closed=false
...      (~30 repetitions within ~1s)
(silence: no "diffcopy took: …" ever logged for this Walk)

The handler_closed=false field confirms the demux ctx is not
cancelled on overflow, matching the no-op closeFn in
pkg/fssync/walk.go. The build hung for 13 minutes until
killed manually.

Running sample on the CLI process showed healthy stdio handling,
suggesting a hang inside the builder VM.

Reproduction

A regression test in pkg/fileutils/tarxfer_test.go reproduces the
hang deterministically and validates the fix:

go test ./pkg/fileutils/... \
  -run TestReceiver_Receive_OverflowsDemuxChannel -race

The test streams a tar split into 64 small BuildTransfer packets
through the demux while the receiver consumes concurrently. Without
the fix, the receiver blocks until the test ctx times out. With the
fix, the producer waits on backpressure and all packets are
delivered.

Proposed Fix

Remove the non-blocking default branch from
Demultiplexer.Accept. The send blocks on the bounded channel,
gated by <-d.ctx.Done():

select {
case <-d.ctx.Done():
    d.closeFn(d.id)
    return d.ctx.Err()
case d.ch <- c:
    return nil
}

Backpressure can propagate end-to-end, and the bounded
channel becomes a backpressure signal rather than a drop point.

The new pattern should be safe for all consumers. startTar and the
content-store readers are pure draining loops that never wait on a
future packet to make progress on the current one, so they cannot
deadlock with a blocked producer.

I considered cancelling the demux ctx on overflow (instead of blocking).
It would convert the hang into a prompt error but the build would still fail.

I also considered enlarging the channel from 32 to reduce the chance of
a race, but under sufficient producer pressure a buffer can still saturate.
Backpressure seemed like the best option.

Verification

  • go test ./... passes.
  • TestReceiver_Receive_OverflowsDemuxChannel passes 20 iterations
    under -race.
  • pkg/stream/... passes 20 iterations under -race.

I also tested container end to end by overriding the builder to use my own.
Using the builder image with the fix in this PR, I no longer observe the context
transfer hangs after many attempts to reproduce locally.

$ container system property set image.builder <ref>

@tensorfields tensorfields force-pushed the stream-processor-drops-completion-marker branch from 7046b84 to 6166e3d Compare June 9, 2026 17:10
@katiewasnothere katiewasnothere merged commit fbb4645 into apple:main Jun 9, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants