Fix Docker scan hang: bound container removal, surface timeout telemetry#1778
Conversation
Fix 1: Add 30s timeout to RemoveContainerAsync in the finally block of CreateAndRunContainerAsync. Previously used CancellationToken.None which could hang indefinitely if the Docker daemon is slow removing a stuck container, blocking the entire detector and pipeline. Fix 2: Re-throw OperationCanceledException (when the token is cancelled) before the generic catch(Exception) in ResolveImageAsync, ScanDockerImageAsync, and ScanLocalImageAsync. This lets the OCE propagate to ExecuteDetectorAsync where LinuxContainerDetectorTimeout telemetry is emitted, enabling root-cause analysis of stuck scans. Fix 3: Replace JsonSerializer.Serialize(e) with e.ToString() in LinuxScanner.RunSyftAsync. Serializing exceptions with System.Text.Json throws NotSupportedException on MethodBase properties. Fix 4: Add safety-net catch(Exception) in ExecuteDetectorAsync after the OCE catch. Ensures unexpected errors never escape the detector and crash the harness or lose other detectors' results. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Addresses a hang in Linux container image scanning by ensuring Docker container cleanup can’t block indefinitely and by improving timeout/exception propagation so timeout telemetry is emitted.
Changes:
- Bound Docker container removal in
finallywith a 30s timeout to avoid indefinite hangs. - Ensure
OperationCanceledException(when cancellation is requested) rethrows past generic catches soLinuxContainerDetectorTimeouttelemetry is emitted. - Avoid
System.Text.Jsonexception serialization pitfalls by usingException.ToString()for telemetry, and add detector-level safety-net exception handling. - Add unit tests intended to cover timeout/exception scenarios.
Show a summary per file
| File | Description |
|---|---|
src/Microsoft.ComponentDetection.Common/DockerService.cs |
Add bounded-time best-effort container removal in finally to prevent cleanup hangs. |
src/Microsoft.ComponentDetection.Detectors/linux/LinuxContainerDetector.cs |
Re-throw cancellation exceptions to surface timeout telemetry; add outer safety-net catch. |
src/Microsoft.ComponentDetection.Detectors/linux/LinuxScanner.cs |
Use Exception.ToString() for telemetry instead of JSON serialization of exceptions. |
test/Microsoft.ComponentDetection.Detectors.Tests/LinuxContainerDetectorTests.cs |
Add tests around timeout propagation and safety-net behavior (some need adjustment to truly exercise cancellation paths). |
Copilot's findings
Comments suppressed due to low confidence (2)
test/Microsoft.ComponentDetection.Detectors.Tests/LinuxContainerDetectorTests.cs:1402
- This test makes
ImageExistsLocallyAsyncthrow an OperationCanceledException, but the detector’s timeout token isn’t actually cancelled in the test. Because the detector only rethrows OCEs whencancellationToken.IsCancellationRequestedis true, this currently validates the generic failure-swallowing path rather than the intended resolve-phase timeout propagation to ExecuteDetectorAsync. Pass and cancel a token into ExecuteDetectorAsync (or otherwise ensure cancellation is requested) to validate the timeout path.
public async Task ExecuteDetectorAsync_ImageResolveThrowsOce_ReturnsSuccessWithEmptyResults()
{
// Verifies Fix 2 in the resolve phase: if the timeout fires
// during image pull/inspect, the OCE is re-thrown from
// ResolveImageAsync and caught by ExecuteDetectorAsync.
var componentRecorder = new ComponentRecorder();
var detectorArgs = new Dictionary<string, string> { { "Linux.ScanningTimeoutSec", "600" } };
var scanRequest = new ScanRequest(
new DirectoryInfo(Path.GetTempPath()),
(_, __) => false,
this.mockLogger.Object,
detectorArgs,
[NodeLatestImage],
componentRecorder
);
this.mockDockerService.Setup(service =>
service.ImageExistsLocallyAsync(It.IsAny<string>(), It.IsAny<CancellationToken>())
)
.ThrowsAsync(new OperationCanceledException());
test/Microsoft.ComponentDetection.Detectors.Tests/LinuxContainerDetectorTests.cs:1469
- The test name/comment says "OtherImageStillScanned", but it only asserts that the detector returns Success and doesn’t verify that the non-failing image was actually scanned/recorded. Also, the thrown OperationCanceledException isn’t tied to a cancelled token, so it won’t exercise the timeout propagation path guarded by
IsCancellationRequested. Strengthen the assertions (e.g., verify calls/recorded components) and ensure cancellation is requested if the goal is to simulate timeout behavior.
public async Task ExecuteDetectorAsync_ScanThrowsOce_OtherImageStillScanned()
{
// When scanning multiple images and one times out, the entire
// detector should still return Success (not crash).
var componentRecorder = new ComponentRecorder();
const string secondImage = "alpine:latest";
const string secondImageId = "alpine123";
this.mockDockerService.Setup(service =>
service.InspectImageAsync(secondImage, It.IsAny<CancellationToken>())
)
.ReturnsAsync(
new ContainerDetails
{
Id = 2,
ImageId = secondImageId,
Layers = [],
}
);
// First image scans fine, second throws OCE
this.mockSyftLinuxScanner.Setup(scanner =>
scanner.ScanLinuxAsync(
secondImageId,
It.IsAny<IEnumerable<DockerLayer>>(),
It.IsAny<int>(),
It.IsAny<ISet<ComponentType>>(),
It.IsAny<LinuxScannerScope>(),
It.IsAny<CancellationToken>()
)
)
.ThrowsAsync(new OperationCanceledException());
var scanRequest = new ScanRequest(
new DirectoryInfo(Path.GetTempPath()),
(_, __) => false,
this.mockLogger.Object,
null,
[NodeLatestImage, secondImage],
componentRecorder
);
var detector = new LinuxContainerDetector(
this.mockSyftLinuxScanner.Object,
this.mockDockerService.Object,
this.mockLinuxContainerDetectorLogger.Object
);
var result = await detector.ExecuteDetectorAsync(scanRequest);
// Must not crash, must return Success
result.ResultCode.Should().Be(ProcessingResultCode.Success);
}
- Files reviewed: 4/4 changed files
- Comments generated: 2
- Use pre-cancelled CancellationToken in OCE tests so the 'when (cancellationToken.IsCancellationRequested)' guard is exercised - Add Verify/assertion in multi-image test to confirm first image was scanned - Use CancelAsync() instead of Cancel() per analyzer rules Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Pass the caught exception to LogWarning so Docker daemon cleanup hangs are diagnosable from logs and telemetry. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR addresses a hang in Linux container image scanning by ensuring Docker container cleanup cannot block indefinitely and by improving timeout/error surfacing so telemetry is emitted reliably when scans are canceled/timed out.
Changes:
- Bound Docker container cleanup by adding a 30s timeout around
RemoveContainerAsyncinDockerService. - Ensure
OperationCanceledExceptionpropagates out of per-image operations (resolve/scan) soExecuteDetectorAsynccan emit timeout telemetry. - Avoid exception-serialization failures in Syft telemetry by recording
Exception.ToString()instead of JSON-serializing exceptions. - Add unit tests covering cancellation/timeout propagation and safety-net behavior.
Show a summary per file
| File | Description |
|---|---|
src/Microsoft.ComponentDetection.Common/DockerService.cs |
Adds bounded-time, best-effort container removal in finally to prevent indefinite hangs during cleanup. |
src/Microsoft.ComponentDetection.Detectors/linux/LinuxContainerDetector.cs |
Improves cancellation propagation and adds a safety-net catch to prevent unexpected exceptions from escaping the detector. |
src/Microsoft.ComponentDetection.Detectors/linux/LinuxScanner.cs |
Fixes telemetry recording by using e.ToString() to avoid System.Text.Json serialization failures on exceptions. |
test/Microsoft.ComponentDetection.Detectors.Tests/LinuxContainerDetectorTests.cs |
Adds tests for cancellation/timeout behavior, safety-net behavior, and multi-image resilience. |
Copilot's findings
- Files reviewed: 4/4 changed files
- Comments generated: 2
A Docker API or transport error from RemoveContainerAsync would escape the finally and mask the original scan result. Catch all exceptions so cleanup failures never override the primary outcome. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
👋 Hi! It looks like you modified some files in the
If none of the above scenarios apply, feel free to ignore this comment 🙂 |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1778 +/- ##
============================
============================
☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Problem
Customer pipelines hang indefinitely during Linux container image scanning. Despite PR #1729 adding timeout cancellation to the Docker scan read path, the detector still hangs because:
RemoveContainerAsyncusesCancellationToken.None— after the scan timeout fires and the read is cancelled, thefinallyblock callsRemoveContainerAsyncwith no timeout. If the Docker daemon is slow removing a stuck container, this hangs forever.OperationCanceledExceptionis swallowed — the genericcatch (Exception e)catches OCE before it reachesExecuteDetectorAsync, soLinuxContainerDetectorTimeouttelemetry is never emitted.JsonSerializer.Serialize(exception)throws —System.Text.Jsoncannot serializeMethodBaseproperties on exceptions.Fixes
DockerService.csRemoveContainerAsyncin finally blockLinuxContainerDetector.csLinuxScanner.cse.ToString()instead ofJsonSerializer.Serialize(e)LinuxContainerDetector.cscatch (Exception)inExecuteDetectorAsyncTesting
Related