Skip to content

Fix Docker scan hang: bound container removal, surface timeout telemetry#1778

Merged
AMaini503 merged 4 commits into
mainfrom
user/aamaini/fix-docker-scan-hang
Apr 10, 2026
Merged

Fix Docker scan hang: bound container removal, surface timeout telemetry#1778
AMaini503 merged 4 commits into
mainfrom
user/aamaini/fix-docker-scan-hang

Conversation

@AMaini503
Copy link
Copy Markdown
Contributor

Problem

Customer pipelines hang indefinitely during Linux container image scanning. Despite PR #1729 adding timeout cancellation to the Docker scan read path, the detector still hangs because:

  1. RemoveContainerAsync uses CancellationToken.None — after the scan timeout fires and the read is cancelled, the finally block calls RemoveContainerAsync with no timeout. If the Docker daemon is slow removing a stuck container, this hangs forever.

  2. OperationCanceledException is swallowed — the generic catch (Exception e) catches OCE before it reaches ExecuteDetectorAsync, so LinuxContainerDetectorTimeout telemetry is never emitted.

  3. JsonSerializer.Serialize(exception) throwsSystem.Text.Json cannot serialize MethodBase properties on exceptions.

Fixes

Fix File What
1 DockerService.cs 30s timeout on RemoveContainerAsync in finally block
2 LinuxContainerDetector.cs Re-throw OCE before generic catch (3 sites)
3 LinuxScanner.cs e.ToString() instead of JsonSerializer.Serialize(e)
4 LinuxContainerDetector.cs Safety-net catch (Exception) in ExecuteDetectorAsync

Testing

  • 4 new unit tests covering timeout propagation, no-crash safety net, resolve-phase OCE, and multi-image scenarios
  • All existing tests pass

Related

Fix 1: Add 30s timeout to RemoveContainerAsync in the finally block of
CreateAndRunContainerAsync. Previously used CancellationToken.None which
could hang indefinitely if the Docker daemon is slow removing a stuck
container, blocking the entire detector and pipeline.

Fix 2: Re-throw OperationCanceledException (when the token is cancelled)
before the generic catch(Exception) in ResolveImageAsync,
ScanDockerImageAsync, and ScanLocalImageAsync. This lets the OCE
propagate to ExecuteDetectorAsync where LinuxContainerDetectorTimeout
telemetry is emitted, enabling root-cause analysis of stuck scans.

Fix 3: Replace JsonSerializer.Serialize(e) with e.ToString() in
LinuxScanner.RunSyftAsync. Serializing exceptions with System.Text.Json
throws NotSupportedException on MethodBase properties.

Fix 4: Add safety-net catch(Exception) in ExecuteDetectorAsync after the
OCE catch. Ensures unexpected errors never escape the detector and crash
the harness or lose other detectors' results.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@AMaini503 AMaini503 requested a review from a team as a code owner April 9, 2026 22:47
@AMaini503 AMaini503 requested review from Copilot and jpinz April 9, 2026 22:47
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Addresses a hang in Linux container image scanning by ensuring Docker container cleanup can’t block indefinitely and by improving timeout/exception propagation so timeout telemetry is emitted.

Changes:

  • Bound Docker container removal in finally with a 30s timeout to avoid indefinite hangs.
  • Ensure OperationCanceledException (when cancellation is requested) rethrows past generic catches so LinuxContainerDetectorTimeout telemetry is emitted.
  • Avoid System.Text.Json exception serialization pitfalls by using Exception.ToString() for telemetry, and add detector-level safety-net exception handling.
  • Add unit tests intended to cover timeout/exception scenarios.
Show a summary per file
File Description
src/Microsoft.ComponentDetection.Common/DockerService.cs Add bounded-time best-effort container removal in finally to prevent cleanup hangs.
src/Microsoft.ComponentDetection.Detectors/linux/LinuxContainerDetector.cs Re-throw cancellation exceptions to surface timeout telemetry; add outer safety-net catch.
src/Microsoft.ComponentDetection.Detectors/linux/LinuxScanner.cs Use Exception.ToString() for telemetry instead of JSON serialization of exceptions.
test/Microsoft.ComponentDetection.Detectors.Tests/LinuxContainerDetectorTests.cs Add tests around timeout propagation and safety-net behavior (some need adjustment to truly exercise cancellation paths).

Copilot's findings

Comments suppressed due to low confidence (2)

test/Microsoft.ComponentDetection.Detectors.Tests/LinuxContainerDetectorTests.cs:1402

  • This test makes ImageExistsLocallyAsync throw an OperationCanceledException, but the detector’s timeout token isn’t actually cancelled in the test. Because the detector only rethrows OCEs when cancellationToken.IsCancellationRequested is true, this currently validates the generic failure-swallowing path rather than the intended resolve-phase timeout propagation to ExecuteDetectorAsync. Pass and cancel a token into ExecuteDetectorAsync (or otherwise ensure cancellation is requested) to validate the timeout path.
    public async Task ExecuteDetectorAsync_ImageResolveThrowsOce_ReturnsSuccessWithEmptyResults()
    {
        // Verifies Fix 2 in the resolve phase: if the timeout fires
        // during image pull/inspect, the OCE is re-thrown from
        // ResolveImageAsync and caught by ExecuteDetectorAsync.
        var componentRecorder = new ComponentRecorder();
        var detectorArgs = new Dictionary<string, string> { { "Linux.ScanningTimeoutSec", "600" } };

        var scanRequest = new ScanRequest(
            new DirectoryInfo(Path.GetTempPath()),
            (_, __) => false,
            this.mockLogger.Object,
            detectorArgs,
            [NodeLatestImage],
            componentRecorder
        );

        this.mockDockerService.Setup(service =>
                service.ImageExistsLocallyAsync(It.IsAny<string>(), It.IsAny<CancellationToken>())
            )
            .ThrowsAsync(new OperationCanceledException());

test/Microsoft.ComponentDetection.Detectors.Tests/LinuxContainerDetectorTests.cs:1469

  • The test name/comment says "OtherImageStillScanned", but it only asserts that the detector returns Success and doesn’t verify that the non-failing image was actually scanned/recorded. Also, the thrown OperationCanceledException isn’t tied to a cancelled token, so it won’t exercise the timeout propagation path guarded by IsCancellationRequested. Strengthen the assertions (e.g., verify calls/recorded components) and ensure cancellation is requested if the goal is to simulate timeout behavior.
    public async Task ExecuteDetectorAsync_ScanThrowsOce_OtherImageStillScanned()
    {
        // When scanning multiple images and one times out, the entire
        // detector should still return Success (not crash).
        var componentRecorder = new ComponentRecorder();
        const string secondImage = "alpine:latest";
        const string secondImageId = "alpine123";

        this.mockDockerService.Setup(service =>
                service.InspectImageAsync(secondImage, It.IsAny<CancellationToken>())
            )
            .ReturnsAsync(
                new ContainerDetails
                {
                    Id = 2,
                    ImageId = secondImageId,
                    Layers = [],
                }
            );

        // First image scans fine, second throws OCE
        this.mockSyftLinuxScanner.Setup(scanner =>
                scanner.ScanLinuxAsync(
                    secondImageId,
                    It.IsAny<IEnumerable<DockerLayer>>(),
                    It.IsAny<int>(),
                    It.IsAny<ISet<ComponentType>>(),
                    It.IsAny<LinuxScannerScope>(),
                    It.IsAny<CancellationToken>()
                )
            )
            .ThrowsAsync(new OperationCanceledException());

        var scanRequest = new ScanRequest(
            new DirectoryInfo(Path.GetTempPath()),
            (_, __) => false,
            this.mockLogger.Object,
            null,
            [NodeLatestImage, secondImage],
            componentRecorder
        );

        var detector = new LinuxContainerDetector(
            this.mockSyftLinuxScanner.Object,
            this.mockDockerService.Object,
            this.mockLinuxContainerDetectorLogger.Object
        );

        var result = await detector.ExecuteDetectorAsync(scanRequest);

        // Must not crash, must return Success
        result.ResultCode.Should().Be(ProcessingResultCode.Success);
    }
  • Files reviewed: 4/4 changed files
  • Comments generated: 2

Comment thread src/Microsoft.ComponentDetection.Common/DockerService.cs
Aayush Maini and others added 2 commits April 9, 2026 18:27
- Use pre-cancelled CancellationToken in OCE tests so the
  'when (cancellationToken.IsCancellationRequested)' guard is exercised
- Add Verify/assertion in multi-image test to confirm first image was scanned
- Use CancelAsync() instead of Cancel() per analyzer rules

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Pass the caught exception to LogWarning so Docker daemon
cleanup hangs are diagnosable from logs and telemetry.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 10, 2026 01:30
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a hang in Linux container image scanning by ensuring Docker container cleanup cannot block indefinitely and by improving timeout/error surfacing so telemetry is emitted reliably when scans are canceled/timed out.

Changes:

  • Bound Docker container cleanup by adding a 30s timeout around RemoveContainerAsync in DockerService.
  • Ensure OperationCanceledException propagates out of per-image operations (resolve/scan) so ExecuteDetectorAsync can emit timeout telemetry.
  • Avoid exception-serialization failures in Syft telemetry by recording Exception.ToString() instead of JSON-serializing exceptions.
  • Add unit tests covering cancellation/timeout propagation and safety-net behavior.
Show a summary per file
File Description
src/Microsoft.ComponentDetection.Common/DockerService.cs Adds bounded-time, best-effort container removal in finally to prevent indefinite hangs during cleanup.
src/Microsoft.ComponentDetection.Detectors/linux/LinuxContainerDetector.cs Improves cancellation propagation and adds a safety-net catch to prevent unexpected exceptions from escaping the detector.
src/Microsoft.ComponentDetection.Detectors/linux/LinuxScanner.cs Fixes telemetry recording by using e.ToString() to avoid System.Text.Json serialization failures on exceptions.
test/Microsoft.ComponentDetection.Detectors.Tests/LinuxContainerDetectorTests.cs Adds tests for cancellation/timeout behavior, safety-net behavior, and multi-image resilience.

Copilot's findings

  • Files reviewed: 4/4 changed files
  • Comments generated: 2

Comment thread src/Microsoft.ComponentDetection.Common/DockerService.cs
A Docker API or transport error from RemoveContainerAsync would
escape the finally and mask the original scan result. Catch all
exceptions so cleanup failures never override the primary outcome.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@AMaini503 AMaini503 merged commit d8663f2 into main Apr 10, 2026
27 of 28 checks passed
@AMaini503 AMaini503 deleted the user/aamaini/fix-docker-scan-hang branch April 10, 2026 21:55
@github-actions
Copy link
Copy Markdown

👋 Hi! It looks like you modified some files in the Detectors folder.
You may need to bump the detector versions if any of the following scenarios apply:

  • The detector detects more or fewer components than before
  • The detector generates different parent/child graph relationships than before
  • The detector generates different devDependencies values than before

If none of the above scenarios apply, feel free to ignore this comment 🙂

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 10, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 0.0%. Comparing base (c3ef82c) to head (af64fb2).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files
@@     Coverage Diff      @@
##   main   #1778   +/-   ##
============================
============================

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants