fix(scout): tolerate a dead NVMe during cleanup (do not fail the whole machine) by kirson-git · Pull Request #2914 · NVIDIA/infra-controller

kirson-git · 2026-06-26T12:08:17Z

Problem

In crates/scout/src/deprovision/scrabbing.rs, all_nvme_cleanup returns an error if any NVMe device fails cleanup. Because a single dead or failing drive aborts the entire cleanup, the machine transitions to FAILED/NVMECleanFailed and can never be provisioned again — even though the OS install only requires one healthy drive. A single bad disk should not condemn an otherwise-usable machine.

Real incident

A drive returning nvme delete-ns ... Internal Error 0x6006 failed its per-device cleanup. The error propagated out of all_nvme_cleanup, putting the whole machine into NVMECleanFailed and making it unprovisionable, despite the remaining drives being healthy and cleaned successfully.

Fix

Fail the cleanup only when every drive failed (success_count == 0). If at least one drive succeeded, log a tracing::warn! with the success/failure counts and continue — the OS install targets a healthy drive. This extends the per-format error tolerance introduced in #2820 to the overall cleanup result. The analogous HDD/SAS cleanup block is intentionally left unchanged.

…e machine)

copy-pr-bot · 2026-06-26T12:08:21Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

ajf

A drive returning nvme delete-ns ... Internal Error 0x6006 failed its per-device cleanup. The error propagated out of all_nvme_cleanup, putting the whole machine into NVMECleanFailed and making it unprovisionable, despite the remaining drives being healthy and cleaned successfully.

I think falling into NVMECleanFailed state on any drive erase failure is the correct behavior. NICo needs to ensure that after one user is done with the machine, that it's safe for the next user. A transient error causing the NVME not to be able to be formatted seems like a major trust issue if we can't do that.

I don't think this is the correct logic and we shouldn't merge it.

Why is this an OK outcome for your use-case?

fix(scout): tolerate a dead NVMe during cleanup (do not fail the whol…

0495830

…e machine)

kirson-git requested a review from a team as a code owner June 26, 2026 12:08

ajf requested changes Jun 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(scout): tolerate a dead NVMe during cleanup (do not fail the whole machine)#2914

fix(scout): tolerate a dead NVMe during cleanup (do not fail the whole machine)#2914
kirson-git wants to merge 1 commit into
NVIDIA:mainfrom
kirson-git:fix-scout-tolerate-dead-nvme

kirson-git commented Jun 26, 2026

Uh oh!

copy-pr-bot Bot commented Jun 26, 2026

Uh oh!

ajf left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

kirson-git commented Jun 26, 2026

Problem

Real incident

Fix

Uh oh!

copy-pr-bot Bot commented Jun 26, 2026

Uh oh!

ajf left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants