Skip to content

fix(scout): tolerate a dead NVMe during cleanup (do not fail the whole machine)#2914

Open
kirson-git wants to merge 1 commit into
NVIDIA:mainfrom
kirson-git:fix-scout-tolerate-dead-nvme
Open

fix(scout): tolerate a dead NVMe during cleanup (do not fail the whole machine)#2914
kirson-git wants to merge 1 commit into
NVIDIA:mainfrom
kirson-git:fix-scout-tolerate-dead-nvme

Conversation

@kirson-git

Copy link
Copy Markdown
Contributor

Problem

In crates/scout/src/deprovision/scrabbing.rs, all_nvme_cleanup returns an error if any NVMe device fails cleanup. Because a single dead or failing drive aborts the entire cleanup, the machine transitions to FAILED/NVMECleanFailed and can never be provisioned again — even though the OS install only requires one healthy drive. A single bad disk should not condemn an otherwise-usable machine.

Real incident

A drive returning nvme delete-ns ... Internal Error 0x6006 failed its per-device cleanup. The error propagated out of all_nvme_cleanup, putting the whole machine into NVMECleanFailed and making it unprovisionable, despite the remaining drives being healthy and cleaned successfully.

Fix

Fail the cleanup only when every drive failed (success_count == 0). If at least one drive succeeded, log a tracing::warn! with the success/failure counts and continue — the OS install targets a healthy drive. This extends the per-format error tolerance introduced in #2820 to the overall cleanup result. The analogous HDD/SAS cleanup block is intentionally left unchanged.

@kirson-git kirson-git requested a review from a team as a code owner June 26, 2026 12:08
@copy-pr-bot

copy-pr-bot Bot commented Jun 26, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@ajf ajf left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A drive returning nvme delete-ns ... Internal Error 0x6006 failed its per-device cleanup. The error propagated out of all_nvme_cleanup, putting the whole machine into NVMECleanFailed and making it unprovisionable, despite the remaining drives being healthy and cleaned successfully.

I think falling into NVMECleanFailed state on any drive erase failure is the correct behavior. NICo needs to ensure that after one user is done with the machine, that it's safe for the next user. A transient error causing the NVME not to be able to be formatted seems like a major trust issue if we can't do that.

I don't think this is the correct logic and we shouldn't merge it.

Why is this an OK outcome for your use-case?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants