fix(scout): tolerate a dead NVMe during cleanup (do not fail the whole machine)#2914
Open
kirson-git wants to merge 1 commit into
Open
fix(scout): tolerate a dead NVMe during cleanup (do not fail the whole machine)#2914kirson-git wants to merge 1 commit into
kirson-git wants to merge 1 commit into
Conversation
ajf
requested changes
Jun 26, 2026
ajf
left a comment
Collaborator
There was a problem hiding this comment.
A drive returning nvme delete-ns ... Internal Error 0x6006 failed its per-device cleanup. The error propagated out of all_nvme_cleanup, putting the whole machine into NVMECleanFailed and making it unprovisionable, despite the remaining drives being healthy and cleaned successfully.
I think falling into NVMECleanFailed state on any drive erase failure is the correct behavior. NICo needs to ensure that after one user is done with the machine, that it's safe for the next user. A transient error causing the NVME not to be able to be formatted seems like a major trust issue if we can't do that.
I don't think this is the correct logic and we shouldn't merge it.
Why is this an OK outcome for your use-case?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
In
crates/scout/src/deprovision/scrabbing.rs,all_nvme_cleanupreturns an error if any NVMe device fails cleanup. Because a single dead or failing drive aborts the entire cleanup, the machine transitions toFAILED/NVMECleanFailedand can never be provisioned again — even though the OS install only requires one healthy drive. A single bad disk should not condemn an otherwise-usable machine.Real incident
A drive returning
nvme delete-ns ... Internal Error 0x6006failed its per-device cleanup. The error propagated out ofall_nvme_cleanup, putting the whole machine intoNVMECleanFailedand making it unprovisionable, despite the remaining drives being healthy and cleaned successfully.Fix
Fail the cleanup only when every drive failed (
success_count == 0). If at least one drive succeeded, log atracing::warn!with the success/failure counts and continue — the OS install targets a healthy drive. This extends the per-format error tolerance introduced in #2820 to the overall cleanup result. The analogous HDD/SAS cleanup block is intentionally left unchanged.