Skip to content

feat: Machine Validation - Reliable Machine Validation Run Lifecycle #454

Description

@sunilkumar-nvidia

Is this a new feature, an improvement, or a change to existing functionality?

Improvement to existing functionality.

How would you describe the priority of this feature request

High priority. This addresses reliability risk where validation failures can leave machines stuck and unavailable.

Please provide a clear description of problem this feature solves

Machine Validation currently depends on Scout completing successfully and sending final completion back to the API. If Scout crashes, a command hangs, a retry fails, or the API misses/rejects completion, the machine can remain in validation indefinitely. Operators have limited durable state to understand or recover the run.

Feature Description

Introduce a durable validation run lifecycle with run items, execution attempts, heartbeats, retry tracking, and stale-run reconciliation. The API will be able to mark failed or stale validation work terminal and unblock the machine through the existing compatibility path.

Describe your ideal solution

Will produce separate design doc to propose the ideal solution.

Describe any alternatives you have considered

No response

Additional context

No response

Code of Conduct

  • I agree to follow NVIDIA Bare Metal Manager's Code of Conduct
  • I have searched the open feature requests and have found no duplicates for this feature request

Metadata

Metadata

Type

No fields configured for Epic.

Projects

Status
In Progress

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions