Skip to content

feat: NICo × Health Check Integration (MVP – Run Catalog Checks after Break/Fix Validation) #2962

Description

@deepak-poornachandra

Is this a new feature, an enhancement, or a change to existing functionality?

New Feature

How would you describe the priority of this feature request

High

Please provide a clear description of problem this feature solves

The Health checks provides a set of validation checks that can be executed directly on a node and return a pass/fail result along with diagnostic output.

  1. NICo will run the checks on the node
  2. NICo will read the output and make decisions
  3. The health check binary does not take any action itself — it only reports results

The goal is to integrate this capability into NICo in a simple and incremental way, starting with a narrow use case.

Feature Description

✅ In Scope

  1. Run Health checks directly on the node (in-band)
  2. Trigger execution from NICo during post-repair (break/fix) workflow
  3. Capture results:
    a. Pass / Fail
    b. Diagnostic output
    Use results to decide: Whether node can return to fleet

❌ Out of Scope

  1. Replacing existing NICo validation logic
  2. Multi-node or cluster-level validation
  3. Out-of-band integration
  4. Continuous monitoring integration
  5. Deep integration into health check source code

Describe your ideal solution

  1. Execute Health Checks
    a. NICo triggers checks on the node
    b. Execution happens in post-repair validation step

  2. Capture and Parse Results
    Read output:
    a. Pass / Fail (exit code or equivalent)
    b. Diagnostic details
    Normalize results for NICo usage

  3. Decision Handling in NICo
    NICo evaluates:
    a. Pass → Node is healthy → return to pool
    b. Fail → Node remains in repair
    Catalog only reports; NICo makes decisions

  4. Workflow Integration
    Integrate into:
    a. Break/fix post-repair validation step

Describe any alternatives you have considered

No response

Additional context

No response

Code of Conduct

  • I agree to follow NVIDIA Infra Controller's Code of Conduct
  • I have searched the open feature requests and have found no duplicates for this feature request

Metadata

Metadata

Assignees

No one assigned

    Labels

    featureFeature (deprecated - use issue type, but it's needed for reporting now)
    No fields configured for Enhancement.

    Projects

    Status
    Evaluate

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions