Skip to content

fix(agent): default min_dpu_functioning_links to 1 so one dead uplink doesn't fence the DPU#2913

Open
kirson-git wants to merge 2 commits into
NVIDIA:mainfrom
kirson-git:fix-dpu-min-functioning-links-ha
Open

fix(agent): default min_dpu_functioning_links to 1 so one dead uplink doesn't fence the DPU#2913
kirson-git wants to merge 2 commits into
NVIDIA:mainfrom
kirson-git:fix-dpu-min-functioning-links-ha

Conversation

@kirson-git

Copy link
Copy Markdown
Contributor

Problem

The DPU agent's BGP health check (crates/agent/src/health/bgp.rs) loops over 0..min_healthy_links and requires each uplink to be BGP-Established. min_healthy_links is conf.min_dpu_functioning_links.unwrap_or(2) in crates/agent/src/main_loop.rs.

When min_dpu_functioning_links is unset (the default), it resolves to 2 — i.e. all uplinks on a dual-homed DPU must be Established. If one redundant / failover uplink is down (e.g. a dead cable/optic), the DPU is held unhealthy and any instance assigned to it gets stuck at Assigned/WaitingForNetworkConfig, never completing — the opposite of what dual-uplink redundancy is for.

Real-world incident

A BlueField-3 DPU with p0 Established and p1 physically down (dead optic): agent reported BgpPeeringTor [Target: p1_if] ... Idle, instance hung at WaitingForNetworkConfig despite full connectivity over p0 (VTEP reachable via p0).

Fix

Default min_healthy_links to 1: a dual-homed DPU stays usable on a single uplink; losing one redundant link degrades gracefully instead of fencing the DPU. Operators can still set min_dpu_functioning_links higher for stricter health.

Operational workaround (no rebuild): set min_dpu_functioning_links = 1 in nico-api's carbide-api-config.toml.

@kirson-git kirson-git requested a review from a team as a code owner June 26, 2026 11:29
@copy-pr-bot

copy-pr-bot Bot commented Jun 26, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: dab2fb46-ca19-4cd2-9253-42b7a524a6a8

📥 Commits

Reviewing files that changed from the base of the PR and between e9f2f82 and a28af59.

📒 Files selected for processing (1)
  • crates/agent/src/main_loop.rs

Summary by CodeRabbit

  • Bug Fixes
    • Health checks now consider the system healthy with one functioning link by default when the relevant configuration value is not set (previously it required two), improving graceful degradation and reducing unnecessary interruptions in dual-homed setups.

Walkthrough

The agent main loop changes the default minimum healthy links used by health checking from 2 to 1 when the configuration value is unset.

Changes

Main loop update flow and health threshold

Layer / File(s) Summary
Lower health-check default
crates/agent/src/main_loop.rs
health::health_check now receives min_healthy_links as 1 when min_dpu_functioning_links is unset.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: lowering the default uplink health threshold to 1 to avoid fencing a DPU on one link failure.
Description check ✅ Passed The description is directly related to the code change and correctly explains the issue, fix, and workaround.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/agent/src/main_loop.rs`:
- Around line 433-445: `CurrentNetworkVersion::matches_versions_from` is
treating the default `(None, None)` state as a cache hit when both protobuf
version fields are empty after `get_non_empty_str()` normalization. Update the
version check so it only returns true after at least one successful version has
been recorded in `CurrentNetworkVersion`, and make empty/missing
`managed_host_config_version` or `instance_network_config_version` force a
reapply instead of short-circuiting the HBN update path in `main_loop`.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: dab2fb46-ca19-4cd2-9253-42b7a524a6a8

📥 Commits

Reviewing files that changed from the base of the PR and between e9f2f82 and a28af59.

📒 Files selected for processing (1)
  • crates/agent/src/main_loop.rs

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Inline review comments failed to post. This is likely due to GitHub's internal server error or limits when posting large numbers of comments. If you are seeing this consistently it is likely a permissions issue. Please check "Moderation" -> "Code review limits" under your organization settings.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/agent/src/main_loop.rs`:
- Around line 433-445: `CurrentNetworkVersion::matches_versions_from` is
treating the default `(None, None)` state as a cache hit when both protobuf
version fields are empty after `get_non_empty_str()` normalization. Update the
version check so it only returns true after at least one successful version has
been recorded in `CurrentNetworkVersion`, and make empty/missing
`managed_host_config_version` or `instance_network_config_version` force a
reapply instead of short-circuiting the HBN update path in `main_loop`.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: dab2fb46-ca19-4cd2-9253-42b7a524a6a8

📥 Commits

Reviewing files that changed from the base of the PR and between e9f2f82 and a28af59.

📒 Files selected for processing (1)
  • crates/agent/src/main_loop.rs
🛑 Comments failed to post (1)
crates/agent/src/main_loop.rs (1)

433-445: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Do not treat an uninitialized version tracker as a cache hit.

CurrentNetworkVersion::default() starts as (None, None), and get_non_empty_str() collapses empty protobuf strings to None. If both incoming version fields are unset, matches_versions_from() returns true, so Line 659 skips the HBN update path before anything has ever been applied. Only short-circuit once at least one successful version has been recorded; missing version data should force a reapply.

Suggested fix
 impl CurrentNetworkVersion {
     pub fn matches_versions_from(
         &self,
         conf: impl AsRef<ManagedHostNetworkConfigResponse>,
     ) -> bool {
+        if self.managed_host_config_version.is_none()
+            && self.instance_network_config_version.is_none()
+        {
+            return false;
+        }
+
         let conf = conf.as_ref();
         let managed_host_config_version = get_non_empty_str(&conf.managed_host_config_version);
         let instance_network_config_version =
             get_non_empty_str(&conf.instance_network_config_version);

         self.managed_host_config_version.as_deref() == managed_host_config_version
             && self.instance_network_config_version.as_deref() == instance_network_config_version
     }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

impl CurrentNetworkVersion {
    pub fn matches_versions_from(
        &self,
        conf: impl AsRef<ManagedHostNetworkConfigResponse>,
    ) -> bool {
        if self.managed_host_config_version.is_none()
            && self.instance_network_config_version.is_none()
        {
            return false;
        }

        let conf = conf.as_ref();
        let managed_host_config_version = get_non_empty_str(&conf.managed_host_config_version);
        let instance_network_config_version =
            get_non_empty_str(&conf.instance_network_config_version);

        self.managed_host_config_version.as_deref() == managed_host_config_version
            && self.instance_network_config_version.as_deref() == instance_network_config_version
    }
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/agent/src/main_loop.rs` around lines 433 - 445,
`CurrentNetworkVersion::matches_versions_from` is treating the default `(None,
None)` state as a cache hit when both protobuf version fields are empty after
`get_non_empty_str()` normalization. Update the version check so it only returns
true after at least one successful version has been recorded in
`CurrentNetworkVersion`, and make empty/missing `managed_host_config_version` or
`instance_network_config_version` force a reapply instead of short-circuiting
the HBN update path in `main_loop`.

@DrewBloechl

Copy link
Copy Markdown
Contributor

I'm pretty sure this is the way it is on purpose. A machine with one of its uplinks down should be considered degraded and not available for assignment.

@rwthompsonii

Copy link
Copy Markdown
Contributor

Yes I actually implemented this precisely because we were getting "healthy" machines with only one working link, which caused problems IIRC. Every link should be healthy for the machine to be healthy, imo.

@ajf

ajf commented Jun 26, 2026

Copy link
Copy Markdown
Collaborator

I believed the existing behavior is:

  • Any link down will trigger a health alert with no effects (i.e. host can still be allocated, state machine still does stuff with it, etc.
  • If all links are down it'll trigger a separate health alert with classifications of prevent allocations and prevent host state machine changes.

This means that any link down makes a health alert with no effect, and all links down makes a health alert with actual prevention measures from that machine being used.

I think what I describe is correct behavior.

@rwthompsonii

Copy link
Copy Markdown
Contributor

I think one link down caused problems, though, I just can't recall what they were. At the least I'd think you'd want to make that "opt in" and default to the existing behavior, which is "all links must be healthy to allocate".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants