fix(agent): default min_dpu_functioning_links to 1 so one dead uplink doesn't fence the DPU#2913
fix(agent): default min_dpu_functioning_links to 1 so one dead uplink doesn't fence the DPU#2913kirson-git wants to merge 2 commits into
Conversation
…n one dead uplink)
ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
Summary by CodeRabbit
WalkthroughThe agent main loop changes the default minimum healthy links used by health checking from 2 to 1 when the configuration value is unset. ChangesMain loop update flow and health threshold
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~2 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
…when one redundant uplink is down)
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/agent/src/main_loop.rs`:
- Around line 433-445: `CurrentNetworkVersion::matches_versions_from` is
treating the default `(None, None)` state as a cache hit when both protobuf
version fields are empty after `get_non_empty_str()` normalization. Update the
version check so it only returns true after at least one successful version has
been recorded in `CurrentNetworkVersion`, and make empty/missing
`managed_host_config_version` or `instance_network_config_version` force a
reapply instead of short-circuiting the HBN update path in `main_loop`.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: dab2fb46-ca19-4cd2-9253-42b7a524a6a8
📒 Files selected for processing (1)
crates/agent/src/main_loop.rs
There was a problem hiding this comment.
Caution
Inline review comments failed to post. This is likely due to GitHub's internal server error or limits when posting large numbers of comments. If you are seeing this consistently it is likely a permissions issue. Please check "Moderation" -> "Code review limits" under your organization settings.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/agent/src/main_loop.rs`:
- Around line 433-445: `CurrentNetworkVersion::matches_versions_from` is
treating the default `(None, None)` state as a cache hit when both protobuf
version fields are empty after `get_non_empty_str()` normalization. Update the
version check so it only returns true after at least one successful version has
been recorded in `CurrentNetworkVersion`, and make empty/missing
`managed_host_config_version` or `instance_network_config_version` force a
reapply instead of short-circuiting the HBN update path in `main_loop`.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: dab2fb46-ca19-4cd2-9253-42b7a524a6a8
📒 Files selected for processing (1)
crates/agent/src/main_loop.rs
🛑 Comments failed to post (1)
crates/agent/src/main_loop.rs (1)
433-445: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win
Do not treat an uninitialized version tracker as a cache hit.
CurrentNetworkVersion::default()starts as(None, None), andget_non_empty_str()collapses empty protobuf strings toNone. If both incoming version fields are unset,matches_versions_from()returnstrue, so Line 659 skips the HBN update path before anything has ever been applied. Only short-circuit once at least one successful version has been recorded; missing version data should force a reapply.Suggested fix
impl CurrentNetworkVersion { pub fn matches_versions_from( &self, conf: impl AsRef<ManagedHostNetworkConfigResponse>, ) -> bool { + if self.managed_host_config_version.is_none() + && self.instance_network_config_version.is_none() + { + return false; + } + let conf = conf.as_ref(); let managed_host_config_version = get_non_empty_str(&conf.managed_host_config_version); let instance_network_config_version = get_non_empty_str(&conf.instance_network_config_version); self.managed_host_config_version.as_deref() == managed_host_config_version && self.instance_network_config_version.as_deref() == instance_network_config_version }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.impl CurrentNetworkVersion { pub fn matches_versions_from( &self, conf: impl AsRef<ManagedHostNetworkConfigResponse>, ) -> bool { if self.managed_host_config_version.is_none() && self.instance_network_config_version.is_none() { return false; } let conf = conf.as_ref(); let managed_host_config_version = get_non_empty_str(&conf.managed_host_config_version); let instance_network_config_version = get_non_empty_str(&conf.instance_network_config_version); self.managed_host_config_version.as_deref() == managed_host_config_version && self.instance_network_config_version.as_deref() == instance_network_config_version } }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/agent/src/main_loop.rs` around lines 433 - 445, `CurrentNetworkVersion::matches_versions_from` is treating the default `(None, None)` state as a cache hit when both protobuf version fields are empty after `get_non_empty_str()` normalization. Update the version check so it only returns true after at least one successful version has been recorded in `CurrentNetworkVersion`, and make empty/missing `managed_host_config_version` or `instance_network_config_version` force a reapply instead of short-circuiting the HBN update path in `main_loop`.
|
I'm pretty sure this is the way it is on purpose. A machine with one of its uplinks down should be considered degraded and not available for assignment. |
|
Yes I actually implemented this precisely because we were getting "healthy" machines with only one working link, which caused problems IIRC. Every link should be healthy for the machine to be healthy, imo. |
|
I believed the existing behavior is:
This means that any link down makes a health alert with no effect, and all links down makes a health alert with actual prevention measures from that machine being used. I think what I describe is correct behavior. |
|
I think one link down caused problems, though, I just can't recall what they were. At the least I'd think you'd want to make that "opt in" and default to the existing behavior, which is "all links must be healthy to allocate". |
Problem
The DPU agent's BGP health check (
crates/agent/src/health/bgp.rs) loops over0..min_healthy_linksand requires each uplink to be BGP-Established.min_healthy_linksisconf.min_dpu_functioning_links.unwrap_or(2)incrates/agent/src/main_loop.rs.When
min_dpu_functioning_linksis unset (the default), it resolves to 2 — i.e. all uplinks on a dual-homed DPU must be Established. If one redundant / failover uplink is down (e.g. a dead cable/optic), the DPU is held unhealthy and any instance assigned to it gets stuck atAssigned/WaitingForNetworkConfig, never completing — the opposite of what dual-uplink redundancy is for.Real-world incident
A BlueField-3 DPU with
p0Established andp1physically down (dead optic): agent reportedBgpPeeringTor [Target: p1_if] ... Idle, instance hung atWaitingForNetworkConfigdespite full connectivity overp0(VTEP reachable via p0).Fix
Default
min_healthy_linksto 1: a dual-homed DPU stays usable on a single uplink; losing one redundant link degrades gracefully instead of fencing the DPU. Operators can still setmin_dpu_functioning_linkshigher for stricter health.Operational workaround (no rebuild): set
min_dpu_functioning_links = 1in nico-api'scarbide-api-config.toml.