feat: Carbide side changes to support Astra#2904
Conversation
Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>
|
Important Review skippedThis PR was authored by the user configured for CodeRabbit reviews. CodeRabbit does not review PRs authored by this user. It's recommended to use a dedicated user account to post CodeRabbit review feedback. ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
WalkthroughAstra config and status fields are added to Forge RPCs and wired through DPU and agent status paths. SVPC handling is added for scout commands and MLX reports. DPA interface lookups now accept explicit search filters and updated call sites use the new API. ChangesAstra, SVPC, and DPA search updates
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (6)
crates/api-core/src/handlers/astra.rs (2)
226-311: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winRecommended: correct the misattributed log lines, the stale comment, and adopt structured fields.
Several diagnostics here are copy-paste artifacts from another handler and will mislead operators:
- Lines 226, 257, 282, 293, 302, 307: messages are prefixed
handle_dpa_message:though the enclosing function isprocess_astra_config_status.- Line 266: the comment "We checked that pf_info is not None above, so unwrap is safe" refers to a
pf_infocheck that does not exist in this function.- Line 283 references
find_by_vni, but the call isdb::spx_partition::find_by(...).Additionally, per the logfmt guideline, prefer structured key/value fields over interpolation (e.g.
tracing::error!(%vni, error = %e, "...")) so logs remain searchable.As per coding guidelines: "prefer placing common fields as attributes passed to tracing functions instead of using string interpolation."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/api-core/src/handlers/astra.rs` around lines 226 - 311, Update the diagnostics in process_astra_config_status so the log messages use the correct handler name instead of the copied handle_dpa_message prefix, remove the stale pf_info comment, and fix the find_by_vni wording to match db::spx_partition::find_by. While touching the tracing calls around ConfigVersion::from_str and the spx_partition lookup, switch to structured tracing fields (for example include vni, obs, and error as attributes) instead of interpolating values into the message text.Source: Coding guidelines
82-88: 🩺 Stability & Availability | 🔵 Trivial | ⚡ Quick winRecommended: avoid silently swallowing the transaction-begin failure.
When
database_connection.begin()fails, the handler logs and returnsOk(None), masking an infrastructure error as a legitimate "no Astra config" result and diverging from the?-propagation used at Line 64. Propagate the error instead so callers can distinguish "absent" from "failed".🔧 Proposed fix
- let mut txn = match api.database_connection.begin().await { - Ok(t) => t, - Err(e) => { - tracing::error!("handle_dpa_message: Unable to start txn: {:#?}", e); - return Ok(None); - } - }; + let mut txn = api.txn_begin().await?;🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/api-core/src/handlers/astra.rs` around lines 82 - 88, The transaction start failure in handle_dpa_message is being swallowed by logging and returning Ok(None), which hides an infrastructure error as a normal “no result” case. Update the database_connection.begin() match so the Err path propagates the error instead of converting it to None, keeping behavior consistent with the existing ? usage in the same handler and preserving the distinction between “absent config” and “failed to begin txn”.crates/rpc/proto/forge.proto (1)
7529-7553: 🗄️ Data Integrity & Integration | 🔵 Trivial | 💤 Low valueAlign
vni(andsubnet_mask) integer types with the rest offorge.proto.
AstraAttachment.vniandAstraAttachmentStatus.vniare declared asint32, whereas every other VNI field in this file (e.g.vpc_vni,internet_l3_vni,site_global_vpc_vni) isuint32. VNIs are unsigned 24-bit values; using a signed type here is inconsistent and invites avoidable casts on the Rust side. The same observation applies tosubnet_mask. As these messages are new, changing them now is wire-cost-free.- int32 vni = 2; + uint32 vni = 2; string subnet_ipv4 = 3; - int32 subnet_mask = 4; + uint32 subnet_mask = 4;As per path instructions (review protobuf for clear naming and validation implications).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/rpc/proto/forge.proto` around lines 7529 - 7553, `AstraAttachment` and `AstraAttachmentStatus` use signed `int32` for `vni` and `subnet_mask`, which is inconsistent with the rest of `forge.proto` and the unsigned nature of these values. Update the `AstraAttachment` and `AstraAttachmentStatus` message fields to use `uint32` for `vni` and `subnet_mask`, keeping the existing field numbers and names unchanged so the new messages remain wire-safe and match the other VNI fields in this proto.Source: Path instructions
crates/api-core/src/instance/mod.rs (1)
705-711: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low valueHoist the constant config and prefer
Default.The
DpaSearchConfigis identical on every iteration and equalsDefault::default(). Construct it once outside the loop and rely on the derive:+ let dpa_search_config = DpaSearchConfig::default(); for mid in &machine_ids { - let dpa_search_config = DpaSearchConfig { - only_svpc: false, - only_astra: false, - }; let dpa_interfaces = db::dpa_interface::find_by_machine_id(&mut txn, *mid, dpa_search_config).await?;🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/api-core/src/instance/mod.rs` around lines 705 - 711, The `DpaSearchConfig` in the machine-id loop is constant and should not be rebuilt on every iteration. Hoist the config construction out of the `for mid in &machine_ids` loop in `instance/mod.rs`, and prefer using `DpaSearchConfig::default()` since the values match the derived default. Keep the loop body using the shared config when calling `db::dpa_interface::find_by_machine_id`.crates/api-model/src/dpa_interface/mod.rs (1)
79-83: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚖️ Poor tradeoffConsider modelling the mutually-exclusive filter as an enum to make the contract compiler-enforced.
The downstream consumer
db::dpa_interface::find_by_machine_idrejects theonly_svpc == true && only_astra == truecombination at runtime, returningDatabaseError::Internal. A two-boolean representation makes an invalid state representable and pushes the invariant to runtime. A small enum captures the intent precisely and removes the need for the defensive runtime check entirely:♻️ Suggested shape
#[derive(Default, Clone, Copy)] pub enum DpaInterfaceFilter { #[default] All, OnlySvpc, OnlyAstra, }Separately, since
Defaultalready yields{ only_svpc: false, only_astra: false }, the several "no filter" call sites can simply useDpaSearchConfig::default()rather than spelling out bothfalsefields.As per coding guidelines (STYLE_GUIDE.md: "Prefer designs that are hard to misuse. The more the compiler can catch bugs, the better.").
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/api-model/src/dpa_interface/mod.rs` around lines 79 - 83, Replace the two-boolean DpaSearchConfig in DpaSearchConfig with a single enum-based filter so the mutually exclusive state is compiler-enforced instead of checked at runtime. Update the DpaSearchConfig definition and any callers of db::dpa_interface::find_by_machine_id to use the new enum (e.g. All, OnlySvpc, OnlyAstra), and remove the defensive “both true” runtime validation once the invalid combination can no longer be expressed. For the no-filter cases, switch call sites to DpaSearchConfig::default() rather than spelling out both false fields.Source: Path instructions
crates/api-core/src/handlers/svpc.rs (1)
364-443: 🚀 Performance & Scalability | 🔵 Trivial | ⚡ Quick winEliminate the per-iteration clone of the entire snapshot vector.
get_dpa_by_macconsumes aVec<DpaInterface>, forcingdpa_snapshots.clone()on every observation (Line 373). For a report withmobservations againstninterfaces this turns an O(n·m) lookup into O(n·m) plusmfull-vector deep clones. Borrow the slice and clone only the single matched entry. The subsequentdpa.clone()at Line 432 is likewise redundant, asdpais not read after the call.♻️ Proposed change
- let mut dpa = match get_dpa_by_mac(&devinfo, dpa_snapshots.clone()) { + let mut dpa = match get_dpa_by_mac(&devinfo, &dpa_snapshots) {- match dpa_interface::update_card_state(&mut txn, dpa.clone()).await { + match dpa_interface::update_card_state(&mut txn, dpa).await {And adjust the helper to borrow:
-fn get_dpa_by_mac(devinfo: &MlxDeviceInfo, dpas: Vec<DpaInterface>) -> CarbideResult<DpaInterface> { - dpas.into_iter() - .find(|dpa| dpa.mac_address.to_string() == devinfo.base_mac) +fn get_dpa_by_mac(devinfo: &MlxDeviceInfo, dpas: &[DpaInterface]) -> CarbideResult<DpaInterface> { + dpas.iter() + .find(|dpa| dpa.mac_address.to_string() == devinfo.base_mac) + .cloned() .ok_or_else(|| CarbideError::NotFoundError { kind: "mac_addr", id: devinfo.base_mac.to_string(), }) }As per coding guidelines: "Avoid needless clones. Seeing
.clone()frequently indicates the ownership model may need rethinking."🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/api-core/src/handlers/svpc.rs` around lines 364 - 443, The observation loop in process_mlx_observation is doing unnecessary full-vector and per-item clones because get_dpa_by_mac takes ownership of dpa_snapshots and update_card_state is called with dpa.clone(). Update get_dpa_by_mac to borrow the snapshot list instead of consuming a Vec, then pass the borrowed slice from the loop so only the matched DpaInterface is cloned if needed. Also remove the redundant dpa.clone() before calling dpa_interface::update_card_state, since dpa is not used afterward.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/api-core/src/handlers/astra.rs`:
- Around line 315-317: The `handle_astra` path in `astra.rs` is panicking on
untrusted input by calling `MacAddress::from_str(&obs.mac_address).unwrap()`
inside the `dpa_interfaces.iter().find(...)` lookup. Parse `obs.mac_address`
once before the search, handle the parse failure gracefully, and skip the
observation/continue processing when the MAC string is invalid instead of
aborting the request thread.
- Around line 223-352: The SPX status observation write is lost because the
transaction started in handle_dpa_message is never committed after
db::machine::update_spx_status_observation succeeds. After the update call,
explicitly commit the txn before returning Ok(()) so the changes persist, and
keep the existing error handling around begin/update paths intact. Use the
handle_dpa_message flow and update_spx_status_observation call as the key spots
to verify the transaction is finalized.
---
Nitpick comments:
In `@crates/api-core/src/handlers/astra.rs`:
- Around line 226-311: Update the diagnostics in process_astra_config_status so
the log messages use the correct handler name instead of the copied
handle_dpa_message prefix, remove the stale pf_info comment, and fix the
find_by_vni wording to match db::spx_partition::find_by. While touching the
tracing calls around ConfigVersion::from_str and the spx_partition lookup,
switch to structured tracing fields (for example include vni, obs, and error as
attributes) instead of interpolating values into the message text.
- Around line 82-88: The transaction start failure in handle_dpa_message is
being swallowed by logging and returning Ok(None), which hides an infrastructure
error as a normal “no result” case. Update the database_connection.begin() match
so the Err path propagates the error instead of converting it to None, keeping
behavior consistent with the existing ? usage in the same handler and preserving
the distinction between “absent config” and “failed to begin txn”.
In `@crates/api-core/src/handlers/svpc.rs`:
- Around line 364-443: The observation loop in process_mlx_observation is doing
unnecessary full-vector and per-item clones because get_dpa_by_mac takes
ownership of dpa_snapshots and update_card_state is called with dpa.clone().
Update get_dpa_by_mac to borrow the snapshot list instead of consuming a Vec,
then pass the borrowed slice from the loop so only the matched DpaInterface is
cloned if needed. Also remove the redundant dpa.clone() before calling
dpa_interface::update_card_state, since dpa is not used afterward.
In `@crates/api-core/src/instance/mod.rs`:
- Around line 705-711: The `DpaSearchConfig` in the machine-id loop is constant
and should not be rebuilt on every iteration. Hoist the config construction out
of the `for mid in &machine_ids` loop in `instance/mod.rs`, and prefer using
`DpaSearchConfig::default()` since the values match the derived default. Keep
the loop body using the shared config when calling
`db::dpa_interface::find_by_machine_id`.
In `@crates/api-model/src/dpa_interface/mod.rs`:
- Around line 79-83: Replace the two-boolean DpaSearchConfig in DpaSearchConfig
with a single enum-based filter so the mutually exclusive state is
compiler-enforced instead of checked at runtime. Update the DpaSearchConfig
definition and any callers of db::dpa_interface::find_by_machine_id to use the
new enum (e.g. All, OnlySvpc, OnlyAstra), and remove the defensive “both true”
runtime validation once the invalid combination can no longer be expressed. For
the no-filter cases, switch call sites to DpaSearchConfig::default() rather than
spelling out both false fields.
In `@crates/rpc/proto/forge.proto`:
- Around line 7529-7553: `AstraAttachment` and `AstraAttachmentStatus` use
signed `int32` for `vni` and `subnet_mask`, which is inconsistent with the rest
of `forge.proto` and the unsigned nature of these values. Update the
`AstraAttachment` and `AstraAttachmentStatus` message fields to use `uint32` for
`vni` and `subnet_mask`, keeping the existing field numbers and names unchanged
so the new messages remain wire-safe and match the other VNI fields in this
proto.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: ae9d3091-e3cf-422a-8b3b-de8e1cb3fb88
📒 Files selected for processing (29)
crates/agent/src/ethernet_virtualization.rscrates/agent/src/main_loop.rscrates/agent/src/tests/full.rscrates/api-core/src/api.rscrates/api-core/src/handlers/astra.rscrates/api-core/src/handlers/dpa.rscrates/api-core/src/handlers/dpu.rscrates/api-core/src/handlers/instance.rscrates/api-core/src/handlers/machine_scout.rscrates/api-core/src/handlers/mod.rscrates/api-core/src/handlers/svpc.rscrates/api-core/src/instance/mod.rscrates/api-core/src/tests/common/api_fixtures/mod.rscrates/api-core/src/tests/dpa_interfaces.rscrates/api-core/src/tests/dpu_agent_upgrade.rscrates/api-core/src/tests/dpu_info_list.rscrates/api-core/src/tests/machine_network.rscrates/api-core/src/tests/network_security_group.rscrates/api-db/src/dpa_interface.rscrates/api-model/src/dpa_interface/mod.rscrates/api-model/src/instance/config/spx.rscrates/dpa-manager/src/card_handler/svpc.rscrates/dpa-manager/src/lib.rscrates/machine-a-tron/src/api_client.rscrates/machine-controller/src/io.rscrates/rpc/build.rscrates/rpc/proto/forge.protocrates/rpc/src/model/machine/network.rscrates/test-harness/src/machine_dpu.rs
🔍 Container Scan Summary
Per-CVE detail lives in the per-service |
Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>
Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>
Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
crates/api-core/src/handlers/svpc.rs (1)
467-471: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick winPropagate
update_card_statefailures instead of committing partial success.If this DB update fails, the handler logs the error, commits, and returns success, so Scout can drop a report that never advanced the DPA card state.
Proposed fix
- match dpa_interface::update_card_state(&mut txn, dpa).await { - Ok(_id) => (), - Err(e) => { - tracing::error!("process_mlx_observation update_card_state error: {e}"); - } - } + dpa_interface::update_card_state(&mut txn, dpa) + .await + .map_err(|e| { + tracing::error!(error = %e, "process_mlx_observation update_card_state failed"); + e + })?;As per path instructions, review
crates/api*/**changes for “transaction safety” and “SQLx/query correctness.”🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/api-core/src/handlers/svpc.rs` around lines 467 - 471, The transaction in process_mlx_observation is swallowing failures from dpa_interface::update_card_state, which lets the handler commit and return success after a failed state update. Change the match in process_mlx_observation to propagate the error instead of only logging it, so the surrounding transaction aborts and the caller sees the failure. Use the existing update_card_state call site and the transaction/return path in svpc.rs to ensure no partial success is committed when this DB update fails.Source: Path instructions
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@crates/api-core/src/handlers/svpc.rs`:
- Around line 467-471: The transaction in process_mlx_observation is swallowing
failures from dpa_interface::update_card_state, which lets the handler commit
and return success after a failed state update. Change the match in
process_mlx_observation to propagate the error instead of only logging it, so
the surrounding transaction aborts and the caller sees the failure. Use the
existing update_card_state call site and the transaction/return path in svpc.rs
to ensure no partial success is committed when this DB update fails.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 97537660-1bf4-44a7-9111-c59938a46e35
📒 Files selected for processing (3)
crates/api-core/src/handlers/astra.rscrates/api-core/src/handlers/svpc.rscrates/api-core/src/instance/mod.rs
🚧 Files skipped from review as they are similar to previous changes (2)
- crates/api-core/src/instance/mod.rs
- crates/api-core/src/handlers/astra.rs
Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>
Carbide side changes for Astra support. We still need to make changes in Forge DPU agent, and DPF changes
to ingest Astra NICs.
Related issues
Type of Change
Breaking Changes
Testing
Additional Notes