Skip to content

feat: support retaining a BMC IP address allocation#2955

Open
chet wants to merge 1 commit into
NVIDIA:mainfrom
chet:gh-issue-2952
Open

feat: support retaining a BMC IP address allocation#2955
chet wants to merge 1 commit into
NVIDIA:mainfrom
chet:gh-issue-2952

Conversation

@chet

@chet chet commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

Following #2920 (the machine snapshot now reads a BMC's IP live from the interface), give operators explicit control over how a BMC's IP is assigned -- and make the default keep it. A BMC that picks up a dynamic address now retains it as a static reservation, so it survives DHCP lease expiry instead of churning. BMC IP churn has been a recurring source of trouble -- DPF re-registration, snapshot/topology mismatch, hosts the state machine can't place -- so retaining by default removes it at the root.

A new bmc_ip_allocation on ExpectedMachine selects the behavior: auto (the default) keeps a configured bmc_ip_address fixed and otherwise retains an auto-allocated one; fixed requires an address; retained pins an auto-allocated address; dynamic is the opt-out that leaves the lease expirable. A retained address is recorded as a static allocation, which lease expiry already leaves alone.

  • Add the bmc_ip_allocation field end to end (proto, model, migration, conversions) plus a --bmc-ip-allocation flag on expected-machine add/update/patch
  • Pin a retained BMC's auto-allocated address as static during site-explorer reconciliation, so lease expiry can't reap it
  • Validate the combinations -- fixed needs an address; dynamic and retained reject one

Tests added!

This supports #2952


REST proto regeneration. Adding bmc_ip_allocation to the Core protos trips Check REST Core Proto Sync, which only runs on PRs that touch core .proto files. Regenerating the REST mirror (make -C rest-api core-proto, make -C rest-api/flow gen-nicoapi-pb) also carries in two syncs that are already present in Core but had not yet been mirrored. They ride along because the gate requires the regenerated tree to be fully clean -- they are not part of this change:

@chet chet requested a review from a team as a code owner June 28, 2026 10:06
@chet

chet commented Jun 28, 2026

Copy link
Copy Markdown
Contributor Author

@coderabbitai PTAL, thanks!

@coderabbitai

coderabbitai Bot commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

@chet I'll review the PR now.

ᘛ⁐̤ᕐᐷ

✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai

coderabbitai Bot commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 97213ab1-84bc-4058-9be2-281c82590cdf

📥 Commits

Reviewing files that changed from the base of the PR and between 7a7ccaf and c5ecb54.

⛔ Files ignored due to path filters (8)
  • rest-api/flow/internal/nicoapi/gen/nico.pb.go is excluded by !**/*.pb.go, !**/gen/**, !rest-api/**/*.pb.go
  • rest-api/flow/internal/nicoapi/gen/nico_grpc.pb.go is excluded by !**/*.pb.go, !**/gen/**, !rest-api/**/*.pb.go, !rest-api/**/*_grpc.pb.go
  • rest-api/flow/internal/nicoapi/gen/site_explorer.pb.go is excluded by !**/*.pb.go, !**/gen/**, !rest-api/**/*.pb.go
  • rest-api/workflow-schema/schema/site-agent/workflows/v1/nico_nico.pb.go is excluded by !**/*.pb.go, !rest-api/**/*.pb.go
  • rest-api/workflow-schema/schema/site-agent/workflows/v1/nico_nico_grpc.pb.go is excluded by !**/*.pb.go, !rest-api/**/*.pb.go, !rest-api/**/*_grpc.pb.go
  • rest-api/workflow-schema/schema/site-agent/workflows/v1/site_explorer_nico.pb.go is excluded by !**/*.pb.go, !rest-api/**/*.pb.go
  • rest-api/workflow-schema/site-agent/workflows/v1/nico_nico.proto is excluded by !rest-api/workflow-schema/site-agent/workflows/v1/*_nico.proto
  • rest-api/workflow-schema/site-agent/workflows/v1/site_explorer_nico.proto is excluded by !rest-api/workflow-schema/site-agent/workflows/v1/*_nico.proto
📒 Files selected for processing (25)
  • crates/admin-cli/src/expected_machines/add/args.rs
  • crates/admin-cli/src/expected_machines/common.rs
  • crates/admin-cli/src/expected_machines/patch/args.rs
  • crates/admin-cli/src/expected_machines/patch/mod.rs
  • crates/admin-cli/src/expected_machines/tests.rs
  • crates/admin-cli/src/expected_machines/update/mod.rs
  • crates/admin-cli/src/rpc.rs
  • crates/api-core/src/handlers/expected_machine.rs
  • crates/api-core/src/tests/expected_machine.rs
  • crates/api-core/src/tests/finder.rs
  • crates/api-core/src/tests/sku.rs
  • crates/api-db/migrations/20260628075228_expected_machines_bmc_ip_allocation.sql
  • crates/api-db/src/expected_machine.rs
  • crates/api-db/src/expected_machine/tests.rs
  • crates/api-db/src/machine_interface.rs
  • crates/api-db/src/machine_interface/tests.rs
  • crates/api-model/src/expected_machine.rs
  • crates/machine-a-tron/src/api_client.rs
  • crates/rpc/build.rs
  • crates/rpc/proto/forge.proto
  • crates/rpc/src/model/expected_machine.rs
  • crates/site-explorer/src/lib.rs
  • crates/site-explorer/tests/site_explorer.rs
  • rest-api/flow/internal/nicoapi/nicoproto/nico.proto
  • rest-api/flow/internal/nicoapi/nicoproto/site_explorer.proto
🚧 Files skipped from review as they are similar to previous changes (24)
  • crates/api-db/migrations/20260628075228_expected_machines_bmc_ip_allocation.sql
  • crates/rpc/build.rs
  • crates/api-core/src/tests/sku.rs
  • rest-api/flow/internal/nicoapi/nicoproto/site_explorer.proto
  • crates/rpc/proto/forge.proto
  • crates/api-db/src/expected_machine/tests.rs
  • crates/admin-cli/src/expected_machines/patch/mod.rs
  • crates/api-core/src/tests/expected_machine.rs
  • crates/api-db/src/expected_machine.rs
  • crates/admin-cli/src/expected_machines/common.rs
  • crates/admin-cli/src/expected_machines/patch/args.rs
  • crates/admin-cli/src/expected_machines/tests.rs
  • crates/api-core/src/handlers/expected_machine.rs
  • crates/admin-cli/src/expected_machines/update/mod.rs
  • crates/admin-cli/src/rpc.rs
  • crates/machine-a-tron/src/api_client.rs
  • crates/api-model/src/expected_machine.rs
  • rest-api/flow/internal/nicoapi/nicoproto/nico.proto
  • crates/api-db/src/machine_interface/tests.rs
  • crates/admin-cli/src/expected_machines/add/args.rs
  • crates/api-db/src/machine_interface.rs
  • crates/site-explorer/src/lib.rs
  • crates/site-explorer/tests/site_explorer.rs
  • crates/rpc/src/model/expected_machine.rs

Summary by CodeRabbit

  • New Features

    • Added per-host BMC IP allocation options to host management commands and API flows, including support for auto, dynamic, fixed, and retained behavior.
    • BMC IP settings now persist through add, patch, update, and import workflows.
    • Added site explorer “last run” information to show the most recent exploration status and summary.
  • Bug Fixes

    • Improved BMC IP handling so retained addresses are kept static and classified correctly during reconciliation and lookup.

Walkthrough

Adds per-host BMC IP allocation support across the CLI, RPC, model, database, API, and site-explorer retention flow, and adds a new gRPC endpoint and schema for latest Site Explorer run metadata.

Changes

BmcIpAllocationType end-to-end

Layer / File(s) Summary
Proto, model, and conversions
crates/rpc/proto/forge.proto, crates/rpc/build.rs, crates/api-model/src/expected_machine.rs, crates/rpc/src/model/expected_machine.rs
Defines BmcIpAllocationType, adds it to ExpectedMachineData, and updates RPC/model conversions and validation helpers for optional emission and default handling.
Database persistence and retention
crates/api-db/migrations/*.sql, crates/api-db/src/expected_machine.rs, crates/api-db/src/machine_interface.rs, crates/api-db/src/*tests.rs
Adds the PostgreSQL enum and column, updates expected-machine insert and update SQL bindings, and adds BMC address retention by MAC with database coverage.
API validation and round-trip tests
crates/api-core/src/handlers/expected_machine.rs, crates/api-core/src/tests/expected_machine.rs, crates/api-core/src/tests/finder.rs, crates/api-core/src/tests/sku.rs
Validates bmc_ip_allocation against bmc_ip_address on insert and update, and extends API tests for explicit values, omitted values, and fixture defaults.
Site-explorer retention flow
crates/site-explorer/src/lib.rs, crates/site-explorer/tests/site_explorer.rs
Extends endpoint exploration to pin retained dynamic BMC leases and adds a helper that performs the retention work in its own transaction with warning-only failure handling.
Admin CLI and JSON plumbing
crates/admin-cli/src/expected_machines/add/args.rs, crates/admin-cli/src/expected_machines/patch/args.rs, crates/admin-cli/src/expected_machines/patch/mod.rs, crates/admin-cli/src/expected_machines/update/mod.rs, crates/admin-cli/src/expected_machines/common.rs, crates/admin-cli/src/rpc.rs, crates/admin-cli/src/expected_machines/tests.rs, crates/machine-a-tron/src/api_client.rs
Adds --bmc-ip-allocation to add and patch commands, forwards it through update and RPC request construction, extends JSON-backed expected-machine handling, and updates CLI parsing tests.

Site Explorer last-run API

Layer / File(s) Summary
Proto API and response schema
rest-api/flow/internal/nicoapi/nicoproto/nico.proto, rest-api/flow/internal/nicoapi/nicoproto/site_explorer.proto
Adds Forge.GetSiteExplorerLastRun, introduces SiteExplorerLastRun and SiteExplorerLastRunResponse, and extends SiteExplorationReport with last_run metadata.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title is concise and accurately summarizes the main change: retaining BMC IP allocation behavior.
Description check ✅ Passed The description is clearly aligned with the changeset and explains the new BMC IP allocation modes and retention behavior.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 OpenGrep (1.23.0)
crates/admin-cli/src/expected_machines/common.rs

┌──────────────┐
│ Opengrep CLI │
└──────────────┘

�[32m✔�[39m �[1mOpengrep OSS�[0m
�[32m✔�[39m Basic security coverage for first-party code vulnerabilities.

[00.28][ERROR]: unable to find a config; path .coderabbit-opengrep-fallback.yml does not exist

crates/admin-cli/src/expected_machines/patch/args.rs

┌──────────────┐
│ Opengrep CLI │
└──────────────┘

�[32m✔�[39m �[1mOpengrep OSS�[0m
�[32m✔�[39m Basic security coverage for first-party code vulnerabilities.

[00.18][ERROR]: unable to find a config; path .coderabbit-opengrep-fallback.yml does not exist

crates/admin-cli/src/expected_machines/add/args.rs

┌──────────────┐
│ Opengrep CLI │
└──────────────┘

�[32m✔�[39m �[1mOpengrep OSS�[0m
�[32m✔�[39m Basic security coverage for first-party code vulnerabilities.

[00.45][ERROR]: unable to find a config; path .coderabbit-opengrep-fallback.yml does not exist

  • 18 others
🔧 ast-grep (0.44.0)
crates/rpc/build.rs

Comment @coderabbitai help to get the list of available commands.

@github-actions

Copy link
Copy Markdown

🔐 TruffleHog Secret Scan

No secrets or credentials found!

Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉

🔗 View scan details

🕐 Last updated: 2026-06-28 10:09:05 UTC | Commit: 6d8c5e4

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
crates/admin-cli/src/expected_machines/update/mod.rs (1)

41-72: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Omitting bmc_ip_allocation in update JSON does not actually reset it.

ExpectedMachineJson::bmc_ip_allocation defaults to None, but patch_expected_machine() interprets None as “keep the stored value”. That makes expected-machine update inconsistent with both the comment above this call (“result matches the file”) and replace_all_expected_machines, where omission means server-default/Auto. Removing the field from an update file will silently leave an existing dynamic/fixed/retained setting in place. Either send Some(BmcIpAllocationType::Auto) for omitted update-file values or document that update files must spell out Auto explicitly.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/admin-cli/src/expected_machines/update/mod.rs` around lines 41 - 72,
The expected-machine update path is treating omitted bmc_ip_allocation as “keep
existing” via patch_expected_machine, which conflicts with the update-file
semantics in ExpectedMachineJson and the “result matches the file” behavior.
Update the expected_machines::update flow to translate a missing
bmc_ip_allocation in the parsed JSON into an explicit Auto value before calling
ctx.api_client.patch_expected_machine, or otherwise make the CLI/docs clearly
require Auto to be written out; keep the fix localized around
ExpectedMachineJson and patch_expected_machine call site.
crates/admin-cli/src/expected_machines/patch/args.rs (1)

227-238: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

validate() still rejects several valid single-field patches.

run() forwards default_pause_ingestion_and_poweron, bmc_retain_credentials, disable_lockdown, and metadata-only edits, but this guard does not count any of them. Commands that only change one of those fields still fail with the generic “one of the following options must be specified” error. Please derive this check from the actual patchable surface (or the same clap grouping) instead of maintaining a second manual list. As per path instructions, CLI changes should preserve clap behavior and actionable operator-facing error messages.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/admin-cli/src/expected_machines/patch/args.rs` around lines 227 - 238,
The validation guard in `validate()` is using a stale hand-maintained allowlist,
so it rejects valid single-field patches such as
`default_pause_ingestion_and_poweron`, `bmc_retain_credentials`,
`disable_lockdown`, and metadata-only edits forwarded by `run()`. Update the
check to derive “any patchable field set” from the actual patch surface or the
same clap grouping used for the CLI, rather than manually listing only the
current BMC/chassis fields. Keep the existing operator-facing error style, but
ensure the `validate()` gate matches every field that `run()` can legitimately
forward.

Source: Path instructions

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@crates/api-db/migrations/20260628075228_expected_machines_bmc_ip_allocation.sql`:
- Around line 8-9: The new bmc_ip_allocation column is added in the migration,
but existing expected_machines rows are not backfilled from retained BMC leases.
Update this migration to also convert preexisting rows that have a DHCP BMC
address while bmc_ip_address is unset so they persist as static rather than
staying at the default auto; use the migration itself to preserve data for
upgraded deployments, keeping it idempotent and aligned with expected_machines
and bmc_ip_allocation_t.

In `@crates/rpc/build.rs`:
- Around line 56-63: The serde derive for BmcIpAllocationType currently
serializes enum variants with their Rust casing, which will break JSON/CLI
consistency for ExpectedMachineJson.bmc_ip_allocation. Update the type
attributes for rpc::forge::BmcIpAllocationType in build.rs to add a serde rename
policy that emits snake_case names while keeping the existing clap::ValueEnum
derive intact. This change should be applied on the same enum attribute block so
the generated JSON matches the lowercase values used by the CLI.

In `@crates/rpc/src/model/expected_machine.rs`:
- Around line 276-289: The `bmc_ip_allocation` conversion in
`expected_machine.rs` is collapsing `None` into `BmcIpAllocationType::default()`
too early, which causes the update path in `handlers::expected_machine::update`
to overwrite existing non-default values when the field is omitted. Keep
`bmc_ip_allocation` as `Option<_>` through the model/handler boundary and only
apply the default when creating a new record or after merging with the existing
row in the update flow. Add a regression test around the `ExpectedMachine`
update path to verify an omitted `bmc_ip_allocation` preserves an existing
`dynamic` or `retained` value.

In `@crates/site-explorer/src/lib.rs`:
- Around line 1884-1892: The BMC allocation reconciliation only promotes DHCP
leases to static in the `retains_dynamic_ip(false)` branch, so switching a host
from retained/auto to `bmc_ip_allocation=dynamic` never undoes the previous
reservation. Update the reconciliation logic around
`expected_machine.data.bmc_ip_allocation` and `try_retain_bmc` to explicitly
clear or revert the BMC lease back to an expirable DHCP state when dynamic is
selected, and ensure the inverse path uses the existing BMC MAC-based
lookup/retention helpers. Add coverage in `site_explorer.rs` for an
`auto/retained -> dynamic` transition so the database row is reconciled back
correctly.

---

Outside diff comments:
In `@crates/admin-cli/src/expected_machines/patch/args.rs`:
- Around line 227-238: The validation guard in `validate()` is using a stale
hand-maintained allowlist, so it rejects valid single-field patches such as
`default_pause_ingestion_and_poweron`, `bmc_retain_credentials`,
`disable_lockdown`, and metadata-only edits forwarded by `run()`. Update the
check to derive “any patchable field set” from the actual patch surface or the
same clap grouping used for the CLI, rather than manually listing only the
current BMC/chassis fields. Keep the existing operator-facing error style, but
ensure the `validate()` gate matches every field that `run()` can legitimately
forward.

In `@crates/admin-cli/src/expected_machines/update/mod.rs`:
- Around line 41-72: The expected-machine update path is treating omitted
bmc_ip_allocation as “keep existing” via patch_expected_machine, which conflicts
with the update-file semantics in ExpectedMachineJson and the “result matches
the file” behavior. Update the expected_machines::update flow to translate a
missing bmc_ip_allocation in the parsed JSON into an explicit Auto value before
calling ctx.api_client.patch_expected_machine, or otherwise make the CLI/docs
clearly require Auto to be written out; keep the fix localized around
ExpectedMachineJson and patch_expected_machine call site.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f8be7e37-cae7-4fc4-8495-6b61f3ae86a4

📥 Commits

Reviewing files that changed from the base of the PR and between 28cf6eb and 6d8c5e4.

📒 Files selected for processing (22)
  • crates/admin-cli/src/expected_machines/add/args.rs
  • crates/admin-cli/src/expected_machines/common.rs
  • crates/admin-cli/src/expected_machines/patch/args.rs
  • crates/admin-cli/src/expected_machines/patch/mod.rs
  • crates/admin-cli/src/expected_machines/tests.rs
  • crates/admin-cli/src/expected_machines/update/mod.rs
  • crates/admin-cli/src/rpc.rs
  • crates/api-core/src/handlers/expected_machine.rs
  • crates/api-core/src/tests/expected_machine.rs
  • crates/api-core/src/tests/sku.rs
  • crates/api-db/migrations/20260628075228_expected_machines_bmc_ip_allocation.sql
  • crates/api-db/src/expected_machine.rs
  • crates/api-db/src/expected_machine/tests.rs
  • crates/api-db/src/machine_interface.rs
  • crates/api-db/src/machine_interface/tests.rs
  • crates/api-model/src/expected_machine.rs
  • crates/machine-a-tron/src/api_client.rs
  • crates/rpc/build.rs
  • crates/rpc/proto/forge.proto
  • crates/rpc/src/model/expected_machine.rs
  • crates/site-explorer/src/lib.rs
  • crates/site-explorer/tests/site_explorer.rs

Comment on lines +8 to +9
ALTER TABLE expected_machines
ADD COLUMN bmc_ip_allocation bmc_ip_allocation_t NOT NULL DEFAULT 'auto';

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Backfill existing retained BMC leases in the migration.

Rows added here default to auto, but the migration never converts preexisting BMC dhcp address rows to static when bmc_ip_address is unset. On upgraded deployments, those machines remain vulnerable to lease expiry until site-explorer happens to reconcile them, which breaks the new default-retention behavior.

Proposed migration backfill
 ALTER TABLE expected_machines
   ADD COLUMN bmc_ip_allocation bmc_ip_allocation_t NOT NULL DEFAULT 'auto';
+
+UPDATE machine_interface_addresses mia
+SET allocation_type = 'static'
+FROM machine_interfaces mi
+JOIN expected_machines em
+  ON em.bmc_mac_address = mi.mac_address
+WHERE mia.interface_id = mi.id
+  AND mi.interface_type = 'Bmc'
+  AND mia.allocation_type = 'dhcp'
+  AND em.bmc_ip_address IS NULL
+  AND em.bmc_ip_allocation = 'auto';

As per path instructions, crates/api-db/migrations/**: review database migrations for forward/backward compatibility, idempotency assumptions, data preservation, and consistency with Rust and Go data models.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
ALTER TABLE expected_machines
ADD COLUMN bmc_ip_allocation bmc_ip_allocation_t NOT NULL DEFAULT 'auto';
ALTER TABLE expected_machines
ADD COLUMN bmc_ip_allocation bmc_ip_allocation_t NOT NULL DEFAULT 'auto';
UPDATE machine_interface_addresses mia
SET allocation_type = 'static'
FROM machine_interfaces mi
JOIN expected_machines em
ON em.bmc_mac_address = mi.mac_address
WHERE mia.interface_id = mi.id
AND mi.interface_type = 'Bmc'
AND mia.allocation_type = 'dhcp'
AND em.bmc_ip_address IS NULL
AND em.bmc_ip_allocation = 'auto';
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@crates/api-db/migrations/20260628075228_expected_machines_bmc_ip_allocation.sql`
around lines 8 - 9, The new bmc_ip_allocation column is added in the migration,
but existing expected_machines rows are not backfilled from retained BMC leases.
Update this migration to also convert preexisting rows that have a DHCP BMC
address while bmc_ip_address is unset so they persist as static rather than
staying at the default auto; use the migration itself to preserve data for
upgraded deployments, keeping it idempotent and aligned with expected_machines
and bmc_ip_allocation_t.

Source: Path instructions

Comment thread crates/rpc/build.rs
Comment on lines +56 to +63
.type_attribute(
".forge.BmcIpAllocationType",
"#[cfg_attr(feature = \"cli\", derive(clap::ValueEnum))]",
)
.type_attribute(
".forge.BmcIpAllocationType",
"#[derive(serde::Serialize, serde::Deserialize)]",
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "Where the JSON model and new field are declared/used:"
rg -n -C2 'ExpectedMachineJson|bmc_ip_allocation' crates/admin-cli/src crates/machine-a-tron/src

echo
echo "Current generated attributes for BmcIpAllocationType:"
rg -n -C2 'BmcIpAllocationType|serde\(rename_all' crates/rpc/build.rs

Repository: NVIDIA/infra-controller

Length of output: 14925


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "Enum definition and values:"
rg -n -C3 'enum BmcIpAllocationType|BmcIpAllocationType' crates proto . --glob '!target' --glob '!*.lock'

echo
echo "Any serde rename attributes already applied to this enum or its JSON wrapper:"
rg -n -C3 'rename_all|rename =|BmcIpAllocationType' crates/admin-cli/src crates/rpc/build.rs --glob '!target'

Repository: NVIDIA/infra-controller

Length of output: 50379


Use snake_case serde names for BmcIpAllocationType ExpectedMachineJson.bmc_ip_allocation serializes rpc::forge::BmcIpAllocationType directly, so this derive will emit Auto/Dynamic/Fixed/Retained in JSON unless the enum gets #[serde(rename_all = "snake_case")]. That keeps file-based CLI flows aligned with the lowercase values used on the command line.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/rpc/build.rs` around lines 56 - 63, The serde derive for
BmcIpAllocationType currently serializes enum variants with their Rust casing,
which will break JSON/CLI consistency for ExpectedMachineJson.bmc_ip_allocation.
Update the type attributes for rpc::forge::BmcIpAllocationType in build.rs to
add a serde rename policy that emits snake_case names while keeping the existing
clap::ValueEnum derive intact. This change should be applied on the same enum
attribute block so the generated JSON matches the lowercase values used by the
CLI.

Comment thread crates/rpc/src/model/expected_machine.rs
Comment on lines +1884 to +1892
} else if expected_machine
.data
.bmc_ip_allocation
.retains_dynamic_ip(false)
{
// No operator-specified BMC IP, but the host's bmc_ip_allocation
// retains its auto-allocated address: pin the BMC interface's
// DHCP lease as Static so it survives lease expiry.
try_retain_bmc(&self.database_connection, expected_machine.bmc_mac_address).await;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🟠 Major | 🏗️ Heavy lift

Dynamic mode cannot undo a previously retained BMC lease.

Once this path has promoted a DHCP BMC address to static, later changing the same host to bmc_ip_allocation=dynamic never reverts that row. retains_dynamic_ip(false) just stops calling try_retain_bmc, and retain_bmc_address_by_mac only performs dhcp -> static, so the old reservation survives indefinitely. That breaks the new mode contract (dynamic should leave the lease expirable) and makes allocation-mode changes sticky in the DB. Please add the inverse reconciliation path here as well, and cover an auto/retained -> dynamic transition in crates/site-explorer/tests/site_explorer.rs.

Also applies to: 3325-3349

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/site-explorer/src/lib.rs` around lines 1884 - 1892, The BMC allocation
reconciliation only promotes DHCP leases to static in the
`retains_dynamic_ip(false)` branch, so switching a host from retained/auto to
`bmc_ip_allocation=dynamic` never undoes the previous reservation. Update the
reconciliation logic around `expected_machine.data.bmc_ip_allocation` and
`try_retain_bmc` to explicitly clear or revert the BMC lease back to an
expirable DHCP state when dynamic is selected, and ensure the inverse path uses
the existing BMC MAC-based lookup/retention helpers. Add coverage in
`site_explorer.rs` for an `auto/retained -> dynamic` transition so the database
row is reconciled back correctly.

@github-actions

github-actions Bot commented Jun 28, 2026

Copy link
Copy Markdown

🔍 Container Scan Summary

Service Total Critical High Medium Low Other
boot-artifacts-aarch64 3 0 0 3 0 0
boot-artifacts-x86_64 3 0 0 3 0 0
forge-admin-cli-x86_64 285 6 25 103 7 144
machine-validation-runner 748 30 189 272 36 221
machine_validation 748 30 189 272 36 221
machine_validation-aarch64 748 30 189 272 36 221
nvmetal-carbide 748 30 189 272 36 221
TOTAL 3283 126 781 1197 151 1028

Per-CVE detail lives in the per-service grype-* artifacts (JSON + SARIF). Severity counts only — no CVE IDs published here.

@chet chet changed the title feat: control how a BMC's IP is allocated, retaining it by default feat: support retaining a BMC IP address allocation Jun 28, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
rest-api/flow/internal/nicoapi/nicoproto/site_explorer.proto (1)

104-107: 🗄️ Data Integrity & Integration | 🔵 Trivial | ⚡ Quick win

Use google.protobuf.Timestamp for the run timestamps. started_at, finished_at, last_successful_finished_at, and last_failed_finished_at are instants; string leaves RFC 3339 parsing and validation to every client. Switch them to google.protobuf.Timestamp and add google/protobuf/timestamp.proto before this schema ships.

Suggested change
+import "google/protobuf/timestamp.proto";
...
-  string started_at = 1;
+  google.protobuf.Timestamp started_at = 1;
...
-  string finished_at = 2;
+  google.protobuf.Timestamp finished_at = 2;
...
-  optional string last_successful_finished_at = 9;
+  optional google.protobuf.Timestamp last_successful_finished_at = 9;
...
-  optional string last_failed_finished_at = 10;
+  optional google.protobuf.Timestamp last_failed_finished_at = 10;
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rest-api/flow/internal/nicoapi/nicoproto/site_explorer.proto` around lines
104 - 107, The run timestamp fields in site_explorer.proto should use a proper
instant type instead of string. Update the schema for the run-related fields,
including started_at, finished_at, last_successful_finished_at, and
last_failed_finished_at, to use google.protobuf.Timestamp, and add the required
google/protobuf/timestamp.proto import so clients do not have to parse RFC 3339
manually.

Source: Path instructions

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@rest-api/flow/internal/nicoapi/nicoproto/site_explorer.proto`:
- Around line 104-107: The run timestamp fields in site_explorer.proto should
use a proper instant type instead of string. Update the schema for the
run-related fields, including started_at, finished_at,
last_successful_finished_at, and last_failed_finished_at, to use
google.protobuf.Timestamp, and add the required google/protobuf/timestamp.proto
import so clients do not have to parse RFC 3339 manually.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b9df0648-2347-4a1f-a467-ff7561bf48ff

📥 Commits

Reviewing files that changed from the base of the PR and between 6d8c5e4 and 7a7ccaf.

⛔ Files ignored due to path filters (8)
  • rest-api/flow/internal/nicoapi/gen/nico.pb.go is excluded by !**/*.pb.go, !**/gen/**, !rest-api/**/*.pb.go
  • rest-api/flow/internal/nicoapi/gen/nico_grpc.pb.go is excluded by !**/*.pb.go, !**/gen/**, !rest-api/**/*.pb.go, !rest-api/**/*_grpc.pb.go
  • rest-api/flow/internal/nicoapi/gen/site_explorer.pb.go is excluded by !**/*.pb.go, !**/gen/**, !rest-api/**/*.pb.go
  • rest-api/workflow-schema/schema/site-agent/workflows/v1/nico_nico.pb.go is excluded by !**/*.pb.go, !rest-api/**/*.pb.go
  • rest-api/workflow-schema/schema/site-agent/workflows/v1/nico_nico_grpc.pb.go is excluded by !**/*.pb.go, !rest-api/**/*.pb.go, !rest-api/**/*_grpc.pb.go
  • rest-api/workflow-schema/schema/site-agent/workflows/v1/site_explorer_nico.pb.go is excluded by !**/*.pb.go, !rest-api/**/*.pb.go
  • rest-api/workflow-schema/site-agent/workflows/v1/nico_nico.proto is excluded by !rest-api/workflow-schema/site-agent/workflows/v1/*_nico.proto
  • rest-api/workflow-schema/site-agent/workflows/v1/site_explorer_nico.proto is excluded by !rest-api/workflow-schema/site-agent/workflows/v1/*_nico.proto
📒 Files selected for processing (24)
  • crates/admin-cli/src/expected_machines/add/args.rs
  • crates/admin-cli/src/expected_machines/common.rs
  • crates/admin-cli/src/expected_machines/patch/args.rs
  • crates/admin-cli/src/expected_machines/patch/mod.rs
  • crates/admin-cli/src/expected_machines/tests.rs
  • crates/admin-cli/src/expected_machines/update/mod.rs
  • crates/admin-cli/src/rpc.rs
  • crates/api-core/src/handlers/expected_machine.rs
  • crates/api-core/src/tests/expected_machine.rs
  • crates/api-core/src/tests/sku.rs
  • crates/api-db/migrations/20260628075228_expected_machines_bmc_ip_allocation.sql
  • crates/api-db/src/expected_machine.rs
  • crates/api-db/src/expected_machine/tests.rs
  • crates/api-db/src/machine_interface.rs
  • crates/api-db/src/machine_interface/tests.rs
  • crates/api-model/src/expected_machine.rs
  • crates/machine-a-tron/src/api_client.rs
  • crates/rpc/build.rs
  • crates/rpc/proto/forge.proto
  • crates/rpc/src/model/expected_machine.rs
  • crates/site-explorer/src/lib.rs
  • crates/site-explorer/tests/site_explorer.rs
  • rest-api/flow/internal/nicoapi/nicoproto/nico.proto
  • rest-api/flow/internal/nicoapi/nicoproto/site_explorer.proto
✅ Files skipped from review due to trivial changes (1)
  • crates/api-core/src/tests/sku.rs
🚧 Files skipped from review as they are similar to previous changes (21)
  • crates/admin-cli/src/expected_machines/common.rs
  • crates/admin-cli/src/expected_machines/update/mod.rs
  • crates/api-db/src/expected_machine/tests.rs
  • crates/api-db/src/machine_interface/tests.rs
  • crates/machine-a-tron/src/api_client.rs
  • crates/admin-cli/src/expected_machines/tests.rs
  • crates/api-core/src/handlers/expected_machine.rs
  • crates/rpc/build.rs
  • crates/rpc/proto/forge.proto
  • crates/rpc/src/model/expected_machine.rs
  • crates/api-db/migrations/20260628075228_expected_machines_bmc_ip_allocation.sql
  • crates/admin-cli/src/expected_machines/patch/mod.rs
  • crates/api-core/src/tests/expected_machine.rs
  • crates/admin-cli/src/rpc.rs
  • crates/admin-cli/src/expected_machines/patch/args.rs
  • crates/site-explorer/tests/site_explorer.rs
  • crates/api-db/src/expected_machine.rs
  • crates/api-db/src/machine_interface.rs
  • crates/api-model/src/expected_machine.rs
  • crates/admin-cli/src/expected_machines/add/args.rs
  • crates/site-explorer/src/lib.rs

Following NVIDIA#2920 (the machine snapshot now reads a BMC's IP live from the
interface), give operators explicit control over how a BMC's IP is
assigned -- and make the default keep it. A BMC that picks up a dynamic
address now retains it as a static reservation, so it survives DHCP lease
expiry instead of churning. BMC IP churn has been a recurring source of
trouble -- DPF re-registration, snapshot/topology mismatch, hosts the
state machine can't place -- so retaining by default removes it at the root.

A new bmc_ip_allocation on ExpectedMachine selects the behavior: auto (the
default) keeps a configured bmc_ip_address fixed and otherwise retains an
auto-allocated one; fixed requires an address; retained pins an
auto-allocated address; dynamic is the opt-out that leaves the lease
expirable. A retained address is recorded as a static allocation, which
lease expiry already leaves alone.

- Add the bmc_ip_allocation field end to end (proto, model, migration,
  conversions) plus a --bmc-ip-allocation flag on expected-machine
  add/update/patch
- Pin a retained BMC's auto-allocated address as static during
  site-explorer reconciliation, so lease expiry can't reap it
- Validate the combinations -- fixed needs an address; dynamic and
  retained reject one

Tests added!

This supports NVIDIA#2952

Signed-off-by: Chet Nichols III <chetn@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant