Skip to content

fix: use instance ID for instance power requests#2948

Open
osu wants to merge 6 commits into
NVIDIA:mainfrom
osu:issue-931-instance-power
Open

fix: use instance ID for instance power requests#2948
osu wants to merge 6 commits into
NVIDIA:mainfrom
osu:issue-931-instance-power

Conversation

@osu

@osu osu commented Jun 28, 2026

Copy link
Copy Markdown
Member

Description

Remove the deprecated machine_id input from InvokeInstancePower and require callers to identify the tenant instance directly.

This change:

  • reserves the removed protobuf field name and number;
  • resolves power requests solely through instance_id in API Core;
  • updates the admin CLI and DPU reprovision helper;
  • carries instance_id through REST API, control-plane workflow, and site workflow paths; and
  • updates mocks/tests and regenerates the checked-in REST protobuf outputs.

Related issues

Closes #931

Type of Change

  • Add - New feature or capability
  • Change - Changes in existing functionality
  • Fix - Bug fixes
  • Remove - Removed features or deprecated functionality
  • Internal - Internal changes (refactoring, tests, docs, etc.)

Breaking Changes

  • This PR contains breaking changes

InvokeInstancePower clients must now populate instance_id. The deprecated machine_id field has been removed and its protobuf name/number are reserved.

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Passed locally:

  • cargo test -p nico-admin-cli --locked
  • DATABASE_URL=postgresql://postgres:postgres@127.0.0.1:30432/nicotest cargo test -p carbide-api-core --no-default-features --locked instance_ipxe_behaviors
  • cargo clippy -p carbide-api-core -p nico-admin-cli --no-default-features --all-targets --locked -- -D warnings
  • cargo fmt --all -- --check with the repository-pinned nightly toolchain
  • protolint lint -config_path=.protolint.yaml crates/rpc/proto/
  • buf breaking crates/rpc/proto --against 'https://github.com/NVIDIA/infra-controller.git#branch=main,subdir=crates/rpc/proto'
  • cargo make --no-workspace check-rest-core-proto-sync
  • make test-api
  • make test-workflow
  • make test-site-workflow
  • make test-flow
  • focused Go tests for the affected workflow, site-workflow, Flow, and DPU reprovision packages

Additional Notes

The REST generated snapshots on main were already stale for the Site Explorer change from #2591 and the DHCP lease status enum change from #2877. The first commit is the deterministic generated baseline produced from clean main; the second commit applies #931 and regenerates the same outputs. All protobuf-generated files were produced by cargo make --no-workspace generate-rest-core-proto, not edited by hand.

The final commit adds the single carbide_site_explorer_last_run_status row emitted by the Core metrics-doc generator; that generated-doc drift also predates this change.

osu added 2 commits June 27, 2026 18:23
Signed-off-by: Hasan Khan <hasank@nvidia.com>
Signed-off-by: Hasan Khan <hasank@nvidia.com>
@osu osu requested a review from a team as a code owner June 28, 2026 02:03
@coderabbitai

coderabbitai Bot commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

machine_id is removed from reboot and power request handling across proto, API, workflows, CLI, shell, and tests, leaving instance_id as the required identifier. Separately, a new GetSiteExplorerLastRun RPC and related telemetry messages are added, along with a metrics documentation entry.

Changes

InstancePowerRequest: machine_id removal

Layer / File(s) Summary
Proto contract: reserve machine_id, require instance_id
crates/rpc/proto/forge.proto, rest-api/flow/internal/nicoapi/nicoproto/nico.proto
Both proto copies reserve field number 1 and the name machine_id in InstancePowerRequest.
Request handling: use instance_id end to end
crates/api-core/src/handlers/instance.rs, rest-api/site-workflow/pkg/activity/instance.go, rest-api/workflow/pkg/workflow/instance/reboot.go, rest-api/site-workflow/pkg/workflow/instance.go, rest-api/api/pkg/api/handler/instance.go, crates/admin-cli/src/instance/reboot/cmd.rs, dev/bin/reprovision_dpu.sh
Power and reboot handlers now read or populate InstanceId and no longer use MachineId for reboot requests.
Tests and stubs: align request shape and validation
crates/api-core/src/tests/..., rest-api/site-workflow/pkg/activity/instance_test.go, rest-api/site-workflow/pkg/grpc/server/nico_test_server.go, rest-api/site-workflow/pkg/grpc/server/nico_test_server_test.go, rest-api/site-workflow/pkg/workflow/instance_test.go, rest-api/workflow/pkg/workflow/instance/reboot_test.go
Instance power call sites in tests drop machine_id, the test server resolves requests by instance map, reboot workflow tests use InstanceId, and a new test asserts InstanceId is required.

Site Explorer last-run RPC and telemetry

Layer / File(s) Summary
Proto additions: last-run RPC and telemetry messages
rest-api/flow/internal/nicoapi/nicoproto/nico.proto, rest-api/flow/internal/nicoapi/nicoproto/site_explorer.proto
GetSiteExplorerLastRun is added to the Forge service, and SiteExplorerLastRun, SiteExplorerLastRunResponse, and last_run on SiteExplorationReport are added.
Observability metric entry
docs/observability/core_metrics.md
The observability metrics table gains a row for carbide_site_explorer_last_run_status.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested labels

high risk

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Out of Scope Changes check ⚠️ Warning Unrelated Site Explorer RPC/proto additions and a metrics doc row were changed alongside the instance power work. Split the unrelated Site Explorer and metrics updates into a separate PR, or explain why they are required for the instance_id migration.
Docstring Coverage ⚠️ Warning Docstring coverage is 42.86% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title is concise and accurately summarizes the primary change: switching instance power requests to instance ID.
Linked Issues check ✅ Passed The changes align with #931 by routing reboot/power requests through instance_id and removing machine_id from nico and nico-rest.
Description check ✅ Passed The description accurately matches the changeset, covering removal of machine_id and the instance_id-based request flow across services, tests, and generated protobufs.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@github-actions

Copy link
Copy Markdown

🔐 TruffleHog Secret Scan

No secrets or credentials found!

Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉

🔗 View scan details

🕐 Last updated: 2026-06-28 02:06:26 UTC | Commit: 8df6154

Signed-off-by: Hasan Khan <hasank@nvidia.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
rest-api/site-workflow/pkg/grpc/server/nico_test_server.go (1)

461-471: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Reject a missing InstanceId as InvalidArgument in the fake server.

req.GetInstanceId().GetValue() collapses an omitted ID to "", so this test double now returns NotFound("") where the real Core handler rejects the request as InvalidArgument. That mismatch makes the site-workflow tests exercise the wrong error path for malformed requests.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rest-api/site-workflow/pkg/grpc/server/nico_test_server.go` around lines 461
- 471, In nico_test_server’s request handling, a missing InstanceId is currently
collapsed to an empty string and treated as NotFound, which diverges from the
real Core handler. Update the logic around req.GetInstanceId().GetValue() and
the instance lookup so the fake server returns InvalidArgument for an
omitted/empty InstanceId before checking f.ins, while keeping the existing
POWER_RESET and invalid-operation behavior unchanged.
🧹 Nitpick comments (1)
crates/admin-cli/src/instance/reboot/cmd.rs (1)

24-33: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Add command-level context to the RPC failure.

This now bubbles the raw gRPC error directly, so after removing the preflight lookup the operator loses the breadcrumb about which reboot request failed. Wrap the await with context such as while attempting to request reboot for instance ... so the CLI error stays actionable. As per path instructions, review CLI changes for actionable operator-facing error messages.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/admin-cli/src/instance/reboot/cmd.rs` around lines 24 - 33, The reboot
command in handle_reboot now returns the raw gRPC error without operator
context, so add a contextual wrapper around the await on
api_client.0.invoke_instance_power that mentions the reboot request and the
target instance identifier from args.instance. Keep the existing call flow, but
ensure the error surfaced by CarbideCliResult includes actionable wording like
“while attempting to request reboot for instance …” so the CLI remains clear
when the RPC fails.

Source: Path instructions

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@dev/bin/reprovision_dpu.sh`:
- Around line 67-71: The instance-present check in reprovision_dpu.sh is
inconsistent: the reboot branch in the INSTANCE_ID block treats an empty
INSTANCE_ID as missing, but later logic still only checks for "null". Update the
predicate used around the InvokeInstancePower path and the later
instance-configured path so they both treat empty strings and "null" the same,
keeping the script’s has-instance behavior consistent.

In `@rest-api/site-workflow/pkg/activity/instance_test.go`:
- Around line 350-369: The new reboot instance test cases only assert wantErr,
so they do not verify that the local request validation in RebootInstanceOnSite
is what failed. Update the missing-request and missing-Instance-ID rows in
instance_test.go to assert the Temporal application error type/message returned
for invalid input, using the existing request-validation path and the
cwssaws.InstancePowerRequest/InstanceId guard so the tests fail if the call
instead regresses into a transport or mock error.

In `@rest-api/site-workflow/pkg/workflow/instance_test.go`:
- Around line 454-456: The activity expectation is too loose because it uses
mock.Anything instead of the exact request object built in the test. Update the
expectations in instance_test.go to match the request variable passed to the
workflow so the InstanceId contract is actually verified, using the same
cwssaws.InstancePowerRequest instance created near the test setup and the
related activity expectation sites in the test cases.

---

Outside diff comments:
In `@rest-api/site-workflow/pkg/grpc/server/nico_test_server.go`:
- Around line 461-471: In nico_test_server’s request handling, a missing
InstanceId is currently collapsed to an empty string and treated as NotFound,
which diverges from the real Core handler. Update the logic around
req.GetInstanceId().GetValue() and the instance lookup so the fake server
returns InvalidArgument for an omitted/empty InstanceId before checking f.ins,
while keeping the existing POWER_RESET and invalid-operation behavior unchanged.

---

Nitpick comments:
In `@crates/admin-cli/src/instance/reboot/cmd.rs`:
- Around line 24-33: The reboot command in handle_reboot now returns the raw
gRPC error without operator context, so add a contextual wrapper around the
await on api_client.0.invoke_instance_power that mentions the reboot request and
the target instance identifier from args.instance. Keep the existing call flow,
but ensure the error surfaced by CarbideCliResult includes actionable wording
like “while attempting to request reboot for instance …” so the CLI remains
clear when the RPC fails.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ef018dc3-3443-407b-9dbd-d8dea40af6bf

📥 Commits

Reviewing files that changed from the base of the PR and between 87a5337 and 8df6154.

⛔ Files ignored due to path filters (8)
  • rest-api/flow/internal/nicoapi/gen/nico.pb.go is excluded by !**/*.pb.go, !**/gen/**, !rest-api/**/*.pb.go
  • rest-api/flow/internal/nicoapi/gen/nico_grpc.pb.go is excluded by !**/*.pb.go, !**/gen/**, !rest-api/**/*.pb.go, !rest-api/**/*_grpc.pb.go
  • rest-api/flow/internal/nicoapi/gen/site_explorer.pb.go is excluded by !**/*.pb.go, !**/gen/**, !rest-api/**/*.pb.go
  • rest-api/workflow-schema/schema/site-agent/workflows/v1/nico_nico.pb.go is excluded by !**/*.pb.go, !rest-api/**/*.pb.go
  • rest-api/workflow-schema/schema/site-agent/workflows/v1/nico_nico_grpc.pb.go is excluded by !**/*.pb.go, !rest-api/**/*.pb.go, !rest-api/**/*_grpc.pb.go
  • rest-api/workflow-schema/schema/site-agent/workflows/v1/site_explorer_nico.pb.go is excluded by !**/*.pb.go, !rest-api/**/*.pb.go
  • rest-api/workflow-schema/site-agent/workflows/v1/nico_nico.proto is excluded by !rest-api/workflow-schema/site-agent/workflows/v1/*_nico.proto
  • rest-api/workflow-schema/site-agent/workflows/v1/site_explorer_nico.proto is excluded by !rest-api/workflow-schema/site-agent/workflows/v1/*_nico.proto
📒 Files selected for processing (17)
  • crates/admin-cli/src/instance/reboot/cmd.rs
  • crates/api-core/src/handlers/instance.rs
  • crates/api-core/src/tests/dpu_reprovisioning.rs
  • crates/api-core/src/tests/host_bmc_firmware_test.rs
  • crates/api-core/src/tests/instance_ipxe_behaviors.rs
  • crates/rpc/proto/forge.proto
  • dev/bin/reprovision_dpu.sh
  • rest-api/api/pkg/api/handler/instance.go
  • rest-api/flow/internal/nicoapi/nicoproto/nico.proto
  • rest-api/flow/internal/nicoapi/nicoproto/site_explorer.proto
  • rest-api/site-workflow/pkg/activity/instance.go
  • rest-api/site-workflow/pkg/activity/instance_test.go
  • rest-api/site-workflow/pkg/grpc/server/nico_test_server.go
  • rest-api/site-workflow/pkg/workflow/instance.go
  • rest-api/site-workflow/pkg/workflow/instance_test.go
  • rest-api/workflow/pkg/workflow/instance/reboot.go
  • rest-api/workflow/pkg/workflow/instance/reboot_test.go
💤 Files with no reviewable changes (2)
  • crates/api-core/src/tests/dpu_reprovisioning.rs
  • crates/api-core/src/tests/host_bmc_firmware_test.rs

Comment thread dev/bin/reprovision_dpu.sh
Comment thread rest-api/site-workflow/pkg/activity/instance_test.go Outdated
Comment thread rest-api/site-workflow/pkg/workflow/instance_test.go
@github-actions

github-actions Bot commented Jun 28, 2026

Copy link
Copy Markdown

🔍 Container Scan Summary

Service Total Critical High Medium Low Other
boot-artifacts-aarch64 3 0 0 3 0 0
boot-artifacts-x86_64 3 0 0 3 0 0
forge-admin-cli-x86_64 285 6 25 103 7 144
machine-validation-runner 748 32 187 272 36 221
machine_validation 748 32 187 272 36 221
machine_validation-aarch64 748 32 187 272 36 221
nvmetal-carbide 748 30 189 272 36 221
TOTAL 3283 132 775 1197 151 1028

Per-CVE detail lives in the per-service grype-* artifacts (JSON + SARIF). Severity counts only — no CVE IDs published here.

Signed-off-by: Hasan Khan <hasank@nvidia.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
dev/bin/reprovision_dpu.sh (1)

67-72: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Fail fast when InvokeInstancePower is rejected.

This RPC now gates the instance-backed reprovision path, but its exit status is ignored. Without strict mode, a failed grpcurl here leaves the script continuing as if tenant approval succeeded, and the later wait loop just times out.

As per path instructions, review shell scripts for "strict-mode assumptions" and "error propagation".

Proposed fix
 if [[ -n "$INSTANCE_ID" && "$INSTANCE_ID" != "null" ]]; then
-	echo "Sending reboot message with apply_updates_on_reboot true".
-	grpcurl -d "{\"operation\": 0, \"instance_id\": { \"value\": \"$INSTANCE_ID\" }, \"apply_updates_on_reboot\": true}" -insecure "${API_SERVER}" forge.Forge/InvokeInstancePower
+	echo "Sending reboot message with apply_updates_on_reboot true."
+	if ! grpcurl -d "{\"operation\": 0, \"instance_id\": { \"value\": \"$INSTANCE_ID\" }, \"apply_updates_on_reboot\": true}" -insecure "${API_SERVER}" forge.Forge/InvokeInstancePower; then
+		echo "Failed to approve tenant reboot for instance $INSTANCE_ID."
+		exit 1
+	fi
 else
 	echo "No instance found; skipping tenant reboot approval."
 fi
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@dev/bin/reprovision_dpu.sh` around lines 67 - 72, The InvokeInstancePower RPC
call in the reprovision path is not checked, so a rejected grpcurl still lets
the script continue. Update the reprovision flow around the INSTANCE_ID branch
in the shell script to propagate failures immediately by enabling strict error
handling or explicitly checking the grpcurl exit status and exiting on failure,
so the later wait loop is never reached after a failed tenant approval.

Source: Path instructions

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@dev/bin/reprovision_dpu.sh`:
- Around line 67-72: The InvokeInstancePower RPC call in the reprovision path is
not checked, so a rejected grpcurl still lets the script continue. Update the
reprovision flow around the INSTANCE_ID branch in the shell script to propagate
failures immediately by enabling strict error handling or explicitly checking
the grpcurl exit status and exiting on failure, so the later wait loop is never
reached after a failed tenant approval.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: a2eaa069-6515-4393-92b2-0a91e112e4f5

📥 Commits

Reviewing files that changed from the base of the PR and between 18c7dd5 and b8e9eaa.

📒 Files selected for processing (3)
  • dev/bin/reprovision_dpu.sh
  • rest-api/site-workflow/pkg/activity/instance_test.go
  • rest-api/site-workflow/pkg/workflow/instance_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • rest-api/site-workflow/pkg/workflow/instance_test.go
  • rest-api/site-workflow/pkg/activity/instance_test.go

Signed-off-by: Hasan Khan <hasank@nvidia.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
rest-api/site-workflow/pkg/grpc/server/nico_test_server_test.go (1)

83-94: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Use testify assertions here.

These hand-rolled t.Fatalf/t.Errorf checks diverge from the repo’s Go test convention and make failures less consistent than the surrounding suite. As per coding guidelines, rest-api/**/*.go: "Use testify (assert/require) for test assertions."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rest-api/site-workflow/pkg/grpc/server/nico_test_server_test.go` around lines
83 - 94, The test loop in nico_test_server_test.go uses manual t.Fatalf and
t.Errorf checks instead of the repo-standard testify assertions. Update the
InvokeInstancePower test body to use require/assert from testify for the
status.Code, status.Convert(err).Message(), and result nil checks, keeping the
same expectations while matching the surrounding Go test convention.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@rest-api/site-workflow/pkg/grpc/server/nico_test_server_test.go`:
- Around line 83-94: The test loop in nico_test_server_test.go uses manual
t.Fatalf and t.Errorf checks instead of the repo-standard testify assertions.
Update the InvokeInstancePower test body to use require/assert from testify for
the status.Code, status.Convert(err).Message(), and result nil checks, keeping
the same expectations while matching the surrounding Go test convention.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 690bfcf1-8161-4066-a88e-9d5cac612fe8

📥 Commits

Reviewing files that changed from the base of the PR and between b8e9eaa and e1420fe.

📒 Files selected for processing (4)
  • crates/admin-cli/src/instance/reboot/cmd.rs
  • dev/bin/reprovision_dpu.sh
  • rest-api/site-workflow/pkg/grpc/server/nico_test_server.go
  • rest-api/site-workflow/pkg/grpc/server/nico_test_server_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • rest-api/site-workflow/pkg/grpc/server/nico_test_server.go
  • dev/bin/reprovision_dpu.sh

Signed-off-by: Hasan Khan <hasank@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

InvokeInstancePower takes machine_id as argument

1 participant