fix: use instance ID for instance power requests#2948
Conversation
Signed-off-by: Hasan Khan <hasank@nvidia.com>
Signed-off-by: Hasan Khan <hasank@nvidia.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
Walkthrough
ChangesInstancePowerRequest: machine_id removal
Site Explorer last-run RPC and telemetry
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested labels
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
🔐 TruffleHog Secret Scan✅ No secrets or credentials found! Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉 🕐 Last updated: 2026-06-28 02:06:26 UTC | Commit: 8df6154 |
Signed-off-by: Hasan Khan <hasank@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
rest-api/site-workflow/pkg/grpc/server/nico_test_server.go (1)
461-471: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick winReject a missing
InstanceIdasInvalidArgumentin the fake server.
req.GetInstanceId().GetValue()collapses an omitted ID to"", so this test double now returnsNotFound("")where the real Core handler rejects the request asInvalidArgument. That mismatch makes the site-workflow tests exercise the wrong error path for malformed requests.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@rest-api/site-workflow/pkg/grpc/server/nico_test_server.go` around lines 461 - 471, In nico_test_server’s request handling, a missing InstanceId is currently collapsed to an empty string and treated as NotFound, which diverges from the real Core handler. Update the logic around req.GetInstanceId().GetValue() and the instance lookup so the fake server returns InvalidArgument for an omitted/empty InstanceId before checking f.ins, while keeping the existing POWER_RESET and invalid-operation behavior unchanged.
🧹 Nitpick comments (1)
crates/admin-cli/src/instance/reboot/cmd.rs (1)
24-33: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winAdd command-level context to the RPC failure.
This now bubbles the raw gRPC error directly, so after removing the preflight lookup the operator loses the breadcrumb about which reboot request failed. Wrap the await with context such as
while attempting to request reboot for instance ...so the CLI error stays actionable. As per path instructions, review CLI changes for actionable operator-facing error messages.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/admin-cli/src/instance/reboot/cmd.rs` around lines 24 - 33, The reboot command in handle_reboot now returns the raw gRPC error without operator context, so add a contextual wrapper around the await on api_client.0.invoke_instance_power that mentions the reboot request and the target instance identifier from args.instance. Keep the existing call flow, but ensure the error surfaced by CarbideCliResult includes actionable wording like “while attempting to request reboot for instance …” so the CLI remains clear when the RPC fails.Source: Path instructions
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@dev/bin/reprovision_dpu.sh`:
- Around line 67-71: The instance-present check in reprovision_dpu.sh is
inconsistent: the reboot branch in the INSTANCE_ID block treats an empty
INSTANCE_ID as missing, but later logic still only checks for "null". Update the
predicate used around the InvokeInstancePower path and the later
instance-configured path so they both treat empty strings and "null" the same,
keeping the script’s has-instance behavior consistent.
In `@rest-api/site-workflow/pkg/activity/instance_test.go`:
- Around line 350-369: The new reboot instance test cases only assert wantErr,
so they do not verify that the local request validation in RebootInstanceOnSite
is what failed. Update the missing-request and missing-Instance-ID rows in
instance_test.go to assert the Temporal application error type/message returned
for invalid input, using the existing request-validation path and the
cwssaws.InstancePowerRequest/InstanceId guard so the tests fail if the call
instead regresses into a transport or mock error.
In `@rest-api/site-workflow/pkg/workflow/instance_test.go`:
- Around line 454-456: The activity expectation is too loose because it uses
mock.Anything instead of the exact request object built in the test. Update the
expectations in instance_test.go to match the request variable passed to the
workflow so the InstanceId contract is actually verified, using the same
cwssaws.InstancePowerRequest instance created near the test setup and the
related activity expectation sites in the test cases.
---
Outside diff comments:
In `@rest-api/site-workflow/pkg/grpc/server/nico_test_server.go`:
- Around line 461-471: In nico_test_server’s request handling, a missing
InstanceId is currently collapsed to an empty string and treated as NotFound,
which diverges from the real Core handler. Update the logic around
req.GetInstanceId().GetValue() and the instance lookup so the fake server
returns InvalidArgument for an omitted/empty InstanceId before checking f.ins,
while keeping the existing POWER_RESET and invalid-operation behavior unchanged.
---
Nitpick comments:
In `@crates/admin-cli/src/instance/reboot/cmd.rs`:
- Around line 24-33: The reboot command in handle_reboot now returns the raw
gRPC error without operator context, so add a contextual wrapper around the
await on api_client.0.invoke_instance_power that mentions the reboot request and
the target instance identifier from args.instance. Keep the existing call flow,
but ensure the error surfaced by CarbideCliResult includes actionable wording
like “while attempting to request reboot for instance …” so the CLI remains
clear when the RPC fails.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: ef018dc3-3443-407b-9dbd-d8dea40af6bf
⛔ Files ignored due to path filters (8)
rest-api/flow/internal/nicoapi/gen/nico.pb.gois excluded by!**/*.pb.go,!**/gen/**,!rest-api/**/*.pb.gorest-api/flow/internal/nicoapi/gen/nico_grpc.pb.gois excluded by!**/*.pb.go,!**/gen/**,!rest-api/**/*.pb.go,!rest-api/**/*_grpc.pb.gorest-api/flow/internal/nicoapi/gen/site_explorer.pb.gois excluded by!**/*.pb.go,!**/gen/**,!rest-api/**/*.pb.gorest-api/workflow-schema/schema/site-agent/workflows/v1/nico_nico.pb.gois excluded by!**/*.pb.go,!rest-api/**/*.pb.gorest-api/workflow-schema/schema/site-agent/workflows/v1/nico_nico_grpc.pb.gois excluded by!**/*.pb.go,!rest-api/**/*.pb.go,!rest-api/**/*_grpc.pb.gorest-api/workflow-schema/schema/site-agent/workflows/v1/site_explorer_nico.pb.gois excluded by!**/*.pb.go,!rest-api/**/*.pb.gorest-api/workflow-schema/site-agent/workflows/v1/nico_nico.protois excluded by!rest-api/workflow-schema/site-agent/workflows/v1/*_nico.protorest-api/workflow-schema/site-agent/workflows/v1/site_explorer_nico.protois excluded by!rest-api/workflow-schema/site-agent/workflows/v1/*_nico.proto
📒 Files selected for processing (17)
crates/admin-cli/src/instance/reboot/cmd.rscrates/api-core/src/handlers/instance.rscrates/api-core/src/tests/dpu_reprovisioning.rscrates/api-core/src/tests/host_bmc_firmware_test.rscrates/api-core/src/tests/instance_ipxe_behaviors.rscrates/rpc/proto/forge.protodev/bin/reprovision_dpu.shrest-api/api/pkg/api/handler/instance.gorest-api/flow/internal/nicoapi/nicoproto/nico.protorest-api/flow/internal/nicoapi/nicoproto/site_explorer.protorest-api/site-workflow/pkg/activity/instance.gorest-api/site-workflow/pkg/activity/instance_test.gorest-api/site-workflow/pkg/grpc/server/nico_test_server.gorest-api/site-workflow/pkg/workflow/instance.gorest-api/site-workflow/pkg/workflow/instance_test.gorest-api/workflow/pkg/workflow/instance/reboot.gorest-api/workflow/pkg/workflow/instance/reboot_test.go
💤 Files with no reviewable changes (2)
- crates/api-core/src/tests/dpu_reprovisioning.rs
- crates/api-core/src/tests/host_bmc_firmware_test.rs
🔍 Container Scan Summary
Per-CVE detail lives in the per-service |
Signed-off-by: Hasan Khan <hasank@nvidia.com>
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
dev/bin/reprovision_dpu.sh (1)
67-72: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick winFail fast when
InvokeInstancePoweris rejected.This RPC now gates the instance-backed reprovision path, but its exit status is ignored. Without strict mode, a failed
grpcurlhere leaves the script continuing as if tenant approval succeeded, and the later wait loop just times out.As per path instructions, review shell scripts for "strict-mode assumptions" and "error propagation".
Proposed fix
if [[ -n "$INSTANCE_ID" && "$INSTANCE_ID" != "null" ]]; then - echo "Sending reboot message with apply_updates_on_reboot true". - grpcurl -d "{\"operation\": 0, \"instance_id\": { \"value\": \"$INSTANCE_ID\" }, \"apply_updates_on_reboot\": true}" -insecure "${API_SERVER}" forge.Forge/InvokeInstancePower + echo "Sending reboot message with apply_updates_on_reboot true." + if ! grpcurl -d "{\"operation\": 0, \"instance_id\": { \"value\": \"$INSTANCE_ID\" }, \"apply_updates_on_reboot\": true}" -insecure "${API_SERVER}" forge.Forge/InvokeInstancePower; then + echo "Failed to approve tenant reboot for instance $INSTANCE_ID." + exit 1 + fi else echo "No instance found; skipping tenant reboot approval." fi🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@dev/bin/reprovision_dpu.sh` around lines 67 - 72, The InvokeInstancePower RPC call in the reprovision path is not checked, so a rejected grpcurl still lets the script continue. Update the reprovision flow around the INSTANCE_ID branch in the shell script to propagate failures immediately by enabling strict error handling or explicitly checking the grpcurl exit status and exiting on failure, so the later wait loop is never reached after a failed tenant approval.Source: Path instructions
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@dev/bin/reprovision_dpu.sh`:
- Around line 67-72: The InvokeInstancePower RPC call in the reprovision path is
not checked, so a rejected grpcurl still lets the script continue. Update the
reprovision flow around the INSTANCE_ID branch in the shell script to propagate
failures immediately by enabling strict error handling or explicitly checking
the grpcurl exit status and exiting on failure, so the later wait loop is never
reached after a failed tenant approval.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: a2eaa069-6515-4393-92b2-0a91e112e4f5
📒 Files selected for processing (3)
dev/bin/reprovision_dpu.shrest-api/site-workflow/pkg/activity/instance_test.gorest-api/site-workflow/pkg/workflow/instance_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
- rest-api/site-workflow/pkg/workflow/instance_test.go
- rest-api/site-workflow/pkg/activity/instance_test.go
Signed-off-by: Hasan Khan <hasank@nvidia.com>
There was a problem hiding this comment.
🧹 Nitpick comments (1)
rest-api/site-workflow/pkg/grpc/server/nico_test_server_test.go (1)
83-94: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winUse
testifyassertions here.These hand-rolled
t.Fatalf/t.Errorfchecks diverge from the repo’s Go test convention and make failures less consistent than the surrounding suite. As per coding guidelines,rest-api/**/*.go: "Usetestify(assert/require) for test assertions."🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@rest-api/site-workflow/pkg/grpc/server/nico_test_server_test.go` around lines 83 - 94, The test loop in nico_test_server_test.go uses manual t.Fatalf and t.Errorf checks instead of the repo-standard testify assertions. Update the InvokeInstancePower test body to use require/assert from testify for the status.Code, status.Convert(err).Message(), and result nil checks, keeping the same expectations while matching the surrounding Go test convention.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@rest-api/site-workflow/pkg/grpc/server/nico_test_server_test.go`:
- Around line 83-94: The test loop in nico_test_server_test.go uses manual
t.Fatalf and t.Errorf checks instead of the repo-standard testify assertions.
Update the InvokeInstancePower test body to use require/assert from testify for
the status.Code, status.Convert(err).Message(), and result nil checks, keeping
the same expectations while matching the surrounding Go test convention.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 690bfcf1-8161-4066-a88e-9d5cac612fe8
📒 Files selected for processing (4)
crates/admin-cli/src/instance/reboot/cmd.rsdev/bin/reprovision_dpu.shrest-api/site-workflow/pkg/grpc/server/nico_test_server.gorest-api/site-workflow/pkg/grpc/server/nico_test_server_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
- rest-api/site-workflow/pkg/grpc/server/nico_test_server.go
- dev/bin/reprovision_dpu.sh
Signed-off-by: Hasan Khan <hasank@nvidia.com>
Description
Remove the deprecated
machine_idinput fromInvokeInstancePowerand require callers to identify the tenant instance directly.This change:
instance_idin API Core;instance_idthrough REST API, control-plane workflow, and site workflow paths; andRelated issues
Closes #931
Type of Change
Breaking Changes
InvokeInstancePowerclients must now populateinstance_id. The deprecatedmachine_idfield has been removed and its protobuf name/number are reserved.Testing
Passed locally:
cargo test -p nico-admin-cli --lockedDATABASE_URL=postgresql://postgres:postgres@127.0.0.1:30432/nicotest cargo test -p carbide-api-core --no-default-features --locked instance_ipxe_behaviorscargo clippy -p carbide-api-core -p nico-admin-cli --no-default-features --all-targets --locked -- -D warningscargo fmt --all -- --checkwith the repository-pinned nightly toolchainprotolint lint -config_path=.protolint.yaml crates/rpc/proto/buf breaking crates/rpc/proto --against 'https://github.com/NVIDIA/infra-controller.git#branch=main,subdir=crates/rpc/proto'cargo make --no-workspace check-rest-core-proto-syncmake test-apimake test-workflowmake test-site-workflowmake test-flowAdditional Notes
The REST generated snapshots on
mainwere already stale for the Site Explorer change from #2591 and the DHCP lease status enum change from #2877. The first commit is the deterministic generated baseline produced from cleanmain; the second commit applies #931 and regenerates the same outputs. All protobuf-generated files were produced bycargo make --no-workspace generate-rest-core-proto, not edited by hand.The final commit adds the single
carbide_site_explorer_last_run_statusrow emitted by the Core metrics-doc generator; that generated-doc drift also predates this change.