feat: route machine lifecycle power through component backends#2959
Conversation
Signed-off-by: Hasan Khan <hasank@nvidia.com>
Summary by CodeRabbit
WalkthroughThis PR unifies compute-tray power control through the Component Manager backend interface. Per-machine requests are classified into three routing paths—Core Redfish direct, configured RMS direct, or queued rack-state-controller—with a durable dispatch marker preventing at-most-once replay. A full ChangesCompute Power Routing via Component Manager Backends
Sequence Diagram(s)sequenceDiagram
participant Client
participant component_manager_handler
participant classify_compute_power_targets
participant dispatch_compute_power_direct
participant ComputeTrayManager
participant RackStateController
participant rack_maintenance
Client->>component_manager_handler: ComponentPowerControlRequest(machine_ids, bypass)
component_manager_handler->>classify_compute_power_targets: classify machines
classify_compute_power_targets-->>component_manager_handler: CoreDirect / ConfiguredDirect / RackStateController buckets
rect rgba(100, 180, 255, 0.5)
Note over component_manager_handler,ComputeTrayManager: Direct dispatch path
component_manager_handler->>dispatch_compute_power_direct: endpoints + action
dispatch_compute_power_direct->>ComputeTrayManager: power_control(endpoints)
ComputeTrayManager-->>dispatch_compute_power_direct: per-BMC results
dispatch_compute_power_direct-->>component_manager_handler: ComponentResults
end
rect rgba(255, 160, 80, 0.5)
Note over component_manager_handler,rack_maintenance: Queued rack path
component_manager_handler->>RackStateController: queue_compute_power_control (DB row lock, set maintenance_requested)
RackStateController-->>component_manager_handler: queued ComponentResults
RackStateController->>rack_maintenance: PowerControl Preparing → Prepared → dispatch → Finalizing
rack_maintenance->>ComputeTrayManager: power_control(scoped endpoints)
ComputeTrayManager-->>rack_maintenance: per-BMC results
rack_maintenance-->>RackStateController: finalized desired-state & health overrides
end
component_manager_handler-->>Client: ComponentPowerControlResponse
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
🔐 TruffleHog Secret Scan✅ No secrets or credentials found! Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉 🕐 Last updated: 2026-06-28 20:15:02 UTC | Commit: 3e4bbc4 |
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
crates/api-core/src/tests/common/api_fixtures/mod.rs (1)
1366-1382: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick winPass the Redfish simulator into
build_component_managerfor Core compute overrides.
TestEnvOverrides::component_manager_confignow exposes the backend choice, but this fixture always callsbuild_component_manager(..., None)for the Redfish pool. Any test that setscompute_tray_backend: Corewill fail during env construction even thoughredfish_simis already available in this scope.Proposed fix
let test_component_manager = component_manager::component_manager::build_component_manager( &component_manager_config, component_manager_rack_profiles, rms_sim.as_rms_client(), None, Some(db_pool.clone()), - None, + Some(redfish_sim.clone()), )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/api-core/src/tests/common/api_fixtures/mod.rs` around lines 1366 - 1382, The `build_component_manager` call in the test fixture is still hardcoding the Redfish pool argument to `None`, which breaks env construction when `TestEnvOverrides::component_manager_config` selects `compute_tray_backend: Core`. Update the fixture so the existing `redfish_sim` in scope is passed through to `component_manager::component_manager::build_component_manager` when building `test_component_manager`, while preserving the current `rms_sim.as_rms_client()` and `db_pool` wiring. Use the `build_component_manager` and `TestEnvOverrides::component_manager_config` symbols to locate the change.crates/api-core/src/handlers/component_manager.rs (1)
1374-1571: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick winReject duplicate BMC IPs before direct dispatch.
ip_to_machine_idkeeps only one machine per BMC IP, whiledispatch_endpointscan still contain multiple endpoints with that same IP. In that case, direct dispatch can issue duplicate power operations and then attribute backend results to the wrong machine. Mirror the rack-controller guard by detecting duplicate BMC IPs during endpoint resolution and returning per-machine errors instead of dispatching them.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/api-core/src/handlers/component_manager.rs` around lines 1374 - 1571, Duplicate BMC IPs are still allowed into the direct-dispatch path, which can make `dispatch_compute_power_direct` send repeated power actions and misattribute results because `ip_to_machine_id` only retains one machine per IP. Add a duplicate-IP check while building endpoints in `resolve_compute_tray_endpoints` (or immediately before dispatch) so repeated BMC IPs are rejected with per-machine `error_result` entries instead of being added to `ResolvedComputeTrayEndpoints` and `dispatch_endpoints`. Use the existing `resolve_compute_tray_endpoints`, `dispatch_compute_power_direct`, and `ip_to_machine_id` flow as the place to mirror the rack-controller guard.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/api-core/src/handlers/rack.rs`:
- Around line 831-845: The maintenance scheduling flow in rack.rs is holding the
rack transaction open while awaiting credential_manager.set_credentials, which
can block DB locks on an external dependency. Move credential storage out of the
transaction scope in the maintenance scheduling path (around the rack token
handling in the maintenance_access_token block), or switch to a
request-scoped/two-phase key so the DB transaction only covers rack state
changes. Keep the transaction limited to the DB update logic and perform the
credential backend write after the transaction has been committed.
In `@crates/api-core/src/tests/common/api_fixtures/mod.rs`:
- Around line 337-343: The controller-path test fixture is using only the base
credential manager, so it misses credentials available through the composite
snapshot-based reader. Update the relevant constructors in the test fixture to
inject the composite credential reader built from
test_static_credential_snapshot() plus credential_manager, matching the API path
behavior; use the existing CompositeCredentialManager and credential_reader
symbols to locate the affected setup.
In `@crates/machine-controller/src/handler.rs`:
- Around line 8683-8699: The InitialReset flow in the host state handler is
mixing backends by calling services.power_control before immediately polling
with create_redfish_client_from_machine/get_power_state, which can route rack
machines through RMS and then fail on direct BMC reads. Update the InitialReset
path in handler::StateHandlerOutcome/state transition logic so it stays on
Core/direct Redfish end-to-end, or switch to a backend-owned reset/status path
before any RMS-driven power action; make the same fix in the other matching
InitialReset block referenced by the review.
- Around line 10302-10319: In dispatch_core_host_power_control, the current
action can be downgraded from ACPowercycle to ForceOff, but the later DPU
bookkeeping still sees the original value. Return the effective power action
from dispatch_core_host_power_control, capture that result at the call site, and
use it before the DPU reboot timestamp loop so the recorded action matches the
one actually executed; keep the existing ACPowercycle fallback and the
dispatch_component_manager_power_control path in sync.
---
Outside diff comments:
In `@crates/api-core/src/handlers/component_manager.rs`:
- Around line 1374-1571: Duplicate BMC IPs are still allowed into the
direct-dispatch path, which can make `dispatch_compute_power_direct` send
repeated power actions and misattribute results because `ip_to_machine_id` only
retains one machine per IP. Add a duplicate-IP check while building endpoints in
`resolve_compute_tray_endpoints` (or immediately before dispatch) so repeated
BMC IPs are rejected with per-machine `error_result` entries instead of being
added to `ResolvedComputeTrayEndpoints` and `dispatch_endpoints`. Use the
existing `resolve_compute_tray_endpoints`, `dispatch_compute_power_direct`, and
`ip_to_machine_id` flow as the place to mirror the rack-controller guard.
In `@crates/api-core/src/tests/common/api_fixtures/mod.rs`:
- Around line 1366-1382: The `build_component_manager` call in the test fixture
is still hardcoding the Redfish pool argument to `None`, which breaks env
construction when `TestEnvOverrides::component_manager_config` selects
`compute_tray_backend: Core`. Update the fixture so the existing `redfish_sim`
in scope is passed through to
`component_manager::component_manager::build_component_manager` when building
`test_component_manager`, while preserving the current `rms_sim.as_rms_client()`
and `db_pool` wiring. Use the `build_component_manager` and
`TestEnvOverrides::component_manager_config` symbols to locate the change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 3d98a856-b97a-4c69-aa11-1a5dde0f6a21
⛔ Files ignored due to path filters (4)
Cargo.lockis excluded by!**/*.lockrest-api/flow/internal/nicoapi/gen/nico.pb.gois excluded by!**/*.pb.go,!**/gen/**,!rest-api/**/*.pb.gorest-api/workflow-schema/schema/site-agent/workflows/v1/nico_nico.pb.gois excluded by!**/*.pb.go,!rest-api/**/*.pb.gorest-api/workflow-schema/site-agent/workflows/v1/nico_nico.protois excluded by!rest-api/workflow-schema/site-agent/workflows/v1/*_nico.proto
📒 Files selected for processing (30)
crates/api-core/src/handlers/component_manager.rscrates/api-core/src/handlers/rack.rscrates/api-core/src/setup.rscrates/api-core/src/tests/common/api_fixtures/mod.rscrates/api-core/src/tests/component_manager_compute_power.rscrates/api-core/src/tests/machine_power.rscrates/api-core/src/tests/machine_states.rscrates/api-core/src/tests/mod.rscrates/api-core/src/tests/rack_state_controller/handler.rscrates/api-db/src/power_options.rscrates/api-model/src/machine/mod.rscrates/api-model/src/rack.rscrates/component-manager/src/component_manager.rscrates/component-manager/src/compute_tray_manager.rscrates/component-manager/src/config.rscrates/component-manager/src/core_compute_manager.rscrates/component-manager/src/rms.rscrates/machine-controller/Cargo.tomlcrates/machine-controller/src/context.rscrates/machine-controller/src/handler.rscrates/machine-controller/src/redfish.rscrates/rack-controller/Cargo.tomlcrates/rack-controller/src/context.rscrates/rack-controller/src/io.rscrates/rack-controller/src/maintenance.rscrates/rack-controller/src/ready.rscrates/rpc/proto/forge.protorest-api/flow/docs/component-manager-config.mdrest-api/flow/internal/nicoapi/nicoproto/nico.protorest-api/flow/internal/task/componentmanager/compute/nico/nico.go
| // Keep the rack row locked while installing the shared rack token so | ||
| // a losing concurrent request can never delete or replace the token | ||
| // owned by the request that actually schedules maintenance. | ||
| if let Some(token) = maintenance_access_token.as_ref() | ||
| && let Err(error) = api | ||
| .credential_manager | ||
| .delete_credentials(&rack_maintenance_access_token_key(&rack_id)) | ||
| .set_credentials( | ||
| &rack_maintenance_access_token_key(&rack_id), | ||
| &Credentials::UsernamePassword { | ||
| username: "access_token".into(), | ||
| password: token.clone(), | ||
| }, | ||
| ) | ||
| .await | ||
| { |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | 🏗️ Heavy lift
Do not await credential storage while holding the rack transaction.
This keeps the rack row lock and DB connection open while credential_manager.set_credentials runs. If the credential backend is slow or unavailable, maintenance scheduling for this rack can block behind an external dependency. Use a request-scoped credential key or another two-phase design so the DB transaction only covers DB state. As per coding guidelines, “Do not hold a transaction open while doing long-running work.” As per path instructions, API/database changes should be reviewed for transaction safety.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@crates/api-core/src/handlers/rack.rs` around lines 831 - 845, The maintenance
scheduling flow in rack.rs is holding the rack transaction open while awaiting
credential_manager.set_credentials, which can block DB locks on an external
dependency. Move credential storage out of the transaction scope in the
maintenance scheduling path (around the rack token handling in the
maintenance_access_token block), or switch to a request-scoped/two-phase key so
the DB transaction only covers rack state changes. Keep the transaction limited
to the DB update logic and perform the credential backend write after the
transaction has been committed.
Sources: Coding guidelines, Path instructions
| core_compute_tray_manager: Arc::new( | ||
| component_manager::core_compute_manager::CoreComputeTrayManager::new( | ||
| self.redfish_sim.clone(), | ||
| ), | ||
| ), | ||
| component_manager: self.test_component_manager.clone(), | ||
| credential_reader: self.test_credential_manager.clone(), |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win
Use the composite credential reader for controller-path tests.
The fixture builds a CompositeCredentialManager from test_static_credential_snapshot() plus credential_manager, but both machine-state service constructors inject only credential_manager.clone(). That makes controller-path credential resolution narrower than the API path and can break tests that rely on CredentialKeys present only in the static snapshot.
Also applies to: 1511-1517
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@crates/api-core/src/tests/common/api_fixtures/mod.rs` around lines 337 - 343,
The controller-path test fixture is using only the base credential manager, so
it misses credentials available through the composite snapshot-based reader.
Update the relevant constructors in the test fixture to inject the composite
credential reader built from test_static_credential_snapshot() plus
credential_manager, matching the API path behavior; use the existing
CompositeCredentialManager and credential_reader symbols to locate the affected
setup.
| services | ||
| .power_control(&state.host_snapshot, PowerAction::ForceOff) | ||
| .await?; | ||
|
|
||
| Ok(StateHandlerOutcome::transition(scenario.actual_new_state( | ||
| HostReprovisionState::InitialReset { | ||
| phase: InitialResetPhase::WaitingForHostOff, | ||
| last_time: Utc::now(), | ||
| }, | ||
| state.managed_state.get_host_repro_retry_count(), | ||
| ))) | ||
| } | ||
| InitialResetPhase::WaitingForHostOff => { | ||
| let redfish_client = services | ||
| .create_redfish_client_from_machine(&state.host_snapshot) | ||
| .await?; | ||
| let status = get_power_state(redfish_client.as_ref()).await?; |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | 🏗️ Heavy lift
Avoid dispatching RMS power before direct BMC polling.
services.power_control can select RMS for rack machines, but the next phases immediately use direct Redfish reads. That contradicts the non-Core path assumption below that rack BMCs may only be reachable through RMS, so this can power off successfully and then fail/stick while polling. Either keep this initial-reset flow on Core/direct-Redfish end-to-end, or add a backend-owned status/reset path before routing rack machines through RMS here.
Also applies to: 8756-8772
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@crates/machine-controller/src/handler.rs` around lines 8683 - 8699, The
InitialReset flow in the host state handler is mixing backends by calling
services.power_control before immediately polling with
create_redfish_client_from_machine/get_power_state, which can route rack
machines through RMS and then fail on direct BMC reads. Update the InitialReset
path in handler::StateHandlerOutcome/state transition logic so it stays on
Core/direct Redfish end-to-end, or switch to a backend-owned reset/status path
before any RMS-driven power action; make the same fix in the other matching
InitialReset block referenced by the review.
| if action == SystemPowerControl::ACPowercycle | ||
| && !redfish_client.ac_powercycle_supported_by_power() | ||
| { | ||
| action = SystemPowerControl::ForceOff; | ||
| } | ||
|
|
||
| if action == SystemPowerControl::ACPowercycle && power_state != libredfish::PowerState::Off { | ||
| tracing::warn!( | ||
| machine_id = %machine.id, | ||
| %power_state, | ||
| "ACPowercycle requires chassis to be Off, forcing off first" | ||
| ); | ||
| ctx.services | ||
| .power_control_with_manager(machine, backend, PowerAction::ForceOff) | ||
| .await?; | ||
| } | ||
|
|
||
| dispatch_component_manager_power_control(machine, backend, action, ctx, trigger_location).await |
There was a problem hiding this comment.
🗄️ Data Integrity & Integration | 🟡 Minor | ⚡ Quick win
Propagate the effective power action into DPU bookkeeping.
When Core downgrades unsupported ACPowercycle to ForceOff, the host write records ForceOff, but the later DPU timestamp loop still uses the original ACPowercycle. Return the effective action from dispatch_core_host_power_control and assign it back before writing DPU reboot timestamps.
Proposed fix
async fn dispatch_core_host_power_control(
@@
-) -> Result<(), StateHandlerError> {
+) -> Result<SystemPowerControl, StateHandlerError> {
@@
- dispatch_component_manager_power_control(machine, backend, action, ctx, trigger_location).await
+ dispatch_component_manager_power_control(machine, backend, action, ctx, trigger_location).await?;
+ Ok(action)
}
@@
- dispatch_core_host_power_control(
+ action = dispatch_core_host_power_control(
machine,
backend.as_ref(),
redfish_client.as_ref(),
power_state,
action,
ctx,
location,
)
.await?;Also applies to: 10389-10397
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@crates/machine-controller/src/handler.rs` around lines 10302 - 10319, In
dispatch_core_host_power_control, the current action can be downgraded from
ACPowercycle to ForceOff, but the later DPU bookkeeping still sees the original
value. Return the effective power action from dispatch_core_host_power_control,
capture that result at the call site, and use it before the DPU reboot timestamp
loop so the recorded action matches the one actually executed; keep the existing
ACPowercycle fallback and the dispatch_component_manager_power_control path in
sync.
🔍 Container Scan Summary
Per-CVE detail lives in the per-service |
Route compute-machine lifecycle power operations through the Component Manager abstraction.
Related issues
Closes #2053
Addresses #2421
Addresses #2679
Type of Change
Breaking Changes
Testing
Validated with:
cargo test -p component-manager --lib -- --test-threads=1(81 passed)cargo test -p carbide-machine-controller --lib(39 passed)cargo test -p carbide-rack-controller --lib(35 passed)carbide-api-coredatabase-backed integration tests for standalone Core routing under global RMS, rack queue/bypass dispatch, stale desired-state versions, crash recovery, and backend-failure rollbackcargo test -p carbide-api-model rack::tests::should_run(3 passed)cargo make --no-workspace generate-rest-core-protocargo clippywith warnings denied for the affected Component Manager, controller, API-core, and API-model targetscargo +nightly fmt --all -- --checkcargo test --locked --no-runfor all four affected cratesAdditional Notes
The durable rack dispatch marker intentionally fails a recovered in-flight request closed instead of replaying non-idempotent restart or AC-cycle actions.