feat: route machine lifecycle power through component backends by osu · Pull Request #2959 · NVIDIA/infra-controller

osu · 2026-06-28T20:12:23Z

Route compute-machine lifecycle power operations through the Component Manager abstraction.

add a Core compute-tray adapter that preserves the existing Redfish/IPMI behavior for non-rack machines
dispatch rack machines configured for RMS through exact compute-tray power actions, including BMC authentication, vendor, and custom-port data
route RMS rack requests through a durable rack-maintenance state machine unless the caller explicitly bypasses state control
make rack dispatch crash-safe and at-most-once, with scoped desired-power bookkeeping, compare-and-swap updates, result reconciliation, and cleanup
move automated machine lifecycle host-power actions onto the same backend interface while preserving Core fallback behavior
align public model, protobuf, generated Go, and Flow documentation with the new routing contract

Related issues

Closes #2053

Addresses #2421
Addresses #2679

Type of Change

Add - New feature or capability
Change - Changes in existing functionality
Fix - Bug fixes
Remove - Removed features or deprecated functionality
Internal - Internal changes (refactoring, tests, docs, etc.)

Breaking Changes

This PR contains breaking changes

Testing

Unit tests added/updated
Integration tests added/updated
Manual testing performed
No testing required (docs, internal refactor, etc.)

Validated with:

cargo test -p component-manager --lib -- --test-threads=1 (81 passed)
cargo test -p carbide-machine-controller --lib (39 passed)
cargo test -p carbide-rack-controller --lib (35 passed)
focused carbide-api-core database-backed integration tests for standalone Core routing under global RMS, rack queue/bypass dispatch, stale desired-state versions, crash recovery, and backend-failure rollback
focused lifecycle regressions for the phased RMS initial reset (including non-redispatch polling), unavailable/outdated DPUs, DPU reset, and instance creation
cargo test -p carbide-api-model rack::tests::should_run (3 passed)
cargo make --no-workspace generate-rest-core-proto
cargo clippy with warnings denied for the affected Component Manager, controller, API-core, and API-model targets
cargo +nightly fmt --all -- --check
Linux cargo test --locked --no-run for all four affected crates

Additional Notes

The durable rack dispatch marker intentionally fails a recovered in-flight request closed instead of replaying non-idempotent restart or AC-cycle actions.

Signed-off-by: Hasan Khan <hasank@nvidia.com>

coderabbitai · 2026-06-28T20:12:42Z

Summary by CodeRabbit

New Features
- Added support for more precise compute power routing across standalone and rack-associated machines.
- Improved rack maintenance handling for scoped power control, including safer retry behavior and state tracking.
- Added support for additional hardware/vendor configurations and BMC port handling.
Bug Fixes
- Prevented duplicate machine power requests from being sent more than once.
- Fixed stale power-state updates and improved concurrency handling during power option changes.
- Improved recovery when backend power actions fail, preserving prior machine state.
Documentation
- Updated power-routing guidance and configuration notes for compute and rack operations.

Walkthrough

This PR unifies compute-tray power control through the Component Manager backend interface. Per-machine requests are classified into three routing paths—Core Redfish direct, configured RMS direct, or queued rack-state-controller—with a durable dispatch marker preventing at-most-once replay. A full PowerControl sub-state-machine is added to the rack maintenance flow, and optimistic concurrency is enforced on power-option updates.

Changes

Compute Power Routing via Component Manager Backends

Layer / File(s)	Summary
ComputeTrayEndpoint model and rack power-control persisted types `crates/component-manager/src/compute_tray_manager.rs`, `crates/api-model/src/rack.rs`, `crates/api-model/src/machine/mod.rs`	`ComputeTrayEndpoint` replaces `bmc_credentials` with `ComputeTrayAuthentication` (Credentials/CredentialKey) and adds `bmc_port`. Rack model gains `RackPowerControlState`/`Result`/`Target` persisted types, `power_control_dispatch_started_at` durable marker on `RackConfig`, and `MaintenanceScope::should_run` now requires explicit opt-in for `PowerControl` activities. `InitialResetPhase` adds `WaitingForHostOff`/`WaitingForHostOn` sub-states.
Core and RMS backend adaptors `crates/component-manager/src/core_compute_manager.rs`, `crates/component-manager/src/rms.rs`, `crates/component-manager/src/config.rs`, `crates/component-manager/src/component_manager.rs`	`CoreComputeTrayManager::power_control` derives auth from `ComputeTrayAuthentication` and passes `bmc_port`. `build_compute_tray_node_info` returns `Result` and rejects credential-key auth in RMS. Adds `to_rms_compute_power_operation` preserving Redfish reset semantics. `LenovoAmi` vendor variant added with explicit `None` mapping in Core.
Machine controller routing and handler_host_power_control unification `crates/machine-controller/src/context.rs`, `crates/machine-controller/src/handler.rs`, `crates/machine-controller/src/redfish.rs`, `crates/machine-controller/Cargo.toml`	Adds `ComputeTrayRoute` selection, credential-loading `compute_tray_endpoint`, and `power_control`/`power_control_with_manager` with result validation to `MachineStateHandlerServices`. Rewrites `handler_host_power_control_with_location` to route non-core backends through component-manager dispatch. Standardizes BOSS, restart-verification, DPU, HGX/BMC, and platform-config power calls through `handler_host_power_control`.
Rack maintenance PowerControl sub-state-machine `crates/rack-controller/src/maintenance.rs`, `crates/rack-controller/src/ready.rs`, `crates/rack-controller/src/io.rs`, `crates/rack-controller/Cargo.toml`	Full compute-tray power-control subsystem: credential resolution, endpoint pairing with BMC-IP drift detection, durable dispatch-marker claiming via DB row lock, health-report override preparation, per-target desired-power-state updates with rollback on failure, backend dispatch with result reconciliation, and best-effort re-exploration. Integrated as `Preparing`/`Prepared`/`Finalizing` sub-states in `handle_maintenance`.
API handler routing pipeline and rack maintenance scheduling fix `crates/api-core/src/handlers/component_manager.rs`, `crates/api-core/src/handlers/rack.rs`	Replaces monolithic direct-dispatch with `classify_compute_power_targets` (CoreDirect/ConfiguredDirect/RackStateController), `dispatch_compute_power_direct` with health-override and reconciliation, and `queue_compute_power_control_via_rack_state_controller`. `on_demand_rack_maintenance` now re-reads and row-locks the rack inside a transaction before validation, with revised token error/commit semantics.
Power options optimistic concurrency `crates/api-db/src/power_options.rs`	`update_desired_state` matches on `current desired_power_state_version` in the `WHERE` clause; maps `RowNotFound` to `ConcurrentModificationError`.
Service wiring and test infrastructure `crates/api-core/src/setup.rs`, `crates/api-core/src/tests/common/api_fixtures/mod.rs`, `crates/api-core/src/tests/mod.rs`	Wires `core_compute_tray_manager`, `component_manager`, and `credential_reader` into `MachineStateHandlerServices` and `RackStateHandlerServices`. Adds `component_manager_config` override to `TestEnvOverrides`.
Tests `crates/api-core/src/tests/component_manager_compute_power.rs`, `crates/api-core/src/tests/machine_states.rs`, `crates/api-core/src/tests/machine_power.rs`, `crates/api-core/src/tests/rack_state_controller/handler.rs`, `crates/rack-controller/src/maintenance.rs`, `crates/component-manager/src/...`	API-level routing tests (standalone bypass, rack queue, rack bypass direct); rack-controller dispatch-marker idempotency and backend-failure restoration tests; machine-states RMS routing and `InitialResetPhase` polling tests; stale-version rejection test; maintenance unit tests for scope helpers, reconciliation, and state transitions; backend unit tests for vendor mapping and port handling.
Proto and documentation `crates/rpc/proto/forge.proto`, `rest-api/flow/internal/nicoapi/nicoproto/nico.proto`, `rest-api/flow/docs/component-manager-config.md`, `rest-api/flow/internal/task/componentmanager/compute/nico/nico.go`	Updates `bypass_state_controller` field comments and operator documentation to reflect new per-path routing behavior.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant component_manager_handler
  participant classify_compute_power_targets
  participant dispatch_compute_power_direct
  participant ComputeTrayManager
  participant RackStateController
  participant rack_maintenance

  Client->>component_manager_handler: ComponentPowerControlRequest(machine_ids, bypass)
  component_manager_handler->>classify_compute_power_targets: classify machines
  classify_compute_power_targets-->>component_manager_handler: CoreDirect / ConfiguredDirect / RackStateController buckets

  rect rgba(100, 180, 255, 0.5)
    Note over component_manager_handler,ComputeTrayManager: Direct dispatch path
    component_manager_handler->>dispatch_compute_power_direct: endpoints + action
    dispatch_compute_power_direct->>ComputeTrayManager: power_control(endpoints)
    ComputeTrayManager-->>dispatch_compute_power_direct: per-BMC results
    dispatch_compute_power_direct-->>component_manager_handler: ComponentResults
  end

  rect rgba(255, 160, 80, 0.5)
    Note over component_manager_handler,rack_maintenance: Queued rack path
    component_manager_handler->>RackStateController: queue_compute_power_control (DB row lock, set maintenance_requested)
    RackStateController-->>component_manager_handler: queued ComponentResults
    RackStateController->>rack_maintenance: PowerControl Preparing → Prepared → dispatch → Finalizing
    rack_maintenance->>ComputeTrayManager: power_control(scoped endpoints)
    ComputeTrayManager-->>rack_maintenance: per-BMC results
    rack_maintenance-->>RackStateController: finalized desired-state & health overrides
  end

  component_manager_handler-->>Client: ComponentPowerControlResponse

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 62.90% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main change: routing machine lifecycle power through component backends.
Description check	✅ Passed	The description matches the changeset and explains the backend routing, durability, and lifecycle-power refactor.
Linked Issues check	✅ Passed	The PR aligns with `#2053` by routing rack machines through RMS, non-rack machines through Core, and moving internal lifecycle power onto the backend interface.
Out of Scope Changes check	✅ Passed	The changes are concentrated on backend routing, durability, models, tests, and docs; no clear unrelated scope creep is evident.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands.}

github-actions · 2026-06-28T20:15:03Z

🔐 TruffleHog Secret Scan

✅ No secrets or credentials found!

Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉

🔗 View scan details

_{🕐 Last updated: 2026-06-28 20:15:02 UTC | Commit: 3e4bbc4}

copy-pr-bot · 2026-06-28T20:16:50Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

coderabbitai

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

crates/api-core/src/tests/common/api_fixtures/mod.rs (1)
1366-1382: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Pass the Redfish simulator into build_component_manager for Core compute overrides.

TestEnvOverrides::component_manager_config now exposes the backend choice, but this fixture always calls build_component_manager(..., None) for the Redfish pool. Any test that sets compute_tray_backend: Core will fail during env construction even though redfish_sim is already available in this scope.
Proposed fix
     let test_component_manager = component_manager::component_manager::build_component_manager(
         &component_manager_config,
         component_manager_rack_profiles,
         rms_sim.as_rms_client(),
         None,
         Some(db_pool.clone()),
-        None,
+        Some(redfish_sim.clone()),
     )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/api-core/src/tests/common/api_fixtures/mod.rs` around lines 1366 -
1382, The `build_component_manager` call in the test fixture is still hardcoding
the Redfish pool argument to `None`, which breaks env construction when
`TestEnvOverrides::component_manager_config` selects `compute_tray_backend:
Core`. Update the fixture so the existing `redfish_sim` in scope is passed
through to `component_manager::component_manager::build_component_manager` when
building `test_component_manager`, while preserving the current
`rms_sim.as_rms_client()` and `db_pool` wiring. Use the
`build_component_manager` and `TestEnvOverrides::component_manager_config`
symbols to locate the change.
crates/api-core/src/handlers/component_manager.rs (1)
1374-1571: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Reject duplicate BMC IPs before direct dispatch.

ip_to_machine_id keeps only one machine per BMC IP, while dispatch_endpoints can still contain multiple endpoints with that same IP. In that case, direct dispatch can issue duplicate power operations and then attribute backend results to the wrong machine. Mirror the rack-controller guard by detecting duplicate BMC IPs during endpoint resolution and returning per-machine errors instead of dispatching them.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/api-core/src/handlers/component_manager.rs` around lines 1374 - 1571,
Duplicate BMC IPs are still allowed into the direct-dispatch path, which can
make `dispatch_compute_power_direct` send repeated power actions and
misattribute results because `ip_to_machine_id` only retains one machine per IP.
Add a duplicate-IP check while building endpoints in
`resolve_compute_tray_endpoints` (or immediately before dispatch) so repeated
BMC IPs are rejected with per-machine `error_result` entries instead of being
added to `ResolvedComputeTrayEndpoints` and `dispatch_endpoints`. Use the
existing `resolve_compute_tray_endpoints`, `dispatch_compute_power_direct`, and
`ip_to_machine_id` flow as the place to mirror the rack-controller guard.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/api-core/src/handlers/rack.rs`:
- Around line 831-845: The maintenance scheduling flow in rack.rs is holding the
rack transaction open while awaiting credential_manager.set_credentials, which
can block DB locks on an external dependency. Move credential storage out of the
transaction scope in the maintenance scheduling path (around the rack token
handling in the maintenance_access_token block), or switch to a
request-scoped/two-phase key so the DB transaction only covers rack state
changes. Keep the transaction limited to the DB update logic and perform the
credential backend write after the transaction has been committed.

In `@crates/api-core/src/tests/common/api_fixtures/mod.rs`:
- Around line 337-343: The controller-path test fixture is using only the base
credential manager, so it misses credentials available through the composite
snapshot-based reader. Update the relevant constructors in the test fixture to
inject the composite credential reader built from
test_static_credential_snapshot() plus credential_manager, matching the API path
behavior; use the existing CompositeCredentialManager and credential_reader
symbols to locate the affected setup.

In `@crates/machine-controller/src/handler.rs`:
- Around line 8683-8699: The InitialReset flow in the host state handler is
mixing backends by calling services.power_control before immediately polling
with create_redfish_client_from_machine/get_power_state, which can route rack
machines through RMS and then fail on direct BMC reads. Update the InitialReset
path in handler::StateHandlerOutcome/state transition logic so it stays on
Core/direct Redfish end-to-end, or switch to a backend-owned reset/status path
before any RMS-driven power action; make the same fix in the other matching
InitialReset block referenced by the review.
- Around line 10302-10319: In dispatch_core_host_power_control, the current
action can be downgraded from ACPowercycle to ForceOff, but the later DPU
bookkeeping still sees the original value. Return the effective power action
from dispatch_core_host_power_control, capture that result at the call site, and
use it before the DPU reboot timestamp loop so the recorded action matches the
one actually executed; keep the existing ACPowercycle fallback and the
dispatch_component_manager_power_control path in sync.

---

Outside diff comments:
In `@crates/api-core/src/handlers/component_manager.rs`:
- Around line 1374-1571: Duplicate BMC IPs are still allowed into the
direct-dispatch path, which can make `dispatch_compute_power_direct` send
repeated power actions and misattribute results because `ip_to_machine_id` only
retains one machine per IP. Add a duplicate-IP check while building endpoints in
`resolve_compute_tray_endpoints` (or immediately before dispatch) so repeated
BMC IPs are rejected with per-machine `error_result` entries instead of being
added to `ResolvedComputeTrayEndpoints` and `dispatch_endpoints`. Use the
existing `resolve_compute_tray_endpoints`, `dispatch_compute_power_direct`, and
`ip_to_machine_id` flow as the place to mirror the rack-controller guard.

In `@crates/api-core/src/tests/common/api_fixtures/mod.rs`:
- Around line 1366-1382: The `build_component_manager` call in the test fixture
is still hardcoding the Redfish pool argument to `None`, which breaks env
construction when `TestEnvOverrides::component_manager_config` selects
`compute_tray_backend: Core`. Update the fixture so the existing `redfish_sim`
in scope is passed through to
`component_manager::component_manager::build_component_manager` when building
`test_component_manager`, while preserving the current `rms_sim.as_rms_client()`
and `db_pool` wiring. Use the `build_component_manager` and
`TestEnvOverrides::component_manager_config` symbols to locate the change.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 3d98a856-b97a-4c69-aa11-1a5dde0f6a21

📥 Commits

Reviewing files that changed from the base of the PR and between 0082abd and 3e4bbc4.

⛔ Files ignored due to path filters (4)

Cargo.lock is excluded by !**/*.lock
rest-api/flow/internal/nicoapi/gen/nico.pb.go is excluded by !**/*.pb.go, !**/gen/**, !rest-api/**/*.pb.go
rest-api/workflow-schema/schema/site-agent/workflows/v1/nico_nico.pb.go is excluded by !**/*.pb.go, !rest-api/**/*.pb.go
rest-api/workflow-schema/site-agent/workflows/v1/nico_nico.proto is excluded by !rest-api/workflow-schema/site-agent/workflows/v1/*_nico.proto

📒 Files selected for processing (30)

crates/api-core/src/handlers/component_manager.rs
crates/api-core/src/handlers/rack.rs
crates/api-core/src/setup.rs
crates/api-core/src/tests/common/api_fixtures/mod.rs
crates/api-core/src/tests/component_manager_compute_power.rs
crates/api-core/src/tests/machine_power.rs
crates/api-core/src/tests/machine_states.rs
crates/api-core/src/tests/mod.rs
crates/api-core/src/tests/rack_state_controller/handler.rs
crates/api-db/src/power_options.rs
crates/api-model/src/machine/mod.rs
crates/api-model/src/rack.rs
crates/component-manager/src/component_manager.rs
crates/component-manager/src/compute_tray_manager.rs
crates/component-manager/src/config.rs
crates/component-manager/src/core_compute_manager.rs
crates/component-manager/src/rms.rs
crates/machine-controller/Cargo.toml
crates/machine-controller/src/context.rs
crates/machine-controller/src/handler.rs
crates/machine-controller/src/redfish.rs
crates/rack-controller/Cargo.toml
crates/rack-controller/src/context.rs
crates/rack-controller/src/io.rs
crates/rack-controller/src/maintenance.rs
crates/rack-controller/src/ready.rs
crates/rpc/proto/forge.proto
rest-api/flow/docs/component-manager-config.md
rest-api/flow/internal/nicoapi/nicoproto/nico.proto
rest-api/flow/internal/task/componentmanager/compute/nico/nico.go

coderabbitai · 2026-06-28T20:23:08Z

+        // Keep the rack row locked while installing the shared rack token so
+        // a losing concurrent request can never delete or replace the token
+        // owned by the request that actually schedules maintenance.
+        if let Some(token) = maintenance_access_token.as_ref()
            && let Err(error) = api
                .credential_manager
-                .delete_credentials(&rack_maintenance_access_token_key(&rack_id))
+                .set_credentials(
+                    &rack_maintenance_access_token_key(&rack_id),
+                    &Credentials::UsernamePassword {
+                        username: "access_token".into(),
+                        password: token.clone(),
+                    },
+                )
                .await
        {


🩺 Stability & Availability | 🟠 Major | 🏗️ Heavy lift

Do not await credential storage while holding the rack transaction.

This keeps the rack row lock and DB connection open while credential_manager.set_credentials runs. If the credential backend is slow or unavailable, maintenance scheduling for this rack can block behind an external dependency. Use a request-scoped credential key or another two-phase design so the DB transaction only covers DB state. As per coding guidelines, “Do not hold a transaction open while doing long-running work.” As per path instructions, API/database changes should be reviewed for transaction safety.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/api-core/src/handlers/rack.rs` around lines 831 - 845, The maintenance scheduling flow in rack.rs is holding the rack transaction open while awaiting credential_manager.set_credentials, which can block DB locks on an external dependency. Move credential storage out of the transaction scope in the maintenance scheduling path (around the rack token handling in the maintenance_access_token block), or switch to a request-scoped/two-phase key so the DB transaction only covers rack state changes. Keep the transaction limited to the DB update logic and perform the credential backend write after the transaction has been committed.

Sources: Coding guidelines, Path instructions

coderabbitai · 2026-06-28T20:23:08Z

+            core_compute_tray_manager: Arc::new(
+                component_manager::core_compute_manager::CoreComputeTrayManager::new(
+                    self.redfish_sim.clone(),
+                ),
+            ),
+            component_manager: self.test_component_manager.clone(),
+            credential_reader: self.test_credential_manager.clone(),


🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Use the composite credential reader for controller-path tests.

The fixture builds a CompositeCredentialManager from test_static_credential_snapshot() plus credential_manager, but both machine-state service constructors inject only credential_manager.clone(). That makes controller-path credential resolution narrower than the API path and can break tests that rely on CredentialKeys present only in the static snapshot.

Also applies to: 1511-1517

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/api-core/src/tests/common/api_fixtures/mod.rs` around lines 337 - 343, The controller-path test fixture is using only the base credential manager, so it misses credentials available through the composite snapshot-based reader. Update the relevant constructors in the test fixture to inject the composite credential reader built from test_static_credential_snapshot() plus credential_manager, matching the API path behavior; use the existing CompositeCredentialManager and credential_reader symbols to locate the affected setup.

coderabbitai · 2026-06-28T20:23:08Z

+                services
+                    .power_control(&state.host_snapshot, PowerAction::ForceOff)
+                    .await?;
+
+                Ok(StateHandlerOutcome::transition(scenario.actual_new_state(
+                    HostReprovisionState::InitialReset {
+                        phase: InitialResetPhase::WaitingForHostOff,
+                        last_time: Utc::now(),
+                    },
+                    state.managed_state.get_host_repro_retry_count(),
+                )))
+            }
+            InitialResetPhase::WaitingForHostOff => {
+                let redfish_client = services
+                    .create_redfish_client_from_machine(&state.host_snapshot)
+                    .await?;
                let status = get_power_state(redfish_client.as_ref()).await?;


🩺 Stability & Availability | 🟠 Major | 🏗️ Heavy lift

Avoid dispatching RMS power before direct BMC polling.

services.power_control can select RMS for rack machines, but the next phases immediately use direct Redfish reads. That contradicts the non-Core path assumption below that rack BMCs may only be reachable through RMS, so this can power off successfully and then fail/stick while polling. Either keep this initial-reset flow on Core/direct-Redfish end-to-end, or add a backend-owned status/reset path before routing rack machines through RMS here.

Also applies to: 8756-8772

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/machine-controller/src/handler.rs` around lines 8683 - 8699, The InitialReset flow in the host state handler is mixing backends by calling services.power_control before immediately polling with create_redfish_client_from_machine/get_power_state, which can route rack machines through RMS and then fail on direct BMC reads. Update the InitialReset path in handler::StateHandlerOutcome/state transition logic so it stays on Core/direct Redfish end-to-end, or switch to a backend-owned reset/status path before any RMS-driven power action; make the same fix in the other matching InitialReset block referenced by the review.

coderabbitai · 2026-06-28T20:23:08Z

+    if action == SystemPowerControl::ACPowercycle
+        && !redfish_client.ac_powercycle_supported_by_power()
+    {
+        action = SystemPowerControl::ForceOff;
+    }
+
+    if action == SystemPowerControl::ACPowercycle && power_state != libredfish::PowerState::Off {
+        tracing::warn!(
+            machine_id = %machine.id,
+            %power_state,
+            "ACPowercycle requires chassis to be Off, forcing off first"
+        );
+        ctx.services
+            .power_control_with_manager(machine, backend, PowerAction::ForceOff)
+            .await?;
+    }
+
+    dispatch_component_manager_power_control(machine, backend, action, ctx, trigger_location).await


🗄️ Data Integrity & Integration | 🟡 Minor | ⚡ Quick win

Propagate the effective power action into DPU bookkeeping.

When Core downgrades unsupported ACPowercycle to ForceOff, the host write records ForceOff, but the later DPU timestamp loop still uses the original ACPowercycle. Return the effective action from dispatch_core_host_power_control and assign it back before writing DPU reboot timestamps.

Proposed fix

async fn dispatch_core_host_power_control( @@ -) -> Result<(), StateHandlerError> { +) -> Result<SystemPowerControl, StateHandlerError> { @@ - dispatch_component_manager_power_control(machine, backend, action, ctx, trigger_location).await + dispatch_component_manager_power_control(machine, backend, action, ctx, trigger_location).await?; + Ok(action) } @@ - dispatch_core_host_power_control( + action = dispatch_core_host_power_control( machine, backend.as_ref(), redfish_client.as_ref(), power_state, action, ctx, location, ) .await?;

Also applies to: 10389-10397

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/machine-controller/src/handler.rs` around lines 10302 - 10319, In dispatch_core_host_power_control, the current action can be downgraded from ACPowercycle to ForceOff, but the later DPU bookkeeping still sees the original value. Return the effective power action from dispatch_core_host_power_control, capture that result at the call site, and use it before the DPU reboot timestamp loop so the recorded action matches the one actually executed; keep the existing ACPowercycle fallback and the dispatch_component_manager_power_control path in sync.

github-actions · 2026-06-28T21:34:45Z

🔍 Container Scan Summary

Service	Total	Critical	High	Medium	Low	Other
boot-artifacts-aarch64	3	0	0	3	0	0
boot-artifacts-x86_64	3	0	0	3	0	0
forge-admin-cli-x86_64	285	6	25	103	7	144
machine-validation-runner	748	30	189	272	36	221
machine_validation	748	30	189	272	36	221
machine_validation-aarch64	748	30	189	272	36	221
TOTAL	2535	96	592	925	115	807

Per-CVE detail lives in the per-service grype-* artifacts (JSON + SARIF). Severity counts only — no CVE IDs published here.

feat: route machine power through component backends

3e4bbc4

Signed-off-by: Hasan Khan <hasank@nvidia.com>

osu requested a review from a team as a code owner June 28, 2026 20:12

osu self-assigned this Jun 28, 2026

osu marked this pull request as draft June 28, 2026 20:16

coderabbitai Bot reviewed Jun 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: route machine lifecycle power through component backends#2959

feat: route machine lifecycle power through component backends#2959
osu wants to merge 1 commit into
NVIDIA:mainfrom
osu:feat/2053-machine-component-backends

osu commented Jun 28, 2026

Uh oh!

coderabbitai Bot commented Jun 28, 2026 •

edited

Loading

❌ Failed checks (1 warning)

Uh oh!

github-actions Bot commented Jun 28, 2026

Uh oh!

copy-pr-bot Bot commented Jun 28, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 28, 2026

Uh oh!

coderabbitai Bot Jun 28, 2026

Uh oh!

coderabbitai Bot Jun 28, 2026

Uh oh!

coderabbitai Bot Jun 28, 2026

Uh oh!

github-actions Bot commented Jun 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

osu commented Jun 28, 2026

Related issues

Type of Change

Breaking Changes

Testing

Additional Notes

Uh oh!

coderabbitai Bot commented Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

github-actions Bot commented Jun 28, 2026

🔐 TruffleHog Secret Scan

Uh oh!

copy-pr-bot Bot commented Jun 28, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 28, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jun 28, 2026

🔍 Container Scan Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented Jun 28, 2026 •

edited

Loading