Skip to content

feat: route machine lifecycle power through component backends#2959

Draft
osu wants to merge 1 commit into
NVIDIA:mainfrom
osu:feat/2053-machine-component-backends
Draft

feat: route machine lifecycle power through component backends#2959
osu wants to merge 1 commit into
NVIDIA:mainfrom
osu:feat/2053-machine-component-backends

Conversation

@osu

@osu osu commented Jun 28, 2026

Copy link
Copy Markdown
Member

Route compute-machine lifecycle power operations through the Component Manager abstraction.

  • add a Core compute-tray adapter that preserves the existing Redfish/IPMI behavior for non-rack machines
  • dispatch rack machines configured for RMS through exact compute-tray power actions, including BMC authentication, vendor, and custom-port data
  • route RMS rack requests through a durable rack-maintenance state machine unless the caller explicitly bypasses state control
  • make rack dispatch crash-safe and at-most-once, with scoped desired-power bookkeeping, compare-and-swap updates, result reconciliation, and cleanup
  • move automated machine lifecycle host-power actions onto the same backend interface while preserving Core fallback behavior
  • align public model, protobuf, generated Go, and Flow documentation with the new routing contract

Related issues

Closes #2053

Addresses #2421
Addresses #2679

Type of Change

  • Add - New feature or capability
  • Change - Changes in existing functionality
  • Fix - Bug fixes
  • Remove - Removed features or deprecated functionality
  • Internal - Internal changes (refactoring, tests, docs, etc.)

Breaking Changes

  • This PR contains breaking changes

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Validated with:

  • cargo test -p component-manager --lib -- --test-threads=1 (81 passed)
  • cargo test -p carbide-machine-controller --lib (39 passed)
  • cargo test -p carbide-rack-controller --lib (35 passed)
  • focused carbide-api-core database-backed integration tests for standalone Core routing under global RMS, rack queue/bypass dispatch, stale desired-state versions, crash recovery, and backend-failure rollback
  • focused lifecycle regressions for the phased RMS initial reset (including non-redispatch polling), unavailable/outdated DPUs, DPU reset, and instance creation
  • cargo test -p carbide-api-model rack::tests::should_run (3 passed)
  • cargo make --no-workspace generate-rest-core-proto
  • cargo clippy with warnings denied for the affected Component Manager, controller, API-core, and API-model targets
  • cargo +nightly fmt --all -- --check
  • Linux cargo test --locked --no-run for all four affected crates

Additional Notes

The durable rack dispatch marker intentionally fails a recovered in-flight request closed instead of replaying non-idempotent restart or AC-cycle actions.

Signed-off-by: Hasan Khan <hasank@nvidia.com>
@osu osu requested a review from a team as a code owner June 28, 2026 20:12
@coderabbitai

coderabbitai Bot commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Summary by CodeRabbit

  • New Features

    • Added support for more precise compute power routing across standalone and rack-associated machines.
    • Improved rack maintenance handling for scoped power control, including safer retry behavior and state tracking.
    • Added support for additional hardware/vendor configurations and BMC port handling.
  • Bug Fixes

    • Prevented duplicate machine power requests from being sent more than once.
    • Fixed stale power-state updates and improved concurrency handling during power option changes.
    • Improved recovery when backend power actions fail, preserving prior machine state.
  • Documentation

    • Updated power-routing guidance and configuration notes for compute and rack operations.

Walkthrough

This PR unifies compute-tray power control through the Component Manager backend interface. Per-machine requests are classified into three routing paths—Core Redfish direct, configured RMS direct, or queued rack-state-controller—with a durable dispatch marker preventing at-most-once replay. A full PowerControl sub-state-machine is added to the rack maintenance flow, and optimistic concurrency is enforced on power-option updates.

Changes

Compute Power Routing via Component Manager Backends

Layer / File(s) Summary
ComputeTrayEndpoint model and rack power-control persisted types
crates/component-manager/src/compute_tray_manager.rs, crates/api-model/src/rack.rs, crates/api-model/src/machine/mod.rs
ComputeTrayEndpoint replaces bmc_credentials with ComputeTrayAuthentication (Credentials/CredentialKey) and adds bmc_port. Rack model gains RackPowerControlState/Result/Target persisted types, power_control_dispatch_started_at durable marker on RackConfig, and MaintenanceScope::should_run now requires explicit opt-in for PowerControl activities. InitialResetPhase adds WaitingForHostOff/WaitingForHostOn sub-states.
Core and RMS backend adaptors
crates/component-manager/src/core_compute_manager.rs, crates/component-manager/src/rms.rs, crates/component-manager/src/config.rs, crates/component-manager/src/component_manager.rs
CoreComputeTrayManager::power_control derives auth from ComputeTrayAuthentication and passes bmc_port. build_compute_tray_node_info returns Result and rejects credential-key auth in RMS. Adds to_rms_compute_power_operation preserving Redfish reset semantics. LenovoAmi vendor variant added with explicit None mapping in Core.
Machine controller routing and handler_host_power_control unification
crates/machine-controller/src/context.rs, crates/machine-controller/src/handler.rs, crates/machine-controller/src/redfish.rs, crates/machine-controller/Cargo.toml
Adds ComputeTrayRoute selection, credential-loading compute_tray_endpoint, and power_control/power_control_with_manager with result validation to MachineStateHandlerServices. Rewrites handler_host_power_control_with_location to route non-core backends through component-manager dispatch. Standardizes BOSS, restart-verification, DPU, HGX/BMC, and platform-config power calls through handler_host_power_control.
Rack maintenance PowerControl sub-state-machine
crates/rack-controller/src/maintenance.rs, crates/rack-controller/src/ready.rs, crates/rack-controller/src/io.rs, crates/rack-controller/Cargo.toml
Full compute-tray power-control subsystem: credential resolution, endpoint pairing with BMC-IP drift detection, durable dispatch-marker claiming via DB row lock, health-report override preparation, per-target desired-power-state updates with rollback on failure, backend dispatch with result reconciliation, and best-effort re-exploration. Integrated as Preparing/Prepared/Finalizing sub-states in handle_maintenance.
API handler routing pipeline and rack maintenance scheduling fix
crates/api-core/src/handlers/component_manager.rs, crates/api-core/src/handlers/rack.rs
Replaces monolithic direct-dispatch with classify_compute_power_targets (CoreDirect/ConfiguredDirect/RackStateController), dispatch_compute_power_direct with health-override and reconciliation, and queue_compute_power_control_via_rack_state_controller. on_demand_rack_maintenance now re-reads and row-locks the rack inside a transaction before validation, with revised token error/commit semantics.
Power options optimistic concurrency
crates/api-db/src/power_options.rs
update_desired_state matches on current desired_power_state_version in the WHERE clause; maps RowNotFound to ConcurrentModificationError.
Service wiring and test infrastructure
crates/api-core/src/setup.rs, crates/api-core/src/tests/common/api_fixtures/mod.rs, crates/api-core/src/tests/mod.rs
Wires core_compute_tray_manager, component_manager, and credential_reader into MachineStateHandlerServices and RackStateHandlerServices. Adds component_manager_config override to TestEnvOverrides.
Tests
crates/api-core/src/tests/component_manager_compute_power.rs, crates/api-core/src/tests/machine_states.rs, crates/api-core/src/tests/machine_power.rs, crates/api-core/src/tests/rack_state_controller/handler.rs, crates/rack-controller/src/maintenance.rs, crates/component-manager/src/...
API-level routing tests (standalone bypass, rack queue, rack bypass direct); rack-controller dispatch-marker idempotency and backend-failure restoration tests; machine-states RMS routing and InitialResetPhase polling tests; stale-version rejection test; maintenance unit tests for scope helpers, reconciliation, and state transitions; backend unit tests for vendor mapping and port handling.
Proto and documentation
crates/rpc/proto/forge.proto, rest-api/flow/internal/nicoapi/nicoproto/nico.proto, rest-api/flow/docs/component-manager-config.md, rest-api/flow/internal/task/componentmanager/compute/nico/nico.go
Updates bypass_state_controller field comments and operator documentation to reflect new per-path routing behavior.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant component_manager_handler
  participant classify_compute_power_targets
  participant dispatch_compute_power_direct
  participant ComputeTrayManager
  participant RackStateController
  participant rack_maintenance

  Client->>component_manager_handler: ComponentPowerControlRequest(machine_ids, bypass)
  component_manager_handler->>classify_compute_power_targets: classify machines
  classify_compute_power_targets-->>component_manager_handler: CoreDirect / ConfiguredDirect / RackStateController buckets

  rect rgba(100, 180, 255, 0.5)
    Note over component_manager_handler,ComputeTrayManager: Direct dispatch path
    component_manager_handler->>dispatch_compute_power_direct: endpoints + action
    dispatch_compute_power_direct->>ComputeTrayManager: power_control(endpoints)
    ComputeTrayManager-->>dispatch_compute_power_direct: per-BMC results
    dispatch_compute_power_direct-->>component_manager_handler: ComponentResults
  end

  rect rgba(255, 160, 80, 0.5)
    Note over component_manager_handler,rack_maintenance: Queued rack path
    component_manager_handler->>RackStateController: queue_compute_power_control (DB row lock, set maintenance_requested)
    RackStateController-->>component_manager_handler: queued ComponentResults
    RackStateController->>rack_maintenance: PowerControl Preparing → Prepared → dispatch → Finalizing
    rack_maintenance->>ComputeTrayManager: power_control(scoped endpoints)
    ComputeTrayManager-->>rack_maintenance: per-BMC results
    rack_maintenance-->>RackStateController: finalized desired-state & health overrides
  end

  component_manager_handler-->>Client: ComponentPowerControlResponse
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 62.90% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: routing machine lifecycle power through component backends.
Description check ✅ Passed The description matches the changeset and explains the backend routing, durability, and lifecycle-power refactor.
Linked Issues check ✅ Passed The PR aligns with #2053 by routing rack machines through RMS, non-rack machines through Core, and moving internal lifecycle power onto the backend interface.
Out of Scope Changes check ✅ Passed The changes are concentrated on backend routing, durability, models, tests, and docs; no clear unrelated scope creep is evident.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@osu osu self-assigned this Jun 28, 2026
@github-actions

Copy link
Copy Markdown

🔐 TruffleHog Secret Scan

No secrets or credentials found!

Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉

🔗 View scan details

🕐 Last updated: 2026-06-28 20:15:02 UTC | Commit: 3e4bbc4

@osu osu marked this pull request as draft June 28, 2026 20:16
@copy-pr-bot

copy-pr-bot Bot commented Jun 28, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
crates/api-core/src/tests/common/api_fixtures/mod.rs (1)

1366-1382: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Pass the Redfish simulator into build_component_manager for Core compute overrides.

TestEnvOverrides::component_manager_config now exposes the backend choice, but this fixture always calls build_component_manager(..., None) for the Redfish pool. Any test that sets compute_tray_backend: Core will fail during env construction even though redfish_sim is already available in this scope.

Proposed fix
     let test_component_manager = component_manager::component_manager::build_component_manager(
         &component_manager_config,
         component_manager_rack_profiles,
         rms_sim.as_rms_client(),
         None,
         Some(db_pool.clone()),
-        None,
+        Some(redfish_sim.clone()),
     )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/api-core/src/tests/common/api_fixtures/mod.rs` around lines 1366 -
1382, The `build_component_manager` call in the test fixture is still hardcoding
the Redfish pool argument to `None`, which breaks env construction when
`TestEnvOverrides::component_manager_config` selects `compute_tray_backend:
Core`. Update the fixture so the existing `redfish_sim` in scope is passed
through to `component_manager::component_manager::build_component_manager` when
building `test_component_manager`, while preserving the current
`rms_sim.as_rms_client()` and `db_pool` wiring. Use the
`build_component_manager` and `TestEnvOverrides::component_manager_config`
symbols to locate the change.
crates/api-core/src/handlers/component_manager.rs (1)

1374-1571: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Reject duplicate BMC IPs before direct dispatch.

ip_to_machine_id keeps only one machine per BMC IP, while dispatch_endpoints can still contain multiple endpoints with that same IP. In that case, direct dispatch can issue duplicate power operations and then attribute backend results to the wrong machine. Mirror the rack-controller guard by detecting duplicate BMC IPs during endpoint resolution and returning per-machine errors instead of dispatching them.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/api-core/src/handlers/component_manager.rs` around lines 1374 - 1571,
Duplicate BMC IPs are still allowed into the direct-dispatch path, which can
make `dispatch_compute_power_direct` send repeated power actions and
misattribute results because `ip_to_machine_id` only retains one machine per IP.
Add a duplicate-IP check while building endpoints in
`resolve_compute_tray_endpoints` (or immediately before dispatch) so repeated
BMC IPs are rejected with per-machine `error_result` entries instead of being
added to `ResolvedComputeTrayEndpoints` and `dispatch_endpoints`. Use the
existing `resolve_compute_tray_endpoints`, `dispatch_compute_power_direct`, and
`ip_to_machine_id` flow as the place to mirror the rack-controller guard.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/api-core/src/handlers/rack.rs`:
- Around line 831-845: The maintenance scheduling flow in rack.rs is holding the
rack transaction open while awaiting credential_manager.set_credentials, which
can block DB locks on an external dependency. Move credential storage out of the
transaction scope in the maintenance scheduling path (around the rack token
handling in the maintenance_access_token block), or switch to a
request-scoped/two-phase key so the DB transaction only covers rack state
changes. Keep the transaction limited to the DB update logic and perform the
credential backend write after the transaction has been committed.

In `@crates/api-core/src/tests/common/api_fixtures/mod.rs`:
- Around line 337-343: The controller-path test fixture is using only the base
credential manager, so it misses credentials available through the composite
snapshot-based reader. Update the relevant constructors in the test fixture to
inject the composite credential reader built from
test_static_credential_snapshot() plus credential_manager, matching the API path
behavior; use the existing CompositeCredentialManager and credential_reader
symbols to locate the affected setup.

In `@crates/machine-controller/src/handler.rs`:
- Around line 8683-8699: The InitialReset flow in the host state handler is
mixing backends by calling services.power_control before immediately polling
with create_redfish_client_from_machine/get_power_state, which can route rack
machines through RMS and then fail on direct BMC reads. Update the InitialReset
path in handler::StateHandlerOutcome/state transition logic so it stays on
Core/direct Redfish end-to-end, or switch to a backend-owned reset/status path
before any RMS-driven power action; make the same fix in the other matching
InitialReset block referenced by the review.
- Around line 10302-10319: In dispatch_core_host_power_control, the current
action can be downgraded from ACPowercycle to ForceOff, but the later DPU
bookkeeping still sees the original value. Return the effective power action
from dispatch_core_host_power_control, capture that result at the call site, and
use it before the DPU reboot timestamp loop so the recorded action matches the
one actually executed; keep the existing ACPowercycle fallback and the
dispatch_component_manager_power_control path in sync.

---

Outside diff comments:
In `@crates/api-core/src/handlers/component_manager.rs`:
- Around line 1374-1571: Duplicate BMC IPs are still allowed into the
direct-dispatch path, which can make `dispatch_compute_power_direct` send
repeated power actions and misattribute results because `ip_to_machine_id` only
retains one machine per IP. Add a duplicate-IP check while building endpoints in
`resolve_compute_tray_endpoints` (or immediately before dispatch) so repeated
BMC IPs are rejected with per-machine `error_result` entries instead of being
added to `ResolvedComputeTrayEndpoints` and `dispatch_endpoints`. Use the
existing `resolve_compute_tray_endpoints`, `dispatch_compute_power_direct`, and
`ip_to_machine_id` flow as the place to mirror the rack-controller guard.

In `@crates/api-core/src/tests/common/api_fixtures/mod.rs`:
- Around line 1366-1382: The `build_component_manager` call in the test fixture
is still hardcoding the Redfish pool argument to `None`, which breaks env
construction when `TestEnvOverrides::component_manager_config` selects
`compute_tray_backend: Core`. Update the fixture so the existing `redfish_sim`
in scope is passed through to
`component_manager::component_manager::build_component_manager` when building
`test_component_manager`, while preserving the current `rms_sim.as_rms_client()`
and `db_pool` wiring. Use the `build_component_manager` and
`TestEnvOverrides::component_manager_config` symbols to locate the change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 3d98a856-b97a-4c69-aa11-1a5dde0f6a21

📥 Commits

Reviewing files that changed from the base of the PR and between 0082abd and 3e4bbc4.

⛔ Files ignored due to path filters (4)
  • Cargo.lock is excluded by !**/*.lock
  • rest-api/flow/internal/nicoapi/gen/nico.pb.go is excluded by !**/*.pb.go, !**/gen/**, !rest-api/**/*.pb.go
  • rest-api/workflow-schema/schema/site-agent/workflows/v1/nico_nico.pb.go is excluded by !**/*.pb.go, !rest-api/**/*.pb.go
  • rest-api/workflow-schema/site-agent/workflows/v1/nico_nico.proto is excluded by !rest-api/workflow-schema/site-agent/workflows/v1/*_nico.proto
📒 Files selected for processing (30)
  • crates/api-core/src/handlers/component_manager.rs
  • crates/api-core/src/handlers/rack.rs
  • crates/api-core/src/setup.rs
  • crates/api-core/src/tests/common/api_fixtures/mod.rs
  • crates/api-core/src/tests/component_manager_compute_power.rs
  • crates/api-core/src/tests/machine_power.rs
  • crates/api-core/src/tests/machine_states.rs
  • crates/api-core/src/tests/mod.rs
  • crates/api-core/src/tests/rack_state_controller/handler.rs
  • crates/api-db/src/power_options.rs
  • crates/api-model/src/machine/mod.rs
  • crates/api-model/src/rack.rs
  • crates/component-manager/src/component_manager.rs
  • crates/component-manager/src/compute_tray_manager.rs
  • crates/component-manager/src/config.rs
  • crates/component-manager/src/core_compute_manager.rs
  • crates/component-manager/src/rms.rs
  • crates/machine-controller/Cargo.toml
  • crates/machine-controller/src/context.rs
  • crates/machine-controller/src/handler.rs
  • crates/machine-controller/src/redfish.rs
  • crates/rack-controller/Cargo.toml
  • crates/rack-controller/src/context.rs
  • crates/rack-controller/src/io.rs
  • crates/rack-controller/src/maintenance.rs
  • crates/rack-controller/src/ready.rs
  • crates/rpc/proto/forge.proto
  • rest-api/flow/docs/component-manager-config.md
  • rest-api/flow/internal/nicoapi/nicoproto/nico.proto
  • rest-api/flow/internal/task/componentmanager/compute/nico/nico.go

Comment on lines +831 to 845
// Keep the rack row locked while installing the shared rack token so
// a losing concurrent request can never delete or replace the token
// owned by the request that actually schedules maintenance.
if let Some(token) = maintenance_access_token.as_ref()
&& let Err(error) = api
.credential_manager
.delete_credentials(&rack_maintenance_access_token_key(&rack_id))
.set_credentials(
&rack_maintenance_access_token_key(&rack_id),
&Credentials::UsernamePassword {
username: "access_token".into(),
password: token.clone(),
},
)
.await
{

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | 🏗️ Heavy lift

Do not await credential storage while holding the rack transaction.

This keeps the rack row lock and DB connection open while credential_manager.set_credentials runs. If the credential backend is slow or unavailable, maintenance scheduling for this rack can block behind an external dependency. Use a request-scoped credential key or another two-phase design so the DB transaction only covers DB state. As per coding guidelines, “Do not hold a transaction open while doing long-running work.” As per path instructions, API/database changes should be reviewed for transaction safety.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/api-core/src/handlers/rack.rs` around lines 831 - 845, The maintenance
scheduling flow in rack.rs is holding the rack transaction open while awaiting
credential_manager.set_credentials, which can block DB locks on an external
dependency. Move credential storage out of the transaction scope in the
maintenance scheduling path (around the rack token handling in the
maintenance_access_token block), or switch to a request-scoped/two-phase key so
the DB transaction only covers rack state changes. Keep the transaction limited
to the DB update logic and perform the credential backend write after the
transaction has been committed.

Sources: Coding guidelines, Path instructions

Comment on lines +337 to +343
core_compute_tray_manager: Arc::new(
component_manager::core_compute_manager::CoreComputeTrayManager::new(
self.redfish_sim.clone(),
),
),
component_manager: self.test_component_manager.clone(),
credential_reader: self.test_credential_manager.clone(),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Use the composite credential reader for controller-path tests.

The fixture builds a CompositeCredentialManager from test_static_credential_snapshot() plus credential_manager, but both machine-state service constructors inject only credential_manager.clone(). That makes controller-path credential resolution narrower than the API path and can break tests that rely on CredentialKeys present only in the static snapshot.

Also applies to: 1511-1517

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/api-core/src/tests/common/api_fixtures/mod.rs` around lines 337 - 343,
The controller-path test fixture is using only the base credential manager, so
it misses credentials available through the composite snapshot-based reader.
Update the relevant constructors in the test fixture to inject the composite
credential reader built from test_static_credential_snapshot() plus
credential_manager, matching the API path behavior; use the existing
CompositeCredentialManager and credential_reader symbols to locate the affected
setup.

Comment on lines +8683 to 8699
services
.power_control(&state.host_snapshot, PowerAction::ForceOff)
.await?;

Ok(StateHandlerOutcome::transition(scenario.actual_new_state(
HostReprovisionState::InitialReset {
phase: InitialResetPhase::WaitingForHostOff,
last_time: Utc::now(),
},
state.managed_state.get_host_repro_retry_count(),
)))
}
InitialResetPhase::WaitingForHostOff => {
let redfish_client = services
.create_redfish_client_from_machine(&state.host_snapshot)
.await?;
let status = get_power_state(redfish_client.as_ref()).await?;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | 🏗️ Heavy lift

Avoid dispatching RMS power before direct BMC polling.

services.power_control can select RMS for rack machines, but the next phases immediately use direct Redfish reads. That contradicts the non-Core path assumption below that rack BMCs may only be reachable through RMS, so this can power off successfully and then fail/stick while polling. Either keep this initial-reset flow on Core/direct-Redfish end-to-end, or add a backend-owned status/reset path before routing rack machines through RMS here.

Also applies to: 8756-8772

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/machine-controller/src/handler.rs` around lines 8683 - 8699, The
InitialReset flow in the host state handler is mixing backends by calling
services.power_control before immediately polling with
create_redfish_client_from_machine/get_power_state, which can route rack
machines through RMS and then fail on direct BMC reads. Update the InitialReset
path in handler::StateHandlerOutcome/state transition logic so it stays on
Core/direct Redfish end-to-end, or switch to a backend-owned reset/status path
before any RMS-driven power action; make the same fix in the other matching
InitialReset block referenced by the review.

Comment on lines +10302 to +10319
if action == SystemPowerControl::ACPowercycle
&& !redfish_client.ac_powercycle_supported_by_power()
{
action = SystemPowerControl::ForceOff;
}

if action == SystemPowerControl::ACPowercycle && power_state != libredfish::PowerState::Off {
tracing::warn!(
machine_id = %machine.id,
%power_state,
"ACPowercycle requires chassis to be Off, forcing off first"
);
ctx.services
.power_control_with_manager(machine, backend, PowerAction::ForceOff)
.await?;
}

dispatch_component_manager_power_control(machine, backend, action, ctx, trigger_location).await

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🟡 Minor | ⚡ Quick win

Propagate the effective power action into DPU bookkeeping.

When Core downgrades unsupported ACPowercycle to ForceOff, the host write records ForceOff, but the later DPU timestamp loop still uses the original ACPowercycle. Return the effective action from dispatch_core_host_power_control and assign it back before writing DPU reboot timestamps.

Proposed fix
 async fn dispatch_core_host_power_control(
@@
-) -> Result<(), StateHandlerError> {
+) -> Result<SystemPowerControl, StateHandlerError> {
@@
-    dispatch_component_manager_power_control(machine, backend, action, ctx, trigger_location).await
+    dispatch_component_manager_power_control(machine, backend, action, ctx, trigger_location).await?;
+    Ok(action)
 }
@@
-                dispatch_core_host_power_control(
+                action = dispatch_core_host_power_control(
                     machine,
                     backend.as_ref(),
                     redfish_client.as_ref(),
                     power_state,
                     action,
                     ctx,
                     location,
                 )
                 .await?;

Also applies to: 10389-10397

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/machine-controller/src/handler.rs` around lines 10302 - 10319, In
dispatch_core_host_power_control, the current action can be downgraded from
ACPowercycle to ForceOff, but the later DPU bookkeeping still sees the original
value. Return the effective power action from dispatch_core_host_power_control,
capture that result at the call site, and use it before the DPU reboot timestamp
loop so the recorded action matches the one actually executed; keep the existing
ACPowercycle fallback and the dispatch_component_manager_power_control path in
sync.

@github-actions

Copy link
Copy Markdown

🔍 Container Scan Summary

Service Total Critical High Medium Low Other
boot-artifacts-aarch64 3 0 0 3 0 0
boot-artifacts-x86_64 3 0 0 3 0 0
forge-admin-cli-x86_64 285 6 25 103 7 144
machine-validation-runner 748 30 189 272 36 221
machine_validation 748 30 189 272 36 221
machine_validation-aarch64 748 30 189 272 36 221
TOTAL 2535 96 592 925 115 807

Per-CVE detail lives in the per-service grype-* artifacts (JSON + SARIF). Severity counts only — no CVE IDs published here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Conduct Machine Lifecycle Operations via Component Manager Backends

1 participant