fix(ssh-console): recover conflicting IPMI SOL sessions#2956
Conversation
Signed-off-by: Hasan Khan <hasank@nvidia.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
Summary by CodeRabbit
WalkthroughAutomatic recovery from conflicting IPMI SOL sessions is implemented by detecting the "SOL payload already active" marker in bounded ChangesIPMI SOL Conflict Auto-Recovery
Sequence Diagram(s)sequenceDiagram
participant Client as BmcClient retry loop
participant Proxy as IpmitoolMessageProxy
participant Ipmitool as ipmitool process
participant BMC as BMC (IPMI LAN)
Client->>Proxy: spawn(sol activate)
Proxy->>Ipmitool: exec ipmitool sol activate
Ipmitool->>BMC: SOL activate request
BMC-->>Ipmitool: SOL payload already active
Ipmitool-->>Proxy: unexpected exit (exit_status=1)
Proxy->>Proxy: detect "SOL payload already active" marker
Proxy->>BMC: ipmitool sol deactivate
BMC-->>Proxy: deactivate success
Proxy-->>Client: SpawnError::ConflictingSolSessionDeactivated (retry_immediately=true)
Client->>Client: should_reset_retry_backoff → true (first SOL recovery)
Client->>Client: reset backoff, clear flag
Client->>Proxy: spawn(sol activate) immediately
Proxy->>Ipmitool: exec ipmitool sol activate
Ipmitool->>BMC: SOL activate request
BMC-->>Ipmitool: SOL Session operational
Proxy-->>Client: connection established
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
crates/ssh-console/src/bmc/client.rs (1)
276-276: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winEmit
connection_timeas a structured tracing field.This keeps the logfmt output queryable instead of burying the duration in the message.
Proposed fix
- tracing::debug!(%machine_id, "last connection lasted {}s, resetting backoff to 0s", connection_time.as_secs()); + tracing::debug!( + %machine_id, + connection_time_secs = connection_time.as_secs(), + "last connection lasted long enough; resetting backoff to 0s", + );As per coding guidelines, “When writing log messages, prefer placing common fields as attributes passed to tracing functions instead of using string interpolation.”
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/ssh-console/src/bmc/client.rs` at line 276, The tracing::debug! call in the connection reset path should not interpolate connection_time into the message string; emit it as a structured field instead so the log stays queryable. Update the debug event near machine_id to include connection_time as an attribute alongside the existing fields, and keep the message text generic while preserving the reset-backoff context.Source: Coding guidelines
crates/ssh-console/src/bmc/connection_impl/ipmi.rs (1)
910-913: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winReplace the temporary
Vecwith a slice literal.append_captured_outputtakes&[u8], so&vec![...]is an unnecessary allocation here.Proposed fix
- &vec![b'x'; MAX_CAPTURED_IPMITOOL_OUTPUT_SIZE + 1], + &[b'x'; MAX_CAPTURED_IPMITOOL_OUTPUT_SIZE + 1],🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/ssh-console/src/bmc/connection_impl/ipmi.rs` around lines 910 - 913, The captured-output test setup is doing an unnecessary heap allocation by passing a temporary Vec to append_captured_output. Update the call in ipmi.rs to use a slice literal (or slice reference) of repeated b'x' bytes instead of &vec![...], since append_captured_output accepts &[u8]. Keep the change localized to the append_captured_output call used in the MAX_CAPTURED_IPMITOOL_OUTPUT_SIZE path.Sources: Coding guidelines, Path instructions
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/ssh-console/src/bmc/connection_impl/ipmi.rs`:
- Around line 544-550: Stop retaining diagnostic console output after SOL
becomes operational. In the IPMI connection flow, update the logic around
append_captured_output, captured_output_contains, and the
sol_session_operational flag so that once SOL_SESSION_OPERATIONAL is detected,
captured_output is cleared or disabled for further accumulation. Keep the
activation check intact in the relevant read/processing path, but ensure later
host console data is not appended into the buffer that can be embedded in
SpawnError.
---
Nitpick comments:
In `@crates/ssh-console/src/bmc/client.rs`:
- Line 276: The tracing::debug! call in the connection reset path should not
interpolate connection_time into the message string; emit it as a structured
field instead so the log stays queryable. Update the debug event near machine_id
to include connection_time as an attribute alongside the existing fields, and
keep the message text generic while preserving the reset-backoff context.
In `@crates/ssh-console/src/bmc/connection_impl/ipmi.rs`:
- Around line 910-913: The captured-output test setup is doing an unnecessary
heap allocation by passing a temporary Vec to append_captured_output. Update the
call in ipmi.rs to use a slice literal (or slice reference) of repeated b'x'
bytes instead of &vec![...], since append_captured_output accepts &[u8]. Keep
the change localized to the append_captured_output call used in the
MAX_CAPTURED_IPMITOOL_OUTPUT_SIZE path.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 7bc33993-d21b-436c-bb9d-eced273b12b5
📒 Files selected for processing (6)
crates/ssh-console/src/bmc/client.rscrates/ssh-console/src/bmc/client_pool.rscrates/ssh-console/src/bmc/connection.rscrates/ssh-console/src/bmc/connection_impl/ipmi.rscrates/ssh-console/tests/main.rscrates/ssh-console/tests/util/ipmi_sim.rs
Signed-off-by: Hasan Khan <hasank@nvidia.com>
🔍 Container Scan Summary
Per-CVE detail lives in the per-service |
Description
Recover ssh-console automatically when
ipmitool sol activatereports that theSOL payload is already active on another session. The service deactivates that
conflicting payload and retries immediately, while retaining normal backoff for
unrelated failures and repeated conflicts.
The change also makes PTY/process-exit handling deterministic, bounds captured
diagnostic output, and prevents host-console output received after activation
from being misclassified as an activation conflict.
Related issues
Fixes #2787
Type of Change
Breaking Changes
Testing
Validated in the Linux build image:
cargo make --no-workspace check-format-nightlycargo clippy --locked -p carbide-ssh-console --all-targets --all-features -- -D warningscargo test --locked -p carbide-ssh-console --lib(36 passed)REPO_ROOT=/workspace cargo test --locked -p carbide-ssh-console --test main test_ipmi_sol_conflict_recovery -- --exact --nocapture --test-threads=1(1 passed)REPO_ROOT=/workspace cargo test --locked -p carbide-ssh-console --test main -- --nocapture --test-threads=1(4 passed)git diff --checkThe end-to-end regression test establishes a real competing
ipmitoolsessionagainst
ipmi_sim, configures normal reconnect backoff to 30 seconds, andrequires the SSH console to become usable within 20 seconds.
Additional Notes
Recovery is deliberately limited to the exact conflict reported before
SOL Session operational; later host-console output cannot trigger adeactivation. ssh-console multiplexes frontends over one shared, long-lived SOL
session, so recovery intentionally evicts an existing out-of-band SOL holder in
order to restore the service-owned console.
This PR addresses automatic stale/conflicting IPMI SOL recovery. Selecting a
different OpenBMC console transport or exposing a manual reset through the
admin CLI/WebUI requires separate model/vendor-contract and API design.