Skip to content

fix(ssh-console): recover conflicting IPMI SOL sessions#2956

Open
osu wants to merge 2 commits into
NVIDIA:mainfrom
osu:fix/2787-ssh-console-sol-recovery
Open

fix(ssh-console): recover conflicting IPMI SOL sessions#2956
osu wants to merge 2 commits into
NVIDIA:mainfrom
osu:fix/2787-ssh-console-sol-recovery

Conversation

@osu

@osu osu commented Jun 28, 2026

Copy link
Copy Markdown
Member

Description

Recover ssh-console automatically when ipmitool sol activate reports that the
SOL payload is already active on another session. The service deactivates that
conflicting payload and retries immediately, while retaining normal backoff for
unrelated failures and repeated conflicts.

The change also makes PTY/process-exit handling deterministic, bounds captured
diagnostic output, and prevents host-console output received after activation
from being misclassified as an activation conflict.

Related issues

Fixes #2787

Type of Change

  • Add - New feature or capability
  • Change - Changes in existing functionality
  • Fix - Bug fixes
  • Remove - Removed features or deprecated functionality
  • Internal - Internal changes (refactoring, tests, docs, etc.)

Breaking Changes

  • This PR contains breaking changes

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Validated in the Linux build image:

  • cargo make --no-workspace check-format-nightly
  • cargo clippy --locked -p carbide-ssh-console --all-targets --all-features -- -D warnings
  • cargo test --locked -p carbide-ssh-console --lib (36 passed)
  • REPO_ROOT=/workspace cargo test --locked -p carbide-ssh-console --test main test_ipmi_sol_conflict_recovery -- --exact --nocapture --test-threads=1 (1 passed)
  • REPO_ROOT=/workspace cargo test --locked -p carbide-ssh-console --test main -- --nocapture --test-threads=1 (4 passed)
  • git diff --check

The end-to-end regression test establishes a real competing ipmitool session
against ipmi_sim, configures normal reconnect backoff to 30 seconds, and
requires the SSH console to become usable within 20 seconds.

Additional Notes

Recovery is deliberately limited to the exact conflict reported before
SOL Session operational; later host-console output cannot trigger a
deactivation. ssh-console multiplexes frontends over one shared, long-lived SOL
session, so recovery intentionally evicts an existing out-of-band SOL holder in
order to restore the service-owned console.

This PR addresses automatic stale/conflicting IPMI SOL recovery. Selecting a
different OpenBMC console transport or exposing a manual reset through the
admin CLI/WebUI requires separate model/vendor-contract and API design.

Signed-off-by: Hasan Khan <hasank@nvidia.com>
@osu osu requested a review from a team as a code owner June 28, 2026 18:37
@coderabbitai

coderabbitai Bot commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 683a7fad-6c81-40ac-a880-6ffbae70a2d0

📥 Commits

Reviewing files that changed from the base of the PR and between ac6bfa1 and 7fd9735.

📒 Files selected for processing (1)
  • crates/ssh-console/src/bmc/connection_impl/ipmi.rs
🚧 Files skipped from review as they are similar to previous changes (1)
  • crates/ssh-console/src/bmc/connection_impl/ipmi.rs

Summary by CodeRabbit

  • New Features

    • Improved recovery from interrupted or conflicting IPMI SOL sessions, helping connections resume more reliably.
    • Added broader end-to-end coverage for SOL session recovery in the SSH console workflow.
  • Bug Fixes

    • Retry behavior now adapts better after successful reconnects and after SOL-related connection conflicts.
    • Console output handling is more robust, including cases where important status messages arrive in fragments.

Walkthrough

Automatic recovery from conflicting IPMI SOL sessions is implemented by detecting the "SOL payload already active" marker in bounded ipmitool output, running ipmitool sol deactivate, and returning new SpawnError variants that trigger immediate retry. The client retry loop gains a previous_connection_close_was_sol_recovery flag and a should_reset_retry_backoff predicate to suppress consecutive immediate resets.

Changes

IPMI SOL Conflict Auto-Recovery

Layer / File(s) Summary
SpawnError variants and retry_immediately contracts
crates/ssh-console/src/bmc/connection_impl/ipmi.rs, crates/ssh-console/src/bmc/connection.rs
Adds ConflictingSolSessionDeactivated and ConflictingSolSessionDeactivationFailed variants to SpawnError, the SolDeactivateError enum, SOL/session constants, and retry_immediately() delegation through both error layers.
ipmitool process loop refactor and bounded PTY capture
crates/ssh-console/src/bmc/connection_impl/ipmi.rs
Replaces fixed-buffer output capture with a bounded VecDeque<u8>; adds sol_session_operational tracking; introduces PtyReadResult enum; refactors the event loop into a stateful select! with ipmitool_exited/pty_closed/kill_requested flags and explicit PTY drain; extracts shared command builders (ipmitool_command, sol_activate_command, sol_deactivate_command); adds handle_unexpected_ipmitool_exit and deactivate_sol.
Retry backoff reset predicate and SOL-recovery state
crates/ssh-console/src/bmc/client.rs
Introduces previous_connection_close_was_sol_recovery boolean in the retry loop, should_reset_retry_backoff() helper centralising the reset predicate (long-success or SOL conflict with consecutive-reset suppression), and clears the flag on spawn failure.
Tests and integration coverage
crates/ssh-console/src/bmc/client.rs, crates/ssh-console/src/bmc/client_pool.rs, crates/ssh-console/src/bmc/connection_impl/ipmi.rs, crates/ssh-console/tests/main.rs, crates/ssh-console/tests/util/ipmi_sim.rs
Adds unit tests for should_reset_retry_backoff, marker detection, fragmented banner handling, PTY drain, and sol deactivate outcomes; adds BmcPool::for_test() helper; adds ActiveSolSession/activate_sol test utility; adds test_ipmi_sol_conflict_recovery integration test.

Sequence Diagram(s)

sequenceDiagram
  participant Client as BmcClient retry loop
  participant Proxy as IpmitoolMessageProxy
  participant Ipmitool as ipmitool process
  participant BMC as BMC (IPMI LAN)

  Client->>Proxy: spawn(sol activate)
  Proxy->>Ipmitool: exec ipmitool sol activate
  Ipmitool->>BMC: SOL activate request
  BMC-->>Ipmitool: SOL payload already active
  Ipmitool-->>Proxy: unexpected exit (exit_status=1)
  Proxy->>Proxy: detect "SOL payload already active" marker
  Proxy->>BMC: ipmitool sol deactivate
  BMC-->>Proxy: deactivate success
  Proxy-->>Client: SpawnError::ConflictingSolSessionDeactivated (retry_immediately=true)
  Client->>Client: should_reset_retry_backoff → true (first SOL recovery)
  Client->>Client: reset backoff, clear flag
  Client->>Proxy: spawn(sol activate) immediately
  Proxy->>Ipmitool: exec ipmitool sol activate
  Ipmitool->>BMC: SOL activate request
  BMC-->>Ipmitool: SOL Session operational
  Proxy-->>Client: connection established
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 77.08% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: recovering from conflicting IPMI SOL sessions.
Description check ✅ Passed The description matches the implemented fix and its test coverage, so it is clearly related to the changeset.
Linked Issues check ✅ Passed The code implements automatic recovery for the reported SOL conflict and immediate retry behavior requested in #2787.
Out of Scope Changes check ✅ Passed The added helpers and tests all support the SOL recovery fix and do not introduce unrelated scope.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
crates/ssh-console/src/bmc/client.rs (1)

276-276: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Emit connection_time as a structured tracing field.

This keeps the logfmt output queryable instead of burying the duration in the message.

Proposed fix
-                            tracing::debug!(%machine_id, "last connection lasted {}s, resetting backoff to 0s", connection_time.as_secs());
+                            tracing::debug!(
+                                %machine_id,
+                                connection_time_secs = connection_time.as_secs(),
+                                "last connection lasted long enough; resetting backoff to 0s",
+                            );

As per coding guidelines, “When writing log messages, prefer placing common fields as attributes passed to tracing functions instead of using string interpolation.”

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/ssh-console/src/bmc/client.rs` at line 276, The tracing::debug! call
in the connection reset path should not interpolate connection_time into the
message string; emit it as a structured field instead so the log stays
queryable. Update the debug event near machine_id to include connection_time as
an attribute alongside the existing fields, and keep the message text generic
while preserving the reset-backoff context.

Source: Coding guidelines

crates/ssh-console/src/bmc/connection_impl/ipmi.rs (1)

910-913: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Replace the temporary Vec with a slice literal. append_captured_output takes &[u8], so &vec![...] is an unnecessary allocation here.

Proposed fix
-            &vec![b'x'; MAX_CAPTURED_IPMITOOL_OUTPUT_SIZE + 1],
+            &[b'x'; MAX_CAPTURED_IPMITOOL_OUTPUT_SIZE + 1],
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/ssh-console/src/bmc/connection_impl/ipmi.rs` around lines 910 - 913,
The captured-output test setup is doing an unnecessary heap allocation by
passing a temporary Vec to append_captured_output. Update the call in ipmi.rs to
use a slice literal (or slice reference) of repeated b'x' bytes instead of
&vec![...], since append_captured_output accepts &[u8]. Keep the change
localized to the append_captured_output call used in the
MAX_CAPTURED_IPMITOOL_OUTPUT_SIZE path.

Sources: Coding guidelines, Path instructions

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/ssh-console/src/bmc/connection_impl/ipmi.rs`:
- Around line 544-550: Stop retaining diagnostic console output after SOL
becomes operational. In the IPMI connection flow, update the logic around
append_captured_output, captured_output_contains, and the
sol_session_operational flag so that once SOL_SESSION_OPERATIONAL is detected,
captured_output is cleared or disabled for further accumulation. Keep the
activation check intact in the relevant read/processing path, but ensure later
host console data is not appended into the buffer that can be embedded in
SpawnError.

---

Nitpick comments:
In `@crates/ssh-console/src/bmc/client.rs`:
- Line 276: The tracing::debug! call in the connection reset path should not
interpolate connection_time into the message string; emit it as a structured
field instead so the log stays queryable. Update the debug event near machine_id
to include connection_time as an attribute alongside the existing fields, and
keep the message text generic while preserving the reset-backoff context.

In `@crates/ssh-console/src/bmc/connection_impl/ipmi.rs`:
- Around line 910-913: The captured-output test setup is doing an unnecessary
heap allocation by passing a temporary Vec to append_captured_output. Update the
call in ipmi.rs to use a slice literal (or slice reference) of repeated b'x'
bytes instead of &vec![...], since append_captured_output accepts &[u8]. Keep
the change localized to the append_captured_output call used in the
MAX_CAPTURED_IPMITOOL_OUTPUT_SIZE path.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 7bc33993-d21b-436c-bb9d-eced273b12b5

📥 Commits

Reviewing files that changed from the base of the PR and between d3395d8 and ac6bfa1.

📒 Files selected for processing (6)
  • crates/ssh-console/src/bmc/client.rs
  • crates/ssh-console/src/bmc/client_pool.rs
  • crates/ssh-console/src/bmc/connection.rs
  • crates/ssh-console/src/bmc/connection_impl/ipmi.rs
  • crates/ssh-console/tests/main.rs
  • crates/ssh-console/tests/util/ipmi_sim.rs

Comment thread crates/ssh-console/src/bmc/connection_impl/ipmi.rs Outdated
Signed-off-by: Hasan Khan <hasank@nvidia.com>
@github-actions

Copy link
Copy Markdown

🔍 Container Scan Summary

Service Total Critical High Medium Low Other
boot-artifacts-aarch64 3 0 0 3 0 0
boot-artifacts-x86_64 3 0 0 3 0 0
forge-admin-cli-x86_64 285 6 25 103 7 144
machine-validation-runner 748 30 189 272 36 221
machine_validation 748 30 189 272 36 221
machine_validation-aarch64 748 30 189 272 36 221
nvmetal-carbide 748 30 189 272 36 221
TOTAL 3283 126 781 1197 151 1028

Per-CVE detail lives in the per-service grype-* artifacts (JSON + SARIF). Severity counts only — no CVE IDs published here.

@osu osu self-assigned this Jun 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: enhancement when ssh-console-rs fails to connect to host console

1 participant