Skip to content

Add IMDSv2 migration command for ROSA Classic clusters#909

Open
givaldolins wants to merge 4 commits into
openshift:masterfrom
givaldolins:imdsv2
Open

Add IMDSv2 migration command for ROSA Classic clusters#909
givaldolins wants to merge 4 commits into
openshift:masterfrom
givaldolins:imdsv2

Conversation

@givaldolins

@givaldolins givaldolins commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Implements automated IMDSv2 migration workflow with:

  • Pre-flight health checks (ClusterOperators, CPMS, nodes, etcd)
  • Hive MachinePool patching for infra/worker nodes
  • Sequential infra machine replacement with health validation
  • ControlPlaneMachineSet update for master node rollout
  • Post-migration validation

Supports selective migration via --nodes flag (all, master, infra, workers).

This MR automates SOP

Summary by CodeRabbit

  • New Features

    • Added osdctl cluster imdsv2 command to migrate ROSA Classic clusters to require IMDSv2 authentication.
    • Provides pre-flight validation, machine-pool updates, sequential infra node replacement, control-plane rollout, and final validation with confirmation prompts for destructive steps.
  • Documentation

    • New command documentation and CLI usage added, including examples and options for targeting node roles and timeout/reason flags.

Implements automated IMDSv2 migration workflow with:
- Pre-flight health checks (ClusterOperators, CPMS, nodes, etcd)
- Hive MachinePool patching for infra/worker nodes
- Sequential infra machine replacement with health validation
- ControlPlaneMachineSet update for master node rollout
- Post-migration validation

Supports selective migration via --nodes flag (all, master, infra, workers).

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jun 3, 2026

Copy link
Copy Markdown

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 7cdd5241-bc02-414e-91a6-ffaf4919ee50

📥 Commits

Reviewing files that changed from the base of the PR and between 38748b6 and 07e230f.

📒 Files selected for processing (4)
  • cmd/cluster/imdsv2.go
  • docs/README.md
  • docs/osdctl_cluster.md
  • docs/osdctl_cluster_imdsv2.md
✅ Files skipped from review due to trivial changes (1)
  • docs/osdctl_cluster.md
🚧 Files skipped from review as they are similar to previous changes (1)
  • cmd/cluster/imdsv2.go

Walkthrough

This PR introduces osdctl cluster imdsv2, a new command that orchestrates AWS ROSA Classic cluster node migration to require IMDSv2. The implementation coordinates pre-flight validation, Hive MachinePool patching, infra machine replacement, ControlPlaneMachineSet updates, and final verification across multiple Kubernetes and OCM resources.

Changes

IMDSv2 Migration Command

Layer / File(s) Summary
Command definition and input validation
cmd/cluster/cmd.go, cmd/cluster/imdsv2.go (lines 1–104)
Registers newCmdIMDSv2() subcommand and defines the command with flags (--cluster-id, --reason, --nodes), the imdsv2Options struct, and validation ensuring supported node roles and valid cluster key format.
Client initialization and orchestration
cmd/cluster/imdsv2.go (lines 105–276)
Establishes OCM and Kubernetes clients (including elevated admin client), enforces AWS-classic-only and non-HCP constraints, conditionally initializes Hive clients, and orchestrates the migration sequence across MachinePool patching, infra replacement, CPMS update, and final validation.
Pre-flight safety validation
cmd/cluster/imdsv2.go (lines 278–383)
Verifies ClusterOperator health, checks CPMS Active state and replicas, validates master nodes readiness and infra node readiness when applicable, and ensures exactly 3 etcd pods are Running before mutation.
Hive MachinePool patching and cleanup
cmd/cluster/imdsv2.go (lines 385–468, 619–652, 810–818)
Lists and patches MachinePools (infra/workers) to require IMDSv2 in AWS metadata, adds override annotation for platform spec updates, includes helper to resolve Hive namespace from OCM environment and cluster ID, and provides cleanup to remove the annotation post-migration.
Infra machine sequential replacement and polling
cmd/cluster/imdsv2.go (lines 470–617)
Iteratively deletes infra Machines and polls for Running replacements (excluding original names), waits for infra nodes to reach expected Ready count, re-checks cluster operator health between replacements, and implements timeout/interval-based retry logic.
ControlPlaneMachineSet migration
cmd/cluster/imdsv2.go (lines 654–742)
Unmarshals CPMS AWS provider spec, skips if already configured, prompts for user confirmation, sets metadata authentication to Required, patches via admin client, and monitors rollout until updated and ready replicas match expected counts.
Final validation and verification
cmd/cluster/imdsv2.go (lines 744–818)
Lists all Nodes and fails if any are not Ready, examines each Machine's provider spec to confirm IMDSv2 enforcement, and reports warnings when machines are not configured (e.g., customer-managed workers).
Documentation
docs/README.md, docs/osdctl_cluster.md, docs/osdctl_cluster_imdsv2.md
Adds osdctl cluster imdsv2 entries to the README and cluster docs, and adds a dedicated osdctl_cluster_imdsv2.md page with usage, examples, and flags.

Sequence Diagram

sequenceDiagram
  participant User as User/CLI
  participant Cmd as osdctl imdsv2
  participant OCM as OCM API
  participant K8s as Kubernetes API
  participant Hive as Hive
  participant CPMS as ControlPlaneMachineSet

  User->>Cmd: cluster imdsv2 --cluster-id=... --nodes=...
  Cmd->>OCM: Fetch cluster environment and ID
  Cmd->>K8s: Pre-flight checks (operators, CPMS, nodes, etcd)
  K8s-->>Cmd: Health status

  rect rgba(100, 200, 150, 0.5)
  Cmd->>Hive: List MachinePools in Hive namespace
  Hive-->>Cmd: MachinePool specs
  Cmd->>Hive: Patch selected pools for IMDSv2
  end

  rect rgba(150, 150, 200, 0.5)
  Cmd->>K8s: Delete infra Machines (sequential)
  K8s-->>Cmd: Machines deleted
  Cmd->>K8s: Poll for replacement Machines Running
  K8s-->>Cmd: Replacements ready
  Cmd->>K8s: Wait for infra nodes Ready
  Cmd->>K8s: Verify cluster operator health
  end

  rect rgba(200, 150, 100, 0.5)
  Cmd->>K8s: Get ControlPlaneMachineSet
  Cmd->>Cmd: User confirmation (destructive)
  Cmd->>K8s: Patch CPMS for IMDSv2
  Cmd->>K8s: Monitor CPMS rollout
  K8s-->>Cmd: Rollout complete
  end

  Cmd->>K8s: Final validation (nodes Ready, machines IMDSv2)
  K8s-->>Cmd: Migration verified
  Cmd->>User: Success/warnings
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 14 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (14 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and clearly summarizes the main change: adding a new IMDSv2 migration command for ROSA Classic clusters, which aligns perfectly with the core implementation in cmd/cluster/imdsv2.go and documentation updates.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed No Ginkgo tests are present in the PR. The added files contain no test code with Ginkgo patterns (It, Describe, Context, etc.).
Test Structure And Quality ✅ Passed No Ginkgo test code was added in this PR; the check is not applicable. PR contains only implementation (imdsv2.go) and documentation changes.
Microshift Test Compatibility ✅ Passed No Ginkgo e2e tests are added in this PR. The changes consist of a CLI command implementation (imdsv2.go) and documentation updates only.
Single Node Openshift (Sno) Test Compatibility ✅ Passed PR does not add Ginkgo e2e tests; it only adds a CLI command (osdctl cluster imdsv2) implementation and documentation. SNO compatibility check does not apply.
Topology-Aware Scheduling Compatibility ✅ Passed CLI tool for ROSA Classic (fixed 3 control-plane nodes) that patches resources, not a deployment manifest or operator introducing scheduling constraints.
Ote Binary Stdout Contract ✅ Passed All stdout writes in imdsv2.go are within method bodies called during command execution, not at package initialization. No module-level init() or process-level stdout writes.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed PR adds no Ginkgo e2e tests—only a new command implementation (cmd/cluster/imdsv2.go) and documentation. The check is not applicable.
No-Weak-Crypto ✅ Passed No weak crypto (MD5, SHA1, DES, RC4, 3DES, Blowfish, ECB), custom crypto implementations, or insecure secret comparisons found in the 818 lines of new code in imdsv2.go and related files.
Container-Privileges ✅ Passed PR adds Go CLI code and documentation only; contains no Kubernetes manifests, container configs, or Dockerfile modifications that could expose container privileges.
No-Sensitive-Data-In-Logs ✅ Passed No sensitive data (passwords, tokens, API keys, credentials) is logged. Logging includes cluster names, status updates, and error messages which do not expose authentication material.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from devppratik and petrkotas June 3, 2026 20:35
@openshift-ci

openshift-ci Bot commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: givaldolins
Once this PR has been reviewed and has the lgtm label, please assign tafhim for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
cmd/cluster/imdsv2.go (1)

348-395: ⚖️ Poor tradeoff

Consider using the Kubernetes client instead of shelling out to oc.

Shelling out to oc introduces a dependency on the CLI being installed and configured. The kubeconfig context used by oc may differ from the client initialized via backplane in init(). Consider using the o.client to fetch ClusterOperator resources directly via the config.openshift.io/v1 API group.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cmd/cluster/imdsv2.go` around lines 348 - 395, The checkClusterOperators
function currently shells out to `oc` to list ClusterOperators; replace that
with direct API calls using the existing o.client so the same
kubeconfig/backplane client is used. Use the OpenShift config v1 ClusterOperator
type (e.g., configv1.ClusterOperatorList) or a dynamic client list against
"config.openshift.io/v1" to fetch all ClusterOperator resources with
o.client.List(ctx, &list) (or equivalent), then iterate list.Items and inspect
each item.Status.Conditions exactly as done now to build unhealthyOps, returning
errors on client.List or unmarshalling failures as appropriate; keep the same
final behavior/message and error formatting but remove the exec.CommandContext
call and JSON unmarshal logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cmd/cluster/imdsv2.go`:
- Around line 728-754: The function monitorCPMSRollout creates a timeout context
from context.Background(), breaking cancellation propagation; change the context
creation to derive from the passed ctx (use context.WithTimeout(ctx,
imdsv2RolloutPollTimeout) and keep cancel deferred) so parent cancellations
(SIGINT or upstream cancellation) interrupt the polling loop; update the context
creation in monitorCPMSRollout where pollCtx and cancel are defined (and
continue to pass pollCtx into wait.PollUntilContextTimeout as before).

---

Nitpick comments:
In `@cmd/cluster/imdsv2.go`:
- Around line 348-395: The checkClusterOperators function currently shells out
to `oc` to list ClusterOperators; replace that with direct API calls using the
existing o.client so the same kubeconfig/backplane client is used. Use the
OpenShift config v1 ClusterOperator type (e.g., configv1.ClusterOperatorList) or
a dynamic client list against "config.openshift.io/v1" to fetch all
ClusterOperator resources with o.client.List(ctx, &list) (or equivalent), then
iterate list.Items and inspect each item.Status.Conditions exactly as done now
to build unhealthyOps, returning errors on client.List or unmarshalling failures
as appropriate; keep the same final behavior/message and error formatting but
remove the exec.CommandContext call and JSON unmarshal logic.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 993b8d49-1d85-43e7-9f2d-d04b1a1555f0

📥 Commits

Reviewing files that changed from the base of the PR and between 7117673 and 38748b6.

📒 Files selected for processing (2)
  • cmd/cluster/cmd.go
  • cmd/cluster/imdsv2.go

Comment thread cmd/cluster/imdsv2.go
givaldolins and others added 2 commits June 3, 2026 14:55
- Fix monitorCPMSRollout to derive timeout context from parent ctx
  instead of context.Background(), allowing proper cancellation
  propagation (SIGINT/upstream cancellation)
- Replace checkClusterOperators oc shell-out with direct API calls
  using configv1.ClusterOperatorList and existing o.client
- Register configv1 scheme in init() for ClusterOperator resources
- Remove unused os/exec import

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@givaldolins

Copy link
Copy Markdown
Contributor Author

/retest

@openshift-ci

openshift-ci Bot commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

@givaldolins: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@givaldolins

Copy link
Copy Markdown
Contributor Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jun 9, 2026

Copy link
Copy Markdown
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@clcollins clcollins left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of questions.

Also, Claude noticed ""preFlightChecks, monitorCPMSRollout, init, and Hive client setup are near-identical copies of changevolumetype.go" and suggested they should be consolidated/adjusted to work so there's less redundant code.

Comment thread cmd/cluster/imdsv2.go
}

// replaceInfraMachines replaces infra machines one at a time.
func (o *imdsv2Options) replaceInfraMachines(ctx context.Context) error {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may want to try to use

func RunMachinePoolDance(ctx context.Context, clients DanceClients, originalMp, newMp *hivev1.MachinePool, onTimeout func(ctx context.Context, nodes *corev1.NodeList) error) error {
for this piece.

Comment thread cmd/cluster/imdsv2.go
}

// Delete the machine (MachineSet will create replacement with IMDSv2)
if err := o.clientAdmin.Delete(ctx, machine); err != nil {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are confirmation functions elsewhere in this PR, but there's not one included here. Is that intended or desired?

Comment thread cmd/cluster/imdsv2.go
}
}

// Step 3: Clean up temporary override annotations from MachinePools

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if this process fails partway through, before it gets here? The annotations would continue to live on the machine pools:

Comment thread cmd/cluster/imdsv2.go
fmt.Printf("Hive namespace: %s\n", hiveNamespace)

// Retrieve all MachinePools for this cluster
mpList := &hivev1.MachinePoolList{}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any chance this would match machinepools you don't want to patch? Are ALL machinepools in that namespace supposed to be patched, every time? (Truly asking - i'm not 100% sure)

Comment thread cmd/cluster/imdsv2.go
fmt.Println("Worker nodes must be replaced by the customer.")
}

printer.PrintlnGreen("\n✓ IMDSv2 migration completed successfully!")

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Everything before this is an if statement, so there's a chance (however small - errors, or changes in the future) where you might get here with no actual operation happening, and still get the "completed successfully!" message. Is there something you can check on to ensure that it happened? Some state on the nodes, etc?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants