Skip to content

OCPBUGS-86704: Fix MaxParallelUpgrades bypass in userData deletion path#4155

Open
ranjithrajaram wants to merge 2 commits into
openshift:masterfrom
ranjithrajaram:fix-maxparallel-bypass
Open

OCPBUGS-86704: Fix MaxParallelUpgrades bypass in userData deletion path#4155
ranjithrajaram wants to merge 2 commits into
openshift:masterfrom
ranjithrajaram:fix-maxparallel-bypass

Conversation

@ranjithrajaram

@ranjithrajaram ranjithrajaram commented May 29, 2026

Copy link
Copy Markdown

Trying to address JIRA OCPBUGS-86704

The MaxParallelUpgrades=1 protection is bypassed during userData template changes, causing simultaneous machine deletions across MachineSets during normal WMCO upgrades.

This adds cluster-wide upgrade limit checking at two layers:

  1. Before clearing pub-key-hash annotations (source prevention)
  2. Before machine deletion (deletion path protection)

Root cause: When userData template changes (e.g., AWS routes, SSH retry logic, firewall rules), SecretReconciler cleared pub-key-hash annotations on ALL Machine nodes simultaneously. WindowsMachine Reconciler then used per-MachineSet isAllowedDeletion() check instead of cluster-wide markNodeAsUpgrading() lock, allowing N MachineSets = N simultaneous deletions.

Defense in depth approach:

  • SecretReconciler: Check upgrade limit before clearing each node's annotation (prevents simultaneous clearing)
  • WindowsMachineReconciler: Check upgrade limit before deletion (catches any edge cases)

Summary by CodeRabbit

  • New Features
    • Added cluster-wide upgrade limit enforcement to prevent excessive concurrent upgrade operations. When limits are reached, the system defers node updates and machine deletions, automatically retrying on the next reconciliation cycle.
    • Enhanced logging and event recording for upgrade-related operations to improve operational visibility.

The MaxParallelUpgrades=1 protection is bypassed during userData
template changes, causing simultaneous machine deletions across
MachineSets during normal WMCO upgrades.

This adds cluster-wide upgrade limit checking at two layers:
1. Before clearing pub-key-hash annotations (source prevention)
2. Before machine deletion (deletion path protection)

Root cause: When userData template changes (e.g., AWS routes, SSH
retry logic, firewall rules), SecretReconciler cleared pub-key-hash
annotations on ALL Machine nodes simultaneously. WindowsMachine
Reconciler then used per-MachineSet isAllowedDeletion() check
instead of cluster-wide markNodeAsUpgrading() lock, allowing
N MachineSets = N simultaneous deletions.

Defense in depth approach:
- SecretReconciler: Check upgrade limit before clearing each
  node's annotation (prevents simultaneous clearing)
- WindowsMachineReconciler: Check upgrade limit before deletion
  (catches any edge cases)

Signed-off-by: Ranjith Rajaram <ranjith@redhat.com>
@openshift-ci-robot openshift-ci-robot added jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels May 29, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@ranjithrajaram: This pull request references Jira Issue OCPBUGS-86704, which is invalid:

  • expected the bug to target the "5.0.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Trying to address JIRA OCPBUGS-86704

The MaxParallelUpgrades=1 protection is bypassed during userData template changes, causing simultaneous machine deletions across MachineSets during normal WMCO upgrades.

This adds cluster-wide upgrade limit checking at two layers:

  1. Before clearing pub-key-hash annotations (source prevention)
  2. Before machine deletion (deletion path protection)

Root cause: When userData template changes (e.g., AWS routes, SSH retry logic, firewall rules), SecretReconciler cleared pub-key-hash annotations on ALL Machine nodes simultaneously. WindowsMachine Reconciler then used per-MachineSet isAllowedDeletion() check instead of cluster-wide markNodeAsUpgrading() lock, allowing N MachineSets = N simultaneous deletions.

Defense in depth approach:

  • SecretReconciler: Check upgrade limit before clearing each node's annotation (prevents simultaneous clearing)
  • WindowsMachineReconciler: Check upgrade limit before deletion (catches any edge cases)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai

coderabbitai Bot commented May 29, 2026

Copy link
Copy Markdown

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: c0f921fb-2afb-49fb-a977-280ce577b5ba

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR introduces cluster-wide upgrade-limit gating at two critical junctures in the Windows Machine Config Operator's reconciliation loop. The secret controller now checks upgrade capacity before clearing node public-key-hash annotations, and the windowsmachine controller gates machine deletion on public-key mismatch. Both paths reuse a shared markNodeAsUpgrading guard that returns UpgradeLimitExceededError when the cluster cannot accommodate another simultaneous upgrade. When the limit is reached, both controllers defer their state transitions and requeue for retry on the next reconcile cycle.

🚥 Pre-merge checks | ✅ 5 | ❌ 15

❌ Failed checks (15 inconclusive)

Check name Status Explanation Resolution
Go Best Practices & Build Tags ❓ Inconclusive Repository clone failed, so this custom check could not run with code access. Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
Security: Secrets, Ssh & Csr ❓ Inconclusive Repository clone failed, so this custom check could not run with code access. Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
Kubernetes Controller Patterns ❓ Inconclusive Repository clone failed, so this custom check could not run with code access. Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
Windows Service Management ❓ Inconclusive Repository clone failed, so this custom check could not run with code access. Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
Platform-Specific Requirements ❓ Inconclusive Repository clone failed, so this custom check could not run with code access. Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
Stable And Deterministic Test Names ❓ Inconclusive Repository clone failed, so this custom check could not run with code access. Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
Test Structure And Quality ❓ Inconclusive Repository clone failed, so this custom check could not run with code access. Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
Microshift Test Compatibility ❓ Inconclusive Repository clone failed, so this custom check could not run with code access. Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
Single Node Openshift (Sno) Test Compatibility ❓ Inconclusive Repository clone failed, so this custom check could not run with code access. Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
Topology-Aware Scheduling Compatibility ❓ Inconclusive Repository clone failed, so this custom check could not run with code access. Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
Ote Binary Stdout Contract ❓ Inconclusive Repository clone failed, so this custom check could not run with code access. Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
Ipv6 And Disconnected Network Test Compatibility ❓ Inconclusive Repository clone failed, so this custom check could not run with code access. Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
No-Weak-Crypto ❓ Inconclusive Repository clone failed, so this custom check could not run with code access. Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
Container-Privileges ❓ Inconclusive Repository clone failed, so this custom check could not run with code access. Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
No-Sensitive-Data-In-Logs ❓ Inconclusive Repository clone failed, so this custom check could not run with code access. Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and specifically identifies the bug being fixed (MaxParallelUpgrades bypass) and pinpoints the affected code path (userData deletion), accurately reflecting the core changes across both controller files.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci

openshift-ci Bot commented May 29, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ranjithrajaram
Once this PR has been reviewed and has the lgtm label, please assign mansikulkarni96 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label May 29, 2026
@openshift-ci

openshift-ci Bot commented May 29, 2026

Copy link
Copy Markdown
Contributor

Hi @ranjithrajaram. Thanks for your PR.

I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@mansikulkarni96

Copy link
Copy Markdown
Member

/ok-to-test

@openshift-ci openshift-ci Bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 29, 2026
Enhances the existing testUserDataTamper test to verify that
MaxParallelUpgrades=1 is enforced when userData template changes trigger
machine deletions (pub-key-hash mismatch deletion path).

The test now:
1. Deploys parallel-upgrades-checker monitoring job before userData change
2. Waits for job completion with proper retry logic (prevents flaky tests)
3. Verifies the checker did not fail (which would indicate >1 simultaneous upgrades)
4. Cleans up the checker job after the test

This tests the specific scenario where the bug was bypassing
MaxParallelUpgrades: userData template changes causing simultaneous
deletion of machines across MachineSets.

The implementation leverages existing infrastructure:
- hack/e2e/resources/parallel-upgrade-checker-job.yaml (monitoring job)
- testUserDataTamper (already triggers userData changes)
- waitForNewMachineNodes (already waits for machine recreation)

Helper functions added:
- getRepoRoot() - Resolves repository root for robust path handling
- deployParallelUpgradesChecker() - Deploys the monitoring job
- verifyParallelUpgradesChecker() - Waits for job completion and checks for violations
- cleanupParallelUpgradesChecker() - Removes the job

Key improvements:
- Robust path resolution works regardless of test working directory
- Wait/retry pattern prevents race conditions and flaky tests
- Comprehensive job completion verification with timeout handling
- Clear error messages for debugging

The test will skip if fewer than 2 Machine nodes are present, as
the parallel behavior cannot be verified with a single node.

Signed-off-by: Ranjith Rajaram <ranjith@redhat.com>
@ranjithrajaram ranjithrajaram force-pushed the fix-maxparallel-bypass branch from 2c1bafc to bd640c9 Compare May 30, 2026 10:00
@openshift-ci

openshift-ci Bot commented May 30, 2026

Copy link
Copy Markdown
Contributor

@ranjithrajaram: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/vsphere-e2e-operator bd640c9 link true /test vsphere-e2e-operator
ci/prow/vsphere-disconnected-e2e-operator bd640c9 link true /test vsphere-disconnected-e2e-operator
ci/prow/wicd-unit-vsphere bd640c9 link true /test wicd-unit-vsphere

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. ok-to-test Indicates a non-member PR verified by an org member that is safe to test.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants