OCPBUGS-86704: Fix MaxParallelUpgrades bypass in userData deletion path#4155
OCPBUGS-86704: Fix MaxParallelUpgrades bypass in userData deletion path#4155ranjithrajaram wants to merge 2 commits into
Conversation
The MaxParallelUpgrades=1 protection is bypassed during userData template changes, causing simultaneous machine deletions across MachineSets during normal WMCO upgrades. This adds cluster-wide upgrade limit checking at two layers: 1. Before clearing pub-key-hash annotations (source prevention) 2. Before machine deletion (deletion path protection) Root cause: When userData template changes (e.g., AWS routes, SSH retry logic, firewall rules), SecretReconciler cleared pub-key-hash annotations on ALL Machine nodes simultaneously. WindowsMachine Reconciler then used per-MachineSet isAllowedDeletion() check instead of cluster-wide markNodeAsUpgrading() lock, allowing N MachineSets = N simultaneous deletions. Defense in depth approach: - SecretReconciler: Check upgrade limit before clearing each node's annotation (prevents simultaneous clearing) - WindowsMachineReconciler: Check upgrade limit before deletion (catches any edge cases) Signed-off-by: Ranjith Rajaram <ranjith@redhat.com>
|
@ranjithrajaram: This pull request references Jira Issue OCPBUGS-86704, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Repository YAML (base), Central YAML (inherited) Review profile: CHILL Plan: Enterprise Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThis PR introduces cluster-wide upgrade-limit gating at two critical junctures in the Windows Machine Config Operator's reconciliation loop. The secret controller now checks upgrade capacity before clearing node public-key-hash annotations, and the windowsmachine controller gates machine deletion on public-key mismatch. Both paths reuse a shared 🚥 Pre-merge checks | ✅ 5 | ❌ 15❌ Failed checks (15 inconclusive)
✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: ranjithrajaram The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hi @ranjithrajaram. Thanks for your PR. I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with Regular contributors should join the org to skip this step. Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/ok-to-test |
Enhances the existing testUserDataTamper test to verify that MaxParallelUpgrades=1 is enforced when userData template changes trigger machine deletions (pub-key-hash mismatch deletion path). The test now: 1. Deploys parallel-upgrades-checker monitoring job before userData change 2. Waits for job completion with proper retry logic (prevents flaky tests) 3. Verifies the checker did not fail (which would indicate >1 simultaneous upgrades) 4. Cleans up the checker job after the test This tests the specific scenario where the bug was bypassing MaxParallelUpgrades: userData template changes causing simultaneous deletion of machines across MachineSets. The implementation leverages existing infrastructure: - hack/e2e/resources/parallel-upgrade-checker-job.yaml (monitoring job) - testUserDataTamper (already triggers userData changes) - waitForNewMachineNodes (already waits for machine recreation) Helper functions added: - getRepoRoot() - Resolves repository root for robust path handling - deployParallelUpgradesChecker() - Deploys the monitoring job - verifyParallelUpgradesChecker() - Waits for job completion and checks for violations - cleanupParallelUpgradesChecker() - Removes the job Key improvements: - Robust path resolution works regardless of test working directory - Wait/retry pattern prevents race conditions and flaky tests - Comprehensive job completion verification with timeout handling - Clear error messages for debugging The test will skip if fewer than 2 Machine nodes are present, as the parallel behavior cannot be verified with a single node. Signed-off-by: Ranjith Rajaram <ranjith@redhat.com>
2c1bafc to
bd640c9
Compare
|
@ranjithrajaram: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Trying to address JIRA OCPBUGS-86704
The MaxParallelUpgrades=1 protection is bypassed during userData template changes, causing simultaneous machine deletions across MachineSets during normal WMCO upgrades.
This adds cluster-wide upgrade limit checking at two layers:
Root cause: When userData template changes (e.g., AWS routes, SSH retry logic, firewall rules), SecretReconciler cleared pub-key-hash annotations on ALL Machine nodes simultaneously. WindowsMachine Reconciler then used per-MachineSet isAllowedDeletion() check instead of cluster-wide markNodeAsUpgrading() lock, allowing N MachineSets = N simultaneous deletions.
Defense in depth approach:
Summary by CodeRabbit