Skip to content

fix: clear canaryStatus when rollback cancellation completes#340

Open
LittleChimera wants to merge 1 commit into
openkruise:masterfrom
LittleChimera:fix-rollback-stale-canary-status
Open

fix: clear canaryStatus when rollback cancellation completes#340
LittleChimera wants to merge 1 commit into
openkruise:masterfrom
LittleChimera:fix-rollback-stale-canary-status

Conversation

@LittleChimera

Copy link
Copy Markdown

Ⅰ. Describe what this PR does

When a canary release is cancelled via rollback-directly, doFinalising drains the BatchRelease and restores traffic routing correctly, but canaryStatus is left frozen at the step-machine state captured when the rollback was triggered (e.g. currentStepIndex: 1, currentStepState: StepPaused, podTemplateHash: <failed-revision>). The rollout settles into phase: Healthy with Progressing: False / Completed, yet the stale canaryStatus makes it look stuck at "step N paused", and kubectl kruise rollout approve is a no-op because the controller no longer dispatches reconcileRolloutProgressing while phase is Healthy.

This PR mirrors what handleContinuousRelease already does: clear the sub-status when the cancellation cleanup is complete, before flipping the Progressing condition to Completed.

Ⅱ. Does this pull request fix one issue?

Fixes #339

Ⅲ. Describe how to verify it

A regression test was added: TestReconcileRolloutProgressing/ReconcileRolloutProgressing_cancelling_->_completed_clears_canaryStatus.

It sets up a rollout mid-cancellation with stale canary status (StepPaused at step 1, podTemplateHash pointing at the failed revision, canaryReplicas: 1), drives reconcileRolloutProgressing through the last finalising step, and asserts that after done=true the rollout has:

  • Progressing: False / Completed
  • Succeeded: False
  • canaryStatus: nil

Without the fix the test fails because the stale canaryStatus is preserved.

Ⅳ. Special notes for reviews

The clear is intentionally scoped to the rollback/cancellation path (ProgressingReasonCancelling branch). The success path (ProgressingReasonFinalising) is left untouched because there the canaryStatus reflects the version that was actually deployed and is a useful observability artefact after a successful rollout.

@kruise-bot kruise-bot requested review from furykerry and veophi May 17, 2026 19:03
@kruise-bot

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign fillzpp for approval by writing /assign @fillzpp in a comment. For more information see:The Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@codecov

codecov Bot commented May 17, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 51.57%. Comparing base (b2600e9) to head (39dab28).
⚠️ Report is 2 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #340      +/-   ##
==========================================
+ Coverage   51.38%   51.57%   +0.19%     
==========================================
  Files          66       66              
  Lines        8559     8560       +1     
==========================================
+ Hits         4398     4415      +17     
+ Misses       3575     3556      -19     
- Partials      586      589       +3     
Flag Coverage Δ
unittests 51.57% <100.00%> (+0.19%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

When a canary release is cancelled via rollback-directly, doFinalising
drains the BatchRelease and restores traffic routing, but canaryStatus
is left frozen at the step-machine state captured when the rollback was
triggered (e.g. currentStepIndex: 1, currentStepState: StepPaused,
podTemplateHash: <failed-revision>). The rollout settles into
phase: Healthy with Progressing: False/Completed, yet the stale
canaryStatus makes it look stuck at "step N paused", and
kubectl kruise rollout approve is a no-op because the controller no
longer dispatches reconcileRolloutProgressing while phase is Healthy.

Mirror what handleContinuousRelease already does: clear the sub-status
when cancellation cleanup is complete, before flipping the Progressing
condition to Completed.

Signed-off-by: Luka Skugor <luka.skugor@chimeramail.com>
@LittleChimera LittleChimera force-pushed the fix-rollback-stale-canary-status branch from 91562eb to 39dab28 Compare May 17, 2026 19:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] After rollback-directly cancellation, canaryStatus retains stale "step paused" state

2 participants