Skip to content

feat: Deployment MinReadySeconds progressive delivery strategy#343

Open
Xio-Shark wants to merge 20 commits into
openkruise:masterfrom
Xio-Shark:proposal/deployment-min-ready-seconds
Open

feat: Deployment MinReadySeconds progressive delivery strategy#343
Xio-Shark wants to merge 20 commits into
openkruise:masterfrom
Xio-Shark:proposal/deployment-min-ready-seconds

Conversation

@Xio-Shark

Copy link
Copy Markdown
Contributor

Summary

Implement a MinReadySeconds-based progressive delivery strategy for native Kubernetes Deployments, as described in proposal #341.

Instead of pausing the Deployment controller (Recreate style), this approach inflates minReadySeconds and progressDeadlineSeconds to take control of the rollout pace, then progressively adjusts maxUnavailable to drive batch-by-batch updates — while the native Deployment controller remains active.

Key Design Decisions

  • No new API fields — routed entirely by the MinReadySecondsStrategy feature gate (per review feedback)
  • True availability calculation — uses the original minReadySeconds and pod Ready condition LastTransitionTime to count genuinely available replicas, not just Ready pods
  • Inflation persistence — three-layer protection: Initialize(), ensureInflatedDeploymentStrategy() before each batch, and webhook enforceMinReadyInflation() on every Deployment mutation
  • PDB compatible — Deployment controller stays active, so PodDisruptionBudgets work normally
  • maxSurge preserved — user-configured maxSurge > 0 is respected; only maxUnavailable is driven by the controller

Changes

Core Controller (pkg/controller/batchrelease/control/partitionstyle/deployment/)

  • minready_control.go — Initialize / UpgradeBatch / Finalize / CalculateBatchContext
  • minready_batch_context.go — pod availability calculation with original minReadySeconds
  • minready_constants.go — annotations, inflated values, constants

Status & Metrics

  • minready_status.go — Rollout condition updates (MinReadyInitialized/Batching/Degraded/Finalized)
  • metrics/minready_metrics.go — Prometheus metrics for batch progress and timing

Routing & Webhook

  • batchrelease_executor.go — feature gate routing to MinReady controller
  • workload_update_handler.go — skip Paused mutation, enforce inflation on external edits

Feature Gate

  • MinReadySecondsStrategy (alpha, default off)

Tests

  • Unit: 24 MinReady-specific tests + updated existing package tests
  • Integration: batch lifecycle, concurrency, feature-gate fallback
  • E2E: basic rollout, multi-batch, PDB compatibility, degraded drift detection

Commits

  1. feat: add deployment MinReady strategy — core implementation
  2. fix: correct MinReady batch-ready semantics and feature-gate fallback — fix pod counting and executor routing
  3. fix: align MinReady implementation with proposal review — address review feedback
  4. fix: write BatchReleaseControlAnnotation and clean up finalize labels — proper controller ownership marking

Test Plan

  • go test ./pkg/... — all unit tests pass
  • CI: golangci-lint, unit-tests, E2E (Deployment/CloneSet/StatefulSet/DaemonSet on K8s 1.24/1.26/1.28)
  • Manual: verify batch progression with kubectl rollout status
  • Manual: verify Finalize restores original Deployment fields

Related

  • Proposal: #341
  • GSoC 2026 Project

@kruise-bot kruise-bot requested review from FillZpp and zmberg June 10, 2026 07:22
@kruise-bot

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign zmberg for approval by writing /assign @zmberg in a comment. For more information see:The Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@codecov

codecov Bot commented Jun 10, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 69.29461% with 222 lines in your changes missing coverage. Please review.
✅ Project coverage is 52.84%. Comparing base (56efb03) to head (6661336).
⚠️ Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
...trol/partitionstyle/deployment/minready_control.go 64.91% 72 Missing and 35 partials ⚠️
...tchrelease/control/partitionstyle/control_plane.go 59.37% 21 Missing and 5 partials ⚠️
pkg/util/parse_utils.go 40.62% 12 Missing and 7 partials ⚠️
...artitionstyle/deployment/minready_batch_context.go 45.45% 9 Missing and 9 partials ⚠️
...hrelease/control/partitionstyle/minready_status.go 82.05% 7 Missing and 7 partials ⚠️
...ontroller/batchrelease/metrics/minready_metrics.go 62.96% 5 Missing and 5 partials ⚠️
...ol/partitionstyle/deployment/minready_constants.go 76.92% 5 Missing and 4 partials ⚠️
...g/controller/batchrelease/batchrelease_executor.go 68.75% 4 Missing and 1 partial ⚠️
...bhook/workload/mutating/workload_update_handler.go 88.88% 3 Missing and 2 partials ⚠️
pkg/util/pod_utils.go 0.00% 4 Missing ⚠️
... and 2 more
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #343      +/-   ##
==========================================
+ Coverage   51.38%   52.84%   +1.46%     
==========================================
  Files          66       79      +13     
  Lines        8559     9605    +1046     
==========================================
+ Hits         4398     5076     +678     
- Misses       3575     3841     +266     
- Partials      586      688     +102     
Flag Coverage Δ
unittests 52.84% <69.29%> (+1.46%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

release, release.GetObjectKind().GroupVersionKind()))
inflateDeploymentStrategy(modified)
patch := client.MergeFromWithOptions(original, client.MergeFromWithOptimisticLock{})
return mc.client.Patch(context.TODO(), modified, patch)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

L78, L106, L133, L269 — Every mc.client.Patch(context.TODO(), ...) in Initialize, UpgradeBatch, Finalize, and ensureInflatedDeploymentStrategy uses context.TODO() instead of the reconcile context, consider refactor the code to accept the context parameter

rc.recordMinReadyDegraded("MinReadyInitializeFailed", err)
return err
}
rc.recordMinReadyNormal(v1beta1.RolloutConditionMinReadyInitialized, "MinReadyInitialized", "MinReadySeconds strategy initialized")

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

recordMinReadyNormal/recordMinReadyDegraded fire for ALL partition-style releases, not just MinReady. While isMinReadyRelease() guards inside these methods, the check adds unnecessary overhead and recognition burden, consider moving the recording inside Initialize/UpgradeBatch/Finalize method, and remove other calling of recordMinReadyNormal/recordMinReadyDegraded by simply logging some informations

// maxUnavailable above the batch target is a legal state after a
// scale-down (HPA or manual) and also self-heals external tampering;
// converge it back to the target instead of reporting degraded drift.
klog.Warningf("MinReadyControl.UpgradeBatch[%d]: deployment %v maxUnavailable=%d exceeds target=%d, reducing it to the target",

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use klog.WarningS with structured key-value pairs: klog.WarningS(nil, "MinReady maxUnavailable exceeds target, reducing", "batch", ctx.CurrentBatch, "deployment", klog.KObj(mc.object), "maxUnavailable", current, "target", target). Note: the existing control_plane.go uses klog.Infof/klog.Warningf too, so this is a pre-existing pattern, but new code should aim higher.

// "Canary and BlueGreen cannot both be set"). With BlueGreen==nil guaranteed,
// Canary!=nil && !EnableExtraWorkloadForCanary is equivalent to the executor's
// GetRollingStyle()==Partition routing, so both sides agree on MinReady.
func shouldSkipRecreateMutationForMinReady(rollout *appsv1beta1.Rollout) bool {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the gate is OFF and a MinReady rollout is in progress, the executor correctly routes to MinReady (via annotation fallback), but the webhook's shouldSkipRecreateMutationForMinReady only checks the feature gate

consider also checking with DeploymentStrategyAnnotation symmetric with the executor:

if err := writeOriginalAnnotations(snapshot, deployment); err != nil {
return err
}
if hasAnyOriginalAnnotation(snapshot.Annotations) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

writeOriginalAnnotations already check for hasAnyOriginalAnnotation, consider extract the hasAnyOriginalAnnotation logic out of writeOriginalAnnotations

return
}
maxSurge := intstr.FromInt(1)
deployment.Spec.Strategy.RollingUpdate.MaxSurge = &maxSurge

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

per discussion in the proposal, it is not necessary to hack the maxSurge

if ready == nil || ready.Status != corev1.ConditionTrue {
return false
}
return ready.LastTransitionTime.Add(time.Duration(minReadySeconds)*time.Second).Before(now) ||

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before(now) || Equal(now) computes the same time.Time addition twice. Consider simplify to !After(now)

klog.Warningf("MinReadyControl.UpgradeBatch[%d]: deployment %v maxUnavailable=%d exceeds target=%d, reducing it to the target",
ctx.CurrentBatch, klog.KObj(mc.object), current, target)
}
original := mc.object.DeepCopy()

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is not necessary to deepcopy the original object.

}
return nil
}
original := mc.object.DeepCopy()

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is not necessary to deepcopy the original object.

if validateInflatedDeploymentStrategy(mc.object) == nil {
return nil
}
original := mc.object.DeepCopy()

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is not necessary to deepcopy the original object.

// "Canary and BlueGreen cannot both be set"). With BlueGreen==nil guaranteed,
// Canary!=nil && !EnableExtraWorkloadForCanary is equivalent to the executor's
// GetRollingStyle()==Partition routing, so both sides agree on MinReady.
func shouldSkipRecreateMutationForMinReady(rollout *appsv1beta1.Rollout) bool {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider rename to isMinReadySecondsStrategy

return &v, nil
}

func readOriginalAnnotation(annotations map[string]string, key string) (string, error) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

many func in these patch over-used errors to control normal logic flow, consider remove unnecessary errors, for example, in this func, we can change error to an bool value (exist), if key does not exist, just return exist=false. In fact, is this func necessary in the first place, can we just use raw, ok := annotations[key]


func parseOriginalInt32(annotations map[string]string, key string) (*int32, error) {
raw, err := readOriginalAnnotation(annotations, key)
if err != nil || raw == AnnotationValueKubernetesDefault {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why treat AnnotationValueKubernetesDefault differently, and what is AnnotationValueKubernetesDefault means ?

return nil
}

func ensureAllOriginalAnnotations(annotations map[string]string) error {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this func really necessary? parseOriginalInt32 will return error anyway if required key does not exist

Xio-Shark and others added 5 commits June 14, 2026 17:03
Signed-off-by: Teng Yanxi <151488904+Xio-Shark@users.noreply.github.com>
CalculateBatchContext now counts only updated-revision Ready pods, with a
matching-ReplicaSet readyReplicas fallback, instead of Deployment
status.readyReplicas which also counted old-revision Ready pods and could
mark a batch ready before the new pods were actually ready. List owned pods
when RolloutID is set, and require non-empty pods once a batch label is
expected. The executor falls back to the Recreate controller when the
MinReadySecondsStrategy feature gate is disabled, matching the webhook skip.
Add negative unit and integration coverage for these cases.

Signed-off-by: Teng Yanxi <151488904+Xio-Shark@users.noreply.github.com>
Signed-off-by: Teng Yanxi <151488904+Xio-Shark@users.noreply.github.com>
Initialize now writes BatchReleaseControlAnnotation to mark the
Deployment as controlled by a specific BatchRelease, consistent with
the CloneSet batch release pattern. Finalize now cleans up both
BatchReleaseControlAnnotation and DeploymentStableRevisionLabel to
ensure the Deployment is fully released after rollout completes.
Also removes premature user-facing docs (quickstart, migration guide,
runbook) that will be added in a follow-up after the feature stabilizes.

Signed-off-by: Teng Yanxi <151488904+Xio-Shark@users.noreply.github.com>
MinReady (P0/P1/P2 from code review):
- P0-2: webhook now enforces RollingUpdate/paused=false/non-nil rollingUpdate
  invariants for active MinReady rollouts; controller treats paused drift as
  drift so ensureInflatedDeploymentStrategy self-heals the freeze.
- P1-1: introduce sentinel errors and classify degraded reasons via errors.Is
  instead of matching human-readable message text.
- P1-2: UpgradeBatch converges maxUnavailable back to the batch target on
  scale-down instead of falsely reporting GitOps drift.
- P1-4: lift MinReady annotation constants to api/v1beta1; executor keeps
  routing to MinReady controller (and status keeps reporting) when the feature
  gate is disabled mid-rollout but the Deployment still carries annotations.
- P1-6: EnrollMinReadyDeployment inflates strategy synchronously at admission,
  closing the window where the native controller could observe the original
  budget before Initialize lands.
- P0-1: add batchLabelSatisfied regression matrix (shared hot path).
- P2-3: document the BlueGreen+Canary mutual-exclusion invariant source.

Repo audit:
- Restrict webhook cert dir to 0700 and private keys to 0600 (certs 0644).
- parse_utils: use NestedString to avoid panic on malformed status; surface
  json marshal/unmarshal errors instead of silently swallowing them.
- Align Dockerfile_multiarch Go builder to golang:1.20.14-alpine3.19.
- Update proposal: webhook/executor behavior and add operator runbook
  (feature-gate lifecycle, disable preconditions, controller-death recovery).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Teng Yanxi <151488904+Xio-Shark@users.noreply.github.com>
@Xio-Shark Xio-Shark force-pushed the proposal/deployment-min-ready-seconds branch from 1bfd8cc to b850c17 Compare June 14, 2026 09:09
Signed-off-by: Teng Yanxi <151488904+Xio-Shark@users.noreply.github.com>
@Xio-Shark Xio-Shark force-pushed the proposal/deployment-min-ready-seconds branch from b850c17 to db836f0 Compare June 14, 2026 09:10
Signed-off-by: Teng Yanxi <151488904+Xio-Shark@users.noreply.github.com>
Signed-off-by: Teng Yanxi <151488904+Xio-Shark@users.noreply.github.com>
return err
}
snapshot := deployment.DeepCopy()
if hasAnyOriginalAnnotation(snapshot.Annotations) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in continous release case, the user specified min-ready seconds and progress deadline seconds can be updated, so we need update original annotations in such cases

Signed-off-by: Teng Yanxi <151488904+Xio-Shark@users.noreply.github.com>
finishMinReadyE2ERollout(namespace, rollout.Name)
})

It("TC2 rollback returns to the stable template", func() {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plz add e2e for continuous release (v1 -> v2 -> v3)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added TC3 continuous release coverage for v1 -> v2 -> v3. The case updates user-owned minReadySeconds/progressDeadlineSeconds during an active MinReady rollout, verifies the original availability annotations are refreshed, then verifies final restore uses the v3 values.

partitiondeployment "github.com/openkruise/rollouts/pkg/controller/batchrelease/control/partitionstyle/deployment"
)

var _ = SIGDescribe("Deployment MinReadySeconds", func() {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the e2e is not invoked in the github actions, consider duplicate .github/workflows/e2e-advanced-deployment-1.28.yaml and related yaml and add min-readyseconds tests

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added dedicated E2E-Deployment-MinReady workflows for Kubernetes 1.24, 1.26, and 1.28. They run ginkgo with --focus="Deployment MinReadySeconds" so the MinReadySeconds e2e suite is invoked by GitHub Actions.

@@ -0,0 +1,140 @@
/*

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider organizing related tests files in a sub-packages

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kept these tests in the existing e2e package for this PR. test/e2e currently uses a shared suite/client/helper setup and all existing workflows execute the flat test/e2e package; moving MinReady into a sub-package would require a broader e2e harness refactor. I think that is better handled separately from this feature fix.

Signed-off-by: Teng Yanxi <151488904+Xio-Shark@users.noreply.github.com>
Signed-off-by: Teng Yanxi <151488904+Xio-Shark@users.noreply.github.com>
Signed-off-by: Teng Yanxi <151488904+Xio-Shark@users.noreply.github.com>
@Xio-Shark Xio-Shark force-pushed the proposal/deployment-min-ready-seconds branch from 135226e to 590470c Compare June 15, 2026 02:04
Signed-off-by: Teng Yanxi <151488904+Xio-Shark@users.noreply.github.com>
Signed-off-by: Teng Yanxi <151488904+Xio-Shark@users.noreply.github.com>
Xio-Shark and others added 3 commits June 15, 2026 22:01
Use a 4-step rollout (60% keeps target=3) so TC7 resume no longer jumps
past the convergence assertion. Move MinReady E2E into test/e2e/minready,
update CI workflows, and emit structured warning logs for maxUnavailable
convergence.

Signed-off-by: Teng Yanxi <151488904+Xio-Shark@users.noreply.github.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Move MinReady status/metrics recording into MinReadyControl lifecycle
methods, simplify control_plane to generic logging for non-MinReady paths,
add structured warningS logging, and consolidate original annotation
prepare/enroll helpers per review feedback.

Signed-off-by: Teng Yanxi <151488904+Xio-Shark@users.noreply.github.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Converge external maxUnavailable tampering during EnsureBatchPodsReadyAndLabeled
so TC7 does not depend on UpgradeBatch running only after rollout resume.

Signed-off-by: Teng Yanxi <151488904+Xio-Shark@users.noreply.github.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
@Xio-Shark Xio-Shark force-pushed the proposal/deployment-min-ready-seconds branch from 2ab1c14 to 8a522e8 Compare June 16, 2026 06:49
Wait for step 2 pause before asserting post-resume maxUnavailable in TC7
so UpgradeBatch has applied the 60% batch target.

Signed-off-by: Teng Yanxi <151488904+Xio-Shark@users.noreply.github.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: Teng Yanxi <151488904+Xio-Shark@users.noreply.github.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
@Xio-Shark Xio-Shark force-pushed the proposal/deployment-min-ready-seconds branch from 8a522e8 to 9485d09 Compare June 16, 2026 07:11
TerminatingReasonInTerminating = "InTerminating"
TerminatingReasonCompleted = "Completed"

// MinReadyInitialized indicates MinReadySeconds strategy initialization has completed.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to reuse existing rollout condition such as RolloutConditionProgressing, RolloutConditionSucceeded and RolloutConditionTerminating


func countUpdatedAvailablePods(pods []*corev1.Pod, updateRevision string, minReadySeconds int32, now time.Time) int32 {
return int32(util.WrappedPodCount(pods, func(pod *corev1.Pod) bool {
if !pod.DeletionTimestamp.IsZero() {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plz replace deletiontimestamp checking with isPodActive

func (mc *MinReadyControl) Initialize(ctx context.Context, release *v1beta1.BatchRelease) error {
if release == nil {
err := fmt.Errorf("MinReadyControl.Initialize: release is nil")
mc.RecordOperationFailed("MinReadyInitializeFailed", err)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider simply logging an error and record condition in reconcileRolloutProgressing

Address openkruise#343 review comments:

P0-3: implement sliding window in reconcileMaxUnavailable so a large
batch target (e.g. 99 after a 1-pod canary) no longer writes
maxUnavailable in a single patch. Advance by the user's original
maxUnavailable step, waiting for the current window pods to become
available (UpdatedReadyReplicas >= current) before widening the budget.
maxUnavailable=0 (surge-only) falls back to driving the batch directly.

P1-7: converge condition/event/metrics reporting to the outer
control_plane trunk via MinReadyLifecycle; lower-layer methods now only
return errors instead of recording conditions themselves.

P1-8: replace DeletionTimestamp check with util.IsPodActive in
countUpdatedAvailablePods, mirroring upstream kubernetes
controller_utils.go.

Add 3 sliding-window unit tests; all 8 affected packages pass.

Signed-off-by: Teng Yanxi <151488904+Xio-Shark@users.noreply.github.com>
…o 5)

TestDeploymentMinReadyConcurrentScaleUsesLatestReplicas expected the
full batch target (10) in one patch. After P0-3, UpgradeBatch advances
maxUnavailable one sliding-window step at a time (25% of 20 replicas = 5).
Update the assertion and add a comment explaining the sliding-window
first-step semantics.

Signed-off-by: Teng Yanxi <151488904+Xio-Shark@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants