Skip to content

OCPBUGS-88732: Treat missing cloud config outages as transient#472

Open
RadekManak wants to merge 1 commit into
openshift:mainfrom
RadekManak:regression-ci-failure-investigation
Open

OCPBUGS-88732: Treat missing cloud config outages as transient#472
RadekManak wants to merge 1 commit into
openshift:mainfrom
RadekManak:regression-ci-failure-investigation

Conversation

@RadekManak

@RadekManak RadekManak commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Summary

  • keep missing source cloud configmaps on the transient retry path instead of immediately degrading CCCMO
  • preserve terminal behavior for real bad-key misconfigurations in infra.spec.cloudConfig
  • add a controller regression test that covers source config disappearance and recovery

Test plan

  • KUBEBUILDER_ASSETS=\"$(pwd)/$(go run ./vendor/sigs.k8s.io/controller-runtime/tools/setup-envtest use 1.33.2 -p path --bin-dir ./bin --index https://raw.githubusercontent.com/openshift/api/master/envtest-releases.yaml)\" go test ./pkg/controllers

Summary by CodeRabbit

  • Bug Fixes

    • Improved cloud configuration synchronization when the infrastructure source ConfigMap is missing.
      • For managed platforms, a minimal platform-appropriate default is now initialized.
      • For non-managed scenarios, the controller now returns a terminal error and clears the related ClusterOperator status.
  • Tests

    • Enhanced reconciliation tests to better verify absence/presence of the synced ConfigMap and ClusterOperator health.
    • Added an AWS regression test for missing source → error and operator removal → recreation → successful recovery.
    • Updated BareMetal checks to assert the synced ConfigMap is not present when expected.

@coderabbitai

coderabbitai Bot commented Jun 16, 2026

Copy link
Copy Markdown

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 74c49f82-a3a4-40ee-8c38-876df6920fc7

📥 Commits

Reviewing files that changed from the base of the PR and between 042c11c and c566c55.

📒 Files selected for processing (2)
  • pkg/controllers/cloud_config_sync_controller.go
  • pkg/controllers/cloud_config_sync_controller_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/controllers/cloud_config_sync_controller.go

Walkthrough

In CloudConfigReconciler.Reconcile, the not-found path for the source cloud-config ConfigMap is split: if the platform is managed by CCCMO, sourceCM.Data is initialized with a minimal platform config; otherwise, a terminal error is returned. A new test covers the AWS case for this missing-ConfigMap path.

Changes

Missing source ConfigMap handling

Layer / File(s) Summary
Reconcile bifurcation for missing ConfigMap
pkg/controllers/cloud_config_sync_controller.go
When the source ConfigMap is absent from openshift-config, the controller bifurcates: if shouldManageManagedConfigMap is true, sourceCM.Data is initialized with minimal platform-specific config; otherwise, a terminal error is returned with the ConfigMap's namespace/name.
Test updates and coverage for bifurcation
pkg/controllers/cloud_config_sync_controller_test.go
A new AWS test verifies that deleting the infra source ConfigMap causes reconcile error and absent ClusterOperator, and that recreating it restores success with cloudConfigControllerDegraded=false and cloudConfigControllerAvailable=true. BareMetal test assertion changed to expect Get returning IsNotFound for the synced ConfigMap.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 14 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Test Structure And Quality ⚠️ Warning The new transient test lacks meaningful assertion failure messages on critical assertions (lines 561, 569, 572-573). Guideline 4 requires meaningful messages; this test has none on core assertions... Add assertion messages to critical assertions in the transient test: line 561 (Expect(err).To(HaveOccurred())) needs message explaining expected error; lines 569/572/573 need messages explaining why degraded/available conditions matter f...
✅ Passed checks (14 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: treating missing cloud config scenarios as transient failures rather than terminal errors, which is the primary objective of the changeset.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All 34 Ginkgo test names in cloud_config_sync_controller_test.go are stable and deterministic, using only static descriptive strings with no dynamic values such as pod names, timestamps, UUIDs, nod...
Microshift Test Compatibility ✅ Passed No new Ginkgo e2e tests were added in this PR. The new tests are controller unit tests in pkg/controllers/cloud_config_sync_controller_test.go using envtest, not e2e tests. The custom check applies...
Single Node Openshift (Sno) Test Compatibility ✅ Passed The PR adds unit tests to pkg/controllers/ using envtest, not e2e tests. The custom check applies only to Ginkgo e2e tests; these controller unit tests are not e2e tests.
Topology-Aware Scheduling Compatibility ✅ Passed PR modifies controller logic only (cloud_config_sync_controller.go/.test.go), not deployment manifests. No scheduling constraints, affinity rules, node selectors, or topology-related changes introd...
Ote Binary Stdout Contract ✅ Passed The PR adds new code and tests that do not violate the OTE Binary Stdout Contract. New controller code uses klog (which logs to stderr by default), and new test cases are inside Ginkgo It() blocks...
Ipv6 And Disconnected Network Test Compatibility ✅ Passed Tests added are unit/controller tests (not e2e), run locally with envtest, contain no IPv4 assumptions, hardcoded addresses, or external connectivity requirements. Safe for IPv6-only/disconnected e...
No-Weak-Crypto ✅ Passed No weak cryptographic algorithms (MD5, SHA1, DES, RC4, 3DES, Blowfish, ECB), custom crypto implementations, or non-constant-time secret comparisons detected in the modified cloud_config_sync_contro...
Container-Privileges ✅ Passed Pull request modifies only Go source code files (cloud_config_sync_controller.go and test file) with no changes to container/K8s manifests. The check is not applicable.
No-Sensitive-Data-In-Logs ✅ Passed No sensitive data is exposed in logs. All logging statements only output resource identifiers, platform types, and status messages. Actual cloud config content (passwords, tokens, etc.) is never lo...

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci

openshift-ci Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

@RadekManak: The /test command needs one or more targets.
The following commands are available to trigger required jobs:

/test e2e-aws-ovn
/test e2e-aws-ovn-upgrade
/test fmt
/test images
/test lint
/test okd-scos-images
/test unit
/test vendor
/test verify-deps
/test vet

The following commands are available to trigger optional jobs:

/test e2e-aws-ovn-techpreview
/test e2e-azure-manual-oidc
/test e2e-azure-ovn
/test e2e-azure-ovn-upgrade
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-ibmcloud-ovn
/test e2e-nutanix-ovn
/test e2e-openstack-ovn
/test e2e-vsphere-ovn
/test level0-clusterinfra-azure-ipi-proxy-tests
/test okd-scos-e2e-aws-ovn
/test regression-clusterinfra-vsphere-ipi-ccm

Use /test all to run the following jobs that were automatically triggered:

pull-ci-openshift-cluster-cloud-controller-manager-operator-main-e2e-aws-ovn
pull-ci-openshift-cluster-cloud-controller-manager-operator-main-e2e-aws-ovn-upgrade
pull-ci-openshift-cluster-cloud-controller-manager-operator-main-fmt
pull-ci-openshift-cluster-cloud-controller-manager-operator-main-images
pull-ci-openshift-cluster-cloud-controller-manager-operator-main-level0-clusterinfra-azure-ipi-proxy-tests
pull-ci-openshift-cluster-cloud-controller-manager-operator-main-lint
pull-ci-openshift-cluster-cloud-controller-manager-operator-main-okd-scos-images
pull-ci-openshift-cluster-cloud-controller-manager-operator-main-unit
pull-ci-openshift-cluster-cloud-controller-manager-operator-main-vendor
pull-ci-openshift-cluster-cloud-controller-manager-operator-main-verify-deps
pull-ci-openshift-cluster-cloud-controller-manager-operator-main-vet
Details

In response to this:

/test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci openshift-ci Bot requested review from nrb and racheljpg June 16, 2026 13:10
@openshift-ci

openshift-ci Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign mdbooth for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@RadekManak RadekManak changed the title Treat missing cloud config outages as transient OCPBUGS-88732: Treat missing cloud config outages as transient Jun 16, 2026
@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Jun 16, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@RadekManak: This pull request references Jira Issue OCPBUGS-88732, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Summary

  • keep missing source cloud configmaps on the transient retry path instead of immediately degrading CCCMO
  • preserve terminal behavior for real bad-key misconfigurations in infra.spec.cloudConfig
  • add a controller regression test that covers source config disappearance and recovery

Test plan

  • KUBEBUILDER_ASSETS=\"$(pwd)/$(go run ./vendor/sigs.k8s.io/controller-runtime/tools/setup-envtest use 1.33.2 -p path --bin-dir ./bin --index https://raw.githubusercontent.com/openshift/api/master/envtest-releases.yaml)\" go test ./pkg/controllers

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot

Copy link
Copy Markdown

@RadekManak: This pull request references Jira Issue OCPBUGS-88732, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

Summary

  • keep missing source cloud configmaps on the transient retry path instead of immediately degrading CCCMO
  • preserve terminal behavior for real bad-key misconfigurations in infra.spec.cloudConfig
  • add a controller regression test that covers source config disappearance and recovery

Test plan

  • KUBEBUILDER_ASSETS=\"$(pwd)/$(go run ./vendor/sigs.k8s.io/controller-runtime/tools/setup-envtest use 1.33.2 -p path --bin-dir ./bin --index https://raw.githubusercontent.com/openshift/api/master/envtest-releases.yaml)\" go test ./pkg/controllers

Summary by CodeRabbit

  • Bug Fixes

  • Cloud config sync controller now cleanly handles missing source ConfigMap by either initializing with platform-specific defaults (for managed platforms) or returning appropriate error status via ClusterOperator.

  • Tests

  • Added test case verifying controller properly handles missing infrastructure cloud ConfigMap scenario with correct ClusterOperator status reporting.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@RadekManak RadekManak force-pushed the regression-ci-failure-investigation branch from b53992f to 042c11c Compare June 17, 2026 11:26
@RadekManak

Copy link
Copy Markdown
Contributor Author

/testwith openshift/cloud-provider-aws/main/regression-clusterinfra-aws-ipi-ccm openshift/cloud-provider-aws#158

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/controllers/cloud_config_sync_controller_test.go`:
- Around line 431-438: The code in the deletion loop deletes ConfigMaps but does
not wait for the deletions to be observed before proceeding, which can cause
race conditions with subsequent creates. After the deletion loop that calls
cl.Delete() on each ConfigMap in existingCMs.Items, add an Eventually block that
polls to verify all ConfigMaps have been deleted from the namespace by listing
the ConfigMaps again and confirming the list is empty. This ensures the
deletions are fully observed before the test continues, preventing flaky
AlreadyExists or conflict errors.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ddd7f818-5aa2-4d7a-8264-0e8a3d087a27

📥 Commits

Reviewing files that changed from the base of the PR and between b53992f and 042c11c.

📒 Files selected for processing (2)
  • pkg/controllers/cloud_config_sync_controller.go
  • pkg/controllers/cloud_config_sync_controller_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/controllers/cloud_config_sync_controller.go

Comment on lines +431 to +438
deleteOptions := &client.DeleteOptions{
GracePeriodSeconds: ptr.To[int64](0),
}
existingCMs := &corev1.ConfigMapList{}
Expect(cl.List(ctx, existingCMs, &client.ListOptions{Namespace: targetNamespaceName})).To(Succeed())
for _, cm := range existingCMs.Items {
Expect(cl.Delete(ctx, cm.DeepCopy(), deleteOptions)).To(Succeed())
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Wait for ConfigMap deletions to be observed before proceeding.

Line 431-438 deletes resources but doesn’t verify they’re gone; this can race with subsequent creates in the same namespace and cause flakes (AlreadyExists/conflicts).

As per coding guidelines, “Operations interacting with clusters must include timeouts” and tests should keep setup/cleanup reliable in BeforeEach/AfterEach.

Suggested fix
 		existingCMs := &corev1.ConfigMapList{}
 		Expect(cl.List(ctx, existingCMs, &client.ListOptions{Namespace: targetNamespaceName})).To(Succeed())
 		for _, cm := range existingCMs.Items {
 			Expect(cl.Delete(ctx, cm.DeepCopy(), deleteOptions)).To(Succeed())
+			Eventually(func() error {
+				return cl.Get(ctx, client.ObjectKeyFromObject(cm.DeepCopy()), &corev1.ConfigMap{})
+			}, timeout).Should(MatchError(apierrors.IsNotFound, "IsNotFound"))
 		}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/controllers/cloud_config_sync_controller_test.go` around lines 431 - 438,
The code in the deletion loop deletes ConfigMaps but does not wait for the
deletions to be observed before proceeding, which can cause race conditions with
subsequent creates. After the deletion loop that calls cl.Delete() on each
ConfigMap in existingCMs.Items, add an Eventually block that polls to verify all
ConfigMaps have been deleted from the namespace by listing the ConfigMaps again
and confirming the list is empty. This ensures the deletions are fully observed
before the test continues, preventing flaky AlreadyExists or conflict errors.

Source: Coding guidelines

Keep brief source ConfigMap outages on the retry path so CCCMO does not immediately degrade during recovery, while preserving terminal behavior for real bad-key misconfigurations.
@RadekManak RadekManak force-pushed the regression-ci-failure-investigation branch from 042c11c to c566c55 Compare June 22, 2026 10:13
@RadekManak

Copy link
Copy Markdown
Contributor Author

/testwith openshift/cloud-provider-aws/main/regression-clusterinfra-aws-ipi-ccm openshift/cloud-provider-aws#158 https://github.com/openshift/openshift-tests-private/pull/30029

@openshift-ci

openshift-ci Bot commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

@RadekManak: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/level0-clusterinfra-azure-ipi-proxy-tests c566c55 link false /test level0-clusterinfra-azure-ipi-proxy-tests
ci/prow/e2e-aws-ovn c566c55 link true /test e2e-aws-ovn

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants