Skip to content

FIX: Prevent operator crash when topology ConfigMap exceeds 1MB limit#2048

Open
Boomatang wants to merge 1 commit into
Kuadrant:mainfrom
Boomatang:topology_configmap
Open

FIX: Prevent operator crash when topology ConfigMap exceeds 1MB limit#2048
Boomatang wants to merge 1 commit into
Kuadrant:mainfrom
Boomatang:topology_configmap

Conversation

@Boomatang

@Boomatang Boomatang commented Jun 19, 2026

Copy link
Copy Markdown
Member

Problem

The kuadrant-operator crashes when the topology ConfigMap exceeds the Kubernetes 1MB ConfigMap size limit. This happens at scale when approximately 1,500 or more resource sets (HTTPRoute + AuthPolicy + RateLimitPolicy) are present in the cluster. The topology DOT graph grows linearly (~785 bytes per route), and once it exceeds 1MB the API server rejects the write, causing the reconciler to enter a permanent error loop. The operator becomes unable to process any further events, effectively a denial of service.

Reproduction

  1. Deploy kuadrant from the operator repo:

    make local-setup
  2. Increase operator resources (speeds up reproduction, does not affect outcome):

    kubectl set resources deployment/kuadrant-operator-controller-manager -n kuadrant-system \
      --limits=cpu=4,memory=2Gi \
      --requests=cpu=4,memory=2Gi
    kubectl wait --for=condition=Available deployment/kuadrant-operator-controller-manager \
      -n kuadrant-system --timeout=120s
  3. Create the Kuadrant CR and a test namespace:

    kubectl apply -f examples/toystore/kuadrant.yaml -n kuadrant-system
    kubectl create namespace poc-0
  4. Create ~1500 resource sets (HTTPRoute + AuthPolicy + RateLimitPolicy each):

    NAMESPACE="poc-0"
    GATEWAY_NAMESPACE="gateway-system"
    GATEWAY_NAME="kuadrant-ingressgateway"
    NUMBER=1500
    
    for i in $(seq 0 $((NUMBER - 1))); do
      kubectl apply -f - <<YAML > /dev/null
    apiVersion: gateway.networking.k8s.io/v1
    kind: HTTPRoute
    metadata:
      name: httpbin-route-$i
      namespace: $NAMESPACE
    spec:
      parentRefs:
      - name: $GATEWAY_NAME
        namespace: $GATEWAY_NAMESPACE
      hostnames:
      - "httpbin-$i.local"
      rules:
      - backendRefs:
        - name: httpbin
          port: 80
    YAML
    
      kubectl apply -f - <<YAML > /dev/null
    apiVersion: kuadrant.io/v1
    kind: AuthPolicy
    metadata:
      name: httpbin-auth-$i
      namespace: $NAMESPACE
    spec:
      targetRef:
        group: gateway.networking.k8s.io
        kind: HTTPRoute
        name: httpbin-route-$i
      rules:
        authentication:
          "api-key":
            apiKey:
              selector:
                matchLabels:
                  app: httpbin
    YAML
    
      kubectl apply -f - <<YAML > /dev/null
    apiVersion: kuadrant.io/v1
    kind: RateLimitPolicy
    metadata:
      name: httpbin-ratelimit-$i
      namespace: $NAMESPACE
    spec:
      targetRef:
        group: gateway.networking.k8s.io
        kind: HTTPRoute
        name: httpbin-route-$i
      limits:
        "general-user":
          rates:
          - limit: 5
            window: 10s
          counters:
          - expression: auth.identity.userid
          when:
          - predicate: "auth.identity.userid != 'bob'"
        "bob-limit":
          rates:
          - limit: 2
            window: 10s
          when:
          - predicate: "auth.identity.userid == 'bob'"
    YAML
    done
  5. Watch the operator logs:

    kubectl logs -f deployment/kuadrant-operator-controller-manager -n kuadrant-system

    Without this fix the operator will enter a permanent error loop with:

    ConfigMap "topology" is invalid: []: Too long: may not be more than 1048576 bytes
    

Solution

This PR adds a size guard to the TopologyReconciler. Before writing the topology ConfigMap, the reconciler checks the size of the serialized DOT graph against a 900KB threshold (safely under the 1MB Kubernetes limit). When the topology exceeds this threshold, a small placeholder DOT graph is written instead, preventing the API server rejection and allowing the operator to continue reconciling other resources.

Changes

  • Add a maxTopologyBytes constant (900KB) and an oversizedPlaceholder DOT string
  • Check topology size before writing and substitute the placeholder when oversized
  • Add OpenTelemetry tracing spans, attributes, and events for observability
  • Change Client field type from *dynamic.DynamicClient to dynamic.Interface to support test fakes
  • Add span instrumentation for all error and success paths
  • Add unit tests covering create, update, no-op, oversized placeholder validation, and panic on empty namespace

Limitations

This is a temporary workaround. The placeholder means the console-plugin will not display the full topology graph at scale. A proper solution will require a different storage or serialization strategy, coordinated with the console-plugin.

Summary by CodeRabbit

Release Notes

  • Bug Fixes

    • Oversized topology DOT payloads are now automatically capped and replaced with a syntactically valid placeholder when exceeding the maximum size limit.
  • New Features

    • Added OpenTelemetry tracing for topology reconciliation, including span attributes, explicit lifecycle events, and error reporting.
  • Tests

    • Added unit tests covering topology ConfigMap creation, updates, unchanged detection, and placeholder validation.

@coderabbitai

coderabbitai Bot commented Jun 19, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 0237c23f-5fa6-4813-985c-e4facac34636

📥 Commits

Reviewing files that changed from the base of the PR and between 4055738 and f7486c2.

📒 Files selected for processing (2)
  • internal/controller/topology_reconciler.go
  • internal/controller/topology_reconciler_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • internal/controller/topology_reconciler.go
  • internal/controller/topology_reconciler_test.go

📝 Walkthrough

Walkthrough

TopologyReconciler gains a DOT payload size cap (maxTopologyBytes) that substitutes an oversizedPlaceholder when exceeded. The Reconcile method is instrumented with OpenTelemetry spans, attributes, and lifecycle events. The Client field and constructor are widened to dynamic.Interface. A new test file adds five unit tests and a nested-map string helper.

Changes

TopologyReconciler: size cap, OTel tracing, and interface widening

Layer / File(s) Summary
Size-cap constants, interface widening, and imports
internal/controller/topology_reconciler.go
Adds maxTopologyBytes and oversizedPlaceholder constants, changes TopologyReconciler.Client and NewTopologyReconciler parameter from *dynamic.DynamicClient to dynamic.Interface, and extends imports with fmt and OTel attribute/codes packages.
Reconcile: OTel tracing and size-guard logic
internal/controller/topology_reconciler.go
Reconcile starts an OTel span, sets ConfigMap name/namespace and topology.size_bytes attributes, substitutes oversizedPlaceholder for oversized DOT output with span error recording, and emits span events at destruct failure, creation, already-created, update, and unchanged lifecycle points.
Unit tests for TopologyReconciler
internal/controller/topology_reconciler_test.go
Adds five tests covering placeholder validity, ConfigMap create, update, no-update-when-unchanged, and panic-on-empty-namespace; includes an unstructuredNestedString helper for nested map assertions.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐇 A topology map too large to send?
No fret — a placeholder comes to the rescue, friend!
Spans now trace each ConfigMap twist,
With events and attributes none can resist.
The interface widens, the tests now stand tall,
A tidier reconciler — delightful for all! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title accurately summarizes the main change: preventing operator crashes when topology ConfigMap exceeds the 1MB Kubernetes limit by implementing a size guard mechanism with placeholder substitution.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (3)
internal/controller/topology_reconciler.go (1)

111-111: 💤 Low value

Use direct string comparison instead of strings.Compare.

Go's strings.Compare documentation recommends: "It is usually clearer and always faster to use the built-in string comparison operators". Direct comparison d != cm.Data["topology"] is idiomatic here.

♻️ Suggested change
-	if d, found := cmTopology.Data["topology"]; !found || strings.Compare(d, cm.Data["topology"]) != 0 {
+	if d, found := cmTopology.Data["topology"]; !found || d != cm.Data["topology"] {
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/controller/topology_reconciler.go` at line 111, In the if condition
in topology_reconciler.go, replace the strings.Compare function call with a
direct string comparison operator. Instead of using strings.Compare(d,
cm.Data["topology"]) != 0, use d != cm.Data["topology"] directly. This is more
idiomatic Go code and performs better, as recommended by Go's standard library
documentation for string comparisons.
internal/controller/topology_reconciler_test.go (2)

159-180: 💤 Low value

Consider simplifying return signature since error is always nil.

The error return value is never set to non-nil; all paths return nil. The (string, bool) signature would be clearer and remove redundant assert.NilError(t, err) calls from test code.

♻️ Suggested change
-func unstructuredNestedString(obj map[string]any, fields ...string) (string, bool, error) {
+func unstructuredNestedString(obj map[string]any, fields ...string) (string, bool) {
 	current := obj
 	for i, field := range fields {
 		if i == len(fields)-1 {
 			val, ok := current[field]
 			if !ok {
-				return "", false, nil
+				return "", false
 			}
 			s, ok := val.(string)
-			return s, ok, nil
+			return s, ok
 		}
 		next, ok := current[field]
 		if !ok {
-			return "", false, nil
+			return "", false
 		}
 		current, ok = next.(map[string]any)
 		if !ok {
-			return "", false, nil
+			return "", false
 		}
 	}
-	return "", false, nil
+	return "", false
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/controller/topology_reconciler_test.go` around lines 159 - 180, The
unstructuredNestedString function has an error return value that is never set to
non-nil, making the return signature unnecessarily complex. Simplify the
function signature by removing the error return type, changing from (string,
bool, error) to (string, bool). Then update all return statements in the
function to return only two values instead of three, removing the trailing nil
error value from each return statement (including the final return statement at
the end of the function).

22-27: ⚡ Quick win

Consider adding a test for the oversized topology reconciliation path.

This test validates the placeholder constant, but there's no test that exercises the reconciler's behaviour when topology.ToDot() exceeds maxTopologyBytes. Such a test would confirm the placeholder substitution logic works end-to-end.

Creating a topology large enough to exceed 900KB may be impractical in unit tests, but you could consider extracting the size-check logic into a testable helper or using a test-only lower threshold.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/controller/topology_reconciler_test.go` around lines 22 - 27, The
TestTopologyReconciler_OversizedPlaceholder test validates the placeholder
constant itself but does not exercise the actual reconciler behavior when
topology.ToDot() output exceeds maxTopologyBytes. To fix this, either extract
the size-checking logic from the reconciler into a separate testable helper
function that can be called with different thresholds, or introduce a test-only
reduced maxTopologyBytes threshold that allows creating a realistically large
topology to trigger the placeholder substitution logic end-to-end. This ensures
the placeholder substitution path in the reconciler works correctly when actual
topology data exceeds the limit.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@internal/controller/topology_reconciler_test.go`:
- Around line 159-180: The unstructuredNestedString function has an error return
value that is never set to non-nil, making the return signature unnecessarily
complex. Simplify the function signature by removing the error return type,
changing from (string, bool, error) to (string, bool). Then update all return
statements in the function to return only two values instead of three, removing
the trailing nil error value from each return statement (including the final
return statement at the end of the function).
- Around line 22-27: The TestTopologyReconciler_OversizedPlaceholder test
validates the placeholder constant itself but does not exercise the actual
reconciler behavior when topology.ToDot() output exceeds maxTopologyBytes. To
fix this, either extract the size-checking logic from the reconciler into a
separate testable helper function that can be called with different thresholds,
or introduce a test-only reduced maxTopologyBytes threshold that allows creating
a realistically large topology to trigger the placeholder substitution logic
end-to-end. This ensures the placeholder substitution path in the reconciler
works correctly when actual topology data exceeds the limit.

In `@internal/controller/topology_reconciler.go`:
- Line 111: In the if condition in topology_reconciler.go, replace the
strings.Compare function call with a direct string comparison operator. Instead
of using strings.Compare(d, cm.Data["topology"]) != 0, use d !=
cm.Data["topology"] directly. This is more idiomatic Go code and performs
better, as recommended by Go's standard library documentation for string
comparisons.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: d02da6c1-ebbe-4905-8855-9ebdc8ee4386

📥 Commits

Reviewing files that changed from the base of the PR and between b199cd1 and 4055738.

📒 Files selected for processing (2)
  • internal/controller/topology_reconciler.go
  • internal/controller/topology_reconciler_test.go

@Boomatang

Copy link
Copy Markdown
Member Author

@R-Lawton I want to give you a head up on this. It will have an effect on the console-plugin if the topology becomes to large. I am not sure how the console-plugin would handle the placeholder topology that is added.

Topology workaround added to allow the creation of large topologies.

Signed-off-by: Jim Fitzpatrick <jfitzpat@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant