Cleanup orphaned WasmPlugin objects#2053
Conversation
|
Warning Review limit reached
More reviews will be available in 51 minutes and 56 seconds. Learn how PR review limits work. Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file). ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits. 🚦 How do rate limits work?CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability. For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughThe Istio extension reconciler gains a per-gateway WasmPlugin deletion step with ChangesWasmPlugin delete permission and reconciler cleanup
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2053 +/- ##
==========================================
- Coverage 75.07% 74.99% -0.09%
==========================================
Files 127 127
Lines 12619 12638 +19
==========================================
+ Hits 9474 9478 +4
- Misses 2663 2671 +8
- Partials 482 489 +7
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
30ee33c to
ea9b977
Compare
thomasmaas
left a comment
There was a problem hiding this comment.
Overall: Reasonable temporary fix for a real upgrade problem. The approach is sound — reuse the known ExtensionName convention, delete-with-NotFound-guard, minimal RBAC additions. A few observations below.
Race condition in the create branch (line 129-138):
The WasmPlugin delete runs even when resource.Create() fails — there's no continue after the failed create (the existing // TODO: handle error already signals this gap). This means a persistent EnvoyFilter creation failure would leave the gateway unprotected. Easy fix: add continue after the create error log, or gate the cleanup on err == nil. I realize the broader race (EnvoyFilter created but not yet applied by Envoy) is acknowledged and harder to solve, but this specific case is avoidable.
Duplication (lines 134-138 and 145-149):
The cleanup block is identical in both branches. Since it should run regardless of whether an EnvoyFilter exists, it can move to a single location before the if !found check — right after desiredEnvoyFilter is built, before the branching logic.
Tracking removal:
The comments say "temporary to be removed" — worth filing a tracking issue and referencing it in a TODO so this doesn't become permanent.
RBAC (config/rbac/role.yaml):
get and list are included but the code only calls Delete. Fine to keep if you plan to add a list-before-delete guard later, otherwise they're unnecessary permissions.
No tests added: Given this is temporary and idempotent, not a hard objection, but a simple integration test (create a WasmPlugin, trigger reconcile, assert it's gone) would catch regressions if the ExtensionName convention ever changes.
ea9b977 to
e3f0ad0
Compare
|
Thanks for the feedback @thomasmaas, addressed those so it's ready once we decide how to proceed (whether we use this or not). Although I disagree with this section so I've disregarded - we should still best effort only remove once the
|
Signed-off-by: Adam Cattermole <a.d.cattermole@gmail.com>
e3f0ad0 to
9e513aa
Compare
maleck13
left a comment
There was a problem hiding this comment.
change looks reasonable to me. Only thought was around the logging of the error rather than returning it? Do we want to force a re-try if it fails
Yeah good question - I tend to agree that while both the |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@internal/controller/istio_extension_reconciler.go`:
- Around line 137-140: The WasmPlugin cleanup in IstioExtensionReconciler is
deleting by generated name only, which can remove a user-created object with the
same name. Update the deletion flow in the reconciler to first fetch the
WasmPlugin, inspect its labels, and only call Delete when the Kuadrant
ownership/managed-by label is present. Keep the existing error handling in the
delete path and use the gateway-derived wasmPluginName and
kuadrantistio.WasmPluginsResource to locate the object.
- Line 177: The reconcile flow is only returning the accumulated write errors,
so failures from the EnvoyFilter delete/update/destruct paths can be lost and
skip retries. Update the reconcile logic in IstioExtensionReconciler to add
those path errors into the same accumulator used by reconcileErr before
returning, and keep returning the joined error from the final return site so any
failed write operation is retried.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 0b042144-1ed2-476a-9139-bb0d79a8f258
📒 Files selected for processing (4)
bundle/manifests/kuadrant-operator.clusterserviceversion.yamlcharts/kuadrant-operator/templates/manifests.yamlconfig/rbac/role.yamlinternal/controller/istio_extension_reconciler.go
Signed-off-by: Adam Cattermole <a.d.cattermole@gmail.com>
253f885 to
ed89ee9
Compare
|
I'm collecting errors and return after we've tried all at the end - addressed the creation/update/delete/destruct errors to also retrigger (the TODO has been there for 1 year 8 months..) |
|
This should be OK alongside clear documentation on the upgrade path, besides, it would be required that this approach is understood/accepted by Product / team leads / architects, since there are a couple of questions in the description that require some answers. It would be ideal to have a test in our testsuite asserting the failure to upgrade before this fix and the correct enforcement of the policies afterwards. I'm particularly drawn to the alternative of having communication between the shim and operator, making sure the correct configuration is there and as a corollary the actual enforcement state in the policies |
thomasmaas
left a comment
There was a problem hiding this comment.
All feedback from prior review addressed. The code is in good shape:
- Cleanup placement is correct — only runs when EnvoyFilter already exists in topology, avoiding the race where WasmPlugin is removed before the replacement is in place.
- Error handling — the second commit properly accumulates all write errors into
reconcileErrand returns them for retry. This also fixes the pre-existing// TODO: handle errorgaps for EnvoyFilter create/update/delete. - RBAC minimal — only
deleteonwasmplugins, nothing more. - Ownership —
kuadrant-prefix convention is sufficient per maleck13's comment; no need for get+label-check.
Minor: the controller.Destruct error at line 167-169 (update branch) still only logs+continues without adding to reconcileErr, but that's a pre-existing pattern for a non-API-server error and not introduced by this PR.
Agree with Didier that this needs product/architect sign-off on the upgrade path and documentation before shipping, but the code itself is ready.
| if err := resource.Delete(ctx, existingEnvoyFilter.GetName(), metav1.DeleteOptions{}); err != nil { | ||
| logger.Error(err, "failed to delete envoyfilter object", "gateway", gatewayKey.String(), "envoyfilter", fmt.Sprintf("%s/%s", existingEnvoyFilter.GetNamespace(), existingEnvoyFilter.GetName())) | ||
| // TODO: handle error | ||
| reconcileErr = errors.Join(reconcileErr, fmt.Errorf("failed to delete envoyfilter %s/%s: %w", existingEnvoyFilter.GetNamespace(), existingEnvoyFilter.GetName(), err)) |
There was a problem hiding this comment.
Slightly worried about the proliferation of this pattern in the STOW reconciliation workflow. Failing the entire workflow because one task returned an error won't automatically trigger a retry as one coming from a controller-runtime background would expect.
I sincerely believe we should give #2043 some attention.
There was a problem hiding this comment.
I recalled seeing that PR but that's my mistake for not investigating what it was doing before returning the error here. What would you advise to proceed here - a follow up removing the errors or leaving as is for the imminent patch release?
There was a problem hiding this comment.
I guess the error handling is not required for the patch which is more about deleting the WasmPlugin leftovers, right? I would rollback ed89ee9.
There was a problem hiding this comment.
That's right, luckily I kept the changes separate here so here's a follow up to revert that commit only #2067
We recently migrated away from
WasmPlugintoEnvoyFilterfor our wasm configuration to configureallow_on_headers_stop_iteration, to move toward supporting request body processing.As part of this work we missed the cleanup of existing
WasmPluginobjects. This would result in the envoy configuration having two merged copies of the wasm config. The protection policy configuration for one of these would be snapshotted at the state at the time of upgrade, and the other would reflect current state. This would result in overly restrictive policies where double auth/double rate limiting is happening on each request.A workaround to mitigate, the existing
WasmPluginobjects in gateway namespaces, with the kuadrant ownership label should be removed, once it’s been confirmed that the newEnvoyFilteris in place - guide outlines this here https://gist.github.com/adam-cattermole/5dd1fa53aa36af57254170776ad53601.Questions
This PR implements a naive solution to remove the orphaned
WasmPluginobjects as part of the regular reconcile logic. This comes with a major racey caveat that there is no guarantee theEnvoyFilterhas been applied correctly before theWasmPluginis removed (there's no status on these and it requires extracting the envoy config from the gateway) - leaving the upstream unprotected.Curious on thoughts and how to proceed here, perhaps @maleck13, @guicassolato
Alternatives
EnvoyFilterwould ever be ready for a gateway - does the initContainer have timeout conditions, could this make our upgrade process flakey and error prone?Enforced, and in this scenario, we would know when it's safe to remove theWasmPlugin. This however is not a small amount of work and I'm concerned about rushing this for the fix here.Summary by CodeRabbit