Skip to content

Cleanup orphaned WasmPlugin objects#2053

Merged
adam-cattermole merged 2 commits into
mainfrom
wasmplugin-cleanup
Jun 26, 2026
Merged

Cleanup orphaned WasmPlugin objects#2053
adam-cattermole merged 2 commits into
mainfrom
wasmplugin-cleanup

Conversation

@adam-cattermole

@adam-cattermole adam-cattermole commented Jun 22, 2026

Copy link
Copy Markdown
Member

We recently migrated away from WasmPlugin to EnvoyFilter for our wasm configuration to configure allow_on_headers_stop_iteration, to move toward supporting request body processing.

As part of this work we missed the cleanup of existing WasmPlugin objects. This would result in the envoy configuration having two merged copies of the wasm config. The protection policy configuration for one of these would be snapshotted at the state at the time of upgrade, and the other would reflect current state. This would result in overly restrictive policies where double auth/double rate limiting is happening on each request.

A workaround to mitigate, the existing WasmPlugin objects in gateway namespaces, with the kuadrant ownership label should be removed, once it’s been confirmed that the new EnvoyFilter is in place - guide outlines this here https://gist.github.com/adam-cattermole/5dd1fa53aa36af57254170776ad53601.

Questions

This PR implements a naive solution to remove the orphaned WasmPlugin objects as part of the regular reconcile logic. This comes with a major racey caveat that there is no guarantee the EnvoyFilter has been applied correctly before the WasmPlugin is removed (there's no status on these and it requires extracting the envoy config from the gateway) - leaving the upstream unprotected.

  • What's our stance on upgrades, and uptime - do we have guides on expected behaviour for an istio version upgrade for example?
  • Is it acceptable to have this behaviour documented with a warning indicating that for workloads where service protection is critical you cut off access to your cluster/gateways prior to upgrade?

Curious on thoughts and how to proceed here, perhaps @maleck13, @guicassolato

Alternatives

  • I've considered using an initContainer as part of the upgrade process instead of the reconciler. I see one major drawback which is that there's no guarantee the EnvoyFilter would ever be ready for a gateway - does the initContainer have timeout conditions, could this make our upgrade process flakey and error prone?
  • We have discussed having proper communication between the wasm-shim and the kuadrant-operator. We now have a channel where these communicate, and so there's the option of having the wasm-shim confirm it has loaded the config for a specific identifier - this would mean we can update policy status' much more reliably and say at what time they are Enforced, and in this scenario, we would know when it's safe to remove the WasmPlugin. This however is not a small amount of work and I'm concerned about rushing this for the fix here.

Summary by CodeRabbit

  • Bug Fixes
    • Improved handling of configuration updates so failed changes are reported properly instead of being silently ignored.
    • Added safer cleanup for related Istio resources during reconciliation, helping prevent stale resources from lingering.
  • Chores
    • Updated generated deployment metadata and permissions to match the latest resource access requirements.

@coderabbitai

coderabbitai Bot commented Jun 22, 2026

Copy link
Copy Markdown

Review Change Stack

Warning

Review limit reached

@adam-cattermole, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 51 minutes and 56 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: cf573074-bacd-4058-aedc-f32f8501b6f3

📥 Commits

Reviewing files that changed from the base of the PR and between 253f885 and ed89ee9.

📒 Files selected for processing (1)
  • internal/controller/istio_extension_reconciler.go
📝 Walkthrough

Walkthrough

The Istio extension reconciler gains a per-gateway WasmPlugin deletion step with NotFound-tolerant error accumulation, replacing the previous hard nil return. A new kubebuilder RBAC annotation grants delete on wasmplugins, propagated to config/rbac/role.yaml, the Helm chart, and the OLM CSV.

Changes

WasmPlugin delete permission and reconciler cleanup

Layer / File(s) Summary
Reconciler: WasmPlugin deletion, error accumulation, and RBAC annotation
internal/controller/istio_extension_reconciler.go
Adds k8s.io/apimachinery/pkg/api/errors import and a //+kubebuilder:rbac annotation for delete on wasmplugins; introduces a reconcileErr accumulator; inserts a per-gateway WasmPlugin delete call (ignoring NotFound); joins EnvoyFilter create errors into the accumulator; returns the accumulated error instead of always nil.
RBAC manifests: propagate delete permission for wasmplugins
config/rbac/role.yaml, charts/kuadrant-operator/templates/manifests.yaml, bundle/manifests/kuadrant-operator.clusterserviceversion.yaml
Adds the extensions.istio.io/wasmplugins/delete rule to the ClusterRole, Helm chart template, and OLM CSV cluster permissions; bumps the CSV createdAt timestamp.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Suggested reviewers

  • alexsnaps
  • crstrn13

Poem

🐇 A plugin left behind? Not on my watch!
I'll delete it cleanly, down to the last notch.
With errors collected, not silently lost,
The reconciler tidies — whatever the cost.
NotFound? No bother, I'll hop right along,
The RBAC is granted, the manifests are strong! 🌿

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarises the main change: cleaning up orphaned WasmPlugin resources.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch wasmplugin-cleanup

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@codecov

codecov Bot commented Jun 22, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 60.00000% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.99%. Comparing base (2aeb878) to head (ed89ee9).
⚠️ Report is 22 commits behind head on main.

Files with missing lines Patch % Lines
internal/controller/istio_extension_reconciler.go 60.00% 3 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2053      +/-   ##
==========================================
- Coverage   75.07%   74.99%   -0.09%     
==========================================
  Files         127      127              
  Lines       12619    12638      +19     
==========================================
+ Hits         9474     9478       +4     
- Misses       2663     2671       +8     
- Partials      482      489       +7     
Flag Coverage Δ
bare-k8s-integration 20.08% <0.00%> (-0.13%) ⬇️
controllers-integration 53.38% <60.00%> (-0.08%) ⬇️
envoygateway-integration 41.01% <0.00%> (+0.27%) ⬆️
gatewayapi-integration 16.00% <0.00%> (-0.11%) ⬇️
istio-integration 44.65% <60.00%> (-0.21%) ⬇️
unit 28.66% <0.00%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
api (u) 85.88% <ø> (ø)
internal (u) 76.42% <42.85%> (-0.10%) ⬇️
pkg (u) 39.13% <ø> (ø)
Files with missing lines Coverage Δ
internal/controller/istio_extension_reconciler.go 78.82% <60.00%> (-1.07%) ⬇️

... and 19 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@thomasmaas thomasmaas left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall: Reasonable temporary fix for a real upgrade problem. The approach is sound — reuse the known ExtensionName convention, delete-with-NotFound-guard, minimal RBAC additions. A few observations below.

Race condition in the create branch (line 129-138):
The WasmPlugin delete runs even when resource.Create() fails — there's no continue after the failed create (the existing // TODO: handle error already signals this gap). This means a persistent EnvoyFilter creation failure would leave the gateway unprotected. Easy fix: add continue after the create error log, or gate the cleanup on err == nil. I realize the broader race (EnvoyFilter created but not yet applied by Envoy) is acknowledged and harder to solve, but this specific case is avoidable.

Duplication (lines 134-138 and 145-149):
The cleanup block is identical in both branches. Since it should run regardless of whether an EnvoyFilter exists, it can move to a single location before the if !found check — right after desiredEnvoyFilter is built, before the branching logic.

Tracking removal:
The comments say "temporary to be removed" — worth filing a tracking issue and referencing it in a TODO so this doesn't become permanent.

RBAC (config/rbac/role.yaml):
get and list are included but the code only calls Delete. Fine to keep if you plan to add a list-before-delete guard later, otherwise they're unnecessary permissions.

No tests added: Given this is temporary and idempotent, not a hard objection, but a simple integration test (create a WasmPlugin, trigger reconcile, assert it's gone) would catch regressions if the ExtensionName convention ever changes.

@adam-cattermole

Copy link
Copy Markdown
Member Author

Thanks for the feedback @thomasmaas, addressed those so it's ready once we decide how to proceed (whether we use this or not).

Although I disagree with this section so I've disregarded - we should still best effort only remove once the EnvoyFilter is present:

Since it should run regardless of whether an EnvoyFilter exists, it can move to a single location before the if !found check — right after desiredEnvoyFilter is built, before the branching logic.

Signed-off-by: Adam Cattermole <a.d.cattermole@gmail.com>

@maleck13 maleck13 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change looks reasonable to me. Only thought was around the logging of the error rather than returning it? Do we want to force a re-try if it fails

@adam-cattermole

adam-cattermole commented Jun 24, 2026

Copy link
Copy Markdown
Member Author

Only thought was around the logging of the error rather than returning it? Do we want to force a re-try if it fails

Yeah good question - I tend to agree that while both the EnvoyFilter and WasmPlugin are present things could be overly restrictive / fail so we should probably return an error and trigger a retry. The only caveat being that we're within the loop over gateways so I think we should 'try attempt them all' first before we return the error instead of erroring out of the loop - or at least I assume that's why the EnvoyFilter creation errors are also ignored

@adam-cattermole adam-cattermole marked this pull request as ready for review June 24, 2026 13:06

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/controller/istio_extension_reconciler.go`:
- Around line 137-140: The WasmPlugin cleanup in IstioExtensionReconciler is
deleting by generated name only, which can remove a user-created object with the
same name. Update the deletion flow in the reconciler to first fetch the
WasmPlugin, inspect its labels, and only call Delete when the Kuadrant
ownership/managed-by label is present. Keep the existing error handling in the
delete path and use the gateway-derived wasmPluginName and
kuadrantistio.WasmPluginsResource to locate the object.
- Line 177: The reconcile flow is only returning the accumulated write errors,
so failures from the EnvoyFilter delete/update/destruct paths can be lost and
skip retries. Update the reconcile logic in IstioExtensionReconciler to add
those path errors into the same accumulator used by reconcileErr before
returning, and keep returning the joined error from the final return site so any
failed write operation is retried.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 0b042144-1ed2-476a-9139-bb0d79a8f258

📥 Commits

Reviewing files that changed from the base of the PR and between 5edef44 and 253f885.

📒 Files selected for processing (4)
  • bundle/manifests/kuadrant-operator.clusterserviceversion.yaml
  • charts/kuadrant-operator/templates/manifests.yaml
  • config/rbac/role.yaml
  • internal/controller/istio_extension_reconciler.go

Comment thread internal/controller/istio_extension_reconciler.go
Comment thread internal/controller/istio_extension_reconciler.go
Signed-off-by: Adam Cattermole <a.d.cattermole@gmail.com>
@adam-cattermole

Copy link
Copy Markdown
Member Author

I'm collecting errors and return after we've tried all at the end - addressed the creation/update/delete/destruct errors to also retrigger (the TODO has been there for 1 year 8 months..)

@didierofrivia

Copy link
Copy Markdown
Member

This should be OK alongside clear documentation on the upgrade path, besides, it would be required that this approach is understood/accepted by Product / team leads / architects, since there are a couple of questions in the description that require some answers. It would be ideal to have a test in our testsuite asserting the failure to upgrade before this fix and the correct enforcement of the policies afterwards.

I'm particularly drawn to the alternative of having communication between the shim and operator, making sure the correct configuration is there and as a corollary the actual enforcement state in the policies

@thomasmaas thomasmaas left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All feedback from prior review addressed. The code is in good shape:

  • Cleanup placement is correct — only runs when EnvoyFilter already exists in topology, avoiding the race where WasmPlugin is removed before the replacement is in place.
  • Error handling — the second commit properly accumulates all write errors into reconcileErr and returns them for retry. This also fixes the pre-existing // TODO: handle error gaps for EnvoyFilter create/update/delete.
  • RBAC minimal — only delete on wasmplugins, nothing more.
  • Ownershipkuadrant- prefix convention is sufficient per maleck13's comment; no need for get+label-check.

Minor: the controller.Destruct error at line 167-169 (update branch) still only logs+continues without adding to reconcileErr, but that's a pre-existing pattern for a non-API-server error and not introduced by this PR.

Agree with Didier that this needs product/architect sign-off on the upgrade path and documentation before shipping, but the code itself is ready.

@adam-cattermole adam-cattermole added this pull request to the merge queue Jun 26, 2026
Merged via the queue into main with commit 374d055 Jun 26, 2026
43 of 44 checks passed
@adam-cattermole adam-cattermole deleted the wasmplugin-cleanup branch June 26, 2026 09:13
@github-project-automation github-project-automation Bot moved this to Done in Kuadrant Jun 26, 2026
if err := resource.Delete(ctx, existingEnvoyFilter.GetName(), metav1.DeleteOptions{}); err != nil {
logger.Error(err, "failed to delete envoyfilter object", "gateway", gatewayKey.String(), "envoyfilter", fmt.Sprintf("%s/%s", existingEnvoyFilter.GetNamespace(), existingEnvoyFilter.GetName()))
// TODO: handle error
reconcileErr = errors.Join(reconcileErr, fmt.Errorf("failed to delete envoyfilter %s/%s: %w", existingEnvoyFilter.GetNamespace(), existingEnvoyFilter.GetName(), err))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slightly worried about the proliferation of this pattern in the STOW reconciliation workflow. Failing the entire workflow because one task returned an error won't automatically trigger a retry as one coming from a controller-runtime background would expect.

I sincerely believe we should give #2043 some attention.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recalled seeing that PR but that's my mistake for not investigating what it was doing before returning the error here. What would you advise to proceed here - a follow up removing the errors or leaving as is for the imminent patch release?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the error handling is not required for the patch which is more about deleting the WasmPlugin leftovers, right? I would rollback ed89ee9.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right, luckily I kept the changes separate here so here's a follow up to revert that commit only #2067

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

6 participants