fix: add cleanup of published ResourceSlices on shutdown#67
Open
ttsuuubasa wants to merge 4 commits into
Open
Conversation
- Add cleanupResourceSlices defer in StartCDIManager to stop controllers and DeleteCollection slices by spec.driver field selector - Wait for the manager goroutine after SIGTERM so cleanup finishes before the process exits - Grant deletecollection verb on resourceslices to the ClusterRole Signed-off-by: Tsubasa Watanabe <w.tsubasa@fujitsu.com>
Also filter by empty spec.nodeName so per-node ResourceSlices published by other drivers sharing the same driver name are not accidentally deleted on shutdown. Signed-off-by: Tsubasa Watanabe <w.tsubasa@fujitsu.com>
On shutdown, list the driver's cluster-scoped ResourceSlices and delete only those whose pool name matches one we published, instead of DeleteCollection by driver name. This avoids removing slices owned by another driver that happens to share the same driver name. Signed-off-by: Tsubasa Watanabe <w.tsubasa@fujitsu.com>
….yaml Signed-off-by: Tsubasa Watanabe <w.tsubasa@fujitsu.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Ensure the driver deletes the ResourceSlices it published when it is shut down. Previously, on SIGTERM the driver simply stopped without removing its slices, leaving stale
ResourceSliceobjects in the cluster.Motivation
The driver currently leaves the
ResourceSliceobjects it published in the cluster even after it is shut down. This is especially problematic when switching between dynamic scaling (DDS + CDI DRA) and manual scaling (CRO only): leftover slices from CDI DRA interfere with the manual mode, forcing operators to identify and delete those slices by hand.Reclaiming slices automatically on CDI DRA shutdown removes that manual step and reduces operator burden. It also improves the uninstall experience.
This change targets the normal shutdown path (e.g., scaling the Deployment from 1 to 0, or a regular pod termination). Failure-mode shutdowns (crash, OOMKill, node failure, etc.) are out of scope — orphaned slices in those cases are accepted as an acknowledged limitation.
Changes
pkg/manager/manager.go:cleanupResourceSlicesinvoked viadeferinStartCDIManager.Controller.Stop()for each driver, then lists the cluster-scoped slices for that driver (FieldSelector: spec.driver=<name>,spec.nodeName=) and deletes only those whose pool name matches a pool we published. A fresh 30s context is used since the parent context is already canceled at this point.main.go:cancel()and then wait onerrChanso the deferred cleanup inStartCDIManagercompletes before the process exits.Design notes
Controller.Stop()is called explicitly even though ctx cancellation alone would stop the goroutines, because onlyStop()blocks viawg.Wait()and thus prevents a race where an in-flight publish recreates a slice we just deleted.DeleteCollectionwithspec.driver=<name>. This guards against deleting slices owned by another driver that happens to share the same driver name.