Skip to content

fix: add cleanup of published ResourceSlices on shutdown#67

Open
ttsuuubasa wants to merge 4 commits into
CoHDI:mainfrom
ttsuuubasa:cohdi-dev
Open

fix: add cleanup of published ResourceSlices on shutdown#67
ttsuuubasa wants to merge 4 commits into
CoHDI:mainfrom
ttsuuubasa:cohdi-dev

Conversation

@ttsuuubasa

@ttsuuubasa ttsuuubasa commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

Summary

Ensure the driver deletes the ResourceSlices it published when it is shut down. Previously, on SIGTERM the driver simply stopped without removing its slices, leaving stale ResourceSlice objects in the cluster.

Motivation

The driver currently leaves the ResourceSlice objects it published in the cluster even after it is shut down. This is especially problematic when switching between dynamic scaling (DDS + CDI DRA) and manual scaling (CRO only): leftover slices from CDI DRA interfere with the manual mode, forcing operators to identify and delete those slices by hand.

Reclaiming slices automatically on CDI DRA shutdown removes that manual step and reduces operator burden. It also improves the uninstall experience.

This change targets the normal shutdown path (e.g., scaling the Deployment from 1 to 0, or a regular pod termination). Failure-mode shutdowns (crash, OOMKill, node failure, etc.) are out of scope — orphaned slices in those cases are accepted as an acknowledged limitation.

Changes

  • pkg/manager/manager.go:
    • Add cleanupResourceSlices invoked via defer in StartCDIManager.
    • It calls Controller.Stop() for each driver, then lists the cluster-scoped slices for that driver (FieldSelector: spec.driver=<name>,spec.nodeName=) and deletes only those whose pool name matches a pool we published. A fresh 30s context is used since the parent context is already canceled at this point.
  • main.go:
    • On SIGTERM, explicitly cancel() and then wait on errChan so the deferred cleanup in StartCDIManager completes before the process exits.
    • Without this, the goroutine and the main routine race and the cleanup may be killed mid-flight.

Design notes

  • Controller.Stop() is called explicitly even though ctx cancellation alone would stop the goroutines, because only Stop() blocks via wg.Wait() and thus prevents a race where an in-flight publish recreates a slice we just deleted.
  • Cleanup deletes slices one-by-one after filtering by published pool name, rather than using DeleteCollection with spec.driver=<name>. This guards against deleting slices owned by another driver that happens to share the same driver name.

- Add cleanupResourceSlices defer in StartCDIManager to stop controllers
  and DeleteCollection slices by spec.driver field selector
- Wait for the manager goroutine after SIGTERM so cleanup finishes
  before the process exits
- Grant deletecollection verb on resourceslices to the ClusterRole

Signed-off-by: Tsubasa Watanabe <w.tsubasa@fujitsu.com>
Also filter by empty spec.nodeName so per-node ResourceSlices published by
other drivers sharing the same driver name are not accidentally deleted on shutdown.

Signed-off-by: Tsubasa Watanabe <w.tsubasa@fujitsu.com>
On shutdown, list the driver's cluster-scoped ResourceSlices and
delete only those whose pool name matches one we published, instead
of DeleteCollection by driver name. This avoids removing slices
owned by another driver that happens to share the same driver name.

Signed-off-by: Tsubasa Watanabe <w.tsubasa@fujitsu.com>
….yaml

Signed-off-by: Tsubasa Watanabe <w.tsubasa@fujitsu.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant