Drain the DRA kubelet-plugin and validator across driver reloads#203
Draft
karthikvetrivel wants to merge 1 commit into
Draft
Drain the DRA kubelet-plugin and validator across driver reloads#203karthikvetrivel wants to merge 1 commit into
karthikvetrivel wants to merge 1 commit into
Conversation
Signed-off-by: Karthik Vetrivel <kvetrivel@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR adds the NVIDIA DRA driver
kubelet-pluginand the DRA validator to the set of GPU clients that thedriver-managerdrains off a node before reloading the driver. When the GPU Operator manages GPUs through Dynamic Resource Allocation (the new DRA CRD path), these pods hold the driver just like the device-plugin, GFD, and DCGM do today, so they must be drained and recreated across a driver reload. Otherwise thekubelet-pluginkeeps serving CDI specs enumerated against the old driver and GPU claims fail until it is manually restarted.All new behavior is gated on the
nvidia.com/gpu.deploy.dra-driverandnvidia.com/gpu.deploy.dra-validatornode labels, so deployments without the DRA stack are unaffected, mirroring the existing optional components (mig-manager, sandbox-*, vgpu-device-manager).Functionally, this PR depends on companion GPU Operator changes that set those labels and add the matching
nodeSelectorto the kubelet-plugin and validator DaemonSets.Testing
I tested the following scenarios manually on a 2-node cluster (A100 worker + T4 control-plane) running the GPU Operator DRA stack:
nvidia.com/gpu.deploy.dra-driver=trueandnvidia.com/gpu.deploy.dra-validator=trueand come up on each GPU node.Terminatingand no manual intervention. Verified the kubelet-plugin pods were recreated, re-enumerated against the new driver, and a GPU claim succeeded on each node afterward.Terminating(their DRA claims cannot be released once the plugin is gone) and deadlocked the upgrade; draining the plugin last after them avoids this.