Drain the DRA kubelet-plugin and validator across driver reloads by karthikvetrivel · Pull Request #203 · NVIDIA/k8s-driver-manager

karthikvetrivel · 2026-06-24T14:49:05Z

This PR adds the NVIDIA DRA driver kubelet-plugin and the DRA validator to the set of GPU clients that the driver-manager drains off a node before reloading the driver. When the GPU Operator manages GPUs through Dynamic Resource Allocation (the new DRA CRD path), these pods hold the driver just like the device-plugin, GFD, and DCGM do today, so they must be drained and recreated across a driver reload. Otherwise the kubelet-plugin keeps serving CDI specs enumerated against the old driver and GPU claims fail until it is manually restarted.

All new behavior is gated on the nvidia.com/gpu.deploy.dra-driver and nvidia.com/gpu.deploy.dra-validator node labels, so deployments without the DRA stack are unaffected, mirroring the existing optional components (mig-manager, sandbox-*, vgpu-device-manager).

Functionally, this PR depends on companion GPU Operator changes that set those labels and add the matching nodeSelector to the kubelet-plugin and validator DaemonSets.

Testing

I tested the following scenarios manually on a 2-node cluster (A100 worker + T4 control-plane) running the GPU Operator DRA stack:

Deploy the DRA stack and verify the kubelet-plugin and dra-validator are labeled with nvidia.com/gpu.deploy.dra-driver=true and nvidia.com/gpu.deploy.dra-validator=true and come up on each GPU node.
Trigger a driver upgrade by bumping the NVIDIADriver version. Verified the driver-manager drains the validator and GFD first, then the kubelet-plugin, reloads the driver, and restores them. Both nodes completed the rolling upgrade with no pods stuck Terminating and no manual intervention. Verified the kubelet-plugin pods were recreated, re-enumerated against the new driver, and a GPU claim succeeded on each node afterward.
Verify the drain ordering is required. Draining the kubelet-plugin in the same batch as the claim-holders left the validator and GFD stuck Terminating (their DRA claims cannot be released once the plugin is gone) and deadlocked the upgrade; draining the plugin last after them avoids this.

Signed-off-by: Karthik Vetrivel <kvetrivel@nvidia.com>

Drain the DRA kubelet-plugin and validator across driver reloads

f3535f7

Signed-off-by: Karthik Vetrivel <kvetrivel@nvidia.com>

karthikvetrivel self-assigned this Jun 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Drain the DRA kubelet-plugin and validator across driver reloads#203

Drain the DRA kubelet-plugin and validator across driver reloads#203
karthikvetrivel wants to merge 1 commit into
NVIDIA:mainfrom
karthikvetrivel:kv-drain-dra-operands

karthikvetrivel commented Jun 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

karthikvetrivel commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

karthikvetrivel commented Jun 24, 2026 •

edited

Loading