Skip to content

Drain the DRA kubelet-plugin and validator across driver reloads#203

Draft
karthikvetrivel wants to merge 1 commit into
NVIDIA:mainfrom
karthikvetrivel:kv-drain-dra-operands
Draft

Drain the DRA kubelet-plugin and validator across driver reloads#203
karthikvetrivel wants to merge 1 commit into
NVIDIA:mainfrom
karthikvetrivel:kv-drain-dra-operands

Conversation

@karthikvetrivel

@karthikvetrivel karthikvetrivel commented Jun 24, 2026

Copy link
Copy Markdown
Member

This PR adds the NVIDIA DRA driver kubelet-plugin and the DRA validator to the set of GPU clients that the driver-manager drains off a node before reloading the driver. When the GPU Operator manages GPUs through Dynamic Resource Allocation (the new DRA CRD path), these pods hold the driver just like the device-plugin, GFD, and DCGM do today, so they must be drained and recreated across a driver reload. Otherwise the kubelet-plugin keeps serving CDI specs enumerated against the old driver and GPU claims fail until it is manually restarted.

All new behavior is gated on the nvidia.com/gpu.deploy.dra-driver and nvidia.com/gpu.deploy.dra-validator node labels, so deployments without the DRA stack are unaffected, mirroring the existing optional components (mig-manager, sandbox-*, vgpu-device-manager).

Functionally, this PR depends on companion GPU Operator changes that set those labels and add the matching nodeSelector to the kubelet-plugin and validator DaemonSets.

Testing

I tested the following scenarios manually on a 2-node cluster (A100 worker + T4 control-plane) running the GPU Operator DRA stack:

  • Deploy the DRA stack and verify the kubelet-plugin and dra-validator are labeled with nvidia.com/gpu.deploy.dra-driver=true and nvidia.com/gpu.deploy.dra-validator=true and come up on each GPU node.
  • Trigger a driver upgrade by bumping the NVIDIADriver version. Verified the driver-manager drains the validator and GFD first, then the kubelet-plugin, reloads the driver, and restores them. Both nodes completed the rolling upgrade with no pods stuck Terminating and no manual intervention. Verified the kubelet-plugin pods were recreated, re-enumerated against the new driver, and a GPU claim succeeded on each node afterward.
  • Verify the drain ordering is required. Draining the kubelet-plugin in the same batch as the claim-holders left the validator and GFD stuck Terminating (their DRA claims cannot be released once the plugin is gone) and deadlocked the upgrade; draining the plugin last after them avoids this.

Signed-off-by: Karthik Vetrivel <kvetrivel@nvidia.com>
@karthikvetrivel karthikvetrivel self-assigned this Jun 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant