Skip to content

Add DRA support for GPU pod eviction during driver upgrades#204

Open
karthikvetrivel wants to merge 1 commit into
NVIDIA:mainfrom
karthikvetrivel:kv-dra-eviction-direct
Open

Add DRA support for GPU pod eviction during driver upgrades#204
karthikvetrivel wants to merge 1 commit into
NVIDIA:mainfrom
karthikvetrivel:kv-dra-eviction-direct

Conversation

@karthikvetrivel

@karthikvetrivel karthikvetrivel commented Jun 30, 2026

Copy link
Copy Markdown
Member

The driver-manager init container evicts GPU pods from a node before reloading the NVIDIA GPU driver, but today it only detects pods that request GPUs via device-plugin resources (nvidia.com/gpu, nvidia.com/mig-*). This PR extends the pod filter to also detect pods using GPUs through Dynamic Resource Allocation (DRA): for each pod on the node, it resolves the referenced ResourceClaim(s) and checks whether any is allocated by the NVIDIA GPU DRA driver (gpu.nvidia.com).

Testing

I tested the following scenarios on a two node k8s 1.34 cluster with two GPUs and the NVIDIA DRA driver installed:

  1. Create a pod that uses a GPU via a directly-named DRA ResourceClaim (allocated by gpu.nvidia.com) alongside a non-GPU pod on the same node. Verify the claim is allocated and reserved for the pod.
  2. Trigger the driver-manager eviction flow (uninstall_driver). Verify the DRA GPU pod is identified and evicted, while the non-GPU pod is skipped.
  3. Verify detection against the live ResourceClaim API across claim shapes: directly-named and templated claims allocated by gpu.nvidia.com are detected, while compute-domain.nvidia.com, other-driver, unallocated, and not-found claims are not.
  4. Verify traditional device-plugin pods are still detected via nvidia.com/gpu and nvidia.com/mig-* resource requests.

Signed-off-by: Karthik Vetrivel <kvetrivel@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant