add support for resetting third party gpu client pods by tariq1890 · Pull Request #177 · NVIDIA/k8s-driver-manager

tariq1890 · 2026-04-27T22:15:38Z

Problem

This commit introduces support for bouncing gpu client pods that are not directly managed by the gpu-operator. Currently, we have users who deploy gpu client applications like the DRA driver, NVSentinel etc along with the gpu-operator. Since these applications are not a part of the static list of operands that the k8s-driver-manager resets during driver upgrades, users have to run into issues with driver upgrades as unmanaged gpu client applications like the DRA driver and NVSentinel aren't bounced at the same time as the gpu-operator operands.

Solution

We introduce a new annotation (nvidia.com/gpu.client: "true")to designate a pod that may need to be reset by the k8s-driver-manager during driver upgrades. This way, cluster admins can apply the annotation on any of the pods that act as gpu clients and need to be reset when a driver upgrade needs to happen.

The driver-manager looks for pods on the node it's running on with the nvidia.com/gpu.client: "true"
If pods are found, the driver-manager applies a nvidia.com/gpu:NoSchedule taint.
Since the taint has been applied, the deleted pods won't be rescheduled on the node
The pods from step 1 are then deleted. The driver-manager waits till these pods are terminated
The driver-manager performs the unloading of kernel modules and the unbinding of the /run/nvidia/driver dir (existing behaviour)
The driver-manager removes the taint from step 2

coveralls · 2026-04-27T22:17:34Z

Coverage Report for CI Build 25175704806

Coverage decreased (-0.5%) to 5.494%

Details

Coverage decreased (-0.5%) from the base build.
Patch coverage: 134 uncovered changes across 2 files (0 of 134 lines covered, 0.0%).
1 coverage regression across 1 file.

Uncovered Changes

File	Changed	Covered	%
internal/kubernetes/client.go	107	0	0.0%
cmd/driver-manager/main.go	27	0	0.0%

Coverage Regressions

1 previously-covered line in 1 file lost coverage.

File	Lines Losing Coverage	Coverage
cmd/driver-manager/main.go	1	0.0%

Coverage Stats


Relevant Lines:	1529
Covered Lines:	84
Line Coverage:	5.49%
Coverage Strength:	0.06 hits per line

💛 - Coveralls

tariq1890 · 2026-04-28T20:47:26Z

Review comments addressed. Thanks @cdesiniotis !

rajathagasthya

LGTM! 🚀

tariq1890 · 2026-04-30T00:33:52Z

Requesting review on this PR as well. I'd like to merge them both at the same time

rahulait · 2026-04-30T04:39:16Z

+
+	nvidiaGPUClientAnnotation = nvidiaDomainPrefix + "/" + "gpu.client"
+	nvidiaTaintKey            = nvidiaDomainPrefix + "/" + "gpu-driver-update"
+)


Just a note: I am not sure why in the past we went with "gpu.deploy.*" keys. I don't know if the vision was to keep operations within dot after gpu or it was just random. We could have also used "gpu.update.driver" here. I am fine with either approach, just wanted to call out in case there was some history behind this.

I am definitely open to better names for the taint here. It was the best I could come up with, so I'd certainly appreciate more suggestions. At the same time, I don't think the taint key naming has to adhere to the label naming conventions

rahulait

few very minor nit picks, but otherwise LGTM.

rahulait · 2026-04-30T05:08:43Z

One thing that is still not clear to me is how the system will perform if eviction fails after 5 mins and keeps failing for a long time (say few days bcoz of a PDB preventing more pods to be evicted or something else). The node will remain tainted for those 5 mins, so no pod (gpu or without gpu) will get scheduled onto that node (unless it has the toleration for specific taint). On cleanup, we'll remove the taint and then retry, so there will be a brief window where other pods can get scheduled onto the node and this will go on until fixed. Maybe thats fine and nobody will have concerns that other workloads are not getting scheduled on those nodes immediately.

tariq1890 · 2026-04-30T15:49:05Z

One thing that is still not clear to me is how the system will perform if eviction fails after 5 mins and keeps failing for a long time (say few days bcoz of a PDB preventing more pods to be evicted or something else). The node will remain tainted for those 5 mins, so no pod (gpu or without gpu) will get scheduled onto that node (unless it has the toleration for specific taint).

I think it'd be better for the node to be in this tainted state if a gpu client is stuck in Terminating state rather than have the an upgrade to proceed. We want to driver upgrades to block on these gpu clients from here on out after all

so there will be a brief window where other pods can get scheduled onto the node and this will go on until fixed.

Yes, but what we would expect to happen here is for those pods to go into CrashLoopBackoff or to block on Toolkit being ready if they have an init-container similar to that of our operands.

Signed-off-by: Tariq Ibrahim <tibrahim@nvidia.com>

tariq1890 requested review from cdesiniotis, karthikvetrivel and rahulait April 27, 2026 22:15

tariq1890 force-pushed the extra-gpu-client-delete branch 7 times, most recently from 5357bc8 to 545ffef Compare April 28, 2026 02:09

tariq1890 marked this pull request as ready for review April 28, 2026 05:11

tariq1890 force-pushed the extra-gpu-client-delete branch from 545ffef to d6b9f12 Compare April 28, 2026 19:54

cdesiniotis reviewed Apr 28, 2026

View reviewed changes

Comment thread cmd/driver-manager/main.go Outdated

Comment thread internal/kubernetes/client.go

Comment thread internal/kubernetes/client.go Outdated

tariq1890 force-pushed the extra-gpu-client-delete branch from d6b9f12 to 75dd683 Compare April 28, 2026 20:47

cdesiniotis approved these changes Apr 28, 2026

View reviewed changes

tariq1890 force-pushed the extra-gpu-client-delete branch 3 times, most recently from e831e93 to 6df1212 Compare April 28, 2026 21:46

rajathagasthya reviewed Apr 28, 2026

View reviewed changes

Comment thread cmd/driver-manager/main.go Outdated

Comment thread cmd/driver-manager/main.go Outdated

Comment thread internal/kubernetes/client.go

Comment thread cmd/driver-manager/main.go Outdated

Comment thread internal/kubernetes/client.go

tariq1890 force-pushed the extra-gpu-client-delete branch 2 times, most recently from 79f77a7 to 38ba15b Compare April 29, 2026 02:30

rajathagasthya reviewed Apr 29, 2026

View reviewed changes

Comment thread cmd/driver-manager/main.go Outdated

Comment thread cmd/driver-manager/main.go Outdated

tariq1890 force-pushed the extra-gpu-client-delete branch from 38ba15b to 602ef29 Compare April 29, 2026 22:03

rajathagasthya reviewed Apr 29, 2026

View reviewed changes

Comment thread internal/kubernetes/client.go

rajathagasthya approved these changes Apr 29, 2026

View reviewed changes

tariq1890 mentioned this pull request Apr 30, 2026

[state-driver] add new toleration to handle driver upgrades NVIDIA/gpu-operator#2408

Closed