Skip to content

add support for resetting third party gpu client pods#177

Open
tariq1890 wants to merge 1 commit into
mainfrom
extra-gpu-client-delete
Open

add support for resetting third party gpu client pods#177
tariq1890 wants to merge 1 commit into
mainfrom
extra-gpu-client-delete

Conversation

@tariq1890

@tariq1890 tariq1890 commented Apr 27, 2026

Copy link
Copy Markdown
Contributor

Problem

This commit introduces support for bouncing gpu client pods that are not directly managed by the gpu-operator. Currently, we have users who deploy gpu client applications like the DRA driver, NVSentinel etc along with the gpu-operator. Since these applications are not a part of the static list of operands that the k8s-driver-manager resets during driver upgrades, users have to run into issues with driver upgrades as unmanaged gpu client applications like the DRA driver and NVSentinel aren't bounced at the same time as the gpu-operator operands.

Solution

We introduce a new annotation (nvidia.com/gpu.client: "true")to designate a pod that may need to be reset by the k8s-driver-manager during driver upgrades. This way, cluster admins can apply the annotation on any of the pods that act as gpu clients and need to be reset when a driver upgrade needs to happen.

  1. The driver-manager looks for pods on the node it's running on with the nvidia.com/gpu.client: "true"
  2. If pods are found, the driver-manager applies a nvidia.com/gpu:NoSchedule taint.
  3. Since the taint has been applied, the deleted pods won't be rescheduled on the node
  4. The pods from step 1 are then deleted. The driver-manager waits till these pods are terminated
  5. The driver-manager performs the unloading of kernel modules and the unbinding of the /run/nvidia/driver dir (existing behaviour)
  6. The driver-manager removes the taint from step 2

@coveralls

coveralls commented Apr 27, 2026

Copy link
Copy Markdown

Coverage Report for CI Build 25175704806

Coverage decreased (-0.5%) to 5.494%

Details

  • Coverage decreased (-0.5%) from the base build.
  • Patch coverage: 134 uncovered changes across 2 files (0 of 134 lines covered, 0.0%).
  • 1 coverage regression across 1 file.

Uncovered Changes

File Changed Covered %
internal/kubernetes/client.go 107 0 0.0%
cmd/driver-manager/main.go 27 0 0.0%

Coverage Regressions

1 previously-covered line in 1 file lost coverage.

File Lines Losing Coverage Coverage
cmd/driver-manager/main.go 1 0.0%

Coverage Stats

Coverage Status
Relevant Lines: 1529
Covered Lines: 84
Line Coverage: 5.49%
Coverage Strength: 0.06 hits per line

💛 - Coveralls

@tariq1890 tariq1890 force-pushed the extra-gpu-client-delete branch 7 times, most recently from 5357bc8 to 545ffef Compare April 28, 2026 02:09
@tariq1890 tariq1890 marked this pull request as ready for review April 28, 2026 05:11
@tariq1890 tariq1890 force-pushed the extra-gpu-client-delete branch from 545ffef to d6b9f12 Compare April 28, 2026 19:54
Comment thread cmd/driver-manager/main.go Outdated
Comment thread internal/kubernetes/client.go
Comment thread internal/kubernetes/client.go Outdated
@tariq1890 tariq1890 force-pushed the extra-gpu-client-delete branch from d6b9f12 to 75dd683 Compare April 28, 2026 20:47
@tariq1890

Copy link
Copy Markdown
Contributor Author

Review comments addressed. Thanks @cdesiniotis !

@tariq1890 tariq1890 force-pushed the extra-gpu-client-delete branch 3 times, most recently from e831e93 to 6df1212 Compare April 28, 2026 21:46
Comment thread cmd/driver-manager/main.go Outdated
Comment thread cmd/driver-manager/main.go Outdated
Comment thread internal/kubernetes/client.go
Comment thread cmd/driver-manager/main.go Outdated
Comment thread internal/kubernetes/client.go
@tariq1890 tariq1890 force-pushed the extra-gpu-client-delete branch 2 times, most recently from 79f77a7 to 38ba15b Compare April 29, 2026 02:30
Comment thread cmd/driver-manager/main.go Outdated
Comment thread cmd/driver-manager/main.go Outdated
@tariq1890 tariq1890 force-pushed the extra-gpu-client-delete branch from 38ba15b to 602ef29 Compare April 29, 2026 22:03
Comment thread internal/kubernetes/client.go

@rajathagasthya rajathagasthya left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 🚀

@tariq1890

Copy link
Copy Markdown
Contributor Author

Requesting review on this PR as well. I'd like to merge them both at the same time

Comment thread internal/kubernetes/client.go
Comment thread internal/kubernetes/client.go Outdated
Comment thread internal/kubernetes/client.go Outdated
Comment thread internal/kubernetes/client.go Outdated
Comment thread cmd/driver-manager/main.go Outdated
Comment thread cmd/driver-manager/main.go Outdated
Comment thread cmd/driver-manager/main.go Outdated

nvidiaGPUClientAnnotation = nvidiaDomainPrefix + "/" + "gpu.client"
nvidiaTaintKey = nvidiaDomainPrefix + "/" + "gpu-driver-update"
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note: I am not sure why in the past we went with "gpu.deploy.*" keys. I don't know if the vision was to keep operations within dot after gpu or it was just random. We could have also used "gpu.update.driver" here. I am fine with either approach, just wanted to call out in case there was some history behind this.

@tariq1890 tariq1890 Apr 30, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am definitely open to better names for the taint here. It was the best I could come up with, so I'd certainly appreciate more suggestions. At the same time, I don't think the taint key naming has to adhere to the label naming conventions

@rahulait rahulait left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

few very minor nit picks, but otherwise LGTM.

Comment thread cmd/driver-manager/main.go Outdated
@rahulait

rahulait commented Apr 30, 2026

Copy link
Copy Markdown
Contributor

One thing that is still not clear to me is how the system will perform if eviction fails after 5 mins and keeps failing for a long time (say few days bcoz of a PDB preventing more pods to be evicted or something else). The node will remain tainted for those 5 mins, so no pod (gpu or without gpu) will get scheduled onto that node (unless it has the toleration for specific taint). On cleanup, we'll remove the taint and then retry, so there will be a brief window where other pods can get scheduled onto the node and this will go on until fixed. Maybe thats fine and nobody will have concerns that other workloads are not getting scheduled on those nodes immediately.

@tariq1890

Copy link
Copy Markdown
Contributor Author

One thing that is still not clear to me is how the system will perform if eviction fails after 5 mins and keeps failing for a long time (say few days bcoz of a PDB preventing more pods to be evicted or something else). The node will remain tainted for those 5 mins, so no pod (gpu or without gpu) will get scheduled onto that node (unless it has the toleration for specific taint).

I think it'd be better for the node to be in this tainted state if a gpu client is stuck in Terminating state rather than have the an upgrade to proceed. We want to driver upgrades to block on these gpu clients from here on out after all

so there will be a brief window where other pods can get scheduled onto the node and this will go on until fixed.

Yes, but what we would expect to happen here is for those pods to go into CrashLoopBackoff or to block on Toolkit being ready if they have an init-container similar to that of our operands.

Signed-off-by: Tariq Ibrahim <tibrahim@nvidia.com>
@tariq1890 tariq1890 force-pushed the extra-gpu-client-delete branch from 602ef29 to e25d906 Compare April 30, 2026 15:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants