add support for resetting third party gpu client pods#177
Conversation
Coverage Report for CI Build 25175704806Coverage decreased (-0.5%) to 5.494%Details
Uncovered Changes
Coverage Regressions1 previously-covered line in 1 file lost coverage.
Coverage Stats
💛 - Coveralls |
5357bc8 to
545ffef
Compare
545ffef to
d6b9f12
Compare
d6b9f12 to
75dd683
Compare
|
Review comments addressed. Thanks @cdesiniotis ! |
e831e93 to
6df1212
Compare
79f77a7 to
38ba15b
Compare
38ba15b to
602ef29
Compare
|
Requesting review on this PR as well. I'd like to merge them both at the same time |
|
|
||
| nvidiaGPUClientAnnotation = nvidiaDomainPrefix + "/" + "gpu.client" | ||
| nvidiaTaintKey = nvidiaDomainPrefix + "/" + "gpu-driver-update" | ||
| ) |
There was a problem hiding this comment.
Just a note: I am not sure why in the past we went with "gpu.deploy.*" keys. I don't know if the vision was to keep operations within dot after gpu or it was just random. We could have also used "gpu.update.driver" here. I am fine with either approach, just wanted to call out in case there was some history behind this.
There was a problem hiding this comment.
I am definitely open to better names for the taint here. It was the best I could come up with, so I'd certainly appreciate more suggestions. At the same time, I don't think the taint key naming has to adhere to the label naming conventions
rahulait
left a comment
There was a problem hiding this comment.
few very minor nit picks, but otherwise LGTM.
|
One thing that is still not clear to me is how the system will perform if eviction fails after 5 mins and keeps failing for a long time (say few days bcoz of a PDB preventing more pods to be evicted or something else). The node will remain tainted for those 5 mins, so no pod (gpu or without gpu) will get scheduled onto that node (unless it has the toleration for specific taint). On cleanup, we'll remove the taint and then retry, so there will be a brief window where other pods can get scheduled onto the node and this will go on until fixed. Maybe thats fine and nobody will have concerns that other workloads are not getting scheduled on those nodes immediately. |
I think it'd be better for the node to be in this tainted state if a gpu client is stuck in Terminating state rather than have the an upgrade to proceed. We want to driver upgrades to block on these gpu clients from here on out after all
Yes, but what we would expect to happen here is for those pods to go into CrashLoopBackoff or to block on Toolkit being ready if they have an init-container similar to that of our operands. |
Signed-off-by: Tariq Ibrahim <tibrahim@nvidia.com>
602ef29 to
e25d906
Compare
Problem
This commit introduces support for bouncing gpu client pods that are not directly managed by the gpu-operator. Currently, we have users who deploy gpu client applications like the DRA driver, NVSentinel etc along with the gpu-operator. Since these applications are not a part of the static list of operands that the k8s-driver-manager resets during driver upgrades, users have to run into issues with driver upgrades as unmanaged gpu client applications like the DRA driver and NVSentinel aren't bounced at the same time as the gpu-operator operands.
Solution
We introduce a new annotation (
nvidia.com/gpu.client: "true")to designate a pod that may need to be reset by the k8s-driver-manager during driver upgrades. This way, cluster admins can apply the annotation on any of the pods that act as gpu clients and need to be reset when a driver upgrade needs to happen.nvidia.com/gpu.client: "true"nvidia.com/gpu:NoScheduletaint./run/nvidia/driverdir (existing behaviour)