Do not remove driver when gpu.deploy.operands label is set to false#2575
Draft
cdesiniotis wants to merge 1 commit into
Draft
Do not remove driver when gpu.deploy.operands label is set to false#2575cdesiniotis wants to merge 1 commit into
cdesiniotis wants to merge 1 commit into
Conversation
Previously, setting nvidia.com/gpu.deploy.operands label to 'false' would remove all GPU Operator related pods from a node. This commit makes it so that the driver no longer gets removed when the gpu.deploy.operands label is set to false. To manually remove the driver pod from a node, a user has to explicitly label the node with nvidia.com/gpu.deploy.driver=false. This change provides an extra guardrail for the driver pod whose removal from a node is highly disruptive. Additionally, this change is motivated by our future plans to integrate the NVIDIA DRA Driver for GPUs with the GPU Operator. In particular, this change helps provide a possible migration story from the k8s-device-plugin sw stack to the DRA driver sw stack that leverages the nvidia.com/gpu.deploy.operands label to switch between the respective software components. Signed-off-by: Christopher Desiniotis <cdesiniotis@nvidia.com>
eb26f58 to
47d5cda
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Previously, setting nvidia.com/gpu.deploy.operands label to 'false' would remove all GPU Operator related pods from a node. This commit makes it so that the driver no longer gets removed when the gpu.deploy.operands label is set to false. To manually remove the driver pod from a node, a user has to explicitly label the node with nvidia.com/gpu.deploy.driver=false.
This change provides an extra guardrail for the driver pod whose removal from a node is highly disruptive. Additionally, this change is motivated by our future plans to integrate the NVIDIA DRA Driver for GPUs with the GPU Operator. In particular, this change helps provide a possible migration story from the k8s-device-plugin sw stack to the DRA driver sw stack that leverages the nvidia.com/gpu.deploy.operands label to switch between the respective software components.
Code changes in this PR were drafted with the assistance of Claude Code.
Note, this section in our documentation would have to be updated: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html#preventing-installation-of-operands-on-some-nodes