Update daemonset for when CONFIG_MEMORY_HOTPLUG is not Present#2517
Update daemonset for when CONFIG_MEMORY_HOTPLUG is not Present#2517JunAr7112 wants to merge 1 commit into
Conversation
Signed-off-by: Arjun <agadiyar@nvidia.com>
|
@JunAr7112 Have you tested this on a system where |
Yes I setup a StarlingX system without CONFIG_MEMORY_HOTPLUG=y and verified the result: sysadmin@localhost:~ $ POD=$(kubectl get pod -n gpu-operator -l app=nvidia-driver-daemonset -o jsonpath='{.items[0].metadata.name}') sysadmin@localhost:~$ kubectl describe pod -n gpu-operator "$POD" |
Description
This PR is created in response to this bug. Essentially, on some systems the /sys/devices/system/memory/auto_online_blocks files is not present and cannot be mounted as a hostvolume for systems that don't have CONFIG_MEMORY_HOTPLUG=y. auto_online_blocks is a Linux sysfs knob for memory hotplug.
Solution:
In manifests/state-driver/0500_daemonset.yaml and assets/state-driver/0500_daemonset.yaml switch to using a wider mountpath on /sys/devices/system rather than directly mounting /sys/devices/system/memory/auto_online_blocks.
Checklist
make lint)make validate-generated-assets)make validate-modules)Testing
Added TestDriverSysfsMemoryOnlineVolumeUsesStableParentDirectory, which verifies the rendered driver DaemonSet has the /sys/devices/system mount. It finds the volume named sysfs-memory-online and checks:
HostPath.Path == "/sys/devices/system"
HostPath.Type == corev1.HostPathDirectory
Finds the nvidia-driver-ctr container.
Finds that container’s sysfs-memory-online volume mount and checks:
MountPath == "/sys/devices/system"
SubPath == ""
So the test protects the exact behavior we want: the operator should mount the stable parent directory, not the optional /sys/devices/system/memory/auto_online_blocks file.