Skip to content

Update daemonset for when CONFIG_MEMORY_HOTPLUG is not Present#2517

Open
JunAr7112 wants to merge 1 commit into
NVIDIA:mainfrom
JunAr7112:config_host
Open

Update daemonset for when CONFIG_MEMORY_HOTPLUG is not Present#2517
JunAr7112 wants to merge 1 commit into
NVIDIA:mainfrom
JunAr7112:config_host

Conversation

@JunAr7112

@JunAr7112 JunAr7112 commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Description

This PR is created in response to this bug. Essentially, on some systems the /sys/devices/system/memory/auto_online_blocks files is not present and cannot be mounted as a hostvolume for systems that don't have CONFIG_MEMORY_HOTPLUG=y. auto_online_blocks is a Linux sysfs knob for memory hotplug.

Solution:

In manifests/state-driver/0500_daemonset.yaml and assets/state-driver/0500_daemonset.yaml switch to using a wider mountpath on /sys/devices/system rather than directly mounting /sys/devices/system/memory/auto_online_blocks.

Checklist

  • [ x] No secrets, sensitive information, or unrelated changes
  • [ x] Lint checks passing (make lint)
  • [ x] Generated assets in-sync (make validate-generated-assets)
  • [ x] Go mod artifacts in-sync (make validate-modules)
  • [ x] Test cases are added for new code paths

Testing

Added TestDriverSysfsMemoryOnlineVolumeUsesStableParentDirectory, which verifies the rendered driver DaemonSet has the /sys/devices/system mount. It finds the volume named sysfs-memory-online and checks:

HostPath.Path == "/sys/devices/system"
HostPath.Type == corev1.HostPathDirectory
Finds the nvidia-driver-ctr container.

Finds that container’s sysfs-memory-online volume mount and checks:

MountPath == "/sys/devices/system"
SubPath == ""
So the test protects the exact behavior we want: the operator should mount the stable parent directory, not the optional /sys/devices/system/memory/auto_online_blocks file.

@JunAr7112 JunAr7112 marked this pull request as ready for review June 4, 2026 19:59
Signed-off-by: Arjun <agadiyar@nvidia.com>
@tariq1890

Copy link
Copy Markdown
Contributor

@JunAr7112 Have you tested this on a system where CONFIG_MEMORY_HOTPLUG is unset?

@JunAr7112

JunAr7112 commented Jun 24, 2026

Copy link
Copy Markdown
Contributor Author

@JunAr7112 Have you tested this on a system where CONFIG_MEMORY_HOTPLUG is unset?

Yes I setup a StarlingX system without CONFIG_MEMORY_HOTPLUG=y and verified the result:

sysadmin@localhost:~ $ POD=$(kubectl get pod -n gpu-operator -l app=nvidia-driver-daemonset -o jsonpath='{.items[0].metadata.name}')
sysadmin@localhost:~ $ kubectl describe pod -n gpu-operator "$POD"
| awk '/^ sysfs-memory-online:/,/^ nv-firmware:/'
sysfs-memory-online:
Type: HostPath (bare host directory volume)
Path: /sys/devices/system
HostPathType: Directory

sysadmin@localhost:~$ kubectl describe pod -n gpu-operator "$POD"
| awk '/^Events:/,0'
| grep -nE 'Created container: nvidia-driver-ctr|Started container nvidia-driver-ctr|failed to mkdir|CreateContainerError|auto_online_blocks'
9: Normal Created 79s (x6 over 5m57s) kubelet Created container: nvidia-driver-ctr
10: Normal Started 79s (x6 over 5m57s) kubelet Started container nvidia-driver-ctr

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants