The NVIDIA DRA Driver for GPUs (k8s-dra-driver-gpu) is adding support for allocating GPUs on NVSwitch-based HGX systems via Fabric Manager partitions (e.g. for GPU passthrough / VFIO and multi-tenant NVLink isolation). This relies on nv-fabricmanager running on the host in Shared NVSwitch fabric mode, where partitions are queried and activated on demand through the FM SDK rather than activated automatically at boot.
Two things are needed from the GPU Operator's driver daemonset:
- Start nv-fabricmanager in Shared NVSwitch mode (FABRIC_MODE=1) instead of the default bare-metal mode (FABRIC_MODE=0).
- Expose the FM command/SDK socket at a known, shared location so the DRA driver can connect to it — i.e. a configurable FM_CMD_UNIX_SOCKET_PATH.
Today, the GPU Operator's driver daemonset runs nv-fabricmanager in its default bare-metal mode and exposes its command interface only on the default loopback TCP port inside the driver container's network/mount namespace, so there is no supported way for the DRA driver to reach it or to drive partition activation.
The NVIDIA DRA Driver for GPUs (k8s-dra-driver-gpu) is adding support for allocating GPUs on NVSwitch-based HGX systems via Fabric Manager partitions (e.g. for GPU passthrough / VFIO and multi-tenant NVLink isolation). This relies on nv-fabricmanager running on the host in Shared NVSwitch fabric mode, where partitions are queried and activated on demand through the FM SDK rather than activated automatically at boot.
Two things are needed from the GPU Operator's driver daemonset:
Today, the GPU Operator's driver daemonset runs nv-fabricmanager in its default bare-metal mode and exposes its command interface only on the default loopback TCP port inside the driver container's network/mount namespace, so there is no supported way for the DRA driver to reach it or to drive partition activation.