[Feature]: Support running nv-fabricmanager in Shared NVSwitch (fabric partition) mode with a configurable command socket in the driver daemonset

The NVIDIA DRA Driver for GPUs (k8s-dra-driver-gpu) is adding support for allocating GPUs on NVSwitch-based HGX systems via Fabric Manager partitions (e.g. for GPU passthrough / VFIO and multi-tenant NVLink isolation). This relies on nv-fabricmanager running on the host in Shared NVSwitch fabric mode, where partitions are queried and activated on demand through the FM SDK  rather than activated automatically at boot.

Two things are needed from the GPU Operator's driver daemonset:

1. Start nv-fabricmanager in Shared NVSwitch mode (FABRIC_MODE=1) instead of the default bare-metal mode (FABRIC_MODE=0).
2. Expose the FM command/SDK socket at a known, shared location so the DRA driver can connect to it — i.e. a configurable FM_CMD_UNIX_SOCKET_PATH.

Today, the GPU Operator's driver daemonset runs nv-fabricmanager in its default bare-metal mode and exposes its command interface only on the default loopback TCP port inside the driver container's network/mount namespace, so there is no supported way for the DRA driver to reach it or to drive partition activation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature]: Support running nv-fabricmanager in Shared NVSwitch (fabric partition) mode with a configurable command socket in the driver daemonset #2552

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Feature]: Support running nv-fabricmanager in Shared NVSwitch (fabric partition) mode with a configurable command socket in the driver daemonset #2552

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions