Skip to content

[Feature]: Support running nv-fabricmanager in Shared NVSwitch (fabric partition) mode with a configurable command socket in the driver daemonset #2552

Description

@shengnuo

The NVIDIA DRA Driver for GPUs (k8s-dra-driver-gpu) is adding support for allocating GPUs on NVSwitch-based HGX systems via Fabric Manager partitions (e.g. for GPU passthrough / VFIO and multi-tenant NVLink isolation). This relies on nv-fabricmanager running on the host in Shared NVSwitch fabric mode, where partitions are queried and activated on demand through the FM SDK rather than activated automatically at boot.

Two things are needed from the GPU Operator's driver daemonset:

  1. Start nv-fabricmanager in Shared NVSwitch mode (FABRIC_MODE=1) instead of the default bare-metal mode (FABRIC_MODE=0).
  2. Expose the FM command/SDK socket at a known, shared location so the DRA driver can connect to it — i.e. a configurable FM_CMD_UNIX_SOCKET_PATH.

Today, the GPU Operator's driver daemonset runs nv-fabricmanager in its default bare-metal mode and exposes its command interface only on the default loopback TCP port inside the driver container's network/mount namespace, so there is no supported way for the DRA driver to reach it or to drive partition activation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    featureissue/PR that proposes a new feature or functionalitylifecycle/frozenneeds-triageissue or PR has not been assigned a priority-px label

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions