Skip to content

ParseMigProfile returns IDs from wrong GPU on heterogeneous hosts #91

Description

@rajathagasthya

Summary

ParseMigProfile returns NVML profile IDs from the wrong GPU on heterogeneous hosts because VisitMigProfiles dedupes profiles by name across all GPUs.

What breaks

  • On hosts where the same MIG profile name (e.g. "1g.10gb") maps to different NVML GIProfileIDs on different GPUs, only the first-walked variant survives.
  • Callers using (*devicelib).ParseMigProfile to drive NVML calls on a specific GPU may issue requests with a profile ID that GPU does not support.
  • Symptom: GetGpuInstanceProfileInfo returns ERROR_NOT_SUPPORTED, and operations like SetMigConfig fail.

Root cause

(*devicelib).VisitMigProfiles builds a map[string]struct{} keyed by p.String(), so cross-device duplicates are silently dropped (device.go#L537-L562):

visited := make(map[string]struct{})
err := d.VisitDevices(func(i int, d Device) error {
    return d.VisitMigProfiles(func(p MigProfile) error {
        if _, ok := visited[p.String()]; ok { // keyed by name — the bug
            return nil
        }
        visited[p.String()] = struct{}{}
        return visit(p)
    })
})

GetMigProfiles caches that deduped list on devicelib.migProfiles (device.go#L590-L610), and ParseMigProfile searches it by name (mig_profile.go#L152-L166). The per-device (*device).VisitMigProfiles does not dedupe (device.go#L352-L431).

All refs against main at commit 8ff29bb.

Failure example

GPU 0 exposes "1g.10gb" via GPU_INSTANCE_PROFILE_1_SLICE_REV2; GPU 1 exposes it via GPU_INSTANCE_PROFILE_1_SLICE. Then SetMigConfig(gpu=1, config={"1g.10gb": 1}):

  1. ParseMigProfile("1g.10gb") returns the GPU 0 variant (REV2), because GPU 0 was walked first.
  2. On GPU 1, GetGpuInstanceProfileInfo(REV2) returns ERROR_NOT_SUPPORTED.
  3. SetMigConfig fails and leaves GPU 1 cleared.

Reproduction / workaround

Reproduced and worked around in mig-parted PR #372, which adds a per-device resolver plus a test (TestSetMigConfigResolvesMigProfileOnTargetGPU in pkg/mig/config/config_test.go) that mocks the heterogeneous case. Originating report: mig-parted#157.

Consumer-side workaround: resolve profile names against the per-device Device.GetMigProfiles() rather than the devicelib cache.

Possible directions

  • A. Drop the dedup in (*devicelib).VisitMigProfiles; return one entry per (GPU, name). Changes the observable shape of GetMigProfiles/ParseMigProfile.
  • B. Keep the dedup, but have ParseMigProfile return a typed ErrAmbiguous when devices disagree on GI/CI IDs for a name.
  • C. Document the current behavior as "device-agnostic, first-walked wins" and steer NVML-bound callers to the per-device Device.GetMigProfiles().

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions