Summary
ParseMigProfile returns NVML profile IDs from the wrong GPU on heterogeneous hosts because VisitMigProfiles dedupes profiles by name across all GPUs.
What breaks
- On hosts where the same MIG profile name (e.g.
"1g.10gb") maps to different NVML GIProfileIDs on different GPUs, only the first-walked variant survives.
- Callers using
(*devicelib).ParseMigProfile to drive NVML calls on a specific GPU may issue requests with a profile ID that GPU does not support.
- Symptom:
GetGpuInstanceProfileInfo returns ERROR_NOT_SUPPORTED, and operations like SetMigConfig fail.
Root cause
(*devicelib).VisitMigProfiles builds a map[string]struct{} keyed by p.String(), so cross-device duplicates are silently dropped (device.go#L537-L562):
visited := make(map[string]struct{})
err := d.VisitDevices(func(i int, d Device) error {
return d.VisitMigProfiles(func(p MigProfile) error {
if _, ok := visited[p.String()]; ok { // keyed by name — the bug
return nil
}
visited[p.String()] = struct{}{}
return visit(p)
})
})
GetMigProfiles caches that deduped list on devicelib.migProfiles (device.go#L590-L610), and ParseMigProfile searches it by name (mig_profile.go#L152-L166). The per-device (*device).VisitMigProfiles does not dedupe (device.go#L352-L431).
All refs against main at commit 8ff29bb.
Failure example
GPU 0 exposes "1g.10gb" via GPU_INSTANCE_PROFILE_1_SLICE_REV2; GPU 1 exposes it via GPU_INSTANCE_PROFILE_1_SLICE. Then SetMigConfig(gpu=1, config={"1g.10gb": 1}):
ParseMigProfile("1g.10gb") returns the GPU 0 variant (REV2), because GPU 0 was walked first.
- On GPU 1,
GetGpuInstanceProfileInfo(REV2) returns ERROR_NOT_SUPPORTED.
SetMigConfig fails and leaves GPU 1 cleared.
Reproduction / workaround
Reproduced and worked around in mig-parted PR #372, which adds a per-device resolver plus a test (TestSetMigConfigResolvesMigProfileOnTargetGPU in pkg/mig/config/config_test.go) that mocks the heterogeneous case. Originating report: mig-parted#157.
Consumer-side workaround: resolve profile names against the per-device Device.GetMigProfiles() rather than the devicelib cache.
Possible directions
- A. Drop the dedup in
(*devicelib).VisitMigProfiles; return one entry per (GPU, name). Changes the observable shape of GetMigProfiles/ParseMigProfile.
- B. Keep the dedup, but have
ParseMigProfile return a typed ErrAmbiguous when devices disagree on GI/CI IDs for a name.
- C. Document the current behavior as "device-agnostic, first-walked wins" and steer NVML-bound callers to the per-device
Device.GetMigProfiles().
Summary
ParseMigProfilereturns NVML profile IDs from the wrong GPU on heterogeneous hosts becauseVisitMigProfilesdedupes profiles by name across all GPUs.What breaks
"1g.10gb") maps to different NVMLGIProfileIDs on different GPUs, only the first-walked variant survives.(*devicelib).ParseMigProfileto drive NVML calls on a specific GPU may issue requests with a profile ID that GPU does not support.GetGpuInstanceProfileInforeturnsERROR_NOT_SUPPORTED, and operations likeSetMigConfigfail.Root cause
(*devicelib).VisitMigProfilesbuilds amap[string]struct{}keyed byp.String(), so cross-device duplicates are silently dropped (device.go#L537-L562):GetMigProfilescaches that deduped list ondevicelib.migProfiles(device.go#L590-L610), andParseMigProfilesearches it by name (mig_profile.go#L152-L166). The per-device(*device).VisitMigProfilesdoes not dedupe (device.go#L352-L431).All refs against
mainat commit8ff29bb.Failure example
GPU 0 exposes
"1g.10gb"viaGPU_INSTANCE_PROFILE_1_SLICE_REV2; GPU 1 exposes it viaGPU_INSTANCE_PROFILE_1_SLICE. ThenSetMigConfig(gpu=1, config={"1g.10gb": 1}):ParseMigProfile("1g.10gb")returns the GPU 0 variant (REV2), because GPU 0 was walked first.GetGpuInstanceProfileInfo(REV2)returnsERROR_NOT_SUPPORTED.SetMigConfigfails and leaves GPU 1 cleared.Reproduction / workaround
Reproduced and worked around in mig-parted PR #372, which adds a per-device resolver plus a test (
TestSetMigConfigResolvesMigProfileOnTargetGPUinpkg/mig/config/config_test.go) that mocks the heterogeneous case. Originating report: mig-parted#157.Consumer-side workaround: resolve profile names against the per-device
Device.GetMigProfiles()rather than the devicelib cache.Possible directions
(*devicelib).VisitMigProfiles; return one entry per(GPU, name). Changes the observable shape ofGetMigProfiles/ParseMigProfile.ParseMigProfilereturn a typedErrAmbiguouswhen devices disagree on GI/CI IDs for a name.Device.GetMigProfiles().