Fix uvm_to_cpu/uvm_to_device/uvm_get_guard_index for host-mapped tensors (#5816) by JasonLC506 · Pull Request #5816 · pytorch/FBGEMM

JasonLC506 · 2026-06-02T20:57:33Z

Summary:

X-link: https://github.com/facebookresearch/FBGEMM/pull/2742

Three functions in fbgemm_gpu's memory_utils.cu unconditionally cast the storage context to CUDAManagedIndirectContext, which returns nullptr for tensors created via new_host_mapped_tensor / new_unified_tensor(is_host_mapped=True). The nullptr triggers TORCH_CHECK(tcontext != nullptr) with the error message "Expected tcontext != nullptr", surfacing during DCP/Pyper checkpoint loading of TBE shards whenever the destination tensor is host-mapped.

Inference TBE already enables uvm_host_mapped=True (see torchrec/distributed/quant_embedding_kernel.py:350,594, originally landed in D42272528 for a NUMA OOM fix), so this crash is reachable in production today on any path that calls fbgemm.uvm_to_cpu / uvm_to_device / cuda_mem_advise on a host-mapped tensor (e.g. aiplatform/modelstore/checkpointing/pyper/tensor_utils.py:74).

The fix mirrors the existing managed-tensor pattern: a new CUDAHostMappedIndirectContext struct holds a refcount on the original Storage, and each of the three functions gains a deleter == &CUDAHostMappedContext::release branch that constructs the appropriate CPU/device view backed by the host-mapped pointer.

The original CUDAManagedIndirectContext code path is byte-for-byte unchanged (except a missing semicolon on one TORCH_CHECK that the compiler had been absorbing). All existing managed-tensor callers continue to work.

Motivating use case: OneFlow preranker publish jobs intermittently timeout (~6% failure rate) due to torch.zeros(out=72GB_UVM_tensor) triggering THP/UVM compact_stalls on fragmented hosts. A/B results showing the host-mapped fix eliminates the variance (max 85s vs 1330s baseline, 0 torch.zeros compact_stall across 20 nodes): P2349246996. Full root cause / fix design: P2343573171.

Reviewed By: q10

Differential Revision: D106211462

meta-codesync · 2026-06-02T20:57:42Z

@JasonLC506 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D106211462.

…ors (pytorch#5816) Summary: Pull Request resolved: pytorch#5816 X-link: https://github.com/facebookresearch/FBGEMM/pull/2742 Three functions in fbgemm_gpu's memory_utils.cu unconditionally cast the storage context to `CUDAManagedIndirectContext`, which returns nullptr for tensors created via `new_host_mapped_tensor` / `new_unified_tensor(is_host_mapped=True)`. The nullptr triggers `TORCH_CHECK(tcontext != nullptr)` with the error message "Expected tcontext != nullptr", surfacing during DCP/Pyper checkpoint loading of TBE shards whenever the destination tensor is host-mapped. Inference TBE already enables `uvm_host_mapped=True` (see `torchrec/distributed/quant_embedding_kernel.py:350,594`, originally landed in D42272528 for a NUMA OOM fix), so this crash is reachable in production today on any path that calls `fbgemm.uvm_to_cpu` / `uvm_to_device` / `cuda_mem_advise` on a host-mapped tensor (e.g. `aiplatform/modelstore/checkpointing/pyper/tensor_utils.py:74`). The fix mirrors the existing managed-tensor pattern: a new `CUDAHostMappedIndirectContext` struct holds a refcount on the original `Storage`, and each of the three functions gains a `deleter == &CUDAHostMappedContext::release` branch that constructs the appropriate CPU/device view backed by the host-mapped pointer. The original `CUDAManagedIndirectContext` code path is byte-for-byte unchanged (except a missing semicolon on one `TORCH_CHECK` that the compiler had been absorbing). All existing managed-tensor callers continue to work. Motivating use case: OneFlow preranker publish jobs intermittently timeout (~6% failure rate) due to `torch.zeros(out=72GB_UVM_tensor)` triggering THP/UVM compact_stalls on fragmented hosts. A/B results showing the host-mapped fix eliminates the variance (max 85s vs 1330s baseline, 0 torch.zeros compact_stall across 20 nodes): P2349246996. Full root cause / fix design: P2343573171. Reviewed By: q10 Differential Revision: D106211462

…ors (pytorch#5816) Summary: X-link: facebookresearch/FBGEMM#2742 Three functions in fbgemm_gpu's memory_utils.cu unconditionally cast the storage context to `CUDAManagedIndirectContext`, which returns nullptr for tensors created via `new_host_mapped_tensor` / `new_unified_tensor(is_host_mapped=True)`. The nullptr triggers `TORCH_CHECK(tcontext != nullptr)` with the error message "Expected tcontext != nullptr", surfacing during DCP/Pyper checkpoint loading of TBE shards whenever the destination tensor is host-mapped. Inference TBE already enables `uvm_host_mapped=True` (see `torchrec/distributed/quant_embedding_kernel.py:350,594`, originally landed in D42272528 for a NUMA OOM fix), so this crash is reachable in production today on any path that calls `fbgemm.uvm_to_cpu` / `uvm_to_device` / `cuda_mem_advise` on a host-mapped tensor (e.g. `aiplatform/modelstore/checkpointing/pyper/tensor_utils.py:74`). The fix mirrors the existing managed-tensor pattern: a new `CUDAHostMappedIndirectContext` struct holds a refcount on the original `Storage`, and each of the three functions gains a `deleter == &CUDAHostMappedContext::release` branch that constructs the appropriate CPU/device view backed by the host-mapped pointer. The original `CUDAManagedIndirectContext` code path is byte-for-byte unchanged (except a missing semicolon on one `TORCH_CHECK` that the compiler had been absorbing). All existing managed-tensor callers continue to work. Motivating use case: OneFlow preranker publish jobs intermittently timeout (~6% failure rate) due to `torch.zeros(out=72GB_UVM_tensor)` triggering THP/UVM compact_stalls on fragmented hosts. A/B results showing the host-mapped fix eliminates the variance (max 85s vs 1330s baseline, 0 torch.zeros compact_stall across 20 nodes): P2349246996. Full root cause / fix design: P2343573171. Reviewed By: q10 Differential Revision: D106211462

meta-cla Bot added the cla signed label Jun 2, 2026

meta-codesync Bot added fb-exported meta-exported labels Jun 2, 2026

meta-codesync Bot changed the title ~~Fix uvm_to_cpu/uvm_to_device/uvm_get_guard_index for host-mapped tensors~~ Fix uvm_to_cpu/uvm_to_device/uvm_get_guard_index for host-mapped tensors (#5816) Jun 3, 2026

JasonLC506 force-pushed the export-D106211462 branch from 5f5686a to ed3d3aa Compare June 3, 2026 00:10

JasonLC506 force-pushed the export-D106211462 branch from ed3d3aa to 8b78870 Compare June 3, 2026 00:53

JasonLC506 force-pushed the export-D106211462 branch from 8b78870 to 41bff5f Compare June 3, 2026 00:58

JasonLC506 force-pushed the export-D106211462 branch 2 times, most recently from 6859944 to 96cb05e Compare June 4, 2026 01:09

JasonLC506 force-pushed the export-D106211462 branch from 96cb05e to 811346c Compare June 4, 2026 22:41

JasonLC506 force-pushed the export-D106211462 branch from 811346c to a4aae71 Compare June 5, 2026 18:22

JasonLC506 force-pushed the export-D106211462 branch from a4aae71 to 2727bc9 Compare June 16, 2026 22:03

JasonLC506 force-pushed the export-D106211462 branch from 2727bc9 to 0ef1db6 Compare June 16, 2026 22:04

JasonLC506 force-pushed the export-D106211462 branch from 0ef1db6 to c3fbcea Compare June 16, 2026 22:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix uvm_to_cpu/uvm_to_device/uvm_get_guard_index for host-mapped tensors (#5816)#5816

Fix uvm_to_cpu/uvm_to_device/uvm_get_guard_index for host-mapped tensors (#5816)#5816
JasonLC506 wants to merge 1 commit into
pytorch:mainfrom
JasonLC506:export-D106211462

JasonLC506 commented Jun 2, 2026 •

edited by meta-codesync Bot

Loading

Uh oh!

meta-codesync Bot commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

JasonLC506 commented Jun 2, 2026 • edited by meta-codesync Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

meta-codesync Bot commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

JasonLC506 commented Jun 2, 2026 •

edited by meta-codesync Bot

Loading