Fix uvm_to_cpu/uvm_to_device/uvm_get_guard_index for host-mapped tensors (#5816)#5816
Open
JasonLC506 wants to merge 1 commit into
Open
Fix uvm_to_cpu/uvm_to_device/uvm_get_guard_index for host-mapped tensors (#5816)#5816JasonLC506 wants to merge 1 commit into
JasonLC506 wants to merge 1 commit into
Conversation
Contributor
|
@JasonLC506 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D106211462. |
5f5686a to
ed3d3aa
Compare
JasonLC506
added a commit
to JasonLC506/FBGEMM-1
that referenced
this pull request
Jun 3, 2026
…ors (pytorch#5816) Summary: Pull Request resolved: pytorch#5816 X-link: https://github.com/facebookresearch/FBGEMM/pull/2742 Three functions in fbgemm_gpu's memory_utils.cu unconditionally cast the storage context to `CUDAManagedIndirectContext`, which returns nullptr for tensors created via `new_host_mapped_tensor` / `new_unified_tensor(is_host_mapped=True)`. The nullptr triggers `TORCH_CHECK(tcontext != nullptr)` with the error message "Expected tcontext != nullptr", surfacing during DCP/Pyper checkpoint loading of TBE shards whenever the destination tensor is host-mapped. Inference TBE already enables `uvm_host_mapped=True` (see `torchrec/distributed/quant_embedding_kernel.py:350,594`, originally landed in D42272528 for a NUMA OOM fix), so this crash is reachable in production today on any path that calls `fbgemm.uvm_to_cpu` / `uvm_to_device` / `cuda_mem_advise` on a host-mapped tensor (e.g. `aiplatform/modelstore/checkpointing/pyper/tensor_utils.py:74`). The fix mirrors the existing managed-tensor pattern: a new `CUDAHostMappedIndirectContext` struct holds a refcount on the original `Storage`, and each of the three functions gains a `deleter == &CUDAHostMappedContext::release` branch that constructs the appropriate CPU/device view backed by the host-mapped pointer. The original `CUDAManagedIndirectContext` code path is byte-for-byte unchanged (except a missing semicolon on one `TORCH_CHECK` that the compiler had been absorbing). All existing managed-tensor callers continue to work. Motivating use case: OneFlow preranker publish jobs intermittently timeout (~6% failure rate) due to `torch.zeros(out=72GB_UVM_tensor)` triggering THP/UVM compact_stalls on fragmented hosts. A/B results showing the host-mapped fix eliminates the variance (max 85s vs 1330s baseline, 0 torch.zeros compact_stall across 20 nodes): P2349246996. Full root cause / fix design: P2343573171. Reviewed By: q10 Differential Revision: D106211462
JasonLC506
added a commit
to JasonLC506/FBGEMM-1
that referenced
this pull request
Jun 3, 2026
…ors (pytorch#5816) Summary: Pull Request resolved: pytorch#5816 X-link: https://github.com/facebookresearch/FBGEMM/pull/2742 Three functions in fbgemm_gpu's memory_utils.cu unconditionally cast the storage context to `CUDAManagedIndirectContext`, which returns nullptr for tensors created via `new_host_mapped_tensor` / `new_unified_tensor(is_host_mapped=True)`. The nullptr triggers `TORCH_CHECK(tcontext != nullptr)` with the error message "Expected tcontext != nullptr", surfacing during DCP/Pyper checkpoint loading of TBE shards whenever the destination tensor is host-mapped. Inference TBE already enables `uvm_host_mapped=True` (see `torchrec/distributed/quant_embedding_kernel.py:350,594`, originally landed in D42272528 for a NUMA OOM fix), so this crash is reachable in production today on any path that calls `fbgemm.uvm_to_cpu` / `uvm_to_device` / `cuda_mem_advise` on a host-mapped tensor (e.g. `aiplatform/modelstore/checkpointing/pyper/tensor_utils.py:74`). The fix mirrors the existing managed-tensor pattern: a new `CUDAHostMappedIndirectContext` struct holds a refcount on the original `Storage`, and each of the three functions gains a `deleter == &CUDAHostMappedContext::release` branch that constructs the appropriate CPU/device view backed by the host-mapped pointer. The original `CUDAManagedIndirectContext` code path is byte-for-byte unchanged (except a missing semicolon on one `TORCH_CHECK` that the compiler had been absorbing). All existing managed-tensor callers continue to work. Motivating use case: OneFlow preranker publish jobs intermittently timeout (~6% failure rate) due to `torch.zeros(out=72GB_UVM_tensor)` triggering THP/UVM compact_stalls on fragmented hosts. A/B results showing the host-mapped fix eliminates the variance (max 85s vs 1330s baseline, 0 torch.zeros compact_stall across 20 nodes): P2349246996. Full root cause / fix design: P2343573171. Reviewed By: q10 Differential Revision: D106211462
ed3d3aa to
8b78870
Compare
JasonLC506
added a commit
to JasonLC506/FBGEMM-1
that referenced
this pull request
Jun 3, 2026
…ors (pytorch#5816) Summary: Pull Request resolved: pytorch#5816 X-link: https://github.com/facebookresearch/FBGEMM/pull/2742 Three functions in fbgemm_gpu's memory_utils.cu unconditionally cast the storage context to `CUDAManagedIndirectContext`, which returns nullptr for tensors created via `new_host_mapped_tensor` / `new_unified_tensor(is_host_mapped=True)`. The nullptr triggers `TORCH_CHECK(tcontext != nullptr)` with the error message "Expected tcontext != nullptr", surfacing during DCP/Pyper checkpoint loading of TBE shards whenever the destination tensor is host-mapped. Inference TBE already enables `uvm_host_mapped=True` (see `torchrec/distributed/quant_embedding_kernel.py:350,594`, originally landed in D42272528 for a NUMA OOM fix), so this crash is reachable in production today on any path that calls `fbgemm.uvm_to_cpu` / `uvm_to_device` / `cuda_mem_advise` on a host-mapped tensor (e.g. `aiplatform/modelstore/checkpointing/pyper/tensor_utils.py:74`). The fix mirrors the existing managed-tensor pattern: a new `CUDAHostMappedIndirectContext` struct holds a refcount on the original `Storage`, and each of the three functions gains a `deleter == &CUDAHostMappedContext::release` branch that constructs the appropriate CPU/device view backed by the host-mapped pointer. The original `CUDAManagedIndirectContext` code path is byte-for-byte unchanged (except a missing semicolon on one `TORCH_CHECK` that the compiler had been absorbing). All existing managed-tensor callers continue to work. Motivating use case: OneFlow preranker publish jobs intermittently timeout (~6% failure rate) due to `torch.zeros(out=72GB_UVM_tensor)` triggering THP/UVM compact_stalls on fragmented hosts. A/B results showing the host-mapped fix eliminates the variance (max 85s vs 1330s baseline, 0 torch.zeros compact_stall across 20 nodes): P2349246996. Full root cause / fix design: P2343573171. Reviewed By: q10 Differential Revision: D106211462
8b78870 to
41bff5f
Compare
JasonLC506
added a commit
to JasonLC506/FBGEMM-1
that referenced
this pull request
Jun 3, 2026
…ors (pytorch#5816) Summary: Pull Request resolved: pytorch#5816 X-link: https://github.com/facebookresearch/FBGEMM/pull/2742 Three functions in fbgemm_gpu's memory_utils.cu unconditionally cast the storage context to `CUDAManagedIndirectContext`, which returns nullptr for tensors created via `new_host_mapped_tensor` / `new_unified_tensor(is_host_mapped=True)`. The nullptr triggers `TORCH_CHECK(tcontext != nullptr)` with the error message "Expected tcontext != nullptr", surfacing during DCP/Pyper checkpoint loading of TBE shards whenever the destination tensor is host-mapped. Inference TBE already enables `uvm_host_mapped=True` (see `torchrec/distributed/quant_embedding_kernel.py:350,594`, originally landed in D42272528 for a NUMA OOM fix), so this crash is reachable in production today on any path that calls `fbgemm.uvm_to_cpu` / `uvm_to_device` / `cuda_mem_advise` on a host-mapped tensor (e.g. `aiplatform/modelstore/checkpointing/pyper/tensor_utils.py:74`). The fix mirrors the existing managed-tensor pattern: a new `CUDAHostMappedIndirectContext` struct holds a refcount on the original `Storage`, and each of the three functions gains a `deleter == &CUDAHostMappedContext::release` branch that constructs the appropriate CPU/device view backed by the host-mapped pointer. The original `CUDAManagedIndirectContext` code path is byte-for-byte unchanged (except a missing semicolon on one `TORCH_CHECK` that the compiler had been absorbing). All existing managed-tensor callers continue to work. Motivating use case: OneFlow preranker publish jobs intermittently timeout (~6% failure rate) due to `torch.zeros(out=72GB_UVM_tensor)` triggering THP/UVM compact_stalls on fragmented hosts. A/B results showing the host-mapped fix eliminates the variance (max 85s vs 1330s baseline, 0 torch.zeros compact_stall across 20 nodes): P2349246996. Full root cause / fix design: P2343573171. Reviewed By: q10 Differential Revision: D106211462
6859944 to
96cb05e
Compare
JasonLC506
added a commit
to JasonLC506/FBGEMM-1
that referenced
this pull request
Jun 4, 2026
…ors (pytorch#5816) Summary: Pull Request resolved: pytorch#5816 X-link: https://github.com/facebookresearch/FBGEMM/pull/2742 Three functions in fbgemm_gpu's memory_utils.cu unconditionally cast the storage context to `CUDAManagedIndirectContext`, which returns nullptr for tensors created via `new_host_mapped_tensor` / `new_unified_tensor(is_host_mapped=True)`. The nullptr triggers `TORCH_CHECK(tcontext != nullptr)` with the error message "Expected tcontext != nullptr", surfacing during DCP/Pyper checkpoint loading of TBE shards whenever the destination tensor is host-mapped. Inference TBE already enables `uvm_host_mapped=True` (see `torchrec/distributed/quant_embedding_kernel.py:350,594`, originally landed in D42272528 for a NUMA OOM fix), so this crash is reachable in production today on any path that calls `fbgemm.uvm_to_cpu` / `uvm_to_device` / `cuda_mem_advise` on a host-mapped tensor (e.g. `aiplatform/modelstore/checkpointing/pyper/tensor_utils.py:74`). The fix mirrors the existing managed-tensor pattern: a new `CUDAHostMappedIndirectContext` struct holds a refcount on the original `Storage`, and each of the three functions gains a `deleter == &CUDAHostMappedContext::release` branch that constructs the appropriate CPU/device view backed by the host-mapped pointer. The original `CUDAManagedIndirectContext` code path is byte-for-byte unchanged (except a missing semicolon on one `TORCH_CHECK` that the compiler had been absorbing). All existing managed-tensor callers continue to work. Motivating use case: OneFlow preranker publish jobs intermittently timeout (~6% failure rate) due to `torch.zeros(out=72GB_UVM_tensor)` triggering THP/UVM compact_stalls on fragmented hosts. A/B results showing the host-mapped fix eliminates the variance (max 85s vs 1330s baseline, 0 torch.zeros compact_stall across 20 nodes): P2349246996. Full root cause / fix design: P2343573171. Reviewed By: q10 Differential Revision: D106211462
96cb05e to
811346c
Compare
JasonLC506
added a commit
to JasonLC506/FBGEMM-1
that referenced
this pull request
Jun 4, 2026
…ors (pytorch#5816) Summary: X-link: facebookresearch/FBGEMM#2742 Three functions in fbgemm_gpu's memory_utils.cu unconditionally cast the storage context to `CUDAManagedIndirectContext`, which returns nullptr for tensors created via `new_host_mapped_tensor` / `new_unified_tensor(is_host_mapped=True)`. The nullptr triggers `TORCH_CHECK(tcontext != nullptr)` with the error message "Expected tcontext != nullptr", surfacing during DCP/Pyper checkpoint loading of TBE shards whenever the destination tensor is host-mapped. Inference TBE already enables `uvm_host_mapped=True` (see `torchrec/distributed/quant_embedding_kernel.py:350,594`, originally landed in D42272528 for a NUMA OOM fix), so this crash is reachable in production today on any path that calls `fbgemm.uvm_to_cpu` / `uvm_to_device` / `cuda_mem_advise` on a host-mapped tensor (e.g. `aiplatform/modelstore/checkpointing/pyper/tensor_utils.py:74`). The fix mirrors the existing managed-tensor pattern: a new `CUDAHostMappedIndirectContext` struct holds a refcount on the original `Storage`, and each of the three functions gains a `deleter == &CUDAHostMappedContext::release` branch that constructs the appropriate CPU/device view backed by the host-mapped pointer. The original `CUDAManagedIndirectContext` code path is byte-for-byte unchanged (except a missing semicolon on one `TORCH_CHECK` that the compiler had been absorbing). All existing managed-tensor callers continue to work. Motivating use case: OneFlow preranker publish jobs intermittently timeout (~6% failure rate) due to `torch.zeros(out=72GB_UVM_tensor)` triggering THP/UVM compact_stalls on fragmented hosts. A/B results showing the host-mapped fix eliminates the variance (max 85s vs 1330s baseline, 0 torch.zeros compact_stall across 20 nodes): P2349246996. Full root cause / fix design: P2343573171. Reviewed By: q10 Differential Revision: D106211462
811346c to
a4aae71
Compare
JasonLC506
added a commit
to JasonLC506/FBGEMM-1
that referenced
this pull request
Jun 5, 2026
…ors (pytorch#5816) Summary: X-link: facebookresearch/FBGEMM#2742 Three functions in fbgemm_gpu's memory_utils.cu unconditionally cast the storage context to `CUDAManagedIndirectContext`, which returns nullptr for tensors created via `new_host_mapped_tensor` / `new_unified_tensor(is_host_mapped=True)`. The nullptr triggers `TORCH_CHECK(tcontext != nullptr)` with the error message "Expected tcontext != nullptr", surfacing during DCP/Pyper checkpoint loading of TBE shards whenever the destination tensor is host-mapped. Inference TBE already enables `uvm_host_mapped=True` (see `torchrec/distributed/quant_embedding_kernel.py:350,594`, originally landed in D42272528 for a NUMA OOM fix), so this crash is reachable in production today on any path that calls `fbgemm.uvm_to_cpu` / `uvm_to_device` / `cuda_mem_advise` on a host-mapped tensor (e.g. `aiplatform/modelstore/checkpointing/pyper/tensor_utils.py:74`). The fix mirrors the existing managed-tensor pattern: a new `CUDAHostMappedIndirectContext` struct holds a refcount on the original `Storage`, and each of the three functions gains a `deleter == &CUDAHostMappedContext::release` branch that constructs the appropriate CPU/device view backed by the host-mapped pointer. The original `CUDAManagedIndirectContext` code path is byte-for-byte unchanged (except a missing semicolon on one `TORCH_CHECK` that the compiler had been absorbing). All existing managed-tensor callers continue to work. Motivating use case: OneFlow preranker publish jobs intermittently timeout (~6% failure rate) due to `torch.zeros(out=72GB_UVM_tensor)` triggering THP/UVM compact_stalls on fragmented hosts. A/B results showing the host-mapped fix eliminates the variance (max 85s vs 1330s baseline, 0 torch.zeros compact_stall across 20 nodes): P2349246996. Full root cause / fix design: P2343573171. Reviewed By: q10 Differential Revision: D106211462
a4aae71 to
2727bc9
Compare
JasonLC506
added a commit
to JasonLC506/FBGEMM-1
that referenced
this pull request
Jun 16, 2026
…ors (pytorch#5816) Summary: X-link: facebookresearch/FBGEMM#2742 Three functions in fbgemm_gpu's memory_utils.cu unconditionally cast the storage context to `CUDAManagedIndirectContext`, which returns nullptr for tensors created via `new_host_mapped_tensor` / `new_unified_tensor(is_host_mapped=True)`. The nullptr triggers `TORCH_CHECK(tcontext != nullptr)` with the error message "Expected tcontext != nullptr", surfacing during DCP/Pyper checkpoint loading of TBE shards whenever the destination tensor is host-mapped. Inference TBE already enables `uvm_host_mapped=True` (see `torchrec/distributed/quant_embedding_kernel.py:350,594`, originally landed in D42272528 for a NUMA OOM fix), so this crash is reachable in production today on any path that calls `fbgemm.uvm_to_cpu` / `uvm_to_device` / `cuda_mem_advise` on a host-mapped tensor (e.g. `aiplatform/modelstore/checkpointing/pyper/tensor_utils.py:74`). The fix mirrors the existing managed-tensor pattern: a new `CUDAHostMappedIndirectContext` struct holds a refcount on the original `Storage`, and each of the three functions gains a `deleter == &CUDAHostMappedContext::release` branch that constructs the appropriate CPU/device view backed by the host-mapped pointer. The original `CUDAManagedIndirectContext` code path is byte-for-byte unchanged (except a missing semicolon on one `TORCH_CHECK` that the compiler had been absorbing). All existing managed-tensor callers continue to work. Motivating use case: OneFlow preranker publish jobs intermittently timeout (~6% failure rate) due to `torch.zeros(out=72GB_UVM_tensor)` triggering THP/UVM compact_stalls on fragmented hosts. A/B results showing the host-mapped fix eliminates the variance (max 85s vs 1330s baseline, 0 torch.zeros compact_stall across 20 nodes): P2349246996. Full root cause / fix design: P2343573171. Reviewed By: q10 Differential Revision: D106211462
JasonLC506
added a commit
to JasonLC506/FBGEMM-1
that referenced
this pull request
Jun 16, 2026
…ors (pytorch#5816) Summary: X-link: facebookresearch/FBGEMM#2742 Three functions in fbgemm_gpu's memory_utils.cu unconditionally cast the storage context to `CUDAManagedIndirectContext`, which returns nullptr for tensors created via `new_host_mapped_tensor` / `new_unified_tensor(is_host_mapped=True)`. The nullptr triggers `TORCH_CHECK(tcontext != nullptr)` with the error message "Expected tcontext != nullptr", surfacing during DCP/Pyper checkpoint loading of TBE shards whenever the destination tensor is host-mapped. Inference TBE already enables `uvm_host_mapped=True` (see `torchrec/distributed/quant_embedding_kernel.py:350,594`, originally landed in D42272528 for a NUMA OOM fix), so this crash is reachable in production today on any path that calls `fbgemm.uvm_to_cpu` / `uvm_to_device` / `cuda_mem_advise` on a host-mapped tensor (e.g. `aiplatform/modelstore/checkpointing/pyper/tensor_utils.py:74`). The fix mirrors the existing managed-tensor pattern: a new `CUDAHostMappedIndirectContext` struct holds a refcount on the original `Storage`, and each of the three functions gains a `deleter == &CUDAHostMappedContext::release` branch that constructs the appropriate CPU/device view backed by the host-mapped pointer. The original `CUDAManagedIndirectContext` code path is byte-for-byte unchanged (except a missing semicolon on one `TORCH_CHECK` that the compiler had been absorbing). All existing managed-tensor callers continue to work. Motivating use case: OneFlow preranker publish jobs intermittently timeout (~6% failure rate) due to `torch.zeros(out=72GB_UVM_tensor)` triggering THP/UVM compact_stalls on fragmented hosts. A/B results showing the host-mapped fix eliminates the variance (max 85s vs 1330s baseline, 0 torch.zeros compact_stall across 20 nodes): P2349246996. Full root cause / fix design: P2343573171. Reviewed By: q10 Differential Revision: D106211462
2727bc9 to
0ef1db6
Compare
…ors (pytorch#5816) Summary: X-link: facebookresearch/FBGEMM#2742 Three functions in fbgemm_gpu's memory_utils.cu unconditionally cast the storage context to `CUDAManagedIndirectContext`, which returns nullptr for tensors created via `new_host_mapped_tensor` / `new_unified_tensor(is_host_mapped=True)`. The nullptr triggers `TORCH_CHECK(tcontext != nullptr)` with the error message "Expected tcontext != nullptr", surfacing during DCP/Pyper checkpoint loading of TBE shards whenever the destination tensor is host-mapped. Inference TBE already enables `uvm_host_mapped=True` (see `torchrec/distributed/quant_embedding_kernel.py:350,594`, originally landed in D42272528 for a NUMA OOM fix), so this crash is reachable in production today on any path that calls `fbgemm.uvm_to_cpu` / `uvm_to_device` / `cuda_mem_advise` on a host-mapped tensor (e.g. `aiplatform/modelstore/checkpointing/pyper/tensor_utils.py:74`). The fix mirrors the existing managed-tensor pattern: a new `CUDAHostMappedIndirectContext` struct holds a refcount on the original `Storage`, and each of the three functions gains a `deleter == &CUDAHostMappedContext::release` branch that constructs the appropriate CPU/device view backed by the host-mapped pointer. The original `CUDAManagedIndirectContext` code path is byte-for-byte unchanged (except a missing semicolon on one `TORCH_CHECK` that the compiler had been absorbing). All existing managed-tensor callers continue to work. Motivating use case: OneFlow preranker publish jobs intermittently timeout (~6% failure rate) due to `torch.zeros(out=72GB_UVM_tensor)` triggering THP/UVM compact_stalls on fragmented hosts. A/B results showing the host-mapped fix eliminates the variance (max 85s vs 1330s baseline, 0 torch.zeros compact_stall across 20 nodes): P2349246996. Full root cause / fix design: P2343573171. Reviewed By: q10 Differential Revision: D106211462
0ef1db6 to
c3fbcea
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/2742
Three functions in fbgemm_gpu's memory_utils.cu unconditionally cast the storage context to
CUDAManagedIndirectContext, which returns nullptr for tensors created vianew_host_mapped_tensor/new_unified_tensor(is_host_mapped=True). The nullptr triggersTORCH_CHECK(tcontext != nullptr)with the error message "Expected tcontext != nullptr", surfacing during DCP/Pyper checkpoint loading of TBE shards whenever the destination tensor is host-mapped.Inference TBE already enables
uvm_host_mapped=True(seetorchrec/distributed/quant_embedding_kernel.py:350,594, originally landed in D42272528 for a NUMA OOM fix), so this crash is reachable in production today on any path that callsfbgemm.uvm_to_cpu/uvm_to_device/cuda_mem_adviseon a host-mapped tensor (e.g.aiplatform/modelstore/checkpointing/pyper/tensor_utils.py:74).The fix mirrors the existing managed-tensor pattern: a new
CUDAHostMappedIndirectContextstruct holds a refcount on the originalStorage, and each of the three functions gains adeleter == &CUDAHostMappedContext::releasebranch that constructs the appropriate CPU/device view backed by the host-mapped pointer.The original
CUDAManagedIndirectContextcode path is byte-for-byte unchanged (except a missing semicolon on oneTORCH_CHECKthat the compiler had been absorbing). All existing managed-tensor callers continue to work.Motivating use case: OneFlow preranker publish jobs intermittently timeout (~6% failure rate) due to
torch.zeros(out=72GB_UVM_tensor)triggering THP/UVM compact_stalls on fragmented hosts. A/B results showing the host-mapped fix eliminates the variance (max 85s vs 1330s baseline, 0 torch.zeros compact_stall across 20 nodes): P2349246996. Full root cause / fix design: P2343573171.Reviewed By: q10
Differential Revision: D106211462