Skip to content

Fix uvm_to_cpu/uvm_to_device/uvm_get_guard_index for host-mapped tensors (#5816)#5816

Open
JasonLC506 wants to merge 1 commit into
pytorch:mainfrom
JasonLC506:export-D106211462
Open

Fix uvm_to_cpu/uvm_to_device/uvm_get_guard_index for host-mapped tensors (#5816)#5816
JasonLC506 wants to merge 1 commit into
pytorch:mainfrom
JasonLC506:export-D106211462

Conversation

@JasonLC506

@JasonLC506 JasonLC506 commented Jun 2, 2026

Copy link
Copy Markdown

Summary:

X-link: https://github.com/facebookresearch/FBGEMM/pull/2742

Three functions in fbgemm_gpu's memory_utils.cu unconditionally cast the storage context to CUDAManagedIndirectContext, which returns nullptr for tensors created via new_host_mapped_tensor / new_unified_tensor(is_host_mapped=True). The nullptr triggers TORCH_CHECK(tcontext != nullptr) with the error message "Expected tcontext != nullptr", surfacing during DCP/Pyper checkpoint loading of TBE shards whenever the destination tensor is host-mapped.

Inference TBE already enables uvm_host_mapped=True (see torchrec/distributed/quant_embedding_kernel.py:350,594, originally landed in D42272528 for a NUMA OOM fix), so this crash is reachable in production today on any path that calls fbgemm.uvm_to_cpu / uvm_to_device / cuda_mem_advise on a host-mapped tensor (e.g. aiplatform/modelstore/checkpointing/pyper/tensor_utils.py:74).

The fix mirrors the existing managed-tensor pattern: a new CUDAHostMappedIndirectContext struct holds a refcount on the original Storage, and each of the three functions gains a deleter == &CUDAHostMappedContext::release branch that constructs the appropriate CPU/device view backed by the host-mapped pointer.

The original CUDAManagedIndirectContext code path is byte-for-byte unchanged (except a missing semicolon on one TORCH_CHECK that the compiler had been absorbing). All existing managed-tensor callers continue to work.

Motivating use case: OneFlow preranker publish jobs intermittently timeout (~6% failure rate) due to torch.zeros(out=72GB_UVM_tensor) triggering THP/UVM compact_stalls on fragmented hosts. A/B results showing the host-mapped fix eliminates the variance (max 85s vs 1330s baseline, 0 torch.zeros compact_stall across 20 nodes): P2349246996. Full root cause / fix design: P2343573171.

Reviewed By: q10

Differential Revision: D106211462

@meta-cla meta-cla Bot added the cla signed label Jun 2, 2026
@meta-codesync

meta-codesync Bot commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

@JasonLC506 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D106211462.

@meta-codesync meta-codesync Bot changed the title Fix uvm_to_cpu/uvm_to_device/uvm_get_guard_index for host-mapped tensors Fix uvm_to_cpu/uvm_to_device/uvm_get_guard_index for host-mapped tensors (#5816) Jun 3, 2026
@JasonLC506 JasonLC506 force-pushed the export-D106211462 branch from 5f5686a to ed3d3aa Compare June 3, 2026 00:10
JasonLC506 added a commit to JasonLC506/FBGEMM-1 that referenced this pull request Jun 3, 2026
…ors (pytorch#5816)

Summary:
Pull Request resolved: pytorch#5816

X-link: https://github.com/facebookresearch/FBGEMM/pull/2742

Three functions in fbgemm_gpu's memory_utils.cu unconditionally cast the storage context to `CUDAManagedIndirectContext`, which returns nullptr for tensors created via `new_host_mapped_tensor` / `new_unified_tensor(is_host_mapped=True)`. The nullptr triggers `TORCH_CHECK(tcontext != nullptr)` with the error message "Expected tcontext != nullptr", surfacing during DCP/Pyper checkpoint loading of TBE shards whenever the destination tensor is host-mapped.

Inference TBE already enables `uvm_host_mapped=True` (see `torchrec/distributed/quant_embedding_kernel.py:350,594`, originally landed in D42272528 for a NUMA OOM fix), so this crash is reachable in production today on any path that calls `fbgemm.uvm_to_cpu` / `uvm_to_device` / `cuda_mem_advise` on a host-mapped tensor (e.g. `aiplatform/modelstore/checkpointing/pyper/tensor_utils.py:74`).

The fix mirrors the existing managed-tensor pattern: a new `CUDAHostMappedIndirectContext` struct holds a refcount on the original `Storage`, and each of the three functions gains a `deleter == &CUDAHostMappedContext::release` branch that constructs the appropriate CPU/device view backed by the host-mapped pointer.

The original `CUDAManagedIndirectContext` code path is byte-for-byte unchanged (except a missing semicolon on one `TORCH_CHECK` that the compiler had been absorbing). All existing managed-tensor callers continue to work.

Motivating use case: OneFlow preranker publish jobs intermittently timeout (~6% failure rate) due to `torch.zeros(out=72GB_UVM_tensor)` triggering THP/UVM compact_stalls on fragmented hosts. A/B results showing the host-mapped fix eliminates the variance (max 85s vs 1330s baseline, 0 torch.zeros compact_stall across 20 nodes): P2349246996. Full root cause / fix design: P2343573171.

Reviewed By: q10

Differential Revision: D106211462
JasonLC506 added a commit to JasonLC506/FBGEMM-1 that referenced this pull request Jun 3, 2026
…ors (pytorch#5816)

Summary:
Pull Request resolved: pytorch#5816

X-link: https://github.com/facebookresearch/FBGEMM/pull/2742

Three functions in fbgemm_gpu's memory_utils.cu unconditionally cast the storage context to `CUDAManagedIndirectContext`, which returns nullptr for tensors created via `new_host_mapped_tensor` / `new_unified_tensor(is_host_mapped=True)`. The nullptr triggers `TORCH_CHECK(tcontext != nullptr)` with the error message "Expected tcontext != nullptr", surfacing during DCP/Pyper checkpoint loading of TBE shards whenever the destination tensor is host-mapped.

Inference TBE already enables `uvm_host_mapped=True` (see `torchrec/distributed/quant_embedding_kernel.py:350,594`, originally landed in D42272528 for a NUMA OOM fix), so this crash is reachable in production today on any path that calls `fbgemm.uvm_to_cpu` / `uvm_to_device` / `cuda_mem_advise` on a host-mapped tensor (e.g. `aiplatform/modelstore/checkpointing/pyper/tensor_utils.py:74`).

The fix mirrors the existing managed-tensor pattern: a new `CUDAHostMappedIndirectContext` struct holds a refcount on the original `Storage`, and each of the three functions gains a `deleter == &CUDAHostMappedContext::release` branch that constructs the appropriate CPU/device view backed by the host-mapped pointer.

The original `CUDAManagedIndirectContext` code path is byte-for-byte unchanged (except a missing semicolon on one `TORCH_CHECK` that the compiler had been absorbing). All existing managed-tensor callers continue to work.

Motivating use case: OneFlow preranker publish jobs intermittently timeout (~6% failure rate) due to `torch.zeros(out=72GB_UVM_tensor)` triggering THP/UVM compact_stalls on fragmented hosts. A/B results showing the host-mapped fix eliminates the variance (max 85s vs 1330s baseline, 0 torch.zeros compact_stall across 20 nodes): P2349246996. Full root cause / fix design: P2343573171.

Reviewed By: q10

Differential Revision: D106211462
@JasonLC506 JasonLC506 force-pushed the export-D106211462 branch from ed3d3aa to 8b78870 Compare June 3, 2026 00:53
JasonLC506 added a commit to JasonLC506/FBGEMM-1 that referenced this pull request Jun 3, 2026
…ors (pytorch#5816)

Summary:
Pull Request resolved: pytorch#5816

X-link: https://github.com/facebookresearch/FBGEMM/pull/2742

Three functions in fbgemm_gpu's memory_utils.cu unconditionally cast the storage context to `CUDAManagedIndirectContext`, which returns nullptr for tensors created via `new_host_mapped_tensor` / `new_unified_tensor(is_host_mapped=True)`. The nullptr triggers `TORCH_CHECK(tcontext != nullptr)` with the error message "Expected tcontext != nullptr", surfacing during DCP/Pyper checkpoint loading of TBE shards whenever the destination tensor is host-mapped.

Inference TBE already enables `uvm_host_mapped=True` (see `torchrec/distributed/quant_embedding_kernel.py:350,594`, originally landed in D42272528 for a NUMA OOM fix), so this crash is reachable in production today on any path that calls `fbgemm.uvm_to_cpu` / `uvm_to_device` / `cuda_mem_advise` on a host-mapped tensor (e.g. `aiplatform/modelstore/checkpointing/pyper/tensor_utils.py:74`).

The fix mirrors the existing managed-tensor pattern: a new `CUDAHostMappedIndirectContext` struct holds a refcount on the original `Storage`, and each of the three functions gains a `deleter == &CUDAHostMappedContext::release` branch that constructs the appropriate CPU/device view backed by the host-mapped pointer.

The original `CUDAManagedIndirectContext` code path is byte-for-byte unchanged (except a missing semicolon on one `TORCH_CHECK` that the compiler had been absorbing). All existing managed-tensor callers continue to work.

Motivating use case: OneFlow preranker publish jobs intermittently timeout (~6% failure rate) due to `torch.zeros(out=72GB_UVM_tensor)` triggering THP/UVM compact_stalls on fragmented hosts. A/B results showing the host-mapped fix eliminates the variance (max 85s vs 1330s baseline, 0 torch.zeros compact_stall across 20 nodes): P2349246996. Full root cause / fix design: P2343573171.

Reviewed By: q10

Differential Revision: D106211462
@JasonLC506 JasonLC506 force-pushed the export-D106211462 branch from 8b78870 to 41bff5f Compare June 3, 2026 00:58
JasonLC506 added a commit to JasonLC506/FBGEMM-1 that referenced this pull request Jun 3, 2026
…ors (pytorch#5816)

Summary:
Pull Request resolved: pytorch#5816

X-link: https://github.com/facebookresearch/FBGEMM/pull/2742

Three functions in fbgemm_gpu's memory_utils.cu unconditionally cast the storage context to `CUDAManagedIndirectContext`, which returns nullptr for tensors created via `new_host_mapped_tensor` / `new_unified_tensor(is_host_mapped=True)`. The nullptr triggers `TORCH_CHECK(tcontext != nullptr)` with the error message "Expected tcontext != nullptr", surfacing during DCP/Pyper checkpoint loading of TBE shards whenever the destination tensor is host-mapped.

Inference TBE already enables `uvm_host_mapped=True` (see `torchrec/distributed/quant_embedding_kernel.py:350,594`, originally landed in D42272528 for a NUMA OOM fix), so this crash is reachable in production today on any path that calls `fbgemm.uvm_to_cpu` / `uvm_to_device` / `cuda_mem_advise` on a host-mapped tensor (e.g. `aiplatform/modelstore/checkpointing/pyper/tensor_utils.py:74`).

The fix mirrors the existing managed-tensor pattern: a new `CUDAHostMappedIndirectContext` struct holds a refcount on the original `Storage`, and each of the three functions gains a `deleter == &CUDAHostMappedContext::release` branch that constructs the appropriate CPU/device view backed by the host-mapped pointer.

The original `CUDAManagedIndirectContext` code path is byte-for-byte unchanged (except a missing semicolon on one `TORCH_CHECK` that the compiler had been absorbing). All existing managed-tensor callers continue to work.

Motivating use case: OneFlow preranker publish jobs intermittently timeout (~6% failure rate) due to `torch.zeros(out=72GB_UVM_tensor)` triggering THP/UVM compact_stalls on fragmented hosts. A/B results showing the host-mapped fix eliminates the variance (max 85s vs 1330s baseline, 0 torch.zeros compact_stall across 20 nodes): P2349246996. Full root cause / fix design: P2343573171.

Reviewed By: q10

Differential Revision: D106211462
@JasonLC506 JasonLC506 force-pushed the export-D106211462 branch 2 times, most recently from 6859944 to 96cb05e Compare June 4, 2026 01:09
JasonLC506 added a commit to JasonLC506/FBGEMM-1 that referenced this pull request Jun 4, 2026
…ors (pytorch#5816)

Summary:
Pull Request resolved: pytorch#5816

X-link: https://github.com/facebookresearch/FBGEMM/pull/2742

Three functions in fbgemm_gpu's memory_utils.cu unconditionally cast the storage context to `CUDAManagedIndirectContext`, which returns nullptr for tensors created via `new_host_mapped_tensor` / `new_unified_tensor(is_host_mapped=True)`. The nullptr triggers `TORCH_CHECK(tcontext != nullptr)` with the error message "Expected tcontext != nullptr", surfacing during DCP/Pyper checkpoint loading of TBE shards whenever the destination tensor is host-mapped.

Inference TBE already enables `uvm_host_mapped=True` (see `torchrec/distributed/quant_embedding_kernel.py:350,594`, originally landed in D42272528 for a NUMA OOM fix), so this crash is reachable in production today on any path that calls `fbgemm.uvm_to_cpu` / `uvm_to_device` / `cuda_mem_advise` on a host-mapped tensor (e.g. `aiplatform/modelstore/checkpointing/pyper/tensor_utils.py:74`).

The fix mirrors the existing managed-tensor pattern: a new `CUDAHostMappedIndirectContext` struct holds a refcount on the original `Storage`, and each of the three functions gains a `deleter == &CUDAHostMappedContext::release` branch that constructs the appropriate CPU/device view backed by the host-mapped pointer.

The original `CUDAManagedIndirectContext` code path is byte-for-byte unchanged (except a missing semicolon on one `TORCH_CHECK` that the compiler had been absorbing). All existing managed-tensor callers continue to work.

Motivating use case: OneFlow preranker publish jobs intermittently timeout (~6% failure rate) due to `torch.zeros(out=72GB_UVM_tensor)` triggering THP/UVM compact_stalls on fragmented hosts. A/B results showing the host-mapped fix eliminates the variance (max 85s vs 1330s baseline, 0 torch.zeros compact_stall across 20 nodes): P2349246996. Full root cause / fix design: P2343573171.

Reviewed By: q10

Differential Revision: D106211462
@JasonLC506 JasonLC506 force-pushed the export-D106211462 branch from 96cb05e to 811346c Compare June 4, 2026 22:41
JasonLC506 added a commit to JasonLC506/FBGEMM-1 that referenced this pull request Jun 4, 2026
…ors (pytorch#5816)

Summary:

X-link: facebookresearch/FBGEMM#2742

Three functions in fbgemm_gpu's memory_utils.cu unconditionally cast the storage context to `CUDAManagedIndirectContext`, which returns nullptr for tensors created via `new_host_mapped_tensor` / `new_unified_tensor(is_host_mapped=True)`. The nullptr triggers `TORCH_CHECK(tcontext != nullptr)` with the error message "Expected tcontext != nullptr", surfacing during DCP/Pyper checkpoint loading of TBE shards whenever the destination tensor is host-mapped.

Inference TBE already enables `uvm_host_mapped=True` (see `torchrec/distributed/quant_embedding_kernel.py:350,594`, originally landed in D42272528 for a NUMA OOM fix), so this crash is reachable in production today on any path that calls `fbgemm.uvm_to_cpu` / `uvm_to_device` / `cuda_mem_advise` on a host-mapped tensor (e.g. `aiplatform/modelstore/checkpointing/pyper/tensor_utils.py:74`).

The fix mirrors the existing managed-tensor pattern: a new `CUDAHostMappedIndirectContext` struct holds a refcount on the original `Storage`, and each of the three functions gains a `deleter == &CUDAHostMappedContext::release` branch that constructs the appropriate CPU/device view backed by the host-mapped pointer.

The original `CUDAManagedIndirectContext` code path is byte-for-byte unchanged (except a missing semicolon on one `TORCH_CHECK` that the compiler had been absorbing). All existing managed-tensor callers continue to work.

Motivating use case: OneFlow preranker publish jobs intermittently timeout (~6% failure rate) due to `torch.zeros(out=72GB_UVM_tensor)` triggering THP/UVM compact_stalls on fragmented hosts. A/B results showing the host-mapped fix eliminates the variance (max 85s vs 1330s baseline, 0 torch.zeros compact_stall across 20 nodes): P2349246996. Full root cause / fix design: P2343573171.

Reviewed By: q10

Differential Revision: D106211462
@JasonLC506 JasonLC506 force-pushed the export-D106211462 branch from 811346c to a4aae71 Compare June 5, 2026 18:22
JasonLC506 added a commit to JasonLC506/FBGEMM-1 that referenced this pull request Jun 5, 2026
…ors (pytorch#5816)

Summary:

X-link: facebookresearch/FBGEMM#2742

Three functions in fbgemm_gpu's memory_utils.cu unconditionally cast the storage context to `CUDAManagedIndirectContext`, which returns nullptr for tensors created via `new_host_mapped_tensor` / `new_unified_tensor(is_host_mapped=True)`. The nullptr triggers `TORCH_CHECK(tcontext != nullptr)` with the error message "Expected tcontext != nullptr", surfacing during DCP/Pyper checkpoint loading of TBE shards whenever the destination tensor is host-mapped.

Inference TBE already enables `uvm_host_mapped=True` (see `torchrec/distributed/quant_embedding_kernel.py:350,594`, originally landed in D42272528 for a NUMA OOM fix), so this crash is reachable in production today on any path that calls `fbgemm.uvm_to_cpu` / `uvm_to_device` / `cuda_mem_advise` on a host-mapped tensor (e.g. `aiplatform/modelstore/checkpointing/pyper/tensor_utils.py:74`).

The fix mirrors the existing managed-tensor pattern: a new `CUDAHostMappedIndirectContext` struct holds a refcount on the original `Storage`, and each of the three functions gains a `deleter == &CUDAHostMappedContext::release` branch that constructs the appropriate CPU/device view backed by the host-mapped pointer.

The original `CUDAManagedIndirectContext` code path is byte-for-byte unchanged (except a missing semicolon on one `TORCH_CHECK` that the compiler had been absorbing). All existing managed-tensor callers continue to work.

Motivating use case: OneFlow preranker publish jobs intermittently timeout (~6% failure rate) due to `torch.zeros(out=72GB_UVM_tensor)` triggering THP/UVM compact_stalls on fragmented hosts. A/B results showing the host-mapped fix eliminates the variance (max 85s vs 1330s baseline, 0 torch.zeros compact_stall across 20 nodes): P2349246996. Full root cause / fix design: P2343573171.

Reviewed By: q10

Differential Revision: D106211462
JasonLC506 added a commit to JasonLC506/FBGEMM-1 that referenced this pull request Jun 16, 2026
…ors (pytorch#5816)

Summary:

X-link: facebookresearch/FBGEMM#2742

Three functions in fbgemm_gpu's memory_utils.cu unconditionally cast the storage context to `CUDAManagedIndirectContext`, which returns nullptr for tensors created via `new_host_mapped_tensor` / `new_unified_tensor(is_host_mapped=True)`. The nullptr triggers `TORCH_CHECK(tcontext != nullptr)` with the error message "Expected tcontext != nullptr", surfacing during DCP/Pyper checkpoint loading of TBE shards whenever the destination tensor is host-mapped.

Inference TBE already enables `uvm_host_mapped=True` (see `torchrec/distributed/quant_embedding_kernel.py:350,594`, originally landed in D42272528 for a NUMA OOM fix), so this crash is reachable in production today on any path that calls `fbgemm.uvm_to_cpu` / `uvm_to_device` / `cuda_mem_advise` on a host-mapped tensor (e.g. `aiplatform/modelstore/checkpointing/pyper/tensor_utils.py:74`).

The fix mirrors the existing managed-tensor pattern: a new `CUDAHostMappedIndirectContext` struct holds a refcount on the original `Storage`, and each of the three functions gains a `deleter == &CUDAHostMappedContext::release` branch that constructs the appropriate CPU/device view backed by the host-mapped pointer.

The original `CUDAManagedIndirectContext` code path is byte-for-byte unchanged (except a missing semicolon on one `TORCH_CHECK` that the compiler had been absorbing). All existing managed-tensor callers continue to work.

Motivating use case: OneFlow preranker publish jobs intermittently timeout (~6% failure rate) due to `torch.zeros(out=72GB_UVM_tensor)` triggering THP/UVM compact_stalls on fragmented hosts. A/B results showing the host-mapped fix eliminates the variance (max 85s vs 1330s baseline, 0 torch.zeros compact_stall across 20 nodes): P2349246996. Full root cause / fix design: P2343573171.

Reviewed By: q10

Differential Revision: D106211462
JasonLC506 added a commit to JasonLC506/FBGEMM-1 that referenced this pull request Jun 16, 2026
…ors (pytorch#5816)

Summary:

X-link: facebookresearch/FBGEMM#2742

Three functions in fbgemm_gpu's memory_utils.cu unconditionally cast the storage context to `CUDAManagedIndirectContext`, which returns nullptr for tensors created via `new_host_mapped_tensor` / `new_unified_tensor(is_host_mapped=True)`. The nullptr triggers `TORCH_CHECK(tcontext != nullptr)` with the error message "Expected tcontext != nullptr", surfacing during DCP/Pyper checkpoint loading of TBE shards whenever the destination tensor is host-mapped.

Inference TBE already enables `uvm_host_mapped=True` (see `torchrec/distributed/quant_embedding_kernel.py:350,594`, originally landed in D42272528 for a NUMA OOM fix), so this crash is reachable in production today on any path that calls `fbgemm.uvm_to_cpu` / `uvm_to_device` / `cuda_mem_advise` on a host-mapped tensor (e.g. `aiplatform/modelstore/checkpointing/pyper/tensor_utils.py:74`).

The fix mirrors the existing managed-tensor pattern: a new `CUDAHostMappedIndirectContext` struct holds a refcount on the original `Storage`, and each of the three functions gains a `deleter == &CUDAHostMappedContext::release` branch that constructs the appropriate CPU/device view backed by the host-mapped pointer.

The original `CUDAManagedIndirectContext` code path is byte-for-byte unchanged (except a missing semicolon on one `TORCH_CHECK` that the compiler had been absorbing). All existing managed-tensor callers continue to work.

Motivating use case: OneFlow preranker publish jobs intermittently timeout (~6% failure rate) due to `torch.zeros(out=72GB_UVM_tensor)` triggering THP/UVM compact_stalls on fragmented hosts. A/B results showing the host-mapped fix eliminates the variance (max 85s vs 1330s baseline, 0 torch.zeros compact_stall across 20 nodes): P2349246996. Full root cause / fix design: P2343573171.

Reviewed By: q10

Differential Revision: D106211462
…ors (pytorch#5816)

Summary:

X-link: facebookresearch/FBGEMM#2742

Three functions in fbgemm_gpu's memory_utils.cu unconditionally cast the storage context to `CUDAManagedIndirectContext`, which returns nullptr for tensors created via `new_host_mapped_tensor` / `new_unified_tensor(is_host_mapped=True)`. The nullptr triggers `TORCH_CHECK(tcontext != nullptr)` with the error message "Expected tcontext != nullptr", surfacing during DCP/Pyper checkpoint loading of TBE shards whenever the destination tensor is host-mapped.

Inference TBE already enables `uvm_host_mapped=True` (see `torchrec/distributed/quant_embedding_kernel.py:350,594`, originally landed in D42272528 for a NUMA OOM fix), so this crash is reachable in production today on any path that calls `fbgemm.uvm_to_cpu` / `uvm_to_device` / `cuda_mem_advise` on a host-mapped tensor (e.g. `aiplatform/modelstore/checkpointing/pyper/tensor_utils.py:74`).

The fix mirrors the existing managed-tensor pattern: a new `CUDAHostMappedIndirectContext` struct holds a refcount on the original `Storage`, and each of the three functions gains a `deleter == &CUDAHostMappedContext::release` branch that constructs the appropriate CPU/device view backed by the host-mapped pointer.

The original `CUDAManagedIndirectContext` code path is byte-for-byte unchanged (except a missing semicolon on one `TORCH_CHECK` that the compiler had been absorbing). All existing managed-tensor callers continue to work.

Motivating use case: OneFlow preranker publish jobs intermittently timeout (~6% failure rate) due to `torch.zeros(out=72GB_UVM_tensor)` triggering THP/UVM compact_stalls on fragmented hosts. A/B results showing the host-mapped fix eliminates the variance (max 85s vs 1330s baseline, 0 torch.zeros compact_stall across 20 nodes): P2349246996. Full root cause / fix design: P2343573171.

Reviewed By: q10

Differential Revision: D106211462
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant