Skip to content

dense_to_jagged_forward: realize total_L SymInt before empty#5873

Open
haoyuz wants to merge 1 commit into
pytorch:mainfrom
haoyuz:export-D108236923
Open

dense_to_jagged_forward: realize total_L SymInt before empty#5873
haoyuz wants to merge 1 commit into
pytorch:mainfrom
haoyuz:export-D108236923

Conversation

@haoyuz

@haoyuz haoyuz commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/2793

CONTEXT: On AMD MI350X (HIP), MAST job fire-fandw06-f1096341099 (Stories LSR
train_eval) crashed inside fbgemm::dense_to_jagged during the forward pass
of UhmEventTokenizer.get_position_encoding. Two related symptoms appeared
across ranks:

  • RuntimeError: ...RegisterCUDA_0.cpp:7563: SymIntArrayRef expected to contain only concrete integers (asIntArrayRefSlow check fired)
  • RuntimeError: Trying to create tensor with negative dimension -1409625905161306112: [-1409625905161306112, 8] (heap SymNode pointer
    reinterpreted as int64 in at::detail::empty_generic)

dense_to_jagged_forward.cu and the CPU variant forward an
std::optional<at::SymInt> total_L straight into
at::empty_symint({total_L, D}, ...). The aten empty.memory_format CUDA/HIP
wrapper at RegisterCUDA_0.cpp calls C10_AS_INTARRAYREF_SLOW on the size
array, which TORCH_CHECKs that no SymInt in the array is
is_heap_allocated(). Any heap-allocated SymInt arriving here (e.g. an
unbacked SymInt produced inside a torch.compile region in production)
trips that check, or - depending on how the dispatcher walked the array -
leaks the SymNode pointer through as a raw int64_t dimension.

WHAT: Realize total_L to a concrete int64_t via guard_int(__FILE__, __LINE__) before constructing the output tensor, and switch the allocation
from at::empty_symint / at::zeros_symint (SymInt-shape) to at::empty /
at::zeros (int64_t shape). For heap SymInts with a hint or a runtime
guard guard_int resolves cleanly to the concrete value; for truly unbacked
SymInts with no value the kernel now produces a clean
"Could not extract specialized integer from data-dependent expression"
error instead of the low-level memory crash.

Same fix applied to both the CUDA/HIP kernel
(src/jagged_tensor_ops/dense_to_jagged_forward.cu) and the CPU kernel
(src/jagged_tensor_ops/jagged_tensor_ops_cpu.cpp).

Adds a regression test test_dense_to_jagged_heap_symint_total_L that
constructs an unbacked, heap-allocated SymInt via
ShapeEnv.create_unbacked_symint() and calls
torch.ops.fbgemm.dense_to_jagged directly. Pre-fix the test fails with the
SymIntArrayRef crash; post-fix it passes (asserting the clean guard_int
error path).

Differential Revision: D108236923

Summary:
X-link: facebookresearch/FBGEMM#2793

CONTEXT: On AMD MI350X (HIP), MAST job `fire-fandw06-f1096341099` (Stories LSR
train_eval) crashed inside `fbgemm::dense_to_jagged` during the forward pass
of `UhmEventTokenizer.get_position_encoding`. Two related symptoms appeared
across ranks:
  - `RuntimeError: ...RegisterCUDA_0.cpp:7563: SymIntArrayRef expected to
    contain only concrete integers` (asIntArrayRefSlow check fired)
  - `RuntimeError: Trying to create tensor with negative dimension
    -1409625905161306112: [-1409625905161306112, 8]` (heap SymNode pointer
    reinterpreted as int64 in `at::detail::empty_generic`)

`dense_to_jagged_forward.cu` and the CPU variant forward an
`std::optional<at::SymInt> total_L` straight into
`at::empty_symint({total_L, D}, ...)`. The aten `empty.memory_format` CUDA/HIP
wrapper at `RegisterCUDA_0.cpp` calls `C10_AS_INTARRAYREF_SLOW` on the size
array, which `TORCH_CHECK`s that no `SymInt` in the array is
`is_heap_allocated()`. Any heap-allocated `SymInt` arriving here (e.g. an
unbacked SymInt produced inside a `torch.compile` region in production)
trips that check, or - depending on how the dispatcher walked the array -
leaks the `SymNode` pointer through as a raw `int64_t` dimension.

WHAT: Realize `total_L` to a concrete `int64_t` via `guard_int(__FILE__,
__LINE__)` before constructing the output tensor, and switch the allocation
from `at::empty_symint` / `at::zeros_symint` (SymInt-shape) to `at::empty` /
`at::zeros` (`int64_t` shape). For heap SymInts with a hint or a runtime
guard `guard_int` resolves cleanly to the concrete value; for truly unbacked
SymInts with no value the kernel now produces a clean
`"Could not extract specialized integer from data-dependent expression"`
error instead of the low-level memory crash.

Same fix applied to both the CUDA/HIP kernel
(`src/jagged_tensor_ops/dense_to_jagged_forward.cu`) and the CPU kernel
(`src/jagged_tensor_ops/jagged_tensor_ops_cpu.cpp`).

Adds a regression test `test_dense_to_jagged_heap_symint_total_L` that
constructs an unbacked, heap-allocated SymInt via
`ShapeEnv.create_unbacked_symint()` and calls
`torch.ops.fbgemm.dense_to_jagged` directly. Pre-fix the test fails with the
`SymIntArrayRef` crash; post-fix it passes (asserting the clean `guard_int`
error path).

Differential Revision: D108236923
@meta-cla meta-cla Bot added the cla signed label Jun 11, 2026
@meta-codesync

meta-codesync Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

@haoyuz has exported this pull request. If you are a Meta employee, you can view the originating Diff in D108236923.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant