Fix the grid problem with sparse permute 2d (V2: threshold-guarded) by q10 · Pull Request #5898 · pytorch/FBGEMM

q10 · 2026-06-14T01:04:43Z

Summary:
Migrates the four host-side cap sites in sparse_permute_2d.cu from
the unconditional ROCm cap pattern (V1) to the new
fbgemm_gpu::utils::cuda::determine_grid_blocks helper (introduced in
D106267802) with the default BlockCapPolicy::OverflowOnly.

Net behaviour change vs V1:

ROCm small/medium grid (blocks * threads_per_block <= UINT32_MAX):
uncapped grid restored. Was: capped to
MAX_THREAD_BLOCKS_FACTOR * #SMs unconditionally. Now: passes
through unchanged (matches pre-D104903707 behaviour).
ROCm large grid (blocks * threads_per_block > UINT32_MAX):
cap still applied. The kernel grid-strides over b_t so capping is
correctness-preserving, and this is the regime where the HIP
2^32 thread-per-launch limit would otherwise fire.
NVIDIA: bit-identical to V1 (the threshold check lives entirely
under #ifdef USE_ROCM inside the helper).

The four sites:

permute_2D_sparse_data_cuda blocks_1 (line ~223), threads_1=256.
permute_2D_sparse_data_cuda blocks_2 (line ~262), block size
dim3(32, BT_blocks=32) so threads_per_block = 32 * BT_blocks = 1024.
permute_sparse_features_cuda blocks_1 (line ~439), threads_1=256.
permute_sparse_features_cuda blocks_2 (line ~486), block size
same as above (1024).

Drive-by: switches cuda_calc_block_count (y/z-dim 65535 cap) to
determine_grid_blocks (which uses cuda_calc_xblock_count internally
with the 2^31-1 x-dim cap). These launches are 1-D x-dim launches so
cuda_calc_xblock_count is the correct primitive; the kernels
grid-stride so any change in grid size is correctness-preserving.

Reviewed By: spcyppt

Differential Revision: D104937969

…max_thread_blocks helpers (pytorch#5853) Summary: X-link: facebookresearch/FBGEMM#2775 Final diff in the threshold-guard helper introduction stack. Migrates the two host-side cap sites in `codegen/training/backward/embedding_backward_split_template.cu` from the legacy `std::min(blocks_uncapped, get_max_thread_blocks_(...))` form to the new threshold-guarded helper `fbgemm_gpu::utils::cuda::determine_grid_blocks_from_blocks(..., BlockCapPolicy::Always)`. With this migration the last legacy callers are gone, so this diff also cleans up: - Removes `fbgemm_gpu::utils::cuda::get_max_thread_blocks(stream)` from `include/fbgemm_gpu/utils/cuda_utilities.cuh`. - Removes the file-local `get_max_thread_blocks_()` and `MAX_THREAD_BLOCKS_FACTOR` from `include/fbgemm_gpu/embedding_backward_template_helpers.cuh`. - Adds `#include "fbgemm_gpu/utils/cuda_utilities.cuh"` to `embedding_backward_template_helpers.cuh` for the new helper. Behavior-preserving on the TBE backward variants: the policy `Always` matches the prior unconditional ROCm cap exactly. Reviewed By: spcyppt Differential Revision: D106453408

…ch#5897) Summary: X-link: facebookresearch/FBGEMM#2816 Now that the TBE backward template (the last caller) has migrated to `cap_grid_dim_x`, remove the legacy `fbgemm_gpu::utils::cuda::get_max_thread_blocks(stream)` helper from `include/fbgemm_gpu/utils/cuda_utilities.cuh` and inline its sole remaining use inside `cap_grid_dim_x`. Behavior-preserving: the inlined body computes `MAX_THREAD_BLOCKS_FACTOR * #SMs` exactly as before. Reviewed By: spcyppt Differential Revision: D107317501

Summary: Migrates the four host-side cap sites in `sparse_permute_2d.cu` from the unconditional ROCm cap pattern (V1) to the new `fbgemm_gpu::utils::cuda::determine_grid_blocks` helper (introduced in D106267802) with the default `BlockCapPolicy::OverflowOnly`. Net behaviour change vs V1: - ROCm small/medium grid (`blocks * threads_per_block <= UINT32_MAX`): uncapped grid restored. Was: capped to `MAX_THREAD_BLOCKS_FACTOR * #SMs` unconditionally. Now: passes through unchanged (matches pre-D104903707 behaviour). - ROCm large grid (`blocks * threads_per_block > UINT32_MAX`): cap still applied. The kernel grid-strides over `b_t` so capping is correctness-preserving, and this is the regime where the HIP `2^32` thread-per-launch limit would otherwise fire. - NVIDIA: bit-identical to V1 (the threshold check lives entirely under `#ifdef USE_ROCM` inside the helper). The four sites: - `permute_2D_sparse_data_cuda` `blocks_1` (line ~223), `threads_1=256`. - `permute_2D_sparse_data_cuda` `blocks_2` (line ~262), block size `dim3(32, BT_blocks=32)` so threads_per_block = `32 * BT_blocks = 1024`. - `permute_sparse_features_cuda` `blocks_1` (line ~439), `threads_1=256`. - `permute_sparse_features_cuda` `blocks_2` (line ~486), block size same as above (1024). Drive-by: switches `cuda_calc_block_count` (y/z-dim 65535 cap) to `determine_grid_blocks` (which uses `cuda_calc_xblock_count` internally with the 2^31-1 x-dim cap). These launches are 1-D x-dim launches so `cuda_calc_xblock_count` is the correct primitive; the kernels grid-stride so any change in grid size is correctness-preserving. Reviewed By: spcyppt Differential Revision: D104937969

meta-codesync · 2026-06-14T01:04:53Z

@q10 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D104937969.

meta-codesync · 2026-06-14T20:25:01Z

This pull request has been merged in 0f5414f.

q10 added 3 commits June 13, 2026 18:04

meta-cla Bot added the cla signed label Jun 14, 2026

meta-codesync Bot added the meta-exported label Jun 14, 2026

meta-codesync Bot closed this in 0f5414f Jun 14, 2026

meta-codesync Bot added the Merged label Jun 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix the grid problem with sparse permute 2d (V2: threshold-guarded)#5898

Fix the grid problem with sparse permute 2d (V2: threshold-guarded)#5898
q10 wants to merge 3 commits into
pytorch:mainfrom
q10:export-D104937969

q10 commented Jun 14, 2026

Uh oh!

meta-codesync Bot commented Jun 14, 2026

Uh oh!

meta-codesync Bot commented Jun 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

q10 commented Jun 14, 2026

Uh oh!

meta-codesync Bot commented Jun 14, 2026

Uh oh!

meta-codesync Bot commented Jun 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant