Skip to content

Fix the grid problem with sparse permute 2d (V2: threshold-guarded)#5898

Closed
q10 wants to merge 3 commits into
pytorch:mainfrom
q10:export-D104937969
Closed

Fix the grid problem with sparse permute 2d (V2: threshold-guarded)#5898
q10 wants to merge 3 commits into
pytorch:mainfrom
q10:export-D104937969

Conversation

@q10

@q10 q10 commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

Summary:
Migrates the four host-side cap sites in sparse_permute_2d.cu from
the unconditional ROCm cap pattern (V1) to the new
fbgemm_gpu::utils::cuda::determine_grid_blocks helper (introduced in
D106267802) with the default BlockCapPolicy::OverflowOnly.

Net behaviour change vs V1:

  • ROCm small/medium grid (blocks * threads_per_block <= UINT32_MAX):
    uncapped grid restored. Was: capped to
    MAX_THREAD_BLOCKS_FACTOR * #SMs unconditionally. Now: passes
    through unchanged (matches pre-D104903707 behaviour).
  • ROCm large grid (blocks * threads_per_block > UINT32_MAX):
    cap still applied. The kernel grid-strides over b_t so capping is
    correctness-preserving, and this is the regime where the HIP
    2^32 thread-per-launch limit would otherwise fire.
  • NVIDIA: bit-identical to V1 (the threshold check lives entirely
    under #ifdef USE_ROCM inside the helper).

The four sites:

  • permute_2D_sparse_data_cuda blocks_1 (line ~223), threads_1=256.
  • permute_2D_sparse_data_cuda blocks_2 (line ~262), block size
    dim3(32, BT_blocks=32) so threads_per_block = 32 * BT_blocks = 1024.
  • permute_sparse_features_cuda blocks_1 (line ~439), threads_1=256.
  • permute_sparse_features_cuda blocks_2 (line ~486), block size
    same as above (1024).

Drive-by: switches cuda_calc_block_count (y/z-dim 65535 cap) to
determine_grid_blocks (which uses cuda_calc_xblock_count internally
with the 2^31-1 x-dim cap). These launches are 1-D x-dim launches so
cuda_calc_xblock_count is the correct primitive; the kernels
grid-stride so any change in grid size is correctness-preserving.

Reviewed By: spcyppt

Differential Revision: D104937969

q10 added 3 commits June 13, 2026 18:04
…max_thread_blocks helpers (pytorch#5853)

Summary:
X-link: facebookresearch/FBGEMM#2775


Final diff in the threshold-guard helper introduction stack.

Migrates the two host-side cap sites in
`codegen/training/backward/embedding_backward_split_template.cu`
from the legacy `std::min(blocks_uncapped,
get_max_thread_blocks_(...))` form to the new threshold-guarded
helper `fbgemm_gpu::utils::cuda::determine_grid_blocks_from_blocks(...,
BlockCapPolicy::Always)`.

With this migration the last legacy callers are gone, so this diff
also cleans up:
- Removes `fbgemm_gpu::utils::cuda::get_max_thread_blocks(stream)`
  from `include/fbgemm_gpu/utils/cuda_utilities.cuh`.
- Removes the file-local `get_max_thread_blocks_()` and
  `MAX_THREAD_BLOCKS_FACTOR` from
  `include/fbgemm_gpu/embedding_backward_template_helpers.cuh`.
- Adds `#include "fbgemm_gpu/utils/cuda_utilities.cuh"` to
  `embedding_backward_template_helpers.cuh` for the new helper.

Behavior-preserving on the TBE backward variants: the policy
`Always` matches the prior unconditional ROCm cap exactly.

Reviewed By: spcyppt

Differential Revision: D106453408
…ch#5897)

Summary:

X-link: facebookresearch/FBGEMM#2816

Now that the TBE backward template (the last caller) has migrated to
`cap_grid_dim_x`, remove the legacy
`fbgemm_gpu::utils::cuda::get_max_thread_blocks(stream)` helper from
`include/fbgemm_gpu/utils/cuda_utilities.cuh` and inline its sole
remaining use inside `cap_grid_dim_x`.

Behavior-preserving: the inlined body computes
`MAX_THREAD_BLOCKS_FACTOR * #SMs` exactly as before.

Reviewed By: spcyppt

Differential Revision: D107317501
Summary:
Migrates the four host-side cap sites in `sparse_permute_2d.cu` from
the unconditional ROCm cap pattern (V1) to the new
`fbgemm_gpu::utils::cuda::determine_grid_blocks` helper (introduced in
D106267802) with the default `BlockCapPolicy::OverflowOnly`.

Net behaviour change vs V1:
- ROCm small/medium grid (`blocks * threads_per_block <= UINT32_MAX`):
  uncapped grid restored. Was: capped to
  `MAX_THREAD_BLOCKS_FACTOR * #SMs` unconditionally. Now: passes
  through unchanged (matches pre-D104903707 behaviour).
- ROCm large grid (`blocks * threads_per_block > UINT32_MAX`):
  cap still applied. The kernel grid-strides over `b_t` so capping is
  correctness-preserving, and this is the regime where the HIP
  `2^32` thread-per-launch limit would otherwise fire.
- NVIDIA: bit-identical to V1 (the threshold check lives entirely
  under `#ifdef USE_ROCM` inside the helper).

The four sites:
- `permute_2D_sparse_data_cuda` `blocks_1` (line ~223), `threads_1=256`.
- `permute_2D_sparse_data_cuda` `blocks_2` (line ~262), block size
  `dim3(32, BT_blocks=32)` so threads_per_block = `32 * BT_blocks = 1024`.
- `permute_sparse_features_cuda` `blocks_1` (line ~439), `threads_1=256`.
- `permute_sparse_features_cuda` `blocks_2` (line ~486), block size
  same as above (1024).

Drive-by: switches `cuda_calc_block_count` (y/z-dim 65535 cap) to
`determine_grid_blocks` (which uses `cuda_calc_xblock_count` internally
with the 2^31-1 x-dim cap). These launches are 1-D x-dim launches so
`cuda_calc_xblock_count` is the correct primitive; the kernels
grid-stride so any change in grid size is correctness-preserving.

Reviewed By: spcyppt

Differential Revision: D104937969
@meta-cla meta-cla Bot added the cla signed label Jun 14, 2026
@meta-codesync

meta-codesync Bot commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

@q10 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D104937969.

@meta-codesync

meta-codesync Bot commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

This pull request has been merged in 0f5414f.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant