Fix ROCm __syncthreads deadlock in compute_amax_and_quantize_kernel by q10 · Pull Request #5894 · pytorch/FBGEMM

q10 · 2026-06-12T19:58:21Z

Summary:
Fixes a ROCm/AMD GPU deadlock in the FP4 fused amax+quantize kernel, found by
auditing fbgemm_gpu for the same barrier-divergence pattern fixed in D107554507.

The bug

In compute_amax_and_quantize_kernel (quantize.cu), threads whose idx >= n hit an
early return (quantize.cu:1958-1959) before reaching a block-wide reduction.
The launch fp4_fused_amax_quantize uses block = dim3(blocksize, blocks_per_cta=4)
and blocks = ceil_div(numel, blocksize4), so when numel is not a multiple of
blocksize4 the tail block has some threads that return early while the rest
enter the block reduction. The surviving threads stall forever on a block-wide
barrier -> deadlock on ROCm (latent UB on NVIDIA).

Where the hanging barrier actually is (not visible in the diff)

The blocking barriers are NOT in the kernel body -- they are reached transitively
through a device helper, which is why the early return was easy to miss. Call
chain:

Kernel calls compute_max<THREAD_X, THREAD_Y>() quantize.cu:1964
compute_max() branches on THREAD_X: quantize.cu:1864-1872
- THREAD_X == 32 -> compute_max_warp() (warp-only, no block barrier)
- else -> compute_max_block() quantize.cu:1870
  The sole instantiation is <__nv_bfloat16, 16, 4> (quantize.cu:1993), so
  THREAD_X == 16 and it ALWAYS takes the compute_max_block() path.
compute_max_block() (quantize.cu:1825-1846) has two block-wide barriers:
- implicit __syncthreads() inside cub::BlockReduce::Reduce() quantize.cu:1837
- explicit __syncthreads() quantize.cu:1843

Both gate the whole physical 64-thread block (dim3(blocksize, 4)), even though
cub is declared BlockReduce<float, 16> with per-row temp_storage[threadIdx.y].
So a thread that returns early in ANY row stalls the entire block.

The fix

Remove the early return and mask the work with active = idx < n:

Inactive lanes load a neutral 0.0f. The reduction is a fabsf-max, and
fabsf(0.0f) = 0 <= any real |x|, so inactive lanes cannot perturb block_amax.
compute_max() is called by ALL threads -> full block participation, both
__syncthreads() are reached by every thread.
Only active lanes (idx < n) store y[idx]; no OOB read or write.
Behavior is bit-identical for active lanes. blocks = ceil_div(...) guarantees the
tail block always has >= 1 active lane, so block_amax is never degenerate.

Caveat

This deadlock exists only because the sole launch uses THREAD_X = 16. With
THREAD_X = 32, compute_max() would take the compute_max_warp() path (no block
__syncthreads()) and the early return would have been safe. The fix is correct
for the code as it exists today and remains correct if a non-32 THREAD_X is added.

Reviewed By: henrylhtsang

Differential Revision: D107946896

Summary: Fixes a ROCm/AMD GPU deadlock in the FP4 fused amax+quantize kernel, found by auditing fbgemm_gpu for the same barrier-divergence pattern fixed in D107554507. ## The bug In compute_amax_and_quantize_kernel (quantize.cu), threads whose idx >= n hit an early `return` (quantize.cu:1958-1959) before reaching a block-wide reduction. The launch fp4_fused_amax_quantize uses block = dim3(blocksize, blocks_per_cta=4) and blocks = ceil_div(numel, blocksize*4), so when numel is not a multiple of blocksize*4 the tail block has some threads that return early while the rest enter the block reduction. The surviving threads stall forever on a block-wide barrier -> deadlock on ROCm (latent UB on NVIDIA). ## Where the hanging barrier actually is (not visible in the diff) The blocking barriers are NOT in the kernel body -- they are reached transitively through a device helper, which is why the early return was easy to miss. Call chain: 1. Kernel calls compute_max<THREAD_X, THREAD_Y>() quantize.cu:1964 2. compute_max() branches on THREAD_X: quantize.cu:1864-1872 - THREAD_X == 32 -> compute_max_warp() (warp-only, no block barrier) - else -> compute_max_block() quantize.cu:1870 The sole instantiation is <__nv_bfloat16, 16, 4> (quantize.cu:1993), so THREAD_X == 16 and it ALWAYS takes the compute_max_block() path. 3. compute_max_block() (quantize.cu:1825-1846) has two block-wide barriers: - implicit __syncthreads() inside cub::BlockReduce::Reduce() quantize.cu:1837 - explicit __syncthreads() quantize.cu:1843 Both gate the whole physical 64-thread block (dim3(blocksize, 4)), even though cub is declared BlockReduce<float, 16> with per-row temp_storage[threadIdx.y]. So a thread that returns early in ANY row stalls the entire block. ## The fix Remove the early return and mask the work with `active = idx < n`: - Inactive lanes load a neutral 0.0f. The reduction is a fabsf-max, and fabsf(0.0f) = 0 <= any real |x|, so inactive lanes cannot perturb block_amax. - compute_max() is called by ALL threads -> full block participation, both __syncthreads() are reached by every thread. - Only active lanes (idx < n) store y[idx]; no OOB read or write. Behavior is bit-identical for active lanes. blocks = ceil_div(...) guarantees the tail block always has >= 1 active lane, so block_amax is never degenerate. ## Caveat This deadlock exists only because the sole launch uses THREAD_X = 16. With THREAD_X = 32, compute_max() would take the compute_max_warp() path (no block __syncthreads()) and the early return would have been safe. The fix is correct for the code as it exists today and remains correct if a non-32 THREAD_X is added. Reviewed By: henrylhtsang Differential Revision: D107946896

meta-codesync · 2026-06-12T19:58:38Z

@q10 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D107946896.

pytorch-bot Bot added ciflow/rocm module: rocm labels Jun 12, 2026

meta-cla Bot added the cla signed label Jun 12, 2026

meta-codesync Bot added the meta-exported label Jun 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix ROCm __syncthreads deadlock in compute_amax_and_quantize_kernel#5894

Fix ROCm __syncthreads deadlock in compute_amax_and_quantize_kernel#5894
q10 wants to merge 1 commit into
pytorch:mainfrom
q10:export-D107946896

q10 commented Jun 12, 2026

Uh oh!

meta-codesync Bot commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

q10 commented Jun 12, 2026

The bug

Where the hanging barrier actually is (not visible in the diff)

The fix

Caveat

Uh oh!

meta-codesync Bot commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant