Fix backward regression in PR #5712 by avbokovoy · Pull Request #5825 · pytorch/FBGEMM

avbokovoy · 2026-06-03T11:54:18Z

Guarded short segments cases to skip optimized kernel launch. New warp_per_row kernel dispatch logic (FBGEMM_TBE_ROCM_HIP_BACKWARD_KERNEL=1):

if  (vbe || mixed_D) && num_unique_prev > 0 && total_L <= 2*num_unique_prev
  → launch hip_mixed_d_warp (introduced in the PR)
else !mixed_D && D in [64, 128, 160, 192, 256, 320] && weights_dtype == output_dtype && weights_on_HBM
  → launch hip_warp (introduced previously - no change for this PR)
else 
  → regular split_warp kernel

spcyppt · 2026-06-03T17:42:14Z

could you help clarify what cases will use this hip_mixed_d kernel? Please also put it in the PR summary.

From previous discussion with Bernard, this is my understanding:

if vbe
→ launch hip_mixed_d_warp kernel (introduced in the PR)
else
if mixed D
→  launch hip_mixed_d_warp (introduced in the PR)
else
if D in [64, 128, 160, 192, 256, 320] && weights_dtype == output_dtype && weights_on_HBM
→ launch hip_warp kernel (introduced previously - no change for this PR)
else
→ regular split_warp kernel

could you modify the above based on the new push?

cc: @avbokovoy @liligwu

avbokovoy · 2026-06-04T08:19:03Z

@spcyppt The dispatch logic after this PR should look like this:

if  (vbe || mixed_D) && num_unique_prev > 0 && total_L <= 2*num_unique_prev
  → launch hip_mixed_d_warp (introduced in the PR)
else !mixed_D && D in [64, 128, 160, 192, 256, 320] && weights_dtype == output_dtype && weights_on_HBM
  → launch hip_warp (introduced previously - no change for this PR)
else 
  → regular split_warp kernel

Let me know if you have other questions

cc: @liligwu

spcyppt · 2026-06-07T01:32:12Z

@avbokovoy could you rebase this? There's a conflict and I could not import the PR. I don't have permission to resolve conflicts.

Summary: CTA kernel optimizations from D86108064 split out: - Enable subwarp shuffle for CTA kernel on ROCm - Adaptive work group sizing based on total_L/total_B ratio - Small-D template parameter override (max_D <= 128) with kFixedMaxVecsPerThread=1, kThreadGroupSize=32, kUseVecBlocking=false for tighter compiler codegen - PROCESS_BLOCK macro for grad accumulation loop unrolling in compute_grad_sum (shared by CTA and warp kernels) - subwarp_reduce_add for ROCm GROUP_REDUCE_ALL_SUM (shared by CTA and warp kernels) This diff should not impact any performance on NVIDIA GPUs. Differential Revision: D102946299 Pulled By: spcyppt

Summary: Warp kernel optimizations from D86108064 split out: - New hip_mixed_d_warp kernel for mixed-D and VBE backward on ROCm - Batch preloading of per-row metadata via warp shuffle - Momentum value preloading (split_precomputation_preload) to eliminate separate global memory reads - Overloaded table_update_kernel accepting pre-resolved placement/offset/optimizer state - AMD __builtin_amdgcn_readlane for efficient broadcasting - Small-D template override (max_D <= 128) for warp kernel This diff should not impact any performance on NVIDIA GPUs. Differential Revision: D102946325 Pulled By: spcyppt

avbokovoy · 2026-06-09T10:27:51Z

@avbokovoy could you rebase this? There's a conflict and I could not import the PR. I don't have permission to resolve conflicts.

@spcyppt Rebase is done. Apologies for the delay

meta-codesync · 2026-06-09T19:55:58Z

@spcyppt has imported this pull request. If you are a Meta employee, you can view this in D107407381.

meta-cla Bot added the cla signed label Jun 3, 2026

Bernard-Liu and others added 4 commits June 9, 2026 09:22

Guard optimized kernel to process short segments only

a7a487a

Gate on avg-SL threshold via zero-stall prev-iter caching

5ff55c7

avbokovoy force-pushed the abokovoi/pr_5712_backward_regression branch from 1445a35 to 5ff55c7 Compare June 9, 2026 10:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix backward regression in PR #5712#5825

Fix backward regression in PR #5712#5825
avbokovoy wants to merge 4 commits into
pytorch:mainfrom
ROCm:abokovoi/pr_5712_backward_regression

avbokovoy commented Jun 3, 2026 •

edited

Loading

Uh oh!

spcyppt commented Jun 3, 2026 •

edited

Loading

Uh oh!

avbokovoy commented Jun 4, 2026 •

edited

Loading

Uh oh!

spcyppt commented Jun 7, 2026 •

edited

Loading

Uh oh!

avbokovoy commented Jun 9, 2026

Uh oh!

meta-codesync Bot commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

avbokovoy commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

spcyppt commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

avbokovoy commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

spcyppt commented Jun 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

avbokovoy commented Jun 9, 2026

Uh oh!

meta-codesync Bot commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

avbokovoy commented Jun 3, 2026 •

edited

Loading

spcyppt commented Jun 3, 2026 •

edited

Loading

avbokovoy commented Jun 4, 2026 •

edited

Loading

spcyppt commented Jun 7, 2026 •

edited

Loading