Skip to content

Use 256-thread blocks for jagged dense-output kernel (#5848)#5848

Open
MericGit wants to merge 1 commit into
pytorch:mainfrom
MericGit:export-D107571746
Open

Use 256-thread blocks for jagged dense-output kernel (#5848)#5848
MericGit wants to merge 1 commit into
pytorch:mainfrom
MericGit:export-D107571746

Conversation

@MericGit

@MericGit MericGit commented Jun 8, 2026

Copy link
Copy Markdown

Summary:

X-link: https://github.com/facebookresearch/FBGEMM/pull/2766

Initial Rocm profiler thread tracing shows pretty poor utilization for jagged_1d_to_dense.
Prev was setting it to use 16 threads only which is only 25% util of a wavefront when D=1 for 1d.

For 2d (large D) values it seems OK to just consistently use 256. Alternative could be coding special path for just the D=1 (1d case). Open to suggestions. Generally this will shift them from using 1024 threads to 256.

These changes impacts jagged_1d_to_dense, jagged_2d_to_dense, and jagged_to_padded_dense_forward. Mostly focused on jagged_1d_to_dense.

Differential Revision: D107571746

@meta-cla meta-cla Bot added the cla signed label Jun 8, 2026
@meta-codesync

meta-codesync Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

@MericGit has exported this pull request. If you are a Meta employee, you can view the originating Diff in D107571746.

Summary:

X-link: facebookresearch/FBGEMM#2766

Initial Rocm profiler thread tracing shows pretty poor utilization for jagged_1d_to_dense.
Prev was setting it to use 16 threads only which is only 25% util of a wavefront when D=1 for 1d. 


For 2d (large D) values it seems OK to just consistently use 256. Alternative could be coding special path for just the D=1 (1d case). Open to suggestions. Generally this will shift them from using 1024 threads to 256. 

These changes impacts jagged_1d_to_dense, jagged_2d_to_dense, and jagged_to_padded_dense_forward. Mostly focused on jagged_1d_to_dense.

Differential Revision: D107571746
@meta-codesync meta-codesync Bot changed the title Use 256-thread blocks for jagged dense-output kernel Use 256-thread blocks for jagged dense-output kernel (#5848) Jun 12, 2026
@MericGit MericGit force-pushed the export-D107571746 branch from 97c857b to 2256ee2 Compare June 12, 2026 22:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant