Skip to content

JIT: Improve codegen for xarch vector byte multiply#126348

Merged
tannergooding merged 7 commits into
dotnet:mainfrom
saucecontrol:bytemul
Jun 8, 2026
Merged

JIT: Improve codegen for xarch vector byte multiply#126348
tannergooding merged 7 commits into
dotnet:mainfrom
saucecontrol:bytemul

Conversation

@saucecontrol

@saucecontrol saucecontrol commented Mar 31, 2026

Copy link
Copy Markdown
Member

Resolves #109775

In cases where we can widen to the next vector size up and multiply once, the current codegen is already good. When that's not possible, the current codegen falls back to a version that splits into two vectors and runs the same basic algorithm, which is not optimal.

This implements the suggestion made by @MineCake147E on #109775, which still requires two multiplications, but avoids double widening and narrowing. The result is a ~2x perf improvement. Benchmarks

Typical diff:

        vmovups  zmm0, zmmword ptr [rdx]
-       vmovaps  zmm1, zmm0
-       vpmovzxbw zmm1, ymm1
-       vmovups  zmm2, zmmword ptr [r8]
-       vmovaps  zmm3, zmm2
-       vpmovzxbw zmm3, ymm3
-       vpmullw  zmm1, zmm3, zmm1
-       vpmovwb  ymm1, zmm1
-       vextracti32x8 ymm0, zmm0, 1
-       vpmovzxbw zmm0, ymm0
-       vextracti32x8 ymm2, zmm2, 1
-       vpmovzxbw zmm2, ymm2
-       vpmullw  zmm0, zmm2, zmm0
-       vpmovwb  ymm0, zmm0
-       vinserti32x8 zmm0, zmm1, ymm0, 1
-       vmovups  zmmword ptr [rcx], zmm0
+       vmovups  zmm1, zmmword ptr [r8]
+       vpmullw  zmm2, zmm1, zmm0
+       vpsrlw   zmm0, zmm0, 8
+       vpandd   zmm1, zmm1, dword ptr [reloc @RWD00] {1to16}
+       vpmullw  zmm0, zmm1, zmm0
+       vpternlogd zmm2, zmm0, dword ptr [reloc @RWD04] {1to16}, -20
+       vmovups  zmmword ptr [rcx], zmm2
        mov      rax, rcx
        vzeroupper 
        ret      

+RWD00  	dd	FF00FF00h
+RWD04  	dd	00FF00FFh
 
-; Total bytes of code 106
+; Total bytes of code 65

Full diffs

NB: codegen could actually be better, but currently JIT imports (and morphs) AND_NOT(x, y) as AND(x, NOT(y)), which in the case of constant y that is re-used, creates two constants where one would have sufficed.

Copilot AI review requested due to automatic review settings March 31, 2026 06:25
@github-actions github-actions Bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Mar 31, 2026
@dotnet-policy-service dotnet-policy-service Bot added the community-contribution Indicates that the PR has been added by a community member label Mar 31, 2026
@dotnet-policy-service

Copy link
Copy Markdown
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Improves xarch JIT codegen for SIMD byte multiplication by using a more efficient two-multiply “odd/even byte” strategy when widening to the next vector size isn’t possible, reducing unnecessary widen/narrow work compared to the prior fallback.

Changes:

  • Adds a fast-path that widens to the next vector size (AVX2 for SIMD16, AVX512 for SIMD32) to perform a single multiply and then narrow.
  • Replaces the previous fallback (split/widen/mul/narrow twice) with an odd/even byte approach that uses two 16-bit multiplies and recombines bytes with masks/shifts.

Comment thread src/coreclr/jit/gentree.cpp Outdated
Comment thread src/coreclr/jit/gentree.cpp
Comment thread src/coreclr/jit/gentree.cpp Outdated
Copilot AI review requested due to automatic review settings March 31, 2026 08:41

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.

Comment thread src/coreclr/jit/gentree.cpp Outdated
Comment thread src/coreclr/jit/gentree.cpp
Comment thread src/coreclr/jit/gentree.cpp Outdated
@xtqqczze

xtqqczze commented Mar 31, 2026

Copy link
Copy Markdown
Contributor

vpmovzxbw zmm1, zmm1

the current codegen has an invalid operand for the instruction?

should be VPMOVZXBW zmm1 {k1}{z}, ymm2/m256?

and a similar issue for vpmovwb?

@saucecontrol

Copy link
Copy Markdown
Member Author

the current codegen has an invalid operand for the instruction?

Ha, I didn't notice that. Looks like that's just a bug in the JIT disasm. Running the code bytes through another disassembler shows it correctly.

@saucecontrol

Copy link
Copy Markdown
Member Author

cc @dotnet/jit-contrib

Diffs

@saucecontrol

This comment was marked as outdated.

@saucecontrol

This comment was marked as resolved.

Comment thread src/coreclr/jit/gentree.cpp Outdated
Comment thread src/coreclr/jit/gentree.cpp
@xtqqczze

This comment was marked as resolved.

Copilot AI review requested due to automatic review settings April 17, 2026 17:05

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.

Comment thread src/coreclr/jit/gentree.cpp
Comment thread src/coreclr/jit/gentree.cpp
Comment thread src/coreclr/jit/gentree.cpp
@saucecontrol

Copy link
Copy Markdown
Member Author

@saucecontrol Could you update the diff in the description now #126371 is merged?

Done

Comment thread src/coreclr/jit/gentree.cpp
@tannergooding

Copy link
Copy Markdown
Member

CC. @dotnet/jit-contrib, @kg, @EgorBo for secondary review

Copilot AI review requested due to automatic review settings June 8, 2026 19:53

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.

Comment thread src/coreclr/jit/gentree.cpp
@tannergooding tannergooding enabled auto-merge (squash) June 8, 2026 20:52
@tannergooding tannergooding merged commit aa035b9 into dotnet:main Jun 8, 2026
137 of 139 checks passed
@saucecontrol saucecontrol deleted the bytemul branch June 8, 2026 23:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI community-contribution Indicates that the PR has been added by a community member

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SSE/AVX byte multiplication could be improved

5 participants