<__msvc_int128.hpp>: use __umulh on ARM64/ARM64EC (#6184)#6281
<__msvc_int128.hpp>: use __umulh on ARM64/ARM64EC (#6184)#6281Adesh4477 wants to merge 1 commit into
Conversation
Adds an ARM64/ARM64EC fast path to _Base128::_UMul128 that uses the __umulh intrinsic for the high 64 bits and a plain 64-bit multiply for the low 64 bits, in place of the Knuth-base-2^32 fallback. Microbench on Snapdragon X Elite (5M random uint64 pairs * 5 reps): Knuth fallback : ~82 ms (~3.27 ns/op) __umulh path : ~27 ms (~1.08 ns/op) Speedup : ~3.03x Disassembly collapses from ~30 ops (incl. /GS cookie push) to 4 ops (umulh / mul / str / ret). _STL_128_INTRINSICS is intentionally not enabled for ARM64; that macro also gates _addcarry_u64, _subborrow_u64, __shiftleft128, __shiftright128, and _udiv128/_div128, which have no direct single-instruction ARM64 equivalents and are out of scope for this change. Per the issue author, x64 is intentionally not modified -- _umul128 remains preferable there.
There was a problem hiding this comment.
Pull request overview
Adds an ARM64/ARM64EC runtime fast-path for _Base128::_UMul128() in <__msvc_int128.hpp> by computing the high 64 bits via __umulh() and the low 64 bits via a normal 64-bit multiply, avoiding the existing Knuth base-2^32 fallback in non-constant-evaluated code.
Changes:
- Add an ARM64/ARM64EC
__umulh-based implementation for the high half of the 128-bit product. - Keep the existing constexpr/Knuth fallback for constant evaluation and other targets.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| #elif (defined(_M_ARM64) || defined(_M_ARM64EC)) && !defined(_M_CEE_PURE) | ||
| if (!_Is_constant_evaluated()) { | ||
| _High_result = __umulh(_Left, _Right); | ||
| return _Left * _Right; | ||
| } |
There was a problem hiding this comment.
This appears to be the correct conclusion from the comment in stl/inc/xcharconv_ryu.h:57-62.
However, the comment itself seems to be obsolete: the intrinsics are in <intrin0.h> already.
So I'm suggesting against any changes here.
There was a problem hiding this comment.
Actually, after further looking into this I think that VSO-1918426 may be about something else, not even what Copilot made me believe it means.
I'm still holding against any changes here.
There was a problem hiding this comment.
VSO-1918426 "ifdef guards for Arm64EC/CHPE are missing or not done correctly in C/C++ headers" was resolved as fixed on 2023-12-01.
| _High_result = __umulh(_Left, _Right); | ||
| return _Left * _Right; | ||
| } | ||
| #endif // _STL_128_INTRINSICS / ARM64 __umulh |
There was a problem hiding this comment.
According to the latest practice of preprocessor comments, we just mention the last block with arrows:
#endif // ^^^ (defined(_M_ARM64) || defined(_M_ARM64EC)) && !defined(_M_CEE_PURE) ^^^
| #elif (defined(_M_ARM64) || defined(_M_ARM64EC)) && !defined(_M_CEE_PURE) | ||
| if (!_Is_constant_evaluated()) { | ||
| _High_result = __umulh(_Left, _Right); | ||
| return _Left * _Right; | ||
| } |
There was a problem hiding this comment.
This appears to be the correct conclusion from the comment in stl/inc/xcharconv_ryu.h:57-62.
However, the comment itself seems to be obsolete: the intrinsics are in <intrin0.h> already.
So I'm suggesting against any changes here.
On ARM64 _UMul128 falls through to the Knuth fallback because _STL_128_INTRINSICS is x64-only. ARM64 has umulh as a single instruction so we can do this in two ops
instead of ~thirty. Patch adds the obvious #elif branch using __umulh for the high half and a regular 64-bit * for the low half.
Tested locally this microbench on Snapdragon X Elite (5M random uint64 pairs * 5 reps):
Knuth fallback : ~82 ms (~3.27 ns/op)
__umulh path : ~27 ms (~1.08 ns/op)
Speedup : ~3.03x
Per the issue author, x64 is intentionally not modified -- _umul128 remains preferable there.