[wasm] Enable Vector128 fast paths on Wasm via PackedSimd: hex/Guid, UTF-8, SearchValues, Teddy, Adler32/XXH3#129838
[wasm] Enable Vector128 fast paths on Wasm via PackedSimd: hex/Guid, UTF-8, SearchValues, Teddy, Adler32/XXH3#129838lewing wants to merge 11 commits into
Conversation
These internal helpers are used by HexConverter.AsciiToHexVector128 and other byte-interleaving code paths. They previously dispatched only to Sse2.UnpackLow/UnpackHigh or AdvSimd.Arm64.ZipLow/ZipHigh and threw NotSupportedException on platforms without either ISA. With the recent change that enables HexConverter and Guid format on Wasm via PackedSimd, the helpers became reachable on browser-wasm and started throwing at runtime in libraries tests. Lower to PackedSimd.Shuffle with a constant 16-byte index vector (i8x16.shuffle) when PackedSimd is supported. Validated via System.Runtime.Extensions.Tests on browser-wasm (8224 passing, 0 failed) after the previous run failed 132 tests. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
HexConverter.AsciiToHexVector128, HexConverter.EncodeTo_Vector128 and Guid.FormatGuidVector128Utf8 already use only portable Vector128 ops (Vector128.ShuffleNative, Vector128.UnpackLow/High, Vector128.Shuffle with constant indices) plus an optional AdvSimd.Arm64-specific branch. The gates at Convert.ToHexString, EncodeToUtf8/Utf16, and Guid.ToString required Ssse3 or AdvSimd.Arm64, so Wasm fell back to scalar even with PackedSimd. Add PackedSimd.IsSupported to the gates and the [CompExactlyDependsOn] attributes on the helpers. The bodies are unchanged; on Wasm the existing else branch (portable Vector128.Shuffle) is selected. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
GetPointerToFirstInvalidChar's inner ASCII-scan loop dispatched on AdvSimd.Arm64 (with bitmask128) or Sse2 (with MoveMask), falling back to a scalar 4-DWORD-at-a-time path otherwise. On Wasm with PackedSimd, neither SIMD branch was taken, so UTF-8 validation took the scalar path. Add a PackedSimd.IsSupported branch that uses portable Vector128.LoadUnsafe + ExtractMostSignificantBits to compute the same per-byte non-ASCII bitmask used by the Sse2 path. Update the post-loop Debug.Assert to include PackedSimd. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Tagging subscribers to this area: @dotnet/area-system-numerics |
There was a problem hiding this comment.
Pull request overview
This PR enables existing Vector128-based fast paths to run on browser-wasm by widening the feature gates to include System.Runtime.Intrinsics.Wasm.PackedSimd and adding a PackedSimd.Shuffle implementation for Vector128.UnpackLow/UnpackHigh so dependent encode/format paths don’t fall into NotSupportedException on Wasm SIMD-enabled runs.
Changes:
- Add
PackedSimd.IsSupportedimplementations forVector128.UnpackLow/UnpackHighusingPackedSimd.Shuffle(two-vector shuffle). - Widen existing hex + Guid formatting SIMD gates /
[CompExactlyDependsOn]to includePackedSimd. - Add a
PackedSimdvectorized ASCII-scan path inUtf8Utility.ValidationusingVector128.LoadUnsafe+ExtractMostSignificantBits.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector128.cs | Adds PackedSimd support to UnpackLow/UnpackHigh to avoid Wasm falling into the unsupported path. |
| src/libraries/Common/src/System/HexConverter.cs | Expands encode-side Vector128 gate/attributes to include PackedSimd for Wasm SIMD. |
| src/libraries/System.Private.CoreLib/src/System/Guid.cs | Expands Guid vectorized formatting gate/attributes to include PackedSimd on little-endian. |
| src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf8Utility.Validation.cs | Adds a PackedSimd ASCII-validation fast path using MSB extraction. |
TranscodeToUtf8's 8-char ASCII fast loop used a Vector128<short> read, a mask-and-compare to detect non-ASCII, and a narrow-and-store of 8 bytes using Sse2.PackUnsignedSaturate / AdvSimd.ExtractNarrowingSatura teUnsignedLower. Two follow-on 4-char sites narrowed 4 bytes the same way. All four sites required Sse41.X64 or AdvSimd.Arm64 + LE, so Wasm took the 4-DWORD-at-a-time scalar fallback. Add PackedSimd branches at every dispatch site: - Outer entry gate (declaration + entry condition) - 8-char narrow-store: use the existing portable AND-compare for the non-ASCII test (same code Sse41 already uses) and PackedSimd.Convert NarrowingSaturateUnsigned + scalar extract for the store - 4-char narrow-stores: PackedSimd.ConvertNarrowingSaturateUnsigned + AsUInt32().ToScalar() unaligned write The Sse2.X64.ConvertToUInt64 sub-branch already had an else path that calls AsUInt64().ToScalar(), which works on Wasm without changes. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
SearchValues<string> with 2+ values previously selected the Aho-Corasick implementation on Wasm because the Teddy entry gate in StringSearchValues.cs required Ssse3 or AdvSimd.Arm64. Teddy's core Vector128 primitives in TeddyHelper.cs (LoadAndPack16AsciiChars, the nibble GetNibbles helper, the two-table Shuffle, and RightShift1/2) similarly excluded PackedSimd. Add PackedSimd branches throughout: - LoadAndPack16AsciiChars: PackedSimd.ConvertNarrowingSaturateUnsigned - GetNibbles: PackedSimd needs the explicit '& 0xF' on the low half because Swizzle returns 0 for indices >= 16 (unlike Ssse3's implicit AND of the low 4 bits) - Shuffle: already uses portable Vector128.ShuffleNative which maps to PackedSimd.Swizzle; just widen the [CompExactlyDependsOn] - RightShift1/RightShift2: compose two Vector128.ShuffleNative calls with constant index vectors and OR the halves. PackedSimd.Shuffle (two-vector i8x16.shuffle) is impractical due to constant lane index requirements; Swizzle clamps out-of-range to 0 which makes the OR safe. Widen the entry gate in StringSearchValues.cs.CreateFromNormalizedV alues and the null-char filter in TryGetTeddyAcceleratedValues (PackedSimd shares Ssse3's PackUnsignedSaturate behavior where signed negative inputs become 0, so null-containing needles produce more false positives on both). Widen [CompExactlyDependsOn] on the IndexOfAnyN2/N3 + Vector128 helpers in AsciiStringSearchValuesTeddyBase.cs. Validated: System.Memory.Tests on browser-wasm 52249/52249 passing (covers SearchValues<string> Teddy paths via StringSearchValues tests), host 52905/52906 unchanged. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
SearchValues<char> with values that span more than the ASCII range selects a ProbabilisticMap-based search. The vectorized IndexOfAny / LastIndexOfAny path (using ContainsMask16Chars + IsCharBitNotSet) was previously gated on Sse41 || AdvSimd.Arm64 only, so on Wasm the search fell back to the scalar SimpleLoop even when PackedSimd was available. This change is subtler than the other enablement PRs because the *layout* of the ProbabilisticMap bitmap also branches on the same gate (SetCharBit/IsCharBitSet at the top of the file). The [BypassReadyToRun] comment there warns that the construction and lookup branches must agree at all times during program execution. Widen all three gates (SetCharBit/IsCharBitSet, ContainsMask16Chars, the IndexOfAny/LastIndexOfAny entry dispatcher, and the [CompExactly DependsOn] on the Vector128 worker methods) to include PackedSimd consistently. ContainsMask16Chars gets a PackedSimd branch that mirrors the Sse2 algorithm using PackedSimd.ConvertNarrowingSaturateUnsigned for the two-vector narrowing step. IsCharBitNotSet already had a PackedSimd dependency for the table lookup via Vector128.ShuffleNative. ProbabilisticWithAsciiCharSearchValues already had PackedSimd dispatch. Validated: System.Memory.Tests on browser-wasm 52249/52249 passing, host arm64 52905/52906 unchanged. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Both algorithms were already vectorized on Wasm via the portable
Vector128 else branch (Vector128.Widen + multiply + add), but the
result was 3-5 portable ops per iteration where PackedSimd has a
direct one-instruction equivalent.
Adler32.UpdateVector128: add a PackedSimd branch alongside Sse2 and
AdvSimd that uses PackedSimd.AddPairwiseWidening (i16x8.extadd_pair
wise_i8x16_u and i32x4.extadd_pairwise_i16x8_u) for the s1 sum and
PackedSimd.MultiplyWideningLower/Upper + AddPairwiseWidening for the
weighted s2 sum.
XxHashShared.MultiplyWideningLower: add a PackedSimd branch that
computes { source[0]*source[1], source[2]*source[3] } via two
shuffles + i64x2.extmul_low_i32x4_u, replacing the portable
mask + 64-bit multiply pair.
Validated: System.IO.Hashing.Tests 4196/4196 passing on both host
arm64 and browser-wasm (the XxHash lane order is checked end-to-end
via the algorithm output bytes — a swap would corrupt every hash).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Added 4 more commits extending the Wasm Vector128 fast-path enablement:
Additional test coverage on browser-wasm (V8 v15)
The XxHash lane-swap correctness is end-to-end checked: a wrong shuffle order would corrupt every produced hash. Remaining gap (not in this PR)
Note The new commits in this PR were drafted with AI/Copilot assistance. |
…ow/UnpackHigh PackedSimd.Shuffle wraps i8x16.shuffle which requires its 16 lane indices to be compile-time constants. Mono interpreter accepted a Vector128.Create() constant operand at runtime, but Mono AOT cannot fold it and throws PlatformNotSupportedException at runtime. The same impact was already known and avoided in TeddyHelper.Right Shift1/RightShift2 (see preceding commit on this branch) — use two Vector128.ShuffleNative calls (lowering to PackedSimd.Swizzle, which clamps out-of-range indices to 0) and OR the partial results together. Apply the same pattern in Vector128.UnpackLow/UnpackHigh. This was caught by CI as 50 GuidTests + cascaded reflection-invoke failures under the WasmTestOnChrome-MONO-ST (AOT) leg on PR dotnet#129838. On Mono interpreter all callers (HexConverter, Guid.FormatGuid) had already been validated end-to-end. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
CI feedback addressedThe previous Root cause: Fix ( Unrelated: The |
…RightShift Both helpers previously dispatched to PackedSimd via Vector128.ShuffleNative, which itself has a Ssse3 -> AdvSimd.Arm64 -> PackedSimd if/else chain. The Mono SIMD intrinsic recognizer does not always lower that chain cleanly for less-traveled paths, surfacing as NIY interpreter assertions and runtime startup failures. Call PackedSimd.Swizzle (i8x16.swizzle) directly under the PackedSimd.IsSupported branch. The semantics are identical to ShuffleNative on Wasm (clamps indices >= 16 to 0) but the lowering goes through a single recognized intrinsic, avoiding the dispatcher chain. Validated: System.Memory.Tests on browser-wasm V8 interpreter 52249/52249 (covers TeddyHelper.RightShift1/2). The original NIY OutOfMemoryException:.ctor failure seen in System.Runtime.Tests with the prior ShuffleNative version is gone with this change. AOT behaviour will be re-validated by CI on push. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Reword the LoopTerminatedEarlyDueToNonAsciiData label comment to mention that Wasm is also a little-endian-only platform reaching this point through the PackedSimd branch added earlier in this PR. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Wasm is always little-endian by spec, so the BitConverter.IsLittle Endian check on the PackedSimd.IsSupported branch is a no-op. Keep the check on the AdvSimd.Arm64 branch where it actually matters (NEON can be big- or little-endian on some configurations). Mirrors how the existing Sse41.X64 branch in the same gate has no LE check (x86-64 is also always little-endian). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Enable several
Vector128fast paths in CoreLib andSystem.IO.Hashingon browser-wasm by adding aPackedSimd.IsSupportedbranch alongside the existingSse2/Ssse3/AdvSimd.Arm64gates. Before these changes, the SIMD code paths were unreachable on Wasm even though the wasm runtime supports PackedSimd (the test pipeline default setsWasmEnableSIMD=true).Scope and commits
c69998277eb0e99f4a194cPackedSimdpath toVector128.UnpackLow/UnpackHigh(composed from twoPackedSimd.Swizzlecalls + OR; the two-vectorPackedSimd.Shufflerequires constant lane indices and is impractical from generic code).3dc956b6465HexConverter.AsciiToHexVector128,HexConverter.EncodeToUtf8/Utf16,HexConverter.EncodeTo_Vector128, andGuid.FormatGuidVector128Utf8. Bodies were already portable (Vector128.ShuffleNative,Vector128.UnpackLow/High, constant-indexVector128.Shuffle).061b22bffcf2543702633Utf8Utility.ValidationASCII fast path on Wasm — add aPackedSimdbranch alongsideAdvSimd.Arm64/Sse2in the inner ASCII-scan loop ofGetPointerToFirstInvalidChar, using portableVector128.LoadUnsafe+ExtractMostSignificantBits.da08d905c6eUtf8Utility.TranscodingASCII fast path on Wasm — the 8-char narrow-store loop and the two 4-char tail stores, usingPackedSimd.ConvertNarrowingSaturateUnsigned.2caa5e259970e99f4a194cTeddyHelper.LoadAndPack16AsciiChars,GetNibbles, two-tableShuffle,RightShift1/RightShift2(composed from twoPackedSimd.Swizzlecalls + OR), plus theStringSearchValues.CreateFromNormalizedValuesentry gate andAsciiStringSearchValuesTeddyBase.IndexOfAnyN2/N3[CompExactlyDependsOn].819c9fe3960ProbabilisticMapvectorizedSearchValues<char>on Wasm. Subtle because the bitmap layout itself branches on the gate (SetCharBit/IsCharBitSet, marked[BypassReadyToRun]). Widens the layout choice,ContainsMask16Chars(new PackedSimd narrowing branch), theIndexOfAny/LastIndexOfAnyentry dispatcher, and the worker[CompExactlyDependsOn]attributes consistently.80522ca629bAdler32.UpdateVector128andXxHashShared.MultiplyWideningLower: replace the portableVector128.Widen + multiply + addsequences withPackedSimd.AddPairwiseWidening+MultiplyWideningLower/UpperandPackedSimd.MultiplyWideningLower(i64x2.extmul_low_i32x4_u). Already vectorized on Wasm via the generic else branch; this turns 3–5 ops/iter into 1.Test results
Validated end-to-end with
WasmEnableSIMD=true(the browser-wasm test pipeline default).System.Runtime.TestsSystem.Runtime.Extensions.TestsConvert.ToHexString/FromHexStringSystem.Memory.TestsSearchValues<string>(Teddy),SearchValues<char>(ProbabilisticMap),SpanHelpersSystem.IO.Hashing.TestsThe original push exposed two AOT-only failure modes in browser-wasm Helix legs:
PackedSimd.Shuffle(two-vectori8x16.shuffle) requires compile-time-constant lane indices; Mono AOT can't fold aVector128.Create(...)operand and throwsPlatformNotSupportedException. Fixed inc69998277eb→0e99f4a194cby composing with single-vectorPackedSimd.Swizzle+ OR.TeddyHelper.RightShift1/RightShift2viaVector128.ShuffleNative(a dispatcher with anSsse3 → AdvSimd.Arm64 → PackedSimdif/else chain). The Mono SIMD intrinsic recognizer doesn't always lower that chain cleanly for less-traveled paths — the safer pattern is to callPackedSimd.Swizzledirectly under thePackedSimd.IsSupportedbranch. Applied consistently in0e99f4a194c.Files changed
src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector128.cssrc/libraries/Common/src/System/HexConverter.cssrc/libraries/System.Private.CoreLib/src/System/Guid.cssrc/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf8Utility.Validation.cssrc/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf8Utility.Transcoding.cssrc/libraries/System.Private.CoreLib/src/System/SearchValues/Strings/Helpers/TeddyHelper.cssrc/libraries/System.Private.CoreLib/src/System/SearchValues/Strings/StringSearchValues.cssrc/libraries/System.Private.CoreLib/src/System/SearchValues/Strings/AsciiStringSearchValuesTeddyBase.cssrc/libraries/System.Private.CoreLib/src/System/SearchValues/ProbabilisticMap.cssrc/libraries/System.IO.Hashing/src/System/IO/Hashing/Adler32.cssrc/libraries/System.IO.Hashing/src/System/IO/Hashing/XxHashShared.csNot in this PR
Base64DecoderHelper/Base64EncoderHelper— needpmaddubswandpmulhuwanalogs composed fromMultiplyWideningLower/Upper+AddPairwiseWideningorDot, with careful 8-short lane preservation. Worth a dedicated follow-up with benchmarks.Note
This PR description and the commits in this branch were drafted with AI/Copilot assistance.