refactor(simd): unify cross-ISA implementations via traits + kernel templates#2156
Hidden character warning
refactor(simd): unify cross-ISA implementations via traits + kernel templates#2156LHT129 wants to merge 1 commit into
Conversation
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🟢 Require kind labelWonderful, this rule succeeded.
🟢 Require version labelWonderful, this rule succeeded.
|
There was a problem hiding this comment.
Code Review
This pull request refactors the SIMD codebase by introducing a unified, template-based kernel architecture across all instruction set architectures (ISAs), replacing duplicate ISA-specific implementations with common templates parameterized by SIMD traits. While this significantly improves code maintainability, several critical issues were identified during the review. Specifically, the AVX trait implementation of load_u8_as_float incorrectly uses _mm_set_epi8 to load from a pointer, which will corrupt data. Additionally, multiple fallback functions in avx.cpp and avx2.cpp bypass intermediate SIMD tiers (e.g., falling back directly to generic or sse instead of avx), breaking the optimization hierarchy and risking performance regressions on tail data. Furthermore, RotateOpImpl in butterfly.h assumes the step size is a multiple of the vector width, potentially leaving tail elements unprocessed, and the NEON trait implementation of load_nibbles_2x_as_float relies on inefficient scalar operations that should be vectorized.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
wxyucs
left a comment
There was a problem hiding this comment.
Thanks for the large SIMD cleanup. The traits/kernel direction is good, but I found two correctness issues in the new NEON path that should be fixed before merge. I’m leaving the fallback-tier/perf-only items out of the blocking list; these two can produce wrong results or unsafe reads.
64b4e79 to
f8fad58
Compare
f8fad58 to
6f681b4
Compare
There was a problem hiding this comment.
Pull request overview
This PR refactors the SIMD layer to eliminate cross-ISA duplication by introducing per-ISA trait specializations and shared, trait-parameterized kernel templates. ISA-specific .cpp files become thin wrappers that select the appropriate traits and preserve the existing fallback cascade (e.g., AVX512 → AVX2 → AVX → SSE → generic; NEON → generic).
Changes:
- Added ISA traits headers (
src/simd/traits/simd_traits_*.h) to provide a uniform, static interface over each ISA’s vector types and operations. - Added shared kernel templates (
src/simd/kernels/*.h) implementing core FP32, INT8, SQ4/SQ8, BF16/FP16, bit-ops, normalization, and RaBitQ routines once in terms of traits. - Updated SIMD backends (
generic.cpp,sse.cpp,avx*.cpp,neon.cpp) to call the shared kernels with the correct trait specialization and fallback.
Reviewed changes
Copilot reviewed 30 out of 30 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| src/simd/generic.cpp | Switches generic implementations to trait-based kernels (no-ISA baseline + fallback root). |
| src/simd/sse.cpp | Refactors SSE backend to shared kernels via SimdTraits<SSE_Tag> and related trait families. |
| src/simd/avx.cpp | Refactors AVX backend to shared kernels via SimdTraits<AVX_Tag> and AVX-specific traits. |
| src/simd/avx2.cpp | Refactors AVX2 backend to shared kernels via SimdTraits<AVX2_Tag> and AVX2-specific traits. |
| src/simd/avx512.cpp | Refactors AVX512 backend to shared kernels via SimdTraits<AVX512_Tag> and AVX512-specific traits. |
| src/simd/neon.cpp | Refactors NEON backend to shared kernels via SimdTraits<NEON_Tag> and NEON-specific traits. |
| src/simd/traits/simd_traits_generic.h | Defines generic (scalar) traits and primary templates for all trait families. |
| src/simd/traits/simd_traits_sse.h | Adds SSE traits for FP32/INT8/bit-ops/SQ4/SQ8/BF16/uniform-code ops. |
| src/simd/traits/simd_traits_avx.h | Adds AVX traits for FP32/bit-ops/SQ4/SQ8/BF16/FP16/RaBitQ ops. |
| src/simd/traits/simd_traits_avx2.h | Adds AVX2 trait overrides and AVX2-specific INT8/bit/SQ4/SQ8/BF16/FP16/uniform/RaBitQ traits. |
| src/simd/traits/simd_traits_avx512.h | Adds AVX512 traits for FP32/INT8/bit/SQ4/SQ8/BF16/FP16/uniform/RaBitQ ops. |
| src/simd/traits/simd_traits_neon.h | Adds NEON traits for FP32/INT8/bit/SQ4/SQ8/BF16/FP16 ops. |
| src/simd/kernels/kernels.h | Aggregator header for all kernel templates. |
| src/simd/kernels/compute_ip.h | Trait-parameterized FP32 inner-product kernel with optional unroll and fallback. |
| src/simd/kernels/compute_l2.h | Trait-parameterized FP32 L2-squared kernel with optional unroll and fallback. |
| src/simd/kernels/compute_batch4.h | Batch-of-4 IP/L2 kernels to reuse query loads across 4 code vectors. |
| src/simd/kernels/binary_op.h | Shared FP32 elementwise binary ops (add/sub/mul/div). |
| src/simd/kernels/reduce_add.h | Shared FP32 reduce-add kernel with fallback for sub-vector tails. |
| src/simd/kernels/pq_distance.h | Shared PQ distance-table update kernel (256 centers). |
| src/simd/kernels/int8_compute.h | Shared INT8 IP/L2 kernels parameterized on Int8Traits. |
| src/simd/kernels/half_compute.h | Shared BF16/FP16 IP/L2 kernels parameterized on half-traits. |
| src/simd/kernels/sq8_compute.h | Shared SQ8 quantized compute kernels (IP/L2/codes-IP/codes-L2). |
| src/simd/kernels/sq4_compute.h | Shared SQ4 quantized compute kernels (IP/L2/codes-IP/codes-L2). |
| src/simd/kernels/sq8_uniform_compute.h | Shared SQ8 “uniform code” integer-IP kernel via UniformCodeTraits. |
| src/simd/kernels/sq4_uniform_compute.h | Shared SQ4 “uniform code” integer-IP kernel via UniformCodeTraits. |
| src/simd/kernels/bit_op.h | Shared bitwise AND/OR/XOR/NOT kernels via BitTraits. |
| src/simd/kernels/scalar_op.h | Shared scalar-broadcast kernels (DivScalar, VecRescale) via SimdTraits. |
| src/simd/kernels/butterfly.h | Shared butterfly kernels for FHT (RotateOp, KacsWalk) via SimdTraits. |
| src/simd/kernels/normalize.h | Shared NormalizeWithCentroid / inverse normalization kernels via SimdTraits. |
| src/simd/kernels/rabitq_compute.h | Shared RaBitQ float binary/split-code kernels via RaBitQTraits. |
…emplates Introduce a type-traits abstraction layer and shared kernel templates to eliminate copy-paste SIMD implementations across ISAs (SSE/AVX/AVX2/AVX512/NEON). Changes: - Add per-ISA traits headers (traits/simd_traits_*.h) exposing a uniform static interface (load/store/fmadd/reduce_add/etc.) over each ISA's native vector types. - Add 17 kernel templates (kernels/*.h) that implement the algorithm once in terms of traits, replacing 5-way duplicated hand-written intrinsics. - Covered function families: FP32 compute/batch/binary-ops/reduce, SQ8/SQ4 quantized compute, SQ4/SQ8 uniform code IP, INT8 compute, BF16/FP16 half-precision compute, bit operations, normalize, RaBitQ binary/batch/split-code IP, butterfly/rotate, and scalar ops. - SVE and AMX are explicitly out of scope (different vector model). Net effect: ISA implementation files shrink from ~11,049 to ~4,628 lines (-58%). New SIMD function cost drops from ~7 files x 30 lines to 1 kernel template + 0-2 trait extensions. All 51 SIMD test cases pass (30,591,891 assertions). Signed-off-by: LHT129 <tianlan.lht@antgroup.com>
6f681b4 to
5c5540e
Compare
Summary
Resolves #2131
Introduce a type-traits abstraction layer and shared kernel templates to eliminate
copy-paste SIMD implementations across ISAs (SSE/AVX/AVX2/AVX512/NEON).
Key changes:
src/simd/traits/simd_traits_*.h) exposing a uniformstatic interface over each ISA's native vector types
src/simd/kernels/*.h) that implement algorithms oncein terms of traits, replacing 5-way duplicated hand-written intrinsics
compute, SQ4/SQ8 uniform code IP, INT8 compute, BF16/FP16 half-precision compute,
bit operations, normalize, RaBitQ binary/batch/split-code IP, butterfly/rotate,
and scalar ops
Net effect:
always_inline, verified via CPU-pinned benchmarksOut of scope: SVE (scalable vectors), AMX (tile/matrix model),
simd_dispatch.h(unchanged)Test plan
taskset -c 0) benchmarks show no regression on AVX2/AVX512 paths