Skip to content

refactor(simd): unify cross-ISA implementations via traits + kernel templates#2156

Open
LHT129 wants to merge 1 commit into
antgroup:mainfrom
LHT129:2026-06-01-重构-simd-层用模板消除跨isa重复实现

Hidden character warning

The head ref may contain hidden characters: "2026-06-01-\u91cd\u6784-simd-\u5c42\u7528\u6a21\u677f\u6d88\u9664\u8de8isa\u91cd\u590d\u5b9e\u73b0"
Open

refactor(simd): unify cross-ISA implementations via traits + kernel templates#2156
LHT129 wants to merge 1 commit into
antgroup:mainfrom
LHT129:2026-06-01-重构-simd-层用模板消除跨isa重复实现

Conversation

@LHT129
Copy link
Copy Markdown
Collaborator

@LHT129 LHT129 commented Jun 8, 2026

Summary

Resolves #2131

Introduce a type-traits abstraction layer and shared kernel templates to eliminate
copy-paste SIMD implementations across ISAs (SSE/AVX/AVX2/AVX512/NEON).

Key changes:

  • Add per-ISA traits headers (src/simd/traits/simd_traits_*.h) exposing a uniform
    static interface over each ISA's native vector types
  • Add 17 kernel templates (src/simd/kernels/*.h) that implement algorithms once
    in terms of traits, replacing 5-way duplicated hand-written intrinsics
  • Covered function families: FP32 compute/batch/binary-ops/reduce, SQ8/SQ4 quantized
    compute, SQ4/SQ8 uniform code IP, INT8 compute, BF16/FP16 half-precision compute,
    bit operations, normalize, RaBitQ binary/batch/split-code IP, butterfly/rotate,
    and scalar ops

Net effect:

  • ISA implementation files shrink from ~11,049 to ~4,628 lines (-58%)
  • New SIMD function cost drops from ~7 files x 30 lines to 1 kernel template + 0-2 trait extensions
  • Zero runtime overhead: all traits are always_inline, verified via CPU-pinned benchmarks

Out of scope: SVE (scalable vectors), AMX (tile/matrix model), simd_dispatch.h (unchanged)

Test plan

  • All 51 SIMD unit test cases pass (30,591,891 assertions)
  • CPU-pinned (taskset -c 0) benchmarks show no regression on AVX2/AVX512 paths
  • Release build compiles cleanly with all ISA flags

@LHT129 LHT129 self-assigned this Jun 8, 2026
@LHT129 LHT129 added kind/improvement Code improvements (variable/function renaming, refactoring, etc. ) version/1.0 labels Jun 8, 2026
@mergify mergify Bot added the module/simd label Jun 8, 2026
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Jun 8, 2026

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Require kind label

Wonderful, this rule succeeded.
  • label~=^kind/

🟢 Require version label

Wonderful, this rule succeeded.
  • label~=^version/

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the SIMD codebase by introducing a unified, template-based kernel architecture across all instruction set architectures (ISAs), replacing duplicate ISA-specific implementations with common templates parameterized by SIMD traits. While this significantly improves code maintainability, several critical issues were identified during the review. Specifically, the AVX trait implementation of load_u8_as_float incorrectly uses _mm_set_epi8 to load from a pointer, which will corrupt data. Additionally, multiple fallback functions in avx.cpp and avx2.cpp bypass intermediate SIMD tiers (e.g., falling back directly to generic or sse instead of avx), breaking the optimization hierarchy and risking performance regressions on tail data. Furthermore, RotateOpImpl in butterfly.h assumes the step size is a multiple of the vector width, potentially leaving tail elements unprocessed, and the NEON trait implementation of load_nibbles_2x_as_float relies on inefficient scalar operations that should be vectorized.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread src/simd/traits/simd_traits_avx.h
Comment thread src/simd/avx.cpp Outdated
Comment thread src/simd/avx.cpp Outdated
Comment thread src/simd/avx2.cpp
Comment thread src/simd/avx2.cpp
Comment thread src/simd/avx2.cpp
Comment thread src/simd/avx2.cpp
Comment thread src/simd/avx2.cpp
Comment thread src/simd/kernels/butterfly.h Outdated
Comment thread src/simd/traits/simd_traits_neon.h
Copy link
Copy Markdown
Collaborator

@wxyucs wxyucs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the large SIMD cleanup. The traits/kernel direction is good, but I found two correctness issues in the new NEON path that should be fixed before merge. I’m leaving the fallback-tier/perf-only items out of the blocking list; these two can produce wrong results or unsafe reads.

Comment thread src/simd/kernels/butterfly.h Outdated
Comment thread src/simd/traits/simd_traits_neon.h Outdated
@LHT129 LHT129 force-pushed the 2026-06-01-重构-simd-层用模板消除跨isa重复实现 branch from 64b4e79 to f8fad58 Compare June 8, 2026 05:01
Copilot AI review requested due to automatic review settings June 8, 2026 07:02
@LHT129 LHT129 force-pushed the 2026-06-01-重构-simd-层用模板消除跨isa重复实现 branch from f8fad58 to 6f681b4 Compare June 8, 2026 07:02
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the SIMD layer to eliminate cross-ISA duplication by introducing per-ISA trait specializations and shared, trait-parameterized kernel templates. ISA-specific .cpp files become thin wrappers that select the appropriate traits and preserve the existing fallback cascade (e.g., AVX512 → AVX2 → AVX → SSE → generic; NEON → generic).

Changes:

  • Added ISA traits headers (src/simd/traits/simd_traits_*.h) to provide a uniform, static interface over each ISA’s vector types and operations.
  • Added shared kernel templates (src/simd/kernels/*.h) implementing core FP32, INT8, SQ4/SQ8, BF16/FP16, bit-ops, normalization, and RaBitQ routines once in terms of traits.
  • Updated SIMD backends (generic.cpp, sse.cpp, avx*.cpp, neon.cpp) to call the shared kernels with the correct trait specialization and fallback.

Reviewed changes

Copilot reviewed 30 out of 30 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/simd/generic.cpp Switches generic implementations to trait-based kernels (no-ISA baseline + fallback root).
src/simd/sse.cpp Refactors SSE backend to shared kernels via SimdTraits<SSE_Tag> and related trait families.
src/simd/avx.cpp Refactors AVX backend to shared kernels via SimdTraits<AVX_Tag> and AVX-specific traits.
src/simd/avx2.cpp Refactors AVX2 backend to shared kernels via SimdTraits<AVX2_Tag> and AVX2-specific traits.
src/simd/avx512.cpp Refactors AVX512 backend to shared kernels via SimdTraits<AVX512_Tag> and AVX512-specific traits.
src/simd/neon.cpp Refactors NEON backend to shared kernels via SimdTraits<NEON_Tag> and NEON-specific traits.
src/simd/traits/simd_traits_generic.h Defines generic (scalar) traits and primary templates for all trait families.
src/simd/traits/simd_traits_sse.h Adds SSE traits for FP32/INT8/bit-ops/SQ4/SQ8/BF16/uniform-code ops.
src/simd/traits/simd_traits_avx.h Adds AVX traits for FP32/bit-ops/SQ4/SQ8/BF16/FP16/RaBitQ ops.
src/simd/traits/simd_traits_avx2.h Adds AVX2 trait overrides and AVX2-specific INT8/bit/SQ4/SQ8/BF16/FP16/uniform/RaBitQ traits.
src/simd/traits/simd_traits_avx512.h Adds AVX512 traits for FP32/INT8/bit/SQ4/SQ8/BF16/FP16/uniform/RaBitQ ops.
src/simd/traits/simd_traits_neon.h Adds NEON traits for FP32/INT8/bit/SQ4/SQ8/BF16/FP16 ops.
src/simd/kernels/kernels.h Aggregator header for all kernel templates.
src/simd/kernels/compute_ip.h Trait-parameterized FP32 inner-product kernel with optional unroll and fallback.
src/simd/kernels/compute_l2.h Trait-parameterized FP32 L2-squared kernel with optional unroll and fallback.
src/simd/kernels/compute_batch4.h Batch-of-4 IP/L2 kernels to reuse query loads across 4 code vectors.
src/simd/kernels/binary_op.h Shared FP32 elementwise binary ops (add/sub/mul/div).
src/simd/kernels/reduce_add.h Shared FP32 reduce-add kernel with fallback for sub-vector tails.
src/simd/kernels/pq_distance.h Shared PQ distance-table update kernel (256 centers).
src/simd/kernels/int8_compute.h Shared INT8 IP/L2 kernels parameterized on Int8Traits.
src/simd/kernels/half_compute.h Shared BF16/FP16 IP/L2 kernels parameterized on half-traits.
src/simd/kernels/sq8_compute.h Shared SQ8 quantized compute kernels (IP/L2/codes-IP/codes-L2).
src/simd/kernels/sq4_compute.h Shared SQ4 quantized compute kernels (IP/L2/codes-IP/codes-L2).
src/simd/kernels/sq8_uniform_compute.h Shared SQ8 “uniform code” integer-IP kernel via UniformCodeTraits.
src/simd/kernels/sq4_uniform_compute.h Shared SQ4 “uniform code” integer-IP kernel via UniformCodeTraits.
src/simd/kernels/bit_op.h Shared bitwise AND/OR/XOR/NOT kernels via BitTraits.
src/simd/kernels/scalar_op.h Shared scalar-broadcast kernels (DivScalar, VecRescale) via SimdTraits.
src/simd/kernels/butterfly.h Shared butterfly kernels for FHT (RotateOp, KacsWalk) via SimdTraits.
src/simd/kernels/normalize.h Shared NormalizeWithCentroid / inverse normalization kernels via SimdTraits.
src/simd/kernels/rabitq_compute.h Shared RaBitQ float binary/split-code kernels via RaBitQTraits.

Comment thread src/simd/kernels/int8_compute.h
Comment thread src/simd/kernels/int8_compute.h
Comment thread src/simd/traits/simd_traits_avx.h
Comment thread src/simd/kernels/rabitq_compute.h
Comment thread src/simd/traits/simd_traits_avx.h
@LHT129 LHT129 requested a review from wxyucs June 8, 2026 07:17
…emplates

Introduce a type-traits abstraction layer and shared kernel templates to
eliminate copy-paste SIMD implementations across ISAs (SSE/AVX/AVX2/AVX512/NEON).

Changes:
- Add per-ISA traits headers (traits/simd_traits_*.h) exposing a uniform
  static interface (load/store/fmadd/reduce_add/etc.) over each ISA's
  native vector types.
- Add 17 kernel templates (kernels/*.h) that implement the algorithm once
  in terms of traits, replacing 5-way duplicated hand-written intrinsics.
- Covered function families: FP32 compute/batch/binary-ops/reduce,
  SQ8/SQ4 quantized compute, SQ4/SQ8 uniform code IP, INT8 compute,
  BF16/FP16 half-precision compute, bit operations, normalize,
  RaBitQ binary/batch/split-code IP, butterfly/rotate, and scalar ops.
- SVE and AMX are explicitly out of scope (different vector model).

Net effect: ISA implementation files shrink from ~11,049 to ~4,628 lines
(-58%). New SIMD function cost drops from ~7 files x 30 lines to 1 kernel
template + 0-2 trait extensions.

All 51 SIMD test cases pass (30,591,891 assertions).

Signed-off-by: LHT129 <tianlan.lht@antgroup.com>
@LHT129 LHT129 force-pushed the 2026-06-01-重构-simd-层用模板消除跨isa重复实现 branch from 6f681b4 to 5c5540e Compare June 8, 2026 07:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/improvement Code improvements (variable/function renaming, refactoring, etc. ) module/simd size/XXL version/1.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

refactor(simd): eliminate cross-ISA duplication via traits + kernel templates

3 participants