Skip to content

Panhaolin2001/HybridSIMD

Repository files navigation

HybridSIMD logo

HybridSIMD

HybridSIMD provides one C++ SIMD API over multiple backend libraries. User code can be written once with AUTO_t; it can either run directly with a default backend or be tuned into a concrete generated source file.

The tuner currently searches library-level choices:

  • a SIMD backend for each AUTO_t call site
  • an optional global NumLane_PLACEHOLDER for Vectors<T, N> kernels

The lightweight tuner does not require LLVM tools, google benchmark, opt passes, or benchmark source directories.

Quick Start

Fetch the SIMD backend libraries:

git submodule update --init --recursive

Build and run the smoke test:

cmake -S . -B build -DHYBRIDSIMD_BUILD_TESTS=ON -DHYBRIDSIMD_BUILD_EXAMPLES=OFF
cmake --build build
ctest --test-dir build --output-on-failure

Write Code

Include the unified API:

#include "hybridsimd/all.hpp"

For native-width vectors, use Vector<T> and AUTO_t:

Vector<float> x;
Vector<float> y;
Vector<float> out;

LoadUnaligned<float, AUTO_t>(x, x_ptr);
LoadUnaligned<float, AUTO_t>(y, y_ptr);
out = Add<float, AUTO_t>(x, y);
StoreUnaligned<float, AUTO_t>(out, out_ptr);

This code compiles and runs directly. In a normal build, AUTO_t resolves to a conservative default backend.

For lane tuning, use Vectors<T, NumLane_PLACEHOLDER>:

Vectors<float, NumLane_PLACEHOLDER> x;
Vectors<float, NumLane_PLACEHOLDER> y;
Vectors<float, NumLane_PLACEHOLDER> out;

LoadUnaligned<float, NumLane_PLACEHOLDER, AUTO_t>(x, x_ptr);
LoadUnaligned<float, NumLane_PLACEHOLDER, AUTO_t>(y, y_ptr);
out = Add<float, NumLane_PLACEHOLDER, AUTO_t>(x, y);
StoreUnaligned<float, NumLane_PLACEHOLDER, AUTO_t>(out, out_ptr);

The source passed to hybridsimd-tune should be a complete executable program with main(), because each candidate is compiled and run directly.

Tune Code

Use the repository-local tuner:

./hybridsimd-tune path/to/kernel_N.cpp \
  --lanes 4,8,16,32 \
  -o path/to/kernel_N.optimized.cpp \
  --manifest path/to/kernel_N.manifest.json

The tuner:

  • scans the source for AUTO_t call sites
  • scans for NumLane_PLACEHOLDER if present
  • filters unsupported backend/type/op/lane combinations
  • generates candidate C++ files
  • compiles and runs each candidate with clang++ when available, otherwise c++
  • records compile/run failures and continues searching
  • writes the fastest candidate to *.optimized.cpp

The tuner can choose different backends for different operations. A manifest can contain a solution like:

{
  "NumLane_PLACEHOLDER": 4,
  "Add_0": "HIGHWAY_t",
  "LoadUnaligned_0": "HIGHWAY_t",
  "LoadUnaligned_1": "VCL_t",
  "StoreUnaligned_0": "VCL_t"
}

The manifest is a reproducibility report. It is not required to compile or run the generated C++ file.

Useful options:

--algorithm auto              # default; constraint seeds first, then random for large spaces
--algorithm exhaustive        # enumerate all candidates up to the safety limit
--algorithm random            # sample legal candidates
--algorithm genetic           # genetic search; tunes N automatically when NumLane_PLACEHOLDER is present
--generations 16              # genetic only; number of generations
--population-size 4           # genetic only; individuals per generation
--labels HIGHWAY_t,VCL_t,XSIMD_t,TSIMD_t,EVE_t
--lanes 4,8,16                # restrict lane candidates

genetic is the only public genetic mode. If the source contains NumLane_PLACEHOLDER, it tunes both N and backend choices. If the source only contains Vector<T> and AUTO_t, it tunes backend choices only.

Run Generated Code

Compile the generated file like a normal C++ program:

clang++ -std=c++20 -O3 \
  -I. -Ihybridsimd/include -I3rdparty \
  -I3rdparty/eve/include -I3rdparty/highway \
  -I3rdparty/libcxx_simd/include -I3rdparty/mipp \
  -I3rdparty/tsimd -I3rdparty/vcl -I3rdparty/xsimd/include \
  path/to/kernel_N.optimized.cpp -o kernel_N

./kernel_N

Complete Examples

Complete tunable programs are provided at:

  • example/tune_axpy_N.cpp
  • example/tune_blackscholes_N.cpp
  • example/tune_mandelbrot_N.cpp

Tune AXPY:

./hybridsimd-tune example/tune_axpy_N.cpp \
  --lanes 4 \
  --labels HIGHWAY_t,VCL_t \
  --algorithm exhaustive \
  -o /tmp/tune_axpy_N.demo.cpp \
  --manifest /tmp/tune_axpy_N.demo.json

Compile and run the generated result:

clang++ -std=c++20 -O3 \
  -I. -Ihybridsimd/include -I3rdparty \
  -I3rdparty/eve/include -I3rdparty/highway \
  -I3rdparty/libcxx_simd/include -I3rdparty/mipp \
  -I3rdparty/tsimd -I3rdparty/vcl -I3rdparty/xsimd/include \
  /tmp/tune_axpy_N.demo.cpp -o /tmp/tune_axpy_N.demo

/tmp/tune_axpy_N.demo

The example prints 0 when its assertions pass.

Tune Black-Scholes with genetic search over backend choices and N:

./hybridsimd-tune example/tune_blackscholes_N.cpp \
  --algorithm genetic \
  --labels HIGHWAY_t,VCL_t,XSIMD_t,TSIMD_t,EVE_t \
  --lanes 16,32,64,128 \
  --generations 8 \
  --population-size 64 \
  --jobs 16 \
  --max-trials 512 \
  --runs 1 \
  --cxxflags=-march=native \
  -o example/reports/tune_blackscholes_N.optimized.cpp \
  --manifest example/reports/tune_blackscholes_N.manifest.json

Example result on one AVX-capable test machine:

scalar_ms=37.913
tuned_ms=3.577
speedup=10.598x
checksum=99.961

The selected variant used NumLane_PLACEHOLDER=64 and mixed backends across individual operations. The candidate had to compile, run, and pass the example's verify() before it could be selected.

Tune Mandelbrot with the same genetic search shape:

./hybridsimd-tune example/tune_mandelbrot_N.cpp \
  --algorithm genetic \
  --labels HIGHWAY_t,VCL_t,XSIMD_t,TSIMD_t,EVE_t \
  --lanes 4,8,16,32,64,128 \
  --generations 8 \
  --population-size 64 \
  --jobs 16 \
  --max-trials 512 \
  --runs 1 \
  --cxxflags=-march=native \
  -o example/reports/killer_mandelbrot.g8p64.cpp \
  --manifest example/reports/killer_mandelbrot.g8p64.manifest.json

Example result on the same AVX-capable test machine:

verify=passed max_abs_error=12 avg_abs_error=0.000541687 mismatch_count=35 mismatch_rate=0.000267029
scalar_ms=72.924
tuned_ms=8.814
speedup=8.274x
checksum=151566138

The selected Mandelbrot variant used NumLane_PLACEHOLDER=32 and mixed TSIMD_t, XSIMD_t, HIGHWAY_t, and EVE_t across individual operations.

Example Benchmark Report

Run the example auto-detection report:

./example/run.sh

This does not require google benchmark. It:

  • compiles the original complete examples with a small compatible benchmark shim
  • runs the existing benchmark matrices for Vector<T>, Vectors<T, N>, pure Highway, xsimd, EVE, VCL, tsimd, stdsimd, and MIPP paths where the source provides them
  • compiles and runs a smoke probe for each backend
  • compiles and runs Vectors<T, N> probes for each detected backend/lane pair
  • tunes example/tune_axpy_N.cpp using only runnable backend/lane candidates
  • writes a full JSON report, readable Markdown summary, and per-case timing CSV under example/reports/

Useful filters:

./example/run.sh --backends HIGHWAY_t,VCL_t --lanes 4,8,16 --max-trials 8

The compatible runner is for repeatable smoke/performance comparison in this repository and is the recommended example-report entry point.

Generate Without Tuning

To replace AUTO_t with a preferred backend without measuring candidates, use the lower-level codegen module:

python3 -m hybridsimd.algorithm.codegen path/to/kernel.cpp \
  --mode generate \
  --backend HIGHWAY_t \
  --lane 16 \
  -o path/to/kernel.optimized.cpp \
  --manifest path/to/kernel.manifest.json

Supported SIMD libraries

The public autotuning labels are:

  • HIGHWAY_t
  • VCL_t
  • XSIMD_t
  • TSIMD_t
  • EVE_t

The complete example matrix also includes native source paths for libraries such as MIPP and stdsimd where those paths exist in the examples.

About

Blending the strengths of multiple SIMD libraries for synergistic performance optimization.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors