HybridSIMD provides one C++ SIMD API over multiple backend libraries. User code
can be written once with AUTO_t; it can either run directly with a default
backend or be tuned into a concrete generated source file.
The tuner currently searches library-level choices:
- a SIMD backend for each
AUTO_tcall site - an optional global
NumLane_PLACEHOLDERforVectors<T, N>kernels
The lightweight tuner does not require LLVM tools, google benchmark, opt passes, or benchmark source directories.
Fetch the SIMD backend libraries:
git submodule update --init --recursiveBuild and run the smoke test:
cmake -S . -B build -DHYBRIDSIMD_BUILD_TESTS=ON -DHYBRIDSIMD_BUILD_EXAMPLES=OFF
cmake --build build
ctest --test-dir build --output-on-failureInclude the unified API:
#include "hybridsimd/all.hpp"For native-width vectors, use Vector<T> and AUTO_t:
Vector<float> x;
Vector<float> y;
Vector<float> out;
LoadUnaligned<float, AUTO_t>(x, x_ptr);
LoadUnaligned<float, AUTO_t>(y, y_ptr);
out = Add<float, AUTO_t>(x, y);
StoreUnaligned<float, AUTO_t>(out, out_ptr);This code compiles and runs directly. In a normal build, AUTO_t resolves to a
conservative default backend.
For lane tuning, use Vectors<T, NumLane_PLACEHOLDER>:
Vectors<float, NumLane_PLACEHOLDER> x;
Vectors<float, NumLane_PLACEHOLDER> y;
Vectors<float, NumLane_PLACEHOLDER> out;
LoadUnaligned<float, NumLane_PLACEHOLDER, AUTO_t>(x, x_ptr);
LoadUnaligned<float, NumLane_PLACEHOLDER, AUTO_t>(y, y_ptr);
out = Add<float, NumLane_PLACEHOLDER, AUTO_t>(x, y);
StoreUnaligned<float, NumLane_PLACEHOLDER, AUTO_t>(out, out_ptr);The source passed to hybridsimd-tune should be a complete executable program
with main(), because each candidate is compiled and run directly.
Use the repository-local tuner:
./hybridsimd-tune path/to/kernel_N.cpp \
--lanes 4,8,16,32 \
-o path/to/kernel_N.optimized.cpp \
--manifest path/to/kernel_N.manifest.jsonThe tuner:
- scans the source for
AUTO_tcall sites - scans for
NumLane_PLACEHOLDERif present - filters unsupported backend/type/op/lane combinations
- generates candidate C++ files
- compiles and runs each candidate with
clang++when available, otherwisec++ - records compile/run failures and continues searching
- writes the fastest candidate to
*.optimized.cpp
The tuner can choose different backends for different operations. A manifest can contain a solution like:
{
"NumLane_PLACEHOLDER": 4,
"Add_0": "HIGHWAY_t",
"LoadUnaligned_0": "HIGHWAY_t",
"LoadUnaligned_1": "VCL_t",
"StoreUnaligned_0": "VCL_t"
}The manifest is a reproducibility report. It is not required to compile or run the generated C++ file.
Useful options:
--algorithm auto # default; constraint seeds first, then random for large spaces
--algorithm exhaustive # enumerate all candidates up to the safety limit
--algorithm random # sample legal candidates
--algorithm genetic # genetic search; tunes N automatically when NumLane_PLACEHOLDER is present
--generations 16 # genetic only; number of generations
--population-size 4 # genetic only; individuals per generation
--labels HIGHWAY_t,VCL_t,XSIMD_t,TSIMD_t,EVE_t
--lanes 4,8,16 # restrict lane candidatesgenetic is the only public genetic mode. If the source contains
NumLane_PLACEHOLDER, it tunes both N and backend choices. If the source only
contains Vector<T> and AUTO_t, it tunes backend choices only.
Compile the generated file like a normal C++ program:
clang++ -std=c++20 -O3 \
-I. -Ihybridsimd/include -I3rdparty \
-I3rdparty/eve/include -I3rdparty/highway \
-I3rdparty/libcxx_simd/include -I3rdparty/mipp \
-I3rdparty/tsimd -I3rdparty/vcl -I3rdparty/xsimd/include \
path/to/kernel_N.optimized.cpp -o kernel_N
./kernel_NComplete tunable programs are provided at:
example/tune_axpy_N.cppexample/tune_blackscholes_N.cppexample/tune_mandelbrot_N.cpp
Tune AXPY:
./hybridsimd-tune example/tune_axpy_N.cpp \
--lanes 4 \
--labels HIGHWAY_t,VCL_t \
--algorithm exhaustive \
-o /tmp/tune_axpy_N.demo.cpp \
--manifest /tmp/tune_axpy_N.demo.jsonCompile and run the generated result:
clang++ -std=c++20 -O3 \
-I. -Ihybridsimd/include -I3rdparty \
-I3rdparty/eve/include -I3rdparty/highway \
-I3rdparty/libcxx_simd/include -I3rdparty/mipp \
-I3rdparty/tsimd -I3rdparty/vcl -I3rdparty/xsimd/include \
/tmp/tune_axpy_N.demo.cpp -o /tmp/tune_axpy_N.demo
/tmp/tune_axpy_N.demoThe example prints 0 when its assertions pass.
Tune Black-Scholes with genetic search over backend choices and N:
./hybridsimd-tune example/tune_blackscholes_N.cpp \
--algorithm genetic \
--labels HIGHWAY_t,VCL_t,XSIMD_t,TSIMD_t,EVE_t \
--lanes 16,32,64,128 \
--generations 8 \
--population-size 64 \
--jobs 16 \
--max-trials 512 \
--runs 1 \
--cxxflags=-march=native \
-o example/reports/tune_blackscholes_N.optimized.cpp \
--manifest example/reports/tune_blackscholes_N.manifest.jsonExample result on one AVX-capable test machine:
scalar_ms=37.913
tuned_ms=3.577
speedup=10.598x
checksum=99.961
The selected variant used NumLane_PLACEHOLDER=64 and mixed backends across
individual operations. The candidate had to compile, run, and pass the example's
verify() before it could be selected.
Tune Mandelbrot with the same genetic search shape:
./hybridsimd-tune example/tune_mandelbrot_N.cpp \
--algorithm genetic \
--labels HIGHWAY_t,VCL_t,XSIMD_t,TSIMD_t,EVE_t \
--lanes 4,8,16,32,64,128 \
--generations 8 \
--population-size 64 \
--jobs 16 \
--max-trials 512 \
--runs 1 \
--cxxflags=-march=native \
-o example/reports/killer_mandelbrot.g8p64.cpp \
--manifest example/reports/killer_mandelbrot.g8p64.manifest.jsonExample result on the same AVX-capable test machine:
verify=passed max_abs_error=12 avg_abs_error=0.000541687 mismatch_count=35 mismatch_rate=0.000267029
scalar_ms=72.924
tuned_ms=8.814
speedup=8.274x
checksum=151566138
The selected Mandelbrot variant used NumLane_PLACEHOLDER=32 and mixed
TSIMD_t, XSIMD_t, HIGHWAY_t, and EVE_t across individual operations.
Run the example auto-detection report:
./example/run.shThis does not require google benchmark. It:
- compiles the original complete examples with a small compatible benchmark shim
- runs the existing benchmark matrices for
Vector<T>,Vectors<T, N>, pure Highway, xsimd, EVE, VCL, tsimd, stdsimd, and MIPP paths where the source provides them - compiles and runs a smoke probe for each backend
- compiles and runs
Vectors<T, N>probes for each detected backend/lane pair - tunes
example/tune_axpy_N.cppusing only runnable backend/lane candidates - writes a full JSON report, readable Markdown summary, and per-case timing CSV
under
example/reports/
Useful filters:
./example/run.sh --backends HIGHWAY_t,VCL_t --lanes 4,8,16 --max-trials 8The compatible runner is for repeatable smoke/performance comparison in this repository and is the recommended example-report entry point.
To replace AUTO_t with a preferred backend without measuring candidates, use
the lower-level codegen module:
python3 -m hybridsimd.algorithm.codegen path/to/kernel.cpp \
--mode generate \
--backend HIGHWAY_t \
--lane 16 \
-o path/to/kernel.optimized.cpp \
--manifest path/to/kernel.manifest.jsonThe public autotuning labels are:
HIGHWAY_tVCL_tXSIMD_tTSIMD_tEVE_t
The complete example matrix also includes native source paths for libraries such as MIPP and stdsimd where those paths exist in the examples.