Four Eisenstein norm computations in five NEON instructions.
ARM NEON benchmarks for the eisenstein crate's integer arithmetic. Vectors of four int32x4_t compute a² − ab + b² in parallel using fused multiply-add and narrowing shifts. The measured throughput is 3.3× over scalar on a Cortex-A72. The theoretical maximum is 4× (limited by register pressure on the accumulator).
// Four norm computations in five instructions:
smull v0.4s, v0.4s, v0.4s // a²
smull v1.4s, v1.4s, v1.4s // b²
mla v0.4s, v1.4s, v2.4s // a² + b²
mls v0.4s, v3.4s, v4.4s // a² - ab + b²Five instructions compute four norms. No branch. No load-hit-store. No SVE or SME required — just the NEON unit every ARMv8 chip has.
- 3.3× throughput on Cortex-A72 (measured, 5-run median)
- 4× theoretical maximum (limited by register pressure)
- Zero drift — same integer arithmetic as the Rust crate
- Zero unsafe — NEON intrinsics, no inline assembly in the Rust wrapper
cargo build --release
./target/release/arm-neon-eisenstein-benchRequires aarch64 or ARMv8 with NEON.
MIT OR Apache-2.0