perf: ARM64 asm for sumOfProducts — fused 2-product Montgomery reduction by MatteoMer · Pull Request #80 · MatteoMer/zolt

MatteoMer · 2026-04-14T12:41:46Z

Summary

Add arm64SumOfProducts256 in ARM64 inline assembly computing (a0*b0 + a1*b1) * R^{-1} mod p with 4 shared Montgomery reductions instead of 9 for separate mul+mul+add
Wire into both MontgomeryField implementations (field/mod.zig and curves/montgomery_field.zig) matching the existing arm64 dispatch pattern
sumOfProducts benchmarks at 28.6 ns vs ~38.1 ns for 2×mul+add (~25% faster per call), directly accelerating Fp2.mul which is 29% of prover CPU time

Design

5-limb product pattern: computes each a[i]*b[0..3] as a 5-limb number using only 7 scratch registers, freeing x19-x22 for preloaded b1 values
28 of 29 usable ARM64 GPRs: accumulator (x0-x3,x14), b0 (x4-x7), mod (x8-x11), inv (x12), a0 ptr (x13), b1 (x19-x22), a1 ptr (x25), carry1 (x28), scratch (x15-x17,x23-x24,x26-x27)
Fused final reduction: subs/sbcs + csel inside asm block — no post-asm lessThanModulus check needed

Test plan

All 539 existing tests pass (zig build test)
End-to-end proofs verify (fibonacci, sha256_128, sha256 all VERIFIED via Jolt verifier)
Field microbenchmark confirms 28.6 ns/op (1.08x faster than Arkworks sum_of_products)

🤖 Generated with Claude Code

Add arm64SumOfProducts256 in ARM64 inline assembly, computing (a0*b0 + a1*b1) * R^{-1} mod p with 4 shared Montgomery reductions instead of 9 for separate mul+mul+add. Uses 28 GPRs with a 5-limb product pattern to fit b0, b1, and mod all preloaded in registers. Final conditional modulus subtraction is fused inside the asm block. Wired into both MontgomeryField implementations (field/mod.zig and curves/montgomery_field.zig) matching the existing arm64 dispatch pattern. sumOfProducts benchmarks at 28.6ns vs 38.1ns for 2×mul+add (~25% faster per call). Fp2.mul calls sumOfProducts, so this speeds up the 29%-of-prover-time hot path automatically. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

MatteoMer merged commit a494fff into main Apr 14, 2026
17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: ARM64 asm for sumOfProducts — fused 2-product Montgomery reduction#80

perf: ARM64 asm for sumOfProducts — fused 2-product Montgomery reduction#80
MatteoMer merged 1 commit into
mainfrom
perf/arm64-sum-of-products

MatteoMer commented Apr 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

MatteoMer commented Apr 14, 2026

Summary

Design

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant