Skip to content

perf: ARM64 asm for sumOfProducts — fused 2-product Montgomery reduction#80

Merged
MatteoMer merged 1 commit into
mainfrom
perf/arm64-sum-of-products
Apr 14, 2026
Merged

perf: ARM64 asm for sumOfProducts — fused 2-product Montgomery reduction#80
MatteoMer merged 1 commit into
mainfrom
perf/arm64-sum-of-products

Conversation

@MatteoMer

Copy link
Copy Markdown
Owner

Summary

  • Add arm64SumOfProducts256 in ARM64 inline assembly computing (a0*b0 + a1*b1) * R^{-1} mod p with 4 shared Montgomery reductions instead of 9 for separate mul+mul+add
  • Wire into both MontgomeryField implementations (field/mod.zig and curves/montgomery_field.zig) matching the existing arm64 dispatch pattern
  • sumOfProducts benchmarks at 28.6 ns vs ~38.1 ns for 2×mul+add (~25% faster per call), directly accelerating Fp2.mul which is 29% of prover CPU time

Design

  • 5-limb product pattern: computes each a[i]*b[0..3] as a 5-limb number using only 7 scratch registers, freeing x19-x22 for preloaded b1 values
  • 28 of 29 usable ARM64 GPRs: accumulator (x0-x3,x14), b0 (x4-x7), mod (x8-x11), inv (x12), a0 ptr (x13), b1 (x19-x22), a1 ptr (x25), carry1 (x28), scratch (x15-x17,x23-x24,x26-x27)
  • Fused final reduction: subs/sbcs + csel inside asm block — no post-asm lessThanModulus check needed

Test plan

  • All 539 existing tests pass (zig build test)
  • End-to-end proofs verify (fibonacci, sha256_128, sha256 all VERIFIED via Jolt verifier)
  • Field microbenchmark confirms 28.6 ns/op (1.08x faster than Arkworks sum_of_products)

🤖 Generated with Claude Code

Add arm64SumOfProducts256 in ARM64 inline assembly, computing
(a0*b0 + a1*b1) * R^{-1} mod p with 4 shared Montgomery reductions
instead of 9 for separate mul+mul+add. Uses 28 GPRs with a 5-limb
product pattern to fit b0, b1, and mod all preloaded in registers.
Final conditional modulus subtraction is fused inside the asm block.

Wired into both MontgomeryField implementations (field/mod.zig and
curves/montgomery_field.zig) matching the existing arm64 dispatch
pattern. sumOfProducts benchmarks at 28.6ns vs 38.1ns for 2×mul+add
(~25% faster per call). Fp2.mul calls sumOfProducts, so this speeds
up the 29%-of-prover-time hot path automatically.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@MatteoMer MatteoMer merged commit a494fff into main Apr 14, 2026
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant