perf: ARM64 asm for sumOfProducts — fused 2-product Montgomery reduction#80
Merged
Conversation
Add arm64SumOfProducts256 in ARM64 inline assembly, computing
(a0*b0 + a1*b1) * R^{-1} mod p with 4 shared Montgomery reductions
instead of 9 for separate mul+mul+add. Uses 28 GPRs with a 5-limb
product pattern to fit b0, b1, and mod all preloaded in registers.
Final conditional modulus subtraction is fused inside the asm block.
Wired into both MontgomeryField implementations (field/mod.zig and
curves/montgomery_field.zig) matching the existing arm64 dispatch
pattern. sumOfProducts benchmarks at 28.6ns vs 38.1ns for 2×mul+add
(~25% faster per call). Fp2.mul calls sumOfProducts, so this speeds
up the 29%-of-prover-time hot path automatically.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
arm64SumOfProducts256in ARM64 inline assembly computing(a0*b0 + a1*b1) * R^{-1} mod pwith 4 shared Montgomery reductions instead of 9 for separatemul+mul+addMontgomeryFieldimplementations (field/mod.zigandcurves/montgomery_field.zig) matching the existing arm64 dispatch patternsumOfProductsbenchmarks at 28.6 ns vs ~38.1 ns for2×mul+add(~25% faster per call), directly acceleratingFp2.mulwhich is 29% of prover CPU timeDesign
a[i]*b[0..3]as a 5-limb number using only 7 scratch registers, freeing x19-x22 for preloadedb1valuessubs/sbcs+cselinside asm block — no post-asmlessThanModuluscheck neededTest plan
zig build test)fibonacci,sha256_128,sha256all VERIFIED via Jolt verifier)sum_of_products)🤖 Generated with Claude Code