ed25519: tune AVX2 field multiplication codegen by efagerho · Pull Request #26 · anza-xyz/cryptography

efagerho · 2026-06-04T08:27:17Z

Balance the ten-term AVX2 field multiplication accumulations and recompute cheap doubled odd limbs at their use sites. This shortens dependency chains and reduces some vector stack traffic without changing the multiplication formula.

This change is intended to pair with the AVX2 B * 2^128 lookup-table branch. On its own, this source-level scheduling tradeoff is not a win in this repository's single-verification benchmark. Once the lookup-table branch removes dynamic B * 2^128 table construction from the verifier hot path, the field-mul rewrite becomes profitable: the remaining work is more field-arithmetic dominated, and trading cheap recomputation for lower live-range and stack traffic improves the combined path.

Benchmark notes:

Ran this repository's Criterion benchmark program, benches/bench.rs, filtering to Single Verification, pinned to CPU 4 with 1s warmup, 2s measurement, and sample size 10.
Standalone local_verify_zebra estimate was 21.104 us, with 95% CI 20.998..21.211 us.
master measured 20.051 us, with 95% CI 19.938..20.134 us, so this standalone branch was about 5.25% slower in that run.
Combined with avx2-basepoint-128-table, local_verify_zebra measured 18.703 us, with 95% CI 18.611..18.817 us.
The table-only branch measured 19.382 us, with 95% CI 19.327..19.426 us, so adding this field-mul rewrite on top of the lookup table was about 3.5% faster in that run.

Balance the ten-term AVX2 field multiplication accumulations and recompute cheap doubled odd limbs at their use sites. This shortens dependency chains and reduces some vector stack traffic without changing the multiplication formula. This change is intended to pair with the AVX2 B * 2^128 lookup-table branch. On its own, this source-level scheduling tradeoff is not a win in this repository's single-verification benchmark. Once the lookup-table branch removes dynamic B * 2^128 table construction from the verifier hot path, the field-mul rewrite becomes profitable: the remaining work is more field-arithmetic dominated, and trading cheap recomputation for lower live-range and stack traffic improves the combined path. Benchmark notes: - Ran this repository's Criterion benchmark program, benches/bench.rs, filtering to Single Verification, pinned to CPU 4 with 1s warmup, 2s measurement, and sample size 10. - Standalone local_verify_zebra estimate was 21.104 us, with 95% CI 20.998..21.211 us. - master measured 20.051 us, with 95% CI 19.938..20.134 us, so this standalone branch was about 5.25% slower in that run. - Combined with avx2-basepoint-128-table, local_verify_zebra measured 18.703 us, with 95% CI 18.611..18.817 us. - The table-only branch measured 19.382 us, with 95% CI 19.327..19.426 us, so adding this field-mul rewrite on top of the lookup table was about 3.5% faster in that run.

efagerho force-pushed the avx2-field-mul-codegen branch from f222410 to c8962a3 Compare June 4, 2026 08:37

efagerho force-pushed the avx2-field-mul-codegen branch from c8962a3 to 2908d3f Compare June 4, 2026 08:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ed25519: tune AVX2 field multiplication codegen#26

ed25519: tune AVX2 field multiplication codegen#26
efagerho wants to merge 1 commit into
anza-xyz:masterfrom
efagerho:avx2-field-mul-codegen

efagerho commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

efagerho commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant