Skip to content

ed25519: tune AVX2 field multiplication codegen#26

Open
efagerho wants to merge 1 commit into
anza-xyz:masterfrom
efagerho:avx2-field-mul-codegen
Open

ed25519: tune AVX2 field multiplication codegen#26
efagerho wants to merge 1 commit into
anza-xyz:masterfrom
efagerho:avx2-field-mul-codegen

Conversation

@efagerho

@efagerho efagerho commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Balance the ten-term AVX2 field multiplication accumulations and recompute cheap doubled odd limbs at their use sites. This shortens dependency chains and reduces some vector stack traffic without changing the multiplication formula.

This change is intended to pair with the AVX2 B * 2^128 lookup-table branch. On its own, this source-level scheduling tradeoff is not a win in this repository's single-verification benchmark. Once the lookup-table branch removes dynamic B * 2^128 table construction from the verifier hot path, the field-mul rewrite becomes profitable: the remaining work is more field-arithmetic dominated, and trading cheap recomputation for lower live-range and stack traffic improves the combined path.

Benchmark notes:

  • Ran this repository's Criterion benchmark program, benches/bench.rs, filtering to Single Verification, pinned to CPU 4 with 1s warmup, 2s measurement, and sample size 10.

  • Standalone local_verify_zebra estimate was 21.104 us, with 95% CI 20.998..21.211 us.

  • master measured 20.051 us, with 95% CI 19.938..20.134 us, so this standalone branch was about 5.25% slower in that run.

  • Combined with avx2-basepoint-128-table, local_verify_zebra measured 18.703 us, with 95% CI 18.611..18.817 us.

  • The table-only branch measured 19.382 us, with 95% CI 19.327..19.426 us, so adding this field-mul rewrite on top of the lookup table was about 3.5% faster in that run.

@efagerho efagerho force-pushed the avx2-field-mul-codegen branch from f222410 to c8962a3 Compare June 4, 2026 08:37
Balance the ten-term AVX2 field multiplication accumulations and recompute cheap doubled odd limbs at their use sites. This shortens dependency chains and reduces some vector stack traffic without changing the multiplication formula.

This change is intended to pair with the AVX2 B * 2^128 lookup-table branch. On its own, this source-level scheduling tradeoff is not a win in this repository's single-verification benchmark. Once the lookup-table branch removes dynamic B * 2^128 table construction from the verifier hot path, the field-mul rewrite becomes profitable: the remaining work is more field-arithmetic dominated, and trading cheap recomputation for lower live-range and stack traffic improves the combined path.

Benchmark notes:

- Ran this repository's Criterion benchmark program, benches/bench.rs, filtering to Single Verification, pinned to CPU 4 with 1s warmup, 2s measurement, and sample size 10.

- Standalone local_verify_zebra estimate was 21.104 us, with 95% CI 20.998..21.211 us.

- master measured 20.051 us, with 95% CI 19.938..20.134 us, so this standalone branch was about 5.25% slower in that run.

- Combined with avx2-basepoint-128-table, local_verify_zebra measured 18.703 us, with 95% CI 18.611..18.817 us.

- The table-only branch measured 19.382 us, with 95% CI 19.327..19.426 us, so adding this field-mul rewrite on top of the lookup table was about 3.5% faster in that run.
@efagerho efagerho force-pushed the avx2-field-mul-codegen branch from c8962a3 to 2908d3f Compare June 4, 2026 08:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant