ed25519: tune AVX2 field multiplication codegen#26
Open
efagerho wants to merge 1 commit into
Open
Conversation
f222410 to
c8962a3
Compare
Balance the ten-term AVX2 field multiplication accumulations and recompute cheap doubled odd limbs at their use sites. This shortens dependency chains and reduces some vector stack traffic without changing the multiplication formula. This change is intended to pair with the AVX2 B * 2^128 lookup-table branch. On its own, this source-level scheduling tradeoff is not a win in this repository's single-verification benchmark. Once the lookup-table branch removes dynamic B * 2^128 table construction from the verifier hot path, the field-mul rewrite becomes profitable: the remaining work is more field-arithmetic dominated, and trading cheap recomputation for lower live-range and stack traffic improves the combined path. Benchmark notes: - Ran this repository's Criterion benchmark program, benches/bench.rs, filtering to Single Verification, pinned to CPU 4 with 1s warmup, 2s measurement, and sample size 10. - Standalone local_verify_zebra estimate was 21.104 us, with 95% CI 20.998..21.211 us. - master measured 20.051 us, with 95% CI 19.938..20.134 us, so this standalone branch was about 5.25% slower in that run. - Combined with avx2-basepoint-128-table, local_verify_zebra measured 18.703 us, with 95% CI 18.611..18.817 us. - The table-only branch measured 19.382 us, with 95% CI 19.327..19.426 us, so adding this field-mul rewrite on top of the lookup table was about 3.5% faster in that run.
c8962a3 to
2908d3f
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Balance the ten-term AVX2 field multiplication accumulations and recompute cheap doubled odd limbs at their use sites. This shortens dependency chains and reduces some vector stack traffic without changing the multiplication formula.
This change is intended to pair with the AVX2 B * 2^128 lookup-table branch. On its own, this source-level scheduling tradeoff is not a win in this repository's single-verification benchmark. Once the lookup-table branch removes dynamic B * 2^128 table construction from the verifier hot path, the field-mul rewrite becomes profitable: the remaining work is more field-arithmetic dominated, and trading cheap recomputation for lower live-range and stack traffic improves the combined path.
Benchmark notes:
Ran this repository's Criterion benchmark program, benches/bench.rs, filtering to Single Verification, pinned to CPU 4 with 1s warmup, 2s measurement, and sample size 10.
Standalone local_verify_zebra estimate was 21.104 us, with 95% CI 20.998..21.211 us.
master measured 20.051 us, with 95% CI 19.938..20.134 us, so this standalone branch was about 5.25% slower in that run.
Combined with avx2-basepoint-128-table, local_verify_zebra measured 18.703 us, with 95% CI 18.611..18.817 us.
The table-only branch measured 19.382 us, with 95% CI 19.327..19.426 us, so adding this field-mul rewrite on top of the lookup table was about 3.5% faster in that run.