[aesgcm] improve portable gf128 performance by robinhundt · Pull Request #1340 · cryspen/libcrux

robinhundt · 2026-02-19T17:59:42Z

Hi there,
while looking at the aes-gcm implementation I noticed that the performance of the portable gf128 implementation could be improved. This PR replaces the previous gf128 multiplication with an optimized carry-less 128 x 128 -> 256 bit multiplication followed by the usual reduction by the irreducible polynomial (same reduce as in platform/x64/gf128_core.rs). This improves the throughput of the portable aes-gcm implementation by 50-60% (see benchmarks).

The implementation of the carry-less multiplication is based on the algorithm used in bearssl and adapted from the RustCrypto implementation. Contrary to these implementations, I'm making use of Rust's u128 support, allowing me to skip the Rev trick described in the blog post. Originally, I wrote this code for my cryprot-core library.

Limitations

Constant-time: As the clmul64 implementation uses u128 multiplication, it will not be constant-time on targets where this multiplication is not constant-time. Especially since this implementation would require constant-time 64 x 64 -> 128 bit multiplications (e.g. MULX on x86), I'm doubtful whether this implementation would provide a performance benefit, as those targets will often have PCLMULQDQ (or equivalent) available. There might be some ARM targets which have CT MUL/UMULH instructions but don't have the crypto extensions and don't support PMULL. More information about the problem of constant-time multiplications is available here.

32-bit targets: I'm unsure how this implementation would fare on 32-bit targets compared to e.g. the optimized one in RustCrypto/universal-hashes. Some quick napkin math and godbolt experimentation suggests that my implementation should compile to less MUL instructions, but I haven't investigated this properly.

Comparison to bearssl/RustCrypto implementation: The prior works use T x T -> T carry-less multiplications (for T = 32 or T = 64). I believe my version is faster on at least some targets, but I have not suitably benchmarked this.

Verification of this code: As far as I can tell, the current gf128 implementation is not formally verified. I'm not sure whether this optimized version would be harder to formally verify.

From my current understanding of libcrux, I would not recommend merging this PR as is. Especially the limitations around constant-time guarantees would definitely warrant a closer look. But it shows that there is a significant potential for performance improvement in the portable aes-gcm implementation.

Benchmarks

Baseline:

aes gcm 128/libcrux/128 
                        time:   [8.2022 µs 8.2176 µs 8.2385 µs]
                        thrpt:  [14.817 MiB/s 14.855 MiB/s 14.883 MiB/s]
Found 15 outliers among 100 measurements (15.00%)
  6 (6.00%) high mild
  9 (9.00%) high severe
aes gcm 128/libcrux/1 KB
                        time:   [49.087 µs 49.178 µs 49.299 µs]
                        thrpt:  [19.809 MiB/s 19.858 MiB/s 19.895 MiB/s]
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) high mild
  5 (5.00%) high severe
Benchmarking aes gcm 128/libcrux/10 MB: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 49.3s, or reduce sample count to 10.
aes gcm 128/libcrux/10 MB
                        time:   [484.10 ms 484.49 ms 484.96 ms]
                        thrpt:  [20.620 MiB/s 20.640 MiB/s 20.657 MiB/s]
Found 8 outliers among 100 measurements (8.00%)
  3 (3.00%) high mild
  5 (5.00%) high severe

aes gcm 256/libcrux/128 
                        time:   [9.8674 µs 9.8712 µs 9.8758 µs]
                        thrpt:  [12.361 MiB/s 12.366 MiB/s 12.371 MiB/s]
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) high mild
  3 (3.00%) high severe
aes gcm 256/libcrux/1 KB
                        time:   [60.014 µs 60.038 µs 60.066 µs]
                        thrpt:  [16.258 MiB/s 16.266 MiB/s 16.272 MiB/s]
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe
Benchmarking aes gcm 256/libcrux/10 MB: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 60.1s, or reduce sample count to 10.
aes gcm 256/libcrux/10 MB
                        time:   [594.81 ms 595.70 ms 596.76 ms]
                        thrpt:  [16.757 MiB/s 16.787 MiB/s 16.812 MiB/s]
Found 14 outliers among 100 measurements (14.00%)
  5 (5.00%) high mild
  9 (9.00%) high severe

This PR:

aes gcm 128/libcrux/128 
                        time:   [5.1399 µs 5.1433 µs 5.1476 µs]
                        thrpt:  [23.714 MiB/s 23.734 MiB/s 23.750 MiB/s]
                 change:
                        time:   [−38.278% −37.510% −36.890%] (p = 0.00 < 0.05)
                        thrpt:  [+58.453% +60.025% +62.017%]
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  3 (3.00%) high mild
  8 (8.00%) high severe
aes gcm 128/libcrux/1 KB
                        time:   [30.780 µs 30.796 µs 30.814 µs]
                        thrpt:  [31.693 MiB/s 31.711 MiB/s 31.727 MiB/s]
                 change:
                        time:   [−37.294% −36.720% −35.852%] (p = 0.00 < 0.05)
                        thrpt:  [+55.890% +58.029% +59.474%]
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  1 (1.00%) low mild
  6 (6.00%) high mild
  7 (7.00%) high severe
Benchmarking aes gcm 128/libcrux/10 MB: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 31.5s, or reduce sample count to 10.
aes gcm 128/libcrux/10 MB
                        time:   [307.24 ms 307.56 ms 307.95 ms]
                        thrpt:  [32.473 MiB/s 32.514 MiB/s 32.548 MiB/s]
                 change:
                        time:   [−36.611% −36.519% −36.431%] (p = 0.00 < 0.05)
                        thrpt:  [+57.310% +57.527% +57.757%]
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) high mild
  6 (6.00%) high severe

aes gcm 256/libcrux/128 
                        time:   [6.7777 µs 6.8141 µs 6.8582 µs]
                        thrpt:  [17.799 MiB/s 17.914 MiB/s 18.011 MiB/s]
                 change:
                        time:   [−31.512% −31.242% −30.940%] (p = 0.00 < 0.05)
                        thrpt:  [+44.802% +45.437% +46.012%]
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  8 (8.00%) high mild
  6 (6.00%) high severe
aes gcm 256/libcrux/1 KB
                        time:   [40.781 µs 40.835 µs 40.930 µs]
                        thrpt:  [23.860 MiB/s 23.915 MiB/s 23.946 MiB/s]
                 change:
                        time:   [−32.025% −31.895% −31.745%] (p = 0.00 < 0.05)
                        thrpt:  [+46.509% +46.833% +47.113%]
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  2 (2.00%) low mild
  4 (4.00%) high mild
  7 (7.00%) high severe
Benchmarking aes gcm 256/libcrux/10 MB: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 41.3s, or reduce sample count to 10.
aes gcm 256/libcrux/10 MB
                        time:   [405.76 ms 406.16 ms 406.64 ms]
                        thrpt:  [24.592 MiB/s 24.621 MiB/s 24.645 MiB/s]
                 change:
                        time:   [−31.954% −31.819% −31.686%] (p = 0.00 < 0.05)
                        thrpt:  [+46.383% +46.669% +46.960%]
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) high mild
  6 (6.00%) high severe

jschneider-bensch · 2026-02-23T16:20:07Z

Thank you, that looks very interesting!
I'll make sure to check it out as soon as time allows.

robinhundt · 2026-03-24T11:22:20Z

Ohh, I just had another look at this and think my adaptation to u128 of the bearSSL code for the carry-less multiplication is not actually correct. The holes to mask out the carries are not large enough in cases like x = y = u64::MAX.

The classic approach used in bearSSL and RustCrypto for portable carry-less multiplication for 32 x 32 -> 64 bits works and should still provide a meaningful performance improvement.

Recently I came upon the paper Efficient GHASH and POLYVAL Implementation Using Polynomial Multiplication: Optimized 64-bit Decomposition with Bit-Reversal Elimination which seems to achieve the same as what I tried to do, albeit hopefully correct. Apparently, this has recently been implemented in the RustCrypto/universal-hashes crate.

I'll convert this to a draft for now to not waste review capacities.

franziskuskiefer · 2026-03-24T11:53:34Z

Just a few more thoughts on this.

I think we can generally expect that 64-bit CPUs have AES-NI and therefore won't actually use the portable implementation. So doing the 32-bit version may be more valuable than the 64-bit variant.

The hole punching is indeed a bit pretty tricky to get right. I did a version a while back that you can look up here. There's more background on that on Tim's blog as well.

[aesgcm] improve portable gf128 performance

bf9f436

robinhundt requested a review from a team as a code owner February 19, 2026 17:59

robinhundt requested a review from jschneider-bensch February 19, 2026 17:59

jschneider-bensch added the waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Feb 24, 2026

jschneider-bensch requested a review from franziskuskiefer March 24, 2026 09:33

robinhundt marked this pull request as draft March 24, 2026 11:22

franziskuskiefer removed the waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Mar 24, 2026

franziskuskiefer removed request for franziskuskiefer and jschneider-bensch March 24, 2026 11:53

robinhundt mentioned this pull request Apr 2, 2026

[core] Scalar clmul64 is wrong robinhundt/CryProt#86

Closed

jschneider-bensch assigned jschneider-bensch and robinhundt and unassigned jschneider-bensch Apr 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[aesgcm] improve portable gf128 performance#1340

[aesgcm] improve portable gf128 performance#1340
robinhundt wants to merge 1 commit into
cryspen:mainfrom
robinhundt:robin/aes-gcm-portable-perf-improvement

robinhundt commented Feb 19, 2026

Uh oh!

jschneider-bensch commented Feb 23, 2026

Uh oh!

robinhundt commented Mar 24, 2026

Uh oh!

franziskuskiefer commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

robinhundt commented Feb 19, 2026

Limitations

Benchmarks

Uh oh!

jschneider-bensch commented Feb 23, 2026

Uh oh!

robinhundt commented Mar 24, 2026

Uh oh!

franziskuskiefer commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants