[aesgcm] improve portable gf128 performance#1340
Conversation
|
Thank you, that looks very interesting! |
|
Ohh, I just had another look at this and think my adaptation to u128 of the bearSSL code for the carry-less multiplication is not actually correct. The holes to mask out the carries are not large enough in cases like The classic approach used in bearSSL and RustCrypto for portable carry-less multiplication for 32 x 32 -> 64 bits works and should still provide a meaningful performance improvement. Recently I came upon the paper Efficient GHASH and POLYVAL Implementation Using Polynomial Multiplication: Optimized 64-bit Decomposition with Bit-Reversal Elimination which seems to achieve the same as what I tried to do, albeit hopefully correct. Apparently, this has recently been implemented in the RustCrypto/universal-hashes crate. I'll convert this to a draft for now to not waste review capacities. |
|
Just a few more thoughts on this. I think we can generally expect that 64-bit CPUs have AES-NI and therefore won't actually use the portable implementation. So doing the 32-bit version may be more valuable than the 64-bit variant. The hole punching is indeed a bit pretty tricky to get right. I did a version a while back that you can look up here. There's more background on that on Tim's blog as well. |
Hi there,
while looking at the aes-gcm implementation I noticed that the performance of the portable gf128 implementation could be improved. This PR replaces the previous gf128 multiplication with an optimized carry-less 128 x 128 -> 256 bit multiplication followed by the usual reduction by the irreducible polynomial (same reduce as in
platform/x64/gf128_core.rs). This improves the throughput of the portable aes-gcm implementation by 50-60% (see benchmarks).The implementation of the carry-less multiplication is based on the algorithm used in bearssl and adapted from the RustCrypto implementation. Contrary to these implementations, I'm making use of Rust's u128 support, allowing me to skip the
Revtrick described in the blog post. Originally, I wrote this code for my cryprot-core library.Limitations
Constant-time: As the
clmul64implementation usesu128multiplication, it will not be constant-time on targets where this multiplication is not constant-time. Especially since this implementation would require constant-time 64 x 64 -> 128 bit multiplications (e.g. MULX on x86), I'm doubtful whether this implementation would provide a performance benefit, as those targets will often havePCLMULQDQ(or equivalent) available. There might be some ARM targets which have CTMUL/UMULHinstructions but don't have the crypto extensions and don't support PMULL. More information about the problem of constant-time multiplications is available here.32-bit targets: I'm unsure how this implementation would fare on 32-bit targets compared to e.g. the optimized one in RustCrypto/universal-hashes. Some quick napkin math and godbolt experimentation suggests that my implementation should compile to less MUL instructions, but I haven't investigated this properly.
Comparison to bearssl/RustCrypto implementation: The prior works use T x T -> T carry-less multiplications (for T = 32 or T = 64). I believe my version is faster on at least some targets, but I have not suitably benchmarked this.
Verification of this code: As far as I can tell, the current gf128 implementation is not formally verified. I'm not sure whether this optimized version would be harder to formally verify.
From my current understanding of libcrux, I would not recommend merging this PR as is. Especially the limitations around constant-time guarantees would definitely warrant a closer look. But it shows that there is a significant potential for performance improvement in the portable aes-gcm implementation.
Benchmarks
Baseline:
This PR: