This is part of the source code accompanying the paper "Efficient isochronous fixed-weight sampling with applications to NTRU", specifically for ARMv8-A cores. This is a list of folders and their contents:
amx,reference,PQC_NEON,rng_opt,speed,vector-polymul-ntru-ntrup: Mostly imported from the repository for the paper Fast polynomial multiplication using matrix multiplication accelerators with applications to NTRU on Apple M1/M3 SoCs, with changes of our own to implement the proposed shuffling algorithm of our paper. We notereference,PQC_NEONandvector-polymul-ntru-ntrupwere themselves imported by the referenced paper from repositories of other papers and documents: the Round 3 submission package of NTRU to the NIST PQC standardization effort, "Optimized Software Implementations of CRYSTALS-Kyber, NTRU, and Saber Using NEON-Based Special Instructions of ARMv8" and "Algorithmic Views of Vectorized Polynomial Multipliers -- NTRU", respectively;googletest: a copy of the Google Test library;jupyter: contains a Jupyter notebook to support some of our claims from the "Implementation aspects" section;KAT: includes both the original KATs from NTRU's submission to the NIST PQC standardization effort (in the subfoldersorting), as well as new KATs generated by us for our proposed sampling by shuffling approach (in the subfoldershuffling);speed_results_A53,speed_results_A57,speed_results_A72,speed_results_M1,speed_results_M3: benchmark results in the Cortex-A53, Cortex-A57, Cortex-A72, Apple M1 and Apple M3 cores (for the latter two, the DIT bit is set for data-independent timing, while this bit is not supported by the former three cores);test: tests (using the Google Test library) to validate our implementation.
The root folder also includes a helper script to run benchmarks (see instructions below), called run_benchmarks.sh.
We use CMake as our build system. It can be installed using Homebrew with the command brew install cmake. A typical sequence of commands to build the code, starting from the root folder of the repository, is:
mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
make
NOTE: for the tested compilers, there is a register allocation issue when the optimized randombytes routine is compiled in Debug mode (i.e. passing -DCMAKE_BUILD_TYPE=Debug to CMake), and the build fails. However, in RelWithDebInfo and Release mode, there is no issue.
Compilation produces many test binaries in the build folder (build/test_* if using the directions in Building the code above). While it is possible to run each binary directly, we recommend using the ctest utility from CMake to run all available tests with a single invocation. ctest also runs additional tests that automate the process of comparing KATs using the PQCgenKAT_kem_* binaries.
Compilation produces many benchmarking binaries in the build folder (build/speed_* if using the directions in Building the code above). Each binary may be run directly, or a full benchmark set can be automatically run using the helper scripts described in Benchmarking helper scripts below.
In Linux platforms, a kernel module to enable userspace access to ARM performance counters (including the cycle counters) is required. In macOS platforms, it is necessary to run the code with root privileges (e.g. using sudo) to allow access to the cycle counters.
We provide the helper script run_benchmarks.sh to automatically run all available benchmarks (except for those related to the RNG). It supports running in ARMv8-A and ARMv9-A systems under both Linux and macOS operating systems, but please note the prerequisites for enabling the cycle counter listed in Running benchmarks above. The script must be run from the root folder of the repository, and places their results in a folder called speed_results_XXX, where XXX will be replaced by the CPU name in the machine where the script is run; the script provides a core detection feature, which only works for the cores we benchmarked on our paper, but should be easily extensible to other cores. Each executable file that is run creates an associated text file containing the benchmark results, with a self-explanatory naming scheme.
Our work builds upon many other libraries and implementations, with different licenses for each. Any modifications that we make to an existing work is released under the same original license as that work. As for our original code, we release it under the Creative Commons CC0 1.0 Universal (CC0 1.0) Public Domain Dedication.