spec: jolt-crypto performance optimizations#1453
Conversation
Capture nine targeted BN254 hot-path optimizations identified during post-#1368 review: field_to_fr specialization, MSM batch-normalize, GT sliding-window exp, wNAF Shamir, 4D precomputed Shamir table, parallelized batch_addition post-inversion, cached GLV 2D coeffs, native i128/u128 decomposition, cached Frobenius coefficients. Includes four new jolt-eval invariants and four new perf objectives to mechanically gate correctness and measure impact.
|
Spec Analysis: jolt-crypto Performance Optimizations
Status: Approved — spec is clear enough for one-shot implementation. Summary of what will be built:
Key invariants preserved:
Critical evaluation criteria:
Minor advisory (not gating):
Next step: Run
|
| 2. **MSM batch-normalization**: Replace per-point `b.0.into_affine()` in the `impl_jolt_group_wrapper!` `msm` path with `<$projective>::normalize_batch(...)` so a single inversion amortizes across all input points, matching the pattern already used in `multi_pairing`. | ||
| 3. **GT MSM sliding-window exponentiation with shared squarings**: Replace the serial `for` loop in `Bn254GT::msm` with per-base windowed exponentiation that amortizes squarings across scalar bit positions (e.g., simultaneous multi-exponentiation à la Straus for small batches, or windowed per-base with a shared accumulator for large batches). | ||
| 4. **wNAF signed-digit in Shamir's trick**: Replace naive bit-by-bit double-and-add in `shamir_glv_mul_2d` and `shamir_glv_mul_4d` with wNAF (width-4 for 2D, width-5 for 4D) including sign-aware precomputed odd-multiple tables per base. | ||
| 5. **Precomputed 16-entry Shamir table for 4D GLV online path**: Extend the 2D fixed-base precomputation pattern (`PrecomputedShamir2Table`, 16 entries) to 4D with `PrecomputedShamir4Table` (256 entries = 4 points × 2 sign bits = 8 bits), invoked from `glv_four_scalar_mul_online` and both `dory_g2` vector ops. |
There was a problem hiding this comment.
The optimization title says "Precomputed 16-entry Shamir table for 4D GLV" but the description states it will have 256 entries (PrecomputedShamir4Table with 256 entries = 2^8 from 4 decomposed scalars × 2 sign bits each). The title should say "Precomputed 256-entry Shamir table for 4D GLV online path" to match the actual implementation specification. The 16 entries refers to the existing 2D table being extended, not the new 4D table size.
5. **Precomputed 256-entry Shamir table for 4D GLV online path**: Extend the 2D fixed-base precomputation pattern (`PrecomputedShamir2Table`, 16 entries) to 4D with `PrecomputedShamir4Table` (256 entries = 4 points × 2 sign bits = 8 bits), invoked from `glv_four_scalar_mul_online` and both `dory_g2` vector ops.| 5. **Precomputed 16-entry Shamir table for 4D GLV online path**: Extend the 2D fixed-base precomputation pattern (`PrecomputedShamir2Table`, 16 entries) to 4D with `PrecomputedShamir4Table` (256 entries = 4 points × 2 sign bits = 8 bits), invoked from `glv_four_scalar_mul_online` and both `dory_g2` vector ops. | |
| 5. **Precomputed 256-entry Shamir table for 4D GLV online path**: Extend the 2D fixed-base precomputation pattern (`PrecomputedShamir2Table`, 16 entries) to 4D with `PrecomputedShamir4Table` (256 entries = 4 points × 2 sign bits = 8 bits), invoked from `glv_four_scalar_mul_online` and both `dory_g2` vector ops. | |
Spotted by Graphite
Is this helpful? React 👍 or 👎 to let us know.
|
Claude spec review session started: https://claude.ai/code/session_013hzc5MRUjJr6fnHgP1zS4k |
|
Spec Analysis: jolt-crypto Performance Optimizations (fresh pass)
Status: Approved — spec is clear enough for one-shot implementation. PR already carries Independent codebase verification (every file/symbol referenced in the spec exists):
Minor advisories — not gating:
None of the above block implementation. Summary of what will be built (unchanged from prior analysis): nine BN254 backend optimizations, four new Generated by Claude Code |
|
Claude implement-spec session started: https://claude.ai/code/session_01TJmR9rwHyvWNByDR5RFsxa |
|
Implementation plan for: jolt-crypto performance optimizations Scaffolding (first):
Optimizations (dependency-respecting order, low risk → high risk):
Order: scaffolding → opt 9 → 7 → 8 → 2 → 1 → 6 → 4 → 5 → 3. Each optimization is committed as its own logical unit once the Parallel tasks: invariants 1–4 (independent files), objectives 1–4 (independent files) — within scaffolding only. Optimizations are sequential because each touches the same hot-path files and needs its own Estimated scope: ~10 modified files in Note on scope: the spec's "Alternatives Considered" §1 explicitly allows splitting into at most three PRs (A: scaffolding + invariants, B: opts 1–5, C: opts 6–9). If time pressure emerges, I will land the infrastructure + optimizations 1, 2, 6, 7, 8, 9 first (the safe, mechanical wins) and leave wNAF + 4D Shamir table + GT sliding-window (opts 3, 4, 5) as a follow-up because they are the most algorithmically involved and merit independent benchmarking. Generated by Claude Code |
Adds four new invariants (MSM vs naive, GLV-vector vs naive, batch-addition vs naive, scalar-decomp reconstruction) and four new performance objectives (g1_msm_1024, g1_scalar_mul, gt_scalar_mul, pedersen_commit_1024) targeting the jolt-crypto BN254 backend. Each invariant implements `#[invariant(Test, Fuzz)]`; each objective is paired with a Criterion bench harness. Also exposes `decomp_2d`/`decomp_4d` as `pub mod` (the enclosing `glv` module is already `#[doc(hidden)]`) so future tests can reference the decomposition helpers directly. https://claude.ai/code/session_01TJmR9rwHyvWNByDR5RFsxa
Implements five of the nine optimizations in the jolt-crypto-perf-optimizations spec (PR #1453): - Opt 1: `field_to_fr` specialization — TypeId-based fast path transmutes `jolt_field::Fr` directly to `ark_bn254::Fr` via `#[repr(transparent)]` layout compatibility, bypassing the byte-serialization roundtrip. The generic byte path is unchanged for other `Field` implementations. - Opt 2: MSM batch-normalize — `impl_jolt_group_wrapper!`'s `msm` now calls `<$projective as CurveGroup>::normalize_batch` on a transmuted `&[$projective]` slice, amortizing a single field inversion across all points instead of inverting z per-point via `into_affine`. The macro's unused `$affine` parameter is dropped. - Opt 6: Parallel post-inversion loop in `batch_g1_additions_multi_affine_inner`. Replaces the serial `for ((set_idx, pair_idx), inv) in pair_info.iter().zip(...)` loop with a `par_iter().enumerate()` pass over `working_sets` that writes into per-set buffers without cross-set contention, using a pre-computed `offsets` array to slice the shared `inverses` vector. - Opt 7: Cache GLV 2D decomposition constants — introduces a `static DECOMP_CONSTANTS: LazyLock<DecompConstants>` holding the `BigInt` form of `SCALAR_DECOMP_COEFFS`, `-n12`, and the subgroup order `r`. Replaces the per-call `.map(BigInt::from_bytes_be)` reconstruction that allocated 5 `BigInt`s on every `decompose_scalar_2d` invocation. - Opt 9: Cache Frobenius coefficients — replaces the `const fn get_frobenius_coefficients()` that rebuilt `Fq2` elements from `MontFp!` literals on each call with a `const FROBENIUS_COEFFICIENTS: FrobeniusCoefficients` evaluated at compile time. `frobenius_psi_power_projective` reads directly from the const value. Optimizations 3 (GT sliding-window MSM), 4 (wNAF signed-digit in Shamir's trick), 5 (precomputed 16-entry 4D Shamir table), and 8 (native i128/u128 arithmetic in `decompose_scalar_2d`) are deferred to a follow-up — they are the most algorithmically involved and merit independent benchmarking. Spec status: partially-implemented. muldiv e2e passes in both `--features host` and `--features host,zk`; `cargo clippy` passes in both modes with `-D warnings`; all jolt-crypto integration tests pass unchanged. https://claude.ai/code/session_01TJmR9rwHyvWNByDR5RFsxa
|
Implementation complete (partial) for: jolt-crypto performance optimizations Landed five of the nine optimizations and all four new Commits:
Changes made:
Evaluation results:
Deferred (to a follow-up PR):
The four new Baseline benchmarks: The four new performance objectives are wired and ready — per-bench baseline capture via Generated by Claude Code |
Benchmark comparison (crates) |
Summary
Spec capturing nine targeted performance optimizations for the
jolt-cryptocrate (merged in #1368), identified during post-merge review. All optimizations preserve the public API and correctness invariants; only wall-clock time and allocator pressure change.Optimizations covered:
field_to_frspecialization — skip byte-serialization roundtrip whenF == jolt_field::Frinto_affinewithnormalize_batchshamir_glv_mul_2d/shamir_glv_mul_4dbatch_g1_additions_multi_affine_innerpost-inversion loopSCALAR_DECOMP_COEFFSinLazyLocki128/u128arithmetic indecompose_scalar_2d(dropnum_bigint)FrobeniusCoefficientsasconstorLazyLockIncludes four new
jolt-evalinvariants (MSM vs naive, GLV vector vs naive, batch-addition vs naive, scalar-decomp reconstruction) and four new performance objectives (jolt_crypto_g1_msm_1024,jolt_crypto_gt_scalar_mul,jolt_crypto_g1_scalar_mul,jolt_crypto_pedersen_commit_1024) to mechanically gate correctness and measure impact.Primary correctness gate is the existing
muldive2e test in both--features hostand--features host,zk.Test plan
/analyze-specto score ambiguity and surface gapsspeclabel (GitHub Action does this automatically)claude-spec-review-requestfor external analysis