perf(bigint): speed up large BigInt x small scalar multiplication#3620
perf(bigint): speed up large BigInt x small scalar multiplication#3620mizchi wants to merge 8 commits into
Conversation
Add a mul_single_limb fast path that runs a single carry-propagating loop in O(self.len). The dispatch in Mul::mul checks both operands so the fast path fires regardless of operand order. For factorial(800)-style chains where one operand is always 1 limb: wasm : 255.1 -> 80.2 ms (-68.6%) wasm-gc : 93.5 -> 25.2 ms (-73.0%) native : 48.9 -> 20.0 ms (-59.1%) The n*n path (Karatsuba) is untouched; bigint_square is within noise.
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR optimizes BigInt multiplication for the common “n-limb × 1-limb” case (e.g., factorial-style multiplication chains) by adding a specialized fast path and helper routine.
Changes:
- Adds a
len == 1fast path inMul::multo avoid the general grade-school loop overhead. - Introduces
BigInt::mul_single_limbto efficiently multiply by a single radix limb.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| limbs[n] = carry.to_uint() | ||
| n + 1 | ||
| } | ||
| { limbs, sign: Positive, len } |
There was a problem hiding this comment.
Documented in ada88f8. The magnitude-only contract here mirrors grade_school_mul (line 447 in this file returns sign: Positive the same way) and karatsuba_mul — Mul::mul always overwrites sign with the combined sign of the operands at the dispatch site. Renaming to mul_single_limb_abs would break the naming symmetry with those other helpers, so I kept the name and added a doc comment that spells out the contract and warns against direct callers needing a signed product.
| let mut carry = 0UL | ||
| for i in 0..<n { | ||
| let product = self.limbs[i].to_uint64() * xq + carry | ||
| limbs[i] = (product & radix_mask).to_uint() | ||
| carry = product >> radix_bit_len |
There was a problem hiding this comment.
radix_mask and radix_bit_len are already in the right types: let radix_mask : UInt64 = radix - 1 (line 89) and let radix_bit_len = 32 (line 81, Int, which is the expected shift-width type for UInt64 >> Int in moonbit). So product & radix_mask is UInt64 & UInt64 and product >> radix_bit_len is UInt64 >> Int, no implicit conversion. The existing grade_school_mul, karatsuba_mul, and div helpers in this same file all use these constants the same way (see e.g. lines 595, 597, 670–688), so the new code follows established convention and stays in sync if those constants are ever retyped.
Address review comment: mul_single_limb always returns sign: Positive, which is the same convention as grade_school_mul and karatsuba_mul -- Mul::mul overwrites sign with the combined sign of the operands. Spell out that contract in the doc so the helper is not mistaken for a general signed scalar multiply.
Summary
BigInt::mulcurrently dispatches into two general-purpose multiplication routines:grade_school_mul- theO(n * m)classical algorithm: a nested loop over both operands' limbs with per-iteration carry propagation and aj < other_len || carry != 0guard. Used when at least one operand has fewer thankaratsuba_threshold(= 50) limbs.karatsuba_mul- the recursiveO(n^log2(3))algorithm: splits both operands in half, computes three half-size products, and recombines via the Karatsuba identity. Used when both operands cross the threshold.For asymmetric cases where one operand fits in a single radix limb, neither routine is ideal.
grade_school_mulstill pays the general nested-loop overhead, andkaratsuba_mulis never reached because the 1-limb operand is belowkaratsuba_threshold.This PR adds a
mul_single_limbfast path that runs a single carry-propagating loop inO(self.len). The dispatch inMul::mulchecks both operands, so the fast path fires regardless of operand order.Real-world impact
This is not specific to factorial. It optimizes the common shape where a large
BigIntis repeatedly multiplied by a machine-word-sized scalar.One concrete core-library path is arbitrary-radix parsing. For non-decimal / non-power-of-two radices, the value is naturally built as:
Here
base <= 36, so it is a 1-limbBigInt, whileaccgrows with the input length. Long base-3/base-5/base-36 inputs therefore repeatedly hit then-limb * 1-limbcase optimized by this PR.The same shape appears in user code for exact combinatorics and product accumulation, for example factorials, permutations, binomial/product formulas, and loops of the form:
where
accgrows beyond one limb butkremains a small integer.The profiler found this because, in the factorial-style workload,
grade_school_muldominated wasm-gc self time. The new fast path removes the general nested-loop overhead for the asymmetric case and turns it into a single carry-propagating pass over the large operand. Workloads with this scalar-growth profile therefore see whole-program speedups on wasm, wasm-gc, and native backends, while balancedn * nmultiplication and the JS backend remain effectively unchanged.Benchmarks
moonbit 0.1.20260522 + this patch, Linux x86_64, wall time (3-run median, no GuestProfiler).
factorial(800)(100 iterations), which repeatedly multiplies a growing accumulator by a 1-limb integer. Scenario:bench/cmd/bigint_ops/main.mbt.JS is unchanged because the JS backend transpiles
BigIntto V8's nativeBigIntrather than going throughgrade_school_mul.Balanced-multiplication probe (repeated squaring of a 30-digit seed x 11 iterations, both operands grow together so the existing
karatsuba_mulpath is exercised) confirms then * npath is not regressed. Scenario:bench/cmd/bigint_square/main.mbt.Test results
moon testagainst this branch on all four targets (full core suite):Notes on the helper
mul_single_limbreturnssign: Positiveregardless ofself.sign. This matches the existing convention ofgrade_school_mulandkaratsuba_mul: they return magnitude-only results, andMul::muloverwritessignwith the combined sign of the operands at the dispatch site. The doc comment onmul_single_limbspells this out.