Skip to content

perf(bigint): speed up large BigInt x small scalar multiplication#3620

Open
mizchi wants to merge 8 commits into
moonbitlang:mainfrom
mizchi:pr-bigint-mul-single-limb
Open

perf(bigint): speed up large BigInt x small scalar multiplication#3620
mizchi wants to merge 8 commits into
moonbitlang:mainfrom
mizchi:pr-bigint-mul-single-limb

Conversation

@mizchi

@mizchi mizchi commented May 23, 2026

Copy link
Copy Markdown
Contributor

Summary

BigInt::mul currently dispatches into two general-purpose multiplication routines:

  • grade_school_mul - the O(n * m) classical algorithm: a nested loop over both operands' limbs with per-iteration carry propagation and a j < other_len || carry != 0 guard. Used when at least one operand has fewer than karatsuba_threshold (= 50) limbs.
  • karatsuba_mul - the recursive O(n^log2(3)) algorithm: splits both operands in half, computes three half-size products, and recombines via the Karatsuba identity. Used when both operands cross the threshold.

For asymmetric cases where one operand fits in a single radix limb, neither routine is ideal. grade_school_mul still pays the general nested-loop overhead, and karatsuba_mul is never reached because the 1-limb operand is below karatsuba_threshold.

This PR adds a mul_single_limb fast path that runs a single carry-propagating loop in O(self.len). The dispatch in Mul::mul checks both operands, so the fast path fires regardless of operand order.

Real-world impact

This is not specific to factorial. It optimizes the common shape where a large BigInt is repeatedly multiplied by a machine-word-sized scalar.

One concrete core-library path is arbitrary-radix parsing. For non-decimal / non-power-of-two radices, the value is naturally built as:

acc = acc * base + digit

Here base <= 36, so it is a 1-limb BigInt, while acc grows with the input length. Long base-3/base-5/base-36 inputs therefore repeatedly hit the n-limb * 1-limb case optimized by this PR.

The same shape appears in user code for exact combinatorics and product accumulation, for example factorials, permutations, binomial/product formulas, and loops of the form:

acc = acc * k

where acc grows beyond one limb but k remains a small integer.

The profiler found this because, in the factorial-style workload, grade_school_mul dominated wasm-gc self time. The new fast path removes the general nested-loop overhead for the asymmetric case and turns it into a single carry-propagating pass over the large operand. Workloads with this scalar-growth profile therefore see whole-program speedups on wasm, wasm-gc, and native backends, while balanced n * n multiplication and the JS backend remain effectively unchanged.

Benchmarks

moonbit 0.1.20260522 + this patch, Linux x86_64, wall time (3-run median, no GuestProfiler).

factorial(800) (100 iterations), which repeatedly multiplies a growing accumulator by a 1-limb integer. Scenario: bench/cmd/bigint_ops/main.mbt.

backend baseline patched delta
wasm 255.1 ms 80.2 ms -68.6%
wasm-gc 93.5 ms 25.2 ms -73.0%
native 48.9 ms 20.0 ms -59.1%
js 21.2 ms 22.3 ms noise

JS is unchanged because the JS backend transpiles BigInt to V8's native BigInt rather than going through grade_school_mul.

Balanced-multiplication probe (repeated squaring of a 30-digit seed x 11 iterations, both operands grow together so the existing karatsuba_mul path is exercised) confirms the n * n path is not regressed. Scenario: bench/cmd/bigint_square/main.mbt.

backend baseline patched delta
wasm 257.1 ms 271.9 ms noise
wasm-gc 84.8 ms 78.2 ms -7.8%
native 57.0 ms 58.0 ms noise

Test results

moon test against this branch on all four targets (full core suite):

target result
wasm 6500 / 6500 pass
wasm-gc 6500 / 6500 pass
js 6459 / 6459 pass
native 6411 / 6411 pass

Notes on the helper

mul_single_limb returns sign: Positive regardless of self.sign. This matches the existing convention of grade_school_mul and karatsuba_mul: they return magnitude-only results, and Mul::mul overwrites sign with the combined sign of the operands at the dispatch site. The doc comment on mul_single_limb spells this out.

Add a mul_single_limb fast path that runs a single carry-propagating
loop in O(self.len). The dispatch in Mul::mul checks both operands so
the fast path fires regardless of operand order.

For factorial(800)-style chains where one operand is always 1 limb:
  wasm    : 255.1 -> 80.2 ms  (-68.6%)
  wasm-gc :  93.5 -> 25.2 ms  (-73.0%)
  native  :  48.9 -> 20.0 ms  (-59.1%)

The n*n path (Karatsuba) is untouched; bigint_square is within noise.
Copilot AI review requested due to automatic review settings May 23, 2026 09:20

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR optimizes BigInt multiplication for the common “n-limb × 1-limb” case (e.g., factorial-style multiplication chains) by adding a specialized fast path and helper routine.

Changes:

  • Adds a len == 1 fast path in Mul::mul to avoid the general grade-school loop overhead.
  • Introduces BigInt::mul_single_limb to efficiently multiply by a single radix limb.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread bigint/bigint_nonjs.mbt
limbs[n] = carry.to_uint()
n + 1
}
{ limbs, sign: Positive, len }

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documented in ada88f8. The magnitude-only contract here mirrors grade_school_mul (line 447 in this file returns sign: Positive the same way) and karatsuba_mulMul::mul always overwrites sign with the combined sign of the operands at the dispatch site. Renaming to mul_single_limb_abs would break the naming symmetry with those other helpers, so I kept the name and added a doc comment that spells out the contract and warns against direct callers needing a signed product.

Comment thread bigint/bigint_nonjs.mbt
Comment on lines +410 to +414
let mut carry = 0UL
for i in 0..<n {
let product = self.limbs[i].to_uint64() * xq + carry
limbs[i] = (product & radix_mask).to_uint()
carry = product >> radix_bit_len

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

radix_mask and radix_bit_len are already in the right types: let radix_mask : UInt64 = radix - 1 (line 89) and let radix_bit_len = 32 (line 81, Int, which is the expected shift-width type for UInt64 >> Int in moonbit). So product & radix_mask is UInt64 & UInt64 and product >> radix_bit_len is UInt64 >> Int, no implicit conversion. The existing grade_school_mul, karatsuba_mul, and div helpers in this same file all use these constants the same way (see e.g. lines 595, 597, 670–688), so the new code follows established convention and stays in sync if those constants are ever retyped.

mizchi added 5 commits May 23, 2026 18:47
Address review comment: mul_single_limb always returns sign: Positive,
which is the same convention as grade_school_mul and karatsuba_mul --
Mul::mul overwrites sign with the combined sign of the operands. Spell
out that contract in the doc so the helper is not mistaken for a
general signed scalar multiply.
@mizchi mizchi changed the title perf(bigint): specialize (n-limb) x (1-limb) multiplication perf(bigint): speed up large BigInt x small scalar multiplication May 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants