perf(codegen): MSB/LSB → clz/ctz peephole family + if [Br N] [] → br_if#6117
Draft
ggreif wants to merge 8 commits into
Draft
perf(codegen): MSB/LSB → clz/ctz peephole family + if [Br N] [] → br_if#6117ggreif wants to merge 8 commits into
if [Br N] [] → br_if#6117ggreif wants to merge 8 commits into
Conversation
5265cd4 to
b0b0ff1
Compare
Mirrors the LSB-and-1 → ctz family (PRs #3873 / #6103) for the symmetric directions the existing ruleset didn't cover. **MSB-extraction → `clz`** (parallel to `and 1`/`shr_u 31` rules): - Or-pattern `(shr_u | shr_s) (N-1)` so `Int*` signed shifts collapse the same way `Nat*` unsigned ones already do. - `br_if` variants where an upstream `eqz` cancels with the implicit flip (no leg swap; eqz consumed). - i64 (EOP) variants for the boolean reaching `if`/`br_if` via either `i32.wrap_i64` (swap arms — wrap preserves the truth value) or `i64.eqz` (no swap — eqz inverts). `wrap; eqz` is omitted for the `and (1<<63)` shape since wrap of an MSB-set value reads 0. **LSB-via-`shl (N-1)` → `ctz`** (parallel to `and 1`): - `shl 31` / `shl 63` puts the LSB at the top of the word; nonzero iff LSB was set, same boolean as `and 1` but emitted from `compile_(classical|enhanced).ml`'s `pow` body for the square-and-multiply iteration. The `if` variants reach the optimizer post-`eqz; if t e` → `if e t` swap, so we match the swapped shape. `wrap` variants are omitted (wrap of `shl (N-1)` reads 0). **Dropped**: the dead bare `[i64.and 1; if]` rule. It was ill-typed wasm (`if` wants i32 cond), unreachable in practice; PR #6103's commit message claimed to drop it but the diff didn't. Removed for real here. Every rule carries an above-line comment naming the input shape, the target shape, and the swap convention. Where a rule fires from a specific moc emission path (the `pow` exponent test) that's named in the comment too — discovered via `failwith` probes that fired on the first `make quick` pass and pointed at the real source. **Tests**: `test/run/top-bit.mo` extended to cover i32 `shr_s 31`, `and 0x80000000` + let-else `br_if` shapes; new `test/run/top-bit-i64.mo` covers i64 (EOP) Nat64/Int64 with both `wrap` and `i64.eqz` routes. All phases green; `make -j8 quick` clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
moc never emits `BrIf` directly — the prior LSB `eqz; br_if` peephole
rules were correct-by-symmetry but had no input source. This adds the
single rule that materialises `BrIf` from moc's preferred `if (br N)
end` shape, fixing the source-side gap.
Fires today on internal codegen with no-result blocks (every
`trans_state{N}` state-equality check goes from `if-br-end` to a
single `br_if`). Does NOT yet fire on user-level break/continue/
let-else because those push a unit value before the br
(`if [Const 0; Br N] []`), needing a local to rewrite — out of scope.
DeBruijn: removing the if's implicit label shifts every reachable br
target down by one, so `Br N` (N ≥ 1) becomes `BrIf (N-1)`. `Br 0`
(targets the if's own end) is a no-op fall-through and collapses to
`Drop` of the cond.
Also drop the `br_if` `CHECK` lines from `top-bit{,-i64}.mo`: the
let-else shape goes through Int-boxing, never producing the
`shr; eqz; br_if` shape my MSB rules expect. Kept the let-else as a
runtime sanity check and dropped the misleading FileCheck claims.
The LSB-via-shl rules had only indirect coverage via `aardvark.mo` /
`assign-ops.mo` (where the original `failwith` probes fired). Add
direct tests that trigger the wrapping-pow body and verify the
peephole output in `$wpow_nat<Nat{32,64}>`:
shl-lsb.mo — classical (i32): `i32.shl 31; eqz; if` → `i32.ctz; if`
shl-lsb-i64.mo — EOP (i64): `i64.shl 63; i64.eqz; if`
→ `i64.ctz; wrap; if`
Both `Nat32 **%` and `Int32 **%` route through the same
`wpow_nat<Nat32>` body (the `Int` form delegates after a negative
check), so one CHECK suffices. The i64 wasm-run goldens match the
expected `running_gc` link error (same shape as `top-bit-i64.mo`'s
goldens).
Both gates (`parse_directives` line 283 and `filecheck` line 620) used
`line.starts_with("//CHECK")` with no space, but 24 of the 27
FileCheck-annotated tests in `test/run/` use the conventional
`// CHECK:` form. Result: FileCheck was silently skipped on the
majority of would-be FileCheck tests — the `[FileCheck]` phase never
appeared in their output, and CHECK lines were unenforced.
Trim leading `/` and whitespace before matching `CHECK`, so both
`// CHECK` and `//CHECK` activate. Manually running FileCheck on the
19 silenced classical-mode tests (the 5 EOP-only ones have no wasm in
default mode) shows all 19 already pass — fix is risk-free.
b0b0ff1 to
2894549
Compare
Contributor
Wasm is little-endian, so `i64.load addr` reads bytes [addr, addr+8)
with the low 32 bits at [addr, addr+4) — exactly what
`i32.load addr` would load. Three cases:
* `i64.load` (sz=None) → `i32.load` (sz=None), align clamped to ≤ 2.
* `i64.load32_{u,s}` (sz=Some(Pack32,_)) → `i32.load` (sz=None) — Pack32 + i32 is degenerate.
* `i64.load{8,16}_{u,s}` (sz=Some(Pack8|Pack16,_)) → `i32.load{8,16}_{u,s}` (sz preserved).
Packed variants are fully spec-preserving (same byte count read either
way; wrap just discards the i64 extension). The full `i64.load → i32.load`
narrows the read from 8 to 4 bytes — same OOM caveat as the existing
`Load; Drop → drop` rule (`but should be fine for the code that we create`).
**Impact** — EOP corpus scan over `test/run/_out/` (266 wasm files):
* Before: 201 `i64.load; i32.wrap_i64` instances in 12 files,
concentrated in Candid/IDL deserialization paths:
`candid.wasm` 53, `idl-opt-tests.wasm` 31, `idl-decoding-bug.wasm` 23,
`idl-extra-variant-bug.wasm` 22, `issue-5606.wasm` / `idl-ops.wasm` /
`idl.wasm` 16 each, …
* After: **0**. Every full-load+wrap shape eliminated.
The constant ~3 packed-load+wrap per file are RTS-contributed (Rust
toolchain output, linked verbatim) and out of scope for the OCaml
peephole.
….eqz`
The high-mask preserves bits [n, 63] and clears [0, n); eqz then tests
whether those high bits are all zero. `shrU n` brings the same bits to
positions [0, 63-n] and eqz asks the identical question. Same predicate,
fewer bytes — sleb128 of small `n` is 1 byte, sleb128 of `-1<<n` is up to 9.
`high_mask_shift c` uses the user's compact characterisation:
`ctz(c) + clz(~c) = 64` (with `c ∈ {0, -1}` ruled out explicitly).
`Wasm.I64.{ctz,clz}` from the upstream interpreter library do the heavy
lifting.
Placement matters: the rule comes AFTER the existing
`i64.and (1<<63); i64.eqz; (if|br_if)` → `i64.clz; wrap; (if|br_if)`
rules so for n=63 in a branch context the MSB-extraction rule fires
first (more specific, produces clz directly). Otherwise this rule
catches all high-mask-and-eqz shapes.
**Impact** — EOP scan post-rule (selected files):
* `candid.wasm` 18 hits
* `words-enhanced.wasm` 7 hits
* `issue-5606.wasm` 5 hits
* `assign-ops.wasm` 4 hits
Each hit saves ~7 bytes of code size and one mask materialisation.
Fires on every `enforce_unsigned_bits env n` in `compile_enhanced.ml`'s
small-word arithmetic kernels (Nat8 / Int8 / Nat16 / Int16 / Nat32 /
Int32 wrapping arithmetic + overflow check).
`i32.and (-1<<n); <bool-context>` → `i32.shrU n; <bool-context>`,
where `bool-context` is one of `i32.eqz`, `if`, or `br_if`. All three
collapse into a single or-pattern by matching the bool op after it
has been consumed into l'; the rule preserves it untouched.
`high_mask_shift_i32` mirrors the i64 helper: `c ≠ 0 ∧ c ≠ -1 ∧
ctz(c) + clz(~c) = 32` via `Wasm.I32.{ctz, clz}`.
Targets classical's `enforce_unsigned_bits env n` — the mask fed
directly into `if (trap)` form (no intervening `eqz`, unlike the
EOP shape). Must come after the i32 MSB-`and 0x80000000` rules so
for n=31 those produce `clz; if e t` (with arm swap) first.
**Impact** — classical-build hits per file (selected):
* `aardvark.wasm` 6
* `assign-ops.wasm` 6
* `candid.wasm` 7
Each hit saves ~2 bytes (sleb128 of `-1<<n` for n=16 is 3 bytes vs
1 byte for `16`).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Extends the existing LSB-and-1 →
ctzpeephole family (PRs #3873 / #6103) with the symmetric MSB-extraction →clzdirection, the LSB-via-shl (N-1)route, and a newif [Br N] []→br_if (N-1)rule that finally lets moc emitBrIfat all. Also unsilences the FileCheck gate (commit 5265cd4) so the new// CHECKdirectives are actually validated end-to-end — that fix is also shipping standalone as #6116.MSB-extraction →
Clzfamily (cf729f046)i32.(shr_u|shr_s) 31; if t e→i32.clz; if e t(or-pattern on the shift opcode)i32.and 0x80000000; if t e→i32.clz; if e ti32.(shr_u|shr_s) 31; eqz; br_if l→i32.clz; br_if li32.and 0x80000000; eqz; br_if l→i32.clz; br_if lwrap_i64(EOP path) and thei64.eqz-route variantsi64.and 1; ifrule (fix(codegen): see throughi32.wrap_i64in LSB-and-1 →ctzpeephole #6103's commit message claimed removal; diff didn't)LSB-via-
shl (N-1)family (cf729f046)i32.shl 31; if t e→i32.ctz; if e t— fires fromcompile_classical.ml'spowbodyi32.shl 31; eqz; br_if l→i32.ctz; br_if li64.shl 63; i64.eqz; if t e→i64.ctz; wrap; if t e— fires fromcompile_enhanced.ml'spowbody (Word64 + TaggedSmallWord variants)i64.shl 63; i64.eqz; br_if l→i64.ctz; wrap; br_if lOriginally landed with
failwithprobes; the probes fired immediately onaardvark.mo(i64) andassign-ops.mo(i32), pinning the source sites to each backend'spowsquare-and-multiply LSB-test. Comments now name those exact emit sites.if [Br N] []→br_if (N-1)(09bd1b652)moc never emitted
BrIfdirectly, so the pre-existing LSBeqz; br_ifrules + the new MSBbr_ifrules were correct-by-symmetry but had no input source. This adds the single rule that materialisesBrIffrom moc's preferredif (br N) endshape. DeBruijn: removing the if's implicit label decrements every reachable br target by 1;Br 0(targets the if's own end) is a no-op fall-through and collapses toDrop.Fires today on internal codegen with no-result blocks — every
trans_state{N}state-equality check goes fromif-br-endto a singlebr_if. Does not yet fire on user-level break/continue/let-else because those push a unit value before the br (if [Const 0; Br N] []); rewriting that needs a local, out of scope here.i64.load*; i32.wrap_i64→i32.load*(c60e99159)Wasm is little-endian, so
i64.load addrreads bytes[addr, addr+8)with the low 32 bits at[addr, addr+4)— exactly whati32.load addrwould load. Three cases:i64.load(sz=None) →i32.load,alignclamped to≤ 2.i64.load32_{u,s}→i32.load(Pack32 + i32 is degenerate).i64.load{8,16}_{u,s}→i32.load{8,16}_{u,s}(szpreserved).Packed variants are fully spec-preserving (same byte count read either way;
wrapjust discards the i64 extension). The fulli64.load → i32.loadnarrows the read from 8 to 4 bytes — same OOM caveat as the existingLoad; Drop → droprule.Impact — EOP corpus scan over
test/run/_out/(268 wasm files):i64.load; i32.wrap_i64instancescandid.wasm53,idl-opt-tests.wasm31,idl-decoding-bug.wasm23,idl-extra-variant-bug.wasm22,idl.wasm/idl-ops.wasm/issue-5606.wasm16 each, …The ~3 packed-load+wrap per file that remain are RTS-side (Rust toolchain output, linked verbatim) and outside the OCaml peephole's reach. CI's auto-bench captured a small cycle win on
typtbl.drun-run.ok(−2 cycles uniformly across 5 measurements) attributable to this rule.i64.and (-1<<n); i64.eqz→i64.shrU n; i64.eqz(30426ab0a)When the constant is a "high mask" (
-1 << n, n ∈ [1, 63]),eqztests "are bits[n, 63]all zero?".i64.shrU nbrings those same bits down to positions[0, 63−n]andeqzasks the identical question. Same predicate, fewer bytes — sleb128 of smallnis 1 byte; sleb128 of-1 << nis up to 9.The detector
c ≠ 0 ∧ c ≠ -1 ∧ ctz(c) + clz(~c) = 64(viaWasm.I64.{ctz, clz}) is the compact characterisation of a contiguous high mask. Thec ∈ {0, -1}edge cases both pass the bare formula (Wasm spec:ctz(0) = clz(0) = 64) but yield meaninglessn; they're ruled out explicitly.Placement matters: must come after the existing MSB
and (1<<63); eqz; (if|br_if) → clz; wrap; (if|br_if)rules. Forn = 63in a branch context the MSB rule is strictly better (producesclzdirectly, no shift). After all the MSB rules fail to match, this rule catches the remaining high-mask-and-eqz shapes.Impact (EOP build, selected files):
candid.wasmwords-enhanced.wasmissue-5606.wasmassign-ops.wasmEach hit saves ~7 bytes of code and one mask materialisation. Fires on every
enforce_unsigned_bits env nincompile_enhanced.ml's small-word arithmetic kernels (Nat8/Int8/Nat16/Int16/Nat32/Int32 wrapping arithmetic + overflow check).Tests (
cf729f046,e8d8b80e2,5265cd41e)test/run/top-bit.mo— classical MSB i32 (Nat32/Int32) →clz; iftest/run/top-bit-i64.mo— EOP MSB i64 (Nat64/Int64) →clz; wrap; iftest/run/shl-lsb.mo— classical LSB-via-shl i32 inwpow_nat<Nat32>→ctz; iftest/run/shl-lsb-i64.mo— EOP LSB-via-shl i64 inwpow_nat<Nat64>→ctz; wrap; ifThe FileCheck-gate fix in
5265cd41e(= #6116) is what makes those// CHECKdirectives actually enforce.Test plan
make -C test/run -j8 quickgreen.EXTRA_MOC_ARGS="--enhanced-orthogonal-persistence" make -C test/run top-bit-i64.only shl-lsb-i64.onlygreen.~~>rewrite in disassembled wat (wasm2wat) before adding it.// CHECKlines pass FileCheck after the gate fix lands.