Skip to content

perf(codegen): MSB/LSB → clz/ctz peephole family + if [Br N] []br_if#6117

Draft
ggreif wants to merge 8 commits into
masterfrom
gabor/clz-msb-peephole
Draft

perf(codegen): MSB/LSB → clz/ctz peephole family + if [Br N] []br_if#6117
ggreif wants to merge 8 commits into
masterfrom
gabor/clz-msb-peephole

Conversation

@ggreif
Copy link
Copy Markdown
Contributor

@ggreif ggreif commented May 13, 2026

Summary

Extends the existing LSB-and-1 → ctz peephole family (PRs #3873 / #6103) with the symmetric MSB-extraction → clz direction, the LSB-via-shl (N-1) route, and a new if [Br N] []br_if (N-1) rule that finally lets moc emit BrIf at all. Also unsilences the FileCheck gate (commit 5265cd4) so the new // CHECK directives are actually validated end-to-end — that fix is also shipping standalone as #6116.

MSB-extraction → Clz family (cf729f046)

  • i32.(shr_u|shr_s) 31; if t ei32.clz; if e t (or-pattern on the shift opcode)
  • i32.and 0x80000000; if t ei32.clz; if e t
  • i32.(shr_u|shr_s) 31; eqz; br_if li32.clz; br_if l
  • i32.and 0x80000000; eqz; br_if li32.clz; br_if l
  • i64 mirrors with wrap_i64 (EOP path) and the i64.eqz-route variants
  • Drops the dead i64.and 1; if rule (fix(codegen): see through i32.wrap_i64 in LSB-and-1 → ctz peephole #6103's commit message claimed removal; diff didn't)

LSB-via-shl (N-1) family (cf729f046)

  • i32.shl 31; if t ei32.ctz; if e t — fires from compile_classical.ml's pow body
  • i32.shl 31; eqz; br_if li32.ctz; br_if l
  • i64.shl 63; i64.eqz; if t ei64.ctz; wrap; if t e — fires from compile_enhanced.ml's pow body (Word64 + TaggedSmallWord variants)
  • i64.shl 63; i64.eqz; br_if li64.ctz; wrap; br_if l

Originally landed with failwith probes; the probes fired immediately on aardvark.mo (i64) and assign-ops.mo (i32), pinning the source sites to each backend's pow square-and-multiply LSB-test. Comments now name those exact emit sites.

if [Br N] []br_if (N-1) (09bd1b652)

moc never emitted BrIf directly, so the pre-existing LSB eqz; br_if rules + the new MSB br_if rules were correct-by-symmetry but had no input source. This adds the single rule that materialises BrIf from moc's preferred if (br N) end shape. DeBruijn: removing the if's implicit label decrements every reachable br target by 1; Br 0 (targets the if's own end) is a no-op fall-through and collapses to Drop.

Fires today on internal codegen with no-result blocks — every trans_state{N} state-equality check goes from if-br-end to a single br_if. Does not yet fire on user-level break/continue/let-else because those push a unit value before the br (if [Const 0; Br N] []); rewriting that needs a local, out of scope here.

i64.load*; i32.wrap_i64i32.load* (c60e99159)

Wasm is little-endian, so i64.load addr reads bytes [addr, addr+8) with the low 32 bits at [addr, addr+4) — exactly what i32.load addr would load. Three cases:

  • i64.load (sz=None) → i32.load, align clamped to ≤ 2.
  • i64.load32_{u,s}i32.load (Pack32 + i32 is degenerate).
  • i64.load{8,16}_{u,s}i32.load{8,16}_{u,s} (sz preserved).

Packed variants are fully spec-preserving (same byte count read either way; wrap just discards the i64 extension). The full i64.load → i32.load narrows the read from 8 to 4 bytes — same OOM caveat as the existing Load; Drop → drop rule.

Impact — EOP corpus scan over test/run/_out/ (268 wasm files):

Stat Before After
Full i64.load; i32.wrap_i64 instances 201 (in 12 files) 0
Top offenders candid.wasm 53, idl-opt-tests.wasm 31, idl-decoding-bug.wasm 23, idl-extra-variant-bug.wasm 22, idl.wasm / idl-ops.wasm / issue-5606.wasm 16 each, …

The ~3 packed-load+wrap per file that remain are RTS-side (Rust toolchain output, linked verbatim) and outside the OCaml peephole's reach. CI's auto-bench captured a small cycle win on typtbl.drun-run.ok (−2 cycles uniformly across 5 measurements) attributable to this rule.

i64.and (-1<<n); i64.eqzi64.shrU n; i64.eqz (30426ab0a)

When the constant is a "high mask" (-1 << n, n ∈ [1, 63]), eqz tests "are bits [n, 63] all zero?". i64.shrU n brings those same bits down to positions [0, 63−n] and eqz asks the identical question. Same predicate, fewer bytes — sleb128 of small n is 1 byte; sleb128 of -1 << n is up to 9.

The detector c ≠ 0 ∧ c ≠ -1 ∧ ctz(c) + clz(~c) = 64 (via Wasm.I64.{ctz, clz}) is the compact characterisation of a contiguous high mask. The c ∈ {0, -1} edge cases both pass the bare formula (Wasm spec: ctz(0) = clz(0) = 64) but yield meaningless n; they're ruled out explicitly.

Placement matters: must come after the existing MSB and (1<<63); eqz; (if|br_if) → clz; wrap; (if|br_if) rules. For n = 63 in a branch context the MSB rule is strictly better (produces clz directly, no shift). After all the MSB rules fail to match, this rule catches the remaining high-mask-and-eqz shapes.

Impact (EOP build, selected files):

File hits
candid.wasm 18
words-enhanced.wasm 7
issue-5606.wasm 5
assign-ops.wasm 4

Each hit saves ~7 bytes of code and one mask materialisation. Fires on every enforce_unsigned_bits env n in compile_enhanced.ml's small-word arithmetic kernels (Nat8/Int8/Nat16/Int16/Nat32/Int32 wrapping arithmetic + overflow check).

Tests (cf729f046, e8d8b80e2, 5265cd41e)

  • test/run/top-bit.mo — classical MSB i32 (Nat32/Int32) → clz; if
  • test/run/top-bit-i64.mo — EOP MSB i64 (Nat64/Int64) → clz; wrap; if
  • test/run/shl-lsb.mo — classical LSB-via-shl i32 in wpow_nat<Nat32>ctz; if
  • test/run/shl-lsb-i64.mo — EOP LSB-via-shl i64 in wpow_nat<Nat64>ctz; wrap; if

The FileCheck-gate fix in 5265cd41e (= #6116) is what makes those // CHECK directives actually enforce.

Test plan

  • make -C test/run -j8 quick green.
  • EXTRA_MOC_ARGS="--enhanced-orthogonal-persistence" make -C test/run top-bit-i64.only shl-lsb-i64.only green.
  • Manually verified each ~~> rewrite in disassembled wat (wasm2wat) before adding it.
  • All four new tests' // CHECK lines pass FileCheck after the gate fix lands.
  • Run-eop / drun / mo-idl on CI.

@ggreif ggreif requested a review from a team as a code owner May 13, 2026 16:51
@ggreif ggreif force-pushed the gabor/clz-msb-peephole branch from 5265cd4 to b0b0ff1 Compare May 13, 2026 17:00
ggreif and others added 4 commits May 13, 2026 19:18
Mirrors the LSB-and-1 → ctz family (PRs #3873 / #6103) for the
symmetric directions the existing ruleset didn't cover.

**MSB-extraction → `clz`** (parallel to `and 1`/`shr_u 31` rules):

- Or-pattern `(shr_u | shr_s) (N-1)` so `Int*` signed shifts collapse
  the same way `Nat*` unsigned ones already do.
- `br_if` variants where an upstream `eqz` cancels with the implicit
  flip (no leg swap; eqz consumed).
- i64 (EOP) variants for the boolean reaching `if`/`br_if` via either
  `i32.wrap_i64` (swap arms — wrap preserves the truth value) or
  `i64.eqz` (no swap — eqz inverts). `wrap; eqz` is omitted for the
  `and (1<<63)` shape since wrap of an MSB-set value reads 0.

**LSB-via-`shl (N-1)` → `ctz`** (parallel to `and 1`):

- `shl 31` / `shl 63` puts the LSB at the top of the word; nonzero
  iff LSB was set, same boolean as `and 1` but emitted from
  `compile_(classical|enhanced).ml`'s `pow` body for the
  square-and-multiply iteration. The `if` variants reach the
  optimizer post-`eqz; if t e` → `if e t` swap, so we match the
  swapped shape. `wrap` variants are omitted (wrap of `shl (N-1)`
  reads 0).

**Dropped**: the dead bare `[i64.and 1; if]` rule. It was ill-typed
wasm (`if` wants i32 cond), unreachable in practice; PR #6103's
commit message claimed to drop it but the diff didn't. Removed
for real here.

Every rule carries an above-line comment naming the input shape, the
target shape, and the swap convention. Where a rule fires from a
specific moc emission path (the `pow` exponent test) that's named in
the comment too — discovered via `failwith` probes that fired on the
first `make quick` pass and pointed at the real source.

**Tests**: `test/run/top-bit.mo` extended to cover i32 `shr_s 31`,
`and 0x80000000` + let-else `br_if` shapes; new `test/run/top-bit-i64.mo`
covers i64 (EOP) Nat64/Int64 with both `wrap` and `i64.eqz` routes.
All phases green; `make -j8 quick` clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
moc never emits `BrIf` directly — the prior LSB `eqz; br_if` peephole
rules were correct-by-symmetry but had no input source. This adds the
single rule that materialises `BrIf` from moc's preferred `if (br N)
end` shape, fixing the source-side gap.

Fires today on internal codegen with no-result blocks (every
`trans_state{N}` state-equality check goes from `if-br-end` to a
single `br_if`). Does NOT yet fire on user-level break/continue/
let-else because those push a unit value before the br
(`if [Const 0; Br N] []`), needing a local to rewrite — out of scope.

DeBruijn: removing the if's implicit label shifts every reachable br
target down by one, so `Br N` (N ≥ 1) becomes `BrIf (N-1)`. `Br 0`
(targets the if's own end) is a no-op fall-through and collapses to
`Drop` of the cond.

Also drop the `br_if` `CHECK` lines from `top-bit{,-i64}.mo`: the
let-else shape goes through Int-boxing, never producing the
`shr; eqz; br_if` shape my MSB rules expect. Kept the let-else as a
runtime sanity check and dropped the misleading FileCheck claims.
The LSB-via-shl rules had only indirect coverage via `aardvark.mo` /
`assign-ops.mo` (where the original `failwith` probes fired). Add
direct tests that trigger the wrapping-pow body and verify the
peephole output in `$wpow_nat<Nat{32,64}>`:

  shl-lsb.mo      — classical (i32): `i32.shl 31; eqz; if` → `i32.ctz; if`
  shl-lsb-i64.mo  — EOP (i64):       `i64.shl 63; i64.eqz; if`
                                     → `i64.ctz; wrap; if`

Both `Nat32 **%` and `Int32 **%` route through the same
`wpow_nat<Nat32>` body (the `Int` form delegates after a negative
check), so one CHECK suffices. The i64 wasm-run goldens match the
expected `running_gc` link error (same shape as `top-bit-i64.mo`'s
goldens).
Both gates (`parse_directives` line 283 and `filecheck` line 620) used
`line.starts_with("//CHECK")` with no space, but 24 of the 27
FileCheck-annotated tests in `test/run/` use the conventional
`// CHECK:` form. Result: FileCheck was silently skipped on the
majority of would-be FileCheck tests — the `[FileCheck]` phase never
appeared in their output, and CHECK lines were unenforced.

Trim leading `/` and whitespace before matching `CHECK`, so both
`// CHECK` and `//CHECK` activate. Manually running FileCheck on the
19 silenced classical-mode tests (the 5 EOP-only ones have no wasm in
default mode) shows all 19 already pass — fix is risk-free.
@ggreif ggreif force-pushed the gabor/clz-msb-peephole branch from b0b0ff1 to 2894549 Compare May 13, 2026 17:19
@ggreif ggreif marked this pull request as draft May 13, 2026 17:36
@ggreif ggreif self-assigned this May 13, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 13, 2026

Comparing from 5d3c2f3 to 8ad7469:
In terms of gas, 5 tests improved and the mean change is -0.0%.
In terms of size, 5 tests improved and the mean change is -0.0%.

ggreif and others added 4 commits May 13, 2026 22:14
Wasm is little-endian, so `i64.load addr` reads bytes [addr, addr+8)
with the low 32 bits at [addr, addr+4) — exactly what
`i32.load addr` would load. Three cases:

* `i64.load`           (sz=None)                  → `i32.load` (sz=None), align clamped to ≤ 2.
* `i64.load32_{u,s}`   (sz=Some(Pack32,_))        → `i32.load` (sz=None) — Pack32 + i32 is degenerate.
* `i64.load{8,16}_{u,s}` (sz=Some(Pack8|Pack16,_)) → `i32.load{8,16}_{u,s}` (sz preserved).

Packed variants are fully spec-preserving (same byte count read either
way; wrap just discards the i64 extension). The full `i64.load → i32.load`
narrows the read from 8 to 4 bytes — same OOM caveat as the existing
`Load; Drop → drop` rule (`but should be fine for the code that we create`).

**Impact** — EOP corpus scan over `test/run/_out/` (266 wasm files):

* Before: 201 `i64.load; i32.wrap_i64` instances in 12 files,
  concentrated in Candid/IDL deserialization paths:
  `candid.wasm` 53, `idl-opt-tests.wasm` 31, `idl-decoding-bug.wasm` 23,
  `idl-extra-variant-bug.wasm` 22, `issue-5606.wasm` / `idl-ops.wasm` /
  `idl.wasm` 16 each, …
* After: **0**. Every full-load+wrap shape eliminated.

The constant ~3 packed-load+wrap per file are RTS-contributed (Rust
toolchain output, linked verbatim) and out of scope for the OCaml
peephole.
….eqz`

The high-mask preserves bits [n, 63] and clears [0, n); eqz then tests
whether those high bits are all zero. `shrU n` brings the same bits to
positions [0, 63-n] and eqz asks the identical question. Same predicate,
fewer bytes — sleb128 of small `n` is 1 byte, sleb128 of `-1<<n` is up to 9.

`high_mask_shift c` uses the user's compact characterisation:
`ctz(c) + clz(~c) = 64` (with `c ∈ {0, -1}` ruled out explicitly).
`Wasm.I64.{ctz,clz}` from the upstream interpreter library do the heavy
lifting.

Placement matters: the rule comes AFTER the existing
`i64.and (1<<63); i64.eqz; (if|br_if)` → `i64.clz; wrap; (if|br_if)`
rules so for n=63 in a branch context the MSB-extraction rule fires
first (more specific, produces clz directly). Otherwise this rule
catches all high-mask-and-eqz shapes.

**Impact** — EOP scan post-rule (selected files):

* `candid.wasm`     18 hits
* `words-enhanced.wasm` 7 hits
* `issue-5606.wasm` 5 hits
* `assign-ops.wasm` 4 hits

Each hit saves ~7 bytes of code size and one mask materialisation.
Fires on every `enforce_unsigned_bits env n` in `compile_enhanced.ml`'s
small-word arithmetic kernels (Nat8 / Int8 / Nat16 / Int16 / Nat32 /
Int32 wrapping arithmetic + overflow check).
`i32.and (-1<<n); <bool-context>` → `i32.shrU n; <bool-context>`,
where `bool-context` is one of `i32.eqz`, `if`, or `br_if`. All three
collapse into a single or-pattern by matching the bool op after it
has been consumed into l'; the rule preserves it untouched.

`high_mask_shift_i32` mirrors the i64 helper: `c ≠ 0 ∧ c ≠ -1 ∧
ctz(c) + clz(~c) = 32` via `Wasm.I32.{ctz, clz}`.

Targets classical's `enforce_unsigned_bits env n` — the mask fed
directly into `if (trap)` form (no intervening `eqz`, unlike the
EOP shape). Must come after the i32 MSB-`and 0x80000000` rules so
for n=31 those produce `clz; if e t` (with arm swap) first.

**Impact** — classical-build hits per file (selected):

* `aardvark.wasm`   6
* `assign-ops.wasm` 6
* `candid.wasm`     7

Each hit saves ~2 bytes (sleb128 of `-1<<n` for n=16 is 3 bytes vs
1 byte for `16`).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant