Tracking: close the WAMR↔Wasmtime AOT runtime gap on the SpiderMonkey/TCGC workload (0.13× → ≥0.50×)

## Goal

Close the WAMR ↔ Wasmtime **AOT runtime** gap on the SpiderMonkey/TCGC
workload (the keyvault TypeSpec codegen run, `tcgc.compile`) from the
currently-measured **0.13×** to **≥ 0.50×**, with a stretch to parity on the
memory-access-bound portion.

This is the sibling of #393, but for a **different workload class**. #393 tracks
CoreMark — a tight, register-resident **scalar loop** where the remaining lever is
register allocation (#392). This issue tracks a large, branchy, **call- and
memory-traffic-heavy real program** (a JS engine running a TypeScript compiler),
where a different set of levers dominate. The two roadmaps are complementary and
should not be merged.

## Measurement (2026-06-06, this `D16pds_v6` x86_64 VM, `origin/main` @ abd1a7fb)

Both runtimes given **fully precompiled** artifacts (WAMR AOT sidecars /
`wasmtime compile` `.cwasm`), so the numbers below are **load + execute only**,
no compilation. Workload: `codegen/cli/scripts/run.sh` over
`keyvault/data-plane/Secrets`. Both emit the identical, correct
`got 58187 bytes back from tcgc.compile` and byte-identical generated output.

| Runtime (precompiled) | run 1 | run 2 | run 3 | mean | vs Wasmtime |
|---|---:|---:|---:|---:|---:|
| **WAMR** AOT | 21.43 s | 21.23 s | 21.69 s | **21.4 s** | **0.13×** |
| **Wasmtime** (`--allow-precompiled`) | 2.85 s | 2.77 s | 2.89 s | **2.8 s** | 1.0× |

(For context only — one-time precompile, excluded above: WAMR `compile-component`
≈ 49.8 s; `wasmtime compile` ≈ 2.4 s.)

WAMR's generated code runs this workload **~7.6× slower** than Wasmtime's
Cranelift output. That is a *bigger* gap than CoreMark's 0.34×, which points at
levers that barely move CoreMark but dominate a memory/dispatch-heavy engine.

## Why the gap is bigger here than on CoreMark

CoreMark keeps ~5–10 values in registers and rarely touches linear memory.
SpiderMonkey is the opposite: a torrent of linear-memory loads/stores and
indirect calls. The two costs that are ~free on CoreMark but enormous here:

### Lever 1 (primary): explicit per-access bounds checks → adopt a guard-page memory model

Every wasm `.load`/`.store` currently emits an **inline bounds check**:
`emitMemBoundsCheck` (`src/compiler/codegen/x86_64/compile.zig:358`) computes
`end = addr + offset + size`, compares it against `VmCtx.memory_size`
(`[vmctx+8]`), and conditionally calls `trap_oob_fn` — roughly 5–7 extra
instructions **plus a branch on every single memory access**
(`compile.zig:3556` for `.load`, `:3601` for `.store`; bulk-mem variants at
`:2464`, `:394`). On a JS engine this is the dominant per-instruction tax.

Wasmtime elides these entirely: it mmaps a 4 GiB reservation + guard region and
lets a hardware fault (SIGSEGV) become a wasm trap, so a 32-bit-wasm load is just
`mov dst, [base + addr + disp]`. The repro script even passes
`-W max-memory-size=4294967296` to Wasmtime specifically to enable this.

**The good news: WAMR already has most of the machinery.** `MemoryInstance.createReserved`
(`src/runtime/common/types.zig:552`) already pins linear memory to a stable
4 GiB virtual reservation (`platform.reserveAddressSpace`,
`src/platform/platform.zig:270`; `supports_reserved_memory = !is_windows`,
`platform.zig:265`) — added for the #752 stable-address fix. Codegen simply
**doesn't exploit it yet**. Proposed work (own child issue, likely multi-PR):
- Size the reservation as 4 GiB + a guard tail covering the max foldable static
  offset, leaving the tail `PROT_NONE`.
- Install a SIGSEGV/SIGBUS handler that longjmps to the trap path. Infra already
  contemplated this — see the note at `src/runtime/aot/runtime.zig:628` ("A
  future change could thread the trap back through a setjmp/longjmp path"), and
  there is existing `sigaction` usage at `runtime.zig:208`.
- Gate `emitMemBoundsCheck` **off** when the guard model is active; keep the
  explicit-check path as the fallback (Windows, shared/64-bit memories, offsets
  the guard can't statically bound).

This single lever is plausibly **2–4×** on this workload.

### Lever 2: drop the per-access vmctx reload

Each `.load`/`.store` begins with `mov r10, rbx` (`compile.zig:3552`, `:3596`)
to stage `VmCtx*` for the bounds check — even though `rbx` is **pinned to vmctx
for the whole function** (#465). The check could read `[rbx+8]` directly. Once
Lever 1 lands this disappears for the common path; until then it is a free
one-instruction-per-access win and a small standalone PR.

### Lever 3: `call_indirect` fast path

SpiderMonkey dispatches through function tables constantly. Audit the
`call_indirect` lowering (`compile.zig:3279`, signature check at `:190`) against
Wasmtime's: cache the expected type id, do a single funcref/type-id load +
compare, and hoist table-bounds/type-id invariants out of hot dispatch loops.

### Lever 4: regalloc quality on the monster function (links #392, #780)

This workload's hot code concentrates in one giant function: core 4's
`local_func=11396` lowers to ~549k instructions, where WAMR's linear-scan
allocator spills heavily. The SSA-aware regalloc (#392) and the emit-phase
super-linearity (#780) both feed this. Unlike CoreMark, allocation quality here
is dominated by a single enormous function, so #392's payoff may be larger on
this workload than on CoreMark.

## Suggested order (dependency-aware)

```
Tier 0  ── profiling harness for THIS workload (perf-attribute the 21s)   ← do first
Tier 1  ┬─ Lever 1  guard-page memory model  (biggest; multi-PR)
        ├─ Lever 2  drop per-access vmctx reload  (tiny, do alongside L1)
        ├─ Lever 3  call_indirect fast path
        └─ Lever 4  regalloc on monster fns (depends on #392, #780)
```

**Tier 0 first.** Add a repeatable harness (sibling of #381's CoreMark harness)
that times the precompiled keyvault e2e for both runtimes and captures a
`perf record` of the WAMR run, so we can attribute the 21 s across bounds checks
vs. dispatch vs. regalloc/spills rather than guessing. The lever ordering above
is a hypothesis from code inspection; the profile should confirm it before large
work starts.

## Expected impact budget (rough, not additive)

| Lever | Plausible runtime gain | vs Wasmtime |
|---|---|---|
| (start) | — | 0.13× |
| L1 guard-page (elide bounds checks) | 2–4× | ~0.30–0.45× |
| L2 vmctx reload | small | — |
| L3 call_indirect | +10–25% | ~0.35–0.55× |
| L4 regalloc/#392 on monster fn | +10–20% | ~0.45–0.60× |

Target ≥ 0.50×; reaching parity on the memory-bound portion likely needs L1 plus
sustained codegen-quality work shared with #393/#392.

## Cross-cutting reminders

- Isolate Zig caches per worktree (see the nvme-worktree workflow / PR #374).
- Every PR must benchmark **both** runtimes (precompiled, load+execute only) on
  the keyvault repro and quote the delta, *and* run `bench_coremark.py` /
  `bench_simd.py` to prove no scalar/SIMD regression.
- Memory-model changes (L1) need new trap-semantics tests: OOB load/store at the
  guard boundary must trap deterministically, `memory.grow` past the committed
  window must still work, and the Windows / no-reservation fallback must keep the
  explicit-check path.

---

_Filed from a runtime A/B of the #743 keyvault repro. Related: #393 (CoreMark
gap), #392 (SSA regalloc), #780 (emit super-linear), #752 (the stable-address
reservation this can reuse), #743 (the workload)._


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tracking: close the WAMR↔Wasmtime AOT runtime gap on the SpiderMonkey/TCGC workload (0.13× → ≥0.50×) #798

Goal

Measurement (2026-06-06, this `D16pds_v6` x86_64 VM, `origin/main` @ `abd1a7f`)

Why the gap is bigger here than on CoreMark

Lever 1 (primary): explicit per-access bounds checks → adopt a guard-page memory model

Lever 2: drop the per-access vmctx reload

Lever 3: `call_indirect` fast path

Lever 4: regalloc quality on the monster function (links #392, #780)

Suggested order (dependency-aware)

Expected impact budget (rough, not additive)

Cross-cutting reminders

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Runtime (precompiled)	run 1	run 2	run 3	mean	vs Wasmtime
WAMR AOT	21.43 s	21.23 s	21.69 s	21.4 s	0.13×
Wasmtime (`--allow-precompiled`)	2.85 s	2.77 s	2.89 s	2.8 s	1.0×

Lever	Plausible runtime gain	vs Wasmtime
(start)	—	0.13×
L1 guard-page (elide bounds checks)	2–4×	~0.30–0.45×
L2 vmctx reload	small	—
L3 call_indirect	+10–25%	~0.35–0.55×
L4 regalloc/#392 on monster fn	+10–20%	~0.45–0.60×

Tracking: close the WAMR↔Wasmtime AOT runtime gap on the SpiderMonkey/TCGC workload (0.13× → ≥0.50×) #798

Description

Goal

Measurement (2026-06-06, this D16pds_v6 x86_64 VM, origin/main @ abd1a7f)

Why the gap is bigger here than on CoreMark

Lever 1 (primary): explicit per-access bounds checks → adopt a guard-page memory model

Lever 2: drop the per-access vmctx reload

Lever 3: call_indirect fast path

Lever 4: regalloc quality on the monster function (links #392, #780)

Suggested order (dependency-aware)

Expected impact budget (rough, not additive)

Cross-cutting reminders

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Measurement (2026-06-06, this `D16pds_v6` x86_64 VM, `origin/main` @ `abd1a7f`)

Lever 3: `call_indirect` fast path