Skip to content

Tracking: close the WAMR↔Wasmtime AOT runtime gap on the SpiderMonkey/TCGC workload (0.13× → ≥0.50×) #798

@cataggar

Description

@cataggar

Goal

Close the WAMR ↔ Wasmtime AOT runtime gap on the SpiderMonkey/TCGC
workload (the keyvault TypeSpec codegen run, tcgc.compile) from the
currently-measured 0.13× to ≥ 0.50×, with a stretch to parity on the
memory-access-bound portion.

This is the sibling of #393, but for a different workload class. #393 tracks
CoreMark — a tight, register-resident scalar loop where the remaining lever is
register allocation (#392). This issue tracks a large, branchy, call- and
memory-traffic-heavy real program
(a JS engine running a TypeScript compiler),
where a different set of levers dominate. The two roadmaps are complementary and
should not be merged.

Measurement (2026-06-06, this D16pds_v6 x86_64 VM, origin/main @ abd1a7f)

Both runtimes given fully precompiled artifacts (WAMR AOT sidecars /
wasmtime compile .cwasm), so the numbers below are load + execute only,
no compilation. Workload: codegen/cli/scripts/run.sh over
keyvault/data-plane/Secrets. Both emit the identical, correct
got 58187 bytes back from tcgc.compile and byte-identical generated output.

Runtime (precompiled) run 1 run 2 run 3 mean vs Wasmtime
WAMR AOT 21.43 s 21.23 s 21.69 s 21.4 s 0.13×
Wasmtime (--allow-precompiled) 2.85 s 2.77 s 2.89 s 2.8 s 1.0×

(For context only — one-time precompile, excluded above: WAMR compile-component
≈ 49.8 s; wasmtime compile ≈ 2.4 s.)

WAMR's generated code runs this workload ~7.6× slower than Wasmtime's
Cranelift output. That is a bigger gap than CoreMark's 0.34×, which points at
levers that barely move CoreMark but dominate a memory/dispatch-heavy engine.

Why the gap is bigger here than on CoreMark

CoreMark keeps ~5–10 values in registers and rarely touches linear memory.
SpiderMonkey is the opposite: a torrent of linear-memory loads/stores and
indirect calls. The two costs that are ~free on CoreMark but enormous here:

Lever 1 (primary): explicit per-access bounds checks → adopt a guard-page memory model

Every wasm .load/.store currently emits an inline bounds check:
emitMemBoundsCheck (src/compiler/codegen/x86_64/compile.zig:358) computes
end = addr + offset + size, compares it against VmCtx.memory_size
([vmctx+8]), and conditionally calls trap_oob_fn — roughly 5–7 extra
instructions plus a branch on every single memory access
(compile.zig:3556 for .load, :3601 for .store; bulk-mem variants at
:2464, :394). On a JS engine this is the dominant per-instruction tax.

Wasmtime elides these entirely: it mmaps a 4 GiB reservation + guard region and
lets a hardware fault (SIGSEGV) become a wasm trap, so a 32-bit-wasm load is just
mov dst, [base + addr + disp]. The repro script even passes
-W max-memory-size=4294967296 to Wasmtime specifically to enable this.

The good news: WAMR already has most of the machinery. MemoryInstance.createReserved
(src/runtime/common/types.zig:552) already pins linear memory to a stable
4 GiB virtual reservation (platform.reserveAddressSpace,
src/platform/platform.zig:270; supports_reserved_memory = !is_windows,
platform.zig:265) — added for the #752 stable-address fix. Codegen simply
doesn't exploit it yet. Proposed work (own child issue, likely multi-PR):

  • Size the reservation as 4 GiB + a guard tail covering the max foldable static
    offset, leaving the tail PROT_NONE.
  • Install a SIGSEGV/SIGBUS handler that longjmps to the trap path. Infra already
    contemplated this — see the note at src/runtime/aot/runtime.zig:628 ("A
    future change could thread the trap back through a setjmp/longjmp path"), and
    there is existing sigaction usage at runtime.zig:208.
  • Gate emitMemBoundsCheck off when the guard model is active; keep the
    explicit-check path as the fallback (Windows, shared/64-bit memories, offsets
    the guard can't statically bound).

This single lever is plausibly 2–4× on this workload.

Lever 2: drop the per-access vmctx reload

Each .load/.store begins with mov r10, rbx (compile.zig:3552, :3596)
to stage VmCtx* for the bounds check — even though rbx is pinned to vmctx
for the whole function
(#465). The check could read [rbx+8] directly. Once
Lever 1 lands this disappears for the common path; until then it is a free
one-instruction-per-access win and a small standalone PR.

Lever 3: call_indirect fast path

SpiderMonkey dispatches through function tables constantly. Audit the
call_indirect lowering (compile.zig:3279, signature check at :190) against
Wasmtime's: cache the expected type id, do a single funcref/type-id load +
compare, and hoist table-bounds/type-id invariants out of hot dispatch loops.

Lever 4: regalloc quality on the monster function (links #392, #780)

This workload's hot code concentrates in one giant function: core 4's
local_func=11396 lowers to ~549k instructions, where WAMR's linear-scan
allocator spills heavily. The SSA-aware regalloc (#392) and the emit-phase
super-linearity (#780) both feed this. Unlike CoreMark, allocation quality here
is dominated by a single enormous function, so #392's payoff may be larger on
this workload than on CoreMark.

Suggested order (dependency-aware)

Tier 0  ── profiling harness for THIS workload (perf-attribute the 21s)   ← do first
Tier 1  ┬─ Lever 1  guard-page memory model  (biggest; multi-PR)
        ├─ Lever 2  drop per-access vmctx reload  (tiny, do alongside L1)
        ├─ Lever 3  call_indirect fast path
        └─ Lever 4  regalloc on monster fns (depends on #392, #780)

Tier 0 first. Add a repeatable harness (sibling of #381's CoreMark harness)
that times the precompiled keyvault e2e for both runtimes and captures a
perf record of the WAMR run, so we can attribute the 21 s across bounds checks
vs. dispatch vs. regalloc/spills rather than guessing. The lever ordering above
is a hypothesis from code inspection; the profile should confirm it before large
work starts.

Expected impact budget (rough, not additive)

Lever Plausible runtime gain vs Wasmtime
(start) 0.13×
L1 guard-page (elide bounds checks) 2–4× ~0.30–0.45×
L2 vmctx reload small
L3 call_indirect +10–25% ~0.35–0.55×
L4 regalloc/#392 on monster fn +10–20% ~0.45–0.60×

Target ≥ 0.50×; reaching parity on the memory-bound portion likely needs L1 plus
sustained codegen-quality work shared with #393/#392.

Cross-cutting reminders

  • Isolate Zig caches per worktree (see the nvme-worktree workflow / PR ci: relocate Zig caches outside $GITHUB_WORKSPACE #374).
  • Every PR must benchmark both runtimes (precompiled, load+execute only) on
    the keyvault repro and quote the delta, and run bench_coremark.py /
    bench_simd.py to prove no scalar/SIMD regression.
  • Memory-model changes (L1) need new trap-semantics tests: OOB load/store at the
    guard boundary must trap deterministically, memory.grow past the committed
    window must still work, and the Windows / no-reservation fallback must keep the
    explicit-check path.

Filed from a runtime A/B of the #743 keyvault repro. Related: #393 (CoreMark
gap), #392 (SSA regalloc), #780 (emit super-linear), #752 (the stable-address
reservation this can reuse), #743 (the workload).

Metadata

Metadata

Assignees

No one assigned

    Labels

    performancePerformance improvements and benchmarking

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions