You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Close the WAMR ↔ Wasmtime AOT runtime gap on the SpiderMonkey/TCGC
workload (the keyvault TypeSpec codegen run, tcgc.compile) from the
currently-measured 0.13× to ≥ 0.50×, with a stretch to parity on the
memory-access-bound portion.
This is the sibling of #393, but for a different workload class. #393 tracks
CoreMark — a tight, register-resident scalar loop where the remaining lever is
register allocation (#392). This issue tracks a large, branchy, call- and
memory-traffic-heavy real program (a JS engine running a TypeScript compiler),
where a different set of levers dominate. The two roadmaps are complementary and
should not be merged.
Measurement (2026-06-06, this D16pds_v6 x86_64 VM, origin/main @ abd1a7f)
Both runtimes given fully precompiled artifacts (WAMR AOT sidecars / wasmtime compile.cwasm), so the numbers below are load + execute only,
no compilation. Workload: codegen/cli/scripts/run.sh over keyvault/data-plane/Secrets. Both emit the identical, correct got 58187 bytes back from tcgc.compile and byte-identical generated output.
Runtime (precompiled)
run 1
run 2
run 3
mean
vs Wasmtime
WAMR AOT
21.43 s
21.23 s
21.69 s
21.4 s
0.13×
Wasmtime (--allow-precompiled)
2.85 s
2.77 s
2.89 s
2.8 s
1.0×
(For context only — one-time precompile, excluded above: WAMR compile-component
≈ 49.8 s; wasmtime compile ≈ 2.4 s.)
WAMR's generated code runs this workload ~7.6× slower than Wasmtime's
Cranelift output. That is a bigger gap than CoreMark's 0.34×, which points at
levers that barely move CoreMark but dominate a memory/dispatch-heavy engine.
Why the gap is bigger here than on CoreMark
CoreMark keeps ~5–10 values in registers and rarely touches linear memory.
SpiderMonkey is the opposite: a torrent of linear-memory loads/stores and
indirect calls. The two costs that are ~free on CoreMark but enormous here:
Lever 1 (primary): explicit per-access bounds checks → adopt a guard-page memory model
Every wasm .load/.store currently emits an inline bounds check: emitMemBoundsCheck (src/compiler/codegen/x86_64/compile.zig:358) computes end = addr + offset + size, compares it against VmCtx.memory_size
([vmctx+8]), and conditionally calls trap_oob_fn — roughly 5–7 extra
instructions plus a branch on every single memory access
(compile.zig:3556 for .load, :3601 for .store; bulk-mem variants at :2464, :394). On a JS engine this is the dominant per-instruction tax.
Wasmtime elides these entirely: it mmaps a 4 GiB reservation + guard region and
lets a hardware fault (SIGSEGV) become a wasm trap, so a 32-bit-wasm load is just mov dst, [base + addr + disp]. The repro script even passes -W max-memory-size=4294967296 to Wasmtime specifically to enable this.
The good news: WAMR already has most of the machinery.MemoryInstance.createReserved
(src/runtime/common/types.zig:552) already pins linear memory to a stable
4 GiB virtual reservation (platform.reserveAddressSpace, src/platform/platform.zig:270; supports_reserved_memory = !is_windows, platform.zig:265) — added for the #752 stable-address fix. Codegen simply doesn't exploit it yet. Proposed work (own child issue, likely multi-PR):
Size the reservation as 4 GiB + a guard tail covering the max foldable static
offset, leaving the tail PROT_NONE.
Install a SIGSEGV/SIGBUS handler that longjmps to the trap path. Infra already
contemplated this — see the note at src/runtime/aot/runtime.zig:628 ("A
future change could thread the trap back through a setjmp/longjmp path"), and
there is existing sigaction usage at runtime.zig:208.
Gate emitMemBoundsCheckoff when the guard model is active; keep the
explicit-check path as the fallback (Windows, shared/64-bit memories, offsets
the guard can't statically bound).
This single lever is plausibly 2–4× on this workload.
Lever 2: drop the per-access vmctx reload
Each .load/.store begins with mov r10, rbx (compile.zig:3552, :3596)
to stage VmCtx* for the bounds check — even though rbx is pinned to vmctx
for the whole function (#465). The check could read [rbx+8] directly. Once
Lever 1 lands this disappears for the common path; until then it is a free
one-instruction-per-access win and a small standalone PR.
Lever 3: call_indirect fast path
SpiderMonkey dispatches through function tables constantly. Audit the call_indirect lowering (compile.zig:3279, signature check at :190) against
Wasmtime's: cache the expected type id, do a single funcref/type-id load +
compare, and hoist table-bounds/type-id invariants out of hot dispatch loops.
Lever 4: regalloc quality on the monster function (links #392, #780)
This workload's hot code concentrates in one giant function: core 4's local_func=11396 lowers to ~549k instructions, where WAMR's linear-scan
allocator spills heavily. The SSA-aware regalloc (#392) and the emit-phase
super-linearity (#780) both feed this. Unlike CoreMark, allocation quality here
is dominated by a single enormous function, so #392's payoff may be larger on
this workload than on CoreMark.
Suggested order (dependency-aware)
Tier 0 ── profiling harness for THIS workload (perf-attribute the 21s) ← do first
Tier 1 ┬─ Lever 1 guard-page memory model (biggest; multi-PR)
├─ Lever 2 drop per-access vmctx reload (tiny, do alongside L1)
├─ Lever 3 call_indirect fast path
└─ Lever 4 regalloc on monster fns (depends on #392, #780)
Tier 0 first. Add a repeatable harness (sibling of #381's CoreMark harness)
that times the precompiled keyvault e2e for both runtimes and captures a perf record of the WAMR run, so we can attribute the 21 s across bounds checks
vs. dispatch vs. regalloc/spills rather than guessing. The lever ordering above
is a hypothesis from code inspection; the profile should confirm it before large
work starts.
Every PR must benchmark both runtimes (precompiled, load+execute only) on
the keyvault repro and quote the delta, and run bench_coremark.py / bench_simd.py to prove no scalar/SIMD regression.
Memory-model changes (L1) need new trap-semantics tests: OOB load/store at the
guard boundary must trap deterministically, memory.grow past the committed
window must still work, and the Windows / no-reservation fallback must keep the
explicit-check path.
Filed from a runtime A/B of the #743 keyvault repro. Related: #393 (CoreMark
gap), #392 (SSA regalloc), #780 (emit super-linear), #752 (the stable-address
reservation this can reuse), #743 (the workload).
Goal
Close the WAMR ↔ Wasmtime AOT runtime gap on the SpiderMonkey/TCGC
workload (the keyvault TypeSpec codegen run,
tcgc.compile) from thecurrently-measured 0.13× to ≥ 0.50×, with a stretch to parity on the
memory-access-bound portion.
This is the sibling of #393, but for a different workload class. #393 tracks
CoreMark — a tight, register-resident scalar loop where the remaining lever is
register allocation (#392). This issue tracks a large, branchy, call- and
memory-traffic-heavy real program (a JS engine running a TypeScript compiler),
where a different set of levers dominate. The two roadmaps are complementary and
should not be merged.
Measurement (2026-06-06, this
D16pds_v6x86_64 VM,origin/main@ abd1a7f)Both runtimes given fully precompiled artifacts (WAMR AOT sidecars /
wasmtime compile.cwasm), so the numbers below are load + execute only,no compilation. Workload:
codegen/cli/scripts/run.shoverkeyvault/data-plane/Secrets. Both emit the identical, correctgot 58187 bytes back from tcgc.compileand byte-identical generated output.--allow-precompiled)(For context only — one-time precompile, excluded above: WAMR
compile-component≈ 49.8 s;
wasmtime compile≈ 2.4 s.)WAMR's generated code runs this workload ~7.6× slower than Wasmtime's
Cranelift output. That is a bigger gap than CoreMark's 0.34×, which points at
levers that barely move CoreMark but dominate a memory/dispatch-heavy engine.
Why the gap is bigger here than on CoreMark
CoreMark keeps ~5–10 values in registers and rarely touches linear memory.
SpiderMonkey is the opposite: a torrent of linear-memory loads/stores and
indirect calls. The two costs that are ~free on CoreMark but enormous here:
Lever 1 (primary): explicit per-access bounds checks → adopt a guard-page memory model
Every wasm
.load/.storecurrently emits an inline bounds check:emitMemBoundsCheck(src/compiler/codegen/x86_64/compile.zig:358) computesend = addr + offset + size, compares it againstVmCtx.memory_size(
[vmctx+8]), and conditionally callstrap_oob_fn— roughly 5–7 extrainstructions plus a branch on every single memory access
(
compile.zig:3556for.load,:3601for.store; bulk-mem variants at:2464,:394). On a JS engine this is the dominant per-instruction tax.Wasmtime elides these entirely: it mmaps a 4 GiB reservation + guard region and
lets a hardware fault (SIGSEGV) become a wasm trap, so a 32-bit-wasm load is just
mov dst, [base + addr + disp]. The repro script even passes-W max-memory-size=4294967296to Wasmtime specifically to enable this.The good news: WAMR already has most of the machinery.
MemoryInstance.createReserved(
src/runtime/common/types.zig:552) already pins linear memory to a stable4 GiB virtual reservation (
platform.reserveAddressSpace,src/platform/platform.zig:270;supports_reserved_memory = !is_windows,platform.zig:265) — added for the #752 stable-address fix. Codegen simplydoesn't exploit it yet. Proposed work (own child issue, likely multi-PR):
offset, leaving the tail
PROT_NONE.contemplated this — see the note at
src/runtime/aot/runtime.zig:628("Afuture change could thread the trap back through a setjmp/longjmp path"), and
there is existing
sigactionusage atruntime.zig:208.emitMemBoundsCheckoff when the guard model is active; keep theexplicit-check path as the fallback (Windows, shared/64-bit memories, offsets
the guard can't statically bound).
This single lever is plausibly 2–4× on this workload.
Lever 2: drop the per-access vmctx reload
Each
.load/.storebegins withmov r10, rbx(compile.zig:3552,:3596)to stage
VmCtx*for the bounds check — even thoughrbxis pinned to vmctxfor the whole function (#465). The check could read
[rbx+8]directly. OnceLever 1 lands this disappears for the common path; until then it is a free
one-instruction-per-access win and a small standalone PR.
Lever 3:
call_indirectfast pathSpiderMonkey dispatches through function tables constantly. Audit the
call_indirectlowering (compile.zig:3279, signature check at:190) againstWasmtime's: cache the expected type id, do a single funcref/type-id load +
compare, and hoist table-bounds/type-id invariants out of hot dispatch loops.
Lever 4: regalloc quality on the monster function (links #392, #780)
This workload's hot code concentrates in one giant function: core 4's
local_func=11396lowers to ~549k instructions, where WAMR's linear-scanallocator spills heavily. The SSA-aware regalloc (#392) and the emit-phase
super-linearity (#780) both feed this. Unlike CoreMark, allocation quality here
is dominated by a single enormous function, so #392's payoff may be larger on
this workload than on CoreMark.
Suggested order (dependency-aware)
Tier 0 first. Add a repeatable harness (sibling of #381's CoreMark harness)
that times the precompiled keyvault e2e for both runtimes and captures a
perf recordof the WAMR run, so we can attribute the 21 s across bounds checksvs. dispatch vs. regalloc/spills rather than guessing. The lever ordering above
is a hypothesis from code inspection; the profile should confirm it before large
work starts.
Expected impact budget (rough, not additive)
Target ≥ 0.50×; reaching parity on the memory-bound portion likely needs L1 plus
sustained codegen-quality work shared with #393/#392.
Cross-cutting reminders
the keyvault repro and quote the delta, and run
bench_coremark.py/bench_simd.pyto prove no scalar/SIMD regression.guard boundary must trap deterministically,
memory.growpast the committedwindow must still work, and the Windows / no-reservation fallback must keep the
explicit-check path.
Filed from a runtime A/B of the #743 keyvault repro. Related: #393 (CoreMark
gap), #392 (SSA regalloc), #780 (emit super-linear), #752 (the stable-address
reservation this can reuse), #743 (the workload).