Skip to content

macOS 27 beta: MPSGraph scratch-heap overflow on Gemma 4 12B full-attention (head_dim 512) layers + macOS AOT .aimodelc load regression (AIModelError 3) #27

@john-rocky

Description

@john-rocky

Two related macOS 27.0 beta Core AI runtime bugs that block 12B-class Gemma 4 on Mac GPU. A clean one-layer bisection isolates the trigger. Numerics are verified-correct and the graph runs at smaller sizes — both blockers are runtime / MPSGraph-side, not model-side. (Related to #5, another macOS-27-beta MPSGraph lowering bug.)

Environment

  • macOS 27.0 build 26A5353q (beta), Apple M4 Max (applegpu_g16s)
  • coreai-build 3600.67.5.8.1 (MetalToolchain-v27.1.5194)
  • coreai-models pipelined engine (llm-runner / llm-benchmark), COREAI_CHUNK_THRESHOLD=1
  • Model: a Gemma 4 12B dense decode-only pipelined bundle (in-graph embed+head, one growing KV pair, dual head_dim 256 sliding / 512 full, attention_k_eq_v full layers)

Bug 1 — MPSGraph scratch-heap overflow on full-attention (head_dim 512) layers

At the first decode token the engine aborts:

allocateMTLBufferFromMTLHeap: offset 198400 + size 16384 exceeds heap total 212992
.../MPSRuntime/Operations/GPUMemrefOps.mm:687: failed assertion
  'Failed to acquire the source buffer for the ViewOp'

Decisive bisection:

  • --num-layers 5 (all sliding, head_dim 256) → runs (~409 tok/s)
  • --num-layers 6 (adds the first full layer: head_dim 512, 16 query heads) → crashes

The failing buffer is exactly [1, 16, 1, 512] fp16 = 16384 B, the full layer's q_proj output. It scales with the number of full layers (16 KB at 1 full layer, 32 KB at 2) and overflows MPSGraph's ~208 KB decode scratch heap (mis-sized by ~2 KB). Sliding-layer Q ([1,16,1,256] = 8 KB) fits, and Gemma 4 E2B/E4B full layers (8 heads × 512 = 8 KB) also fit and run — only the 12B's 16-head × 512 Q tips the heap over. The crash is invariant to every graph-source change tried (KV cache pad↔replicate, uniform narrow, .contiguous() on Q and on K/V, vanilla vs HF SDPA): identical heap / offset / size each time.

Bug 2 — AOT .aimodelc fails to load on macOS (regression vs iOS)

Pre-compiling for the correct M4 Max arch succeeds:

xcrun coreai-build compile <bundle>.aimodel --platform macOS --architecture h16s --expect-frequent-reshapes -o /tmp/aot

…but loading the resulting .aimodelc fails:

CoreAIDelegates.AIModelError error 3      (raw AIModel.load)
invalidCompiledModel                      (llm-runner / LanguageBundle)

This is not specific to the Bug-1 graph: a model that JIT-runs perfectly (--num-layers 5, all sliding) also fails to AOT-load with the same AIModelError 3. So this macOS build cannot load any precompiled .aimodelc for a macOS target, while the same Core AI runtime loads AOT .aimodelc fine on iOS (h18p bundles run on iPhone 17 Pro). With/without --expect-frequent-reshapes and with the source .aimodel present alongside, same result.

Impact

Together these block all GPU paths for 12B-class Gemma 4 (and likely any model whose per-layer attention intermediate exceeds the scratch heap) on Mac: JIT crashes (Bug 1), AOT-load is rejected (Bug 2), and the CPU delegate also fails to load. If Bug 2 were fixed, AOT would work around Bug 1 exactly as iOS already does.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions