Skip to content

fix(msl): declare workgroup vars at function-body scope in kernel entry points#77

Merged
kolkov merged 1 commit into
gogpu:mainfrom
georgebuilds:fix/msl-threadgroup-function-scope
Jun 15, 2026
Merged

fix(msl): declare workgroup vars at function-body scope in kernel entry points#77
kolkov merged 1 commit into
gogpu:mainfrom
georgebuilds:fix/msl-threadgroup-function-scope

Conversation

@georgebuilds

Copy link
Copy Markdown
Contributor

Summary

Workgroup (threadgroup) variables are emitted as threadgroup entry-point parameters, matching Rust naga (back/msl/writer.rs, where workgroup globals become threadgroup T& name kernel arguments). That approach is only correct when the host sizes each threadgroup argument by calling setThreadgroupMemoryLength:atIndex: before dispatch. Rust wgpu-hal does exactly this in its Metal command encoder.

A host that does not perform that setup leaves the threadgroup memory unsized. On Metal an unsized threadgroup parameter reads as zero and writes are dropped, with no validation error. The result is a silent miscompile: any compute shader using var<workgroup> (tiled matmul, reductions, prefix sums, etc.) produces wrong results. The pure-Go gogpu/wgpu Metal HAL has no setThreadgroupMemoryLength call, so every workgroup-memory kernel compiled through this backend is affected.

This change declares the workgroup variables at function-body scope inside the kernel (threadgroup T name;) instead of as parameters. Function-scope threadgroup declarations are legal MSL, are statically sized by the type, and need no host-side setup, so the output is correct regardless of how the HAL drives dispatch. Variable names are unchanged, so all body references and helper-function call sites resolve exactly as before.

A latent issue is fixed in passing: the zero-initialization prologue (and now the new declarations) are filtered by per-entry-point usage, matching Rust naga's !fun_info[handle].is_empty() filter, so an entry point only declares and zeroes the workgroup vars it actually uses rather than every workgroup global in the module.

Scope: MSL backend only. SPIR-V, HLSL, GLSL, and DXIL backends are untouched.

Divergence from Rust naga

This is a deliberate, documented divergence recorded in the snapshot reference allow-list (reason msl-threadgroup-function-scope, 6 affected fixtures). Rust naga can keep the parameter form because its companion HAL sizes the parameters; a Go shader compiler whose consumers do not is better served by the self-contained function-scope form. The generated code is otherwise identical to the parameter-based output (the body, the zero-init prologue, and all references are the same).

Example (snapshot/testdata/golden/msl/workgroup_memory.msl)

Before:

kernel void main_(
, threadgroup type_1& shared_data
, device type_3& output [[user(fake0)]]
...
) {

After:

kernel void main_(
, device type_3& output [[user(fake0)]]
...
) {
    threadgroup type_1 shared_data;
    if (metal::all(__local_invocation_id == metal::uint3(0u))) {
        shared_data = {};
    }

Test plan

  • go build ./... clean (ubuntu/windows/macos targets build)
  • go test ./... passes (9547 tests across 33 packages)
  • go test -race ./msl/... ./snapshot/... passes, no race findings
  • TestRustReference and TestSnapshots pass; goldens regenerated with UPDATE_GOLDEN=1 go test ./snapshot/... and verified byte-identical
  • gofmt -l . empty
  • golangci-lint run reports no new issues; the modified writeEntryPoint triggers no complexity warnings (funlen/gocyclo/gocognit/nestif)

Note: golangci-lint and scripts/pre-release-check.sh flag one pre-existing nolintlint issue in msl/internal/codegen/xcrun_helper_test_darwin.go that exists verbatim on main and is untouched by this PR. I left it out of scope rather than bundle an unrelated lint fix; happy to send it separately if you'd like.

Verified end to end against a pure-Go WebGPU stack on Apple M3: workgroup-memory passthrough returns correct neighbor values for all 64 lanes (was all-zero before), and a tiled matmul using var<workgroup> tiles is bit-exact against a CPU reference (was producing zeros before).

…ry points

Workgroup (threadgroup) variables were emitted as threadgroup entry-point
parameters, matching Rust naga. That approach requires the host to call
setThreadgroupMemoryLength:atIndex: for each threadgroup parameter (Rust
wgpu-hal does this in its Metal encoder). Hosts that do not perform this
setup (for example a pure-Go wgpu Metal HAL) leave the threadgroup memory
unsized, so reads and writes silently no-op and compute results are wrong.

This change declares the workgroup variables at function-body scope inside
the kernel ("threadgroup T name;") instead of as parameters. Function-scope
threadgroup declarations are legal MSL and need no host-side setup. Variable
names are unchanged, so all body references and helper-function call sites
resolve exactly as before. The zero-initialization prologue is now filtered
by per-entry-point usage (matching Rust naga, which filters by
!fun_info[handle].is_empty()), so only workgroup vars actually used by the
entry point are declared and zeroed.

Generated code is otherwise identical to the parameter-based output. The
snapshot reference allow-list records the six affected fixtures with reason
"msl-threadgroup-function-scope", and the MSL goldens are regenerated.
@georgebuilds georgebuilds requested a review from kolkov as a code owner June 14, 2026 21:14
@codecov

codecov Bot commented Jun 14, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 85.00000% with 3 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
msl/internal/codegen/functions.go 85.00% 1 Missing and 2 partials ⚠️

📢 Thoughts on this report? Let us know!

@kolkov kolkov left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thorough review — verified every claim against Rust source and our Metal HAL. Well-researched PR.

Verification Results

All claims confirmed:

  1. Rust naga patternwriter.rs:7039-7043 emits workgroup vars as threadgroup T& name kernel parameters, filtered by fun_info[handle].is_empty() in three places (parameter emission, need_workgroup_variables_initialization, write_workgroup_variables_initialization).

  2. Rust wgpu-hal Metalcommand.rs:1469 calls set_threadgroup_memory_length(i, size) at dispatch time; device.rs:250-258 populates wg_memory_sizes from naga module. Parameter-form threadgroup memory is correct only because the HAL does this.

  3. Our Metal HAL — confirmed no setThreadgroupMemoryLength call anywhere in wgpu/hal/metal/. There is an explicit TODO at device.go:799:

    // TODO(zero-init-workgroup): Pass desc.Compute.ZeroInitializeWorkgroupMemory
    // to naga MSL options.
  4. MSL spec compliancethreadgroup T name; inside a kernel is legal MSL (Section 4.2). Statically sized by the compiler, no host-side setup needed.

  5. types_with_comments.msl w_mem removal — this is a latent bug fix, not a regression. The WGSL source declares w_mem and w_mem2, but test_ep only uses w_mem2. Rust's compact pass removes unused w_mem entirely. Our code was incorrectly zero-initializing w_mem despite it not being used by the entry point. The per-EP filtering fixes this.

  6. Edge cases verified:

    • Helper functions receive workgroup vars as threadgroup T& parameters — works identically with function-body declarations (address space is inherent to the variable)
    • Multiple entry points (interface.wgsl has 4 EPs, only compute uses workgroup var) — per-EP filtering correct
    • Arrays/structs with atomics (atomicOps.msl, workgroup-var-init.msl) — declaration form semantically equivalent

One Design Question: Allow-List Granularity

The current allow-list is map[string]string — one reason per shader, applied to all backends:

"atomicOps": "workgroup-layout-free",  // covers SPIR-V divergence

After this PR, atomicOps diverges from Rust in two backends for different reasons:

Backend Divergence Reason
SPIR-V Layout-free workgroup types workgroup-layout-free
MSL Function-scope threadgroup vars msl-threadgroup-function-scope

The test passes because the shader is already allow-listed, but the reason string only documents the SPIR-V divergence. Same applies to workgroup-var-init.

Risk: if someone later fixes the SPIR-V workgroup-layout-free divergence and removes atomicOps from the allow-list, MSL tests will unexpectedly fail with no documented reason.

This is a pre-existing design limitation, not something this PR introduced. But since we're adding 6 new allow-list entries, it might be a good time to consider a per-backend allow-list structure, e.g.:

// Option A: per-backend map
type allowEntry struct {
    spv, msl, hlsl, glsl string // empty = must match exactly
}

// Option B: multi-reason string  
"atomicOps": "workgroup-layout-free, msl-threadgroup-function-scope",

What are your thoughts? Happy to keep the current structure if you prefer — it's not a blocker.

Verdict

Approve. Clean implementation, real bug fix, correct MSL, well-documented divergence. The per-EP filtering in writeWorkgroupZeroInit is a bonus fix that matches Rust naga behavior more closely.

@kolkov

kolkov commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

@lkmavi — this PR changes MSL workgroup variable emission (Metal backend). Since you're our macOS expert, could you take a look? Specifically interested in whether function-scope threadgroup T name; declarations work correctly on your M-series hardware vs the former entry-point parameter form.

@lkmavi

lkmavi commented Jun 15, 2026

Copy link
Copy Markdown

@kolkov

  • ✅ Function-scope threadgroup T name; is accepted by the Metal 3.1 runtime on M3 Max — legal MSL confirmed on real
    hardware.
  • ✅ All workgroup shaders in the PR's golden-file update list pass Metal runtime compilation.
  • The 13 pre-existing failures are in unrelated areas (int64 atomics, bounds-check DefaultConstructible, f64,
    clip--istance — all features with known Metal limitations). None are regressions from this change.
  • ⚠️ atomicCompareExchange and atomicOps-int64 contain workgroup vars that emit correctly as function-scope, but the
    shaders fail for reasons unrelated to this PR. Worth a separate tracking issue if these were previously expected to
    compile on Metal.

@kolkov kolkov merged commit 35d01f5 into gogpu:main Jun 15, 2026
11 checks passed
@kolkov

kolkov commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

@lkmavi Thanks for the M3 Max verification — great to have real hardware confirmation!

Could you file separate tracking issues for the two Metal compilation failures you found?

  1. atomicCompareExchange — workgroup vars emit correctly but shader fails on Metal (int64 atomics?)
  2. atomicOps-int64 — same pattern, function-scope threadgroup is fine but other Metal limitations

These are pre-existing, not caused by this PR, but worth tracking explicitly.


@georgebuilds — one follow-up on the allow-list design question from the review.

Currently the allow-list is map[string]string (one reason per shader, applies to all backends). After this PR, atomicOps and workgroup-var-init diverge from Rust in two backends for different reasons:

Shader SPIR-V reason MSL reason
atomicOps workgroup-layout-free msl-threadgroup-function-scope
workgroup-var-init workgroup-layout-free msl-threadgroup-function-scope

Only the SPIR-V reason is recorded. If someone later fixes the SPIR-V divergence and removes the shader from the allow-list, MSL tests will unexpectedly fail.

Would you be interested in a follow-up PR to make the allow-list per-backend? Something like:

// Option A: composite reason string
"atomicOps": "workgroup-layout-free, msl-threadgroup-function-scope",

// Option B: per-backend map
type allowEntry map[string]string // backend → reason
var referenceAllowList = map[string]allowEntry{
    "atomicOps": {"spv": "workgroup-layout-free", "msl": "msl-threadgroup-function-scope"},
}

No rush — just flagging for future robustness.

@kolkov kolkov mentioned this pull request Jun 15, 2026
2 tasks
georgebuilds added a commit to georgebuilds/anneal that referenced this pull request Jun 15, 2026
The MSL function-body threadgroup fix landed upstream and is tagged in
v0.17.15, so the build-only third_party/naga prune and the replace
directive are no longer needed. smem_opt_regression_test green against
the real module; full suite 0 failures.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants