fix(msl): declare workgroup vars at function-body scope in kernel entry points by georgebuilds · Pull Request #77 · gogpu/naga

georgebuilds · 2026-06-14T21:14:46Z

Summary

Workgroup (threadgroup) variables are emitted as threadgroup entry-point parameters, matching Rust naga (back/msl/writer.rs, where workgroup globals become threadgroup T& name kernel arguments). That approach is only correct when the host sizes each threadgroup argument by calling setThreadgroupMemoryLength:atIndex: before dispatch. Rust wgpu-hal does exactly this in its Metal command encoder.

A host that does not perform that setup leaves the threadgroup memory unsized. On Metal an unsized threadgroup parameter reads as zero and writes are dropped, with no validation error. The result is a silent miscompile: any compute shader using var<workgroup> (tiled matmul, reductions, prefix sums, etc.) produces wrong results. The pure-Go gogpu/wgpu Metal HAL has no setThreadgroupMemoryLength call, so every workgroup-memory kernel compiled through this backend is affected.

This change declares the workgroup variables at function-body scope inside the kernel (threadgroup T name;) instead of as parameters. Function-scope threadgroup declarations are legal MSL, are statically sized by the type, and need no host-side setup, so the output is correct regardless of how the HAL drives dispatch. Variable names are unchanged, so all body references and helper-function call sites resolve exactly as before.

A latent issue is fixed in passing: the zero-initialization prologue (and now the new declarations) are filtered by per-entry-point usage, matching Rust naga's !fun_info[handle].is_empty() filter, so an entry point only declares and zeroes the workgroup vars it actually uses rather than every workgroup global in the module.

Scope: MSL backend only. SPIR-V, HLSL, GLSL, and DXIL backends are untouched.

Divergence from Rust naga

This is a deliberate, documented divergence recorded in the snapshot reference allow-list (reason msl-threadgroup-function-scope, 6 affected fixtures). Rust naga can keep the parameter form because its companion HAL sizes the parameters; a Go shader compiler whose consumers do not is better served by the self-contained function-scope form. The generated code is otherwise identical to the parameter-based output (the body, the zero-init prologue, and all references are the same).

Example (snapshot/testdata/golden/msl/workgroup_memory.msl)

Before:

kernel void main_(
, threadgroup type_1& shared_data
, device type_3& output [[user(fake0)]]
...
) {

After:

kernel void main_(
, device type_3& output [[user(fake0)]]
...
) {
    threadgroup type_1 shared_data;
    if (metal::all(__local_invocation_id == metal::uint3(0u))) {
        shared_data = {};
    }

Test plan

go build ./... clean (ubuntu/windows/macos targets build)
go test ./... passes (9547 tests across 33 packages)
go test -race ./msl/... ./snapshot/... passes, no race findings
TestRustReference and TestSnapshots pass; goldens regenerated with UPDATE_GOLDEN=1 go test ./snapshot/... and verified byte-identical
gofmt -l . empty
golangci-lint run reports no new issues; the modified writeEntryPoint triggers no complexity warnings (funlen/gocyclo/gocognit/nestif)

Note: golangci-lint and scripts/pre-release-check.sh flag one pre-existing nolintlint issue in msl/internal/codegen/xcrun_helper_test_darwin.go that exists verbatim on main and is untouched by this PR. I left it out of scope rather than bundle an unrelated lint fix; happy to send it separately if you'd like.

Verified end to end against a pure-Go WebGPU stack on Apple M3: workgroup-memory passthrough returns correct neighbor values for all 64 lanes (was all-zero before), and a tiled matmul using var<workgroup> tiles is bit-exact against a CPU reference (was producing zeros before).

…ry points Workgroup (threadgroup) variables were emitted as threadgroup entry-point parameters, matching Rust naga. That approach requires the host to call setThreadgroupMemoryLength:atIndex: for each threadgroup parameter (Rust wgpu-hal does this in its Metal encoder). Hosts that do not perform this setup (for example a pure-Go wgpu Metal HAL) leave the threadgroup memory unsized, so reads and writes silently no-op and compute results are wrong. This change declares the workgroup variables at function-body scope inside the kernel ("threadgroup T name;") instead of as parameters. Function-scope threadgroup declarations are legal MSL and need no host-side setup. Variable names are unchanged, so all body references and helper-function call sites resolve exactly as before. The zero-initialization prologue is now filtered by per-entry-point usage (matching Rust naga, which filters by !fun_info[handle].is_empty()), so only workgroup vars actually used by the entry point are declared and zeroed. Generated code is otherwise identical to the parameter-based output. The snapshot reference allow-list records the six affected fixtures with reason "msl-threadgroup-function-scope", and the MSL goldens are regenerated.

codecov · 2026-06-14T21:37:02Z

Codecov Report

❌ Patch coverage is 85.00000% with 3 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
msl/internal/codegen/functions.go	85.00%	1 Missing and 2 partials ⚠️

📢 Thoughts on this report? Let us know!

kolkov

Thorough review — verified every claim against Rust source and our Metal HAL. Well-researched PR.

Verification Results

All claims confirmed:

Rust naga pattern — writer.rs:7039-7043 emits workgroup vars as threadgroup T& name kernel parameters, filtered by fun_info[handle].is_empty() in three places (parameter emission, need_workgroup_variables_initialization, write_workgroup_variables_initialization).
Rust wgpu-hal Metal — command.rs:1469 calls set_threadgroup_memory_length(i, size) at dispatch time; device.rs:250-258 populates wg_memory_sizes from naga module. Parameter-form threadgroup memory is correct only because the HAL does this.
Our Metal HAL — confirmed no setThreadgroupMemoryLength call anywhere in wgpu/hal/metal/. There is an explicit TODO at device.go:799:
```
// TODO(zero-init-workgroup): Pass desc.Compute.ZeroInitializeWorkgroupMemory
// to naga MSL options.
```
MSL spec compliance — threadgroup T name; inside a kernel is legal MSL (Section 4.2). Statically sized by the compiler, no host-side setup needed.
types_with_comments.msl w_mem removal — this is a latent bug fix, not a regression. The WGSL source declares w_mem and w_mem2, but test_ep only uses w_mem2. Rust's compact pass removes unused w_mem entirely. Our code was incorrectly zero-initializing w_mem despite it not being used by the entry point. The per-EP filtering fixes this.
Edge cases verified:
- Helper functions receive workgroup vars as threadgroup T& parameters — works identically with function-body declarations (address space is inherent to the variable)
- Multiple entry points (interface.wgsl has 4 EPs, only compute uses workgroup var) — per-EP filtering correct
- Arrays/structs with atomics (atomicOps.msl, workgroup-var-init.msl) — declaration form semantically equivalent

One Design Question: Allow-List Granularity

The current allow-list is map[string]string — one reason per shader, applied to all backends:

"atomicOps": "workgroup-layout-free",  // covers SPIR-V divergence

After this PR, atomicOps diverges from Rust in two backends for different reasons:

Backend	Divergence	Reason
SPIR-V	Layout-free workgroup types	`workgroup-layout-free`
MSL	Function-scope threadgroup vars	`msl-threadgroup-function-scope`

The test passes because the shader is already allow-listed, but the reason string only documents the SPIR-V divergence. Same applies to workgroup-var-init.

Risk: if someone later fixes the SPIR-V workgroup-layout-free divergence and removes atomicOps from the allow-list, MSL tests will unexpectedly fail with no documented reason.

This is a pre-existing design limitation, not something this PR introduced. But since we're adding 6 new allow-list entries, it might be a good time to consider a per-backend allow-list structure, e.g.:

// Option A: per-backend map
type allowEntry struct {
    spv, msl, hlsl, glsl string // empty = must match exactly
}

// Option B: multi-reason string  
"atomicOps": "workgroup-layout-free, msl-threadgroup-function-scope",

What are your thoughts? Happy to keep the current structure if you prefer — it's not a blocker.

Verdict

Approve. Clean implementation, real bug fix, correct MSL, well-documented divergence. The per-EP filtering in writeWorkgroupZeroInit is a bonus fix that matches Rust naga behavior more closely.

kolkov · 2026-06-14T22:35:52Z

@lkmavi — this PR changes MSL workgroup variable emission (Metal backend). Since you're our macOS expert, could you take a look? Specifically interested in whether function-scope threadgroup T name; declarations work correctly on your M-series hardware vs the former entry-point parameter form.

lkmavi · 2026-06-15T06:42:26Z

@kolkov

✅ Function-scope threadgroup T name; is accepted by the Metal 3.1 runtime on M3 Max — legal MSL confirmed on real
hardware.
✅ All workgroup shaders in the PR's golden-file update list pass Metal runtime compilation.
The 13 pre-existing failures are in unrelated areas (int64 atomics, bounds-check DefaultConstructible, f64,
clip--istance — all features with known Metal limitations). None are regressions from this change.
⚠️ atomicCompareExchange and atomicOps-int64 contain workgroup vars that emit correctly as function-scope, but the
shaders fail for reasons unrelated to this PR. Worth a separate tracking issue if these were previously expected to
compile on Metal.

kolkov · 2026-06-15T08:20:39Z

@lkmavi Thanks for the M3 Max verification — great to have real hardware confirmation!

Could you file separate tracking issues for the two Metal compilation failures you found?

atomicCompareExchange — workgroup vars emit correctly but shader fails on Metal (int64 atomics?)
atomicOps-int64 — same pattern, function-scope threadgroup is fine but other Metal limitations

These are pre-existing, not caused by this PR, but worth tracking explicitly.

@georgebuilds — one follow-up on the allow-list design question from the review.

Currently the allow-list is map[string]string (one reason per shader, applies to all backends). After this PR, atomicOps and workgroup-var-init diverge from Rust in two backends for different reasons:

Shader	SPIR-V reason	MSL reason
`atomicOps`	`workgroup-layout-free`	`msl-threadgroup-function-scope`
`workgroup-var-init`	`workgroup-layout-free`	`msl-threadgroup-function-scope`

Only the SPIR-V reason is recorded. If someone later fixes the SPIR-V divergence and removes the shader from the allow-list, MSL tests will unexpectedly fail.

Would you be interested in a follow-up PR to make the allow-list per-backend? Something like:

// Option A: composite reason string
"atomicOps": "workgroup-layout-free, msl-threadgroup-function-scope",

// Option B: per-backend map
type allowEntry map[string]string // backend → reason
var referenceAllowList = map[string]allowEntry{
    "atomicOps": {"spv": "workgroup-layout-free", "msl": "msl-threadgroup-function-scope"},
}

No rush — just flagging for future robustness.

The MSL function-body threadgroup fix landed upstream and is tagged in v0.17.15, so the build-only third_party/naga prune and the replace directive are no longer needed. smem_opt_regression_test green against the real module; full suite 0 failures.

georgebuilds requested a review from kolkov as a code owner June 14, 2026 21:14

kolkov approved these changes Jun 14, 2026

View reviewed changes

kolkov merged commit 35d01f5 into gogpu:main Jun 15, 2026
11 checks passed

kolkov mentioned this pull request Jun 15, 2026

docs: v0.17.15 patch release #78

Merged

2 tasks

lkmavi mentioned this pull request Jun 15, 2026

MSL: shaders with 64-bit atomics fail Metal compilation #79

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(msl): declare workgroup vars at function-body scope in kernel entry points#77

fix(msl): declare workgroup vars at function-body scope in kernel entry points#77
kolkov merged 1 commit into
gogpu:mainfrom
georgebuilds:fix/msl-threadgroup-function-scope

georgebuilds commented Jun 14, 2026

Uh oh!

codecov Bot commented Jun 14, 2026

Uh oh!

kolkov left a comment

Uh oh!

kolkov commented Jun 14, 2026

Uh oh!

lkmavi commented Jun 15, 2026

Uh oh!

Uh oh!

kolkov commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

georgebuilds commented Jun 14, 2026

Summary

Divergence from Rust naga

Example (snapshot/testdata/golden/msl/workgroup_memory.msl)

Test plan

Uh oh!

codecov Bot commented Jun 14, 2026

Codecov Report

Uh oh!

kolkov left a comment

Choose a reason for hiding this comment

Verification Results

One Design Question: Allow-List Granularity

Verdict

Uh oh!

kolkov commented Jun 14, 2026

Uh oh!

lkmavi commented Jun 15, 2026

Uh oh!

Uh oh!

kolkov commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants