fix(msl): declare workgroup vars at function-body scope in kernel entry points#77
Conversation
…ry points
Workgroup (threadgroup) variables were emitted as threadgroup entry-point
parameters, matching Rust naga. That approach requires the host to call
setThreadgroupMemoryLength:atIndex: for each threadgroup parameter (Rust
wgpu-hal does this in its Metal encoder). Hosts that do not perform this
setup (for example a pure-Go wgpu Metal HAL) leave the threadgroup memory
unsized, so reads and writes silently no-op and compute results are wrong.
This change declares the workgroup variables at function-body scope inside
the kernel ("threadgroup T name;") instead of as parameters. Function-scope
threadgroup declarations are legal MSL and need no host-side setup. Variable
names are unchanged, so all body references and helper-function call sites
resolve exactly as before. The zero-initialization prologue is now filtered
by per-entry-point usage (matching Rust naga, which filters by
!fun_info[handle].is_empty()), so only workgroup vars actually used by the
entry point are declared and zeroed.
Generated code is otherwise identical to the parameter-based output. The
snapshot reference allow-list records the six affected fixtures with reason
"msl-threadgroup-function-scope", and the MSL goldens are regenerated.
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
kolkov
left a comment
There was a problem hiding this comment.
Thorough review — verified every claim against Rust source and our Metal HAL. Well-researched PR.
Verification Results
All claims confirmed:
-
Rust naga pattern —
writer.rs:7039-7043emits workgroup vars asthreadgroup T& namekernel parameters, filtered byfun_info[handle].is_empty()in three places (parameter emission,need_workgroup_variables_initialization,write_workgroup_variables_initialization). -
Rust wgpu-hal Metal —
command.rs:1469callsset_threadgroup_memory_length(i, size)at dispatch time;device.rs:250-258populateswg_memory_sizesfrom naga module. Parameter-form threadgroup memory is correct only because the HAL does this. -
Our Metal HAL — confirmed no
setThreadgroupMemoryLengthcall anywhere inwgpu/hal/metal/. There is an explicit TODO atdevice.go:799:// TODO(zero-init-workgroup): Pass desc.Compute.ZeroInitializeWorkgroupMemory // to naga MSL options.
-
MSL spec compliance —
threadgroup T name;inside a kernel is legal MSL (Section 4.2). Statically sized by the compiler, no host-side setup needed. -
types_with_comments.mslw_memremoval — this is a latent bug fix, not a regression. The WGSL source declaresw_memandw_mem2, buttest_eponly usesw_mem2. Rust's compact pass removes unusedw_mementirely. Our code was incorrectly zero-initializingw_memdespite it not being used by the entry point. The per-EP filtering fixes this. -
Edge cases verified:
- Helper functions receive workgroup vars as
threadgroup T¶meters — works identically with function-body declarations (address space is inherent to the variable) - Multiple entry points (
interface.wgslhas 4 EPs, onlycomputeuses workgroup var) — per-EP filtering correct - Arrays/structs with atomics (
atomicOps.msl,workgroup-var-init.msl) — declaration form semantically equivalent
- Helper functions receive workgroup vars as
One Design Question: Allow-List Granularity
The current allow-list is map[string]string — one reason per shader, applied to all backends:
"atomicOps": "workgroup-layout-free", // covers SPIR-V divergenceAfter this PR, atomicOps diverges from Rust in two backends for different reasons:
| Backend | Divergence | Reason |
|---|---|---|
| SPIR-V | Layout-free workgroup types | workgroup-layout-free |
| MSL | Function-scope threadgroup vars | msl-threadgroup-function-scope |
The test passes because the shader is already allow-listed, but the reason string only documents the SPIR-V divergence. Same applies to workgroup-var-init.
Risk: if someone later fixes the SPIR-V workgroup-layout-free divergence and removes atomicOps from the allow-list, MSL tests will unexpectedly fail with no documented reason.
This is a pre-existing design limitation, not something this PR introduced. But since we're adding 6 new allow-list entries, it might be a good time to consider a per-backend allow-list structure, e.g.:
// Option A: per-backend map
type allowEntry struct {
spv, msl, hlsl, glsl string // empty = must match exactly
}
// Option B: multi-reason string
"atomicOps": "workgroup-layout-free, msl-threadgroup-function-scope",What are your thoughts? Happy to keep the current structure if you prefer — it's not a blocker.
Verdict
Approve. Clean implementation, real bug fix, correct MSL, well-documented divergence. The per-EP filtering in writeWorkgroupZeroInit is a bonus fix that matches Rust naga behavior more closely.
|
@lkmavi — this PR changes MSL workgroup variable emission (Metal backend). Since you're our macOS expert, could you take a look? Specifically interested in whether function-scope |
|
|
@lkmavi Thanks for the M3 Max verification — great to have real hardware confirmation! Could you file separate tracking issues for the two Metal compilation failures you found?
These are pre-existing, not caused by this PR, but worth tracking explicitly. @georgebuilds — one follow-up on the allow-list design question from the review. Currently the allow-list is
Only the SPIR-V reason is recorded. If someone later fixes the SPIR-V divergence and removes the shader from the allow-list, MSL tests will unexpectedly fail. Would you be interested in a follow-up PR to make the allow-list per-backend? Something like: // Option A: composite reason string
"atomicOps": "workgroup-layout-free, msl-threadgroup-function-scope",
// Option B: per-backend map
type allowEntry map[string]string // backend → reason
var referenceAllowList = map[string]allowEntry{
"atomicOps": {"spv": "workgroup-layout-free", "msl": "msl-threadgroup-function-scope"},
}No rush — just flagging for future robustness. |
The MSL function-body threadgroup fix landed upstream and is tagged in v0.17.15, so the build-only third_party/naga prune and the replace directive are no longer needed. smem_opt_regression_test green against the real module; full suite 0 failures.
Summary
Workgroup (threadgroup) variables are emitted as threadgroup entry-point parameters, matching Rust naga (
back/msl/writer.rs, where workgroup globals becomethreadgroup T& namekernel arguments). That approach is only correct when the host sizes each threadgroup argument by callingsetThreadgroupMemoryLength:atIndex:before dispatch. Rustwgpu-haldoes exactly this in its Metal command encoder.A host that does not perform that setup leaves the threadgroup memory unsized. On Metal an unsized threadgroup parameter reads as zero and writes are dropped, with no validation error. The result is a silent miscompile: any compute shader using
var<workgroup>(tiled matmul, reductions, prefix sums, etc.) produces wrong results. The pure-Go gogpu/wgpu Metal HAL has nosetThreadgroupMemoryLengthcall, so every workgroup-memory kernel compiled through this backend is affected.This change declares the workgroup variables at function-body scope inside the kernel (
threadgroup T name;) instead of as parameters. Function-scope threadgroup declarations are legal MSL, are statically sized by the type, and need no host-side setup, so the output is correct regardless of how the HAL drives dispatch. Variable names are unchanged, so all body references and helper-function call sites resolve exactly as before.A latent issue is fixed in passing: the zero-initialization prologue (and now the new declarations) are filtered by per-entry-point usage, matching Rust naga's
!fun_info[handle].is_empty()filter, so an entry point only declares and zeroes the workgroup vars it actually uses rather than every workgroup global in the module.Scope: MSL backend only. SPIR-V, HLSL, GLSL, and DXIL backends are untouched.
Divergence from Rust naga
This is a deliberate, documented divergence recorded in the snapshot reference allow-list (reason
msl-threadgroup-function-scope, 6 affected fixtures). Rust naga can keep the parameter form because its companion HAL sizes the parameters; a Go shader compiler whose consumers do not is better served by the self-contained function-scope form. The generated code is otherwise identical to the parameter-based output (the body, the zero-init prologue, and all references are the same).Example (snapshot/testdata/golden/msl/workgroup_memory.msl)
Before:
After:
Test plan
go build ./...clean (ubuntu/windows/macos targets build)go test ./...passes (9547 tests across 33 packages)go test -race ./msl/... ./snapshot/...passes, no race findingsTestRustReferenceandTestSnapshotspass; goldens regenerated withUPDATE_GOLDEN=1 go test ./snapshot/...and verified byte-identicalgofmt -l .emptygolangci-lint runreports no new issues; the modifiedwriteEntryPointtriggers no complexity warnings (funlen/gocyclo/gocognit/nestif)Note:
golangci-lintandscripts/pre-release-check.shflag one pre-existingnolintlintissue inmsl/internal/codegen/xcrun_helper_test_darwin.gothat exists verbatim onmainand is untouched by this PR. I left it out of scope rather than bundle an unrelated lint fix; happy to send it separately if you'd like.Verified end to end against a pure-Go WebGPU stack on Apple M3: workgroup-memory passthrough returns correct neighbor values for all 64 lanes (was all-zero before), and a tiled matmul using
var<workgroup>tiles is bit-exact against a CPU reference (was producing zeros before).