Add dejagnu tests for cooperative group GWS debugging#116
Conversation
There was a problem hiding this comment.
Pull request overview
Adds new ROCm dejagnu coverage to exercise ROCgdb debugging of cooperative-group HIP kernels that synchronize via GWS, covering both single-device this_grid().sync() and multi-device this_multi_grid().sync() scenarios.
Changes:
- Introduces a single-device cooperative-kernel test that breaks before/after
grid.sync()and validates waves/dispatch visibility. - Introduces a multi-device cooperative-kernel non-stop test that breaks inside a multi-grid kernel and runs through grid + multi-grid barriers to completion.
- Adds two HIP C++ test programs that implement the cooperative-group synchronization patterns and validate results on the host side.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| gdb/testsuite/gdb.rocm/coop-group-grid-sync.exp | DejaGnu test for single-device cooperative kernel debugging around this_grid().sync(). |
| gdb/testsuite/gdb.rocm/coop-group-grid-sync.cpp | HIP program implementing single-device cooperative grid.sync() and host-side validation. |
| gdb/testsuite/gdb.rocm/coop-group-multi-grid-sync.exp | DejaGnu non-stop test for multi-device cooperative kernel debugging through this_grid().sync() + this_multi_grid().sync(). |
| gdb/testsuite/gdb.rocm/coop-group-multi-grid-sync.cpp | HIP program implementing multi-device cooperative launch with cross-device aggregation and validation. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
af78fac to
33f1926
Compare
lancesix
left a comment
There was a problem hiding this comment.
Hi,
Thanks a lot for this, this is a great starting point.
My main concern for now is gfx110x. We do not support debugging cooperative group on those (documented limitation), the testcase should look for them to not FAIL. This is known that the test will not pass even if the arch do support GWS.
I have added a couple of small comments, I'll get back to a more detailed review after the gfx11 concern has been addressed.
Addressed. Added a supports_cooperative_groups helper in lib/rocm.exp that excludes gfx1100/1101/1102/1103, and both .exp files now require it, so on gfx110x the run reports UNSUPPORTED: …: require failed: supports_cooperative_groups instead of FAIL. Mirrors the existing hip_devices_support_debug_multi_process pattern in the same lib. |
33f1926 to
73b1b20
Compare
ae4f615 to
5aa9e19
Compare
5aa9e19 to
d541e7d
Compare
d541e7d to
28f6ae0
Compare
Add dejagnu coverage for debugging AMD GPU cooperative-group kernels
(hipLaunchCooperativeKernel / hipLaunchCooperativeKernelMultiDevice),
which synchronize at the grid / multi-grid level via Global Wave Sync
(GWS). Previously covered only by out-of-tree tests.
New tests:
* gdb.rocm/coop-group-grid-sync.{cpp,exp}
Single-device, this_grid ().sync ().
* gdb.rocm/coop-group-multi-grid-sync.{cpp,exp}
Multi-device, this_grid ().sync () + this_multi_grid ().sync ();
runs in non-stop mode.
Host-side post-conditions validate the cooperative semantics
numerically, so any regression in GWS behaviour under the debugger
surfaces as a test failure rather than a silent miscompare. The
tests pick a debugger-supported device at runtime and self-skip with
UNSUPPORTED when the configuration is insufficient.
Two helpers added in gdb/testsuite/lib/rocm.exp:
target_supports_cooperative_groups <target> (per-target gate, returns
false on gfx1100/1101/1102/1103 per amd-dbgapi.h) and
supports_cooperative_groups (require-gate wrapper used by both .exp
files). This is a debugger-side gate, distinct from the runtime's
cooperativeLaunch / cooperativeMultiDeviceLaunch flags.
28f6ae0 to
714c354
Compare
| # debugged across the barrier. | ||
| delete_breakpoints | ||
| gdb_breakpoint \ | ||
| [gdb_get_line_number "after-sync line"] allow-pending |
There was a problem hiding this comment.
We shouldn't need allow-pending anymore. We can optionally use temporary so that we can remove delete_breakpoints below. We can use temporary for the first breakpoint, too.
There was a problem hiding this comment.
Thanks — I applied the temporary part: both breakpoints are now temporary, which let me drop the redundant delete_breakpoints calls (and the same in test_threads_in_coop_kernel).
On allow-pending though - I tried removing it, but it turns out it's still needed here: these breakpoints are on lines inside the kernel (device code), which isn't loaded yet when we set them at main. Without allow-pending, gdb_breakpoint fails outright with "set breakpoint at NN" (gdb declines the unresolved location and defaults to "no" on the pending prompt). On a GPU run this turned into a hard FAIL. So I've kept allow-pending and combined it with temporary (gdb_breakpoint allow-pending temporary). The host-side marker breakpoint in the multi-device test, by contrast, resolves immediately, so temporary alone is fine there.
| # Verify that waves from multiple workgroups are stopped at the | ||
| # pre-sync breakpoint. Counting waves alone is wave-size | ||
| # dependent (1 wave per workgroup on wave64 vs 2 on wave32) and | ||
| # would let the test pass on wave32 even if all visible waves | ||
| # happened to come from a single workgroup. Instead, collect | ||
| # the distinct workgroup (block) coordinates from the AMDGPU | ||
| # Wave entries in "info threads" and require at least two | ||
| # distinct workgroups, which directly verifies that the | ||
| # cooperative dispatch's multi-workgroup property is exercised. |
There was a problem hiding this comment.
Because we stop here before sync, is there a guarantee that we would see 2 distinct workgroups? Wouldn't we have that guarantee rather after synch'ing?
There was a problem hiding this comment.
Good question. The guarantee here comes from the cooperative launch itself rather than from the barrier: hipLaunchCooperativeKernel requires the entire grid to be co-resident on the device for the lifetime of the dispatch (that's precisely what makes grid.sync() safe — a non-co-resident grid could deadlock at the barrier). So all workgroups are resident from dispatch start, including before the first grid.sync(). With just 2 workgroups of 64 threads on the target there's ample occupancy, so both are present. I kept the check pre-sync deliberately: verifying the debugger can see all co-resident waves before the barrier (parked at arbitrary points in the kernel) is a more representative debugging scenario than inspecting them lined up at the sync point. Happy to also add an after-sync check if you'd like the stricter guarantee asserted explicitly.
| set eligible 1 | ||
| pass $gdb_test_name | ||
| } | ||
| -re "\\\[Inferior 1 \[^\r\n\]* exited normally\\\]\[^\r\n\]*\r\n$::gdb_prompt " { |
There was a problem hiding this comment.
Can we use -wrap here, too? It's non-stop mode but only the main thread is supposed to hit the breakpoint. So, I expect we are able to use -wrap and simplify the case to -re "\\\[Inferior 1 \[^\r\n\]* exited normally.*". Please also consider using inferior_exited_re.
There was a problem hiding this comment.
Good suggestion — but this arm went away entirely with the restructure: the marker is now placed on a line reached on every run, so there is no early [Inferior 1 ... exited normally] case to match anymore. No -wrap/inferior_exited_re arm needed here as a result.
| return | ||
| } | ||
|
|
||
| set n_gpus [get_integer_valueof "n_gpus" 0] |
There was a problem hiding this comment.
We can do this check early by putting a breakpoint at line 123 and get rid of the "advance to n-gpus-final" check above.
There was a problem hiding this comment.
Done — went with this. Moved the n-gpus-final marker onto the if (n_gpus < N_USED_GPUS) line in the .cpp, which runs on every execution before the inferior's own skip-return. The .exp now does a single gdb_continue_to_breakpoint there, reads n_gpus, and reports unsupported if it's < 2. The dual-arm gdb_test_multiple and its [^\r\n]*\r\n[^\r\n]*\r\n pattern are gone. Validated on gfx942 (in-tree build + 7.14 nightly rocgdb).
| # In non-stop mode, hipLaunchCooperativeKernelMultiDevice | ||
| # produces one child breakpoint instance per participating GPU | ||
| # ("Breakpoint <id>.<inst>"). Collect distinct <inst> values | ||
| # until we have observed a stop on every GPU; only then is it | ||
| # safe to delete the breakpoint and let the dispatch run | ||
| # through both grid syncs. |
There was a problem hiding this comment.
What debugger behavior do we exactly test here? We could put a breakpoint after the sync and all participating blocks/grids would be there. It seems like we are rather testing the runtime, not the debugger.
There was a problem hiding this comment.
Good question — you're right that "do all grids reach the kernel" is a runtime property. The debugger behavior I'm after here is gdb's side: under a single hipLaunchCooperativeKernelMultiDevice dispatch, one source breakpoint resolves to multiple device-side locations, reported as a parent breakpoint with a child instance per GPU (Breakpoint .), and in non-stop mode each device-side stop is observed independently. The loop just confirms gdb reports a stop for every per-device location; the "did every grid arrive" part is left to the host-side result check in the .cpp. I've reworded the in-file comment to make this clearer — happy to switch to the simpler "one breakpoint after the sync" approach if you'd prefer.
| gdb_test "bt 1" \ | ||
| "#0\[^\r\n\]*coop_grid_sync_kernel\[^\r\n\]*" \ | ||
| "backtrace inside coop_grid_sync_kernel" |
There was a problem hiding this comment.
What debugger behavior are we testing here? Before the sync point, stopping waves would be inside the kernel. There is no other kernel. I'm not sure I understand the value of this test from the debugger perspective.
There was a problem hiding this comment.
Thanks - that's a fair point, the bt 1 check was close to tautological with a single kernel. I've adopted approach #1 + #2 to make the debugger value explicit: instead of just confirming each wave's backtrace names the kernel, test_threads_in_coop_kernel now
- switches to each co-resident wave and reads blockIdx.x, asserting we observe more than one distinct workgroup - i.e. gdb selects the correct per-wave register context; and
- within one wave, switches between lanes and asserts threadIdx.x differs across lanes - i.e. gdb reports correct per-lane SIMT state. Both are exercised specifically in the co-resident / GWS-barrier context, which is the cooperative-group angle the existing lane/builtin tests don't cover.
Refine the cooperative-group GWS tests for robustness and to make the
debugger behaviour under test more explicit:
* coop-group-grid-sync.exp: use temporary breakpoints for the
in-kernel locations (still pending, since the device code is
loaded at dispatch time) and drop the redundant delete_breakpoints
calls.
* coop-group-grid-sync.exp: report UNSUPPORTED instead of FAIL when
the inferior self-skips because no device advertises
cooperativeLaunch.
* coop-group-grid-sync.exp: have test_threads_in_coop_kernel check
distinct per-wave blockIdx.x (per-wave register context) and
per-lane threadIdx.x divergence (per-lane SIMT state), instead of
only confirming that the backtrace names the kernel.
* coop-group-multi-grid-sync.{cpp,exp}: read n_gpus from a marker
line that is reached on every execution (no early return before
it), so that when fewer than two cooperative-capable GPUs are
available -- including when a parallel test run restricts the
visible GPUs -- the test reports UNSUPPORTED rather than FAILing.
* coop-group-{grid,multi-grid}-sync.{cpp,exp}: minor comment and
GNU-style cleanups -- tab-align the in-kernel marker comment, keep
hipLaunchCooperativeKernelMultiDevice on a single line, and
clarify the Phase 2 data-dependency comment.
* coop-group-{grid,multi-grid}-sync.{cpp,exp}: give the per-wave and
per-lane value reads explicit, unique test names, and keep the
"n-gpus-final" marker string unique so gdb_get_line_number
resolves the intended line.
Tested on gfx942 with an in-tree build and the 7.14 nightly rocgdb,
both with all GPUs visible and with the visible set restricted to one.
d9f90d2 to
a00cfb3
Compare
Summary
Add dejagnu coverage for debugging AMD GPU cooperative-group kernels —
i.e. kernels launched via
hipLaunchCooperativeKernel/hipLaunchCooperativeKernelMultiDevicethat synchronize at the grid /multi-grid level. On AMD GPUs these synchronization primitives are
implemented in hardware via Global Wave Sync (GWS), and they have a
distinct wave/scheduling model that has historically only been covered by
out-of-tree tests. This PR brings that coverage into the dejagnu testsuite
so it runs as part of the regular ROCgdb regression suite.
Tests added
gdb.rocm/coop-group-grid-sync.{cpp,exp}cooperative_groups::this_grid().sync()(intra-device GWS), launched viahipLaunchCooperativeKernel.gdb.rocm/coop-group-multi-grid-sync.{cpp,exp}this_grid().sync()andcooperative_groups::this_multi_grid().sync()(intra + cross-device GWS), launched viahipLaunchCooperativeKernelMultiDevice.What gets verified
coop-group-grid-sync.exp— two sub-tests:test_break_around_grid_syncgrid.sync()inside a cooperative dispatch.AMDGPU Wavethreads are stopped (waves participatingin the GWS barrier).
info dispatcheslists the cooperative dispatch.grid.sync()and continue: it mustfire (proves GWS-protected code can be debugged across the barrier).
test_threads_in_coop_kernelAMDGPU Waveparked inside the kernel, switch to it andconfirm
bt 1reports a frame insidecoop_grid_sync_kernel.coop-group-multi-grid-sync.exp— runs in non-stop mode:continue -a &, confirm a kernel-side breakpoint fires insidecoop_multi_grid_sync_kernel. Per-GPU child breakpoint instances(
Breakpoint X.Y) are observed for every participating GPU.this_grid().sync()andthis_multi_grid().sync()release correctlyunder the debugger.
The host-side post-conditions in the
.cppprograms additionally validatethe cooperative semantics numerically (cross-workgroup data dependency for
the single-device case, cross-device sum aggregation for the multi-device
case), so any regression in GWS behavior under the debugger turns into a
test failure rather than a silent miscompare.
Skip / unsupported handling
The tests degrade cleanly on systems that cannot run them:
cooperativeLaunch; if unsupported theprogram prints a recognizable message and exits, and the
.expmarksthe test
UNSUPPORTED.>= 2GPUs andcooperativeMultiDeviceLaunchon every device. It is also gated bythe existing
hip_devices_support_debug_multi_processrequirement.Any of those missing →
UNSUPPORTED.No new dejagnu helpers are required; both
.expfiles use existinginfrastructure in
lib/rocm.exp.Out of scope / follow-ups
Intentionally left out of this PR; happy to extend if reviewers ask:
next/step/stepi) acrossgrid.sync()/mgrid.sync()boundaries.lane apply/ per-lane register inspection while waves are parked atthe GWS barrier.