Skip to content

Add dejagnu tests for cooperative group GWS debugging#116

Open
spatrang wants to merge 2 commits into
amd-stagingfrom
users/spatrang/coop-group-gws-tests
Open

Add dejagnu tests for cooperative group GWS debugging#116
spatrang wants to merge 2 commits into
amd-stagingfrom
users/spatrang/coop-group-gws-tests

Conversation

@spatrang

@spatrang spatrang commented May 7, 2026

Copy link
Copy Markdown

Summary

Add dejagnu coverage for debugging AMD GPU cooperative-group kernels —
i.e. kernels launched via hipLaunchCooperativeKernel /
hipLaunchCooperativeKernelMultiDevice that synchronize at the grid /
multi-grid level. On AMD GPUs these synchronization primitives are
implemented in hardware via Global Wave Sync (GWS), and they have a
distinct wave/scheduling model that has historically only been covered by
out-of-tree tests. This PR brings that coverage into the dejagnu testsuite
so it runs as part of the regular ROCgdb regression suite.

Tests added

File Scenario
gdb.rocm/coop-group-grid-sync.{cpp,exp} Single-device cooperative kernel using cooperative_groups::this_grid().sync() (intra-device GWS), launched via hipLaunchCooperativeKernel.
gdb.rocm/coop-group-multi-grid-sync.{cpp,exp} Multi-device cooperative kernel using both this_grid().sync() and cooperative_groups::this_multi_grid().sync() (intra + cross-device GWS), launched via hipLaunchCooperativeKernelMultiDevice.

What gets verified

coop-group-grid-sync.exp — two sub-tests:

  • test_break_around_grid_sync
    • Hit a breakpoint before grid.sync() inside a cooperative dispatch.
    • Confirm multiple AMDGPU Wave threads are stopped (waves participating
      in the GWS barrier).
    • Confirm info dispatches lists the cooperative dispatch.
    • Move the breakpoint to after grid.sync() and continue: it must
      fire (proves GWS-protected code can be debugged across the barrier).
    • Continue to clean program exit.
  • test_threads_in_coop_kernel
    • For every AMDGPU Wave parked inside the kernel, switch to it and
      confirm bt 1 reports a frame inside coop_grid_sync_kernel.

coop-group-multi-grid-sync.exp — runs in non-stop mode:

  • After continue -a &, confirm a kernel-side breakpoint fires inside
    coop_multi_grid_sync_kernel. Per-GPU child breakpoint instances
    (Breakpoint X.Y) are observed for every participating GPU.
  • Continue all threads to program exit, which only succeeds if both
    this_grid().sync() and this_multi_grid().sync() release correctly
    under the debugger.

The host-side post-conditions in the .cpp programs additionally validate
the cooperative semantics numerically (cross-workgroup data dependency for
the single-device case, cross-device sum aggregation for the multi-device
case), so any regression in GWS behavior under the debugger turns into a
test failure rather than a silent miscompare.

Skip / unsupported handling

The tests degrade cleanly on systems that cannot run them:

  • Single-device test: queries cooperativeLaunch; if unsupported the
    program prints a recognizable message and exits, and the .exp marks
    the test UNSUPPORTED.
  • Multi-device test: requires >= 2 GPUs and
    cooperativeMultiDeviceLaunch on every device. It is also gated by
    the existing hip_devices_support_debug_multi_process requirement.
    Any of those missing → UNSUPPORTED.

No new dejagnu helpers are required; both .exp files use existing
infrastructure in lib/rocm.exp.

Out of scope / follow-ups

Intentionally left out of this PR; happy to extend if reviewers ask:

  • Stepping (next / step / stepi) across grid.sync() /
    mgrid.sync() boundaries.
  • Conditional breakpoints inside cooperative kernels.
  • lane apply / per-lane register inspection while waves are parked at
    the GWS barrier.
  • Watchpoints on cooperative shared buffers.

@spatrang spatrang requested review from Copilot and lumachad May 7, 2026 13:08
@spatrang spatrang marked this pull request as ready for review May 7, 2026 13:14
@spatrang spatrang requested a review from a team as a code owner May 7, 2026 13:14

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds new ROCm dejagnu coverage to exercise ROCgdb debugging of cooperative-group HIP kernels that synchronize via GWS, covering both single-device this_grid().sync() and multi-device this_multi_grid().sync() scenarios.

Changes:

  • Introduces a single-device cooperative-kernel test that breaks before/after grid.sync() and validates waves/dispatch visibility.
  • Introduces a multi-device cooperative-kernel non-stop test that breaks inside a multi-grid kernel and runs through grid + multi-grid barriers to completion.
  • Adds two HIP C++ test programs that implement the cooperative-group synchronization patterns and validate results on the host side.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
gdb/testsuite/gdb.rocm/coop-group-grid-sync.exp DejaGnu test for single-device cooperative kernel debugging around this_grid().sync().
gdb/testsuite/gdb.rocm/coop-group-grid-sync.cpp HIP program implementing single-device cooperative grid.sync() and host-side validation.
gdb/testsuite/gdb.rocm/coop-group-multi-grid-sync.exp DejaGnu non-stop test for multi-device cooperative kernel debugging through this_grid().sync() + this_multi_grid().sync().
gdb/testsuite/gdb.rocm/coop-group-multi-grid-sync.cpp HIP program implementing multi-device cooperative launch with cross-device aggregation and validation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread gdb/testsuite/gdb.rocm/coop-group-multi-grid-sync.exp Outdated
Comment thread gdb/testsuite/gdb.rocm/coop-group-multi-grid-sync.exp Outdated
@spatrang spatrang force-pushed the users/spatrang/coop-group-gws-tests branch from af78fac to 33f1926 Compare May 7, 2026 13:35

@lancesix lancesix left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,
Thanks a lot for this, this is a great starting point.

My main concern for now is gfx110x. We do not support debugging cooperative group on those (documented limitation), the testcase should look for them to not FAIL. This is known that the test will not pass even if the arch do support GWS.

I have added a couple of small comments, I'll get back to a more detailed review after the gfx11 concern has been addressed.

Comment thread gdb/testsuite/gdb.rocm/coop-group-grid-sync.cpp Outdated
Comment thread gdb/testsuite/gdb.rocm/coop-group-grid-sync.exp Outdated
Comment thread gdb/testsuite/gdb.rocm/coop-group-grid-sync.exp
Comment thread gdb/testsuite/gdb.rocm/coop-group-grid-sync.exp Outdated
Comment thread gdb/testsuite/gdb.rocm/coop-group-grid-sync.exp
Comment thread gdb/testsuite/gdb.rocm/coop-group-multi-grid-sync.cpp Outdated
Comment thread gdb/testsuite/gdb.rocm/coop-group-multi-grid-sync.cpp
Comment thread gdb/testsuite/gdb.rocm/coop-group-grid-sync.cpp Outdated
Comment thread gdb/testsuite/gdb.rocm/coop-group-grid-sync.exp
@spatrang

Copy link
Copy Markdown
Author

Hi, Thanks a lot for this, this is a great starting point.

My main concern for now is gfx110x. We do not support debugging cooperative group on those (documented limitation), the testcase should look for them to not FAIL. This is known that the test will not pass even if the arch do support GWS.

I have added a couple of small comments, I'll get back to a more detailed review after the gfx11 concern has been addressed.

Addressed. Added a supports_cooperative_groups helper in lib/rocm.exp that excludes gfx1100/1101/1102/1103, and both .exp files now require it, so on gfx110x the run reports UNSUPPORTED: …: require failed: supports_cooperative_groups instead of FAIL. Mirrors the existing hip_devices_support_debug_multi_process pattern in the same lib.

@spatrang spatrang force-pushed the users/spatrang/coop-group-gws-tests branch from 33f1926 to 73b1b20 Compare May 11, 2026 06:54
Comment thread gdb/testsuite/gdb.rocm/coop-group-grid-sync.cpp Outdated
Comment thread gdb/testsuite/gdb.rocm/coop-group-grid-sync.cpp Outdated
Comment thread gdb/testsuite/gdb.rocm/coop-group-grid-sync.cpp
Comment thread gdb/testsuite/gdb.rocm/coop-group-grid-sync.cpp Outdated
Comment thread gdb/testsuite/gdb.rocm/coop-group-grid-sync.cpp Outdated
Comment thread gdb/testsuite/gdb.rocm/coop-group-multi-grid-sync.cpp Outdated
Comment thread gdb/testsuite/gdb.rocm/coop-group-multi-grid-sync.exp Outdated
Comment thread gdb/testsuite/gdb.rocm/coop-group-multi-grid-sync.exp Outdated
Comment thread gdb/testsuite/lib/rocm.exp Outdated
Comment thread gdb/testsuite/lib/rocm.exp Outdated
@spatrang spatrang force-pushed the users/spatrang/coop-group-gws-tests branch 2 times, most recently from ae4f615 to 5aa9e19 Compare May 20, 2026 06:12
Comment thread gdb/testsuite/gdb.rocm/deep-stack.exp Outdated
Comment thread gdb/testsuite/gdb.rocm/deref-scoped-pointer.exp Outdated
Comment thread gdb/testsuite/gdb.rocm/instruction-stepping-commands.exp Outdated
Comment thread gdb/testsuite/gdb.rocm/watchpoint-basic.exp Outdated
Comment thread gdb/testsuite/gdb.rocm/deep-stack.exp Outdated
Comment thread gdb/testsuite/gdb.rocm/coop-group-multi-grid-sync.exp Outdated
Comment thread gdb/testsuite/gdb.rocm/coop-group-multi-grid-sync.exp Outdated
Comment thread gdb/testsuite/gdb.rocm/coop-group-multi-grid-sync.exp Outdated
Comment thread gdb/testsuite/gdb.rocm/coop-group-grid-sync.cpp Outdated
Comment thread gdb/testsuite/gdb.rocm/coop-group-grid-sync.exp Outdated
Comment thread gdb/testsuite/gdb.rocm/coop-group-grid-sync.exp
Comment thread gdb/testsuite/gdb.rocm/coop-group-grid-sync.exp Outdated
Comment thread gdb/testsuite/gdb.rocm/coop-group-grid-sync.exp Outdated
@spatrang spatrang force-pushed the users/spatrang/coop-group-gws-tests branch from 5aa9e19 to d541e7d Compare May 21, 2026 10:50
@spatrang spatrang requested review from aktemur, lancesix and lumachad May 21, 2026 10:59
@spatrang spatrang force-pushed the users/spatrang/coop-group-gws-tests branch from d541e7d to 28f6ae0 Compare May 21, 2026 12:45
Comment thread gdb/testsuite/gdb.rocm/coop-group-grid-sync.cpp Outdated
Comment thread gdb/testsuite/gdb.rocm/coop-group-multi-grid-sync.exp Outdated
Add dejagnu coverage for debugging AMD GPU cooperative-group kernels
(hipLaunchCooperativeKernel / hipLaunchCooperativeKernelMultiDevice),
which synchronize at the grid / multi-grid level via Global Wave Sync
(GWS).  Previously covered only by out-of-tree tests.

New tests:
  * gdb.rocm/coop-group-grid-sync.{cpp,exp}
    Single-device, this_grid ().sync ().
  * gdb.rocm/coop-group-multi-grid-sync.{cpp,exp}
    Multi-device, this_grid ().sync () + this_multi_grid ().sync ();
    runs in non-stop mode.

Host-side post-conditions validate the cooperative semantics
numerically, so any regression in GWS behaviour under the debugger
surfaces as a test failure rather than a silent miscompare.  The
tests pick a debugger-supported device at runtime and self-skip with
UNSUPPORTED when the configuration is insufficient.

Two helpers added in gdb/testsuite/lib/rocm.exp:
target_supports_cooperative_groups <target> (per-target gate, returns
false on gfx1100/1101/1102/1103 per amd-dbgapi.h) and
supports_cooperative_groups (require-gate wrapper used by both .exp
files).  This is a debugger-side gate, distinct from the runtime's
cooperativeLaunch / cooperativeMultiDeviceLaunch flags.
@spatrang spatrang force-pushed the users/spatrang/coop-group-gws-tests branch from 28f6ae0 to 714c354 Compare June 4, 2026 14:31
@spatrang spatrang requested a review from aktemur June 4, 2026 16:47
@spatrang spatrang assigned aktemur and unassigned spatrang Jun 4, 2026
# debugged across the barrier.
delete_breakpoints
gdb_breakpoint \
[gdb_get_line_number "after-sync line"] allow-pending

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't need allow-pending anymore. We can optionally use temporary so that we can remove delete_breakpoints below. We can use temporary for the first breakpoint, too.

@spatrang spatrang Jun 8, 2026

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks — I applied the temporary part: both breakpoints are now temporary, which let me drop the redundant delete_breakpoints calls (and the same in test_threads_in_coop_kernel).

On allow-pending though - I tried removing it, but it turns out it's still needed here: these breakpoints are on lines inside the kernel (device code), which isn't loaded yet when we set them at main. Without allow-pending, gdb_breakpoint fails outright with "set breakpoint at NN" (gdb declines the unresolved location and defaults to "no" on the pending prompt). On a GPU run this turned into a hard FAIL. So I've kept allow-pending and combined it with temporary (gdb_breakpoint allow-pending temporary). The host-side marker breakpoint in the multi-device test, by contrast, resolves immediately, so temporary alone is fine there.

Comment on lines +65 to +73
# Verify that waves from multiple workgroups are stopped at the
# pre-sync breakpoint. Counting waves alone is wave-size
# dependent (1 wave per workgroup on wave64 vs 2 on wave32) and
# would let the test pass on wave32 even if all visible waves
# happened to come from a single workgroup. Instead, collect
# the distinct workgroup (block) coordinates from the AMDGPU
# Wave entries in "info threads" and require at least two
# distinct workgroups, which directly verifies that the
# cooperative dispatch's multi-workgroup property is exercised.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we stop here before sync, is there a guarantee that we would see 2 distinct workgroups? Wouldn't we have that guarantee rather after synch'ing?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. The guarantee here comes from the cooperative launch itself rather than from the barrier: hipLaunchCooperativeKernel requires the entire grid to be co-resident on the device for the lifetime of the dispatch (that's precisely what makes grid.sync() safe — a non-co-resident grid could deadlock at the barrier). So all workgroups are resident from dispatch start, including before the first grid.sync(). With just 2 workgroups of 64 threads on the target there's ample occupancy, so both are present. I kept the check pre-sync deliberately: verifying the debugger can see all co-resident waves before the barrier (parked at arbitrary points in the kernel) is a more representative debugging scenario than inspecting them lined up at the sync point. Happy to also add an after-sync check if you'd like the stricter guarantee asserted explicitly.

Comment thread gdb/testsuite/gdb.rocm/coop-group-grid-sync.exp
Comment thread gdb/testsuite/gdb.rocm/coop-group-multi-grid-sync.exp Outdated
set eligible 1
pass $gdb_test_name
}
-re "\\\[Inferior 1 \[^\r\n\]* exited normally\\\]\[^\r\n\]*\r\n$::gdb_prompt " {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use -wrap here, too? It's non-stop mode but only the main thread is supposed to hit the breakpoint. So, I expect we are able to use -wrap and simplify the case to -re "\\\[Inferior 1 \[^\r\n\]* exited normally.*". Please also consider using inferior_exited_re.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion — but this arm went away entirely with the restructure: the marker is now placed on a line reached on every run, so there is no early [Inferior 1 ... exited normally] case to match anymore. No -wrap/inferior_exited_re arm needed here as a result.

Comment thread gdb/testsuite/gdb.rocm/coop-group-multi-grid-sync.exp Outdated
return
}

set n_gpus [get_integer_valueof "n_gpus" 0]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do this check early by putting a breakpoint at line 123 and get rid of the "advance to n-gpus-final" check above.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — went with this. Moved the n-gpus-final marker onto the if (n_gpus < N_USED_GPUS) line in the .cpp, which runs on every execution before the inferior's own skip-return. The .exp now does a single gdb_continue_to_breakpoint there, reads n_gpus, and reports unsupported if it's < 2. The dual-arm gdb_test_multiple and its [^\r\n]*\r\n[^\r\n]*\r\n pattern are gone. Validated on gfx942 (in-tree build + 7.14 nightly rocgdb).

Comment on lines +110 to +115
# In non-stop mode, hipLaunchCooperativeKernelMultiDevice
# produces one child breakpoint instance per participating GPU
# ("Breakpoint <id>.<inst>"). Collect distinct <inst> values
# until we have observed a stop on every GPU; only then is it
# safe to delete the breakpoint and let the dispatch run
# through both grid syncs.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What debugger behavior do we exactly test here? We could put a breakpoint after the sync and all participating blocks/grids would be there. It seems like we are rather testing the runtime, not the debugger.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question — you're right that "do all grids reach the kernel" is a runtime property. The debugger behavior I'm after here is gdb's side: under a single hipLaunchCooperativeKernelMultiDevice dispatch, one source breakpoint resolves to multiple device-side locations, reported as a parent breakpoint with a child instance per GPU (Breakpoint .), and in non-stop mode each device-side stop is observed independently. The loop just confirms gdb reports a stop for every per-device location; the "did every grid arrive" part is left to the host-side result check in the .cpp. I've reworded the in-file comment to make this clearer — happy to switch to the simpler "one breakpoint after the sync" approach if you'd prefer.

Comment on lines +139 to +141
gdb_test "bt 1" \
"#0\[^\r\n\]*coop_grid_sync_kernel\[^\r\n\]*" \
"backtrace inside coop_grid_sync_kernel"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What debugger behavior are we testing here? Before the sync point, stopping waves would be inside the kernel. There is no other kernel. I'm not sure I understand the value of this test from the debugger perspective.

@spatrang spatrang Jun 9, 2026

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - that's a fair point, the bt 1 check was close to tautological with a single kernel. I've adopted approach #1 + #2 to make the debugger value explicit: instead of just confirming each wave's backtrace names the kernel, test_threads_in_coop_kernel now

  1. switches to each co-resident wave and reads blockIdx.x, asserting we observe more than one distinct workgroup - i.e. gdb selects the correct per-wave register context; and
  2. within one wave, switches between lanes and asserts threadIdx.x differs across lanes - i.e. gdb reports correct per-lane SIMT state. Both are exercised specifically in the co-resident / GWS-barrier context, which is the cooperative-group angle the existing lane/builtin tests don't cover.

@aktemur aktemur assigned spatrang and unassigned aktemur Jun 8, 2026
@spatrang spatrang assigned aktemur and unassigned spatrang Jun 9, 2026
@spatrang spatrang requested a review from aktemur June 9, 2026 08:12
Refine the cooperative-group GWS tests for robustness and to make the
debugger behaviour under test more explicit:

  * coop-group-grid-sync.exp: use temporary breakpoints for the
    in-kernel locations (still pending, since the device code is
    loaded at dispatch time) and drop the redundant delete_breakpoints
    calls.

  * coop-group-grid-sync.exp: report UNSUPPORTED instead of FAIL when
    the inferior self-skips because no device advertises
    cooperativeLaunch.

  * coop-group-grid-sync.exp: have test_threads_in_coop_kernel check
    distinct per-wave blockIdx.x (per-wave register context) and
    per-lane threadIdx.x divergence (per-lane SIMT state), instead of
    only confirming that the backtrace names the kernel.

  * coop-group-multi-grid-sync.{cpp,exp}: read n_gpus from a marker
    line that is reached on every execution (no early return before
    it), so that when fewer than two cooperative-capable GPUs are
    available -- including when a parallel test run restricts the
    visible GPUs -- the test reports UNSUPPORTED rather than FAILing.

  * coop-group-{grid,multi-grid}-sync.{cpp,exp}: minor comment and
    GNU-style cleanups -- tab-align the in-kernel marker comment, keep
    hipLaunchCooperativeKernelMultiDevice on a single line, and
    clarify the Phase 2 data-dependency comment.

  * coop-group-{grid,multi-grid}-sync.{cpp,exp}: give the per-wave and
    per-lane value reads explicit, unique test names, and keep the
    "n-gpus-final" marker string unique so gdb_get_line_number
    resolves the intended line.

Tested on gfx942 with an in-tree build and the 7.14 nightly rocgdb,
both with all GPUs visible and with the visible set restricted to one.
@spatrang spatrang force-pushed the users/spatrang/coop-group-gws-tests branch from d9f90d2 to a00cfb3 Compare June 9, 2026 10:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants