Skip to content

Multi-inferior fixes#106

Open
simark wants to merge 4 commits into
ROCm:amd-stagingfrom
simark:multi-inferior-fixes
Open

Multi-inferior fixes#106
simark wants to merge 4 commits into
ROCm:amd-stagingfrom
simark:multi-inferior-fixes

Conversation

@simark

@simark simark commented Apr 29, 2026

Copy link
Copy Markdown
Contributor

This is a small patch series that we had internally for a while, it needs to be rebased and rechecked.

@simark simark requested a review from a team as a code owner April 29, 2026 15:35
@simark simark marked this pull request as draft April 29, 2026 17:27
@lumachad

lumachad commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

Looks like this PR has conflicts.

@simark simark force-pushed the multi-inferior-fixes branch from f70b879 to ee2ab7d Compare June 9, 2026 04:49
@simark

simark commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

I rebased on top of current amd-staging (it was based on a ~2 year old commit). I will give it a test tomorrow.

@simark simark force-pushed the multi-inferior-fixes branch from ee2ab7d to e90315b Compare June 9, 2026 20:46
@lumachad lumachad added the ci:skip Skip all pre-commit / CI jobs while the label is up label Jun 10, 2026
@lumachad

Copy link
Copy Markdown
Collaborator

Labelled ci:skip. Lift it when it is ready for CI.

simark added 4 commits June 10, 2026 15:10
The `target_ops::async_wait_fd()` method allows targets to provide one
file descriptor GDB core can listen to to receive async events.  The
amd-dbgapi target needs to be able to report two: its own fd and the
underlying linux-nat target's fd.

To accomodate this need, this patch changes
`target_ops::async_wait_fd()` to return a vector of file descriptors.
After this patch, wait_one is still calling async_wait_fds on the
process target, so this will not use amd-dbgapi target's file
descriptor just yet, this will be done in a follow-up patch.

Change-Id: I971be69c4737ea4e69d93d9608039223653d528c
This patch fixes a hang encountered when debugging a ROCm inferior and a
host-only inferior at the same time.  Using the provided test file, the
following command reproduces the problem, leading to a hang during the
`run` command:

    $ ./gdb -q -nx --data-directory=data-directory \
            -ex "set breakpoint pending on" \
            -ex "b kern" \
            -ex "target native" \
            -ex "add-inferior" \
            -ex "inferior 2" \
            -ex "file testsuite/outputs/gdb.rocm/multi-inferior-with-inactive-inferior/multi-inferior-with-inactive-inferior" \
            -ex run

The story is:

 - Inferior 1 is connected to the native Linux target but not executing
   anything (I believe it would be the same result if it was stopped
   somewhere while debugging a non-GPU program, but pushing the native
   target manually is just easier).

 - Inferior 2 runs to a breakpoint in GPU code (in my test, there are
   two waves, both hitting the breakpoint).

 - stop_all_threads gets called post-breakpoint hit.

 - A stop request gets sent for the wave that didn't hit the breakpoint.

 - `wait_one` starts by polling for events using inferior 1.  This hits
   the Linux native target, which replies TARGET_WAITKIND_NO_RESUMED.

 - `wait_one` disables async using `target_async (false)`, which has the
   effect of calling `target_ops::async (false)` on all inferiors
   sharing the native target, both inferiors in our case.

 - `wait_one` considers inferior 2, but skips over it since
   `target->is_async_p ()` returns false (where `target` is the Linux
   native target).

 - We are in an infinite loop inside `stop_all_threads`, because the
   amd-dbgapi target would like to return an event announcing the stop
   of the second wave, but it can't because `wait_one` will never
   poll it.

Fix it by making it possible for the Linux native target to be !async,
while the amd-dbgapi target, placed above in inferior 2's target stack,
is async.

The changes are:

 - Make `amd_dbgapi_target` implement `target_ops::is_async_p`.  It
   returns true if itself is async or the beneath target is async.

 - In `wait_one`, when getting TARGET_WAITKIND_NO_RESUMED, call
   `inf->top_target ()->async (false)` instead of `target_async
   (false)`.  In the reproducer, this has the effect of only calling
   `async (false)` on the first inferior.  This leaves inferior 2 with
   the amd-dbgapi target async, and the Linux native target !async.

   One thing that `target_async` does is enable/disable infrun's async
   (calls `infrun_async`).  I don't think it's needed in the context of
   `wait_one` / `stop_all_threads`.

 - Not really related to the bug fix, but complementary to the previous
   point: change `reenable_target_async` to call `inf->top_target
   ()->async (true)` instead of `target_async (1)`.  When re-enabling
   async for the purpose of `stop_all_threads`, we don't need to enable
   infrun's own async (we don't consume pending / stored waitstatuses at
   this point).  But also, in `reenable_target_async`, we loop over all
   inferiors.  And then, in `target_async`, we loop over all inferiors
   sharing the current inferior's process target.  It's not the end of
   the world, but we loop over inferiors more than we need to, and more
   calls to `is_async` than we need.

 - In `wait_one`, call `is_async_p` on the inferior's top target to know
   if we should skip over that inferior, instead of on the inferior's
   process target.  In our reproducer, it makes it so this call hits the
   amd-dbgapi target for inferior 2.  And that returns true if the
   amd-dbgapi target still has an event to report, even if the Linux
   native target beneath has been set to !async.

 - In `wait_one` again, call `async_wait_fds` on the inferior's top target,
   instead of on the inferior's process target.  This makes it so that the
   returned vector of FDs contains amd-dbgapi's FDs, and `wait_one` can poll
   them.

Add a new test that does:

 - Set up two stopped inferiors, one just connected to the native target
   and one also with the amd-dbgapi target.

 - Try to run another inferior with two waves hitting a breakpoint in
   GPU code.  After the first breakpoint hit, resume execution to see
   the breakpoint hit by the second wave.

The test is a bit more complex than the reproducer described above, for
instance we don't need 3 inferiors to reproduce the problem, but it
matches more closely the initial bug report we got, and it ends up
testing more scenarios, so I think it's worth it.  Since the outcome
depends on the order of the inferiors, the test tries all permutations
of which inferior gets which role.

Change-Id: I5f84904e695278e1d4bd085873ec058e4a639faf
…n amd_dbgapi_target::wait

I saw this while working on patch "gdb/amd-dbgapi: fix hang when
debugging multiple inferiors".  It isn't necessary for the fix, but the
current code looks wrong to me.

The current inferior is not relevant to determine which event
`target_ops::wait` should return.  When called with ptid ==
minus_one_ptid, the target may return an event for any inferior it
controls.

When there isn't an event ready to return (consume_one_event returns an
event ptid == minus_one_ptid), `amd_dbgapi_target::wait` calls
`process_event_queue` to pull events from amd-dbgapi, passing the
process id associated to the current inferior.  What if that process
doesn't have anything to report, but another process has?  We want the
amd-dbgapi target to return the event for that other process, if
`amd_dbgapi_target::wait` was called with `ptid == minus_one_ptid`.

My first idea was to make `process_event_queue` pass
`AMD_DBGAPI_PROCESS_NONE` to `amd_dbgapi_process_next_pending_event`.
This does not work, because it risks pulling a "runtime enabled" event,
and `process_one_event` doesn't know how to handle that (these events
are handled through another path).

Instead, call `process_event_queue` for all processes matching `ptid`
that have the runtime enabled.

I noticed that `process_event_queue` is always called with a specific
process id, never `AMD_DBGAPI_PROCESS_NONE`, so simplify the assertion
there a little bit.

Change-Id: I90a5ef819f4ec3d4597972a9eac6f61ddb219c26
…reads

I saw this while working on patch "gdb/amd-dbgapi: fix hang when
debugging multiple inferiors".  It isn't necessary for
the fix of that bug, but that looked like potential issue.

Let's say we have inferior 1 with the linux-nat target, and inferior 2
with the amd-dbgapi and linux-nat targets.  The loop using
`all_non_exited_process_targets` + `switch_to_target_no_thread` will
only call `update_thread_list` on inferior 1, and therefore amd-dbgapi's
`update_thread_list` method will never get called.

Iterate over all inferiors and call `update_thread_list` on each, to
ensure that we end up calling it on all targets.

Change-Id: Ib8d816fd42a72ce47c47d508808c6a0868cc395d
@simark simark force-pushed the multi-inferior-fixes branch from e90315b to d32cf14 Compare June 11, 2026 18:33
@simark simark marked this pull request as ready for review June 11, 2026 18:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci:skip Skip all pre-commit / CI jobs while the label is up

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants