Skip to content

Fix AMD allgather#161

Draft
erieaton-amd wants to merge 6 commits into
ByteDance-Seed:mainfrom
erieaton-amd:fix-ag
Draft

Fix AMD allgather#161
erieaton-amd wants to merge 6 commits into
ByteDance-Seed:mainfrom
erieaton-amd:fix-ag

Conversation

@erieaton-amd

Copy link
Copy Markdown
Collaborator

The allgather kernels weren't updated with some recent refactoring, probably because they aren't tested by the CI. This PR fixes the allgather tests and adds them to the CI.

This is currently based from the updated rocshmem PR (#156) but doesn't depend on it.

Signed-off-by: Eric Eaton <erieaton@amd.com>
Signed-off-by: Eric Eaton <erieaton@amd.com>
This file was not included in the CI and wasn't updated with some recent
changes.

Signed-off-by: Eric Eaton <erieaton@amd.com>
Signed-off-by: Eric Eaton <erieaton@amd.com>
@erieaton-amd erieaton-amd changed the title Fix amd allgather Fix AMD allgather Feb 18, 2026
There was a change in ROCm 7+ that makes it harder to match up the torch
and amdsmi devices. This change to use KFD makes operations like
sleep_async work again.

Signed-off-by: Eric Eaton <erieaton@amd.com>
@drprajap

Copy link
Copy Markdown
Contributor

@wenlei-bao @houqi Can you guys revie this PR when you get a chance?

Signed-off-by: Eric Eaton <erieaton@amd.com>
@erieaton-amd erieaton-amd marked this pull request as draft March 5, 2026 19:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants