Skip to content

Fix ROCm backend release target mapping#2423

Open
bong-water-water-bong wants to merge 1 commit into
lemonade-sdk:mainfrom
bong-water-water-bong:fix/rocm-release-targets-2415
Open

Fix ROCm backend release target mapping#2423
bong-water-water-bong wants to merge 1 commit into
lemonade-sdk:mainfrom
bong-water-water-bong:fix/rocm-release-targets-2415

Conversation

@bong-water-water-bong

Copy link
Copy Markdown

Summary

  • Normalize ROCm backend asset targets so specific dGPU ISAs like gfx1201 resolve to published family targets like gfx120X while APU targets like gfx1151 stay specific.
  • Update the vLLM ROCm pin to the current split-archive release and normalize vLLM base/per-target pins consistently for install and status paths.
  • Add CPU-runnable regression tests for ROCm release-target mapping and vLLM target-suffix handling.

Closes #2415.

Tests

  • python3 test/test_rocm_release_target.py
  • python3 test/test_device_family_matching.py
  • python3 test/test_cuda_arch_mapping.py
  • ninja -C build lemond lemonade
  • Live GitHub release URL smoke checks for fixed gfx120X, old broken gfx1201, and gfx1151 regression cases

@github-actions github-actions Bot added engine::vllm vLLM backend (experimental, ROCm Linux, Strix Halo) runtime::rocm AMD ROCm runtime enhancement New feature or request labels Jun 25, 2026

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6b8c2a4ad2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +71 to +72
for isa in ("gfx1030", "gfx1031", "gfx1032", "gfx1033",
"gfx1034", "gfx1035", "gfx1036"):

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Blacken the new ROCm release-target test

AGENTS.md says Python must be formatted with Black and that this is enforced in CI; in the checked commit, python3 -m black --check test/test_rocm_release_target.py reports this new file would be reformatted, starting with this tuple and later literals. When the formatting check runs, the PR will fail even though the test logic itself passes.

Useful? React with 👍 / 👎.

// vllm-rocm user pins may be base versions or full per-target
// release tags; normalize either form to the base version so
// status matches the install path's current-target suffix.
expected_version = normalize_expected_version(user_pin);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Require the current vLLM ROCm target in status

When vllm.rocm_bin is set to a full per-target tag and a same-base install for a different target is already in the cache (for example, a cache copied from a gfx1151 machine to gfx120X), this strips the expected version down to the base. The versions_match helper below then accepts any {expected}-... suffix, so /system-info reports the wrong-target vLLM bundle as installed instead of prompting a reinstall, even though the install path would construct the current machine's target suffix.

Useful? React with 👍 / 👎.

@superm1

superm1 commented Jun 25, 2026

Copy link
Copy Markdown
Member

This diff is more comments than code. Please see b46f3b6 and adjust the PR.

@The-Monk

Copy link
Copy Markdown

Verified on gfx1201 (R9700), built from source.

  • whispercpp:rocm → resolves whisper-v1.8.4-linux-rocm-**gfx120X**.tar.gz and installs cleanly (was a hard 404 on gfx1201 before this PR). ✅
  • vllm:rocm → resolves + downloads + reassembles the vllm0.22.1-rocm7.13.0-gfx120X split archive (part01/part02) and installs. ✅

One confirmation re the version bump: the older 0.20.1 split archive fails to reassemble ("Downloaded archive does not exist") — so bumping the pin to 0.22.1 is necessary, not cosmetic; 0.22.1 reassembles fine.


Heads-up, separate from this PR (it's the release artifact in lemonade-sdk/vllm-rocm, not the installer): once installed, the vllm0.22.1-rocm7.13.0-gfx120X tarball can't serve on gfx1201 due to two packaging defects:

  1. Bundled compiler missing clang-23 — Triton JIT fails at runtime:
    could not exec .../_rocm_sdk_core/lib/llvm/bin/clang-23: No such file or directory (only the 26 KB clang driver is shipped; the real clang-NN binary is absent). Worked around locally by symlinking clang-23 → ROCm 7.2.1's clang.
  2. vllm/_C.abi3.so ABI mismatchImportError: undefined symbol: _ZN3c103hip19getCurrentHIPStreamEa (c10::hip::getCurrentHIPStream(signed char)); the bundled libc10_hip.so doesn't export that symbol, i.e. _C was built against a different libtorch than the one shipped.

With (1) symlinked, vLLM gets all the way through V1 engine init (v0.22.1, enforce_eager, model-executor build) before (2) stops it — so this looks like a release-build issue, not an RDNA4/gfx1201 incompatibility. Filing separately on the vllm-rocm repo; cross-referencing here since #2415's target fix is what surfaces it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

engine::vllm vLLM backend (experimental, ROCm Linux, Strix Halo) enhancement New feature or request runtime::rocm AMD ROCm runtime

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Backend manifest 404s: whispercpp:rocm and vllm:rocm request gfx1201 assets published as gfx120X (RDNA4 / R9700)

3 participants