Skip to content

fix: compile error on RDNA4 (gfx1201) ROCm#815

Open
Kaihui-AMD wants to merge 1 commit into
turboderp-org:masterfrom
Kaihui-AMD:fix/rdna4-rocm-vla-shared-memory
Open

fix: compile error on RDNA4 (gfx1201) ROCm#815
Kaihui-AMD wants to merge 1 commit into
turboderp-org:masterfrom
Kaihui-AMD:fix/rdna4-rocm-vla-shared-memory

Conversation

@Kaihui-AMD

@Kaihui-AMD Kaihui-AMD commented May 15, 2026

Copy link
Copy Markdown

Summary — AMD consumer GPUs

ExLlamaV2 fails to compile on RDNA4 GPUs (gfx1201: RX 9070, RX 9060, R9700) with ROCm 7.x.
rms_norm.cu and layer_norm.cu define NUM_WARPS = 1024 / warpSize under USE_ROCM. On HIP, warpSize is a runtime __device__ variable (RDNA4 supports both wave32 and wave64), so NUM_WARPS is not a compile-time constant. This causes:

  1. __shared__ float sums[NUM_WARPS] — variable length array in shared memory
  2. Host functions referencing NUM_THREADS__device__ variable used in __host__ context
    error: variable length array declaration cannot have 'static' storage duration shared float sums[NUM_WARPS];
FAILED: rms_norm.cuda.o /opt/venv/lib/python3.12/site-packages/exllamav2/exllamav2_ext/hip/rms_norm.hip:103:22: error: variable length array declaration cannot have 'static' storage duration shared float sums[NUM_WARPS]; ^ ~~~~~~~~~

FAILED: layer_norm.cuda.o /opt/venv/lib/python3.12/site-packages/exllamav2/exllamav2_ext/hip/layer_norm.hip:80:22: error: variable length array declaration cannot have 'static' storage duration shared float sums[NUM_WARPS]; ^ ~~~~~~~~~

FAILED: rms_norm.cuda.o /workspace/exllamav2/exllamav2/exllamav2_ext/hip/rms_norm.hip:222:18: error: reference to device function 'operator int' in host function

FAILED: layer_norm.cuda.o /workspace/exllamav2/exllamav2/exllamav2_ext/hip/layer_norm.hip:205:18: error: reference to device function 'operator int' in host function

Fix

Add compile-time constants MAX_NUM_WARPS (32) and NUM_THREADS_CONST (1024):

  • __shared__ arrays use MAX_NUM_WARPS (upper bound: 1024 / 32 = 32 warps)
  • Host functions use NUM_THREADS_CONST (NUM_WARPS * WARP_SIZE = 1024 for all warp sizes)
  • CUDA path unchanged — same values defined for both branches

Test environment

  • GPU: AMD Radeon R9700 Pro (gfx1201, RDNA 4), 32 GB VRAM
  • ROCm: 7.2.3
  • PyTorch: 2.9.1+rocm7.2.3
  • Container: rocm/pytorch:rocm7.2.3_ubuntu24.04_py3.12_pytorch_release_2.9.1

Test results

Build: pip install --no-build-isolation -e .success (was failing before)
test_hgemm.py: 24/24 GEMM shapes passed

import torch
from exllamav2.ext import exllamav2_ext as ext_c

# Test RMS norm
dim = 4096
x = torch.randn(1, dim, dtype=torch.half, device='cuda:0')
w = torch.randn(dim, dtype=torch.half, device='cuda:0')
y = torch.empty_like(x)
ext_c.rms_norm(x, w, y, 1e-6)
print(f'rms_norm: input={x.shape}, output={y.shape}, has_nan={torch.isnan(y).any().item()}, has_inf={torch.isinf(y).any().item()}')

# Test Layer norm
x2 = torch.randn(1, dim, dtype=torch.half, device='cuda:0')
w2 = torch.randn(dim, dtype=torch.half, device='cuda:0')
b2 = torch.randn(dim, dtype=torch.half, device='cuda:0')
y2 = torch.empty_like(x2)
ext_c.layer_norm(x2, w2, b2, y2, 1e-6)
print(f'layer_norm: input={x2.shape}, output={y2.shape}, has_nan={torch.isnan(y2).any().item()}, has_inf={torch.isinf(y2).any().item()}')

print('ALL KERNEL TESTS PASSED')
"""output
rms_norm: input=torch.Size([1, 4096]), output=torch.Size([1, 4096]), has_nan=False, has_inf=False
layer_norm: input=torch.Size([1, 4096]), output=torch.Size([1, 4096]), has_nan=False, has_inf=False
ALL KERNEL TESTS PASSED
"""

Signed-off-by: Kaihui-AMD <Kaihui.Tang@amd.com>
@Kaihui-AMD

Copy link
Copy Markdown
Author

@turboderp @kingbri1 Would you mind taking a look? This fixes an RDNA4 (gfx1201) compile blocker in rms_norm.cu and layer_norm.cu — two small changes, CUDA path unchanged. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant