System Info
- GPU: AMD Radeon AI PRO R9700 (gfx1201, wavefront=32)
- ROCm: 7.2.3
- PyTorch: 2.9.1+rocm6.3
- OS: Ubuntu 24.04
Issues Found
1. Compile Error: texture_cache_iterator.hpp - memset not available on AMD platform
/opt/rocm-7.2.3/include/rocprim/iterator/texture_cache_iterator.hpp:178:
memset(&resourse_desc, 0, sizeof(hipResourceDesc));
memset in amd_detail/amd_device_functions.h is __device__ only. The bind_texture method runs on host, so memset is not found. Adding #include <cstring> before rocprim.hpp works around it, but the texture object creation (hipCreateChannelDesc) is only in nvidia_detail/ and not available on AMD. The texture_cache_iterator should be guarded with #ifdef __HIP_PLATFORM_NVIDIA__.
2. Runtime Crash: Memory Access Fault at large data sizes
rocprim operations (exclusive_scan, radix_sort_pairs, select, run_length_encode) cause heap corruption → delayed crash in hipFree when processing >500K elements on gfx1201.
Reproducer (works standalone at N=800K but crashes when chained in a larger pipeline):
import torch
from cumesh._C import CuMesh
V=800000; F=int(V*1.5)
v=torch.randn(V,3,device="cuda")
f=torch.randint(0,V,(F,3),device="cuda",dtype=torch.int32)
m=CuMesh(); m.init(v,f)
m.get_edges()
m.get_vertex_face_adjacency()
m.get_edge_face_adjacency() # ← Memory access fault
Root cause hypothesis: gfx1201 has wavefront size 32 (not 64 like previous generations). rocprim internal temp storage calculations and/or thread block layouts may assume wavefront=64, leading to buffer overflow in multi-stage pipelines.
Note: Same code works correctly on NVIDIA RTX 3060 and on gfx1201 at smaller scales (<500K elements).
The standalone rocprim operations (exclusive_scan, radix_sort_pairs, select) at N=800K do NOT crash. The crash only occurs when multiple operations are chained in sequence (as in a real-world DAG processing pipeline like mesh simplification/connectivity analysis).
System Info
Issues Found
1. Compile Error:
texture_cache_iterator.hpp-memsetnot available on AMD platform/opt/rocm-7.2.3/include/rocprim/iterator/texture_cache_iterator.hpp:178:memsetinamd_detail/amd_device_functions.his__device__only. Thebind_texturemethod runs on host, somemsetis not found. Adding#include <cstring>beforerocprim.hppworks around it, but the texture object creation (hipCreateChannelDesc) is only innvidia_detail/and not available on AMD. Thetexture_cache_iteratorshould be guarded with#ifdef __HIP_PLATFORM_NVIDIA__.2. Runtime Crash: Memory Access Fault at large data sizes
rocprim operations (exclusive_scan, radix_sort_pairs, select, run_length_encode) cause heap corruption → delayed crash in
hipFreewhen processing >500K elements on gfx1201.Reproducer (works standalone at N=800K but crashes when chained in a larger pipeline):
Root cause hypothesis: gfx1201 has wavefront size 32 (not 64 like previous generations). rocprim internal temp storage calculations and/or thread block layouts may assume wavefront=64, leading to buffer overflow in multi-stage pipelines.
Note: Same code works correctly on NVIDIA RTX 3060 and on gfx1201 at smaller scales (<500K elements).
The standalone rocprim operations (exclusive_scan, radix_sort_pairs, select) at N=800K do NOT crash. The crash only occurs when multiple operations are chained in sequence (as in a real-world DAG processing pipeline like mesh simplification/connectivity analysis).