rocprim ops crash on gfx1201 (RDNA4) at large data size (>500K elements)

## System Info
- **GPU**: AMD Radeon AI PRO R9700 (gfx1201, wavefront=32)
- **ROCm**: 7.2.3
- **PyTorch**: 2.9.1+rocm6.3
- **OS**: Ubuntu 24.04

## Issues Found

### 1. Compile Error: `texture_cache_iterator.hpp` - `memset` not available on AMD platform

`/opt/rocm-7.2.3/include/rocprim/iterator/texture_cache_iterator.hpp:178`:
```cpp
memset(&resourse_desc, 0, sizeof(hipResourceDesc));
```

`memset` in `amd_detail/amd_device_functions.h` is `__device__` only. The `bind_texture` method runs on host, so `memset` is not found. Adding `#include <cstring>` before `rocprim.hpp` works around it, but the texture object creation (`hipCreateChannelDesc`) is only in `nvidia_detail/` and not available on AMD. The `texture_cache_iterator` should be guarded with `#ifdef __HIP_PLATFORM_NVIDIA__`.

### 2. Runtime Crash: Memory Access Fault at large data sizes

rocprim operations (exclusive_scan, radix_sort_pairs, select, run_length_encode) cause heap corruption → delayed crash in `hipFree` when processing >500K elements on gfx1201.

**Reproducer** (works standalone at N=800K but crashes when chained in a larger pipeline):

```python
import torch
from cumesh._C import CuMesh
V=800000; F=int(V*1.5)
v=torch.randn(V,3,device="cuda")
f=torch.randint(0,V,(F,3),device="cuda",dtype=torch.int32)
m=CuMesh(); m.init(v,f)
m.get_edges()
m.get_vertex_face_adjacency()
m.get_edge_face_adjacency()  # ← Memory access fault
```

**Root cause hypothesis**: gfx1201 has wavefront size 32 (not 64 like previous generations). rocprim internal temp storage calculations and/or thread block layouts may assume wavefront=64, leading to buffer overflow in multi-stage pipelines.

**Note**: Same code works correctly on NVIDIA RTX 3060 and on gfx1201 at smaller scales (<500K elements).

The standalone rocprim operations (exclusive_scan, radix_sort_pairs, select) at N=800K do NOT crash. The crash only occurs when multiple operations are chained in sequence (as in a real-world DAG processing pipeline like mesh simplification/connectivity analysis).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rocprim ops crash on gfx1201 (RDNA4) at large data size (>500K elements) #776

System Info

Issues Found

1. Compile Error: `texture_cache_iterator.hpp` - `memset` not available on AMD platform

2. Runtime Crash: Memory Access Fault at large data sizes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

rocprim ops crash on gfx1201 (RDNA4) at large data size (>500K elements) #776

Description

System Info

Issues Found

1. Compile Error: texture_cache_iterator.hpp - memset not available on AMD platform

2. Runtime Crash: Memory Access Fault at large data sizes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

1. Compile Error: `texture_cache_iterator.hpp` - `memset` not available on AMD platform