Description
Our current custom fused CUDA kernels achieve 890 GB/s memory bandwidth utilization. However, we can optimize register reuse and shared memory allocation to match FlashAttention-2 performance profiles.
Areas to Improve
- Optimize thread-block tiling for H100 architecture.
- Minimize global memory coalescing overhead.
Description
Our current custom fused CUDA kernels achieve 890 GB/s memory bandwidth utilization. However, we can optimize register reuse and shared memory allocation to match FlashAttention-2 performance profiles.
Areas to Improve