Skip to content

Benchmark and Tune Custom Fused CUDA Kernels for FlashAttention-2 parity #2

@Sarkar-AGI

Description

@Sarkar-AGI

Description

Our current custom fused CUDA kernels achieve 890 GB/s memory bandwidth utilization. However, we can optimize register reuse and shared memory allocation to match FlashAttention-2 performance profiles.

Areas to Improve

  • Optimize thread-block tiling for H100 architecture.
  • Minimize global memory coalescing overhead.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions