Hand-tuned NVIDIA SASS kernels for RTX 3070 Ti (GA104, sm_86): 31,910 GFLOPS HGEMM, 41,721 dense-equiv 2:4 sparse, 11,453 GFLOPS Flash Attention, no cuBLAS / cuDNN / PyTorch. Includes cuasmR, a CRAN-ready R package for cubin read/write + GPU benchmark measurement. 6-chapter tutorial + Chladni-pattern memory layout study.
-
Updated
Jun 5, 2026 - Cuda