This repository provides a comprehensive benchmarking suite for evaluating the performance characteristics of Matrix Multiplication (
It is designed to study the impact of memory hierarchies, parallelization strategies, and hardware accelerators on computational bottlenecks in High-Performance Computing (HPC) and Machine Learning (ML) workloads.
-
Algorithmic Breadth: Implements and compares multiple matrix multiplication strategies:
- CPU Naive & Cache-Optimized: Demonstrates the impact of loop-ordering (i-k-j) on L1/L2 cache locality.
- GPU Naive (Global Memory): Establishes a baseline for GPU execution, demonstrating memory bandwidth limitations.
- GPU Tiled (Shared Memory): Exploits fast on-chip shared memory to reduce global memory transactions, overcoming the memory wall.
- Tensor Core (WMMA): Hardware-accelerated mixed-precision (FP16/FP32) matrix multiplication using Warp Matrix Multiply Accumulate (WMMA) APIs.
-
Precision Flexibility: Templated C++ implementations support
int,float, anddouble, dynamically handling precision casting (e.g., FP32$\to$ FP16) for specialized hardware execution. -
High-Resolution Profiling: Utilizes
std::chrono(CPU) andcudaEvent_t(GPU) for microsecond-level timing and GFLOPS (Giga Floating Point Operations Per Second) computation.
The core of this project investigates how different memory access patterns and specialized compute cores impact theoretical vs. achieved GFLOPS.
The GPU phase reveals critical insights into CUDA optimization, mapping directly to modern AI infrastructure challenges:
- Memory Bound (Naive): Redundant global memory accesses bottleneck the computation, keeping CUDA cores starved for data.
-
Compute Bound (Tiled): By cooperatively loading data into Shared Memory (L1 Cache) in block tiles (e.g.,
$32 \times 32$ ), the algorithm avoids the memory wall and approaches the theoretical compute limit of standard FP32 cores. -
Speed of Light (Tensor Cores): By dispatching
$16 \times 16 \times 16$ fragments directly to Tensor Cores, the throughput scales massively, reflecting the hardware paradigm driving the training and inference of modern Large Language Models (LLMs).
phase1-matmul-singlecore/: Baseline sequential CPU implementations and loop-ordering optimizations.phase2-matmul-multicore/: OpenMP extensions for parallel CPU execution.phase3-matmul-gpu/: CUDA implementations (Naive, Shared Memory, Tensor Cores), Jupyter notebooks, and extensive architectural documentation.
cd phase1-matmul-singlecore
g++ -O3 -std=c++17 matmul-singlecore.cpp -o benchmark_singlecore
./benchmark_singlecorecd phase2-matmul-multicore
g++ -O3 -fopenmp -std=c++17 matmul_multicore.cpp -o benchmark_multicore
./benchmark_multicoreRequires an NVIDIA GPU with Compute Capability >= 7.0 (Volta, Turing, Ampere, Ada, Hopper) for Tensor Core support.
cd phase3-matmul-gpu
nvcc -arch=sm_75 matmul_gpu.cu -o benchmark_gpu
./benchmark_gpu