TFLOPS Benchmark

A PyTorch-based benchmarking tool for measuring matrix multiplication performance (TFLOPS - Tera Floating Point Operations Per Second) across different devices and data types.

Overview

This benchmark measures the computational throughput of matrix multiplication operations on GPUs (CUDA and MPS) using various floating-point precisions. It helps evaluate hardware performance for deep learning workloads.

Features

Multi-device support: CUDA and MPS (Apple Silicon) GPUs
Multiple data types: float32, float16, and bfloat16 (when supported)
Configurable matrix sizes: Test with different matrix dimensions
Performance visualization: Generates charts showing performance vs matrix size
Automatic device detection: Selects available GPU automatically

Usage

Basic Usage

Run the default benchmark sweep across standard matrix sizes (256, 512, 1024, 2048, 4096, 8192):

uv run tflops_bench.py

This will:

Test all available GPU devices
Test all supported data types per device
Generate performance charts saved as duration_vs_size_combined.png

Single Size Test

Test with a specific matrix size:

uv run tflops_bench.py --size 4096

Specific Device/Dtype

Test a specific device and data type combination:

uv run tflops_bench.py --device cuda --dtype float16 --size 2048

Command Line Options

--size N: Matrix size for NxN multiplication (default: runs sweep)
--iterations N: Number of iterations for timing (default: 100)
--device [cuda|mps|cpu]: Specific device to test
--dtype [float32|float16|bfloat16]: Specific data type
--verbose: Show detailed per-test output

How It Works

Matrix Generation: Creates random NxN matrices on the target device
Warmup Phase: Performs 10 warmup iterations to stabilize GPU state
Timed Execution: Runs the specified number of matrix multiplications
TFLOPS Calculation: Computes performance as 2*N³*iterations / (time * 10¹²)
Device Synchronization: Ensures all operations complete before timing

Performance Metrics

The benchmark calculates TFLOPS based on the formula:

FLOPs per matmul = 2 × N³ (for NxN matrices)
TFLOPS = (Total FLOPs) / (Time in seconds × 10¹²)

Output

Console Output

Per-configuration performance results
Summary table showing TFLOPS and execution time

Charts (when running default sweep)

Generates a 4-panel chart showing:

Duration vs matrix size (linear scale)
Duration vs matrix size (log-log scale)
TFLOPS vs matrix size (linear scale)
TFLOPS vs matrix size (semi-log scale)

Device Support Notes

CUDA

Supports float32, float16
Supports bfloat16 on Ampere (SM80) and newer GPUs
Automatic capability detection

MPS (Apple Silicon)

Supports float32, float16
bfloat16 support depends on macOS version and hardware
Gracefully handles unsupported configurations

Requirements

Python e 3.10
PyTorch e 2.8.0
matplotlib e 3.10.5 (for charts)
numpy e 2.3.2

Example Results

Typical performance ranges (vary by hardware):

Consumer GPUs: 10-50 TFLOPS (float32), 20-100 TFLOPS (float16)
Data center GPUs: 50-200 TFLOPS (float32), 100-500 TFLOPS (float16/bfloat16)
Apple Silicon: 5-20 TFLOPS depending on chip and precision

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
tflops_bench.py		tflops_bench.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TFLOPS Benchmark

Overview

Features

Usage

Basic Usage

Single Size Test

Specific Device/Dtype

Command Line Options

How It Works

Performance Metrics

Output

Console Output

Charts (when running default sweep)

Device Support Notes

CUDA

MPS (Apple Silicon)

Requirements

Example Results

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TFLOPS Benchmark

Overview

Features

Usage

Basic Usage

Single Size Test

Specific Device/Dtype

Command Line Options

How It Works

Performance Metrics

Output

Console Output

Charts (when running default sweep)

Device Support Notes

CUDA

MPS (Apple Silicon)

Requirements

Example Results

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages