Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 11 additions & 1 deletion src/metrix/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ Existing GPU profilers are **trash**:
- **Human-readable metrics** instead of raw counters
- **Unit tested** and reliable
- **12 Memory Metrics**: Bandwidth, cache, coalescing, LDS, atomic latency
- **7 Compute Metrics**: FLOPS, arithmetic intensity (HBM/L2/L1), compute throughput
- **Multi-Run Profiling**: Automatic aggregation with min/max/avg statistics
- **Kernel Filtering**: Efficient regex filtering at rocprofv3 level
- **Multiple Output Formats**: Text, JSON, CSV
Expand Down Expand Up @@ -68,6 +69,8 @@ for kernel in results.kernels:
- `memory.hbm_write_bandwidth` - HBM write bandwidth (GB/s)
- `memory.hbm_bandwidth_utilization` - % of peak HBM bandwidth
- `memory.bytes_transferred_hbm` - Total bytes through HBM
- `memory.bytes_transferred_l2` - Total bytes through L2 cache
- `memory.bytes_transferred_l1` - Total bytes through L1 cache

### Cache Performance
- `memory.l1_hit_rate` - L1 cache hit rate (%)
Expand All @@ -85,13 +88,20 @@ for kernel in results.kernels:
### Atomic Operations
- `memory.atomic_latency` - Atomic operation latency (cycles)

### Compute Metrics
- `compute.total_flops` - Total floating-point operations performed
- `compute.hbm_gflops` - Compute throughput (GFLOPS)
- `compute.hbm_arithmetic_intensity` - Ratio of FLOPs to HBM bytes (FLOP/byte)
- `compute.l2_arithmetic_intensity` - Ratio of FLOPs to L2 bytes (FLOP/byte)
- `compute.l1_arithmetic_intensity` - Ratio of FLOPs to L1 bytes (FLOP/byte)

## CLI Options

```
metrix [options] <command>

Options:
--profile, -p Use pre-defined profile (quick, memory)
--profile, -p Use pre-defined profile (quick, memory, compute)
--metrics, -m Comma-separated list of metrics
--time-only Only collect timing
--kernel, -k Filter by kernel name substring
Expand Down
4 changes: 4 additions & 0 deletions src/metrix/src/metrix/backends/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -162,6 +162,10 @@ def _split_counters_into_passes(self, counters: List[str]) -> List[List[str]]:
Returns:
List of counter lists, one per profiling pass
"""
# Handle empty counters (timing-only mode) - return single pass with no counters
if not counters:
return [[]]

counter_groups = self._get_counter_groups()
max_per_pass = 14 # Conservative limit for most AMD GPUs

Expand Down
64 changes: 64 additions & 0 deletions src/metrix/src/metrix/backends/gfx1201.py
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,24 @@ def _bytes_transferred_hbm(self, GRBM_GUI_ACTIVE):
"""
return 0.0

@metric("memory.bytes_transferred_l2")
def _bytes_transferred_l2(self):
"""
Total bytes transferred through L2 cache

Formula: TCC_REQ_sum * 128 (L2 cache line size is 128 bytes)
"""
return 0.0

@metric("memory.bytes_transferred_l1")
def _bytes_transferred_l1(self):
"""
Total bytes transferred through L1 cache

Formula: TCP_TOTAL_CACHE_ACCESSES_sum * cache_line_size (architecture-dependent)
"""
return 0.0

# Cache metrics

@metric("memory.l2_hit_rate")
Expand Down Expand Up @@ -173,3 +191,49 @@ def _atomic_latency(self):

return 0.0

# Compute metrics

@metric("compute.total_flops")
def _total_flops(self):
"""
Total floating-point operations performed by the kernel

Formula: 64 * (FP16 + FP32 + FP64) + 512 * MFMA
"""
return 0.0

@metric("compute.hbm_gflops")
def _hbm_gflops(self):
"""
Compute throughput (GFLOPS) normalized by kernel execution time

Formula: (total_flops / 1e9) / time_seconds
"""
return 0.0

@metric("compute.hbm_arithmetic_intensity")
def _hbm_arithmetic_intensity(self):
"""
HBM Arithmetic Intensity: ratio of floating-point operations to HBM bytes transferred (FLOP/byte)

Formula: total_flops / hbm_bytes
"""
return 0.0

@metric("compute.l2_arithmetic_intensity")
def _l2_arithmetic_intensity(self):
"""
L2 Arithmetic Intensity: ratio of floating-point operations to L2 cache bytes accessed (FLOP/byte)

Formula: total_flops / l2_bytes
"""
return 0.0

@metric("compute.l1_arithmetic_intensity")
def _l1_arithmetic_intensity(self):
"""
L1 Arithmetic Intensity: ratio of floating-point operations to L1 cache bytes accessed (FLOP/byte)

Formula: total_flops / l1_bytes
"""
return 0.0
Loading
Loading