AMDResearch · mawad-amd · Dec 2, 2025 · Dec 2, 2025 · Dec 2, 2025 · Dec 2, 2025
@@ -17,6 +17,7 @@ Existing GPU profilers are **trash**:
 - **Human-readable metrics** instead of raw counters
 - **Unit tested** and reliable
 - **12 Memory Metrics**: Bandwidth, cache, coalescing, LDS, atomic latency
+- **7 Compute Metrics**: FLOPS, arithmetic intensity (HBM/L2/L1), compute throughput
 - **Multi-Run Profiling**: Automatic aggregation with min/max/avg statistics
 - **Kernel Filtering**: Efficient regex filtering at rocprofv3 level
 - **Multiple Output Formats**: Text, JSON, CSV
@@ -68,6 +69,8 @@ for kernel in results.kernels:
 - `memory.hbm_write_bandwidth` - HBM write bandwidth (GB/s)
 - `memory.hbm_bandwidth_utilization` - % of peak HBM bandwidth
 - `memory.bytes_transferred_hbm` - Total bytes through HBM
+- `memory.bytes_transferred_l2` - Total bytes through L2 cache
+- `memory.bytes_transferred_l1` - Total bytes through L1 cache
 
 ### Cache Performance
 - `memory.l1_hit_rate` - L1 cache hit rate (%)
@@ -85,13 +88,20 @@ for kernel in results.kernels:
 ### Atomic Operations
 - `memory.atomic_latency` - Atomic operation latency (cycles)
 
+### Compute Metrics
+- `compute.total_flops` - Total floating-point operations performed
+- `compute.hbm_gflops` - Compute throughput (GFLOPS)
+- `compute.hbm_arithmetic_intensity` - Ratio of FLOPs to HBM bytes (FLOP/byte)
+- `compute.l2_arithmetic_intensity` - Ratio of FLOPs to L2 bytes (FLOP/byte)
+- `compute.l1_arithmetic_intensity` - Ratio of FLOPs to L1 bytes (FLOP/byte)
+
 ## CLI Options
 
 ```
 metrix [options] <command>
 
 Options:
-  --profile, -p      Use pre-defined profile (quick, memory)
+  --profile, -p      Use pre-defined profile (quick, memory, compute)
   --metrics, -m      Comma-separated list of metrics
   --time-only        Only collect timing
   --kernel, -k       Filter by kernel name substring

@@ -162,6 +162,10 @@ def _split_counters_into_passes(self, counters: List[str]) -> List[List[str]]:
         Returns:
             List of counter lists, one per profiling pass
         """
+        # Handle empty counters (timing-only mode) - return single pass with no counters
+        if not counters:
+            return [[]]
+
         counter_groups = self._get_counter_groups()
         max_per_pass = 14  # Conservative limit for most AMD GPUs
 

@@ -83,6 +83,24 @@ def _bytes_transferred_hbm(self, GRBM_GUI_ACTIVE):
         """
         return 0.0
 
+    @metric("memory.bytes_transferred_l2")
+    def _bytes_transferred_l2(self):
+        """
+        Total bytes transferred through L2 cache
+
+        Formula: TCC_REQ_sum * 128 (L2 cache line size is 128 bytes)
+        """
+        return 0.0
+
+    @metric("memory.bytes_transferred_l1")
+    def _bytes_transferred_l1(self):
+        """
+        Total bytes transferred through L1 cache
+
+        Formula: TCP_TOTAL_CACHE_ACCESSES_sum * cache_line_size (architecture-dependent)
+        """
+        return 0.0
+
     # Cache metrics
 
     @metric("memory.l2_hit_rate")
@@ -173,3 +191,49 @@ def _atomic_latency(self):
 
         return 0.0
 
+    # Compute metrics
+
+    @metric("compute.total_flops")
+    def _total_flops(self):
+        """
+        Total floating-point operations performed by the kernel
+
+        Formula: 64 * (FP16 + FP32 + FP64) + 512 * MFMA
+        """
+        return 0.0
+
+    @metric("compute.hbm_gflops")
+    def _hbm_gflops(self):
+        """
+        Compute throughput (GFLOPS) normalized by kernel execution time
+
+        Formula: (total_flops / 1e9) / time_seconds
+        """
+        return 0.0
+
+    @metric("compute.hbm_arithmetic_intensity")
+    def _hbm_arithmetic_intensity(self):
+        """
+        HBM Arithmetic Intensity: ratio of floating-point operations to HBM bytes transferred (FLOP/byte)
+
+        Formula: total_flops / hbm_bytes
+        """
+        return 0.0
+
+    @metric("compute.l2_arithmetic_intensity")
+    def _l2_arithmetic_intensity(self):
+        """
+        L2 Arithmetic Intensity: ratio of floating-point operations to L2 cache bytes accessed (FLOP/byte)
+
+        Formula: total_flops / l2_bytes
+        """
+        return 0.0
+
+    @metric("compute.l1_arithmetic_intensity")
+    def _l1_arithmetic_intensity(self):
+        """
+        L1 Arithmetic Intensity: ratio of floating-point operations to L1 cache bytes accessed (FLOP/byte)
+
+        Formula: total_flops / l1_bytes
+        """
+        return 0.0