Align rocmlir-tuning-driver benchmarking with Triton do_bench by dhernandez0 · Pull Request #2407 · ROCm/rocMLIR

dhernandez0 · 2026-06-11T08:06:08Z

Motivation

Backport the benchmarking changes from rocmlirTriton PR https://github.com/ROCm/rocmlirTriton/pull/280 to align
the tuning driver's timing methodology with Triton's do_bench, giving
more stable and consistent performance numbers. (The last-level-cache
flushing part of that PR is intentionally excluded.)

Technical Details

--rep / --warmup are now millisecond time budgets instead of
iteration counts; the actual iteration counts are derived from these
budgets and an estimated per-launch runtime.
measureKernel pre-allocates one event pair per iteration, records
them back-to-back, and synchronizes only once at the end (matching
do_bench's low host-overhead pattern).
Removed the small-kernel CPU-timer special case in favor of a single
measurement path.
Updated perfRunner.py / tuningRunner.py to pass the new --rep /
--warmup flags.

Test Plan

PR CI

Test Result

All tests pass.

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

rocmlir-pr-reviewer · 2026-06-11T08:16:06Z

+  constexpr unsigned estimateRuns = 5;
+  double estimateMs = 0.0;
+  {
+    hipEvent_t startEvent, stopEvent;


Unlike the main measureKernel loop (which guards its event vectors with llvm::make_scope_exit), this estimate block creates startEvent/stopEvent but only destroys them on the success path. If any intervening HIPCHECK fails (e.g. hipEventRecord or hipEventElapsedTime), the function returns failure() and both events leak. Consider wrapping their destruction in a llvm::make_scope_exit right after creation, mirroring the pattern already used a few lines below, so the cleanup is exception/early-return safe. (Minor: Major-section guidance on using MLIR/RAII allocation utilities rather than manual paired create/destroy.)

rocmlir-pr-reviewer

Verdict: COMMENT · Findings: 1 (0 Critical, 0 Major, 1 Minor)

Scope

Backports the tuning-driver benchmarking methodology from rocmlirTriton to match Triton's do_bench: --num-iterations/--warmup-iterations (iteration counts) become --rep/--warmup (millisecond budgets), iteration counts are derived from an estimated per-launch runtime, the small-kernel CPU-timer special case is removed in favor of a single event-based measureKernel, and the Python drivers (perfRunner.py, tuningRunner.py) are updated to pass the new flags. Touches rocmlir-tuning-driver.cpp and the two perf scripts only.

Findings

Minor: the per-launch estimate block in benchmarkKernels creates a startEvent/stopEvent pair without scope-based cleanup, so they leak on the HIP error paths (rocmlir-tuning-driver.cpp:463).

Notes

Spot-checked the BenchmarkParams aggregate-init reorder (rocmlir-tuning-driver.cpp:691): struct field order warmupMs, repMs matches the init list warmup, rep, so there is no field-swap bug.
Verified MLIR_N_REPEATS is still referenced after the rename — perfRunner.py:1841 and tuningRunner.py:1274 use it for --kernel-repeats — so the new comments are accurate and nothing is dead code.
The main measureKernel loop correctly uses llvm::make_scope_exit and initializes both event vectors to nullptr, so partial-failure cleanup there is sound.
Behavioral observation (not blocking, matches do_bench by design): removing the small-kernel CPU-timer path means a sub-microsecond kernel clamped to minMeasurableMs = 0.001 with repMs = 200 derives ~200k iterations and pre-allocates ~400k hipEvent_t. Realistic tiny kernels (5-50 µs) stay in the low-thousands, but it is worth keeping an eye on resource use for extremely fast kernels that previously took the CPU-timer fast path.
std::sort at line 535 is a pre-existing context line, not introduced here, so the checklist's llvm::sort guidance does not apply to this PR.

CI status

No non-self checks are in the fail/cancel buckets; remaining checks (premerge, Python perf tests, review) are still in progress at review time.

Copilot

Pull request overview

This PR backports benchmarking methodology updates to rocmlir-tuning-driver to align its timing approach with Triton’s do_bench, aiming for more stable and consistent kernel performance measurements during tuning/benchmark runs.

Changes:

Replace iteration-count flags (--num-iterations / --warmup-iterations) with millisecond time budgets (--rep / --warmup) and derive iteration counts from an estimated per-launch runtime.
Rework kernel timing to pre-allocate per-iteration HIP event pairs, record iterations back-to-back, and synchronize once at the end (lower host overhead).
Update perfRunner.py and tuningRunner.py to pass the new CLI flags and introduce corresponding time-budget constants.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
mlir/utils/performance/tuningRunner.py	Switch tuning-driver invocation to `--rep/--warmup` time-budget flags and add tuning defaults.
mlir/utils/performance/perfRunner.py	Switch benchmarking invocation to `--rep/--warmup` time-budget flags and add stricter benchmark defaults.
mlir/tools/rocmlir-tuning-driver/rocmlir-tuning-driver.cpp	Implement do_bench-style iteration sizing and event-based measurement with single end synchronization; update CLI options and JSON output behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+  for (unsigned iter = 0; iter < iterations; ++iter) {
+    float currentMilliseconds = 0.0;
+    HIPCHECK(hipEventElapsedTime(&currentMilliseconds, startEvents[iter],
+                                 stopEvents[iter]));
+    measurements.push_back(static_cast<double>(currentMilliseconds));
  }


+    hipEvent_t startEvent, stopEvent;
+    HIPCHECK(hipEventCreate(&startEvent));
+    HIPCHECK(hipEventCreate(&stopEvent));
+
+    HIPCHECK(hipEventRecord(startEvent, stream));


pabloantoniom

If I understand correctly we are dropping the old way of benchmarking rocMLIR. Would it make sense to keep it, leave it as the default, and just have the Triton-style live under a new option? I wonder if we would need the old way of benchmarking rocMLIR at some point, maybe to compare apples-to-apples between newer and older versions of rocMLIR

dhernandez0 · 2026-06-15T14:56:02Z

If I understand correctly we are dropping the old way of benchmarking rocMLIR. Would it make sense to keep it, leave it as the default, and just have the Triton-style live under a new option? I wonder if we would need the old way of benchmarking rocMLIR at some point, maybe to compare apples-to-apples between newer and older versions of rocMLIR

I think we can use --use-rocprof for that.

justinrosner · 2026-06-15T15:35:17Z

If I understand correctly we are dropping the old way of benchmarking rocMLIR. Would it make sense to keep it, leave it as the default, and just have the Triton-style live under a new option? I wonder if we would need the old way of benchmarking rocMLIR at some point, maybe to compare apples-to-apples between newer and older versions of rocMLIR

I think we can use --use-rocprof for that.

Given that rocprof and HIP measurements give very different results (see Pablo's ticket here: https://amd-hub.atlassian.net/browse/AIROCMLIR-945), I don't know if we can just rely on --use-rocprof to compare older versions of rocMLIR?

codecov · 2026-06-15T22:04:21Z

Codecov Report

❌ Patch coverage is 0% with 27 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
mlir/lib/Dialect/Rock/IR/AmdArchDb.cpp	0.00%	27 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #2407      +/-   ##
===========================================
+ Coverage    82.57%   82.63%   +0.06%     
===========================================
  Files          120      120              
  Lines        42852    42879      +27     
  Branches      7110     7118       +8     
===========================================
+ Hits         35381    35429      +48     
- Misses        4815     4834      +19     
+ Partials      2656     2616      -40

Flag	Coverage Δ
gfx120x	`82.48% <0.00%> (-0.04%)`	⬇️
gfx950	`82.39% <0.00%> (+0.04%)`	⬆️
mfma	`82.33% <0.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
mlir/include/mlir/Dialect/Rock/IR/AmdArchDb.h	`0.00% <ø> (ø)`
mlir/lib/Dialect/Rock/IR/AmdArchDb.cpp	`69.00% <0.00%> (-4.36%)`	⬇️

... and 25 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

pabloantoniom · 2026-06-16T05:22:00Z

If I understand correctly we are dropping the old way of benchmarking rocMLIR. Would it make sense to keep it, leave it as the default, and just have the Triton-style live under a new option? I wonder if we would need the old way of benchmarking rocMLIR at some point, maybe to compare apples-to-apples between newer and older versions of rocMLIR

I think we can use --use-rocprof for that.

Given that rocprof and HIP measurements give very different results (see Pablo's ticket here: https://amd-hub.atlassian.net/browse/AIROCMLIR-945), I don't know if we can just rely on --use-rocprof to compare older versions of rocMLIR?

This is a good point, given that --use-rocprof may be doing weird things under the hood, I'm not convinced relying on --use-rocprof is a good idea (at least until we investigate https://amd-hub.atlassian.net/browse/AIROCMLIR-945)?

mirza-halilcevic

This also resolves AIROCMLIR-163. You can close it once this is merged.

mirza-halilcevic · 2026-06-17T16:16:58Z

  llvm_unreachable(msg.c_str());
 }

+int64_t mlir::rock::getLastLevelCacheSize(StringRef arch) {


We can possibly add the cache sizes to AmdArchInfo as parameters.

mirza-halilcevic · 2026-06-17T16:21:20Z

+    } else {
+      // Default: size the flush buffer to the L2 cache reported by the HIP
+      // runtime, plus a 20% margin.
+      size_t l2Size = static_cast<size_t>(deviceProps.l2CacheSize);
+      flushSize = l2Size + (l2Size / 5); // 20% margin


Maybe this L2-only branch is not necessary anymore. I think we would always want to flush the last level.

mirza-halilcevic · 2026-06-17T16:24:21Z

+  // Pre-allocate one event pair per iteration so we can record them all in a
+  // tight loop and synchronize only once at the end. This matches Triton's
+  // do_bench, which minimizes host-side overhead between launches (no
+  // per-iteration synchronization).


Nice. Good idea.

mirza-halilcevic · 2026-06-17T16:25:14Z

+  // per-iteration synchronization).
+  std::vector<hipEvent_t> startEvents(iterations, nullptr);
+  std::vector<hipEvent_t> stopEvents(iterations, nullptr);
+  auto eventCleanup = llvm::make_scope_exit([&]() {


llvm::make_scope_exit is deprecated. We should use the constructor instead. I was planning to fix this anyway.

mirza-halilcevic · 2026-06-17T21:31:15Z

+  // Estimate the per-launch runtime so we can size warmup/benchmark iteration
+  // counts from the requested time budgets (Triton do_bench style). We time a
+  // handful of launches (flushing caches between them) using a single event
+  // pair.


Estimating warmup iterations on a cold GPU seems like it could cause problems for the first measured config. Maybe we could just let the warmup run for the requested time budget, before everything else. Something like:

warmupDeadline = now() + params.warmupMs do: for kernel in kernels: launch(kernel, stream) synchronize(stream) while now() < warmupDeadline estimate and run benchmark...

Align rocmlir-tuning-driver benchmarking with Triton do_bench

5a492da

dhernandez0 self-assigned this Jun 11, 2026

dhernandez0 requested a review from causten as a code owner June 11, 2026 08:06

dhernandez0 requested a review from Copilot June 11, 2026 08:09

dhernandez0 added the claude-review Trigger automated PR review by claude[bot]; auto-removed after the run. label Jun 11, 2026

Copilot started reviewing on behalf of dhernandez0 June 11, 2026 08:09 View session

dhernandez0 requested review from dorde-antic, erizheng-amd, justinrosner, mirza-halilcevic, pabloantoniom and umangyadav June 11, 2026 08:10

rocmlir-pr-reviewer Bot reviewed Jun 11, 2026

View reviewed changes

rocmlir-pr-reviewer Bot removed the claude-review Trigger automated PR review by claude[bot]; auto-removed after the run. label Jun 11, 2026

Copilot AI reviewed Jun 11, 2026

View reviewed changes

pabloantoniom reviewed Jun 15, 2026

View reviewed changes

Merge branch 'develop' into align-tuning-driver-dobench

640fe1c

dhernandez0 requested a review from pabloantoniom June 15, 2026 15:23

dhernandez0 and others added 2 commits June 16, 2026 14:08

port --flush-last-level-cache

37676a6

Merge branch 'develop' into align-tuning-driver-dobench

dc0cc50

mirza-halilcevic approved these changes Jun 17, 2026

View reviewed changes

mirza-halilcevic reviewed Jun 17, 2026

View reviewed changes

Conversation

dhernandez0 commented Jun 11, 2026

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

rocmlir-pr-reviewer Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

rocmlir-pr-reviewer Bot left a comment

Choose a reason for hiding this comment

Scope

Findings

Notes

CI status

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

pabloantoniom left a comment

Choose a reason for hiding this comment

Uh oh!

dhernandez0 commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

justinrosner commented Jun 15, 2026

Uh oh!

codecov Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

pabloantoniom commented Jun 16, 2026

Uh oh!

mirza-halilcevic left a comment

Choose a reason for hiding this comment

Uh oh!

mirza-halilcevic Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

mirza-halilcevic Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

mirza-halilcevic Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

mirza-halilcevic Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

mirza-halilcevic Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

dhernandez0 commented Jun 15, 2026 •

edited

Loading

codecov Bot commented Jun 15, 2026 •

edited

Loading