Skip to content

Align rocmlir-tuning-driver benchmarking with Triton do_bench#2407

Open
dhernandez0 wants to merge 4 commits into
developfrom
align-tuning-driver-dobench
Open

Align rocmlir-tuning-driver benchmarking with Triton do_bench#2407
dhernandez0 wants to merge 4 commits into
developfrom
align-tuning-driver-dobench

Conversation

@dhernandez0

Copy link
Copy Markdown
Contributor

Motivation

Backport the benchmarking changes from rocmlirTriton PR https://github.com/ROCm/rocmlirTriton/pull/280 to align
the tuning driver's timing methodology with Triton's do_bench, giving
more stable and consistent performance numbers. (The last-level-cache
flushing part of that PR is intentionally excluded.)

Technical Details

  • --rep / --warmup are now millisecond time budgets instead of
    iteration counts; the actual iteration counts are derived from these
    budgets and an estimated per-launch runtime.
  • measureKernel pre-allocates one event pair per iteration, records
    them back-to-back, and synchronizes only once at the end (matching
    do_bench's low host-overhead pattern).
  • Removed the small-kernel CPU-timer special case in favor of a single
    measurement path.
  • Updated perfRunner.py / tuningRunner.py to pass the new --rep /
    --warmup flags.

Test Plan

PR CI

Test Result

All tests pass.

Submission Checklist

@dhernandez0 dhernandez0 self-assigned this Jun 11, 2026
@dhernandez0 dhernandez0 requested a review from causten as a code owner June 11, 2026 08:06
@dhernandez0 dhernandez0 requested a review from Copilot June 11, 2026 08:09
@dhernandez0 dhernandez0 added the claude-review Trigger automated PR review by claude[bot]; auto-removed after the run. label Jun 11, 2026
constexpr unsigned estimateRuns = 5;
double estimateMs = 0.0;
{
hipEvent_t startEvent, stopEvent;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unlike the main measureKernel loop (which guards its event vectors with llvm::make_scope_exit), this estimate block creates startEvent/stopEvent but only destroys them on the success path. If any intervening HIPCHECK fails (e.g. hipEventRecord or hipEventElapsedTime), the function returns failure() and both events leak. Consider wrapping their destruction in a llvm::make_scope_exit right after creation, mirroring the pattern already used a few lines below, so the cleanup is exception/early-return safe. (Minor: Major-section guidance on using MLIR/RAII allocation utilities rather than manual paired create/destroy.)

@rocmlir-pr-reviewer rocmlir-pr-reviewer Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verdict: COMMENT  ·  Findings: 1 (0 Critical, 0 Major, 1 Minor)


Scope

Backports the tuning-driver benchmarking methodology from rocmlirTriton to match Triton's do_bench: --num-iterations/--warmup-iterations (iteration counts) become --rep/--warmup (millisecond budgets), iteration counts are derived from an estimated per-launch runtime, the small-kernel CPU-timer special case is removed in favor of a single event-based measureKernel, and the Python drivers (perfRunner.py, tuningRunner.py) are updated to pass the new flags. Touches rocmlir-tuning-driver.cpp and the two perf scripts only.

Findings

  • Minor: the per-launch estimate block in benchmarkKernels creates a startEvent/stopEvent pair without scope-based cleanup, so they leak on the HIP error paths (rocmlir-tuning-driver.cpp:463).

Notes

  • Spot-checked the BenchmarkParams aggregate-init reorder (rocmlir-tuning-driver.cpp:691): struct field order warmupMs, repMs matches the init list warmup, rep, so there is no field-swap bug.
  • Verified MLIR_N_REPEATS is still referenced after the rename — perfRunner.py:1841 and tuningRunner.py:1274 use it for --kernel-repeats — so the new comments are accurate and nothing is dead code.
  • The main measureKernel loop correctly uses llvm::make_scope_exit and initializes both event vectors to nullptr, so partial-failure cleanup there is sound.
  • Behavioral observation (not blocking, matches do_bench by design): removing the small-kernel CPU-timer path means a sub-microsecond kernel clamped to minMeasurableMs = 0.001 with repMs = 200 derives ~200k iterations and pre-allocates ~400k hipEvent_t. Realistic tiny kernels (5-50 µs) stay in the low-thousands, but it is worth keeping an eye on resource use for extremely fast kernels that previously took the CPU-timer fast path.
  • std::sort at line 535 is a pre-existing context line, not introduced here, so the checklist's llvm::sort guidance does not apply to this PR.

CI status

No non-self checks are in the fail/cancel buckets; remaining checks (premerge, Python perf tests, review) are still in progress at review time.

@rocmlir-pr-reviewer rocmlir-pr-reviewer Bot removed the claude-review Trigger automated PR review by claude[bot]; auto-removed after the run. label Jun 11, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR backports benchmarking methodology updates to rocmlir-tuning-driver to align its timing approach with Triton’s do_bench, aiming for more stable and consistent kernel performance measurements during tuning/benchmark runs.

Changes:

  • Replace iteration-count flags (--num-iterations / --warmup-iterations) with millisecond time budgets (--rep / --warmup) and derive iteration counts from an estimated per-launch runtime.
  • Rework kernel timing to pre-allocate per-iteration HIP event pairs, record iterations back-to-back, and synchronize once at the end (lower host overhead).
  • Update perfRunner.py and tuningRunner.py to pass the new CLI flags and introduce corresponding time-budget constants.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
mlir/utils/performance/tuningRunner.py Switch tuning-driver invocation to --rep/--warmup time-budget flags and add tuning defaults.
mlir/utils/performance/perfRunner.py Switch benchmarking invocation to --rep/--warmup time-budget flags and add stricter benchmark defaults.
mlir/tools/rocmlir-tuning-driver/rocmlir-tuning-driver.cpp Implement do_bench-style iteration sizing and event-based measurement with single end synchronization; update CLI options and JSON output behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +400 to 405
for (unsigned iter = 0; iter < iterations; ++iter) {
float currentMilliseconds = 0.0;
HIPCHECK(hipEventElapsedTime(&currentMilliseconds, startEvents[iter],
stopEvents[iter]));
measurements.push_back(static_cast<double>(currentMilliseconds));
}
Comment on lines +463 to +467
hipEvent_t startEvent, stopEvent;
HIPCHECK(hipEventCreate(&startEvent));
HIPCHECK(hipEventCreate(&stopEvent));

HIPCHECK(hipEventRecord(startEvent, stream));

@pabloantoniom pabloantoniom left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly we are dropping the old way of benchmarking rocMLIR. Would it make sense to keep it, leave it as the default, and just have the Triton-style live under a new option? I wonder if we would need the old way of benchmarking rocMLIR at some point, maybe to compare apples-to-apples between newer and older versions of rocMLIR

@dhernandez0

dhernandez0 commented Jun 15, 2026

Copy link
Copy Markdown
Contributor Author

If I understand correctly we are dropping the old way of benchmarking rocMLIR. Would it make sense to keep it, leave it as the default, and just have the Triton-style live under a new option? I wonder if we would need the old way of benchmarking rocMLIR at some point, maybe to compare apples-to-apples between newer and older versions of rocMLIR

I think we can use --use-rocprof for that.

@justinrosner

Copy link
Copy Markdown
Contributor

If I understand correctly we are dropping the old way of benchmarking rocMLIR. Would it make sense to keep it, leave it as the default, and just have the Triton-style live under a new option? I wonder if we would need the old way of benchmarking rocMLIR at some point, maybe to compare apples-to-apples between newer and older versions of rocMLIR

I think we can use --use-rocprof for that.

Given that rocprof and HIP measurements give very different results (see Pablo's ticket here: https://amd-hub.atlassian.net/browse/AIROCMLIR-945), I don't know if we can just rely on --use-rocprof to compare older versions of rocMLIR?

@codecov

codecov Bot commented Jun 15, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 0% with 27 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
mlir/lib/Dialect/Rock/IR/AmdArchDb.cpp 0.00% 27 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #2407      +/-   ##
===========================================
+ Coverage    82.57%   82.63%   +0.06%     
===========================================
  Files          120      120              
  Lines        42852    42879      +27     
  Branches      7110     7118       +8     
===========================================
+ Hits         35381    35429      +48     
- Misses        4815     4834      +19     
+ Partials      2656     2616      -40     
Flag Coverage Δ
gfx120x 82.48% <0.00%> (-0.04%) ⬇️
gfx950 82.39% <0.00%> (+0.04%) ⬆️
mfma 82.33% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
mlir/include/mlir/Dialect/Rock/IR/AmdArchDb.h 0.00% <ø> (ø)
mlir/lib/Dialect/Rock/IR/AmdArchDb.cpp 69.00% <0.00%> (-4.36%) ⬇️

... and 25 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@pabloantoniom

Copy link
Copy Markdown
Contributor

If I understand correctly we are dropping the old way of benchmarking rocMLIR. Would it make sense to keep it, leave it as the default, and just have the Triton-style live under a new option? I wonder if we would need the old way of benchmarking rocMLIR at some point, maybe to compare apples-to-apples between newer and older versions of rocMLIR

I think we can use --use-rocprof for that.

Given that rocprof and HIP measurements give very different results (see Pablo's ticket here: https://amd-hub.atlassian.net/browse/AIROCMLIR-945), I don't know if we can just rely on --use-rocprof to compare older versions of rocMLIR?

This is a good point, given that --use-rocprof may be doing weird things under the hood, I'm not convinced relying on --use-rocprof is a good idea (at least until we investigate https://amd-hub.atlassian.net/browse/AIROCMLIR-945)?

@mirza-halilcevic mirza-halilcevic left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also resolves AIROCMLIR-163. You can close it once this is merged.

llvm_unreachable(msg.c_str());
}

int64_t mlir::rock::getLastLevelCacheSize(StringRef arch) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can possibly add the cache sizes to AmdArchInfo as parameters.

Comment on lines +228 to +232
} else {
// Default: size the flush buffer to the L2 cache reported by the HIP
// runtime, plus a 20% margin.
size_t l2Size = static_cast<size_t>(deviceProps.l2CacheSize);
flushSize = l2Size + (l2Size / 5); // 20% margin

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this L2-only branch is not necessary anymore. I think we would always want to flush the last level.

Comment on lines +365 to +368
// Pre-allocate one event pair per iteration so we can record them all in a
// tight loop and synchronize only once at the end. This matches Triton's
// do_bench, which minimizes host-side overhead between launches (no
// per-iteration synchronization).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. Good idea.

// per-iteration synchronization).
std::vector<hipEvent_t> startEvents(iterations, nullptr);
std::vector<hipEvent_t> stopEvents(iterations, nullptr);
auto eventCleanup = llvm::make_scope_exit([&]() {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

llvm::make_scope_exit is deprecated. We should use the constructor instead. I was planning to fix this anyway.

Comment on lines +466 to +469
// Estimate the per-launch runtime so we can size warmup/benchmark iteration
// counts from the requested time budgets (Triton do_bench style). We time a
// handful of launches (flushing caches between them) using a single event
// pair.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Estimating warmup iterations on a cold GPU seems like it could cause problems for the first measured config. Maybe we could just let the warmup run for the requested time budget, before everything else. Something like:

warmupDeadline = now() + params.warmupMs
do:
    for kernel in kernels:
        launch(kernel, stream)
    synchronize(stream)
while now() < warmupDeadline

estimate and run benchmark...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants