Align rocmlir-tuning-driver benchmarking with Triton do_bench#2407
Align rocmlir-tuning-driver benchmarking with Triton do_bench#2407dhernandez0 wants to merge 4 commits into
Conversation
| constexpr unsigned estimateRuns = 5; | ||
| double estimateMs = 0.0; | ||
| { | ||
| hipEvent_t startEvent, stopEvent; |
There was a problem hiding this comment.
Unlike the main measureKernel loop (which guards its event vectors with llvm::make_scope_exit), this estimate block creates startEvent/stopEvent but only destroys them on the success path. If any intervening HIPCHECK fails (e.g. hipEventRecord or hipEventElapsedTime), the function returns failure() and both events leak. Consider wrapping their destruction in a llvm::make_scope_exit right after creation, mirroring the pattern already used a few lines below, so the cleanup is exception/early-return safe. (Minor: Major-section guidance on using MLIR/RAII allocation utilities rather than manual paired create/destroy.)
There was a problem hiding this comment.
Verdict: COMMENT · Findings: 1 (0 Critical, 0 Major, 1 Minor)
Scope
Backports the tuning-driver benchmarking methodology from rocmlirTriton to match Triton's do_bench: --num-iterations/--warmup-iterations (iteration counts) become --rep/--warmup (millisecond budgets), iteration counts are derived from an estimated per-launch runtime, the small-kernel CPU-timer special case is removed in favor of a single event-based measureKernel, and the Python drivers (perfRunner.py, tuningRunner.py) are updated to pass the new flags. Touches rocmlir-tuning-driver.cpp and the two perf scripts only.
Findings
- Minor: the per-launch estimate block in
benchmarkKernelscreates astartEvent/stopEventpair without scope-based cleanup, so they leak on the HIP error paths (rocmlir-tuning-driver.cpp:463).
Notes
- Spot-checked the
BenchmarkParamsaggregate-init reorder (rocmlir-tuning-driver.cpp:691): struct field orderwarmupMs, repMsmatches the init listwarmup, rep, so there is no field-swap bug. - Verified
MLIR_N_REPEATSis still referenced after the rename —perfRunner.py:1841andtuningRunner.py:1274use it for--kernel-repeats— so the new comments are accurate and nothing is dead code. - The main
measureKernelloop correctly usesllvm::make_scope_exitand initializes both event vectors tonullptr, so partial-failure cleanup there is sound. - Behavioral observation (not blocking, matches
do_benchby design): removing the small-kernel CPU-timer path means a sub-microsecond kernel clamped tominMeasurableMs = 0.001withrepMs = 200derives ~200k iterations and pre-allocates ~400khipEvent_t. Realistic tiny kernels (5-50 µs) stay in the low-thousands, but it is worth keeping an eye on resource use for extremely fast kernels that previously took the CPU-timer fast path. std::sortat line 535 is a pre-existing context line, not introduced here, so the checklist'sllvm::sortguidance does not apply to this PR.
CI status
No non-self checks are in the fail/cancel buckets; remaining checks (premerge, Python perf tests, review) are still in progress at review time.
There was a problem hiding this comment.
Pull request overview
This PR backports benchmarking methodology updates to rocmlir-tuning-driver to align its timing approach with Triton’s do_bench, aiming for more stable and consistent kernel performance measurements during tuning/benchmark runs.
Changes:
- Replace iteration-count flags (
--num-iterations/--warmup-iterations) with millisecond time budgets (--rep/--warmup) and derive iteration counts from an estimated per-launch runtime. - Rework kernel timing to pre-allocate per-iteration HIP event pairs, record iterations back-to-back, and synchronize once at the end (lower host overhead).
- Update
perfRunner.pyandtuningRunner.pyto pass the new CLI flags and introduce corresponding time-budget constants.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| mlir/utils/performance/tuningRunner.py | Switch tuning-driver invocation to --rep/--warmup time-budget flags and add tuning defaults. |
| mlir/utils/performance/perfRunner.py | Switch benchmarking invocation to --rep/--warmup time-budget flags and add stricter benchmark defaults. |
| mlir/tools/rocmlir-tuning-driver/rocmlir-tuning-driver.cpp | Implement do_bench-style iteration sizing and event-based measurement with single end synchronization; update CLI options and JSON output behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| for (unsigned iter = 0; iter < iterations; ++iter) { | ||
| float currentMilliseconds = 0.0; | ||
| HIPCHECK(hipEventElapsedTime(¤tMilliseconds, startEvents[iter], | ||
| stopEvents[iter])); | ||
| measurements.push_back(static_cast<double>(currentMilliseconds)); | ||
| } |
| hipEvent_t startEvent, stopEvent; | ||
| HIPCHECK(hipEventCreate(&startEvent)); | ||
| HIPCHECK(hipEventCreate(&stopEvent)); | ||
|
|
||
| HIPCHECK(hipEventRecord(startEvent, stream)); |
pabloantoniom
left a comment
There was a problem hiding this comment.
If I understand correctly we are dropping the old way of benchmarking rocMLIR. Would it make sense to keep it, leave it as the default, and just have the Triton-style live under a new option? I wonder if we would need the old way of benchmarking rocMLIR at some point, maybe to compare apples-to-apples between newer and older versions of rocMLIR
I think we can use --use-rocprof for that. |
Given that rocprof and HIP measurements give very different results (see Pablo's ticket here: https://amd-hub.atlassian.net/browse/AIROCMLIR-945), I don't know if we can just rely on |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #2407 +/- ##
===========================================
+ Coverage 82.57% 82.63% +0.06%
===========================================
Files 120 120
Lines 42852 42879 +27
Branches 7110 7118 +8
===========================================
+ Hits 35381 35429 +48
- Misses 4815 4834 +19
+ Partials 2656 2616 -40
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
This is a good point, given that --use-rocprof may be doing weird things under the hood, I'm not convinced relying on --use-rocprof is a good idea (at least until we investigate https://amd-hub.atlassian.net/browse/AIROCMLIR-945)? |
mirza-halilcevic
left a comment
There was a problem hiding this comment.
This also resolves AIROCMLIR-163. You can close it once this is merged.
| llvm_unreachable(msg.c_str()); | ||
| } | ||
|
|
||
| int64_t mlir::rock::getLastLevelCacheSize(StringRef arch) { |
There was a problem hiding this comment.
We can possibly add the cache sizes to AmdArchInfo as parameters.
| } else { | ||
| // Default: size the flush buffer to the L2 cache reported by the HIP | ||
| // runtime, plus a 20% margin. | ||
| size_t l2Size = static_cast<size_t>(deviceProps.l2CacheSize); | ||
| flushSize = l2Size + (l2Size / 5); // 20% margin |
There was a problem hiding this comment.
Maybe this L2-only branch is not necessary anymore. I think we would always want to flush the last level.
| // Pre-allocate one event pair per iteration so we can record them all in a | ||
| // tight loop and synchronize only once at the end. This matches Triton's | ||
| // do_bench, which minimizes host-side overhead between launches (no | ||
| // per-iteration synchronization). |
| // per-iteration synchronization). | ||
| std::vector<hipEvent_t> startEvents(iterations, nullptr); | ||
| std::vector<hipEvent_t> stopEvents(iterations, nullptr); | ||
| auto eventCleanup = llvm::make_scope_exit([&]() { |
There was a problem hiding this comment.
llvm::make_scope_exit is deprecated. We should use the constructor instead. I was planning to fix this anyway.
| // Estimate the per-launch runtime so we can size warmup/benchmark iteration | ||
| // counts from the requested time budgets (Triton do_bench style). We time a | ||
| // handful of launches (flushing caches between them) using a single event | ||
| // pair. |
There was a problem hiding this comment.
Estimating warmup iterations on a cold GPU seems like it could cause problems for the first measured config. Maybe we could just let the warmup run for the requested time budget, before everything else. Something like:
warmupDeadline = now() + params.warmupMs
do:
for kernel in kernels:
launch(kernel, stream)
synchronize(stream)
while now() < warmupDeadline
estimate and run benchmark...
Motivation
Backport the benchmarking changes from rocmlirTriton PR https://github.com/ROCm/rocmlirTriton/pull/280 to align
the tuning driver's timing methodology with Triton's
do_bench, givingmore stable and consistent performance numbers. (The last-level-cache
flushing part of that PR is intentionally excluded.)
Technical Details
--rep/--warmupare now millisecond time budgets instead ofiteration counts; the actual iteration counts are derived from these
budgets and an estimated per-launch runtime.
measureKernelpre-allocates one event pair per iteration, recordsthem back-to-back, and synchronizes only once at the end (matching
do_bench's low host-overhead pattern).
measurement path.
perfRunner.py/tuningRunner.pyto pass the new--rep/--warmupflags.Test Plan
PR CI
Test Result
All tests pass.
Submission Checklist