Add MicroBenchmark for Small Trip Count Loop vectorization#404
Add MicroBenchmark for Small Trip Count Loop vectorization#404Stylie777 wants to merge 4 commits into
Conversation
For targets where getMinTripCountTailFoldingThreshold returns a value greater than zero, llvm/llvm-project#195823 has enabled better vectorization of loops where applicable. This micro benchmark is intended to show the impact of these changes on the relevant targets. For targets where getMinTripCountTailFoldingThreshold returns zero, there will be no effect to runtime when comparing scalar vs vector.
| g_small_loop_trip_count_sum ^= checksum(B); | ||
| benchmark::DoNotOptimize(g_small_loop_trip_count_sum); | ||
| State.SetItemsProcessed(State.iterations() * 5); |
There was a problem hiding this comment.
Would be good to comment why this is needed
There was a problem hiding this comment.
It's not. I missed this when first reviewing the codex generated benchmark. I've removed it.
| B[I] = A[I] + static_cast<Ty>(1); | ||
| } | ||
|
|
||
| NOINLINE void loopTc5I64InterleaveCount2Vector(const uint64_t *__restrict A, |
There was a problem hiding this comment.
Is there a reason to not use the templated version for this one as well?
There was a problem hiding this comment.
No there isn't, I have made it consistent now.
| BENCHMARK_TEMPLATE(benchTc5Scalar, uint16_t)->Name("tc5/i16/scalar"); | ||
| BENCHMARK_TEMPLATE(benchTc5Vector, uint32_t)->Name("tc5/i32/vector"); | ||
| BENCHMARK_TEMPLATE(benchTc5Scalar, uint32_t)->Name("tc5/i32/scalar"); | ||
| BENCHMARK_TEMPLATE(benchTc5Vector, uint64_t)->Name("tc5/i64/vector"); |
There was a problem hiding this comment.
I think the potential worst case would be i64 with TC =3, could you also cover this?
There was a problem hiding this comment.
I have added cases for all data types for TC=3 for full coverage.
| NOINLINE void loopTc5Vector(const Ty *__restrict A, Ty *__restrict B) { | ||
| LOOP_VECTORIZE_ENABLE | ||
| for (uint64_t I = 0; I != 5; ++I) | ||
| B[I] = A[I] + static_cast<Ty>(1); |
There was a problem hiding this comment.
This is a case where there is basically no overhead for the vector code compared to the scalar code.
Would be good to also include cases where there is some overhead from the vector code compared to scalar, e.g. some scalarization
There was a problem hiding this comment.
I have added an example that has scalarization in the loop, if this is not what you meant please let me know!
|
Gentle Ping |
There was a problem hiding this comment.
Could you add some of the following cases and double-check performance? On the system I tried (Apple M4 with getMinTripCountTailFoldingThreshold updated to return 5 even without SVE), the scalar code is noticeably faster than the vector code
with int8_t and int16_t
for (int i = 0; i < 3; i++) { a[i] = a[i] < b[i] ? a[i] : b[i]; }
for (int i = 0; i < 5; i++) { a[i]^=b[i]; }
with int32_t
for (int i = 0; i < 5; i++) { a[i] %= (b[i] | 1); }
|
@fhahn I've tested those examples and agree that performance is impacted here by using vectorised rather than scalar. I have added checks to the LV patch in LLVM to not apply vectorisation to loops where the scalar cost is lower than the vectorisation cost. In your examples, these would now be excluded. I have added these here as examples of where it is not beneficial to vectorise this code as the pragma's still vectorise the loops. |
|
Ping |
For targets where getMinTripCountTailFoldingThreshold returns a value greater than zero, llvm/llvm-project#195823 has enabled better vectorization of loops where applicable. This micro benchmark is intended to show the impact of these changes on the relevant targets.
For targets where getMinTripCountTailFoldingThreshold returns zero, there will be no effect to runtime when comparing scalar vs vector.
Assisted-by: Codex