Add MicroBenchmark for Small Trip Count Loop vectorization by Stylie777 · Pull Request #404 · llvm/llvm-test-suite

Stylie777 · 2026-05-14T12:52:27Z

For targets where getMinTripCountTailFoldingThreshold returns a value greater than zero, llvm/llvm-project#195823 has enabled better vectorization of loops where applicable. This micro benchmark is intended to show the impact of these changes on the relevant targets.

For targets where getMinTripCountTailFoldingThreshold returns zero, there will be no effect to runtime when comparing scalar vs vector.

Assisted-by: Codex

For targets where getMinTripCountTailFoldingThreshold returns a value greater than zero, llvm/llvm-project#195823 has enabled better vectorization of loops where applicable. This micro benchmark is intended to show the impact of these changes on the relevant targets. For targets where getMinTripCountTailFoldingThreshold returns zero, there will be no effect to runtime when comparing scalar vs vector.

fhahn · 2026-05-19T08:39:38Z

+  g_small_loop_trip_count_sum ^= checksum(B);
+  benchmark::DoNotOptimize(g_small_loop_trip_count_sum);
+  State.SetItemsProcessed(State.iterations() * 5);


Would be good to comment why this is needed

It's not. I missed this when first reviewing the codex generated benchmark. I've removed it.

fhahn · 2026-05-19T08:41:30Z

+    B[I] = A[I] + static_cast<Ty>(1);
+}
+
+NOINLINE void loopTc5I64InterleaveCount2Vector(const uint64_t *__restrict A,


Is there a reason to not use the templated version for this one as well?

No there isn't, I have made it consistent now.

fhahn · 2026-05-19T08:42:52Z

+BENCHMARK_TEMPLATE(benchTc5Scalar, uint16_t)->Name("tc5/i16/scalar");
+BENCHMARK_TEMPLATE(benchTc5Vector, uint32_t)->Name("tc5/i32/vector");
+BENCHMARK_TEMPLATE(benchTc5Scalar, uint32_t)->Name("tc5/i32/scalar");
+BENCHMARK_TEMPLATE(benchTc5Vector, uint64_t)->Name("tc5/i64/vector");


I think the potential worst case would be i64 with TC =3, could you also cover this?

I have added cases for all data types for TC=3 for full coverage.

fhahn · 2026-05-19T08:44:14Z

+NOINLINE void loopTc5Vector(const Ty *__restrict A, Ty *__restrict B) {
+  LOOP_VECTORIZE_ENABLE
+  for (uint64_t I = 0; I != 5; ++I)
+    B[I] = A[I] + static_cast<Ty>(1);


This is a case where there is basically no overhead for the vector code compared to the scalar code.

Would be good to also include cases where there is some overhead from the vector code compared to scalar, e.g. some scalarization

I have added an example that has scalarization in the loop, if this is not what you meant please let me know!

Stylie777 · 2026-05-28T10:04:51Z

Gentle Ping

fhahn

Could you add some of the following cases and double-check performance? On the system I tried (Apple M4 with getMinTripCountTailFoldingThreshold updated to return 5 even without SVE), the scalar code is noticeably faster than the vector code

with int8_t and int16_t

  for (int i = 0; i < 3; i++) { a[i] = a[i] < b[i] ? a[i] : b[i]; }

  for (int i = 0; i < 5; i++) { a[i]^=b[i]; }

with int32_t

  for (int i = 0; i < 5; i++) { a[i] %= (b[i] | 1); }

Stylie777 · 2026-05-29T13:22:52Z

@fhahn I've tested those examples and agree that performance is impacted here by using vectorised rather than scalar. I have added checks to the LV patch in LLVM to not apply vectorisation to loops where the scalar cost is lower than the vectorisation cost. In your examples, these would now be excluded.

I have added these here as examples of where it is not beneficial to vectorise this code as the pragma's still vectorise the loops.

Stylie777 · 2026-06-10T09:39:54Z

Ping

Stylie777 requested a review from fhahn May 14, 2026 12:52

Stylie777 mentioned this pull request May 14, 2026

[LoopVectorize] Improve Vectorization of Low Trip Count Loops llvm/llvm-project#195823

Open

fhahn reviewed May 19, 2026

View reviewed changes

Stylie777 added 2 commits May 19, 2026 10:39

Respond to review comments

9b8a4ae

formatting

5918831

fhahn reviewed May 28, 2026

View reviewed changes

Add benchmarks for examples where vectorisation is not beneficial

2d40f92

Stylie777 requested a review from fhahn June 19, 2026 12:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add MicroBenchmark for Small Trip Count Loop vectorization#404

Add MicroBenchmark for Small Trip Count Loop vectorization#404
Stylie777 wants to merge 4 commits into
llvm:mainfrom
Stylie777:users/Stylie777/Small-TC-Loop-Vectorization

Stylie777 commented May 14, 2026 •

edited

Loading

Uh oh!

fhahn May 19, 2026

Uh oh!

Stylie777 May 19, 2026

Uh oh!

fhahn May 19, 2026

Uh oh!

Stylie777 May 19, 2026

Uh oh!

fhahn May 19, 2026

Uh oh!

Stylie777 May 19, 2026

Uh oh!

fhahn May 19, 2026

Uh oh!

Stylie777 May 19, 2026

Uh oh!

Stylie777 commented May 28, 2026

Uh oh!

fhahn left a comment •

edited

Loading

Uh oh!

Stylie777 commented May 29, 2026

Uh oh!

Stylie777 commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Stylie777 commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Stylie777 commented May 28, 2026

Uh oh!

fhahn left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Stylie777 commented May 29, 2026

Uh oh!

Stylie777 commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Stylie777 commented May 14, 2026 •

edited

Loading

fhahn left a comment •

edited

Loading