Skip to content

Add MicroBenchmark for Small Trip Count Loop vectorization#404

Open
Stylie777 wants to merge 4 commits into
llvm:mainfrom
Stylie777:users/Stylie777/Small-TC-Loop-Vectorization
Open

Add MicroBenchmark for Small Trip Count Loop vectorization#404
Stylie777 wants to merge 4 commits into
llvm:mainfrom
Stylie777:users/Stylie777/Small-TC-Loop-Vectorization

Conversation

@Stylie777

@Stylie777 Stylie777 commented May 14, 2026

Copy link
Copy Markdown

For targets where getMinTripCountTailFoldingThreshold returns a value greater than zero, llvm/llvm-project#195823 has enabled better vectorization of loops where applicable. This micro benchmark is intended to show the impact of these changes on the relevant targets.

For targets where getMinTripCountTailFoldingThreshold returns zero, there will be no effect to runtime when comparing scalar vs vector.

Assisted-by: Codex

For targets where getMinTripCountTailFoldingThreshold returns a value
greater than zero, llvm/llvm-project#195823
has enabled better vectorization of loops where applicable. This
micro benchmark is intended to show the impact of these changes on
the relevant targets.

For targets where getMinTripCountTailFoldingThreshold returns zero,
there will be no effect to runtime when comparing scalar vs vector.
Comment on lines +76 to +78
g_small_loop_trip_count_sum ^= checksum(B);
benchmark::DoNotOptimize(g_small_loop_trip_count_sum);
State.SetItemsProcessed(State.iterations() * 5);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to comment why this is needed

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not. I missed this when first reviewing the codex generated benchmark. I've removed it.

B[I] = A[I] + static_cast<Ty>(1);
}

NOINLINE void loopTc5I64InterleaveCount2Vector(const uint64_t *__restrict A,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to not use the templated version for this one as well?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No there isn't, I have made it consistent now.

BENCHMARK_TEMPLATE(benchTc5Scalar, uint16_t)->Name("tc5/i16/scalar");
BENCHMARK_TEMPLATE(benchTc5Vector, uint32_t)->Name("tc5/i32/vector");
BENCHMARK_TEMPLATE(benchTc5Scalar, uint32_t)->Name("tc5/i32/scalar");
BENCHMARK_TEMPLATE(benchTc5Vector, uint64_t)->Name("tc5/i64/vector");

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the potential worst case would be i64 with TC =3, could you also cover this?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added cases for all data types for TC=3 for full coverage.

NOINLINE void loopTc5Vector(const Ty *__restrict A, Ty *__restrict B) {
LOOP_VECTORIZE_ENABLE
for (uint64_t I = 0; I != 5; ++I)
B[I] = A[I] + static_cast<Ty>(1);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a case where there is basically no overhead for the vector code compared to the scalar code.

Would be good to also include cases where there is some overhead from the vector code compared to scalar, e.g. some scalarization

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added an example that has scalarization in the loop, if this is not what you meant please let me know!

@Stylie777

Copy link
Copy Markdown
Author

Gentle Ping

@fhahn fhahn left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add some of the following cases and double-check performance? On the system I tried (Apple M4 with getMinTripCountTailFoldingThreshold updated to return 5 even without SVE), the scalar code is noticeably faster than the vector code

with int8_t and int16_t

  for (int i = 0; i < 3; i++) { a[i] = a[i] < b[i] ? a[i] : b[i]; }
  for (int i = 0; i < 5; i++) { a[i]^=b[i]; }

with int32_t

  for (int i = 0; i < 5; i++) { a[i] %= (b[i] | 1); }

@Stylie777

Copy link
Copy Markdown
Author

@fhahn I've tested those examples and agree that performance is impacted here by using vectorised rather than scalar. I have added checks to the LV patch in LLVM to not apply vectorisation to loops where the scalar cost is lower than the vectorisation cost. In your examples, these would now be excluded.

I have added these here as examples of where it is not beneficial to vectorise this code as the pragma's still vectorise the loops.

@Stylie777

Copy link
Copy Markdown
Author

Ping

@Stylie777 Stylie777 requested a review from fhahn June 19, 2026 12:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants