Skip to content

[LoopVectorize] Improve Vectorization of Low Trip Count Loops#195823

Open
Stylie777 wants to merge 14 commits into
llvm:mainfrom
Stylie777:vectorize-small-loop-epilogue
Open

[LoopVectorize] Improve Vectorization of Low Trip Count Loops#195823
Stylie777 wants to merge 14 commits into
llvm:mainfrom
Stylie777:vectorize-small-loop-epilogue

Conversation

@Stylie777

@Stylie777 Stylie777 commented May 5, 2026

Copy link
Copy Markdown
Contributor

Currently, Small Loops with Trip Counts less than 16, and in situations where the Trip Count (TC) is less than the Tail Folding Threshold are harder to vectorize, its only possible where no epilogue will be emitted. However, for loops with large bodies and small trip counts this can be counterproductive to performance, often failing to vectorize entirely. This is more prevalent with targets where getMinTripCountTailFoldingThreshold() returns a value greater than 0.

To address this, the Small Loops where the TC == VF + 1 can now vectorize, leading to a vectorized iteration (or loop if interleaving is required) and a single scalar iteration. Later passes can then remove the loop's entirely.

Testing an with OpenSource Fortran HPC Benchmark which includes multiple loops with small trip counts, but large loop bodies, has shown significant improvement to runtime after these changes.

Assisted-by: Claude Sonnet 4.6/Codex

@llvmorg-github-actions

llvmorg-github-actions Bot commented May 5, 2026

Copy link
Copy Markdown

@llvm/pr-subscribers-backend-risc-v
@llvm/pr-subscribers-vectorizers

@llvm/pr-subscribers-llvm-transforms

Author: Jack Styles (Stylie777)

Changes

Currently, Small Loops with Trip Counts less than 16, and in situations where the Trip Count (TC) is less than the Tail Folding Threshold are harder to vectorize, its only possible where no epilogue will be emitted. However, for loops with large bodies and small trip counts this can be counterproductive to performance, often failing to vectorize entirely. This is more prevalent with targets where getMinTripCountTailFoldingThreshold() returns a value greater than 0.

To address this, the Small Loops where the TC == VF + 1 can now vectorize, leading to a single vectorized itneration and a single scalar iteration. Later passes can then remove the loop's entirely.

Testing an with OpenSource Fortran HPC Benchmark which includes multiple loops with small trip counts, but large loop bodies, has shown significant improvement to runtime after these changes.

Assisted-by: Claude Sonnet 4.6/Codex


Full diff: https://github.com/llvm/llvm-project/pull/195823.diff

3 Files Affected:

  • (modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+37-1)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/sve-low-trip-count.ll (+11-13)
  • (added) llvm/test/Transforms/LoopVectorize/AArch64/sve-small-trip-count-vf-plus-one.ll (+195)
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 3d7f7a1649b16..b0ae53a016dc9 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -3012,6 +3012,15 @@ LoopVectorizationCostModel::computeMaxVF(ElementCount UserVF, unsigned UserIC) {
   }
 
   auto ExpectedTC = getSmallBestKnownTC(PSE, TheLoop);
+  auto ApplyVectorWidth = [](FixedScalableVFPair &MaxFactors,
+                             unsigned int FixedVF, unsigned int ScalableVF) {
+    MaxFactors.FixedVF = ElementCount::getFixed(FixedVF);
+    MaxFactors.ScalableVF = ElementCount::getScalable(ScalableVF);
+  };
+  unsigned EffectiveIC = UserIC > 0 ? UserIC : 1;
+  auto HasOneScalarIterationRemainder = [EffectiveIC](ElementCount &ExactTC, unsigned int VF)-> bool {
+    return ExactTC.getFixedValue() == ((VF * EffectiveIC) + 1);
+  };
   if (ExpectedTC && ExpectedTC->isFixed() &&
       ExpectedTC->getFixedValue() <=
           TTI.getMinTripCountTailFoldingThreshold()) {
@@ -3023,9 +3032,36 @@ LoopVectorizationCostModel::computeMaxVF(ElementCount UserVF, unsigned UserIC) {
           NoScalarEpilogueNeeded(MaxFactors.FixedVF.getFixedValue())) {
         LLVM_DEBUG(dbgs() << "LV: Picking a fixed-width so that no tail will "
                              "remain for any chosen VF.\n");
-        MaxFactors.ScalableVF = ElementCount::getScalable(0);
+        ApplyVectorWidth(MaxFactors, MaxFactors.FixedVF.getFixedValue(), 0);
         return MaxFactors;
       }
+      // Allow cases where the ExactTC == VF + 1. VF can be any power of
+      // 2 between 2 and MaxVF.
+      //
+      // This produces 1 vector iteration, and 1 scalar iteration with
+      // no remainder. Later passes will eliminate the loop and leave
+      // straight-line code as the both iteration counts are statically known.
+      ElementCount ExactTC = getSmallConstantTripCount(PSE.getSE(), TheLoop);
+      if (EpilogueLoweringStatus == CM_EpilogueNotAllowedLowTripLoop &&
+          ExactTC && ExactTC.isFixed()) {
+        if (HasOneScalarIterationRemainder(ExactTC, MaxFactors.FixedVF.getFixedValue())) {
+          LLVM_DEBUG(dbgs() << "LV: Picking a fixed-width with 1 scalar "
+                               "iteration remainder.\n");
+          ApplyVectorWidth(MaxFactors, MaxFactors.FixedVF.getFixedValue(), 0);
+          return MaxFactors;
+        }
+        // If the maximum VF cannot produce 1 vector iteration + 1 scalar
+        // iteration, step down VF's to find one that can.
+        for (unsigned VF = MaxFactors.FixedVF.getFixedValue(); VF >= 2;
+             VF /= 2) {
+          if (HasOneScalarIterationRemainder(ExactTC, VF)) {
+            LLVM_DEBUG(dbgs() << "LV: Picking VF=" << VF
+                              << " with 1 scalar iteration remainder.\n");
+            ApplyVectorWidth(MaxFactors, VF, 0);
+            return MaxFactors;
+          }
+        }
+      }
     }
 
     reportVectorizationFailure(
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/sve-low-trip-count.ll b/llvm/test/Transforms/LoopVectorize/AArch64/sve-low-trip-count.ll
index c36daa40f6193..76642c63dbd26 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/sve-low-trip-count.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/sve-low-trip-count.ll
@@ -56,22 +56,20 @@ exit:
 define void @trip5_i8(ptr noalias nocapture noundef %dst, ptr noalias nocapture noundef readonly %src) #0 {
 ; CHECK-LABEL: define void @trip5_i8(
 ; CHECK-SAME: ptr noalias noundef captures(none) [[DST:%.*]], ptr noalias noundef readonly captures(none) [[SRC:%.*]]) #[[ATTR0]] {
-; CHECK-NEXT:  [[ENTRY:.*]]:
+; CHECK-NEXT:  [[ENTRY:.*:]]
 ; CHECK-NEXT:    br label %[[LOOP:.*]]
 ; CHECK:       [[LOOP]]:
-; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_NEXT:%.*]], %[[LOOP]] ]
-; CHECK-NEXT:    [[GEP_SRC:%.*]] = getelementptr inbounds i8, ptr [[SRC]], i64 [[IV]]
-; CHECK-NEXT:    [[TMP0:%.*]] = load i8, ptr [[GEP_SRC]], align 1
-; CHECK-NEXT:    [[MUL:%.*]] = shl i8 [[TMP0]], 1
-; CHECK-NEXT:    [[GEP_DST:%.*]] = getelementptr inbounds i8, ptr [[DST]], i64 [[IV]]
-; CHECK-NEXT:    [[TMP1:%.*]] = load i8, ptr [[GEP_DST]], align 1
-; CHECK-NEXT:    [[ADD:%.*]] = add i8 [[MUL]], [[TMP1]]
-; CHECK-NEXT:    store i8 [[ADD]], ptr [[GEP_DST]], align 1
-; CHECK-NEXT:    [[IV_NEXT]] = add nuw nsw i64 [[IV]], 1
-; CHECK-NEXT:    [[EC:%.*]] = icmp eq i64 [[IV_NEXT]], 5
-; CHECK-NEXT:    br i1 [[EC]], label %[[EXIT:.*]], label %[[LOOP]]
+; CHECK-NEXT:    br label %[[EXIT:.*]]
 ; CHECK:       [[EXIT]]:
-; CHECK-NEXT:    ret void
+; CHECK-NEXT:    [[WIDE_LOAD:%.*]] = load <4 x i8>, ptr [[SRC]], align 1
+; CHECK-NEXT:    [[TMP0:%.*]] = shl <4 x i8> [[WIDE_LOAD]], splat (i8 1)
+; CHECK-NEXT:    [[WIDE_LOAD1:%.*]] = load <4 x i8>, ptr [[DST]], align 1
+; CHECK-NEXT:    [[TMP1:%.*]] = add <4 x i8> [[TMP0]], [[WIDE_LOAD1]]
+; CHECK-NEXT:    store <4 x i8> [[TMP1]], ptr [[DST]], align 1
+; CHECK-NEXT:    br label %[[MIDDLE_BLOCK:.*]]
+; CHECK:       [[MIDDLE_BLOCK]]:
+; CHECK-NEXT:    br label %[[SCALAR_PH:.*]]
+; CHECK:       [[SCALAR_PH]]:
 ;
 entry:
   br label %loop
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/sve-small-trip-count-vf-plus-one.ll b/llvm/test/Transforms/LoopVectorize/AArch64/sve-small-trip-count-vf-plus-one.ll
new file mode 100644
index 0000000000000..71157580334d1
--- /dev/null
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/sve-small-trip-count-vf-plus-one.ll
@@ -0,0 +1,195 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --check-globals none --version 6
+;
+; Test that a loop with trip count == VF + 1 is allowed to vectorize
+; on AArch64+SVE where getMinTripCountTailFoldingThreshold() returns 5. This
+; produces one vector iteration and one scalar iteration.
+;
+; RUN: opt -S -p loop-vectorize %s | FileCheck %s
+
+target triple = "aarch64-unknown-linux-gnu"
+
+; TC=5, VF=4: TC == MaxFixedVF + 1 (5 == 4 + 1).
+; The new code path should trigger: 1 vectorized iteration + 1 scalar iteration.
+define void @tc5_vf4_vectorize(ptr noalias %a, ptr noalias %b) #0 {
+; CHECK-LABEL: define void @tc5_vf4_vectorize(
+; CHECK-SAME: ptr noalias [[A:%.*]], ptr noalias [[B:%.*]]) #[[ATTR0:[0-9]+]] {
+; CHECK-NEXT:  [[SCALAR_PH:.*:]]
+; CHECK-NEXT:    br label %[[LOOP:.*]]
+; CHECK:       [[LOOP]]:
+; CHECK-NEXT:    br label %[[VECTOR_BODY:.*]]
+; CHECK:       [[VECTOR_BODY]]:
+; CHECK-NEXT:    [[WIDE_LOAD:%.*]] = load <4 x i32>, ptr [[A]], align 4
+; CHECK-NEXT:    [[TMP0:%.*]] = add nsw <4 x i32> [[WIDE_LOAD]], splat (i32 1)
+; CHECK-NEXT:    store <4 x i32> [[TMP0]], ptr [[B]], align 4
+; CHECK-NEXT:    br label %[[MIDDLE_BLOCK:.*]]
+; CHECK:       [[MIDDLE_BLOCK]]:
+; CHECK-NEXT:    br label %[[SCALAR_PH1:.*]]
+; CHECK:       [[SCALAR_PH1]]:
+; CHECK-NEXT:    br label %[[LOOP1:.*]]
+; CHECK:       [[LOOP1]]:
+; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ 4, %[[SCALAR_PH1]] ], [ [[IV_NEXT:%.*]], %[[LOOP1]] ]
+; CHECK-NEXT:    [[GEP_A:%.*]] = getelementptr inbounds i32, ptr [[A]], i64 [[IV]]
+; CHECK-NEXT:    [[GEP_B:%.*]] = getelementptr inbounds i32, ptr [[B]], i64 [[IV]]
+; CHECK-NEXT:    [[VAL:%.*]] = load i32, ptr [[GEP_A]], align 4
+; CHECK-NEXT:    [[ADD:%.*]] = add nsw i32 [[VAL]], 1
+; CHECK-NEXT:    store i32 [[ADD]], ptr [[GEP_B]], align 4
+; CHECK-NEXT:    [[IV_NEXT]] = add nuw nsw i64 [[IV]], 1
+; CHECK-NEXT:    [[EXITCOND:%.*]] = icmp eq i64 [[IV_NEXT]], 5
+; CHECK-NEXT:    br i1 [[EXITCOND]], label %[[EXIT:.*]], label %[[LOOP1]], !llvm.loop [[LOOP0:![0-9]+]]
+; CHECK:       [[EXIT]]:
+; CHECK-NEXT:    ret void
+;
+entry:
+  br label %loop
+loop:
+  %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
+  %gep.a = getelementptr inbounds i32, ptr %a, i64 %iv
+  %gep.b = getelementptr inbounds i32, ptr %b, i64 %iv
+  %val = load i32, ptr %gep.a, align 4
+  %add = add nsw i32 %val, 1
+  store i32 %add, ptr %gep.b, align 4
+  %iv.next = add nuw nsw i64 %iv, 1
+  %exitcond = icmp eq i64 %iv.next, 5
+  br i1 %exitcond, label %exit, label %loop
+exit:
+  ret void
+}
+
+; TC=5, VF=2, IC=2: TC == VF * IC + 1 (5 == 2 * 2 + 1).
+; The forced interleave count should be considered when choosing VF.
+define void @tc5_forced_ic2_vectorize(ptr noalias %a, ptr noalias %b) #0 {
+; CHECK-LABEL: define void @tc5_forced_ic2_vectorize(
+; CHECK-SAME: ptr noalias [[A:%.*]], ptr noalias [[B:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT:  [[SCALAR_PH:.*:]]
+; CHECK-NEXT:    br label %[[LOOP:.*]]
+; CHECK:       [[LOOP]]:
+; CHECK-NEXT:    br label %[[VECTOR_BODY:.*]]
+; CHECK:       [[VECTOR_BODY]]:
+; CHECK-NEXT:    [[TMP0:%.*]] = getelementptr inbounds i32, ptr [[A]], i64 2
+; CHECK-NEXT:    [[WIDE_LOAD:%.*]] = load <2 x i32>, ptr [[A]], align 4
+; CHECK-NEXT:    [[WIDE_LOAD1:%.*]] = load <2 x i32>, ptr [[TMP0]], align 4
+; CHECK-NEXT:    [[TMP1:%.*]] = add nsw <2 x i32> [[WIDE_LOAD]], splat (i32 1)
+; CHECK-NEXT:    [[TMP2:%.*]] = add nsw <2 x i32> [[WIDE_LOAD1]], splat (i32 1)
+; CHECK-NEXT:    [[TMP3:%.*]] = getelementptr inbounds i32, ptr [[B]], i64 2
+; CHECK-NEXT:    store <2 x i32> [[TMP1]], ptr [[B]], align 4
+; CHECK-NEXT:    store <2 x i32> [[TMP2]], ptr [[TMP3]], align 4
+; CHECK-NEXT:    br label %[[MIDDLE_BLOCK:.*]]
+; CHECK:       [[MIDDLE_BLOCK]]:
+; CHECK-NEXT:    br label %[[SCALAR_PH1:.*]]
+; CHECK:       [[SCALAR_PH1]]:
+; CHECK-NEXT:    br label %[[LOOP1:.*]]
+; CHECK:       [[LOOP1]]:
+; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ 4, %[[SCALAR_PH1]] ], [ [[IV_NEXT:%.*]], %[[LOOP1]] ]
+; CHECK-NEXT:    [[GEP_A:%.*]] = getelementptr inbounds i32, ptr [[A]], i64 [[IV]]
+; CHECK-NEXT:    [[GEP_B:%.*]] = getelementptr inbounds i32, ptr [[B]], i64 [[IV]]
+; CHECK-NEXT:    [[VAL:%.*]] = load i32, ptr [[GEP_A]], align 4
+; CHECK-NEXT:    [[ADD:%.*]] = add nsw i32 [[VAL]], 1
+; CHECK-NEXT:    store i32 [[ADD]], ptr [[GEP_B]], align 4
+; CHECK-NEXT:    [[IV_NEXT]] = add nuw nsw i64 [[IV]], 1
+; CHECK-NEXT:    [[EXITCOND:%.*]] = icmp eq i64 [[IV_NEXT]], 5
+; CHECK-NEXT:    br i1 [[EXITCOND]], label %[[EXIT:.*]], label %[[LOOP1]], !llvm.loop [[LOOP3:![0-9]+]]
+; CHECK:       [[EXIT]]:
+; CHECK-NEXT:    ret void
+;
+entry:
+  br label %loop
+loop:
+  %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
+  %gep.a = getelementptr inbounds i32, ptr %a, i64 %iv
+  %gep.b = getelementptr inbounds i32, ptr %b, i64 %iv
+  %val = load i32, ptr %gep.a, align 4
+  %add = add nsw i32 %val, 1
+  store i32 %add, ptr %gep.b, align 4
+  %iv.next = add nuw nsw i64 %iv, 1
+  %exitcond = icmp eq i64 %iv.next, 5
+  br i1 %exitcond, label %exit, label %loop, !llvm.loop !0
+exit:
+  ret void
+}
+
+; TC=3, VF=2: TC == FixedVF + 1 (3 == 2 + 1).
+; Should vectorize: 1 vector iteration of width 2, then 1 scalar iteration.
+define void @tc3_vf2_vectorize(ptr noalias %a, ptr noalias %b) #0 {
+; CHECK-LABEL: define void @tc3_vf2_vectorize(
+; CHECK-SAME: ptr noalias [[A:%.*]], ptr noalias [[B:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT:  [[SCALAR_PH:.*:]]
+; CHECK-NEXT:    br label %[[LOOP:.*]]
+; CHECK:       [[LOOP]]:
+; CHECK-NEXT:    br label %[[VECTOR_BODY:.*]]
+; CHECK:       [[VECTOR_BODY]]:
+; CHECK-NEXT:    [[WIDE_LOAD:%.*]] = load <2 x i32>, ptr [[A]], align 4
+; CHECK-NEXT:    [[TMP0:%.*]] = add nsw <2 x i32> [[WIDE_LOAD]], splat (i32 1)
+; CHECK-NEXT:    store <2 x i32> [[TMP0]], ptr [[B]], align 4
+; CHECK-NEXT:    br label %[[MIDDLE_BLOCK:.*]]
+; CHECK:       [[MIDDLE_BLOCK]]:
+; CHECK-NEXT:    br label %[[SCALAR_PH1:.*]]
+; CHECK:       [[SCALAR_PH1]]:
+; CHECK-NEXT:    br label %[[LOOP1:.*]]
+; CHECK:       [[LOOP1]]:
+; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ 2, %[[SCALAR_PH1]] ], [ [[IV_NEXT:%.*]], %[[LOOP1]] ]
+; CHECK-NEXT:    [[GEP_A:%.*]] = getelementptr inbounds i32, ptr [[A]], i64 [[IV]]
+; CHECK-NEXT:    [[GEP_B:%.*]] = getelementptr inbounds i32, ptr [[B]], i64 [[IV]]
+; CHECK-NEXT:    [[VAL:%.*]] = load i32, ptr [[GEP_A]], align 4
+; CHECK-NEXT:    [[ADD:%.*]] = add nsw i32 [[VAL]], 1
+; CHECK-NEXT:    store i32 [[ADD]], ptr [[GEP_B]], align 4
+; CHECK-NEXT:    [[IV_NEXT]] = add nuw nsw i64 [[IV]], 1
+; CHECK-NEXT:    [[EXITCOND:%.*]] = icmp eq i64 [[IV_NEXT]], 3
+; CHECK-NEXT:    br i1 [[EXITCOND]], label %[[EXIT:.*]], label %[[LOOP1]], !llvm.loop [[LOOP4:![0-9]+]]
+; CHECK:       [[EXIT]]:
+; CHECK-NEXT:    ret void
+;
+entry:
+  br label %loop
+loop:
+  %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
+  %gep.a = getelementptr inbounds i32, ptr %a, i64 %iv
+  %gep.b = getelementptr inbounds i32, ptr %b, i64 %iv
+  %val = load i32, ptr %gep.a, align 4
+  %add = add nsw i32 %val, 1
+  store i32 %add, ptr %gep.b, align 4
+  %iv.next = add nuw nsw i64 %iv, 1
+  %exitcond = icmp eq i64 %iv.next, 3
+  br i1 %exitcond, label %exit, label %loop
+exit:
+  ret void
+}
+
+; TC=4: exact multiple of VF=4. Vectorizes via the original
+; "no scalar epilogue needed" path -- NOT the new TC==VF+1 path.
+define void @tc4_vf4_vectorize(ptr noalias %a, ptr noalias %b) #0 {
+; CHECK-LABEL: define void @tc4_vf4_vectorize(
+; CHECK-SAME: ptr noalias [[A:%.*]], ptr noalias [[B:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    br label %[[VECTOR_PH:.*]]
+; CHECK:       [[VECTOR_PH]]:
+; CHECK-NEXT:    br label %[[VECTOR_BODY:.*]]
+; CHECK:       [[VECTOR_BODY]]:
+; CHECK-NEXT:    [[WIDE_LOAD:%.*]] = load <4 x i32>, ptr [[A]], align 4
+; CHECK-NEXT:    [[TMP0:%.*]] = add nsw <4 x i32> [[WIDE_LOAD]], splat (i32 1)
+; CHECK-NEXT:    store <4 x i32> [[TMP0]], ptr [[B]], align 4
+; CHECK-NEXT:    br label %[[MIDDLE_BLOCK:.*]]
+; CHECK:       [[MIDDLE_BLOCK]]:
+; CHECK-NEXT:    br label %[[EXIT:.*]]
+; CHECK:       [[EXIT]]:
+; CHECK-NEXT:    ret void
+;
+entry:
+  br label %loop
+loop:
+  %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
+  %gep.a = getelementptr inbounds i32, ptr %a, i64 %iv
+  %gep.b = getelementptr inbounds i32, ptr %b, i64 %iv
+  %val = load i32, ptr %gep.a, align 4
+  %add = add nsw i32 %val, 1
+  store i32 %add, ptr %gep.b, align 4
+  %iv.next = add nuw nsw i64 %iv, 1
+  %exitcond = icmp eq i64 %iv.next, 4
+  br i1 %exitcond, label %exit, label %loop
+exit:
+  ret void
+}
+
+attributes #0 = { vscale_range(1,16) "target-features"="+sve" }
+
+!0 = distinct !{!0, !1}
+!1 = !{!"llvm.loop.interleave.count", i32 2}

@github-actions

github-actions Bot commented May 5, 2026

Copy link
Copy Markdown

✅ With the latest revision this PR passed the C/C++ code formatter.

@david-arm david-arm left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this, seems like a useful improvement!

; CHECK-NEXT: [[IV_NEXT]] = add nuw nsw i64 [[IV]], 1
; CHECK-NEXT: [[EC:%.*]] = icmp eq i64 [[IV_NEXT]], 5
; CHECK-NEXT: br i1 [[EC]], label %[[EXIT:.*]], label %[[LOOP]]
; CHECK-NEXT: br label %[[EXIT:.*]]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The label looks wrong here. I think you need to regenerate the CHECK lines because it should be br label %[[VECTOR_BODY:.*]] or something like that.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually think this test is now surplus to requirements on reflection, it was previously testing that vectors were not applied to this loop, but these changes allow vectorisation and a similar test is in see-small-trip-count-vf-plus-one.ll.

LLVM_DEBUG(dbgs() << "LV: Picking a fixed-width so that no tail will "
"remain for any chosen VF.\n");
MaxFactors.ScalableVF = ElementCount::getScalable(0);
ApplyVectorWidth(MaxFactors, MaxFactors.FixedVF.getFixedValue(), 0);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'm not sure there is much value in this change to be honest, since it's now doing more work than before setting the fixed width VF unnecessarily. It's also inconsistent with the other places in this function that change the factors similarly, e.g.

  MaxFactors.FixedVF = ElementCount::getFixed(1);

on line 3094 below. At the very least, we should use a consistent style throughout the function.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed this and gone back to the original behaviour.

// straight-line code as the both iteration counts are statically known.
ElementCount ExactTC = getSmallConstantTripCount(PSE.getSE(), TheLoop);
if (EpilogueLoweringStatus == CM_EpilogueNotAllowedLowTripLoop &&
ExactTC && ExactTC.isFixed()) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible for ExactTC to be scalable when ExpectedTC is known to be fixed? If so, it would be good to add a test for this. If not, then you should just be able to called ExactTC.isFixed() and it will assert if it's scalable.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If ExpectedTC is fixed, then ExactTC will also be fixed. I've changed this to just check ExactTC.isFixed()

unsigned EffectiveIC = UserIC > 0 ? UserIC : 1;
auto HasOneScalarIterationRemainder = [EffectiveIC](ElementCount &ExactTC,
unsigned int VF) -> bool {
return ExactTC.getFixedValue() == ((VF * EffectiveIC) + 1);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think you can just write this as 1 + (VF * EffectiveIC)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

};
unsigned EffectiveIC = UserIC > 0 ? UserIC : 1;
auto HasOneScalarIterationRemainder = [EffectiveIC](ElementCount &ExactTC,
unsigned int VF) -> bool {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strictly speaking the VF isn't known/hasn't been chosen yet. The name should probably be changed VF->MaxVF

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

ElementCount ExactTC = getSmallConstantTripCount(PSE.getSE(), TheLoop);
if (EpilogueLoweringStatus == CM_EpilogueNotAllowedLowTripLoop &&
ExactTC && ExactTC.isFixed()) {
if (HasOneScalarIterationRemainder(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it's duplicating the first iteration of the for loop below. I think it can be deleted.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

if (HasOneScalarIterationRemainder(ExactTC, VF)) {
LLVM_DEBUG(dbgs() << "LV: Picking VF=" << VF
<< " with 1 scalar iteration remainder.\n");
ApplyVectorWidth(MaxFactors, VF, 0);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to add some comments explaining why you're completely discarding scalable VFs. I presume the reason for this is to ensure the loop gets deleted?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we want to avoid any scalar iterations. For interleaving this may introduce loops, as seen in tc5_forced_ic2_vectorize, but the main aim is to avoid a scalar loop, so only allow a VF that fits TC + 1.


; TC=4: exact multiple of VF=4. Vectorizes via the original
; "no scalar epilogue needed" path -- NOT the new TC==VF+1 path.
define void @tc4_vf4_vectorize(ptr noalias %a, ptr noalias %b) #0 {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we need this test. I'm pretty sure this is already covered by other tests. You can confirm this by looking for the PR that added the code around line 3032.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I expect it is.


; TC=3, VF=2: TC == FixedVF + 1 (3 == 2 + 1).
; Should vectorize: 1 vector iteration of width 2, then 1 scalar iteration.
define void @tc3_vf2_vectorize(ptr noalias %a, ptr noalias %b) #0 {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this essentially a duplicate of tc5_vf4_vectorize? I think you can probably remove it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. It was testing the other code path, but as its now been consolidated into just the for loop, it can be removed.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps worth adding a RISCV test too?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

@github-actions

github-actions Bot commented May 5, 2026

Copy link
Copy Markdown

🪟 Windows x64 Test Results

  • 136094 tests passed
  • 3464 tests skipped

✅ The build succeeded and all tests passed.

// This produces 1 vector iteration, and 1 scalar iteration with
// no remainder. Later passes will eliminate the loop and leave
// straight-line code as the both iteration counts are statically known.
ElementCount ExactTC = getSmallConstantTripCount(PSE.getSE(), TheLoop);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is same TC calculated at the beginning of the function. I think you can use that one instead.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've run the tests with ExpectedTC and I would agree it's using the same value, so I have changed it to use ExpectedTC throughout.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this true? From what I can tell getSmallConstantTripCount and getSmallBestKnownTC can be completely different. I can see getSmallBestKnownTC is invoked with the default parameters, which means that we could hit this part in the function:

  // Check if upper bound estimate is known.
  if (unsigned ExpectedTC = PSE.getSmallConstantMaxTripCount())
    return ElementCount::getFixed(ExpectedTC);

Therefore, the value returned may correspond to the upper bound without there being an exact trip count. So I think we still have to use the result from getSmallConstantTripCount here, otherwise we're potentially making decisions based on profile data or the max trip count.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They can be different yes, apologies I missed this. I am happy to go back to the original implementation.

@fhahn fhahn left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you update the title to clarify Small Loops == loops with low trip counts. Small loops can also mean small body (which I initially assumed)

@@ -0,0 +1,113 @@
; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --check-globals none --version 6

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is SVE relevant here? Looks like none of the tests use scalable vectors

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SVE is relevant as it is the trigger for this code path (On AArch64, SVE changes the return value of getMinTripCountTailFoldingThreshold). So in this case we want to enable the feature.


; TC=5, VF=2, IC=2: TC == VF * IC + 1 (5 == 2 * 2 + 1).
; The forced interleave count should be considered when choosing VF.
define void @tc5_forced_ic2_vectorize(ptr noalias %a, ptr noalias %b) #0 {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you also more extreme cases, like VF = 2 with i64 and possibly a smaller type. Are those also profitable, especially if the loop body is small? Would possibly be good to cover various combinations with micro benchmarks.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have expanded the testing to have different types and VF=2 for i64.

In terms of microbenchmarks, I got Codex to generate me one which just made a small loop for that datatype, and this change shows significant speedup for vectorized vs Scalar for i8, i16 and i32. i64 does not how the same results. i64 with a TC of 5 is not vectorising as the Loop Vectoriser rejects it at a later stage, but at TC3 we are seeing a significant improvement as this is vectorised.

With forced interleaving the improvements are minimal, as the loop is still present even if it has less iterations, so the real gain here is to eliminate the loop completely, which we have seen in other cases with i8, i16, i32 at TC=3 and TC=5 and i64 with only TC=3.

$ /tmp/sve-small-trip-count-bench 100000000
iterations: 100000000
i8 tc5 vector         1.0755 ns/call
i8 tc5 scalar         2.8662 ns/call
i16 tc5 vector        1.0749 ns/call
i16 tc5 scalar        2.8663 ns/call
i32 tc5 vector        1.0748 ns/call
i32 tc5 scalar        2.8669 ns/call
i64 tc3 vector        1.0748 ns/call
i64 tc3 scalar        1.7915 ns/call
i64 tc5 ic2 vector    2.8663 ns/call
i64 tc5 ic2 scalar    2.9432 ns/call

vector/scalar ratios:
i8  0.3752
i16 0.3750
i32 0.3749
i64 0.6000
i64 ic2 0.9739
checksum: 0

For context, the ratio < 1 shows an improvement, > 1 shows regression.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add them to https://github.com/llvm/llvm-test-suite/tree/main/MicroBenchmarks/LoopVectorization so it is easier to evaluate on different platforms?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, looking at this more, I am curious how we can reach such loops in practice, at least for Clang. I'd expect those loops to get fully unrolled before the vectorizer?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will get them upstreamed.

The example I used to find this issue was using Flang, and nested do loops with trip counts of 5. These all fell below the unrolling threshold so were not unrolled, and also under the vectorisation threshold for the reason being addressed here.

@Stylie777 Stylie777 May 13, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, for the llvm-test-suite, the only platforms which will see impact is AArch64 with SVE and RISC-V, as these are the only platforms where getMinTripCountTailFoldingThreshold will return a value greater than 0. Still happy to upstream a benchmark for this but this codepath is quite limited to how many platforms it will impact.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Stylie777 Stylie777 changed the title [LoopVectorize] Improve Vectorization of Small Loops [LoopVectorize] Improve Vectorization of Low Trip Count Loops May 11, 2026

@Stylie777 Stylie777 left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks all, @fhahn I have also updated the title of the PR.

// This produces 1 vector iteration, and 1 scalar iteration with
// no remainder. Later passes will eliminate the loop and leave
// straight-line code as the both iteration counts are statically known.
ElementCount ExactTC = getSmallConstantTripCount(PSE.getSE(), TheLoop);

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They can be different yes, apologies I missed this. I am happy to go back to the original implementation.

};
unsigned EffectiveIC = UserIC > 0 ? UserIC : 1;
auto HasOneScalarIterationRemainder = [EffectiveIC](ElementCount &ExactTC,
unsigned int VF) -> bool {

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

ElementCount ExactTC = getSmallConstantTripCount(PSE.getSE(), TheLoop);
if (EpilogueLoweringStatus == CM_EpilogueNotAllowedLowTripLoop &&
ExactTC && ExactTC.isFixed()) {
if (HasOneScalarIterationRemainder(

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


; TC=5, VF=2, IC=2: TC == VF * IC + 1 (5 == 2 * 2 + 1).
; The forced interleave count should be considered when choosing VF.
define void @tc5_forced_ic2_vectorize(ptr noalias %a, ptr noalias %b) #0 {

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have expanded the testing to have different types and VF=2 for i64.

In terms of microbenchmarks, I got Codex to generate me one which just made a small loop for that datatype, and this change shows significant speedup for vectorized vs Scalar for i8, i16 and i32. i64 does not how the same results. i64 with a TC of 5 is not vectorising as the Loop Vectoriser rejects it at a later stage, but at TC3 we are seeing a significant improvement as this is vectorised.

With forced interleaving the improvements are minimal, as the loop is still present even if it has less iterations, so the real gain here is to eliminate the loop completely, which we have seen in other cases with i8, i16, i32 at TC=3 and TC=5 and i64 with only TC=3.

$ /tmp/sve-small-trip-count-bench 100000000
iterations: 100000000
i8 tc5 vector         1.0755 ns/call
i8 tc5 scalar         2.8662 ns/call
i16 tc5 vector        1.0749 ns/call
i16 tc5 scalar        2.8663 ns/call
i32 tc5 vector        1.0748 ns/call
i32 tc5 scalar        2.8669 ns/call
i64 tc3 vector        1.0748 ns/call
i64 tc3 scalar        1.7915 ns/call
i64 tc5 ic2 vector    2.8663 ns/call
i64 tc5 ic2 scalar    2.9432 ns/call

vector/scalar ratios:
i8  0.3752
i16 0.3750
i32 0.3749
i64 0.6000
i64 ic2 0.9739
checksum: 0

For context, the ratio < 1 shows an improvement, > 1 shows regression.

@github-actions

github-actions Bot commented May 11, 2026

Copy link
Copy Markdown

🐧 Linux x64 Test Results

  • 197166 tests passed
  • 5407 tests skipped

✅ The build succeeded and all tests passed.

// straight-line code as the both iteration counts are statically known.
ElementCount ExactTC = getSmallConstantTripCount(PSE.getSE(), TheLoop);
if (EpilogueLoweringStatus == CM_EpilogueNotAllowedLowTripLoop &&
ExactTC.isFixed()) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is safe. There are three scenarios to consider:

  1. getSmallBestKnownTC != 0, getSmallConstantTripCount != 0, but the values are different.
  2. getSmallBestKnownTC == getSmallConstantTripCount.
  3. getSmallBestKnownTC returns non-zero, getSmallConstantTripCount returns zero.

I think you're assuming that we're always seeing cases 1 or 2, however getSmallConstantTripCount can also return ElementCount::getFixed(0) (case 3). It will be fixed, but zero so we should avoid getting into this path. Can you fix this please and add a test for this case? You'll need something like a max trip count of 5 returned by getSmallBestKnownTC (either due to profile data or the limit of the type/expression), but the exact trip count is unknown.

Also, I left an earlier comment I left on the PR - can getSmallConstantTripCount return a scalable TC when getSmallBestKnownTC returns a fixed one? If so, then this PR should add a test that exposes this case. If not, then you can just assume ExactTC is fixed and no need to add the check isFixed().

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies I thought I responded to that earlier comment. from my understanding if one is returning fixed, the other will return fixed values. I don't think these can be mixed. If so I will remove that case.

I will get that fixed and updated.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have changed the check here to be ExactTC.getFixedValue() != 0 to ensure we don't go down the path of TC == 0. This also assumes that the value is fixed, as this will not be scaleable here.

// also eliminate any loops.
//
// Forced interleaving is considered when seeing if
// OneScalarIterationRemainder is produced. It may prodiced more than

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: typo

Suggested change
// OneScalarIterationRemainder is produced. It may prodiced more than
// OneScalarIterationRemainder is produced. It may produced more than

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still pending

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@david-arm david-arm left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with nits addressed! However, before merging this PR it's probably a good idea to land llvm/llvm-test-suite#404 first as requested.

;
entry:
br label %loop
loop:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Can you add spaces between the blocks?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


; TC=5, VF=4: TC == MaxFixedVF + 1 (5 == 4 + 1).
; The new code path should trigger: 1 vectorized iteration + 1 scalar iteration.
define void @tc5_vf4_vectorize(ptr noalias %a, ptr noalias %b) #0 {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Can you rename this to tc5_vf4_vectorize_i32 to be consistent with tc5_vf4_vectorize_i16?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


; TC=5, VF=2, IC=2: TC == VF * IC + 1 (5 == 2 * 2 + 1).
; The forced interleave count should be considered when choosing VF.
define void @tc5_forced_ic2_vectorize(ptr noalias %a, ptr noalias %b) #0 {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Can you rename this to tc5_forced_ic2_vectorize_i32 to be consistent with tc5_forced_ic2_vectorize_i64?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@david-arm david-arm left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, my mistake. I just realised there is a comment that hasn't been addressed regarding the ExactTC.isFixed() code.

@Stylie777

Copy link
Copy Markdown
Contributor Author

Sorry, my mistake. I just realised there is a comment that hasn't been addressed regarding the ExactTC.isFixed() code.

I may have fixed this downstream, and not pushed the changes. I will check and get it up with the nit's noted above fixed.

; on AArch64 where under getMinTripCountTailFoldingThreshold(). This
; produces the required number of vector iteration and one scalar iteration.
;
; RUN: opt -S -p loop-vectorize %s | FileCheck %s

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: usually the run line is at the top

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

// also eliminate any loops.
//
// Forced interleaving is considered when seeing if
// OneScalarIterationRemainder is produced. It may prodiced more than

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still pending


; TC=5, VF=2, IC=2: TC == VF * IC + 1 (5 == 2 * 2 + 1).
; The forced interleave count should be considered when choosing VF.
define void @tc5_forced_ic2_vectorize_i64(ptr noalias %a, ptr noalias %b) #0 {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you also add a low-trip count loop where the exact trip count is not known?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done as part of the test added in #195823 (comment)

}

auto ExpectedTC = getSmallBestKnownTC(PSE, TheLoop);
unsigned EffectiveIC = UserIC > 0 ? UserIC : 1;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are cases where user IC may be ignored, would be good to also cover this in test, e.g. something like a dependence distance that prevents the user IC from being chosen

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a test for this

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think NoScalarEpilogueNeeded above checks for the user IC in a similar way. Would be good to share EffectiveIC between the two

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed it to use EffectiveIC now.

@Stylie777 Stylie777 force-pushed the vectorize-small-loop-epilogue branch from cc1e953 to a89e00f Compare May 19, 2026 10:31
@Stylie777 Stylie777 requested a review from david-arm May 26, 2026 07:31

@david-arm david-arm left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with nits addressed!

// also eliminate any loops.
//
// Forced interleaving is considered when seeing if
// OneScalarIterationRemainder is produced. It may produced more than

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This wording seems a little odd. I think it should probably either say:

  Forced interleaving is considered when deciding if a single scalar iteration remains.

or

  OneScalarIterationRemainder takes account of any forced interleaving.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done with the second suggestion, and moved it above the call site for HasOneScalarIterationRemainder as I think it will make more sense there.

MaxVF /= 2) {
if (HasOneScalarIterationRemainder(ExactTC, MaxVF)) {
LLVM_DEBUG(dbgs() << "LV: Picking VF=" << MaxVF
<< " with 1 scalar iteration remainder.\n");

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Perhaps with 1 scalar iteration remaining.\n sounds a bit better?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}

; TC=5, VF=4: TC == VF + 1 (5 == 4 + 1).
; The natural fixed-width VF for i16 is 8 on AArch64, so this also checks that

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This comment is perhaps a bit misleading because both VF=4 and VF=8 are natural for i16 on AArch64, since NEON supports both 64-bit and 128-bit register sizes. It's perhaps more precise to say "VF=8 is a natural fixed-width for i16 types on AArch64". What do you think?

The point here is that even after stepping down to VF=4, the <4 x i16> is still a legal type.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}

; TC=5, VF=4: TC == VF + 1 (5 == 4 + 1).
; The natural fixed-width VF for i8 is 16 on AArch64, so this checks that the

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above. Also, the comment still refers to 16-bit types.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}

; ExactTC is unknown here because the loop trip count is the runtime value %n,
; but the guard proves the maximum trip count is 5. it must not use the VF+1

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think the comment it must not use the VF+1 escape relies upon knowing the current way the code is written in computeMaxVF. It's probably better to say something like In this case the vectoriser shouldn't optimise for a single vector iteration + single scalar iteration, because there is no guarantee we will enter the vector loop.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Stylie777 added 5 commits May 28, 2026 11:17
Currently, Small Loops with Trip Counts less than 16, and in
situations where the Trip Count (TC) is less than the Tail Folding
Threshold are harder to vectorize, its only possible where no epilogue
will be emitted. However, for loops with large bodies and small trip
counts this can be counterprodictive to performance, often failing to
vectorize entirely. This is more prevelant with targets where
`getMinTripCountTailFoldingThreshold()` returns a value greater than 0.

To address this, the Small Loops where the TC == VF + 1 can now vectorize,
leading to a single vectorized itneration and a single scalar iteration.
Later passes can then remove the loop's entirely.

Testing an with OpenSource Fortran HPC Benchmark which includes multiple
loops with small trip counts, but large loop bodies, has shown significant
improvement to runtime after these changes.

Assisted-by: Claude Sonnet 4.6/Codex
@Stylie777 Stylie777 force-pushed the vectorize-small-loop-epilogue branch from 0fa0ce6 to 9b476b2 Compare May 28, 2026 10:57

@fhahn fhahn left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if this is always profitable without checking the cost of the vector loop (e.g. the benefit from vectorization may be relatively low compared to the scalar loop, e.g. because some of the vector IR instruction need multiple assembly instructions).

I added some suggestions for cases to check to llvm/llvm-test-suite#404, for those the scalar code seemed faster than the vector code on my test system (Apple M4 with getMinTripCountTailFoldingThreshold adjusted)

Comment on lines +3172 to +3187
ExactTC.getFixedValue() != 0) {
// If the maximum VF cannot produce 1 vector iteration + 1 scalar
// iteration, step down VF's to find one that can. The result should
// also eliminate any loops.
for (unsigned MaxVF = MaxFactors.FixedVF.getFixedValue(); MaxVF >= 2;
MaxVF /= 2) {
// OneScalarIterationRemainder takes account of any forced
// interleaving.
if (HasOneScalarIterationRemainder(ExactTC, MaxVF)) {
LLVM_DEBUG(dbgs() << "LV: Picking VF=" << MaxVF
<< " with 1 scalar iteration remaining.\n");
MaxFactors.FixedVF = ElementCount::getFixed(MaxVF);
MaxFactors.ScalableVF = ElementCount::getScalable(0);
return MaxFactors;
}
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a loop here or could we directly compute the VF based on EffecticeTC + trip count?

Suggested change
ExactTC.getFixedValue() != 0) {
// If the maximum VF cannot produce 1 vector iteration + 1 scalar
// iteration, step down VF's to find one that can. The result should
// also eliminate any loops.
for (unsigned MaxVF = MaxFactors.FixedVF.getFixedValue(); MaxVF >= 2;
MaxVF /= 2) {
// OneScalarIterationRemainder takes account of any forced
// interleaving.
if (HasOneScalarIterationRemainder(ExactTC, MaxVF)) {
LLVM_DEBUG(dbgs() << "LV: Picking VF=" << MaxVF
<< " with 1 scalar iteration remaining.\n");
MaxFactors.FixedVF = ElementCount::getFixed(MaxVF);
MaxFactors.ScalableVF = ElementCount::getScalable(0);
return MaxFactors;
}
}
ExactTC.getFixedValue() > 1) {
unsigned TC = ExactTC.getFixedValue();
unsigned MaxFixedVF = MaxFactors.FixedVF.getFixedValue();
if ((TC - 1) % EffectiveIC == 0) {
unsigned VF = (TC - 1) / EffectiveIC;
if (VF >= 2 && VF <= MaxFixedVF && isPowerOf2_32(VF)) {
LLVM_DEBUG(dbgs() << "LV: Picking VF=" << VF
<< " with 1 scalar iteration remainder.\n");
MaxFactors.FixedVF = ElementCount::getFixed(VF);
MaxFactors.ScalableVF = ElementCount::getScalable(0);
return MaxFactors;
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, and implemented.

@Stylie777

Copy link
Copy Markdown
Contributor Author

@fhahn As I mentioned in llvm/llvm-test-suite#404 I have added some cost modelling to this PR so that loops that are not beneficial to vectorise in this way will be left as scalar.

@Stylie777 Stylie777 requested a review from fhahn June 19, 2026 12:34
@Stylie777

Copy link
Copy Markdown
Contributor Author

Ping

if (TC != EstimatedWidth + 1)
return false;

InstructionCost VectorCost =

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much of this code looks very similar to the code above on line 6138 and there is quite a lot of it to support a corner case. Can it be commonised? It would be good to simplify where possible.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've condensed this down into a single lambda function that gets reused.

// control flow and a scalar epilogue for a single element, so require
// the vectorized form to save at least one scalar iteration.
InstructionCost AdjustedVectorCost = VectorCost + ScalarCost;
if (!VectorCost.isValid() || !ScalarCostForTC.isValid() ||

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can just do !AdjustedVectorCost.isValid() because the Invalid state should be inherited from LHS and propagated from the RHS of the +. Similarly for the code above on line 6147.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@Stylie777

Copy link
Copy Markdown
Contributor Author

Ping @fhahn

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants