[LoopVectorize] Improve Vectorization of Low Trip Count Loops by Stylie777 · Pull Request #195823 · llvm/llvm-project

Stylie777 · 2026-05-05T10:28:19Z

Currently, Small Loops with Trip Counts less than 16, and in situations where the Trip Count (TC) is less than the Tail Folding Threshold are harder to vectorize, its only possible where no epilogue will be emitted. However, for loops with large bodies and small trip counts this can be counterproductive to performance, often failing to vectorize entirely. This is more prevalent with targets where getMinTripCountTailFoldingThreshold() returns a value greater than 0.

To address this, the Small Loops where the TC == VF + 1 can now vectorize, leading to a vectorized iteration (or loop if interleaving is required) and a single scalar iteration. Later passes can then remove the loop's entirely.

Testing an with OpenSource Fortran HPC Benchmark which includes multiple loops with small trip counts, but large loop bodies, has shown significant improvement to runtime after these changes.

Assisted-by: Claude Sonnet 4.6/Codex

llvmorg-github-actions · 2026-05-05T10:29:01Z

@llvm/pr-subscribers-backend-risc-v
@llvm/pr-subscribers-vectorizers

@llvm/pr-subscribers-llvm-transforms

Author: Jack Styles (Stylie777)

Changes

Currently, Small Loops with Trip Counts less than 16, and in situations where the Trip Count (TC) is less than the Tail Folding Threshold are harder to vectorize, its only possible where no epilogue will be emitted. However, for loops with large bodies and small trip counts this can be counterproductive to performance, often failing to vectorize entirely. This is more prevalent with targets where getMinTripCountTailFoldingThreshold() returns a value greater than 0.

To address this, the Small Loops where the TC == VF + 1 can now vectorize, leading to a single vectorized itneration and a single scalar iteration. Later passes can then remove the loop's entirely.

Testing an with OpenSource Fortran HPC Benchmark which includes multiple loops with small trip counts, but large loop bodies, has shown significant improvement to runtime after these changes.

Assisted-by: Claude Sonnet 4.6/Codex

Full diff: https://github.com/llvm/llvm-project/pull/195823.diff

3 Files Affected:

(modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+37-1)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/sve-low-trip-count.ll (+11-13)
(added) llvm/test/Transforms/LoopVectorize/AArch64/sve-small-trip-count-vf-plus-one.ll (+195)

diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 3d7f7a1649b16..b0ae53a016dc9 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -3012,6 +3012,15 @@ LoopVectorizationCostModel::computeMaxVF(ElementCount UserVF, unsigned UserIC) {
   }
 
   auto ExpectedTC = getSmallBestKnownTC(PSE, TheLoop);
+  auto ApplyVectorWidth = [](FixedScalableVFPair &MaxFactors,
+                             unsigned int FixedVF, unsigned int ScalableVF) {
+    MaxFactors.FixedVF = ElementCount::getFixed(FixedVF);
+    MaxFactors.ScalableVF = ElementCount::getScalable(ScalableVF);
+  };
+  unsigned EffectiveIC = UserIC > 0 ? UserIC : 1;
+  auto HasOneScalarIterationRemainder = [EffectiveIC](ElementCount &ExactTC, unsigned int VF)-> bool {
+    return ExactTC.getFixedValue() == ((VF * EffectiveIC) + 1);
+  };
   if (ExpectedTC && ExpectedTC->isFixed() &&
       ExpectedTC->getFixedValue() <=
           TTI.getMinTripCountTailFoldingThreshold()) {
@@ -3023,9 +3032,36 @@ LoopVectorizationCostModel::computeMaxVF(ElementCount UserVF, unsigned UserIC) {
           NoScalarEpilogueNeeded(MaxFactors.FixedVF.getFixedValue())) {
         LLVM_DEBUG(dbgs() << "LV: Picking a fixed-width so that no tail will "
                              "remain for any chosen VF.\n");
-        MaxFactors.ScalableVF = ElementCount::getScalable(0);
+        ApplyVectorWidth(MaxFactors, MaxFactors.FixedVF.getFixedValue(), 0);
         return MaxFactors;
       }
+      // Allow cases where the ExactTC == VF + 1. VF can be any power of
+      // 2 between 2 and MaxVF.
+      //
+      // This produces 1 vector iteration, and 1 scalar iteration with
+      // no remainder. Later passes will eliminate the loop and leave
+      // straight-line code as the both iteration counts are statically known.
+      ElementCount ExactTC = getSmallConstantTripCount(PSE.getSE(), TheLoop);
+      if (EpilogueLoweringStatus == CM_EpilogueNotAllowedLowTripLoop &&
+          ExactTC && ExactTC.isFixed()) {
+        if (HasOneScalarIterationRemainder(ExactTC, MaxFactors.FixedVF.getFixedValue())) {
+          LLVM_DEBUG(dbgs() << "LV: Picking a fixed-width with 1 scalar "
+                               "iteration remainder.\n");
+          ApplyVectorWidth(MaxFactors, MaxFactors.FixedVF.getFixedValue(), 0);
+          return MaxFactors;
+        }
+        // If the maximum VF cannot produce 1 vector iteration + 1 scalar
+        // iteration, step down VF's to find one that can.
+        for (unsigned VF = MaxFactors.FixedVF.getFixedValue(); VF >= 2;
+             VF /= 2) {
+          if (HasOneScalarIterationRemainder(ExactTC, VF)) {
+            LLVM_DEBUG(dbgs() << "LV: Picking VF=" << VF
+                              << " with 1 scalar iteration remainder.\n");
+            ApplyVectorWidth(MaxFactors, VF, 0);
+            return MaxFactors;
+          }
+        }
+      }
     }
 
     reportVectorizationFailure(
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/sve-low-trip-count.ll b/llvm/test/Transforms/LoopVectorize/AArch64/sve-low-trip-count.ll
index c36daa40f6193..76642c63dbd26 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/sve-low-trip-count.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/sve-low-trip-count.ll
@@ -56,22 +56,20 @@ exit:
 define void @trip5_i8(ptr noalias nocapture noundef %dst, ptr noalias nocapture noundef readonly %src) #0 {
 ; CHECK-LABEL: define void @trip5_i8(
 ; CHECK-SAME: ptr noalias noundef captures(none) [[DST:%.*]], ptr noalias noundef readonly captures(none) [[SRC:%.*]]) #[[ATTR0]] {
-; CHECK-NEXT:  [[ENTRY:.*]]:
+; CHECK-NEXT:  [[ENTRY:.*:]]
 ; CHECK-NEXT:    br label %[[LOOP:.*]]
 ; CHECK:       [[LOOP]]:
-; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_NEXT:%.*]], %[[LOOP]] ]
-; CHECK-NEXT:    [[GEP_SRC:%.*]] = getelementptr inbounds i8, ptr [[SRC]], i64 [[IV]]
-; CHECK-NEXT:    [[TMP0:%.*]] = load i8, ptr [[GEP_SRC]], align 1
-; CHECK-NEXT:    [[MUL:%.*]] = shl i8 [[TMP0]], 1
-; CHECK-NEXT:    [[GEP_DST:%.*]] = getelementptr inbounds i8, ptr [[DST]], i64 [[IV]]
-; CHECK-NEXT:    [[TMP1:%.*]] = load i8, ptr [[GEP_DST]], align 1
-; CHECK-NEXT:    [[ADD:%.*]] = add i8 [[MUL]], [[TMP1]]
-; CHECK-NEXT:    store i8 [[ADD]], ptr [[GEP_DST]], align 1
-; CHECK-NEXT:    [[IV_NEXT]] = add nuw nsw i64 [[IV]], 1
-; CHECK-NEXT:    [[EC:%.*]] = icmp eq i64 [[IV_NEXT]], 5
-; CHECK-NEXT:    br i1 [[EC]], label %[[EXIT:.*]], label %[[LOOP]]
+; CHECK-NEXT:    br label %[[EXIT:.*]]
 ; CHECK:       [[EXIT]]:
-; CHECK-NEXT:    ret void
+; CHECK-NEXT:    [[WIDE_LOAD:%.*]] = load <4 x i8>, ptr [[SRC]], align 1
+; CHECK-NEXT:    [[TMP0:%.*]] = shl <4 x i8> [[WIDE_LOAD]], splat (i8 1)
+; CHECK-NEXT:    [[WIDE_LOAD1:%.*]] = load <4 x i8>, ptr [[DST]], align 1
+; CHECK-NEXT:    [[TMP1:%.*]] = add <4 x i8> [[TMP0]], [[WIDE_LOAD1]]
+; CHECK-NEXT:    store <4 x i8> [[TMP1]], ptr [[DST]], align 1
+; CHECK-NEXT:    br label %[[MIDDLE_BLOCK:.*]]
+; CHECK:       [[MIDDLE_BLOCK]]:
+; CHECK-NEXT:    br label %[[SCALAR_PH:.*]]
+; CHECK:       [[SCALAR_PH]]:
 ;
 entry:
   br label %loop
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/sve-small-trip-count-vf-plus-one.ll b/llvm/test/Transforms/LoopVectorize/AArch64/sve-small-trip-count-vf-plus-one.ll
new file mode 100644
index 0000000000000..71157580334d1
--- /dev/null
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/sve-small-trip-count-vf-plus-one.ll
@@ -0,0 +1,195 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --check-globals none --version 6
+;
+; Test that a loop with trip count == VF + 1 is allowed to vectorize
+; on AArch64+SVE where getMinTripCountTailFoldingThreshold() returns 5. This
+; produces one vector iteration and one scalar iteration.
+;
+; RUN: opt -S -p loop-vectorize %s | FileCheck %s
+
+target triple = "aarch64-unknown-linux-gnu"
+
+; TC=5, VF=4: TC == MaxFixedVF + 1 (5 == 4 + 1).
+; The new code path should trigger: 1 vectorized iteration + 1 scalar iteration.
+define void @tc5_vf4_vectorize(ptr noalias %a, ptr noalias %b) #0 {
+; CHECK-LABEL: define void @tc5_vf4_vectorize(
+; CHECK-SAME: ptr noalias [[A:%.*]], ptr noalias [[B:%.*]]) #[[ATTR0:[0-9]+]] {
+; CHECK-NEXT:  [[SCALAR_PH:.*:]]
+; CHECK-NEXT:    br label %[[LOOP:.*]]
+; CHECK:       [[LOOP]]:
+; CHECK-NEXT:    br label %[[VECTOR_BODY:.*]]
+; CHECK:       [[VECTOR_BODY]]:
+; CHECK-NEXT:    [[WIDE_LOAD:%.*]] = load <4 x i32>, ptr [[A]], align 4
+; CHECK-NEXT:    [[TMP0:%.*]] = add nsw <4 x i32> [[WIDE_LOAD]], splat (i32 1)
+; CHECK-NEXT:    store <4 x i32> [[TMP0]], ptr [[B]], align 4
+; CHECK-NEXT:    br label %[[MIDDLE_BLOCK:.*]]
+; CHECK:       [[MIDDLE_BLOCK]]:
+; CHECK-NEXT:    br label %[[SCALAR_PH1:.*]]
+; CHECK:       [[SCALAR_PH1]]:
+; CHECK-NEXT:    br label %[[LOOP1:.*]]
+; CHECK:       [[LOOP1]]:
+; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ 4, %[[SCALAR_PH1]] ], [ [[IV_NEXT:%.*]], %[[LOOP1]] ]
+; CHECK-NEXT:    [[GEP_A:%.*]] = getelementptr inbounds i32, ptr [[A]], i64 [[IV]]
+; CHECK-NEXT:    [[GEP_B:%.*]] = getelementptr inbounds i32, ptr [[B]], i64 [[IV]]
+; CHECK-NEXT:    [[VAL:%.*]] = load i32, ptr [[GEP_A]], align 4
+; CHECK-NEXT:    [[ADD:%.*]] = add nsw i32 [[VAL]], 1
+; CHECK-NEXT:    store i32 [[ADD]], ptr [[GEP_B]], align 4
+; CHECK-NEXT:    [[IV_NEXT]] = add nuw nsw i64 [[IV]], 1
+; CHECK-NEXT:    [[EXITCOND:%.*]] = icmp eq i64 [[IV_NEXT]], 5
+; CHECK-NEXT:    br i1 [[EXITCOND]], label %[[EXIT:.*]], label %[[LOOP1]], !llvm.loop [[LOOP0:![0-9]+]]
+; CHECK:       [[EXIT]]:
+; CHECK-NEXT:    ret void
+;
+entry:
+  br label %loop
+loop:
+  %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
+  %gep.a = getelementptr inbounds i32, ptr %a, i64 %iv
+  %gep.b = getelementptr inbounds i32, ptr %b, i64 %iv
+  %val = load i32, ptr %gep.a, align 4
+  %add = add nsw i32 %val, 1
+  store i32 %add, ptr %gep.b, align 4
+  %iv.next = add nuw nsw i64 %iv, 1
+  %exitcond = icmp eq i64 %iv.next, 5
+  br i1 %exitcond, label %exit, label %loop
+exit:
+  ret void
+}
+
+; TC=5, VF=2, IC=2: TC == VF * IC + 1 (5 == 2 * 2 + 1).
+; The forced interleave count should be considered when choosing VF.
+define void @tc5_forced_ic2_vectorize(ptr noalias %a, ptr noalias %b) #0 {
+; CHECK-LABEL: define void @tc5_forced_ic2_vectorize(
+; CHECK-SAME: ptr noalias [[A:%.*]], ptr noalias [[B:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT:  [[SCALAR_PH:.*:]]
+; CHECK-NEXT:    br label %[[LOOP:.*]]
+; CHECK:       [[LOOP]]:
+; CHECK-NEXT:    br label %[[VECTOR_BODY:.*]]
+; CHECK:       [[VECTOR_BODY]]:
+; CHECK-NEXT:    [[TMP0:%.*]] = getelementptr inbounds i32, ptr [[A]], i64 2
+; CHECK-NEXT:    [[WIDE_LOAD:%.*]] = load <2 x i32>, ptr [[A]], align 4
+; CHECK-NEXT:    [[WIDE_LOAD1:%.*]] = load <2 x i32>, ptr [[TMP0]], align 4
+; CHECK-NEXT:    [[TMP1:%.*]] = add nsw <2 x i32> [[WIDE_LOAD]], splat (i32 1)
+; CHECK-NEXT:    [[TMP2:%.*]] = add nsw <2 x i32> [[WIDE_LOAD1]], splat (i32 1)
+; CHECK-NEXT:    [[TMP3:%.*]] = getelementptr inbounds i32, ptr [[B]], i64 2
+; CHECK-NEXT:    store <2 x i32> [[TMP1]], ptr [[B]], align 4
+; CHECK-NEXT:    store <2 x i32> [[TMP2]], ptr [[TMP3]], align 4
+; CHECK-NEXT:    br label %[[MIDDLE_BLOCK:.*]]
+; CHECK:       [[MIDDLE_BLOCK]]:
+; CHECK-NEXT:    br label %[[SCALAR_PH1:.*]]
+; CHECK:       [[SCALAR_PH1]]:
+; CHECK-NEXT:    br label %[[LOOP1:.*]]
+; CHECK:       [[LOOP1]]:
+; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ 4, %[[SCALAR_PH1]] ], [ [[IV_NEXT:%.*]], %[[LOOP1]] ]
+; CHECK-NEXT:    [[GEP_A:%.*]] = getelementptr inbounds i32, ptr [[A]], i64 [[IV]]
+; CHECK-NEXT:    [[GEP_B:%.*]] = getelementptr inbounds i32, ptr [[B]], i64 [[IV]]
+; CHECK-NEXT:    [[VAL:%.*]] = load i32, ptr [[GEP_A]], align 4
+; CHECK-NEXT:    [[ADD:%.*]] = add nsw i32 [[VAL]], 1
+; CHECK-NEXT:    store i32 [[ADD]], ptr [[GEP_B]], align 4
+; CHECK-NEXT:    [[IV_NEXT]] = add nuw nsw i64 [[IV]], 1
+; CHECK-NEXT:    [[EXITCOND:%.*]] = icmp eq i64 [[IV_NEXT]], 5
+; CHECK-NEXT:    br i1 [[EXITCOND]], label %[[EXIT:.*]], label %[[LOOP1]], !llvm.loop [[LOOP3:![0-9]+]]
+; CHECK:       [[EXIT]]:
+; CHECK-NEXT:    ret void
+;
+entry:
+  br label %loop
+loop:
+  %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
+  %gep.a = getelementptr inbounds i32, ptr %a, i64 %iv
+  %gep.b = getelementptr inbounds i32, ptr %b, i64 %iv
+  %val = load i32, ptr %gep.a, align 4
+  %add = add nsw i32 %val, 1
+  store i32 %add, ptr %gep.b, align 4
+  %iv.next = add nuw nsw i64 %iv, 1
+  %exitcond = icmp eq i64 %iv.next, 5
+  br i1 %exitcond, label %exit, label %loop, !llvm.loop !0
+exit:
+  ret void
+}
+
+; TC=3, VF=2: TC == FixedVF + 1 (3 == 2 + 1).
+; Should vectorize: 1 vector iteration of width 2, then 1 scalar iteration.
+define void @tc3_vf2_vectorize(ptr noalias %a, ptr noalias %b) #0 {
+; CHECK-LABEL: define void @tc3_vf2_vectorize(
+; CHECK-SAME: ptr noalias [[A:%.*]], ptr noalias [[B:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT:  [[SCALAR_PH:.*:]]
+; CHECK-NEXT:    br label %[[LOOP:.*]]
+; CHECK:       [[LOOP]]:
+; CHECK-NEXT:    br label %[[VECTOR_BODY:.*]]
+; CHECK:       [[VECTOR_BODY]]:
+; CHECK-NEXT:    [[WIDE_LOAD:%.*]] = load <2 x i32>, ptr [[A]], align 4
+; CHECK-NEXT:    [[TMP0:%.*]] = add nsw <2 x i32> [[WIDE_LOAD]], splat (i32 1)
+; CHECK-NEXT:    store <2 x i32> [[TMP0]], ptr [[B]], align 4
+; CHECK-NEXT:    br label %[[MIDDLE_BLOCK:.*]]
+; CHECK:       [[MIDDLE_BLOCK]]:
+; CHECK-NEXT:    br label %[[SCALAR_PH1:.*]]
+; CHECK:       [[SCALAR_PH1]]:
+; CHECK-NEXT:    br label %[[LOOP1:.*]]
+; CHECK:       [[LOOP1]]:
+; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ 2, %[[SCALAR_PH1]] ], [ [[IV_NEXT:%.*]], %[[LOOP1]] ]
+; CHECK-NEXT:    [[GEP_A:%.*]] = getelementptr inbounds i32, ptr [[A]], i64 [[IV]]
+; CHECK-NEXT:    [[GEP_B:%.*]] = getelementptr inbounds i32, ptr [[B]], i64 [[IV]]
+; CHECK-NEXT:    [[VAL:%.*]] = load i32, ptr [[GEP_A]], align 4
+; CHECK-NEXT:    [[ADD:%.*]] = add nsw i32 [[VAL]], 1
+; CHECK-NEXT:    store i32 [[ADD]], ptr [[GEP_B]], align 4
+; CHECK-NEXT:    [[IV_NEXT]] = add nuw nsw i64 [[IV]], 1
+; CHECK-NEXT:    [[EXITCOND:%.*]] = icmp eq i64 [[IV_NEXT]], 3
+; CHECK-NEXT:    br i1 [[EXITCOND]], label %[[EXIT:.*]], label %[[LOOP1]], !llvm.loop [[LOOP4:![0-9]+]]
+; CHECK:       [[EXIT]]:
+; CHECK-NEXT:    ret void
+;
+entry:
+  br label %loop
+loop:
+  %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
+  %gep.a = getelementptr inbounds i32, ptr %a, i64 %iv
+  %gep.b = getelementptr inbounds i32, ptr %b, i64 %iv
+  %val = load i32, ptr %gep.a, align 4
+  %add = add nsw i32 %val, 1
+  store i32 %add, ptr %gep.b, align 4
+  %iv.next = add nuw nsw i64 %iv, 1
+  %exitcond = icmp eq i64 %iv.next, 3
+  br i1 %exitcond, label %exit, label %loop
+exit:
+  ret void
+}
+
+; TC=4: exact multiple of VF=4. Vectorizes via the original
+; "no scalar epilogue needed" path -- NOT the new TC==VF+1 path.
+define void @tc4_vf4_vectorize(ptr noalias %a, ptr noalias %b) #0 {
+; CHECK-LABEL: define void @tc4_vf4_vectorize(
+; CHECK-SAME: ptr noalias [[A:%.*]], ptr noalias [[B:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    br label %[[VECTOR_PH:.*]]
+; CHECK:       [[VECTOR_PH]]:
+; CHECK-NEXT:    br label %[[VECTOR_BODY:.*]]
+; CHECK:       [[VECTOR_BODY]]:
+; CHECK-NEXT:    [[WIDE_LOAD:%.*]] = load <4 x i32>, ptr [[A]], align 4
+; CHECK-NEXT:    [[TMP0:%.*]] = add nsw <4 x i32> [[WIDE_LOAD]], splat (i32 1)
+; CHECK-NEXT:    store <4 x i32> [[TMP0]], ptr [[B]], align 4
+; CHECK-NEXT:    br label %[[MIDDLE_BLOCK:.*]]
+; CHECK:       [[MIDDLE_BLOCK]]:
+; CHECK-NEXT:    br label %[[EXIT:.*]]
+; CHECK:       [[EXIT]]:
+; CHECK-NEXT:    ret void
+;
+entry:
+  br label %loop
+loop:
+  %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
+  %gep.a = getelementptr inbounds i32, ptr %a, i64 %iv
+  %gep.b = getelementptr inbounds i32, ptr %b, i64 %iv
+  %val = load i32, ptr %gep.a, align 4
+  %add = add nsw i32 %val, 1
+  store i32 %add, ptr %gep.b, align 4
+  %iv.next = add nuw nsw i64 %iv, 1
+  %exitcond = icmp eq i64 %iv.next, 4
+  br i1 %exitcond, label %exit, label %loop
+exit:
+  ret void
+}
+
+attributes #0 = { vscale_range(1,16) "target-features"="+sve" }
+
+!0 = distinct !{!0, !1}
+!1 = !{!"llvm.loop.interleave.count", i32 2}

github-actions · 2026-05-05T10:29:57Z

✅ With the latest revision this PR passed the C/C++ code formatter.

david-arm

Thanks for this, seems like a useful improvement!

david-arm · 2026-05-05T10:32:35Z

-; CHECK-NEXT:    [[IV_NEXT]] = add nuw nsw i64 [[IV]], 1
-; CHECK-NEXT:    [[EC:%.*]] = icmp eq i64 [[IV_NEXT]], 5
-; CHECK-NEXT:    br i1 [[EC]], label %[[EXIT:.*]], label %[[LOOP]]
+; CHECK-NEXT:    br label %[[EXIT:.*]]


The label looks wrong here. I think you need to regenerate the CHECK lines because it should be br label %[[VECTOR_BODY:.*]] or something like that.

I actually think this test is now surplus to requirements on reflection, it was previously testing that vectors were not applied to this loop, but these changes allow vectorisation and a similar test is in see-small-trip-count-vf-plus-one.ll.

david-arm · 2026-05-05T10:36:26Z

        LLVM_DEBUG(dbgs() << "LV: Picking a fixed-width so that no tail will "
                             "remain for any chosen VF.\n");
-        MaxFactors.ScalableVF = ElementCount::getScalable(0);
+        ApplyVectorWidth(MaxFactors, MaxFactors.FixedVF.getFixedValue(), 0);


nit: I'm not sure there is much value in this change to be honest, since it's now doing more work than before setting the fixed width VF unnecessarily. It's also inconsistent with the other places in this function that change the factors similarly, e.g.

MaxFactors.FixedVF = ElementCount::getFixed(1);

on line 3094 below. At the very least, we should use a consistent style throughout the function.

I've removed this and gone back to the original behaviour.

david-arm · 2026-05-05T10:42:17Z

+      // straight-line code as the both iteration counts are statically known.
+      ElementCount ExactTC = getSmallConstantTripCount(PSE.getSE(), TheLoop);
+      if (EpilogueLoweringStatus == CM_EpilogueNotAllowedLowTripLoop &&
+          ExactTC && ExactTC.isFixed()) {


Is it possible for ExactTC to be scalable when ExpectedTC is known to be fixed? If so, it would be good to add a test for this. If not, then you should just be able to called ExactTC.isFixed() and it will assert if it's scalable.

If ExpectedTC is fixed, then ExactTC will also be fixed. I've changed this to just check ExactTC.isFixed()

david-arm · 2026-05-05T10:44:29Z

+  unsigned EffectiveIC = UserIC > 0 ? UserIC : 1;
+  auto HasOneScalarIterationRemainder = [EffectiveIC](ElementCount &ExactTC,
+                                                      unsigned int VF) -> bool {
+    return ExactTC.getFixedValue() == ((VF * EffectiveIC) + 1);


nit: I think you can just write this as 1 + (VF * EffectiveIC)

david-arm · 2026-05-05T10:45:44Z

+  };
+  unsigned EffectiveIC = UserIC > 0 ? UserIC : 1;
+  auto HasOneScalarIterationRemainder = [EffectiveIC](ElementCount &ExactTC,
+                                                      unsigned int VF) -> bool {


Strictly speaking the VF isn't known/hasn't been chosen yet. The name should probably be changed VF->MaxVF

david-arm · 2026-05-05T10:46:27Z

+      ElementCount ExactTC = getSmallConstantTripCount(PSE.getSE(), TheLoop);
+      if (EpilogueLoweringStatus == CM_EpilogueNotAllowedLowTripLoop &&
+          ExactTC && ExactTC.isFixed()) {
+        if (HasOneScalarIterationRemainder(


This looks like it's duplicating the first iteration of the for loop below. I think it can be deleted.

david-arm · 2026-05-05T10:47:39Z

+          if (HasOneScalarIterationRemainder(ExactTC, VF)) {
+            LLVM_DEBUG(dbgs() << "LV: Picking VF=" << VF
+                              << " with 1 scalar iteration remainder.\n");
+            ApplyVectorWidth(MaxFactors, VF, 0);


It would be good to add some comments explaining why you're completely discarding scalable VFs. I presume the reason for this is to ensure the loop gets deleted?

Yes, we want to avoid any scalar iterations. For interleaving this may introduce loops, as seen in tc5_forced_ic2_vectorize, but the main aim is to avoid a scalar loop, so only allow a VF that fits TC + 1.

david-arm · 2026-05-05T10:51:11Z

+
+; TC=4: exact multiple of VF=4. Vectorizes via the original
+; "no scalar epilogue needed" path -- NOT the new TC==VF+1 path.
+define void @tc4_vf4_vectorize(ptr noalias %a, ptr noalias %b) #0 {


I'm not sure we need this test. I'm pretty sure this is already covered by other tests. You can confirm this by looking for the PR that added the code around line 3032.

Done. I expect it is.

david-arm · 2026-05-05T10:54:01Z

+
+; TC=3, VF=2: TC == FixedVF + 1 (3 == 2 + 1).
+; Should vectorize: 1 vector iteration of width 2, then 1 scalar iteration.
+define void @tc3_vf2_vectorize(ptr noalias %a, ptr noalias %b) #0 {


Isn't this essentially a duplicate of tc5_vf4_vectorize? I think you can probably remove it.

Done. It was testing the other code path, but as its now been consolidated into just the for loop, it can be removed.

david-arm · 2026-05-05T10:55:23Z

Perhaps worth adding a RISCV test too?

github-actions · 2026-05-05T11:16:44Z

🪟 Windows x64 Test Results

136094 tests passed
3464 tests skipped

✅ The build succeeded and all tests passed.

hassnaaHamdi · 2026-05-05T18:43:07Z

+      // This produces 1 vector iteration, and 1 scalar iteration with
+      // no remainder. Later passes will eliminate the loop and leave
+      // straight-line code as the both iteration counts are statically known.
+      ElementCount ExactTC = getSmallConstantTripCount(PSE.getSE(), TheLoop);


I think this is same TC calculated at the beginning of the function. I think you can use that one instead.

I've run the tests with ExpectedTC and I would agree it's using the same value, so I have changed it to use ExpectedTC throughout.

Is this true? From what I can tell getSmallConstantTripCount and getSmallBestKnownTC can be completely different. I can see getSmallBestKnownTC is invoked with the default parameters, which means that we could hit this part in the function:

// Check if upper bound estimate is known. if (unsigned ExpectedTC = PSE.getSmallConstantMaxTripCount()) return ElementCount::getFixed(ExpectedTC);

Therefore, the value returned may correspond to the upper bound without there being an exact trip count. So I think we still have to use the result from getSmallConstantTripCount here, otherwise we're potentially making decisions based on profile data or the max trip count.

They can be different yes, apologies I missed this. I am happy to go back to the original implementation.

fhahn

Could you update the title to clarify Small Loops == loops with low trip counts. Small loops can also mean small body (which I initially assumed)

fhahn · 2026-05-10T20:22:03Z

@@ -0,0 +1,113 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --check-globals none --version 6


Is SVE relevant here? Looks like none of the tests use scalable vectors

SVE is relevant as it is the trigger for this code path (On AArch64, SVE changes the return value of getMinTripCountTailFoldingThreshold). So in this case we want to enable the feature.

fhahn · 2026-05-10T21:12:44Z

+
+; TC=5, VF=2, IC=2: TC == VF * IC + 1 (5 == 2 * 2 + 1).
+; The forced interleave count should be considered when choosing VF.
+define void @tc5_forced_ic2_vectorize(ptr noalias %a, ptr noalias %b) #0 {


could you also more extreme cases, like VF = 2 with i64 and possibly a smaller type. Are those also profitable, especially if the loop body is small? Would possibly be good to cover various combinations with micro benchmarks.

I have expanded the testing to have different types and VF=2 for i64.

In terms of microbenchmarks, I got Codex to generate me one which just made a small loop for that datatype, and this change shows significant speedup for vectorized vs Scalar for i8, i16 and i32. i64 does not how the same results. i64 with a TC of 5 is not vectorising as the Loop Vectoriser rejects it at a later stage, but at TC3 we are seeing a significant improvement as this is vectorised.

With forced interleaving the improvements are minimal, as the loop is still present even if it has less iterations, so the real gain here is to eliminate the loop completely, which we have seen in other cases with i8, i16, i32 at TC=3 and TC=5 and i64 with only TC=3.

$ /tmp/sve-small-trip-count-bench 100000000 iterations: 100000000 i8 tc5 vector 1.0755 ns/call i8 tc5 scalar 2.8662 ns/call i16 tc5 vector 1.0749 ns/call i16 tc5 scalar 2.8663 ns/call i32 tc5 vector 1.0748 ns/call i32 tc5 scalar 2.8669 ns/call i64 tc3 vector 1.0748 ns/call i64 tc3 scalar 1.7915 ns/call i64 tc5 ic2 vector 2.8663 ns/call i64 tc5 ic2 scalar 2.9432 ns/call vector/scalar ratios: i8 0.3752 i16 0.3750 i32 0.3749 i64 0.6000 i64 ic2 0.9739 checksum: 0

For context, the ratio < 1 shows an improvement, > 1 shows regression.

Could you add them to https://github.com/llvm/llvm-test-suite/tree/main/MicroBenchmarks/LoopVectorization so it is easier to evaluate on different platforms?

Actually, looking at this more, I am curious how we can reach such loops in practice, at least for Clang. I'd expect those loops to get fully unrolled before the vectorizer?

I will get them upstreamed.

The example I used to find this issue was using Flang, and nested do loops with trip counts of 5. These all fell below the unrolling threshold so were not unrolled, and also under the vectorisation threshold for the reason being addressed here.

In fact, for the llvm-test-suite, the only platforms which will see impact is AArch64 with SVE and RISC-V, as these are the only platforms where getMinTripCountTailFoldingThreshold will return a value greater than 0. Still happy to upstream a benchmark for this but this codepath is quite limited to how many platforms it will impact.

llvm/llvm-test-suite#404

Stylie777

Thanks all, @fhahn I have also updated the title of the PR.

Stylie777 · 2026-05-11T08:27:16Z

+      // This produces 1 vector iteration, and 1 scalar iteration with
+      // no remainder. Later passes will eliminate the loop and leave
+      // straight-line code as the both iteration counts are statically known.
+      ElementCount ExactTC = getSmallConstantTripCount(PSE.getSE(), TheLoop);


They can be different yes, apologies I missed this. I am happy to go back to the original implementation.

Stylie777 · 2026-05-11T08:28:40Z

+  };
+  unsigned EffectiveIC = UserIC > 0 ? UserIC : 1;
+  auto HasOneScalarIterationRemainder = [EffectiveIC](ElementCount &ExactTC,
+                                                      unsigned int VF) -> bool {


Stylie777 · 2026-05-11T08:28:44Z

+      ElementCount ExactTC = getSmallConstantTripCount(PSE.getSE(), TheLoop);
+      if (EpilogueLoweringStatus == CM_EpilogueNotAllowedLowTripLoop &&
+          ExactTC && ExactTC.isFixed()) {
+        if (HasOneScalarIterationRemainder(


Stylie777 · 2026-05-11T09:44:39Z

+
+; TC=5, VF=2, IC=2: TC == VF * IC + 1 (5 == 2 * 2 + 1).
+; The forced interleave count should be considered when choosing VF.
+define void @tc5_forced_ic2_vectorize(ptr noalias %a, ptr noalias %b) #0 {


I have expanded the testing to have different types and VF=2 for i64.

In terms of microbenchmarks, I got Codex to generate me one which just made a small loop for that datatype, and this change shows significant speedup for vectorized vs Scalar for i8, i16 and i32. i64 does not how the same results. i64 with a TC of 5 is not vectorising as the Loop Vectoriser rejects it at a later stage, but at TC3 we are seeing a significant improvement as this is vectorised.

With forced interleaving the improvements are minimal, as the loop is still present even if it has less iterations, so the real gain here is to eliminate the loop completely, which we have seen in other cases with i8, i16, i32 at TC=3 and TC=5 and i64 with only TC=3.

$ /tmp/sve-small-trip-count-bench 100000000 iterations: 100000000 i8 tc5 vector 1.0755 ns/call i8 tc5 scalar 2.8662 ns/call i16 tc5 vector 1.0749 ns/call i16 tc5 scalar 2.8663 ns/call i32 tc5 vector 1.0748 ns/call i32 tc5 scalar 2.8669 ns/call i64 tc3 vector 1.0748 ns/call i64 tc3 scalar 1.7915 ns/call i64 tc5 ic2 vector 2.8663 ns/call i64 tc5 ic2 scalar 2.9432 ns/call vector/scalar ratios: i8 0.3752 i16 0.3750 i32 0.3749 i64 0.6000 i64 ic2 0.9739 checksum: 0

For context, the ratio < 1 shows an improvement, > 1 shows regression.

github-actions · 2026-05-11T11:26:33Z

🐧 Linux x64 Test Results

197166 tests passed
5407 tests skipped

✅ The build succeeded and all tests passed.

david-arm · 2026-05-11T11:13:28Z

+      // straight-line code as the both iteration counts are statically known.
+      ElementCount ExactTC = getSmallConstantTripCount(PSE.getSE(), TheLoop);
+      if (EpilogueLoweringStatus == CM_EpilogueNotAllowedLowTripLoop &&
+          ExactTC.isFixed()) {


I don't think this is safe. There are three scenarios to consider:

getSmallBestKnownTC != 0, getSmallConstantTripCount != 0, but the values are different.

getSmallBestKnownTC == getSmallConstantTripCount.

getSmallBestKnownTC returns non-zero, getSmallConstantTripCount returns zero.

I think you're assuming that we're always seeing cases 1 or 2, however getSmallConstantTripCount can also return ElementCount::getFixed(0) (case 3). It will be fixed, but zero so we should avoid getting into this path. Can you fix this please and add a test for this case? You'll need something like a max trip count of 5 returned by getSmallBestKnownTC (either due to profile data or the limit of the type/expression), but the exact trip count is unknown.

Also, I left an earlier comment I left on the PR - can getSmallConstantTripCount return a scalable TC when getSmallBestKnownTC returns a fixed one? If so, then this PR should add a test that exposes this case. If not, then you can just assume ExactTC is fixed and no need to add the check isFixed().

Apologies I thought I responded to that earlier comment. from my understanding if one is returning fixed, the other will return fixed values. I don't think these can be mixed. If so I will remove that case.

I will get that fixed and updated.

I have changed the check here to be ExactTC.getFixedValue() != 0 to ensure we don't go down the path of TC == 0. This also assumes that the value is fixed, as this will not be scaleable here.

MacDue · 2026-05-12T13:04:38Z

+        // also eliminate any loops.
+        //
+        // Forced interleaving is considered when seeing if
+        // OneScalarIterationRemainder is produced. It may prodiced more than


nit: typo

Suggested change

// OneScalarIterationRemainder is produced. It may prodiced more than

// OneScalarIterationRemainder is produced. It may produced more than

still pending

david-arm

LGTM with nits addressed! However, before merging this PR it's probably a good idea to land llvm/llvm-test-suite#404 first as requested.

david-arm · 2026-05-19T08:29:01Z

+;
+entry:
+  br label %loop
+loop:


nit: Can you add spaces between the blocks?

david-arm · 2026-05-19T08:29:59Z

+
+; TC=5, VF=4: TC == MaxFixedVF + 1 (5 == 4 + 1).
+; The new code path should trigger: 1 vectorized iteration + 1 scalar iteration.
+define void @tc5_vf4_vectorize(ptr noalias %a, ptr noalias %b) #0 {


nit: Can you rename this to tc5_vf4_vectorize_i32 to be consistent with tc5_vf4_vectorize_i16?

david-arm · 2026-05-19T08:31:39Z

+
+; TC=5, VF=2, IC=2: TC == VF * IC + 1 (5 == 2 * 2 + 1).
+; The forced interleave count should be considered when choosing VF.
+define void @tc5_forced_ic2_vectorize(ptr noalias %a, ptr noalias %b) #0 {


nit: Can you rename this to tc5_forced_ic2_vectorize_i32 to be consistent with tc5_forced_ic2_vectorize_i64?

david-arm

Sorry, my mistake. I just realised there is a comment that hasn't been addressed regarding the ExactTC.isFixed() code.

Stylie777 · 2026-05-19T08:40:35Z

Sorry, my mistake. I just realised there is a comment that hasn't been addressed regarding the ExactTC.isFixed() code.

I may have fixed this downstream, and not pushed the changes. I will check and get it up with the nit's noted above fixed.

fhahn · 2026-05-19T08:47:52Z

+; on AArch64 where under getMinTripCountTailFoldingThreshold(). This
+; produces the required number of vector iteration and one scalar iteration.
+;
+; RUN: opt -S -p loop-vectorize %s | FileCheck %s


nit: usually the run line is at the top

fhahn · 2026-05-19T08:48:47Z

+        // also eliminate any loops.
+        //
+        // Forced interleaving is considered when seeing if
+        // OneScalarIterationRemainder is produced. It may prodiced more than


still pending

fhahn · 2026-05-19T08:49:12Z

+
+; TC=5, VF=2, IC=2: TC == VF * IC + 1 (5 == 2 * 2 + 1).
+; The forced interleave count should be considered when choosing VF.
+define void @tc5_forced_ic2_vectorize_i64(ptr noalias %a, ptr noalias %b) #0 {


could you also add a low-trip count loop where the exact trip count is not known?

Done as part of the test added in #195823 (comment)

fhahn · 2026-05-19T08:49:50Z

  }

  auto ExpectedTC = getSmallBestKnownTC(PSE, TheLoop);
+  unsigned EffectiveIC = UserIC > 0 ? UserIC : 1;


I think there are cases where user IC may be ignored, would be good to also cover this in test, e.g. something like a dependence distance that prevents the user IC from being chosen

Added a test for this

I think NoScalarEpilogueNeeded above checks for the user IC in a similar way. Would be good to share EffectiveIC between the two

I've changed it to use EffectiveIC now.

david-arm

LGTM with nits addressed!

david-arm · 2026-05-28T08:54:26Z

+        // also eliminate any loops.
+        //
+        // Forced interleaving is considered when seeing if
+        // OneScalarIterationRemainder is produced. It may produced more than


nit: This wording seems a little odd. I think it should probably either say:

Forced interleaving is considered when deciding if a single scalar iteration remains.

or

OneScalarIterationRemainder takes account of any forced interleaving.

Done with the second suggestion, and moved it above the call site for HasOneScalarIterationRemainder as I think it will make more sense there.

david-arm · 2026-05-28T08:55:39Z

+             MaxVF /= 2) {
+          if (HasOneScalarIterationRemainder(ExactTC, MaxVF)) {
+            LLVM_DEBUG(dbgs() << "LV: Picking VF=" << MaxVF
+                              << " with 1 scalar iteration remainder.\n");


nit: Perhaps with 1 scalar iteration remaining.\n sounds a bit better?

david-arm · 2026-05-28T08:57:08Z

+}
+
+; TC=5, VF=4: TC == VF + 1 (5 == 4 + 1).
+; The natural fixed-width VF for i16 is 8 on AArch64, so this also checks that


nit: This comment is perhaps a bit misleading because both VF=4 and VF=8 are natural for i16 on AArch64, since NEON supports both 64-bit and 128-bit register sizes. It's perhaps more precise to say "VF=8 is a natural fixed-width for i16 types on AArch64". What do you think?

The point here is that even after stepping down to VF=4, the <4 x i16> is still a legal type.

david-arm · 2026-05-28T08:58:26Z

+}
+
+; TC=5, VF=4: TC == VF + 1 (5 == 4 + 1).
+; The natural fixed-width VF for i8 is 16 on AArch64, so this checks that the


Same comment as above. Also, the comment still refers to 16-bit types.

david-arm · 2026-05-28T09:04:55Z

+}
+
+; ExactTC is unknown here because the loop trip count is the runtime value %n,
+; but the guard proves the maximum trip count is 5. it must not use the VF+1


nit: I think the comment it must not use the VF+1 escape relies upon knowing the current way the code is written in computeMaxVF. It's probably better to say something like In this case the vectoriser shouldn't optimise for a single vector iteration + single scalar iteration, because there is no guarantee we will enter the vector loop.

Currently, Small Loops with Trip Counts less than 16, and in situations where the Trip Count (TC) is less than the Tail Folding Threshold are harder to vectorize, its only possible where no epilogue will be emitted. However, for loops with large bodies and small trip counts this can be counterprodictive to performance, often failing to vectorize entirely. This is more prevelant with targets where `getMinTripCountTailFoldingThreshold()` returns a value greater than 0. To address this, the Small Loops where the TC == VF + 1 can now vectorize, leading to a single vectorized itneration and a single scalar iteration. Later passes can then remove the loop's entirely. Testing an with OpenSource Fortran HPC Benchmark which includes multiple loops with small trip counts, but large loop bodies, has shown significant improvement to runtime after these changes. Assisted-by: Claude Sonnet 4.6/Codex

The checks for matching to VF+1 == TC should protect against this, but its best to the explicit.

fhahn

I am not sure if this is always profitable without checking the cost of the vector loop (e.g. the benefit from vectorization may be relatively low compared to the scalar loop, e.g. because some of the vector IR instruction need multiple assembly instructions).

I added some suggestions for cases to check to llvm/llvm-test-suite#404, for those the scalar code seemed faster than the vector code on my test system (Apple M4 with getMinTripCountTailFoldingThreshold adjusted)

fhahn · 2026-05-28T12:49:20Z

+          ExactTC.getFixedValue() != 0) {
+        // If the maximum VF cannot produce 1 vector iteration + 1 scalar
+        // iteration, step down VF's to find one that can. The result should
+        // also eliminate any loops.
+        for (unsigned MaxVF = MaxFactors.FixedVF.getFixedValue(); MaxVF >= 2;
+             MaxVF /= 2) {
+          // OneScalarIterationRemainder takes account of any forced
+          // interleaving.
+          if (HasOneScalarIterationRemainder(ExactTC, MaxVF)) {
+            LLVM_DEBUG(dbgs() << "LV: Picking VF=" << MaxVF
+                              << " with 1 scalar iteration remaining.\n");
+            MaxFactors.FixedVF = ElementCount::getFixed(MaxVF);
+            MaxFactors.ScalableVF = ElementCount::getScalable(0);
+            return MaxFactors;
+          }
+        }


Do we need a loop here or could we directly compute the VF based on EffecticeTC + trip count?

Suggested change

ExactTC.getFixedValue() != 0) {

// If the maximum VF cannot produce 1 vector iteration + 1 scalar

// iteration, step down VF's to find one that can. The result should

// also eliminate any loops.

for (unsigned MaxVF = MaxFactors.FixedVF.getFixedValue(); MaxVF >= 2;

MaxVF /= 2) {

// OneScalarIterationRemainder takes account of any forced

// interleaving.

if (HasOneScalarIterationRemainder(ExactTC, MaxVF)) {

LLVM_DEBUG(dbgs() << "LV: Picking VF=" << MaxVF

<< " with 1 scalar iteration remaining.\n");

MaxFactors.FixedVF = ElementCount::getFixed(MaxVF);

MaxFactors.ScalableVF = ElementCount::getScalable(0);

return MaxFactors;

}

}

ExactTC.getFixedValue() > 1) {

unsigned TC = ExactTC.getFixedValue();

unsigned MaxFixedVF = MaxFactors.FixedVF.getFixedValue();

if ((TC - 1) % EffectiveIC == 0) {

unsigned VF = (TC - 1) / EffectiveIC;

if (VF >= 2 && VF <= MaxFixedVF && isPowerOf2_32(VF)) {

LLVM_DEBUG(dbgs() << "LV: Picking VF=" << VF

<< " with 1 scalar iteration remainder.\n");

MaxFactors.FixedVF = ElementCount::getFixed(VF);

MaxFactors.ScalableVF = ElementCount::getScalable(0);

return MaxFactors;

}

Agreed, and implemented.

Stylie777 · 2026-05-29T15:03:11Z

@fhahn As I mentioned in llvm/llvm-test-suite#404 I have added some cost modelling to this PR so that loops that are not beneficial to vectorise in this way will be left as scalar.

Stylie777 · 2026-06-19T12:34:16Z

Ping

david-arm · 2026-06-19T12:55:54Z

+        if (TC != EstimatedWidth + 1)
+          return false;
+
+        InstructionCost VectorCost =


Much of this code looks very similar to the code above on line 6138 and there is quite a lot of it to support a corner case. Can it be commonised? It would be good to simplify where possible.

I've condensed this down into a single lambda function that gets reused.

david-arm · 2026-06-19T14:25:57Z

+        // control flow and a scalar epilogue for a single element, so require
+        // the vectorized form to save at least one scalar iteration.
+        InstructionCost AdjustedVectorCost = VectorCost + ScalarCost;
+        if (!VectorCost.isValid() || !ScalarCostForTC.isValid() ||


I think you can just do !AdjustedVectorCost.isValid() because the Invalid state should be inherited from LHS and propagated from the RHS of the +. Similarly for the code above on line 6147.

Stylie777 · 2026-06-26T10:27:22Z

Ping @fhahn

Stylie777 requested review from david-arm and fhahn May 5, 2026 10:28

llvmorg-github-actions Bot added vectorizers llvm:transforms labels May 5, 2026

david-arm reviewed May 5, 2026

View reviewed changes

llvmorg-github-actions Bot added the backend:RISC-V label May 5, 2026

hassnaaHamdi reviewed May 5, 2026

View reviewed changes

fhahn reviewed May 10, 2026

View reviewed changes

Stylie777 changed the title ~~[LoopVectorize] Improve Vectorization of Small Loops~~ [LoopVectorize] Improve Vectorization of Low Trip Count Loops May 11, 2026

Stylie777 commented May 11, 2026

View reviewed changes

david-arm reviewed May 11, 2026

View reviewed changes

MacDue reviewed May 12, 2026

View reviewed changes

david-arm mentioned this pull request May 13, 2026

[LV] Allow runtime checks for low trip count loops without scalar epilogues #197391

Open

Stylie777 mentioned this pull request May 14, 2026

Add MicroBenchmark for Small Trip Count Loop vectorization llvm/llvm-test-suite#404

Open

david-arm approved these changes May 19, 2026

View reviewed changes

david-arm requested changes May 19, 2026

View reviewed changes

fhahn reviewed May 19, 2026

View reviewed changes

Stylie777 force-pushed the vectorize-small-loop-epilogue branch from cc1e953 to a89e00f Compare May 19, 2026 10:31

Stylie777 requested a review from david-arm May 26, 2026 07:31

david-arm approved these changes May 28, 2026

View reviewed changes

Stylie777 added 5 commits May 28, 2026 11:17

formatting

5ca8d7f

Responding to review comments

2df8b55

formatting

5ecf227

Use ExpectedTC

3b284f5

Stylie777 added 6 commits May 28, 2026 11:17

Update OPT Test and readd ExactTc

87a2439

Check for ExactTC!=0

1cb8bdd

The checks for matching to VF+1 == TC should protect against this, but its best to the explicit.

Respond to review comments

fb9102d

Reuse EffectiveIC in NoScalarEpilogueNeeded

e7aade8

format

c160004

Address nit comments

9b476b2

Stylie777 force-pushed the vectorize-small-loop-epilogue branch from 0fa0ce6 to 9b476b2 Compare May 28, 2026 10:57

fhahn reviewed May 28, 2026

View reviewed changes

Stylie777 added 2 commits May 29, 2026 14:06

Add cost modelling for where scalar loops are more profitable

cc5cdab

Remove for loop for calculating best VF

8b5a85f

Stylie777 requested a review from fhahn June 19, 2026 12:34

david-arm reviewed Jun 22, 2026

View reviewed changes

Refactor IsUnprofitableOneScalarTail Lambda Function

74e53f0

		@@ -0,0 +1,113 @@
		; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --check-globals none --version 6

	// OneScalarIterationRemainder is produced. It may prodiced more than
	// OneScalarIterationRemainder is produced. It may produced more than

Uh oh!

Conversation

Stylie777 commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmorg-github-actions Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

david-arm left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🪟 Windows x64 Test Results

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fhahn left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Stylie777 commented May 5, 2026 •

edited

Loading

llvmorg-github-actions Bot commented May 5, 2026 •

edited

Loading

github-actions Bot commented May 5, 2026 •

edited

Loading

github-actions Bot commented May 5, 2026 •

edited

Loading

Stylie777 May 13, 2026 •

edited

Loading

github-actions Bot commented May 11, 2026 •

edited

Loading