[LV] Allow runtime checks for low trip count loops without scalar epilogues by Mel-Chen · Pull Request #197391 · llvm/llvm-project

Mel-Chen · 2026-05-13T09:03:37Z

Previously, under CM_EpilogueNotAllowedLowTripLoop, any loop with a trip count below TinyTripCountVectorThreshold that required runtime checks was barred from vectorization. However, vectorization should still be considered profitable as long as the runtime check overhead does not exceed the actual vectorization gains.

This patch makes CM_EpilogueNotAllowedLowTripLoop rely directly on the minimum profitable trip count check. After this patch, vectorization is only prohibited if the trip count is below the minimum profitable trip count. CM_EpilogueNotAllowedLowTripLoop will focuse on preventing scalar epilogue generation rather than restricting runtime checks.

TODO: Refine the cost calculation for scalar epilogues to eventually remove CM_EpilogueNotAllowedLowTripLoop.

This change was suggested by Alexey Bataev in #176754 .

llvmorg-github-actions · 2026-05-13T09:04:29Z

@llvm/pr-subscribers-vectorizers

@llvm/pr-subscribers-llvm-transforms

Author: Mel Chen (Mel-Chen)

Changes

Previously, under CM_EpilogueNotAllowedLowTripLoop, any loop with a trip count below TinyTripCountVectorThreshold that required runtime checks was barred from vectorization. However, vectorization should still be considered profitable as long as the runtime check overhead does not exceed the actual vectorization gains.

This patch makes CM_EpilogueNotAllowedLowTripLoop rely directly on the minimum profitable trip count check. After this patch, vectorization is only prohibited if the trip count is below the minimum profitable trip count. CM_EpilogueNotAllowedLowTripLoop will focuse on preventing scalar epilogue generation rather than restricting runtime checks.

TODO: Refine the cost calculation for scalar epilogues to eventually remove CM_EpilogueNotAllowedLowTripLoop.

This change was suggested by Alexey Bataev in #176754 .

Patch is 20.64 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/197391.diff

6 Files Affected:

(modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+12-20)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/low_trip_count_predicates.ll (+2-2)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/pr36032.ll (+32-3)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/runtime-check-trip-count-decisions.ll (+1-1)
(modified) llvm/test/Transforms/LoopVectorize/X86/optsize.ll (+66-22)
(modified) llvm/test/Transforms/LoopVectorize/pr39417-optsize-scevchecks.ll (+6-8)

diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 9fec3b74d9630..02993d0135c14 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -187,12 +187,11 @@ static cl::opt<unsigned> EpilogueVectorizationMinVF(
              "the specified value are considered for epilogue vectorization."));
 
 /// Loops with a known constant trip count below this number are vectorized only
-/// if no scalar iteration overheads are incurred.
+/// if no scalar epilogue is required.
 static cl::opt<unsigned> TinyTripCountVectorThreshold(
     "vectorizer-min-trip-count", cl::init(16), cl::Hidden,
     cl::desc("Loops with a constant trip count that is smaller than this "
-             "value are vectorized only if no scalar iteration overheads "
-             "are incurred."));
+             "value are vectorized only if no scalar epilogue is required"));
 
 static cl::opt<unsigned> VectorizeMemoryCheckThreshold(
     "vectorize-memory-check-threshold", cl::init(128), cl::Hidden,
@@ -795,10 +794,9 @@ enum EpilogueLowering {
   // Vectorization with OptForSize: don't allow epilogues.
   CM_EpilogueNotAllowedOptSize,
 
-  // A special case of vectorisation with OptForSize: loops with a very small
-  // trip count are considered for vectorization under OptForSize, thereby
-  // making sure the cost of their loop body is dominant, free of runtime
-  // guards and scalar iteration overheads.
+  // A special case for loops with a very small trip count: they are vectorized
+  // only if no scalar epilogue is required, ensuring the vectorized loop body
+  // is the dominant cost without scalar epilogue overheads.
   CM_EpilogueNotAllowedLowTripLoop,
 
   // Loop hint indicating an epilogue is undesired, apply tail folding.
@@ -2929,13 +2927,11 @@ LoopVectorizationCostModel::computeMaxVF(ElementCount UserVF, unsigned UserIC) {
                       << "vector loop.\n");
     break;
   case CM_EpilogueNotAllowedLowTripLoop:
-    // fallthrough as a special case of OptForSize
+    LLVM_DEBUG(dbgs() << "LV: Not allowing epilogue due to low trip "
+                      << "count.\n");
+    break;
   case CM_EpilogueNotAllowedOptSize:
-    if (EpilogueLoweringStatus == CM_EpilogueNotAllowedOptSize)
-      LLVM_DEBUG(dbgs() << "LV: Not allowing epilogue due to -Os/-Oz.\n");
-    else
-      LLVM_DEBUG(dbgs() << "LV: Not allowing epilogue due to low trip "
-                        << "count.\n");
+    LLVM_DEBUG(dbgs() << "LV: Not allowing epilogue due to -Os/-Oz.\n");
 
     // Bail if runtime checks are required, which are not good when optimising
     // for size.
@@ -8095,17 +8091,13 @@ bool LoopVectorizePass::processLoop(Loop *L) {
       ExpectedTC->getFixedValue() < TinyTripCountVectorThreshold) {
     LLVM_DEBUG(dbgs() << "LV: Found a loop with a very small trip count. "
                       << "This loop is worth vectorizing only if no scalar "
-                      << "iteration overheads are incurred.");
+                      << "epilogue is required.");
     if (Hints.getForce() == LoopVectorizeHints::FK_Enabled)
       LLVM_DEBUG(dbgs() << " But vectorizing was explicitly forced.\n");
     else {
       LLVM_DEBUG(dbgs() << "\n");
-      // Tail-folded loops are efficient even when the loop
-      // iteration count is low. However, setting the epilogue policy to
-      // `CM_EpilogueNotAllowedLowTripLoop` prevents vectorizing loops
-      // with runtime checks. It's more effective to let
-      // `isOutsideLoopWorkProfitable` determine if vectorization is
-      // beneficial for the loop.
+      // For loops with very small trip counts, require that no scalar epilogue
+      // is needed. Tail-folded loops are still allowed as they avoid epilogues.
       if (SEL != CM_EpilogueNotNeededFoldTail)
         SEL = CM_EpilogueNotAllowedLowTripLoop;
     }
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/low_trip_count_predicates.ll b/llvm/test/Transforms/LoopVectorize/AArch64/low_trip_count_predicates.ll
index 3ea4e7fd33359..a7eb6aedbbc58 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/low_trip_count_predicates.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/low_trip_count_predicates.ll
@@ -17,8 +17,8 @@ target triple = "aarch64-unknown-linux-gnu"
 ; DEBUG-VS2: Main Loop VF:vscale x 8, Main Loop UF:1, Epilogue Loop VF:8, Epilogue Loop UF:1
 
 ; DEBUG-LABEL: LV: Checking a loop in 'trip_count_too_small'
-; DEBUG: LV: Found a loop with a very small trip count. This loop is worth vectorizing only if no scalar iteration overheads are incurred.
-; DEBUG: LV: Not vectorizing: Runtime SCEV check is required with -Os/-Oz.
+; DEBUG: LV: Found a loop with a very small trip count. This loop is worth vectorizing only if no scalar epilogue is required.
+; DEBUG: LV: Not vectorizing: The trip count is below the minial threshold value..
 
 ; DEBUG-LABEL: LV: Checking a loop in 'too_many_runtime_checks'
 ; DEBUG: LV: Found trip count: 0
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/pr36032.ll b/llvm/test/Transforms/LoopVectorize/AArch64/pr36032.ll
index f4a1814efc038..91e211847b03d 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/pr36032.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/pr36032.ll
@@ -14,6 +14,7 @@ define void @_Z1dv() {
 ; CHECK-LABEL: @_Z1dv(
 ; CHECK-NEXT:  entry:
 ; CHECK-NEXT:    [[CALL:%.*]] = tail call ptr @"_ZN3$_01aEv"(ptr nonnull @b)
+; CHECK-NEXT:    [[CALL1:%.*]] = ptrtoaddr ptr [[CALL]] to i64
 ; CHECK-NEXT:    br label [[FOR_COND:%.*]]
 ; CHECK:       for.cond:
 ; CHECK-NEXT:    [[F_0:%.*]] = phi i32 [ 0, [[ENTRY:%.*]] ], [ [[ADD5:%.*]], [[FOR_COND_CLEANUP:%.*]] ]
@@ -23,15 +24,43 @@ define void @_Z1dv() {
 ; CHECK-NEXT:    br i1 [[CMP12]], label [[FOR_BODY_LR_PH:%.*]], label [[FOR_COND_CLEANUP]]
 ; CHECK:       for.body.lr.ph:
 ; CHECK-NEXT:    [[TMP0:%.*]] = zext i32 [[G_0]] to i64
+; CHECK-NEXT:    [[TMP12:%.*]] = sub i64 4, [[TMP0]]
 ; CHECK-NEXT:    br label [[FOR_BODY:%.*]]
+; CHECK:       vector.memcheck:
+; CHECK-NEXT:    [[TMP13:%.*]] = add i64 [[CALL1]], [[TMP0]]
+; CHECK-NEXT:    [[TMP3:%.*]] = add i32 [[G_0]], [[CONV]]
+; CHECK-NEXT:    [[TMP4:%.*]] = zext nneg i32 [[TMP3]] to i64
+; CHECK-NEXT:    [[TMP5:%.*]] = add i64 [[TMP4]], ptrtoaddr (ptr @c to i64)
+; CHECK-NEXT:    [[TMP6:%.*]] = sub i64 [[TMP13]], [[TMP5]]
+; CHECK-NEXT:    [[DIFF_CHECK:%.*]] = icmp ult i64 [[TMP6]], 4
+; CHECK-NEXT:    br i1 [[DIFF_CHECK]], label [[SCALAR_PH:%.*]], label [[VECTOR_PH:%.*]]
+; CHECK:       vector.ph:
+; CHECK-NEXT:    [[N_MOD_VF:%.*]] = urem i64 [[TMP12]], 4
+; CHECK-NEXT:    [[N_VEC:%.*]] = sub i64 [[TMP12]], [[N_MOD_VF]]
+; CHECK-NEXT:    [[TMP7:%.*]] = add i64 [[TMP0]], [[N_VEC]]
+; CHECK-NEXT:    br label [[VECTOR_BODY:%.*]]
+; CHECK:       vector.body:
+; CHECK-NEXT:    [[TMP8:%.*]] = add i32 [[CONV]], [[G_0]]
+; CHECK-NEXT:    [[TMP9:%.*]] = zext i32 [[TMP8]] to i64
+; CHECK-NEXT:    [[TMP10:%.*]] = getelementptr inbounds [6 x i8], ptr @c, i64 0, i64 [[TMP9]]
+; CHECK-NEXT:    [[WIDE_LOAD:%.*]] = load <4 x i8>, ptr [[TMP10]], align 1
+; CHECK-NEXT:    [[TMP11:%.*]] = getelementptr inbounds i8, ptr [[CALL]], i64 [[TMP0]]
+; CHECK-NEXT:    store <4 x i8> [[WIDE_LOAD]], ptr [[TMP11]], align 1
+; CHECK-NEXT:    br label [[MIDDLE_BLOCK:%.*]]
+; CHECK:       middle.block:
+; CHECK-NEXT:    [[CMP_N:%.*]] = icmp eq i64 [[TMP12]], [[N_VEC]]
+; CHECK-NEXT:    br i1 [[CMP_N]], label [[FOR_COND_CLEANUP_LOOPEXIT:%.*]], label [[SCALAR_PH]]
+; CHECK:       scalar.ph:
+; CHECK-NEXT:    [[BC_RESUME_VAL:%.*]] = phi i64 [ [[TMP7]], [[MIDDLE_BLOCK]] ], [ [[TMP0]], [[FOR_BODY]] ]
+; CHECK-NEXT:    br label [[FOR_BODY1:%.*]]
 ; CHECK:       for.cond.cleanup.loopexit:
 ; CHECK-NEXT:    br label [[FOR_COND_CLEANUP]]
 ; CHECK:       for.cond.cleanup:
-; CHECK-NEXT:    [[G_1_LCSSA]] = phi i32 [ [[G_0]], [[FOR_COND]] ], [ 4, [[FOR_COND_CLEANUP_LOOPEXIT:%.*]] ]
+; CHECK-NEXT:    [[G_1_LCSSA]] = phi i32 [ [[G_0]], [[FOR_COND]] ], [ 4, [[FOR_COND_CLEANUP_LOOPEXIT]] ]
 ; CHECK-NEXT:    [[ADD5]] = add nuw nsw i32 [[CONV]], 4
 ; CHECK-NEXT:    br label [[FOR_COND]]
 ; CHECK:       for.body:
-; CHECK-NEXT:    [[INDVARS_IV:%.*]] = phi i64 [ [[TMP0]], [[FOR_BODY_LR_PH]] ], [ [[INDVARS_IV_NEXT:%.*]], [[FOR_BODY]] ]
+; CHECK-NEXT:    [[INDVARS_IV:%.*]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.*]], [[FOR_BODY1]] ]
 ; CHECK-NEXT:    [[TMP1:%.*]] = trunc i64 [[INDVARS_IV]] to i32
 ; CHECK-NEXT:    [[ADD:%.*]] = add i32 [[CONV]], [[TMP1]]
 ; CHECK-NEXT:    [[IDXPROM:%.*]] = zext i32 [[ADD]] to i64
@@ -41,7 +70,7 @@ define void @_Z1dv() {
 ; CHECK-NEXT:    store i8 [[TMP2]], ptr [[ARRAYIDX3]], align 1
 ; CHECK-NEXT:    [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
 ; CHECK-NEXT:    [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 4
-; CHECK-NEXT:    br i1 [[EXITCOND]], label [[FOR_COND_CLEANUP_LOOPEXIT]], label [[FOR_BODY]]
+; CHECK-NEXT:    br i1 [[EXITCOND]], label [[FOR_COND_CLEANUP_LOOPEXIT]], label [[FOR_BODY1]], !llvm.loop [[LOOP0:![0-9]+]]
 ;
 entry:
   %call = tail call ptr @"_ZN3$_01aEv"(ptr nonnull @b) #2
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/runtime-check-trip-count-decisions.ll b/llvm/test/Transforms/LoopVectorize/AArch64/runtime-check-trip-count-decisions.ll
index 85dd5898b26dd..11b244e388c4b 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/runtime-check-trip-count-decisions.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/runtime-check-trip-count-decisions.ll
@@ -57,7 +57,7 @@ for.end:
 define i32 @foo_mid_trip_count(ptr %a, ptr %b, ptr %c, i32 %bound) {
 ; CHECK-LABEL: @foo_mid_trip_count(
 ; PREDICATED: vector.body
-; SCALAR-NOT: vector.body
+; SCALAR: vector.body
 entry:
   br label %for.body
 
diff --git a/llvm/test/Transforms/LoopVectorize/X86/optsize.ll b/llvm/test/Transforms/LoopVectorize/X86/optsize.ll
index 8460af9e6aa25..2bb15492a29a6 100644
--- a/llvm/test/Transforms/LoopVectorize/X86/optsize.ll
+++ b/llvm/test/Transforms/LoopVectorize/X86/optsize.ll
@@ -230,16 +230,62 @@ attributes #2 = { optsize }
 define void @scev_predicate_no_vec(i32 %start, ptr %dst) {
 ; CHECK-LABEL: define void @scev_predicate_no_vec(
 ; CHECK-SAME: i32 [[START:%.*]], ptr [[DST:%.*]]) #[[ATTR2:[0-9]+]] {
-; CHECK-NEXT:  [[ENTRY:.*]]:
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    [[TMP0:%.*]] = trunc i32 [[START]] to i16
+; CHECK-NEXT:    [[TMP1:%.*]] = zext i16 [[TMP0]] to i32
+; CHECK-NEXT:    [[UMAX1:%.*]] = call i32 @llvm.umax.i32(i32 [[TMP1]], i32 4)
+; CHECK-NEXT:    [[TMP2:%.*]] = add i32 [[UMAX1]], 1
+; CHECK-NEXT:    [[TMP3:%.*]] = sub i32 [[TMP2]], [[TMP1]]
 ; CHECK-NEXT:    br label %[[LOOP:.*]]
 ; CHECK:       [[LOOP]]:
-; CHECK-NEXT:    [[IV:%.*]] = phi i32 [ [[START]], %[[ENTRY]] ], [ [[ADD:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[TMP4:%.*]] = trunc i32 [[START]] to i16
+; CHECK-NEXT:    [[TMP5:%.*]] = zext i16 [[TMP4]] to i32
+; CHECK-NEXT:    [[UMAX:%.*]] = call i32 @llvm.umax.i32(i32 [[TMP5]], i32 4)
+; CHECK-NEXT:    [[TMP6:%.*]] = sub i32 [[UMAX]], [[TMP5]]
+; CHECK-NEXT:    [[TMP7:%.*]] = trunc i32 [[TMP6]] to i16
+; CHECK-NEXT:    [[TMP8:%.*]] = add i16 [[TMP4]], [[TMP7]]
+; CHECK-NEXT:    [[TMP9:%.*]] = icmp ult i16 [[TMP8]], [[TMP4]]
+; CHECK-NEXT:    [[TMP10:%.*]] = icmp ugt i32 [[TMP6]], 65535
+; CHECK-NEXT:    [[TMP11:%.*]] = or i1 [[TMP9]], [[TMP10]]
+; CHECK-NEXT:    [[IDENT_CHECK:%.*]] = icmp ne i32 [[START]], [[TMP5]]
+; CHECK-NEXT:    [[TMP12:%.*]] = or i1 [[TMP11]], [[IDENT_CHECK]]
+; CHECK-NEXT:    br i1 [[TMP12]], label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
+; CHECK:       [[VECTOR_PH]]:
+; CHECK-NEXT:    [[N_RND_UP:%.*]] = add i32 [[TMP3]], 63
+; CHECK-NEXT:    [[N_MOD_VF:%.*]] = urem i32 [[N_RND_UP]], 64
+; CHECK-NEXT:    [[N_VEC:%.*]] = sub i32 [[N_RND_UP]], [[N_MOD_VF]]
+; CHECK-NEXT:    [[TRIP_COUNT_MINUS_1:%.*]] = sub i32 [[TMP3]], 1
+; CHECK-NEXT:    [[BROADCAST_SPLATINSERT:%.*]] = insertelement <64 x i32> poison, i32 [[TRIP_COUNT_MINUS_1]], i64 0
+; CHECK-NEXT:    [[BROADCAST_SPLAT:%.*]] = shufflevector <64 x i32> [[BROADCAST_SPLATINSERT]], <64 x i32> poison, <64 x i32> zeroinitializer
+; CHECK-NEXT:    [[BROADCAST_SPLATINSERT2:%.*]] = insertelement <64 x i32> poison, i32 [[START]], i64 0
+; CHECK-NEXT:    [[BROADCAST_SPLAT3:%.*]] = shufflevector <64 x i32> [[BROADCAST_SPLATINSERT2]], <64 x i32> poison, <64 x i32> zeroinitializer
+; CHECK-NEXT:    [[INDUCTION:%.*]] = add nuw nsw <64 x i32> [[BROADCAST_SPLAT3]], <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31, i32 32, i32 33, i32 34, i32 35, i32 36, i32 37, i32 38, i32 39, i32 40, i32 41, i32 42, i32 43, i32 44, i32 45, i32 46, i32 47, i32 48, i32 49, i32 50, i32 51, i32 52, i32 53, i32 54, i32 55, i32 56, i32 57, i32 58, i32 59, i32 60, i32 61, i32 62, i32 63>
+; CHECK-NEXT:    br label %[[VECTOR_BODY:.*]]
+; CHECK:       [[VECTOR_BODY]]:
+; CHECK-NEXT:    [[INDEX:%.*]] = phi i32 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_IND:%.*]] = phi <64 x i32> [ [[INDUCTION]], %[[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_IND4:%.*]] = phi <64 x i32> [ <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31, i32 32, i32 33, i32 34, i32 35, i32 36, i32 37, i32 38, i32 39, i32 40, i32 41, i32 42, i32 43, i32 44, i32 45, i32 46, i32 47, i32 48, i32 49, i32 50, i32 51, i32 52, i32 53, i32 54, i32 55, i32 56, i32 57, i32 58, i32 59, i32 60, i32 61, i32 62, i32 63>, %[[VECTOR_PH]] ], [ [[VEC_IND_NEXT5:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[IV:%.*]] = add i32 [[START]], [[INDEX]]
+; CHECK-NEXT:    [[TMP14:%.*]] = icmp ule <64 x i32> [[VEC_IND4]], [[BROADCAST_SPLAT]]
 ; CHECK-NEXT:    [[GEP:%.*]] = getelementptr inbounds i32, ptr [[DST]], i32 [[IV]]
-; CHECK-NEXT:    store i32 [[IV]], ptr [[GEP]], align 4
-; CHECK-NEXT:    [[CONV:%.*]] = and i32 [[IV]], 65535
+; CHECK-NEXT:    call void @llvm.masked.store.v64i32.p0(<64 x i32> [[VEC_IND]], ptr align 4 [[GEP]], <64 x i1> [[TMP14]])
+; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i32 [[INDEX]], 64
+; CHECK-NEXT:    [[VEC_IND_NEXT]] = add nuw nsw <64 x i32> [[VEC_IND]], splat (i32 64)
+; CHECK-NEXT:    [[VEC_IND_NEXT5]] = add nuw <64 x i32> [[VEC_IND4]], splat (i32 64)
+; CHECK-NEXT:    [[TMP16:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]
+; CHECK-NEXT:    br i1 [[TMP16]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP5:![0-9]+]]
+; CHECK:       [[MIDDLE_BLOCK]]:
+; CHECK-NEXT:    br label %[[EXIT:.*]]
+; CHECK:       [[SCALAR_PH]]:
+; CHECK-NEXT:    br label %[[LOOP1:.*]]
+; CHECK:       [[LOOP1]]:
+; CHECK-NEXT:    [[IV1:%.*]] = phi i32 [ [[START]], %[[SCALAR_PH]] ], [ [[ADD:%.*]], %[[LOOP1]] ]
+; CHECK-NEXT:    [[GEP1:%.*]] = getelementptr inbounds i32, ptr [[DST]], i32 [[IV1]]
+; CHECK-NEXT:    store i32 [[IV1]], ptr [[GEP1]], align 4
+; CHECK-NEXT:    [[CONV:%.*]] = and i32 [[IV1]], 65535
 ; CHECK-NEXT:    [[CMP:%.*]] = icmp ult i32 [[CONV]], 4
 ; CHECK-NEXT:    [[ADD]] = add nuw nsw i32 [[CONV]], 1
-; CHECK-NEXT:    br i1 [[CMP]], label %[[LOOP]], label %[[EXIT:.*]]
+; CHECK-NEXT:    br i1 [[CMP]], label %[[LOOP1]], label %[[EXIT]], !llvm.loop [[LOOP6:![0-9]+]]
 ; CHECK:       [[EXIT]]:
 ; CHECK-NEXT:    ret void
 ;
@@ -277,32 +323,30 @@ exit:
 define void @can_prove_scev_predicate_is_always_true(ptr %dst) {
 ; CHECK-LABEL: define void @can_prove_scev_predicate_is_always_true(
 ; CHECK-SAME: ptr [[DST:%.*]]) #[[ATTR2]] {
-; CHECK-NEXT:  [[ENTRY:.*]]:
+; CHECK-NEXT:  [[ENTRY:.*:]]
 ; CHECK-NEXT:    br label %[[LOOP:.*]]
 ; CHECK:       [[LOOP]]:
-; CHECK-NEXT:    [[IV:%.*]] = phi i32 [ 0, %[[ENTRY]] ], [ [[ADD:%.*]], %[[LOOP]] ]
-; CHECK-NEXT:    [[GEP:%.*]] = getelementptr inbounds i32, ptr [[DST]], i32 [[IV]]
-; CHECK-NEXT:    store i32 [[IV]], ptr [[GEP]], align 4
-; CHECK-NEXT:    [[CONV:%.*]] = and i32 [[IV]], 65535
-; CHECK-NEXT:    [[CMP:%.*]] = icmp ult i32 [[CONV]], 4
-; CHECK-NEXT:    [[ADD]] = add nuw nsw i32 [[CONV]], 1
-; CHECK-NEXT:    br i1 [[CMP]], label %[[LOOP]], label %[[EXIT:.*]]
+; CHECK-NEXT:    br label %[[EXIT:.*]]
 ; CHECK:       [[EXIT]]:
+; CHECK-NEXT:    call void @llvm.masked.store.v64i32.p0(<64 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31, i32 32, i32 33, i32 34, i32 35, i32 36, i32 37, i32 38, i32 39, i32 40, i32 41, i32 42, i32 43, i32 44, i32 45, i32 46, i32 47, i32 48, i32 49, i32 50, i32 51, i32 52, i32 53, i32 54, i32 55, i32 56, i32 57, i32 58, i32 59, i32 60, i32 61, i32 62, i32 63>, ptr align 4 [[DST]], <64 x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false>)
+; CHECK-NEXT:    br label %[[MIDDLE_BLOCK:.*]]
+; CHECK:       [[MIDDLE_BLOCK]]:
+; CHECK-NEXT:    br label %[[EXIT1:.*]]
+; CHECK:       [[EXIT1]]:
 ; CHECK-NEXT:    ret void
 ;
 ; AUTOVF-LABEL: define void @can_prove_scev_predicate_is_always_true(
 ; AUTOVF-SAME: ptr [[DST:%.*]]) #[[ATTR2]] {
-; AUTOVF-NEXT:  [[ENTRY:.*]]:
+; AUTOVF-NEXT:  [[ENTRY:.*:]]
 ; AUTOVF-NEXT:    br label %[[LOOP:.*]]
 ; AUTOVF:       [[LOOP]]:
-; AUTOVF-NEXT:    [[IV:%.*]] = phi i32 [ 0, %[[ENTRY]] ], [ [[ADD:%.*]], %[[LOOP]] ]
-; AUTOVF-NEXT:    [[GEP:%.*]] = getelementptr inbounds i32, ptr [[DST]], i32 [[IV]]
-; AUTOVF-NEXT:    store i32 [[IV]], ptr [[GEP]], align 4
-; AUTOVF-NEXT:    [[CONV:%.*]] = and i32 [[IV]], 65535
-; AUTOVF-NEXT:    [[CMP:%.*]] = icmp ult i32 [[CONV]], 4
-; AUTOVF-NEXT:    [[ADD]] = add nuw nsw i32 [[CONV]], 1
-; AUTOVF-NEXT:    br i1 [[CMP]], label %[[LOOP]], label %[[EXIT:.*]]
+; AUTOVF-NEXT:    br label %[[EXIT:.*]]
 ; AUTOVF:       [[EXIT]]:
+; AUTOVF-NEXT:    call void @llvm.masked.store.v8i32.p0(<8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>, ptr align 4 [[DST]], <8 x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 false, i1 false, i1 false>)
+; AUTOVF-NEXT:    br label %[[MIDDLE_BLOCK:.*]]
+; AUTOVF:       [[MIDDLE_BLOCK]]:
+; AUTOVF-NEXT:    br label %[[EXIT1:.*]]
+; AUTOVF:       [[EXIT1]]:
 ; AUTOVF-NEXT:    ret void
 ;
 entry:
@@ -351,7 +395,7 @@ define void @tail_folded_store_avx512(ptr %start, ptr %end) #3 {
 ; CHECK-NEXT:    [[PTR_IND]] = getelementptr i8, ptr [[POINTER_PHI]], i32 -4608
 ; CHECK-NEXT:    [[VEC_IND_NEXT]] = add nuw <64 x i32> [[VEC_IV]], splat (i32 64)
 ; CHECK-NEXT:    [[TMP7:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]
-; CHECK-NEXT:    br i1 [[TMP7]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP5:![0-9]+]]
+; CHECK-NEXT:    br i1 [[TMP7]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP7:![0-9]+]]
 ; CHECK:       [[MIDDLE_BLOCK]]:
 ; CHECK-NEXT:    br label %[[EXIT:.*]]
 ; CHECK:       [[EXIT]]:
diff --git a/llvm/test/Transforms/LoopVectorize/pr39417-optsize-scevchecks.ll b/llvm/test/Transforms/LoopVectorize/pr39417-optsize-scevchecks.ll
index 7d4c1d35ffc9b..2e53a230f9de2 100644
--- a/llvm/test/Transforms/LoopVectorize/pr39417-optsize-scevchecks.ll
+++ b/llvm/test/Transforms/LoopVectorize/pr39417-optsize-scevchecks.ll
@@ -8,17 +8,15 @@ target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
 ; trip count (which implies opt for size).
 define void @func_34() {
 ; CHECK-LABEL: define void @func_34() {
-; CHECK-NEXT:  [[ENTRY:.*]]:
+; CHECK-NEXT:  [[ENTRY:.*:]]
 ; CHECK-NEXT:    br label %[[LOOP:.*]]
 ; CHECK:       [[LOOP]]:
-; CHECK-NEXT:    [[IV:%.*]] = phi i32 [ 0, %[[ENTRY]] ], [ [[IV_...
[truncated]

david-arm · 2026-05-13T09:11:21Z

Does this conflict with another existing PR in the same area? See #195823

david-arm · 2026-05-13T09:16:19Z

 ; CHECK-LABEL: define void @scev_predicate_no_vec(
 ; CHECK-SAME: i32 [[START:%.*]], ptr [[DST:%.*]]) #[[ATTR2:[0-9]+]] {
-; CHECK-NEXT:  [[ENTRY:.*]]:
+; CHECK-NEXT:  [[ENTRY:.*:]]


Looks like this test is in the wrong file now, since it doesn't rely on optsize and will take a different code path?

david-arm · 2026-05-13T09:17:45Z

-; CHECK-NEXT:    br i1 [[CMP]], label %[[LOOP]], label %[[EXIT:.*]]
+; CHECK-NEXT:    br label %[[EXIT:.*]]
 ; CHECK:       [[EXIT]]:
+; CHECK-NEXT:    call void @llvm.masked.store.v64i32.p0(<64 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31, i32 32, i32 33, i32 34, i32 35, i32 36, i32 37, i32 38, i32 39, i32 40, i32 41, i32 42, i32 43, i32 44, i32 45, i32 46, i32 47, i32 48, i32 49, i32 50, i32 51, i32 52, i32 53, i32 54, i32 55, i32 56, i32 57, i32 58, i32 59, i32 60, i32 61, i32 62, i32 63>, ptr align 4 [[DST]], <64 x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false>)


This is an unusual case. I would expect the performance to be worse with vectorisation. Do you know what's happening here?

It looks very odd to pick such a large VF in that case, given we are able to determine that there are only 4 iterations

david-arm · 2026-05-13T09:20:41Z

-; CHECK-NEXT:    [[IV:%.*]] = phi i32 [ 0, %[[ENTRY]] ], [ [[IV_NEXT:%.*]], %[[LOOP]] ]
-; CHECK-NEXT:    [[SEXT:%.*]] = shl i32 [[IV]], 16
-; CHECK-NEXT:    [[STEP:%.*]] = ashr exact i32 [[SEXT]], 16
-; CHECK-NEXT:    [[IV_NEXT]] = add nsw i32 [[STEP]], 1


Hmm, the test didn't seem very robust before. The loop has been entirely dead-coded away. Perhaps worth changing the test to at least have a store in it?

fhahn

This patch makes CM_EpilogueNotAllowedLowTripLoop rely directly on the minimum profitable trip count check.

It looks like none of the tests actually generate a minimum profitable trip count check. would be good to at least add such a test (possibly where the trip count is known to be low but not constant)

fhahn · 2026-05-13T09:25:16Z

-; CHECK-NEXT:    br i1 [[CMP]], label %[[LOOP]], label %[[EXIT:.*]]
+; CHECK-NEXT:    br label %[[EXIT:.*]]
 ; CHECK:       [[EXIT]]:
+; CHECK-NEXT:    call void @llvm.masked.store.v64i32.p0(<64 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31, i32 32, i32 33, i32 34, i32 35, i32 36, i32 37, i32 38, i32 39, i32 40, i32 41, i32 42, i32 43, i32 44, i32 45, i32 46, i32 47, i32 48, i32 49, i32 50, i32 51, i32 52, i32 53, i32 54, i32 55, i32 56, i32 57, i32 58, i32 59, i32 60, i32 61, i32 62, i32 63>, ptr align 4 [[DST]], <64 x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false>)


It looks very odd to pick such a large VF in that case, given we are able to determine that there are only 4 iterations

Stylie777 · 2026-05-14T09:26:00Z

Does this conflict with another existing PR in the same area? See #195823

From taking a look and from my understanding, no. The changes here will still allow for the cases to vectorise I am supporting in 195823.

hassnaaHamdi · 2026-05-16T01:03:23Z

+      // For loops with very small trip counts, require that no scalar epilogue
+      // is needed. Tail-folded loops are still allowed as they avoid epilogues.
      if (SEL != CM_EpilogueNotNeededFoldTail)
        SEL = CM_EpilogueNotAllowedLowTripLoop;


Hi,
I think if the user is explicitly using dont-fold-tail, it doesn't make sense to override it and do the opposite thing.
I know this is an old behaviour, but I just realised it while looking at this patch.
I think the case of dont-fold-tail should be considered as CM_EpilogueRequired and that should be taken into account in requiresScalarEpilogue(..) function, otherwise no difference between using the flag to prevent tail-folding or not using it. Right ?

@Mel-Chen @david-arm @fhahn

hassnaaHamdi · 2026-05-16T01:32:20Z

  ret i32 0
 }

 ; If trip-count is equal to 10, the function is vectorised when predicated tail folding is chosen


If we will go on with the current behaviour, the comment needs updates. I think also the check-prefix needs to be changed as it's now testing tail-folding for low TC loop, right ?

hassnaaHamdi · 2026-05-16T01:32:57Z

 ; CHECK-LABEL: @foo_mid_trip_count(
 ; PREDICATED: vector.body
-; SCALAR-NOT: vector.body
+; SCALAR: vector.body


Only checking vector.body doesn't show if the epilogue got folded or not.

Remove the hard limitation of low constant trip count for runtime check

83e3094

Mel-Chen requested review from alexey-bataev, davemgreen, fhahn and igogo-x86 May 13, 2026 09:03

llvmorg-github-actions Bot added vectorizers llvm:transforms labels May 13, 2026

david-arm requested a review from hassnaaHamdi May 13, 2026 09:09

david-arm requested a review from Stylie777 May 13, 2026 09:11

david-arm reviewed May 13, 2026

View reviewed changes

fhahn reviewed May 13, 2026

View reviewed changes

hassnaaHamdi reviewed May 16, 2026

View reviewed changes

Uh oh!

Conversation

Mel-Chen commented May 13, 2026

Uh oh!

llvmorg-github-actions Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

david-arm commented May 13, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fhahn left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Stylie777 commented May 14, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

llvmorg-github-actions Bot commented May 13, 2026 •

edited

Loading