Skip to content

[LV] Allow runtime checks for low trip count loops without scalar epilogues#197391

Open
Mel-Chen wants to merge 1 commit into
llvm:mainfrom
Mel-Chen:remove-LCTC-RTC-limit
Open

[LV] Allow runtime checks for low trip count loops without scalar epilogues#197391
Mel-Chen wants to merge 1 commit into
llvm:mainfrom
Mel-Chen:remove-LCTC-RTC-limit

Conversation

@Mel-Chen

Copy link
Copy Markdown
Contributor

Previously, under CM_EpilogueNotAllowedLowTripLoop, any loop with a trip count below TinyTripCountVectorThreshold that required runtime checks was barred from vectorization. However, vectorization should still be considered profitable as long as the runtime check overhead does not exceed the actual vectorization gains.

This patch makes CM_EpilogueNotAllowedLowTripLoop rely directly on the minimum profitable trip count check. After this patch, vectorization is only prohibited if the trip count is below the minimum profitable trip count. CM_EpilogueNotAllowedLowTripLoop will focuse on preventing scalar epilogue generation rather than restricting runtime checks.

TODO: Refine the cost calculation for scalar epilogues to eventually remove CM_EpilogueNotAllowedLowTripLoop.

This change was suggested by Alexey Bataev in #176754 .

@llvmorg-github-actions

llvmorg-github-actions Bot commented May 13, 2026

Copy link
Copy Markdown

@llvm/pr-subscribers-vectorizers

@llvm/pr-subscribers-llvm-transforms

Author: Mel Chen (Mel-Chen)

Changes

Previously, under CM_EpilogueNotAllowedLowTripLoop, any loop with a trip count below TinyTripCountVectorThreshold that required runtime checks was barred from vectorization. However, vectorization should still be considered profitable as long as the runtime check overhead does not exceed the actual vectorization gains.

This patch makes CM_EpilogueNotAllowedLowTripLoop rely directly on the minimum profitable trip count check. After this patch, vectorization is only prohibited if the trip count is below the minimum profitable trip count. CM_EpilogueNotAllowedLowTripLoop will focuse on preventing scalar epilogue generation rather than restricting runtime checks.

TODO: Refine the cost calculation for scalar epilogues to eventually remove CM_EpilogueNotAllowedLowTripLoop.

This change was suggested by Alexey Bataev in #176754 .


Patch is 20.64 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/197391.diff

6 Files Affected:

  • (modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+12-20)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/low_trip_count_predicates.ll (+2-2)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/pr36032.ll (+32-3)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/runtime-check-trip-count-decisions.ll (+1-1)
  • (modified) llvm/test/Transforms/LoopVectorize/X86/optsize.ll (+66-22)
  • (modified) llvm/test/Transforms/LoopVectorize/pr39417-optsize-scevchecks.ll (+6-8)
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 9fec3b74d9630..02993d0135c14 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -187,12 +187,11 @@ static cl::opt<unsigned> EpilogueVectorizationMinVF(
              "the specified value are considered for epilogue vectorization."));
 
 /// Loops with a known constant trip count below this number are vectorized only
-/// if no scalar iteration overheads are incurred.
+/// if no scalar epilogue is required.
 static cl::opt<unsigned> TinyTripCountVectorThreshold(
     "vectorizer-min-trip-count", cl::init(16), cl::Hidden,
     cl::desc("Loops with a constant trip count that is smaller than this "
-             "value are vectorized only if no scalar iteration overheads "
-             "are incurred."));
+             "value are vectorized only if no scalar epilogue is required"));
 
 static cl::opt<unsigned> VectorizeMemoryCheckThreshold(
     "vectorize-memory-check-threshold", cl::init(128), cl::Hidden,
@@ -795,10 +794,9 @@ enum EpilogueLowering {
   // Vectorization with OptForSize: don't allow epilogues.
   CM_EpilogueNotAllowedOptSize,
 
-  // A special case of vectorisation with OptForSize: loops with a very small
-  // trip count are considered for vectorization under OptForSize, thereby
-  // making sure the cost of their loop body is dominant, free of runtime
-  // guards and scalar iteration overheads.
+  // A special case for loops with a very small trip count: they are vectorized
+  // only if no scalar epilogue is required, ensuring the vectorized loop body
+  // is the dominant cost without scalar epilogue overheads.
   CM_EpilogueNotAllowedLowTripLoop,
 
   // Loop hint indicating an epilogue is undesired, apply tail folding.
@@ -2929,13 +2927,11 @@ LoopVectorizationCostModel::computeMaxVF(ElementCount UserVF, unsigned UserIC) {
                       << "vector loop.\n");
     break;
   case CM_EpilogueNotAllowedLowTripLoop:
-    // fallthrough as a special case of OptForSize
+    LLVM_DEBUG(dbgs() << "LV: Not allowing epilogue due to low trip "
+                      << "count.\n");
+    break;
   case CM_EpilogueNotAllowedOptSize:
-    if (EpilogueLoweringStatus == CM_EpilogueNotAllowedOptSize)
-      LLVM_DEBUG(dbgs() << "LV: Not allowing epilogue due to -Os/-Oz.\n");
-    else
-      LLVM_DEBUG(dbgs() << "LV: Not allowing epilogue due to low trip "
-                        << "count.\n");
+    LLVM_DEBUG(dbgs() << "LV: Not allowing epilogue due to -Os/-Oz.\n");
 
     // Bail if runtime checks are required, which are not good when optimising
     // for size.
@@ -8095,17 +8091,13 @@ bool LoopVectorizePass::processLoop(Loop *L) {
       ExpectedTC->getFixedValue() < TinyTripCountVectorThreshold) {
     LLVM_DEBUG(dbgs() << "LV: Found a loop with a very small trip count. "
                       << "This loop is worth vectorizing only if no scalar "
-                      << "iteration overheads are incurred.");
+                      << "epilogue is required.");
     if (Hints.getForce() == LoopVectorizeHints::FK_Enabled)
       LLVM_DEBUG(dbgs() << " But vectorizing was explicitly forced.\n");
     else {
       LLVM_DEBUG(dbgs() << "\n");
-      // Tail-folded loops are efficient even when the loop
-      // iteration count is low. However, setting the epilogue policy to
-      // `CM_EpilogueNotAllowedLowTripLoop` prevents vectorizing loops
-      // with runtime checks. It's more effective to let
-      // `isOutsideLoopWorkProfitable` determine if vectorization is
-      // beneficial for the loop.
+      // For loops with very small trip counts, require that no scalar epilogue
+      // is needed. Tail-folded loops are still allowed as they avoid epilogues.
       if (SEL != CM_EpilogueNotNeededFoldTail)
         SEL = CM_EpilogueNotAllowedLowTripLoop;
     }
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/low_trip_count_predicates.ll b/llvm/test/Transforms/LoopVectorize/AArch64/low_trip_count_predicates.ll
index 3ea4e7fd33359..a7eb6aedbbc58 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/low_trip_count_predicates.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/low_trip_count_predicates.ll
@@ -17,8 +17,8 @@ target triple = "aarch64-unknown-linux-gnu"
 ; DEBUG-VS2: Main Loop VF:vscale x 8, Main Loop UF:1, Epilogue Loop VF:8, Epilogue Loop UF:1
 
 ; DEBUG-LABEL: LV: Checking a loop in 'trip_count_too_small'
-; DEBUG: LV: Found a loop with a very small trip count. This loop is worth vectorizing only if no scalar iteration overheads are incurred.
-; DEBUG: LV: Not vectorizing: Runtime SCEV check is required with -Os/-Oz.
+; DEBUG: LV: Found a loop with a very small trip count. This loop is worth vectorizing only if no scalar epilogue is required.
+; DEBUG: LV: Not vectorizing: The trip count is below the minial threshold value..
 
 ; DEBUG-LABEL: LV: Checking a loop in 'too_many_runtime_checks'
 ; DEBUG: LV: Found trip count: 0
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/pr36032.ll b/llvm/test/Transforms/LoopVectorize/AArch64/pr36032.ll
index f4a1814efc038..91e211847b03d 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/pr36032.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/pr36032.ll
@@ -14,6 +14,7 @@ define void @_Z1dv() {
 ; CHECK-LABEL: @_Z1dv(
 ; CHECK-NEXT:  entry:
 ; CHECK-NEXT:    [[CALL:%.*]] = tail call ptr @"_ZN3$_01aEv"(ptr nonnull @b)
+; CHECK-NEXT:    [[CALL1:%.*]] = ptrtoaddr ptr [[CALL]] to i64
 ; CHECK-NEXT:    br label [[FOR_COND:%.*]]
 ; CHECK:       for.cond:
 ; CHECK-NEXT:    [[F_0:%.*]] = phi i32 [ 0, [[ENTRY:%.*]] ], [ [[ADD5:%.*]], [[FOR_COND_CLEANUP:%.*]] ]
@@ -23,15 +24,43 @@ define void @_Z1dv() {
 ; CHECK-NEXT:    br i1 [[CMP12]], label [[FOR_BODY_LR_PH:%.*]], label [[FOR_COND_CLEANUP]]
 ; CHECK:       for.body.lr.ph:
 ; CHECK-NEXT:    [[TMP0:%.*]] = zext i32 [[G_0]] to i64
+; CHECK-NEXT:    [[TMP12:%.*]] = sub i64 4, [[TMP0]]
 ; CHECK-NEXT:    br label [[FOR_BODY:%.*]]
+; CHECK:       vector.memcheck:
+; CHECK-NEXT:    [[TMP13:%.*]] = add i64 [[CALL1]], [[TMP0]]
+; CHECK-NEXT:    [[TMP3:%.*]] = add i32 [[G_0]], [[CONV]]
+; CHECK-NEXT:    [[TMP4:%.*]] = zext nneg i32 [[TMP3]] to i64
+; CHECK-NEXT:    [[TMP5:%.*]] = add i64 [[TMP4]], ptrtoaddr (ptr @c to i64)
+; CHECK-NEXT:    [[TMP6:%.*]] = sub i64 [[TMP13]], [[TMP5]]
+; CHECK-NEXT:    [[DIFF_CHECK:%.*]] = icmp ult i64 [[TMP6]], 4
+; CHECK-NEXT:    br i1 [[DIFF_CHECK]], label [[SCALAR_PH:%.*]], label [[VECTOR_PH:%.*]]
+; CHECK:       vector.ph:
+; CHECK-NEXT:    [[N_MOD_VF:%.*]] = urem i64 [[TMP12]], 4
+; CHECK-NEXT:    [[N_VEC:%.*]] = sub i64 [[TMP12]], [[N_MOD_VF]]
+; CHECK-NEXT:    [[TMP7:%.*]] = add i64 [[TMP0]], [[N_VEC]]
+; CHECK-NEXT:    br label [[VECTOR_BODY:%.*]]
+; CHECK:       vector.body:
+; CHECK-NEXT:    [[TMP8:%.*]] = add i32 [[CONV]], [[G_0]]
+; CHECK-NEXT:    [[TMP9:%.*]] = zext i32 [[TMP8]] to i64
+; CHECK-NEXT:    [[TMP10:%.*]] = getelementptr inbounds [6 x i8], ptr @c, i64 0, i64 [[TMP9]]
+; CHECK-NEXT:    [[WIDE_LOAD:%.*]] = load <4 x i8>, ptr [[TMP10]], align 1
+; CHECK-NEXT:    [[TMP11:%.*]] = getelementptr inbounds i8, ptr [[CALL]], i64 [[TMP0]]
+; CHECK-NEXT:    store <4 x i8> [[WIDE_LOAD]], ptr [[TMP11]], align 1
+; CHECK-NEXT:    br label [[MIDDLE_BLOCK:%.*]]
+; CHECK:       middle.block:
+; CHECK-NEXT:    [[CMP_N:%.*]] = icmp eq i64 [[TMP12]], [[N_VEC]]
+; CHECK-NEXT:    br i1 [[CMP_N]], label [[FOR_COND_CLEANUP_LOOPEXIT:%.*]], label [[SCALAR_PH]]
+; CHECK:       scalar.ph:
+; CHECK-NEXT:    [[BC_RESUME_VAL:%.*]] = phi i64 [ [[TMP7]], [[MIDDLE_BLOCK]] ], [ [[TMP0]], [[FOR_BODY]] ]
+; CHECK-NEXT:    br label [[FOR_BODY1:%.*]]
 ; CHECK:       for.cond.cleanup.loopexit:
 ; CHECK-NEXT:    br label [[FOR_COND_CLEANUP]]
 ; CHECK:       for.cond.cleanup:
-; CHECK-NEXT:    [[G_1_LCSSA]] = phi i32 [ [[G_0]], [[FOR_COND]] ], [ 4, [[FOR_COND_CLEANUP_LOOPEXIT:%.*]] ]
+; CHECK-NEXT:    [[G_1_LCSSA]] = phi i32 [ [[G_0]], [[FOR_COND]] ], [ 4, [[FOR_COND_CLEANUP_LOOPEXIT]] ]
 ; CHECK-NEXT:    [[ADD5]] = add nuw nsw i32 [[CONV]], 4
 ; CHECK-NEXT:    br label [[FOR_COND]]
 ; CHECK:       for.body:
-; CHECK-NEXT:    [[INDVARS_IV:%.*]] = phi i64 [ [[TMP0]], [[FOR_BODY_LR_PH]] ], [ [[INDVARS_IV_NEXT:%.*]], [[FOR_BODY]] ]
+; CHECK-NEXT:    [[INDVARS_IV:%.*]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.*]], [[FOR_BODY1]] ]
 ; CHECK-NEXT:    [[TMP1:%.*]] = trunc i64 [[INDVARS_IV]] to i32
 ; CHECK-NEXT:    [[ADD:%.*]] = add i32 [[CONV]], [[TMP1]]
 ; CHECK-NEXT:    [[IDXPROM:%.*]] = zext i32 [[ADD]] to i64
@@ -41,7 +70,7 @@ define void @_Z1dv() {
 ; CHECK-NEXT:    store i8 [[TMP2]], ptr [[ARRAYIDX3]], align 1
 ; CHECK-NEXT:    [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
 ; CHECK-NEXT:    [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 4
-; CHECK-NEXT:    br i1 [[EXITCOND]], label [[FOR_COND_CLEANUP_LOOPEXIT]], label [[FOR_BODY]]
+; CHECK-NEXT:    br i1 [[EXITCOND]], label [[FOR_COND_CLEANUP_LOOPEXIT]], label [[FOR_BODY1]], !llvm.loop [[LOOP0:![0-9]+]]
 ;
 entry:
   %call = tail call ptr @"_ZN3$_01aEv"(ptr nonnull @b) #2
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/runtime-check-trip-count-decisions.ll b/llvm/test/Transforms/LoopVectorize/AArch64/runtime-check-trip-count-decisions.ll
index 85dd5898b26dd..11b244e388c4b 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/runtime-check-trip-count-decisions.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/runtime-check-trip-count-decisions.ll
@@ -57,7 +57,7 @@ for.end:
 define i32 @foo_mid_trip_count(ptr %a, ptr %b, ptr %c, i32 %bound) {
 ; CHECK-LABEL: @foo_mid_trip_count(
 ; PREDICATED: vector.body
-; SCALAR-NOT: vector.body
+; SCALAR: vector.body
 entry:
   br label %for.body
 
diff --git a/llvm/test/Transforms/LoopVectorize/X86/optsize.ll b/llvm/test/Transforms/LoopVectorize/X86/optsize.ll
index 8460af9e6aa25..2bb15492a29a6 100644
--- a/llvm/test/Transforms/LoopVectorize/X86/optsize.ll
+++ b/llvm/test/Transforms/LoopVectorize/X86/optsize.ll
@@ -230,16 +230,62 @@ attributes #2 = { optsize }
 define void @scev_predicate_no_vec(i32 %start, ptr %dst) {
 ; CHECK-LABEL: define void @scev_predicate_no_vec(
 ; CHECK-SAME: i32 [[START:%.*]], ptr [[DST:%.*]]) #[[ATTR2:[0-9]+]] {
-; CHECK-NEXT:  [[ENTRY:.*]]:
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    [[TMP0:%.*]] = trunc i32 [[START]] to i16
+; CHECK-NEXT:    [[TMP1:%.*]] = zext i16 [[TMP0]] to i32
+; CHECK-NEXT:    [[UMAX1:%.*]] = call i32 @llvm.umax.i32(i32 [[TMP1]], i32 4)
+; CHECK-NEXT:    [[TMP2:%.*]] = add i32 [[UMAX1]], 1
+; CHECK-NEXT:    [[TMP3:%.*]] = sub i32 [[TMP2]], [[TMP1]]
 ; CHECK-NEXT:    br label %[[LOOP:.*]]
 ; CHECK:       [[LOOP]]:
-; CHECK-NEXT:    [[IV:%.*]] = phi i32 [ [[START]], %[[ENTRY]] ], [ [[ADD:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[TMP4:%.*]] = trunc i32 [[START]] to i16
+; CHECK-NEXT:    [[TMP5:%.*]] = zext i16 [[TMP4]] to i32
+; CHECK-NEXT:    [[UMAX:%.*]] = call i32 @llvm.umax.i32(i32 [[TMP5]], i32 4)
+; CHECK-NEXT:    [[TMP6:%.*]] = sub i32 [[UMAX]], [[TMP5]]
+; CHECK-NEXT:    [[TMP7:%.*]] = trunc i32 [[TMP6]] to i16
+; CHECK-NEXT:    [[TMP8:%.*]] = add i16 [[TMP4]], [[TMP7]]
+; CHECK-NEXT:    [[TMP9:%.*]] = icmp ult i16 [[TMP8]], [[TMP4]]
+; CHECK-NEXT:    [[TMP10:%.*]] = icmp ugt i32 [[TMP6]], 65535
+; CHECK-NEXT:    [[TMP11:%.*]] = or i1 [[TMP9]], [[TMP10]]
+; CHECK-NEXT:    [[IDENT_CHECK:%.*]] = icmp ne i32 [[START]], [[TMP5]]
+; CHECK-NEXT:    [[TMP12:%.*]] = or i1 [[TMP11]], [[IDENT_CHECK]]
+; CHECK-NEXT:    br i1 [[TMP12]], label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
+; CHECK:       [[VECTOR_PH]]:
+; CHECK-NEXT:    [[N_RND_UP:%.*]] = add i32 [[TMP3]], 63
+; CHECK-NEXT:    [[N_MOD_VF:%.*]] = urem i32 [[N_RND_UP]], 64
+; CHECK-NEXT:    [[N_VEC:%.*]] = sub i32 [[N_RND_UP]], [[N_MOD_VF]]
+; CHECK-NEXT:    [[TRIP_COUNT_MINUS_1:%.*]] = sub i32 [[TMP3]], 1
+; CHECK-NEXT:    [[BROADCAST_SPLATINSERT:%.*]] = insertelement <64 x i32> poison, i32 [[TRIP_COUNT_MINUS_1]], i64 0
+; CHECK-NEXT:    [[BROADCAST_SPLAT:%.*]] = shufflevector <64 x i32> [[BROADCAST_SPLATINSERT]], <64 x i32> poison, <64 x i32> zeroinitializer
+; CHECK-NEXT:    [[BROADCAST_SPLATINSERT2:%.*]] = insertelement <64 x i32> poison, i32 [[START]], i64 0
+; CHECK-NEXT:    [[BROADCAST_SPLAT3:%.*]] = shufflevector <64 x i32> [[BROADCAST_SPLATINSERT2]], <64 x i32> poison, <64 x i32> zeroinitializer
+; CHECK-NEXT:    [[INDUCTION:%.*]] = add nuw nsw <64 x i32> [[BROADCAST_SPLAT3]], <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31, i32 32, i32 33, i32 34, i32 35, i32 36, i32 37, i32 38, i32 39, i32 40, i32 41, i32 42, i32 43, i32 44, i32 45, i32 46, i32 47, i32 48, i32 49, i32 50, i32 51, i32 52, i32 53, i32 54, i32 55, i32 56, i32 57, i32 58, i32 59, i32 60, i32 61, i32 62, i32 63>
+; CHECK-NEXT:    br label %[[VECTOR_BODY:.*]]
+; CHECK:       [[VECTOR_BODY]]:
+; CHECK-NEXT:    [[INDEX:%.*]] = phi i32 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_IND:%.*]] = phi <64 x i32> [ [[INDUCTION]], %[[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_IND4:%.*]] = phi <64 x i32> [ <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31, i32 32, i32 33, i32 34, i32 35, i32 36, i32 37, i32 38, i32 39, i32 40, i32 41, i32 42, i32 43, i32 44, i32 45, i32 46, i32 47, i32 48, i32 49, i32 50, i32 51, i32 52, i32 53, i32 54, i32 55, i32 56, i32 57, i32 58, i32 59, i32 60, i32 61, i32 62, i32 63>, %[[VECTOR_PH]] ], [ [[VEC_IND_NEXT5:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[IV:%.*]] = add i32 [[START]], [[INDEX]]
+; CHECK-NEXT:    [[TMP14:%.*]] = icmp ule <64 x i32> [[VEC_IND4]], [[BROADCAST_SPLAT]]
 ; CHECK-NEXT:    [[GEP:%.*]] = getelementptr inbounds i32, ptr [[DST]], i32 [[IV]]
-; CHECK-NEXT:    store i32 [[IV]], ptr [[GEP]], align 4
-; CHECK-NEXT:    [[CONV:%.*]] = and i32 [[IV]], 65535
+; CHECK-NEXT:    call void @llvm.masked.store.v64i32.p0(<64 x i32> [[VEC_IND]], ptr align 4 [[GEP]], <64 x i1> [[TMP14]])
+; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i32 [[INDEX]], 64
+; CHECK-NEXT:    [[VEC_IND_NEXT]] = add nuw nsw <64 x i32> [[VEC_IND]], splat (i32 64)
+; CHECK-NEXT:    [[VEC_IND_NEXT5]] = add nuw <64 x i32> [[VEC_IND4]], splat (i32 64)
+; CHECK-NEXT:    [[TMP16:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]
+; CHECK-NEXT:    br i1 [[TMP16]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP5:![0-9]+]]
+; CHECK:       [[MIDDLE_BLOCK]]:
+; CHECK-NEXT:    br label %[[EXIT:.*]]
+; CHECK:       [[SCALAR_PH]]:
+; CHECK-NEXT:    br label %[[LOOP1:.*]]
+; CHECK:       [[LOOP1]]:
+; CHECK-NEXT:    [[IV1:%.*]] = phi i32 [ [[START]], %[[SCALAR_PH]] ], [ [[ADD:%.*]], %[[LOOP1]] ]
+; CHECK-NEXT:    [[GEP1:%.*]] = getelementptr inbounds i32, ptr [[DST]], i32 [[IV1]]
+; CHECK-NEXT:    store i32 [[IV1]], ptr [[GEP1]], align 4
+; CHECK-NEXT:    [[CONV:%.*]] = and i32 [[IV1]], 65535
 ; CHECK-NEXT:    [[CMP:%.*]] = icmp ult i32 [[CONV]], 4
 ; CHECK-NEXT:    [[ADD]] = add nuw nsw i32 [[CONV]], 1
-; CHECK-NEXT:    br i1 [[CMP]], label %[[LOOP]], label %[[EXIT:.*]]
+; CHECK-NEXT:    br i1 [[CMP]], label %[[LOOP1]], label %[[EXIT]], !llvm.loop [[LOOP6:![0-9]+]]
 ; CHECK:       [[EXIT]]:
 ; CHECK-NEXT:    ret void
 ;
@@ -277,32 +323,30 @@ exit:
 define void @can_prove_scev_predicate_is_always_true(ptr %dst) {
 ; CHECK-LABEL: define void @can_prove_scev_predicate_is_always_true(
 ; CHECK-SAME: ptr [[DST:%.*]]) #[[ATTR2]] {
-; CHECK-NEXT:  [[ENTRY:.*]]:
+; CHECK-NEXT:  [[ENTRY:.*:]]
 ; CHECK-NEXT:    br label %[[LOOP:.*]]
 ; CHECK:       [[LOOP]]:
-; CHECK-NEXT:    [[IV:%.*]] = phi i32 [ 0, %[[ENTRY]] ], [ [[ADD:%.*]], %[[LOOP]] ]
-; CHECK-NEXT:    [[GEP:%.*]] = getelementptr inbounds i32, ptr [[DST]], i32 [[IV]]
-; CHECK-NEXT:    store i32 [[IV]], ptr [[GEP]], align 4
-; CHECK-NEXT:    [[CONV:%.*]] = and i32 [[IV]], 65535
-; CHECK-NEXT:    [[CMP:%.*]] = icmp ult i32 [[CONV]], 4
-; CHECK-NEXT:    [[ADD]] = add nuw nsw i32 [[CONV]], 1
-; CHECK-NEXT:    br i1 [[CMP]], label %[[LOOP]], label %[[EXIT:.*]]
+; CHECK-NEXT:    br label %[[EXIT:.*]]
 ; CHECK:       [[EXIT]]:
+; CHECK-NEXT:    call void @llvm.masked.store.v64i32.p0(<64 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31, i32 32, i32 33, i32 34, i32 35, i32 36, i32 37, i32 38, i32 39, i32 40, i32 41, i32 42, i32 43, i32 44, i32 45, i32 46, i32 47, i32 48, i32 49, i32 50, i32 51, i32 52, i32 53, i32 54, i32 55, i32 56, i32 57, i32 58, i32 59, i32 60, i32 61, i32 62, i32 63>, ptr align 4 [[DST]], <64 x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false>)
+; CHECK-NEXT:    br label %[[MIDDLE_BLOCK:.*]]
+; CHECK:       [[MIDDLE_BLOCK]]:
+; CHECK-NEXT:    br label %[[EXIT1:.*]]
+; CHECK:       [[EXIT1]]:
 ; CHECK-NEXT:    ret void
 ;
 ; AUTOVF-LABEL: define void @can_prove_scev_predicate_is_always_true(
 ; AUTOVF-SAME: ptr [[DST:%.*]]) #[[ATTR2]] {
-; AUTOVF-NEXT:  [[ENTRY:.*]]:
+; AUTOVF-NEXT:  [[ENTRY:.*:]]
 ; AUTOVF-NEXT:    br label %[[LOOP:.*]]
 ; AUTOVF:       [[LOOP]]:
-; AUTOVF-NEXT:    [[IV:%.*]] = phi i32 [ 0, %[[ENTRY]] ], [ [[ADD:%.*]], %[[LOOP]] ]
-; AUTOVF-NEXT:    [[GEP:%.*]] = getelementptr inbounds i32, ptr [[DST]], i32 [[IV]]
-; AUTOVF-NEXT:    store i32 [[IV]], ptr [[GEP]], align 4
-; AUTOVF-NEXT:    [[CONV:%.*]] = and i32 [[IV]], 65535
-; AUTOVF-NEXT:    [[CMP:%.*]] = icmp ult i32 [[CONV]], 4
-; AUTOVF-NEXT:    [[ADD]] = add nuw nsw i32 [[CONV]], 1
-; AUTOVF-NEXT:    br i1 [[CMP]], label %[[LOOP]], label %[[EXIT:.*]]
+; AUTOVF-NEXT:    br label %[[EXIT:.*]]
 ; AUTOVF:       [[EXIT]]:
+; AUTOVF-NEXT:    call void @llvm.masked.store.v8i32.p0(<8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>, ptr align 4 [[DST]], <8 x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 false, i1 false, i1 false>)
+; AUTOVF-NEXT:    br label %[[MIDDLE_BLOCK:.*]]
+; AUTOVF:       [[MIDDLE_BLOCK]]:
+; AUTOVF-NEXT:    br label %[[EXIT1:.*]]
+; AUTOVF:       [[EXIT1]]:
 ; AUTOVF-NEXT:    ret void
 ;
 entry:
@@ -351,7 +395,7 @@ define void @tail_folded_store_avx512(ptr %start, ptr %end) #3 {
 ; CHECK-NEXT:    [[PTR_IND]] = getelementptr i8, ptr [[POINTER_PHI]], i32 -4608
 ; CHECK-NEXT:    [[VEC_IND_NEXT]] = add nuw <64 x i32> [[VEC_IV]], splat (i32 64)
 ; CHECK-NEXT:    [[TMP7:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]
-; CHECK-NEXT:    br i1 [[TMP7]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP5:![0-9]+]]
+; CHECK-NEXT:    br i1 [[TMP7]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP7:![0-9]+]]
 ; CHECK:       [[MIDDLE_BLOCK]]:
 ; CHECK-NEXT:    br label %[[EXIT:.*]]
 ; CHECK:       [[EXIT]]:
diff --git a/llvm/test/Transforms/LoopVectorize/pr39417-optsize-scevchecks.ll b/llvm/test/Transforms/LoopVectorize/pr39417-optsize-scevchecks.ll
index 7d4c1d35ffc9b..2e53a230f9de2 100644
--- a/llvm/test/Transforms/LoopVectorize/pr39417-optsize-scevchecks.ll
+++ b/llvm/test/Transforms/LoopVectorize/pr39417-optsize-scevchecks.ll
@@ -8,17 +8,15 @@ target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
 ; trip count (which implies opt for size).
 define void @func_34() {
 ; CHECK-LABEL: define void @func_34() {
-; CHECK-NEXT:  [[ENTRY:.*]]:
+; CHECK-NEXT:  [[ENTRY:.*:]]
 ; CHECK-NEXT:    br label %[[LOOP:.*]]
 ; CHECK:       [[LOOP]]:
-; CHECK-NEXT:    [[IV:%.*]] = phi i32 [ 0, %[[ENTRY]] ], [ [[IV_...
[truncated]

@david-arm david-arm requested a review from hassnaaHamdi May 13, 2026 09:09
@david-arm

Copy link
Copy Markdown
Contributor

Does this conflict with another existing PR in the same area? See #195823

@david-arm david-arm requested a review from Stylie777 May 13, 2026 09:11
; CHECK-LABEL: define void @scev_predicate_no_vec(
; CHECK-SAME: i32 [[START:%.*]], ptr [[DST:%.*]]) #[[ATTR2:[0-9]+]] {
; CHECK-NEXT: [[ENTRY:.*]]:
; CHECK-NEXT: [[ENTRY:.*:]]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this test is in the wrong file now, since it doesn't rely on optsize and will take a different code path?

; CHECK-NEXT: br i1 [[CMP]], label %[[LOOP]], label %[[EXIT:.*]]
; CHECK-NEXT: br label %[[EXIT:.*]]
; CHECK: [[EXIT]]:
; CHECK-NEXT: call void @llvm.masked.store.v64i32.p0(<64 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31, i32 32, i32 33, i32 34, i32 35, i32 36, i32 37, i32 38, i32 39, i32 40, i32 41, i32 42, i32 43, i32 44, i32 45, i32 46, i32 47, i32 48, i32 49, i32 50, i32 51, i32 52, i32 53, i32 54, i32 55, i32 56, i32 57, i32 58, i32 59, i32 60, i32 61, i32 62, i32 63>, ptr align 4 [[DST]], <64 x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false>)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an unusual case. I would expect the performance to be worse with vectorisation. Do you know what's happening here?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks very odd to pick such a large VF in that case, given we are able to determine that there are only 4 iterations

; CHECK-NEXT: [[IV:%.*]] = phi i32 [ 0, %[[ENTRY]] ], [ [[IV_NEXT:%.*]], %[[LOOP]] ]
; CHECK-NEXT: [[SEXT:%.*]] = shl i32 [[IV]], 16
; CHECK-NEXT: [[STEP:%.*]] = ashr exact i32 [[SEXT]], 16
; CHECK-NEXT: [[IV_NEXT]] = add nsw i32 [[STEP]], 1

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, the test didn't seem very robust before. The loop has been entirely dead-coded away. Perhaps worth changing the test to at least have a store in it?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fhahn fhahn left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This patch makes CM_EpilogueNotAllowedLowTripLoop rely directly on the minimum profitable trip count check.

It looks like none of the tests actually generate a minimum profitable trip count check. would be good to at least add such a test (possibly where the trip count is known to be low but not constant)

; CHECK-NEXT: br i1 [[CMP]], label %[[LOOP]], label %[[EXIT:.*]]
; CHECK-NEXT: br label %[[EXIT:.*]]
; CHECK: [[EXIT]]:
; CHECK-NEXT: call void @llvm.masked.store.v64i32.p0(<64 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31, i32 32, i32 33, i32 34, i32 35, i32 36, i32 37, i32 38, i32 39, i32 40, i32 41, i32 42, i32 43, i32 44, i32 45, i32 46, i32 47, i32 48, i32 49, i32 50, i32 51, i32 52, i32 53, i32 54, i32 55, i32 56, i32 57, i32 58, i32 59, i32 60, i32 61, i32 62, i32 63>, ptr align 4 [[DST]], <64 x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false>)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks very odd to pick such a large VF in that case, given we are able to determine that there are only 4 iterations

@Stylie777

Copy link
Copy Markdown
Contributor

Does this conflict with another existing PR in the same area? See #195823

From taking a look and from my understanding, no. The changes here will still allow for the cases to vectorise I am supporting in 195823.

// For loops with very small trip counts, require that no scalar epilogue
// is needed. Tail-folded loops are still allowed as they avoid epilogues.
if (SEL != CM_EpilogueNotNeededFoldTail)
SEL = CM_EpilogueNotAllowedLowTripLoop;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,
I think if the user is explicitly using dont-fold-tail, it doesn't make sense to override it and do the opposite thing.
I know this is an old behaviour, but I just realised it while looking at this patch.
I think the case of dont-fold-tail should be considered as CM_EpilogueRequired and that should be taken into account in requiresScalarEpilogue(..) function, otherwise no difference between using the flag to prevent tail-folding or not using it. Right ?

@Mel-Chen @david-arm @fhahn

ret i32 0
}

; If trip-count is equal to 10, the function is vectorised when predicated tail folding is chosen

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we will go on with the current behaviour, the comment needs updates. I think also the check-prefix needs to be changed as it's now testing tail-folding for low TC loop, right ?

; CHECK-LABEL: @foo_mid_trip_count(
; PREDICATED: vector.body
; SCALAR-NOT: vector.body
; SCALAR: vector.body

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only checking vector.body doesn't show if the epilogue got folded or not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants