[LoopVectorize] Improve Vectorization of Low Trip Count Loops#195823
[LoopVectorize] Improve Vectorization of Low Trip Count Loops#195823Stylie777 wants to merge 14 commits into
Conversation
|
@llvm/pr-subscribers-backend-risc-v @llvm/pr-subscribers-llvm-transforms Author: Jack Styles (Stylie777) ChangesCurrently, Small Loops with Trip Counts less than 16, and in situations where the Trip Count (TC) is less than the Tail Folding Threshold are harder to vectorize, its only possible where no epilogue will be emitted. However, for loops with large bodies and small trip counts this can be counterproductive to performance, often failing to vectorize entirely. This is more prevalent with targets where To address this, the Small Loops where the TC == VF + 1 can now vectorize, leading to a single vectorized itneration and a single scalar iteration. Later passes can then remove the loop's entirely. Testing an with OpenSource Fortran HPC Benchmark which includes multiple loops with small trip counts, but large loop bodies, has shown significant improvement to runtime after these changes. Assisted-by: Claude Sonnet 4.6/Codex Full diff: https://github.com/llvm/llvm-project/pull/195823.diff 3 Files Affected:
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 3d7f7a1649b16..b0ae53a016dc9 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -3012,6 +3012,15 @@ LoopVectorizationCostModel::computeMaxVF(ElementCount UserVF, unsigned UserIC) {
}
auto ExpectedTC = getSmallBestKnownTC(PSE, TheLoop);
+ auto ApplyVectorWidth = [](FixedScalableVFPair &MaxFactors,
+ unsigned int FixedVF, unsigned int ScalableVF) {
+ MaxFactors.FixedVF = ElementCount::getFixed(FixedVF);
+ MaxFactors.ScalableVF = ElementCount::getScalable(ScalableVF);
+ };
+ unsigned EffectiveIC = UserIC > 0 ? UserIC : 1;
+ auto HasOneScalarIterationRemainder = [EffectiveIC](ElementCount &ExactTC, unsigned int VF)-> bool {
+ return ExactTC.getFixedValue() == ((VF * EffectiveIC) + 1);
+ };
if (ExpectedTC && ExpectedTC->isFixed() &&
ExpectedTC->getFixedValue() <=
TTI.getMinTripCountTailFoldingThreshold()) {
@@ -3023,9 +3032,36 @@ LoopVectorizationCostModel::computeMaxVF(ElementCount UserVF, unsigned UserIC) {
NoScalarEpilogueNeeded(MaxFactors.FixedVF.getFixedValue())) {
LLVM_DEBUG(dbgs() << "LV: Picking a fixed-width so that no tail will "
"remain for any chosen VF.\n");
- MaxFactors.ScalableVF = ElementCount::getScalable(0);
+ ApplyVectorWidth(MaxFactors, MaxFactors.FixedVF.getFixedValue(), 0);
return MaxFactors;
}
+ // Allow cases where the ExactTC == VF + 1. VF can be any power of
+ // 2 between 2 and MaxVF.
+ //
+ // This produces 1 vector iteration, and 1 scalar iteration with
+ // no remainder. Later passes will eliminate the loop and leave
+ // straight-line code as the both iteration counts are statically known.
+ ElementCount ExactTC = getSmallConstantTripCount(PSE.getSE(), TheLoop);
+ if (EpilogueLoweringStatus == CM_EpilogueNotAllowedLowTripLoop &&
+ ExactTC && ExactTC.isFixed()) {
+ if (HasOneScalarIterationRemainder(ExactTC, MaxFactors.FixedVF.getFixedValue())) {
+ LLVM_DEBUG(dbgs() << "LV: Picking a fixed-width with 1 scalar "
+ "iteration remainder.\n");
+ ApplyVectorWidth(MaxFactors, MaxFactors.FixedVF.getFixedValue(), 0);
+ return MaxFactors;
+ }
+ // If the maximum VF cannot produce 1 vector iteration + 1 scalar
+ // iteration, step down VF's to find one that can.
+ for (unsigned VF = MaxFactors.FixedVF.getFixedValue(); VF >= 2;
+ VF /= 2) {
+ if (HasOneScalarIterationRemainder(ExactTC, VF)) {
+ LLVM_DEBUG(dbgs() << "LV: Picking VF=" << VF
+ << " with 1 scalar iteration remainder.\n");
+ ApplyVectorWidth(MaxFactors, VF, 0);
+ return MaxFactors;
+ }
+ }
+ }
}
reportVectorizationFailure(
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/sve-low-trip-count.ll b/llvm/test/Transforms/LoopVectorize/AArch64/sve-low-trip-count.ll
index c36daa40f6193..76642c63dbd26 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/sve-low-trip-count.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/sve-low-trip-count.ll
@@ -56,22 +56,20 @@ exit:
define void @trip5_i8(ptr noalias nocapture noundef %dst, ptr noalias nocapture noundef readonly %src) #0 {
; CHECK-LABEL: define void @trip5_i8(
; CHECK-SAME: ptr noalias noundef captures(none) [[DST:%.*]], ptr noalias noundef readonly captures(none) [[SRC:%.*]]) #[[ATTR0]] {
-; CHECK-NEXT: [[ENTRY:.*]]:
+; CHECK-NEXT: [[ENTRY:.*:]]
; CHECK-NEXT: br label %[[LOOP:.*]]
; CHECK: [[LOOP]]:
-; CHECK-NEXT: [[IV:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[IV_NEXT:%.*]], %[[LOOP]] ]
-; CHECK-NEXT: [[GEP_SRC:%.*]] = getelementptr inbounds i8, ptr [[SRC]], i64 [[IV]]
-; CHECK-NEXT: [[TMP0:%.*]] = load i8, ptr [[GEP_SRC]], align 1
-; CHECK-NEXT: [[MUL:%.*]] = shl i8 [[TMP0]], 1
-; CHECK-NEXT: [[GEP_DST:%.*]] = getelementptr inbounds i8, ptr [[DST]], i64 [[IV]]
-; CHECK-NEXT: [[TMP1:%.*]] = load i8, ptr [[GEP_DST]], align 1
-; CHECK-NEXT: [[ADD:%.*]] = add i8 [[MUL]], [[TMP1]]
-; CHECK-NEXT: store i8 [[ADD]], ptr [[GEP_DST]], align 1
-; CHECK-NEXT: [[IV_NEXT]] = add nuw nsw i64 [[IV]], 1
-; CHECK-NEXT: [[EC:%.*]] = icmp eq i64 [[IV_NEXT]], 5
-; CHECK-NEXT: br i1 [[EC]], label %[[EXIT:.*]], label %[[LOOP]]
+; CHECK-NEXT: br label %[[EXIT:.*]]
; CHECK: [[EXIT]]:
-; CHECK-NEXT: ret void
+; CHECK-NEXT: [[WIDE_LOAD:%.*]] = load <4 x i8>, ptr [[SRC]], align 1
+; CHECK-NEXT: [[TMP0:%.*]] = shl <4 x i8> [[WIDE_LOAD]], splat (i8 1)
+; CHECK-NEXT: [[WIDE_LOAD1:%.*]] = load <4 x i8>, ptr [[DST]], align 1
+; CHECK-NEXT: [[TMP1:%.*]] = add <4 x i8> [[TMP0]], [[WIDE_LOAD1]]
+; CHECK-NEXT: store <4 x i8> [[TMP1]], ptr [[DST]], align 1
+; CHECK-NEXT: br label %[[MIDDLE_BLOCK:.*]]
+; CHECK: [[MIDDLE_BLOCK]]:
+; CHECK-NEXT: br label %[[SCALAR_PH:.*]]
+; CHECK: [[SCALAR_PH]]:
;
entry:
br label %loop
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/sve-small-trip-count-vf-plus-one.ll b/llvm/test/Transforms/LoopVectorize/AArch64/sve-small-trip-count-vf-plus-one.ll
new file mode 100644
index 0000000000000..71157580334d1
--- /dev/null
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/sve-small-trip-count-vf-plus-one.ll
@@ -0,0 +1,195 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --check-globals none --version 6
+;
+; Test that a loop with trip count == VF + 1 is allowed to vectorize
+; on AArch64+SVE where getMinTripCountTailFoldingThreshold() returns 5. This
+; produces one vector iteration and one scalar iteration.
+;
+; RUN: opt -S -p loop-vectorize %s | FileCheck %s
+
+target triple = "aarch64-unknown-linux-gnu"
+
+; TC=5, VF=4: TC == MaxFixedVF + 1 (5 == 4 + 1).
+; The new code path should trigger: 1 vectorized iteration + 1 scalar iteration.
+define void @tc5_vf4_vectorize(ptr noalias %a, ptr noalias %b) #0 {
+; CHECK-LABEL: define void @tc5_vf4_vectorize(
+; CHECK-SAME: ptr noalias [[A:%.*]], ptr noalias [[B:%.*]]) #[[ATTR0:[0-9]+]] {
+; CHECK-NEXT: [[SCALAR_PH:.*:]]
+; CHECK-NEXT: br label %[[LOOP:.*]]
+; CHECK: [[LOOP]]:
+; CHECK-NEXT: br label %[[VECTOR_BODY:.*]]
+; CHECK: [[VECTOR_BODY]]:
+; CHECK-NEXT: [[WIDE_LOAD:%.*]] = load <4 x i32>, ptr [[A]], align 4
+; CHECK-NEXT: [[TMP0:%.*]] = add nsw <4 x i32> [[WIDE_LOAD]], splat (i32 1)
+; CHECK-NEXT: store <4 x i32> [[TMP0]], ptr [[B]], align 4
+; CHECK-NEXT: br label %[[MIDDLE_BLOCK:.*]]
+; CHECK: [[MIDDLE_BLOCK]]:
+; CHECK-NEXT: br label %[[SCALAR_PH1:.*]]
+; CHECK: [[SCALAR_PH1]]:
+; CHECK-NEXT: br label %[[LOOP1:.*]]
+; CHECK: [[LOOP1]]:
+; CHECK-NEXT: [[IV:%.*]] = phi i64 [ 4, %[[SCALAR_PH1]] ], [ [[IV_NEXT:%.*]], %[[LOOP1]] ]
+; CHECK-NEXT: [[GEP_A:%.*]] = getelementptr inbounds i32, ptr [[A]], i64 [[IV]]
+; CHECK-NEXT: [[GEP_B:%.*]] = getelementptr inbounds i32, ptr [[B]], i64 [[IV]]
+; CHECK-NEXT: [[VAL:%.*]] = load i32, ptr [[GEP_A]], align 4
+; CHECK-NEXT: [[ADD:%.*]] = add nsw i32 [[VAL]], 1
+; CHECK-NEXT: store i32 [[ADD]], ptr [[GEP_B]], align 4
+; CHECK-NEXT: [[IV_NEXT]] = add nuw nsw i64 [[IV]], 1
+; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[IV_NEXT]], 5
+; CHECK-NEXT: br i1 [[EXITCOND]], label %[[EXIT:.*]], label %[[LOOP1]], !llvm.loop [[LOOP0:![0-9]+]]
+; CHECK: [[EXIT]]:
+; CHECK-NEXT: ret void
+;
+entry:
+ br label %loop
+loop:
+ %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
+ %gep.a = getelementptr inbounds i32, ptr %a, i64 %iv
+ %gep.b = getelementptr inbounds i32, ptr %b, i64 %iv
+ %val = load i32, ptr %gep.a, align 4
+ %add = add nsw i32 %val, 1
+ store i32 %add, ptr %gep.b, align 4
+ %iv.next = add nuw nsw i64 %iv, 1
+ %exitcond = icmp eq i64 %iv.next, 5
+ br i1 %exitcond, label %exit, label %loop
+exit:
+ ret void
+}
+
+; TC=5, VF=2, IC=2: TC == VF * IC + 1 (5 == 2 * 2 + 1).
+; The forced interleave count should be considered when choosing VF.
+define void @tc5_forced_ic2_vectorize(ptr noalias %a, ptr noalias %b) #0 {
+; CHECK-LABEL: define void @tc5_forced_ic2_vectorize(
+; CHECK-SAME: ptr noalias [[A:%.*]], ptr noalias [[B:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT: [[SCALAR_PH:.*:]]
+; CHECK-NEXT: br label %[[LOOP:.*]]
+; CHECK: [[LOOP]]:
+; CHECK-NEXT: br label %[[VECTOR_BODY:.*]]
+; CHECK: [[VECTOR_BODY]]:
+; CHECK-NEXT: [[TMP0:%.*]] = getelementptr inbounds i32, ptr [[A]], i64 2
+; CHECK-NEXT: [[WIDE_LOAD:%.*]] = load <2 x i32>, ptr [[A]], align 4
+; CHECK-NEXT: [[WIDE_LOAD1:%.*]] = load <2 x i32>, ptr [[TMP0]], align 4
+; CHECK-NEXT: [[TMP1:%.*]] = add nsw <2 x i32> [[WIDE_LOAD]], splat (i32 1)
+; CHECK-NEXT: [[TMP2:%.*]] = add nsw <2 x i32> [[WIDE_LOAD1]], splat (i32 1)
+; CHECK-NEXT: [[TMP3:%.*]] = getelementptr inbounds i32, ptr [[B]], i64 2
+; CHECK-NEXT: store <2 x i32> [[TMP1]], ptr [[B]], align 4
+; CHECK-NEXT: store <2 x i32> [[TMP2]], ptr [[TMP3]], align 4
+; CHECK-NEXT: br label %[[MIDDLE_BLOCK:.*]]
+; CHECK: [[MIDDLE_BLOCK]]:
+; CHECK-NEXT: br label %[[SCALAR_PH1:.*]]
+; CHECK: [[SCALAR_PH1]]:
+; CHECK-NEXT: br label %[[LOOP1:.*]]
+; CHECK: [[LOOP1]]:
+; CHECK-NEXT: [[IV:%.*]] = phi i64 [ 4, %[[SCALAR_PH1]] ], [ [[IV_NEXT:%.*]], %[[LOOP1]] ]
+; CHECK-NEXT: [[GEP_A:%.*]] = getelementptr inbounds i32, ptr [[A]], i64 [[IV]]
+; CHECK-NEXT: [[GEP_B:%.*]] = getelementptr inbounds i32, ptr [[B]], i64 [[IV]]
+; CHECK-NEXT: [[VAL:%.*]] = load i32, ptr [[GEP_A]], align 4
+; CHECK-NEXT: [[ADD:%.*]] = add nsw i32 [[VAL]], 1
+; CHECK-NEXT: store i32 [[ADD]], ptr [[GEP_B]], align 4
+; CHECK-NEXT: [[IV_NEXT]] = add nuw nsw i64 [[IV]], 1
+; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[IV_NEXT]], 5
+; CHECK-NEXT: br i1 [[EXITCOND]], label %[[EXIT:.*]], label %[[LOOP1]], !llvm.loop [[LOOP3:![0-9]+]]
+; CHECK: [[EXIT]]:
+; CHECK-NEXT: ret void
+;
+entry:
+ br label %loop
+loop:
+ %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
+ %gep.a = getelementptr inbounds i32, ptr %a, i64 %iv
+ %gep.b = getelementptr inbounds i32, ptr %b, i64 %iv
+ %val = load i32, ptr %gep.a, align 4
+ %add = add nsw i32 %val, 1
+ store i32 %add, ptr %gep.b, align 4
+ %iv.next = add nuw nsw i64 %iv, 1
+ %exitcond = icmp eq i64 %iv.next, 5
+ br i1 %exitcond, label %exit, label %loop, !llvm.loop !0
+exit:
+ ret void
+}
+
+; TC=3, VF=2: TC == FixedVF + 1 (3 == 2 + 1).
+; Should vectorize: 1 vector iteration of width 2, then 1 scalar iteration.
+define void @tc3_vf2_vectorize(ptr noalias %a, ptr noalias %b) #0 {
+; CHECK-LABEL: define void @tc3_vf2_vectorize(
+; CHECK-SAME: ptr noalias [[A:%.*]], ptr noalias [[B:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT: [[SCALAR_PH:.*:]]
+; CHECK-NEXT: br label %[[LOOP:.*]]
+; CHECK: [[LOOP]]:
+; CHECK-NEXT: br label %[[VECTOR_BODY:.*]]
+; CHECK: [[VECTOR_BODY]]:
+; CHECK-NEXT: [[WIDE_LOAD:%.*]] = load <2 x i32>, ptr [[A]], align 4
+; CHECK-NEXT: [[TMP0:%.*]] = add nsw <2 x i32> [[WIDE_LOAD]], splat (i32 1)
+; CHECK-NEXT: store <2 x i32> [[TMP0]], ptr [[B]], align 4
+; CHECK-NEXT: br label %[[MIDDLE_BLOCK:.*]]
+; CHECK: [[MIDDLE_BLOCK]]:
+; CHECK-NEXT: br label %[[SCALAR_PH1:.*]]
+; CHECK: [[SCALAR_PH1]]:
+; CHECK-NEXT: br label %[[LOOP1:.*]]
+; CHECK: [[LOOP1]]:
+; CHECK-NEXT: [[IV:%.*]] = phi i64 [ 2, %[[SCALAR_PH1]] ], [ [[IV_NEXT:%.*]], %[[LOOP1]] ]
+; CHECK-NEXT: [[GEP_A:%.*]] = getelementptr inbounds i32, ptr [[A]], i64 [[IV]]
+; CHECK-NEXT: [[GEP_B:%.*]] = getelementptr inbounds i32, ptr [[B]], i64 [[IV]]
+; CHECK-NEXT: [[VAL:%.*]] = load i32, ptr [[GEP_A]], align 4
+; CHECK-NEXT: [[ADD:%.*]] = add nsw i32 [[VAL]], 1
+; CHECK-NEXT: store i32 [[ADD]], ptr [[GEP_B]], align 4
+; CHECK-NEXT: [[IV_NEXT]] = add nuw nsw i64 [[IV]], 1
+; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[IV_NEXT]], 3
+; CHECK-NEXT: br i1 [[EXITCOND]], label %[[EXIT:.*]], label %[[LOOP1]], !llvm.loop [[LOOP4:![0-9]+]]
+; CHECK: [[EXIT]]:
+; CHECK-NEXT: ret void
+;
+entry:
+ br label %loop
+loop:
+ %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
+ %gep.a = getelementptr inbounds i32, ptr %a, i64 %iv
+ %gep.b = getelementptr inbounds i32, ptr %b, i64 %iv
+ %val = load i32, ptr %gep.a, align 4
+ %add = add nsw i32 %val, 1
+ store i32 %add, ptr %gep.b, align 4
+ %iv.next = add nuw nsw i64 %iv, 1
+ %exitcond = icmp eq i64 %iv.next, 3
+ br i1 %exitcond, label %exit, label %loop
+exit:
+ ret void
+}
+
+; TC=4: exact multiple of VF=4. Vectorizes via the original
+; "no scalar epilogue needed" path -- NOT the new TC==VF+1 path.
+define void @tc4_vf4_vectorize(ptr noalias %a, ptr noalias %b) #0 {
+; CHECK-LABEL: define void @tc4_vf4_vectorize(
+; CHECK-SAME: ptr noalias [[A:%.*]], ptr noalias [[B:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT: [[ENTRY:.*:]]
+; CHECK-NEXT: br label %[[VECTOR_PH:.*]]
+; CHECK: [[VECTOR_PH]]:
+; CHECK-NEXT: br label %[[VECTOR_BODY:.*]]
+; CHECK: [[VECTOR_BODY]]:
+; CHECK-NEXT: [[WIDE_LOAD:%.*]] = load <4 x i32>, ptr [[A]], align 4
+; CHECK-NEXT: [[TMP0:%.*]] = add nsw <4 x i32> [[WIDE_LOAD]], splat (i32 1)
+; CHECK-NEXT: store <4 x i32> [[TMP0]], ptr [[B]], align 4
+; CHECK-NEXT: br label %[[MIDDLE_BLOCK:.*]]
+; CHECK: [[MIDDLE_BLOCK]]:
+; CHECK-NEXT: br label %[[EXIT:.*]]
+; CHECK: [[EXIT]]:
+; CHECK-NEXT: ret void
+;
+entry:
+ br label %loop
+loop:
+ %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
+ %gep.a = getelementptr inbounds i32, ptr %a, i64 %iv
+ %gep.b = getelementptr inbounds i32, ptr %b, i64 %iv
+ %val = load i32, ptr %gep.a, align 4
+ %add = add nsw i32 %val, 1
+ store i32 %add, ptr %gep.b, align 4
+ %iv.next = add nuw nsw i64 %iv, 1
+ %exitcond = icmp eq i64 %iv.next, 4
+ br i1 %exitcond, label %exit, label %loop
+exit:
+ ret void
+}
+
+attributes #0 = { vscale_range(1,16) "target-features"="+sve" }
+
+!0 = distinct !{!0, !1}
+!1 = !{!"llvm.loop.interleave.count", i32 2}
|
|
✅ With the latest revision this PR passed the C/C++ code formatter. |
david-arm
left a comment
There was a problem hiding this comment.
Thanks for this, seems like a useful improvement!
| ; CHECK-NEXT: [[IV_NEXT]] = add nuw nsw i64 [[IV]], 1 | ||
| ; CHECK-NEXT: [[EC:%.*]] = icmp eq i64 [[IV_NEXT]], 5 | ||
| ; CHECK-NEXT: br i1 [[EC]], label %[[EXIT:.*]], label %[[LOOP]] | ||
| ; CHECK-NEXT: br label %[[EXIT:.*]] |
There was a problem hiding this comment.
The label looks wrong here. I think you need to regenerate the CHECK lines because it should be br label %[[VECTOR_BODY:.*]] or something like that.
There was a problem hiding this comment.
I actually think this test is now surplus to requirements on reflection, it was previously testing that vectors were not applied to this loop, but these changes allow vectorisation and a similar test is in see-small-trip-count-vf-plus-one.ll.
| LLVM_DEBUG(dbgs() << "LV: Picking a fixed-width so that no tail will " | ||
| "remain for any chosen VF.\n"); | ||
| MaxFactors.ScalableVF = ElementCount::getScalable(0); | ||
| ApplyVectorWidth(MaxFactors, MaxFactors.FixedVF.getFixedValue(), 0); |
There was a problem hiding this comment.
nit: I'm not sure there is much value in this change to be honest, since it's now doing more work than before setting the fixed width VF unnecessarily. It's also inconsistent with the other places in this function that change the factors similarly, e.g.
MaxFactors.FixedVF = ElementCount::getFixed(1);
on line 3094 below. At the very least, we should use a consistent style throughout the function.
There was a problem hiding this comment.
I've removed this and gone back to the original behaviour.
| // straight-line code as the both iteration counts are statically known. | ||
| ElementCount ExactTC = getSmallConstantTripCount(PSE.getSE(), TheLoop); | ||
| if (EpilogueLoweringStatus == CM_EpilogueNotAllowedLowTripLoop && | ||
| ExactTC && ExactTC.isFixed()) { |
There was a problem hiding this comment.
Is it possible for ExactTC to be scalable when ExpectedTC is known to be fixed? If so, it would be good to add a test for this. If not, then you should just be able to called ExactTC.isFixed() and it will assert if it's scalable.
There was a problem hiding this comment.
If ExpectedTC is fixed, then ExactTC will also be fixed. I've changed this to just check ExactTC.isFixed()
| unsigned EffectiveIC = UserIC > 0 ? UserIC : 1; | ||
| auto HasOneScalarIterationRemainder = [EffectiveIC](ElementCount &ExactTC, | ||
| unsigned int VF) -> bool { | ||
| return ExactTC.getFixedValue() == ((VF * EffectiveIC) + 1); |
There was a problem hiding this comment.
nit: I think you can just write this as 1 + (VF * EffectiveIC)
| }; | ||
| unsigned EffectiveIC = UserIC > 0 ? UserIC : 1; | ||
| auto HasOneScalarIterationRemainder = [EffectiveIC](ElementCount &ExactTC, | ||
| unsigned int VF) -> bool { |
There was a problem hiding this comment.
Strictly speaking the VF isn't known/hasn't been chosen yet. The name should probably be changed VF->MaxVF
| ElementCount ExactTC = getSmallConstantTripCount(PSE.getSE(), TheLoop); | ||
| if (EpilogueLoweringStatus == CM_EpilogueNotAllowedLowTripLoop && | ||
| ExactTC && ExactTC.isFixed()) { | ||
| if (HasOneScalarIterationRemainder( |
There was a problem hiding this comment.
This looks like it's duplicating the first iteration of the for loop below. I think it can be deleted.
| if (HasOneScalarIterationRemainder(ExactTC, VF)) { | ||
| LLVM_DEBUG(dbgs() << "LV: Picking VF=" << VF | ||
| << " with 1 scalar iteration remainder.\n"); | ||
| ApplyVectorWidth(MaxFactors, VF, 0); |
There was a problem hiding this comment.
It would be good to add some comments explaining why you're completely discarding scalable VFs. I presume the reason for this is to ensure the loop gets deleted?
There was a problem hiding this comment.
Yes, we want to avoid any scalar iterations. For interleaving this may introduce loops, as seen in tc5_forced_ic2_vectorize, but the main aim is to avoid a scalar loop, so only allow a VF that fits TC + 1.
|
|
||
| ; TC=4: exact multiple of VF=4. Vectorizes via the original | ||
| ; "no scalar epilogue needed" path -- NOT the new TC==VF+1 path. | ||
| define void @tc4_vf4_vectorize(ptr noalias %a, ptr noalias %b) #0 { |
There was a problem hiding this comment.
I'm not sure we need this test. I'm pretty sure this is already covered by other tests. You can confirm this by looking for the PR that added the code around line 3032.
There was a problem hiding this comment.
Done. I expect it is.
|
|
||
| ; TC=3, VF=2: TC == FixedVF + 1 (3 == 2 + 1). | ||
| ; Should vectorize: 1 vector iteration of width 2, then 1 scalar iteration. | ||
| define void @tc3_vf2_vectorize(ptr noalias %a, ptr noalias %b) #0 { |
There was a problem hiding this comment.
Isn't this essentially a duplicate of tc5_vf4_vectorize? I think you can probably remove it.
There was a problem hiding this comment.
Done. It was testing the other code path, but as its now been consolidated into just the for loop, it can be removed.
There was a problem hiding this comment.
Perhaps worth adding a RISCV test too?
🪟 Windows x64 Test Results
✅ The build succeeded and all tests passed. |
| // This produces 1 vector iteration, and 1 scalar iteration with | ||
| // no remainder. Later passes will eliminate the loop and leave | ||
| // straight-line code as the both iteration counts are statically known. | ||
| ElementCount ExactTC = getSmallConstantTripCount(PSE.getSE(), TheLoop); |
There was a problem hiding this comment.
I think this is same TC calculated at the beginning of the function. I think you can use that one instead.
There was a problem hiding this comment.
I've run the tests with ExpectedTC and I would agree it's using the same value, so I have changed it to use ExpectedTC throughout.
There was a problem hiding this comment.
Is this true? From what I can tell getSmallConstantTripCount and getSmallBestKnownTC can be completely different. I can see getSmallBestKnownTC is invoked with the default parameters, which means that we could hit this part in the function:
// Check if upper bound estimate is known.
if (unsigned ExpectedTC = PSE.getSmallConstantMaxTripCount())
return ElementCount::getFixed(ExpectedTC);
Therefore, the value returned may correspond to the upper bound without there being an exact trip count. So I think we still have to use the result from getSmallConstantTripCount here, otherwise we're potentially making decisions based on profile data or the max trip count.
There was a problem hiding this comment.
They can be different yes, apologies I missed this. I am happy to go back to the original implementation.
fhahn
left a comment
There was a problem hiding this comment.
Could you update the title to clarify Small Loops == loops with low trip counts. Small loops can also mean small body (which I initially assumed)
| @@ -0,0 +1,113 @@ | |||
| ; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --check-globals none --version 6 | |||
There was a problem hiding this comment.
Is SVE relevant here? Looks like none of the tests use scalable vectors
There was a problem hiding this comment.
SVE is relevant as it is the trigger for this code path (On AArch64, SVE changes the return value of getMinTripCountTailFoldingThreshold). So in this case we want to enable the feature.
|
|
||
| ; TC=5, VF=2, IC=2: TC == VF * IC + 1 (5 == 2 * 2 + 1). | ||
| ; The forced interleave count should be considered when choosing VF. | ||
| define void @tc5_forced_ic2_vectorize(ptr noalias %a, ptr noalias %b) #0 { |
There was a problem hiding this comment.
could you also more extreme cases, like VF = 2 with i64 and possibly a smaller type. Are those also profitable, especially if the loop body is small? Would possibly be good to cover various combinations with micro benchmarks.
There was a problem hiding this comment.
I have expanded the testing to have different types and VF=2 for i64.
In terms of microbenchmarks, I got Codex to generate me one which just made a small loop for that datatype, and this change shows significant speedup for vectorized vs Scalar for i8, i16 and i32. i64 does not how the same results. i64 with a TC of 5 is not vectorising as the Loop Vectoriser rejects it at a later stage, but at TC3 we are seeing a significant improvement as this is vectorised.
With forced interleaving the improvements are minimal, as the loop is still present even if it has less iterations, so the real gain here is to eliminate the loop completely, which we have seen in other cases with i8, i16, i32 at TC=3 and TC=5 and i64 with only TC=3.
$ /tmp/sve-small-trip-count-bench 100000000
iterations: 100000000
i8 tc5 vector 1.0755 ns/call
i8 tc5 scalar 2.8662 ns/call
i16 tc5 vector 1.0749 ns/call
i16 tc5 scalar 2.8663 ns/call
i32 tc5 vector 1.0748 ns/call
i32 tc5 scalar 2.8669 ns/call
i64 tc3 vector 1.0748 ns/call
i64 tc3 scalar 1.7915 ns/call
i64 tc5 ic2 vector 2.8663 ns/call
i64 tc5 ic2 scalar 2.9432 ns/call
vector/scalar ratios:
i8 0.3752
i16 0.3750
i32 0.3749
i64 0.6000
i64 ic2 0.9739
checksum: 0
For context, the ratio < 1 shows an improvement, > 1 shows regression.
There was a problem hiding this comment.
Could you add them to https://github.com/llvm/llvm-test-suite/tree/main/MicroBenchmarks/LoopVectorization so it is easier to evaluate on different platforms?
There was a problem hiding this comment.
Actually, looking at this more, I am curious how we can reach such loops in practice, at least for Clang. I'd expect those loops to get fully unrolled before the vectorizer?
There was a problem hiding this comment.
I will get them upstreamed.
The example I used to find this issue was using Flang, and nested do loops with trip counts of 5. These all fell below the unrolling threshold so were not unrolled, and also under the vectorisation threshold for the reason being addressed here.
There was a problem hiding this comment.
In fact, for the llvm-test-suite, the only platforms which will see impact is AArch64 with SVE and RISC-V, as these are the only platforms where getMinTripCountTailFoldingThreshold will return a value greater than 0. Still happy to upstream a benchmark for this but this codepath is quite limited to how many platforms it will impact.
There was a problem hiding this comment.
| // This produces 1 vector iteration, and 1 scalar iteration with | ||
| // no remainder. Later passes will eliminate the loop and leave | ||
| // straight-line code as the both iteration counts are statically known. | ||
| ElementCount ExactTC = getSmallConstantTripCount(PSE.getSE(), TheLoop); |
There was a problem hiding this comment.
They can be different yes, apologies I missed this. I am happy to go back to the original implementation.
| }; | ||
| unsigned EffectiveIC = UserIC > 0 ? UserIC : 1; | ||
| auto HasOneScalarIterationRemainder = [EffectiveIC](ElementCount &ExactTC, | ||
| unsigned int VF) -> bool { |
| ElementCount ExactTC = getSmallConstantTripCount(PSE.getSE(), TheLoop); | ||
| if (EpilogueLoweringStatus == CM_EpilogueNotAllowedLowTripLoop && | ||
| ExactTC && ExactTC.isFixed()) { | ||
| if (HasOneScalarIterationRemainder( |
|
|
||
| ; TC=5, VF=2, IC=2: TC == VF * IC + 1 (5 == 2 * 2 + 1). | ||
| ; The forced interleave count should be considered when choosing VF. | ||
| define void @tc5_forced_ic2_vectorize(ptr noalias %a, ptr noalias %b) #0 { |
There was a problem hiding this comment.
I have expanded the testing to have different types and VF=2 for i64.
In terms of microbenchmarks, I got Codex to generate me one which just made a small loop for that datatype, and this change shows significant speedup for vectorized vs Scalar for i8, i16 and i32. i64 does not how the same results. i64 with a TC of 5 is not vectorising as the Loop Vectoriser rejects it at a later stage, but at TC3 we are seeing a significant improvement as this is vectorised.
With forced interleaving the improvements are minimal, as the loop is still present even if it has less iterations, so the real gain here is to eliminate the loop completely, which we have seen in other cases with i8, i16, i32 at TC=3 and TC=5 and i64 with only TC=3.
$ /tmp/sve-small-trip-count-bench 100000000
iterations: 100000000
i8 tc5 vector 1.0755 ns/call
i8 tc5 scalar 2.8662 ns/call
i16 tc5 vector 1.0749 ns/call
i16 tc5 scalar 2.8663 ns/call
i32 tc5 vector 1.0748 ns/call
i32 tc5 scalar 2.8669 ns/call
i64 tc3 vector 1.0748 ns/call
i64 tc3 scalar 1.7915 ns/call
i64 tc5 ic2 vector 2.8663 ns/call
i64 tc5 ic2 scalar 2.9432 ns/call
vector/scalar ratios:
i8 0.3752
i16 0.3750
i32 0.3749
i64 0.6000
i64 ic2 0.9739
checksum: 0
For context, the ratio < 1 shows an improvement, > 1 shows regression.
🐧 Linux x64 Test Results
✅ The build succeeded and all tests passed. |
| // straight-line code as the both iteration counts are statically known. | ||
| ElementCount ExactTC = getSmallConstantTripCount(PSE.getSE(), TheLoop); | ||
| if (EpilogueLoweringStatus == CM_EpilogueNotAllowedLowTripLoop && | ||
| ExactTC.isFixed()) { |
There was a problem hiding this comment.
I don't think this is safe. There are three scenarios to consider:
- getSmallBestKnownTC != 0, getSmallConstantTripCount != 0, but the values are different.
- getSmallBestKnownTC == getSmallConstantTripCount.
- getSmallBestKnownTC returns non-zero, getSmallConstantTripCount returns zero.
I think you're assuming that we're always seeing cases 1 or 2, however getSmallConstantTripCount can also return ElementCount::getFixed(0) (case 3). It will be fixed, but zero so we should avoid getting into this path. Can you fix this please and add a test for this case? You'll need something like a max trip count of 5 returned by getSmallBestKnownTC (either due to profile data or the limit of the type/expression), but the exact trip count is unknown.
Also, I left an earlier comment I left on the PR - can getSmallConstantTripCount return a scalable TC when getSmallBestKnownTC returns a fixed one? If so, then this PR should add a test that exposes this case. If not, then you can just assume ExactTC is fixed and no need to add the check isFixed().
There was a problem hiding this comment.
Apologies I thought I responded to that earlier comment. from my understanding if one is returning fixed, the other will return fixed values. I don't think these can be mixed. If so I will remove that case.
I will get that fixed and updated.
There was a problem hiding this comment.
I have changed the check here to be ExactTC.getFixedValue() != 0 to ensure we don't go down the path of TC == 0. This also assumes that the value is fixed, as this will not be scaleable here.
| // also eliminate any loops. | ||
| // | ||
| // Forced interleaving is considered when seeing if | ||
| // OneScalarIterationRemainder is produced. It may prodiced more than |
There was a problem hiding this comment.
nit: typo
| // OneScalarIterationRemainder is produced. It may prodiced more than | |
| // OneScalarIterationRemainder is produced. It may produced more than |
david-arm
left a comment
There was a problem hiding this comment.
LGTM with nits addressed! However, before merging this PR it's probably a good idea to land llvm/llvm-test-suite#404 first as requested.
| ; | ||
| entry: | ||
| br label %loop | ||
| loop: |
There was a problem hiding this comment.
nit: Can you add spaces between the blocks?
|
|
||
| ; TC=5, VF=4: TC == MaxFixedVF + 1 (5 == 4 + 1). | ||
| ; The new code path should trigger: 1 vectorized iteration + 1 scalar iteration. | ||
| define void @tc5_vf4_vectorize(ptr noalias %a, ptr noalias %b) #0 { |
There was a problem hiding this comment.
nit: Can you rename this to tc5_vf4_vectorize_i32 to be consistent with tc5_vf4_vectorize_i16?
|
|
||
| ; TC=5, VF=2, IC=2: TC == VF * IC + 1 (5 == 2 * 2 + 1). | ||
| ; The forced interleave count should be considered when choosing VF. | ||
| define void @tc5_forced_ic2_vectorize(ptr noalias %a, ptr noalias %b) #0 { |
There was a problem hiding this comment.
nit: Can you rename this to tc5_forced_ic2_vectorize_i32 to be consistent with tc5_forced_ic2_vectorize_i64?
david-arm
left a comment
There was a problem hiding this comment.
Sorry, my mistake. I just realised there is a comment that hasn't been addressed regarding the ExactTC.isFixed() code.
I may have fixed this downstream, and not pushed the changes. I will check and get it up with the nit's noted above fixed. |
| ; on AArch64 where under getMinTripCountTailFoldingThreshold(). This | ||
| ; produces the required number of vector iteration and one scalar iteration. | ||
| ; | ||
| ; RUN: opt -S -p loop-vectorize %s | FileCheck %s |
There was a problem hiding this comment.
nit: usually the run line is at the top
| // also eliminate any loops. | ||
| // | ||
| // Forced interleaving is considered when seeing if | ||
| // OneScalarIterationRemainder is produced. It may prodiced more than |
|
|
||
| ; TC=5, VF=2, IC=2: TC == VF * IC + 1 (5 == 2 * 2 + 1). | ||
| ; The forced interleave count should be considered when choosing VF. | ||
| define void @tc5_forced_ic2_vectorize_i64(ptr noalias %a, ptr noalias %b) #0 { |
There was a problem hiding this comment.
could you also add a low-trip count loop where the exact trip count is not known?
There was a problem hiding this comment.
Done as part of the test added in #195823 (comment)
| } | ||
|
|
||
| auto ExpectedTC = getSmallBestKnownTC(PSE, TheLoop); | ||
| unsigned EffectiveIC = UserIC > 0 ? UserIC : 1; |
There was a problem hiding this comment.
I think there are cases where user IC may be ignored, would be good to also cover this in test, e.g. something like a dependence distance that prevents the user IC from being chosen
There was a problem hiding this comment.
Added a test for this
There was a problem hiding this comment.
I think NoScalarEpilogueNeeded above checks for the user IC in a similar way. Would be good to share EffectiveIC between the two
There was a problem hiding this comment.
I've changed it to use EffectiveIC now.
cc1e953 to
a89e00f
Compare
david-arm
left a comment
There was a problem hiding this comment.
LGTM with nits addressed!
| // also eliminate any loops. | ||
| // | ||
| // Forced interleaving is considered when seeing if | ||
| // OneScalarIterationRemainder is produced. It may produced more than |
There was a problem hiding this comment.
nit: This wording seems a little odd. I think it should probably either say:
Forced interleaving is considered when deciding if a single scalar iteration remains.
or
OneScalarIterationRemainder takes account of any forced interleaving.
There was a problem hiding this comment.
Done with the second suggestion, and moved it above the call site for HasOneScalarIterationRemainder as I think it will make more sense there.
| MaxVF /= 2) { | ||
| if (HasOneScalarIterationRemainder(ExactTC, MaxVF)) { | ||
| LLVM_DEBUG(dbgs() << "LV: Picking VF=" << MaxVF | ||
| << " with 1 scalar iteration remainder.\n"); |
There was a problem hiding this comment.
nit: Perhaps with 1 scalar iteration remaining.\n sounds a bit better?
| } | ||
|
|
||
| ; TC=5, VF=4: TC == VF + 1 (5 == 4 + 1). | ||
| ; The natural fixed-width VF for i16 is 8 on AArch64, so this also checks that |
There was a problem hiding this comment.
nit: This comment is perhaps a bit misleading because both VF=4 and VF=8 are natural for i16 on AArch64, since NEON supports both 64-bit and 128-bit register sizes. It's perhaps more precise to say "VF=8 is a natural fixed-width for i16 types on AArch64". What do you think?
The point here is that even after stepping down to VF=4, the <4 x i16> is still a legal type.
| } | ||
|
|
||
| ; TC=5, VF=4: TC == VF + 1 (5 == 4 + 1). | ||
| ; The natural fixed-width VF for i8 is 16 on AArch64, so this checks that the |
There was a problem hiding this comment.
Same comment as above. Also, the comment still refers to 16-bit types.
| } | ||
|
|
||
| ; ExactTC is unknown here because the loop trip count is the runtime value %n, | ||
| ; but the guard proves the maximum trip count is 5. it must not use the VF+1 |
There was a problem hiding this comment.
nit: I think the comment it must not use the VF+1 escape relies upon knowing the current way the code is written in computeMaxVF. It's probably better to say something like In this case the vectoriser shouldn't optimise for a single vector iteration + single scalar iteration, because there is no guarantee we will enter the vector loop.
Currently, Small Loops with Trip Counts less than 16, and in situations where the Trip Count (TC) is less than the Tail Folding Threshold are harder to vectorize, its only possible where no epilogue will be emitted. However, for loops with large bodies and small trip counts this can be counterprodictive to performance, often failing to vectorize entirely. This is more prevelant with targets where `getMinTripCountTailFoldingThreshold()` returns a value greater than 0. To address this, the Small Loops where the TC == VF + 1 can now vectorize, leading to a single vectorized itneration and a single scalar iteration. Later passes can then remove the loop's entirely. Testing an with OpenSource Fortran HPC Benchmark which includes multiple loops with small trip counts, but large loop bodies, has shown significant improvement to runtime after these changes. Assisted-by: Claude Sonnet 4.6/Codex
The checks for matching to VF+1 == TC should protect against this, but its best to the explicit.
0fa0ce6 to
9b476b2
Compare
fhahn
left a comment
There was a problem hiding this comment.
I am not sure if this is always profitable without checking the cost of the vector loop (e.g. the benefit from vectorization may be relatively low compared to the scalar loop, e.g. because some of the vector IR instruction need multiple assembly instructions).
I added some suggestions for cases to check to llvm/llvm-test-suite#404, for those the scalar code seemed faster than the vector code on my test system (Apple M4 with getMinTripCountTailFoldingThreshold adjusted)
| ExactTC.getFixedValue() != 0) { | ||
| // If the maximum VF cannot produce 1 vector iteration + 1 scalar | ||
| // iteration, step down VF's to find one that can. The result should | ||
| // also eliminate any loops. | ||
| for (unsigned MaxVF = MaxFactors.FixedVF.getFixedValue(); MaxVF >= 2; | ||
| MaxVF /= 2) { | ||
| // OneScalarIterationRemainder takes account of any forced | ||
| // interleaving. | ||
| if (HasOneScalarIterationRemainder(ExactTC, MaxVF)) { | ||
| LLVM_DEBUG(dbgs() << "LV: Picking VF=" << MaxVF | ||
| << " with 1 scalar iteration remaining.\n"); | ||
| MaxFactors.FixedVF = ElementCount::getFixed(MaxVF); | ||
| MaxFactors.ScalableVF = ElementCount::getScalable(0); | ||
| return MaxFactors; | ||
| } | ||
| } |
There was a problem hiding this comment.
Do we need a loop here or could we directly compute the VF based on EffecticeTC + trip count?
| ExactTC.getFixedValue() != 0) { | |
| // If the maximum VF cannot produce 1 vector iteration + 1 scalar | |
| // iteration, step down VF's to find one that can. The result should | |
| // also eliminate any loops. | |
| for (unsigned MaxVF = MaxFactors.FixedVF.getFixedValue(); MaxVF >= 2; | |
| MaxVF /= 2) { | |
| // OneScalarIterationRemainder takes account of any forced | |
| // interleaving. | |
| if (HasOneScalarIterationRemainder(ExactTC, MaxVF)) { | |
| LLVM_DEBUG(dbgs() << "LV: Picking VF=" << MaxVF | |
| << " with 1 scalar iteration remaining.\n"); | |
| MaxFactors.FixedVF = ElementCount::getFixed(MaxVF); | |
| MaxFactors.ScalableVF = ElementCount::getScalable(0); | |
| return MaxFactors; | |
| } | |
| } | |
| ExactTC.getFixedValue() > 1) { | |
| unsigned TC = ExactTC.getFixedValue(); | |
| unsigned MaxFixedVF = MaxFactors.FixedVF.getFixedValue(); | |
| if ((TC - 1) % EffectiveIC == 0) { | |
| unsigned VF = (TC - 1) / EffectiveIC; | |
| if (VF >= 2 && VF <= MaxFixedVF && isPowerOf2_32(VF)) { | |
| LLVM_DEBUG(dbgs() << "LV: Picking VF=" << VF | |
| << " with 1 scalar iteration remainder.\n"); | |
| MaxFactors.FixedVF = ElementCount::getFixed(VF); | |
| MaxFactors.ScalableVF = ElementCount::getScalable(0); | |
| return MaxFactors; | |
| } |
There was a problem hiding this comment.
Agreed, and implemented.
|
@fhahn As I mentioned in llvm/llvm-test-suite#404 I have added some cost modelling to this PR so that loops that are not beneficial to vectorise in this way will be left as scalar. |
|
Ping |
| if (TC != EstimatedWidth + 1) | ||
| return false; | ||
|
|
||
| InstructionCost VectorCost = |
There was a problem hiding this comment.
Much of this code looks very similar to the code above on line 6138 and there is quite a lot of it to support a corner case. Can it be commonised? It would be good to simplify where possible.
There was a problem hiding this comment.
I've condensed this down into a single lambda function that gets reused.
| // control flow and a scalar epilogue for a single element, so require | ||
| // the vectorized form to save at least one scalar iteration. | ||
| InstructionCost AdjustedVectorCost = VectorCost + ScalarCost; | ||
| if (!VectorCost.isValid() || !ScalarCostForTC.isValid() || |
There was a problem hiding this comment.
I think you can just do !AdjustedVectorCost.isValid() because the Invalid state should be inherited from LHS and propagated from the RHS of the +. Similarly for the code above on line 6147.
|
Ping @fhahn |
Currently, Small Loops with Trip Counts less than 16, and in situations where the Trip Count (TC) is less than the Tail Folding Threshold are harder to vectorize, its only possible where no epilogue will be emitted. However, for loops with large bodies and small trip counts this can be counterproductive to performance, often failing to vectorize entirely. This is more prevalent with targets where
getMinTripCountTailFoldingThreshold()returns a value greater than 0.To address this, the Small Loops where the TC == VF + 1 can now vectorize, leading to a vectorized iteration (or loop if interleaving is required) and a single scalar iteration. Later passes can then remove the loop's entirely.
Testing an with OpenSource Fortran HPC Benchmark which includes multiple loops with small trip counts, but large loop bodies, has shown significant improvement to runtime after these changes.
Assisted-by: Claude Sonnet 4.6/Codex