[VPlan] Move tail folding out of VPlanPredicator. NFC#176143
Conversation
Currently the logic for introducing a header mask and predicating the vector loop region is done inside introduceMasksAndLinearize.
This splits the tail folding part out into an individual VPlan transform so that VPlanPredicator.cpp doesn't need to worry about tail folding, which seemed to be a temporary measure according to a comment in VPlanTransforms.h.
To perform tail folding independently, this splits the "body" of the vector loop region between the phis in the header and the branch + iv increment in the latch:
Before:
+-------------------------------------------+
|%iv = ... |
|... |
|%iv.next = add %iv, vfxuf |
|branch-on-count %iv.next, vector-trip-count|
+-------------------------------------------+
After:
+-------------------------------------------+
|%iv = ... |
|%wide.iv = widen-canonical-iv ... |
|%header-mask = icmp ult %wide.iv, BTC |---+
|branch-on-cond %header-mask | |
+-------------------------------------------+ |
| |
v |
+-------------------------------------------+ |
|... | |
+-------------------------------------------+ |
| |
v |
+-------------------------------------------+ |
|%iv.next = add %iv, vfxuf |<--+
|branch-on-count %iv.next, vector-trip-count|
+-------------------------------------------+
The motivation for this is to align tail folding predication with early-exit predication in llvm#172454. This will also allow us to directly emit an EVL based header mask, instead of having to match + transform the existing header mask in addExplicitVectorLength.
|
@llvm/pr-subscribers-vectorizers @llvm/pr-subscribers-llvm-transforms Author: Luke Lau (lukel97) ChangesCurrently the logic for introducing a header mask and predicating the vector loop region is done inside introduceMasksAndLinearize. This splits the tail folding part out into an individual VPlan transform so that VPlanPredicator.cpp doesn't need to worry about tail folding, which seemed to be a temporary measure according to a comment in VPlanTransforms.h. To perform tail folding independently, this splits the "body" of the vector loop region between the phis in the header and the branch + iv increment in the latch: Before: After: The motivation for this is to align tail folding predication with early-exit predication in #172454. This will also allow us to directly emit an EVL based header mask, instead of having to match + transform the existing header mask in addExplicitVectorLength. Full diff: https://github.com/llvm/llvm-project/pull/176143.diff 5 Files Affected:
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 3f1e12e5d1cd0..1dcd33828445a 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -8463,11 +8463,13 @@ VPlanPtr LoopVectorizationPlanner::tryToBuildVPlanWithVPRecipes(
InterleaveGroups.insert(IG);
}
+ if (CM.foldTailByMasking())
+ VPlanTransforms::foldTailByMasking(*Plan);
+
// ---------------------------------------------------------------------------
// Predicate and linearize the top-level loop region.
// ---------------------------------------------------------------------------
- auto BlockMaskCache = VPlanTransforms::introduceMasksAndLinearize(
- *Plan, CM.foldTailByMasking());
+ auto BlockMaskCache = VPlanTransforms::introduceMasksAndLinearize(*Plan);
// ---------------------------------------------------------------------------
// Construct wide recipes and apply predication for original scalar
@@ -8726,7 +8728,8 @@ void LoopVectorizationPlanner::addReductionResultComputation(
auto *RR = dyn_cast<VPReductionRecipe>(OrigExitingVPV->getDefiningRecipe());
if (!PhiR->isInLoop() && CM.foldTailByMasking() &&
(!RR || !RR->isPartialReduction())) {
- VPValue *Cond = RecipeBuilder.getBlockInMask(PhiR->getParent());
+ VPValue *Cond = RecipeBuilder.getBlockInMask(
+ cast<VPBasicBlock>(PhiR->getParent()->getSuccessors().front()));
std::optional<FastMathFlags> FMFs =
PhiTy->isFloatingPointTy()
? std::make_optional(RdxDesc.getFastMathFlags())
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.cpp b/llvm/lib/Transforms/Vectorize/VPlan.cpp
index a6a46e36b397d..d8817e1d481a7 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlan.cpp
@@ -566,6 +566,11 @@ VPBasicBlock *VPBasicBlock::splitAt(iterator SplitAt) {
auto *SplitBlock = getPlan()->createVPBasicBlock(getName() + ".split");
VPBlockUtils::insertBlockAfter(SplitBlock, this);
+ // If this is the exiting block, make the split the new exiting block.
+ auto *ParentRegion = getParent();
+ if (ParentRegion && ParentRegion->getExiting() == this)
+ ParentRegion->setExiting(SplitBlock);
+
// Finally, move the recipes starting at SplitAt to new block.
for (VPRecipeBase &ToMove :
make_early_inc_range(make_range(SplitAt, this->end())))
diff --git a/llvm/lib/Transforms/Vectorize/VPlanConstruction.cpp b/llvm/lib/Transforms/Vectorize/VPlanConstruction.cpp
index 7ee133545eeb9..97ad127658320 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanConstruction.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanConstruction.cpp
@@ -958,6 +958,61 @@ void VPlanTransforms::createLoopRegions(VPlan &Plan) {
TopRegion->getEntryBasicBlock()->setName("vector.body");
}
+void VPlanTransforms::foldTailByMasking(VPlan &Plan) {
+ assert(Plan.getExitBlocks().size() == 1 &&
+ "only a single-exit block is supported currently");
+ assert(Plan.getExitBlocks().front()->getSinglePredecessor() ==
+ Plan.getMiddleBlock() &&
+ "the exit block must have middle block as single predecessor");
+
+ VPRegionBlock *LoopRegion = Plan.getVectorLoopRegion();
+ VPBasicBlock *Header = LoopRegion->getEntryBasicBlock();
+
+ Header->splitAt(Header->getFirstNonPhi());
+
+ // Create the header mask, insert it in the header and branch on it.
+ auto *IV =
+ new VPWidenCanonicalIVRecipe(Header->getParent()->getCanonicalIV());
+ VPBuilder Builder(Header, Header->getFirstNonPhi());
+ Builder.insert(IV);
+ VPValue *BTC = Plan.getOrCreateBackedgeTakenCount();
+ VPValue *HeaderMask = Builder.createICmp(CmpInst::ICMP_ULE, IV, BTC);
+ Builder.createNaryOp(VPInstruction::BranchOnCond, HeaderMask);
+
+ VPBasicBlock *Latch = LoopRegion->getExitingBasicBlock();
+ VPValue *IVInc;
+ [[maybe_unused]] bool TermBranchOnCount =
+ match(Latch->getTerminator(),
+ m_BranchOnCount(m_VPValue(IVInc),
+ m_Specific(&Plan.getVectorTripCount())));
+ assert(TermBranchOnCount &&
+ match(IVInc, m_Add(m_Specific(LoopRegion->getCanonicalIV()),
+ m_Specific(&Plan.getVFxUF()))) &&
+ "Unexpected terminator");
+
+ // Split the latch at the IV update, and branch to it from the header mask.
+ VPBasicBlock *LatchSplit =
+ Latch->splitAt(IVInc->getDefiningRecipe()->getIterator());
+ VPBlockUtils::connectBlocks(Header, LatchSplit);
+
+ // Any extract of the last element must be updated to extract from the last
+ // active lane of the header mask instead (i.e., the lane corresponding to the
+ // last active iteration).
+ Builder.setInsertPoint(Plan.getMiddleBlock()->getTerminator());
+ for (VPRecipeBase &R : *Plan.getMiddleBlock()) {
+ VPValue *Op;
+ if (!match(&R, m_ExtractLastLane(m_ExtractLastPart(m_VPValue(Op)))))
+ continue;
+
+ // Compute the index of the last active lane.
+ VPValue *LastActiveLane =
+ Builder.createNaryOp(VPInstruction::LastActiveLane, HeaderMask);
+ auto *Ext =
+ Builder.createNaryOp(VPInstruction::ExtractLane, {LastActiveLane, Op});
+ R.getVPSingleValue()->replaceAllUsesWith(Ext);
+ }
+}
+
// Likelyhood of bypassing the vectorized loop due to a runtime check block,
// including memory overlap checks block and wrapping/unit-stride checks block.
static constexpr uint32_t CheckBypassWeights[] = {1, 127};
diff --git a/llvm/lib/Transforms/Vectorize/VPlanPredicator.cpp b/llvm/lib/Transforms/Vectorize/VPlanPredicator.cpp
index f7e7fc29bc203..96bee083875af 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanPredicator.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanPredicator.cpp
@@ -14,13 +14,11 @@
#include "VPRecipeBuilder.h"
#include "VPlan.h"
#include "VPlanCFG.h"
-#include "VPlanPatternMatch.h"
#include "VPlanTransforms.h"
#include "VPlanUtils.h"
#include "llvm/ADT/PostOrderIterator.h"
using namespace llvm;
-using namespace VPlanPatternMatch;
namespace {
class VPPredicator {
@@ -73,9 +71,6 @@ class VPPredicator {
return EdgeMaskCache.lookup({Src, Dst});
}
- /// Compute and return the mask for the vector loop header block.
- void createHeaderMask(VPBasicBlock *HeaderVPBB, bool FoldTail);
-
/// Compute and return the predicate of \p VPBB, assuming that the header
/// block of the loop is set to True, or to the loop mask when tail folding.
VPValue *createBlockInMask(VPBasicBlock *VPBB);
@@ -156,28 +151,6 @@ VPValue *VPPredicator::createBlockInMask(VPBasicBlock *VPBB) {
return BlockMask;
}
-void VPPredicator::createHeaderMask(VPBasicBlock *HeaderVPBB, bool FoldTail) {
- if (!FoldTail) {
- setBlockInMask(HeaderVPBB, nullptr);
- return;
- }
-
- // Introduce the early-exit compare IV <= BTC to form header block mask.
- // This is used instead of IV < TC because TC may wrap, unlike BTC. Start by
- // constructing the desired canonical IV in the header block as its first
- // non-phi instructions.
-
- auto &Plan = *HeaderVPBB->getPlan();
- auto *IV =
- new VPWidenCanonicalIVRecipe(HeaderVPBB->getParent()->getCanonicalIV());
- Builder.setInsertPoint(HeaderVPBB, HeaderVPBB->getFirstNonPhi());
- Builder.insert(IV);
-
- VPValue *BTC = Plan.getOrCreateBackedgeTakenCount();
- VPValue *BlockMask = Builder.createICmp(CmpInst::ICMP_ULE, IV, BTC);
- setBlockInMask(HeaderVPBB, BlockMask);
-}
-
void VPPredicator::createSwitchEdgeMasks(VPInstruction *SI) {
VPBasicBlock *Src = SI->getParent();
@@ -232,7 +205,8 @@ void VPPredicator::createSwitchEdgeMasks(VPInstruction *SI) {
void VPPredicator::convertPhisToBlends(VPBasicBlock *VPBB) {
SmallVector<VPPhi *> Phis;
for (VPRecipeBase &R : VPBB->phis())
- Phis.push_back(cast<VPPhi>(&R));
+ if (auto *PhiR = dyn_cast<VPPhi>(&R))
+ Phis.push_back(PhiR);
for (VPPhi *PhiR : Phis) {
// The non-header Phi is converted into a Blend recipe below,
// so we don't have to worry about the insertion order and we can just use
@@ -261,7 +235,7 @@ void VPPredicator::convertPhisToBlends(VPBasicBlock *VPBB) {
}
DenseMap<VPBasicBlock *, VPValue *>
-VPlanTransforms::introduceMasksAndLinearize(VPlan &Plan, bool FoldTail) {
+VPlanTransforms::introduceMasksAndLinearize(VPlan &Plan) {
VPRegionBlock *LoopRegion = Plan.getVectorLoopRegion();
// Scan the body of the loop in a topological order to visit each basic block
// after having visited its predecessor basic blocks.
@@ -272,14 +246,6 @@ VPlanTransforms::introduceMasksAndLinearize(VPlan &Plan, bool FoldTail) {
for (VPBlockBase *VPB : RPOT) {
// Non-outer regions with VPBBs only are supported at the moment.
auto *VPBB = cast<VPBasicBlock>(VPB);
- // Introduce the mask for VPBB, which may introduce needed edge masks, and
- // convert all phi recipes of VPBB to blend recipes unless VPBB is the
- // header.
- if (VPBB == Header) {
- Predicator.createHeaderMask(Header, FoldTail);
- continue;
- }
-
Predicator.createBlockInMask(VPBB);
Predicator.convertPhisToBlends(VPBB);
}
@@ -300,32 +266,5 @@ VPlanTransforms::introduceMasksAndLinearize(VPlan &Plan, bool FoldTail) {
PrevVPBB = VPBB;
}
-
- // If we folded the tail and introduced a header mask, any extract of the
- // last element must be updated to extract from the last active lane of the
- // header mask instead (i.e., the lane corresponding to the last active
- // iteration).
- if (FoldTail) {
- assert(Plan.getExitBlocks().size() == 1 &&
- "only a single-exit block is supported currently");
- assert(Plan.getExitBlocks().front()->getSinglePredecessor() ==
- Plan.getMiddleBlock() &&
- "the exit block must have middle block as single predecessor");
-
- VPBuilder B(Plan.getMiddleBlock()->getTerminator());
- for (VPRecipeBase &R : *Plan.getMiddleBlock()) {
- VPValue *Op;
- if (!match(&R, m_ExtractLastLane(m_ExtractLastPart(m_VPValue(Op)))))
- continue;
-
- // Compute the index of the last active lane.
- VPValue *HeaderMask = Predicator.getBlockInMask(Header);
- VPValue *LastActiveLane =
- B.createNaryOp(VPInstruction::LastActiveLane, HeaderMask);
- auto *Ext =
- B.createNaryOp(VPInstruction::ExtractLane, {LastActiveLane, Op});
- R.getVPSingleValue()->replaceAllUsesWith(Ext);
- }
- }
return Predicator.getBlockMaskCache();
}
diff --git a/llvm/lib/Transforms/Vectorize/VPlanTransforms.h b/llvm/lib/Transforms/Vectorize/VPlanTransforms.h
index e0d09a099647a..e00c94558bf7a 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanTransforms.h
+++ b/llvm/lib/Transforms/Vectorize/VPlanTransforms.h
@@ -413,14 +413,44 @@ struct VPlanTransforms {
static void narrowInterleaveGroups(VPlan &Plan, ElementCount VF,
TypeSize VectorRegWidth);
+ /// Adapts the vector loop region for tail folding by introducing a header
+ /// mask and predicating the region:
+ ///
+ /// Vector loop region before:
+ /// +-------------------------------------------+
+ /// |%iv = ... |
+ /// |... |
+ /// |%iv.next = add %iv, vfxuf |
+ /// |branch-on-count %iv.next, vector-trip-count|
+ /// +-------------------------------------------+
+ ///
+ /// Vector loop region after:
+ /// +-------------------------------------------+
+ /// |%iv = ... |
+ /// |%wide.iv = widen-canonical-iv ... |
+ /// |%header-mask = icmp ult %wide.iv, BTC |
+ /// |branch-on-cond %header-mask |---+
+ /// +-------------------------------------------+ |
+ /// | |
+ /// v |
+ /// +-------------------------------------------+ |
+ /// | ... | |
+ /// +-------------------------------------------+ |
+ /// | |
+ /// v |
+ /// +-------------------------------------------+ |
+ /// |%iv.next = add %iv, vfxuf |<--+
+ /// |branch-on-count %iv.next, vector-trip-count|
+ /// +-------------------------------------------+
+ ///
+ /// Any VPInstruction::ExtractLastLanes are also updated to extract from the
+ /// last active lane of the header mask.
+ static void foldTailByMasking(VPlan &Plan);
+
/// Predicate and linearize the control-flow in the only loop region of
- /// \p Plan. If \p FoldTail is true, create a mask guarding the loop
- /// header, otherwise use all-true for the header mask. Masks for blocks are
- /// added to a block-to-mask map which is returned in order to be used later
- /// for wide recipe construction. This argument is temporary and will be
- /// removed in the future.
+ /// \p Plan.
static DenseMap<VPBasicBlock *, VPValue *>
- introduceMasksAndLinearize(VPlan &Plan, bool FoldTail);
+ introduceMasksAndLinearize(VPlan &Plan);
/// Add branch weight metadata, if the \p Plan's middle block is terminated by
/// a BranchOnCond recipe.
|
| (!RR || !RR->isPartialReduction())) { | ||
| VPValue *Cond = RecipeBuilder.getBlockInMask(PhiR->getParent()); | ||
| VPValue *Cond = RecipeBuilder.getBlockInMask( | ||
| cast<VPBasicBlock>(PhiR->getParent()->getSuccessors().front())); |
There was a problem hiding this comment.
Cond here is supposed to be the header mask IIUC, and PhiR's parent is the header block. But the header block itself is no longer predicated, only the "body" block, so this gets the header mask from its successor instead.
There was a problem hiding this comment.
I think that's essentially my question about SSA form. I would expect that select to be created by predicator automatically if we maintain proper SSA form, i.e.,
; after tail folding
header:
%red-phi = phi [ init, %preheader ], [ %new-phi, %latch ]
...
br i1 %header-mask, label %body, label %latch
body:
%red-phi.next = ...
...
br label %latch
latch:
; Will be transformed into a VPBlend and subsequently into a select with existing predicator:
new-phi = phi [ %red-phi, %header ], [ %red-phi.next, %body ]
...
exit:
%use = phi [ %new-phi, %latch ]
There was a problem hiding this comment.
I gave this a try and I think it's possible if you explicitly handle the VPReductionPHIRecipes in foldTailByMasking, it would make for a nice tidy up. But it doesn't seem to be an NFC change. I can add a TODO to move this into VPlanTransforms::foldTailByMasking as a follow up?
There was a problem hiding this comment.
What special handling? Is it still necessary after you've optimized blends per your comment in #176143 (comment)?
There was a problem hiding this comment.
You need to check for reduction phis when creating the latch phi and set their incoming value to the reduction phi, as opposed to poison. It's still needed after optimizing the blends:
@@ -1014,7 +1014,12 @@ void VPlanTransforms::foldTailByMasking(VPlan &Plan) {
continue;
VPValue *Poison = Plan.getOrAddLiveIn(
PoisonValue::get(V->getUnderlyingValue()->getType()));
+ auto *RedIt = find_if(V->users(), IsaPred<VPReductionPHIRecipe>);
+ if (RedIt != V->user_end())
+ Poison = cast<VPRecipeBase>(*RedIt)->getVPSingleValue();
VPValue *Phi = Builder.createScalarPhi({V, Poison}, {});
+ if (RedIt != V->user_end() && PredicatedReductionSelect)
+ cast<VPRecipeBase>(*RedIt)->setOperand(1, Phi);
V->replaceUsesWithIf(Phi, NeedsPhi);
}
}There was a problem hiding this comment.
You're right. I now think that tail folding belongs to the predicator :) More specifically, handling of the divergent loop exit belongs to the predicator. Consider outer loop vectorization:
inner.header:
%inner.iv = phi
%inner.red = phi [ 0, %inner.preheader ], [ %inner.red.next, %inner.header ]
...
%ec = icmp %inner.iv.next, %n
br i1 %ec, %inner.exit, %inner.header
inner.exit:
%inner.red.lcssa = phi [ %inner.red.next ]If %n is inner-loop-invariant and all lanes exit at the same time, no special handling for the %inner.red is needed - it can simply be widened and all SIMD lanes would operate independently but exactly the same, in a sense. If, for example, %n == %outer.iv and some lanes exit inner loop earlier than the others, then exactly the same processing as you're talking about would be necessary for %inner.red. However, it has nothing to do with the outer loop we're vectorizing and/or its tail folding, thus that special handling belongs to the predicator.
I think the right way to communicate tail masking is to tell predicator that [outer|our] loop backedge isn't uniform any more, and let it do the processing necessary. We'll reuse the same functionality later when we start supporting proper outer loop vectorization.
Early exits, on the other hand, are different. We don't take backedge if any lane has exited, so the backedge branch remains uniform. As such, early exits from the loop we're vectorizing are unlikely to belong to the predicator, even if there might be some similarities.
There was a problem hiding this comment.
To clarify, the predicator is still ultimately doing performing the tail folding. The only difference is that we're now explicitly modelling the creation of the header mask and the predication of the blocks in VPlan in a separate, isolated step.
To the best of my understanding, this seems to match the original direction, see this comment here:
llvm-project/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
Lines 9277 to 9278 in dabc84c
// Create mask based on the IR BB corresponding to VPBB.
// TODO: Predicate directly based on VPlan.
There was a problem hiding this comment.
Hope to find time to track this patch more closely, but regarding original directions, one laid out back in '23 was to first (always) tail-fold, when strip-mining the scalar VPlan by a symbolic strip-mine factor SF (later to become VF*UF), then predicate, as part of VPlan-creation, then tail-unfold as a cost-based VF-dependent decision:
https://llvm.org/devmtg/2023-10/slides/techtalks/Hahn-VPlan-StatusUpdateAndRoadmap.pdf#page=37
There could be multiple ways to get there from where we are now.
There was a problem hiding this comment.
Thanks for sharing that. I think this patch is somewhat aligned in that direction, since it models the CFG in the strip-mine step from that slide.
As a side note, if we want to tail-unfold then presumably there'll be a step where we'll want to remove the header mask by e.g. replacing it with all ones. I think expanding VPBlendRecipes to VPInstruction::Select would help with this since any select with an all-ones mask just gets folded away: #171851
eas
left a comment
There was a problem hiding this comment.
This breaks SSA form for the exiting values (outside uses might no longer be dominated by the defs) and headers phis' incoming values over the backedge. Subsequent predication would restore it, and it happens (at the moment of this patch) immediately, but I wonder what would happen if we make some future changes in the pipeline.
Do you think maintaining proper SSA form would be worthwhile here?
| // Split the latch at the IV update, and branch to it from the header mask. | ||
| VPBasicBlock *LatchSplit = | ||
| Latch->splitAt(IVInc->getDefiningRecipe()->getIterator()); |
There was a problem hiding this comment.
Can we assert that there are no unexpected instructions in the LatchSplit (e.g. if someone tries to change pass ordering in future)?
There was a problem hiding this comment.
I think that's what the asserts do above, that it only contains the canonical IV increment and the branch-on-count.
There was a problem hiding this comment.
that it ONLY contains
Sorry, I don't see that. I'm expecting these:
assert(std::distance(Latch->begin(), Latch->end() == 2);
assert(IVInc->getParent() == Latch);
// Ensured by the verifier:
assert(Latch->getTerminator->getParent() == Latch);There was a problem hiding this comment.
IIUC the asserts above check for that, but just stricter. The two asserts ensure that the latch is always of the form:
latch:
... arbitrary recipes
%ivinc = add %can-iv, vfxuf
%branch-on-count %ivinc, vector-trip-count
successors: middle.block, header
So when we split at %ivinc, the new structure should always be:
latch:
... arbitrary recipes
successors: latch.split
latch.split:
%ivinc = add %can-iv, vfxuf
%branch-on-count %ivinc, vector-trip-count
successors: middle.block, header
There was a problem hiding this comment.
Woops, I was only considering the use def chain. Added an assert in df47d1e
| // Any extract of the last element must be updated to extract from the last | ||
| // active lane of the header mask instead (i.e., the lane corresponding to the | ||
| // last active iteration). |
There was a problem hiding this comment.
Is this something that we're ok with, or do we need a TODO to move tail folding even earlier in the pipeline before last lane extracts are introduced (or anything else, really, that needs explicit handling here)?
There was a problem hiding this comment.
I think it's fine to handle it here with the rest of the tail folding transform, so it's all kept in one place. The header mask won't be available up until this point anyway.
In general I think it's best to do transforms atomically so you don't have an incorrect intermediate VPlan between transforms
There was a problem hiding this comment.
One way to make it work automatically is to allow a mask on the ExtractLastElement. Then any instances introduced before predication would be masked automatically. And then we can either modify its codegen or lower into a more verbose representation by the end of the pipeline. #142285 could make that simple enough, maybe.
There was a problem hiding this comment.
There's an invariant that the mask needs to be a prefix mask, i.e. all 1s before all 0s, see LastActiveLane, so we wouldn't be able to lower any arbitrary mask.
There was a problem hiding this comment.
So, header-mask will work and that's the one we're creating here, right? And that can be asserted for in the predicator, I think, although somewhat ugly: Mask->getParent() == Header && match(Mask, m_Cmp(LT/LE, IV, m_LoopInvariant()). Or, if everything else, just change how we lower LastActiveLane to support arbitrary mask. That could help in other places, e.g. with generalizing #141900 to skip blocks that have masks other than the header mask.
There was a problem hiding this comment.
Given that this is existing code, I think we should we defer this conversation to a separate PR if we want to change the approach
Yes, in practice the vector region's control flow is immediately linearized afterwards so everything is dominated. This patch has a slightly weird use of the CFG, because the branch-on-cond doesn't use a scalar i1 condition but a vector mask instead. So it's never actually "branching" to begin with and relies on being lineraized and removed. I think this is a good point though. In #172454 the original I approach I used was to insert custom VPInstructions to indicate to VPPredicator that a certain point needed to be masked, without touching the CFG itself and preserving SSA: da6eba9 We could try going back down this route. But this time I would maybe avoid the custom VPInstruction and instead just pass to VPPredicator a map of blocks that should be predicated down to the latch. Something like |
…cator This preserves SSA
|
I've tried out the explicit DenseMap approach in ed18c28 which should avoid the faux-control flow now. |
How is this a problem? Isn't it what predicator sees for the original control flow from VPlan construction (modulo "vector mask" -> "varying i1 condition" renaming, which are essentially the same anyway)? I think explicit control flow while maintaining SSA is a better choice:
|
fhahn
left a comment
There was a problem hiding this comment.
Do you know if the verifier would have caught that?
Given that we now need to pass the extra state between transforms, I am wondering if it still helps for #172454? That would then also require passing more state between transforms further appart, right?
I think #172454 would be a good candidate for using the masked VPInsruction capabilities from #142285; for early exits, we can mask the relevant VPInstructions early.
Yes but the vector mask condition is the problem since VPInstruction::BranchOnCond isn't designed as a masking construct but for regular control flow, i.e. see VPInstruction::usesFirstLaneOnly where it states the condition operand is scalar. In retrospect I think it was more a coincidence that VPPredicator happened to work with it because it was automatically broadcasting scalars when predicating.
Even if we do add PHIs in the latch to keep SSA, the BranchOnCond instructions will be invalid so it will fail the VPlanVerifier anyway. FWIW I'm not strongly opinionated on either approach
Yes it checks for domination, but at this stage during VPlan construction it doesn't pass anyway because e.g. unhandled VPInstruction opcodes that aren't yet widened to VPWidenRecipe.
I didn't notice that PR before, I think we could use it for both this PR and #172454. But we would need to teach VPlanPredicator to logically-and masks with each VPInstructions' existing mask. I'm not sure if we would then need to keep around a cache or something so that we don't end up with a logical-and for each VPInstruction. |
|
I tried reworking this PR and #172454 to do predication via VPInstruction mask operands in #142285, but it turned out to be very difficult since you need to merge the mask operands during both tail folding, early exits and linearization. Not only do you have to worry about where to insert logical-ands and deduplicating them, the big issue I found is that if you e.g. want to predicate everything past the exiting block here, you have to start dealing with control flow joins: flowchart
foo --> exiting --> bar
foo --> bar
And in that case you end up reimplementing what VPPredicator already does.
I've removed the state in 81332b3, and instead used a new VPInstruction to indicate to VPPredicator that the successors should be predicated. I was able to reuse this approach for #172454 too, and it works after rebasing on top of #142285. |
For header:
%iv = phi i64 [ 0, %entry ], [ %iv.next, %latch ]
%mask = icmp slt i64 %iv, %n
br i1 %mask, label %body, label %latch
body:
%gep = getelementptr i64, ptr %ptr, i64 %iv
%ld = load i64, ptr %gep
%add = add i64 %ld, %iv
store i64 %add, ptr %gep
br label %latch
latch:
%iv.next = add nsw i64 %iv, 1
%exitcond = icmp slt i64 %iv.next, %n
br i1 %exitcond, label %header, label %exitRight before predicator (current trunk): <x1> vector loop: {
vector.body:
EMIT vp<%4> = CANONICAL-INDUCTION ir<0>, vp<%index.next>
ir<%iv> = WIDEN-INDUCTION nsw ir<0>, ir<1>, vp<%0>
EMIT ir<%mask> = icmp slt ir<%iv>, ir<%n>
EMIT branch-on-cond ir<%mask> ;; ! branch-on-cond too !
Successor(s): body, latch
body:
EMIT ir<%gep> = getelementptr ir<%ptr>, ir<%iv>
EMIT ir<%ld> = load ir<%gep>
EMIT ir<%add> = add ir<%ld>, ir<%iv>
EMIT store ir<%add>, ir<%gep>
Successor(s): latch
latch:
EMIT ir<%iv.next> = add nsw ir<%iv>, ir<1>
EMIT ir<%exitcond> = icmp slt ir<%iv.next>, ir<%n>
EMIT vp<%5> = not ir<%exitcond>
EMIT vp<%index.next> = add nuw vp<%4>, vp<%1>
EMIT branch-on-count vp<%index.next>, vp<%2>
No successors
}
Successor(s): middle.blockHow is that different from manually created |
TIL the header phis are widened before linearization. I had presumed that they were just widened alongside regular phis in the loop body after linearization. In that case I feel better about using BranchOnCond, let me give it a second try but preserving SSA. |
| auto NotPoison = [&](VPValue *V) { | ||
| // Don't remove poison from phis from the original loop. | ||
| return PhiR->getUnderlyingValue() || !isa<VPIRValue>(V) || | ||
| !isa<PoisonValue>(cast<VPIRValue>(V)->getValue()); | ||
| }; |
There was a problem hiding this comment.
@eas now that we have phis to keep SSA, the incoming value for the header block (i.e. the tail lanes) is poison. But if we don't remove the redundant phis a lot of stuff breaks around the reduction computation matching stuff, and we get other regressions with unnecessary blends. So this prevents the redundant blends from being created.
There was a problem hiding this comment.
There is a patch (by Florian, I think) that changes predicator to emit selects instead of blends. With that, IIUC, we'd just do (pseudocode)
auto *Result = nullptr;
for (auto [Val, Edge] : orig_phi->values()) {
if (is_poison(Val) || Result == Val)
continue;
Result = Result ? createSelect(EdgeCondition, Result, Val) : Val;
}eliminating this whole block of special casing. Is that accurate enough? If so, then I think whatever this patch needs to do temporarily isn't an issue. I'm not sure if the same can be emulated with the VPBlend, but probably the above can be rewritten in terms of filling in the blend's operands.
There was a problem hiding this comment.
Is it #171851, and the successor #150369? Those patches essentially do the code snippet above.
But for this patch at least we need this special casing, since we don't yet have a simplification for a select with a poison value. And we also don't run simplification this early on yet, so we would need to manually call it here. But all for another patch.
There was a problem hiding this comment.
Oh, It was yours :) Yes, #150369 is the one I had in mind.
It seems with the the fold from #177887, the first part does not seem too bad: d15e2db As for the second part, I need to think a bit more about it, but I think similarly with the explicit CFG we similarly need to make sure we insert phis to preserve SSA form? |
I have some strong doubts about this. My reasoning for this is that early exits from the loop we're vectorizing are vastly different from early exits from the inner loops (that the predicator would need to handle someday), so it's not "early exit" but an "early exit from that particular loop" and the early semantics lowering of the incoming source code (either implicit via legal or explicit if we implement more of OMP SIMD/other similar annotation). I also believe that such "import of semantics" should be done early on before more generic processing (like the linearization/predication). |
This silently just happened to work because these phis are just replaced with the same original value.
…ags. NFC (#181694) In order to be able to create selects for reduction phis through tail folding in foldTailByMasking (#176143), make VPReductionPHIRecipe an instance of VPIRFlags and plumb the FMFs from the original RdxDesc. This allows us to remove more uses of the RecurrenceDescriptor in addReductionResultComputation, which should help untie it from LoopVectorizationLegality.
|
Ping |
…ze/move-foldTailByMasking
…ze/move-foldTailByMasking
🐧 Linux x64 Test Results
Failed Tests(click on a test name to see its output) lldb-apilldb-api.functionalities/data-formatter/data-formatter-stl/generic/list/TestDataFormatterGenericList.pyIf these failures are unrelated to your changes (for example tests are broken or flaky at HEAD), please open an issue at https://github.com/llvm/llvm-project/issues and add the |
| // Collect any values defined in the loop that need a phi. Currently this | ||
| // includes header phi backedges and live-outs extracted in the middle block. | ||
| // TODO: Handle early exits via Plan.getExitBlocks() | ||
| MapVector<VPValue *, SmallVector<VPUser *>> NeedsPhi; |
There was a problem hiding this comment.
May be good to have a slightly more descriptive name, although I don't really have any good suggestions.
There was a problem hiding this comment.
Agreed but I'm also drawing a blank. I guess it's something like "live-out of the predicated section". I'll try and think of something in a follow up PR.
| /// +-------------------------------------------+ | ||
| /// |%iv = ... | | ||
| /// |%wide.iv = widen-canonical-iv ... | | ||
| /// |%header-mask = icmp ult %wide.iv, BTC | |
There was a problem hiding this comment.
| /// |%header-mask = icmp ult %wide.iv, BTC | | |
| /// |%header-mask = icmp ule %wide.iv, BTC | |
Oh, I was actually waiting for #173265 to land first to see if we would need to generalize it to the IDom. I'll land this first though then, and if we need to generalize it then I guess you can just do that in #173265 |
#173265 can handle non-tail folding case first. We can create another patch to handle tail folding case. :) |
…ze/move-foldTailByMasking
Currently the logic for introducing a header mask and predicating the vector loop region is done inside introduceMasksAndLinearize.
This splits the tail folding part out into an individual VPlan transform so that VPlanPredicator.cpp doesn't need to worry about tail folding, which seemed to be a temporary measure according to a comment in VPlanTransforms.h.
To perform tail folding independently, this splits the "body" of the vector loop region between the phis in the header and the branch + iv increment in the latch:
Before:
After:
Phis are then inserted in the latch for any value in the loop body that have outside uses, with poison as their incoming value from the header edge.
The motivation for this is to allow us to share the same "predicate all successor blocks" type of predication we do for tail folding, but for early-exit loops in #172454. This may also allow us to directly emit an EVL based header mask, instead of having to match + transform the existing header mask in addExplicitVectorLength.
This also allows us to eventually handle recurrences in the same transform, avoiding the need to special case tail folding in addReductionResultComputation.