[LV][VPlan] Introduce conditional block guards to skip inactive vector blocks (BOSCC)#197423
[LV][VPlan] Introduce conditional block guards to skip inactive vector blocks (BOSCC)#197423nema-ashutosh wants to merge 1 commit into
Conversation
…r blocks (BOSCC)
Add a new VPlan transformation, introduceConditionalBlockGuards, that
wraps conditionally-executed vector blocks in guard diamonds. When the
vector mask has no active lanes, the entire block is bypassed via a
scalar branch, avoiding unnecessary execution of expensive vector
block.
This implements the Branch-On-Superword-Condition-Code (BOSCC) technique
at the VPlan level, replacing the earlier approach proposed in D139074
with a simpler design that reuses existing VPlan building blocks (AnyOf,
BranchOnCond, VPBlockUtils).
The transformation works in four steps:
1. Identify candidate blocks that are conditionally executed in the
original scalar loop and exceed a recipe-count threshold.
2. Discover the active-lane mask from masked recipes, splitting the
block if the mask definition is local.
3. Build a guard diamond: GuardBlock (any-of + branch-on-cond) and
JoinBlock around the candidate.
4. Insert ConditionalMerge PHIs in the JoinBlock for values that
escape the guarded block.
A new VPInstruction opcode, ConditionalMerge, is introduced to merge
guarded values with poison on the skip path. It lowers to an IR PHI
node.
The transformation is off by default (-enable-loop-vectorization-with-
conditions=false) and disabled for scalable vector targets and blocks
containing in-loop reductions.
|
@llvm/pr-subscribers-llvm-transforms @llvm/pr-subscribers-vectorizers Author: Ashutosh Nema (nema-ashutosh) ChangesAdd a new VPlan transformation, introduceConditionalBlockGuards, that wraps conditionally-executed vector blocks in guard diamonds. When the vector mask has no active lanes, the entire block is bypassed via a scalar branch, avoiding unnecessary execution of expensive vector block. This implements the Branch-On-Superword-Condition-Code (BOSCC) technique at the VPlan level, replacing the earlier approach proposed in D139074 with a simpler design that reuses existing VPlan building blocks (AnyOf, BranchOnCond, VPBlockUtils). The transformation works in four steps:
A new VPInstruction opcode, ConditionalMerge, is introduced to merge guarded values with poison on the skip path. It lowers to an IR PHI node. The transformation is off by default (-enable-loop-vectorization-with- conditions=false) and disabled for scalable vector targets and blocks containing in-loop reductions. Patch is 26.17 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/197423.diff 11 Files Affected:
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 61c2d3cd228ec..ce9927e0dd65a 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -6823,6 +6823,8 @@ void LoopVectorizationPlanner::buildVPlans(ElementCount MinVF,
RUN_VPLAN_PASS(VPlanTransforms::sinkPredicatedStores, *Plan, PSE, OrigLoop);
RUN_VPLAN_PASS(VPlanTransforms::truncateToMinimalBitwidths, *Plan,
Config.getMinimalBitwidths());
+ RUN_VPLAN_PASS(VPlanTransforms::introduceConditionalBlockGuards, *Plan,
+ OrigLoop, DT, &TTI);
RUN_VPLAN_PASS(VPlanTransforms::optimize, *Plan);
// TODO: try to put addExplicitVectorLength close to addActiveLaneMask
if (CM.foldTailWithEVL()) {
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.h b/llvm/lib/Transforms/Vectorize/VPlan.h
index 63436c79e9a98..1e538d9515b7d 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.h
+++ b/llvm/lib/Transforms/Vectorize/VPlan.h
@@ -1290,6 +1290,12 @@ class LLVM_ABI_FOR_TEST VPInstruction : public VPRecipeWithIRFlags,
// operand as additional operands. AnyOf is poison-safe as all operands
// will be frozen.
AnyOf,
+ // Creates a PHI node at the merge point of an active-lane guard diamond.
+ // Merges the value produced by the guarded vector block (operand 0) with
+ // poison from the guard-skip path. The parent block must have exactly two
+ // CFG predecessors: predecessor 0 is the guard (skip) block and
+ // predecessor 1 is the guarded vector block.
+ ConditionalMerge,
// Calculates the first active lane index of the vector predicate operands.
// It produces the lane index across all unrolled iterations. Unrolling will
// add all copies of its original operand as additional operands.
diff --git a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
index a42b631cd3304..643194b2a7925 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
@@ -118,6 +118,7 @@ Type *VPTypeAnalysis::inferScalarTypeForRecipe(const VPInstruction *R) {
case VPInstruction::CalculateTripCountMinusVF:
case VPInstruction::CanonicalIVIncrementForPart:
case VPInstruction::AnyOf:
+ case VPInstruction::ConditionalMerge:
case VPInstruction::BuildStructVector:
case VPInstruction::BuildVector:
case VPInstruction::Unpack:
diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
index 11a91dcd46867..bec4f4a5878f4 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
@@ -535,6 +535,7 @@ unsigned VPInstruction::getNumOperandsForOpcode() const {
case Instruction::PHI:
case Instruction::Switch:
case VPInstruction::AnyOf:
+ case VPInstruction::ConditionalMerge:
case VPInstruction::BuildStructVector:
case VPInstruction::BuildVector:
case VPInstruction::CanonicalIVIncrementForPart:
@@ -859,6 +860,31 @@ Value *VPInstruction::generate(VPTransformState &State) {
Res = Builder.CreateOr(Res, Builder.CreateFreeze(State.get(Op)));
return State.VF.isScalar() ? Res : Builder.CreateOrReduce(Res);
}
+ case VPInstruction::ConditionalMerge: {
+ const VPBasicBlock *JoinVPBB = getParent();
+ if (JoinVPBB->getNumPredecessors() < 2)
+ return State.get(getOperand(0));
+ BasicBlock *GuardBB =
+ State.CFG.VPBB2IRBB.at(JoinVPBB->getCFGPredecessor(0));
+ BasicBlock *VecBB =
+ State.CFG.VPBB2IRBB.at(JoinVPBB->getCFGPredecessor(1));
+ Value *InVal;
+ {
+ // State.get() may insert broadcast instructions (insertelement +
+ // shufflevector) when the operand is scalar and needs widening. These
+ // must land in VecBB so that the PHI can reference them as incoming
+ // values from that block. Temporarily redirect the builder there.
+ IRBuilderBase::InsertPointGuard IPG(Builder);
+ Builder.SetInsertPoint(VecBB->getTerminator());
+ InVal = State.get(getOperand(0));
+ }
+ BasicBlock *BB = Builder.GetInsertBlock();
+ PHINode *Phi = PHINode::Create(InVal->getType(), 2, Name);
+ Phi->insertBefore(BB->getFirstNonPHIIt());
+ Phi->addIncoming(InVal, VecBB);
+ Phi->addIncoming(PoisonValue::get(InVal->getType()), GuardBB);
+ return Phi;
+ }
case VPInstruction::ExtractLane: {
assert(getNumOperands() != 2 && "ExtractLane from single source should be "
"simplified to ExtractElement.");
@@ -1200,6 +1226,8 @@ InstructionCost VPInstruction::computeCost(ElementCount VF,
return Ctx.TTI.getArithmeticReductionCost(
Instruction::Or, cast<VectorType>(VecTy), std::nullopt, Ctx.CostKind);
}
+ case VPInstruction::ConditionalMerge:
+ return 0;
case VPInstruction::FirstActiveLane: {
Type *Ty = Ctx.Types.inferScalarType(this);
Type *ScalarTy = Ctx.Types.inferScalarType(getOperand(0));
@@ -1381,6 +1409,7 @@ bool VPInstruction::opcodeMayReadOrWriteFromMemory() const {
case Instruction::Select:
case Instruction::PHI:
case VPInstruction::AnyOf:
+ case VPInstruction::ConditionalMerge:
case VPInstruction::BranchOnCond:
case VPInstruction::BranchOnTwoConds:
case VPInstruction::BranchOnCount:
@@ -1580,6 +1609,9 @@ void VPInstruction::printRecipe(raw_ostream &O, const Twine &Indent,
case VPInstruction::AnyOf:
O << "any-of";
break;
+ case VPInstruction::ConditionalMerge:
+ O << "conditional-merge";
+ break;
case VPInstruction::FirstActiveLane:
O << "first-active-lane";
break;
diff --git a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
index 32d89a34105a4..da0732740b1b4 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
@@ -1637,6 +1637,23 @@ static void simplifyRecipe(VPSingleDefRecipe *Def, VPTypeAnalysis &TypeInfo) {
!cast<VPInstruction>(Def)->isMasked())
return Def->replaceAllUsesWith(Def->getOperand(0));
+ // When removeBranchOnConst folds a constant-true guard condition, it removes
+ // the GuardBlock -> JoinBlock edge and collapses the diamond into a
+ // straight-line chain. JoinBlock then has a single predecessor (VecBlock),
+ // so the ConditionalMerge phi is redundant — replace it with its operand.
+ // The operand necessarily dominates JoinBlock because it was valid at its
+ // use site inside VecBlock before the collapse, and collapsing only
+ // strengthens dominance in the resulting straight-line chain.
+ if (auto *VPI = dyn_cast<VPInstruction>(Def)) {
+ if (VPI->getOpcode() == VPInstruction::ConditionalMerge) {
+ if (VPI->getParent()->getSinglePredecessor()) {
+ Def->replaceAllUsesWith(Def->getOperand(0));
+ Def->eraseFromParent();
+ return;
+ }
+ }
+ }
+
// Look through ExtractLastLane.
if (match(Def, m_ExtractLastLane(m_VPValue(A)))) {
if (match(A, m_BuildVector())) {
@@ -2524,10 +2541,28 @@ static void licm(VPlan &Plan) {
if (!SinkBB)
SinkBB = cast<VPBasicBlock>(LoopRegion->getSingleSuccessor());
- // TODO: This will need to be a check instead of a assert after
- // conditional branches in vectorized loops are supported.
- assert(VPDT.properlyDominates(VPBB, SinkBB) &&
- "Defining block must dominate sink block");
+ // With guard diamonds, a recipe defined inside a conditionally-executed
+ // VecBlock may not dominate exit blocks of the loop region because of
+ // the skip path (GuardBlock -> JoinBlock). Skip sinking in this case;
+ // the recipe stays inside the loop where it is correctly guarded.
+ // (Replaces the previous assert to support conditional block guards.)
+ if (!VPDT.properlyDominates(VPBB, SinkBB))
+ continue;
+
+ // Verify all operands defined inside the loop region will still dominate
+ // the recipe at the sink target. A recipe in the JoinBlock of a guard
+ // diamond may reference values from the conditionally-executed VecBlock;
+ // those values don't dominate the exit and sinking would break dominance.
+ if (any_of(R.operands(), [&VPDT, SinkBB](VPValue *Op) {
+ auto *OpR = Op->getDefiningRecipe();
+ if (!OpR)
+ return false;
+ VPBasicBlock *OpBB = OpR->getParent();
+ return OpBB != SinkBB &&
+ !VPDT.properlyDominates(OpBB, SinkBB);
+ }))
+ continue;
+
// TODO: Clone the recipe if users are on multiple exit paths, instead of
// just moving.
Def->moveBefore(*SinkBB, SinkBB->getFirstNonPhi());
@@ -3798,6 +3833,217 @@ static void expandVPDerivedIV(VPDerivedIVRecipe *R, VPTypeAnalysis &TypeInfo) {
llvm_unreachable("Unhandled induction kind");
}
+static cl::opt<bool> EnableConditionalBlockGuards(
+ "enable-loop-vectorization-with-conditions", cl::init(false), cl::Hidden,
+ cl::desc("Vectorize loop with branches"));
+
+static cl::opt<unsigned> ConditionalGuardThreshold(
+ "conditional-guard-threshold", cl::init(5), cl::Hidden,
+ cl::desc("The minimum instructions in a block required for guarding"));
+
+/// Return the original IR basic block corresponding to \p VPBB by scanning its
+/// recipes for one that has an underlying IR instruction, and returning that
+/// instruction's parent block. VPBasicBlock has no direct link to the original
+/// IR block, so this heuristic is the only way to recover that mapping.
+/// Returns \c nullptr if no recipe in \p VPBB has an underlying IR value.
+static BasicBlock *getOrigBBForVPBB(VPBasicBlock *VPBB) {
+ for (VPRecipeBase &R : *VPBB) {
+ auto *SD = dyn_cast<VPSingleDefRecipe>(&R);
+ if (!SD)
+ continue;
+ if (auto *UV = SD->getUnderlyingValue())
+ if (auto *I = dyn_cast<Instruction>(UV))
+ return I->getParent();
+ }
+ return nullptr;
+}
+
+/// Collect conditionally-executed VPlan blocks that are profitable to guard.
+/// Walks the loop region in reverse post-order and selects blocks that satisfy
+/// two criteria:
+/// 1. The block does not dominate \p IRLatch in the original IR (meaning it
+/// is conditionally executed in the scalar loop).
+/// 2. The block contains more recipes than \c ConditionalGuardThreshold
+/// (making the guard overhead worthwhile).
+static SmallVector<VPBasicBlock *>
+findGuardCandidates(VPBasicBlock *HeaderVPBB, BasicBlock *IRLatch,
+ DominatorTree *DT) {
+ SmallVector<VPBasicBlock *> Candidates;
+ ReversePostOrderTraversal<VPBlockShallowTraversalWrapper<VPBlockBase *>> RPOT(
+ HeaderVPBB);
+ for (VPBasicBlock *VPBB : VPBlockUtils::blocksOnly<VPBasicBlock>(RPOT)) {
+ BasicBlock *IRBB = getOrigBBForVPBB(VPBB);
+ if (!IRBB)
+ continue;
+ if (DT->dominates(IRBB, IRLatch))
+ continue;
+ if (VPBB->size() <= ConditionalGuardThreshold)
+ continue;
+ Candidates.push_back(VPBB);
+ }
+ return Candidates;
+}
+
+/// Extract the mask VPValue from a recipe, handling all recipe types that
+/// carry masks (VPInstruction, VPWidenMemoryRecipe, VPReplicateRecipe,
+/// VPInterleaveBase). Returns nullptr if the recipe has no mask.
+static VPValue *getMaskFromRecipe(VPRecipeBase &R) {
+ if (auto *VPI = dyn_cast<VPInstruction>(&R))
+ return VPI->getMask();
+ if (auto *WMR = dyn_cast<VPWidenMemoryRecipe>(&R))
+ return WMR->getMask();
+ if (auto *Rep = dyn_cast<VPReplicateRecipe>(&R))
+ return Rep->isPredicated() ? Rep->getMask() : nullptr;
+ if (auto *ILR = dyn_cast<VPInterleaveBase>(&R))
+ return ILR->getMask();
+ return nullptr;
+}
+
+/// Find the active-lane mask for \p VecBlock by scanning its recipes, and
+/// split the block if the mask definition lives inside it (so the mask and
+/// any prefix recipes stay in the predecessor). Returns the (possibly new)
+/// guarded block, or \c nullptr if the candidate should be skipped. \p Mask
+/// is set to the discovered mask value on success.
+static VPBasicBlock *findBlockMaskAndSplit(VPBasicBlock *VecBlock,
+ VPValue *&Mask) {
+ Mask = nullptr;
+ for (VPRecipeBase &R : *VecBlock) {
+ if (VPValue *M = getMaskFromRecipe(R)) {
+ Mask = M;
+ break;
+ }
+ }
+ if (!Mask)
+ return nullptr;
+
+ if (auto *MaskDef = Mask->getDefiningRecipe()) {
+ if (MaskDef->getParent() == VecBlock) {
+ auto SplitPt = std::next(MaskDef->getIterator());
+ if (SplitPt == VecBlock->end())
+ return nullptr;
+ VecBlock = VecBlock->splitAt(SplitPt);
+ }
+ }
+
+ if (!VecBlock->getSinglePredecessor() || !VecBlock->getSingleSuccessor())
+ return nullptr;
+ return VecBlock;
+}
+
+/// Build the guard diamond around \p VecBlock. Creates a guard block (with
+/// AnyOf + BranchOnCond) and a join block, then rewires the CFG into:
+/// pred -> guard -> {VecBlock, join}, VecBlock -> join -> succ.
+/// Returns the {GuardBlock, JoinBlock} pair.
+static std::pair<VPBasicBlock *, VPBasicBlock *>
+buildGuardDiamondCFG(VPlan &Plan, VPBasicBlock *VecBlock, VPValue *Mask) {
+ VPBasicBlock *PredBlock =
+ cast<VPBasicBlock>(VecBlock->getSinglePredecessor());
+ VPBasicBlock *SuccBlock =
+ cast<VPBasicBlock>(VecBlock->getSingleSuccessor());
+
+ VPBasicBlock *GuardBlock =
+ Plan.createVPBasicBlock(VecBlock->getName() + ".cond.guard");
+ VPBasicBlock *JoinBlock =
+ Plan.createVPBasicBlock(VecBlock->getName() + ".cond.join");
+ GuardBlock->setParent(VecBlock->getParent());
+ JoinBlock->setParent(VecBlock->getParent());
+
+ VPBuilder GuardBuilder(GuardBlock);
+ VPValue *AnyActive =
+ GuardBuilder.createNaryOp(VPInstruction::AnyOf, {Mask});
+ GuardBuilder.createNaryOp(VPInstruction::BranchOnCond, {AnyActive});
+
+ VPBlockUtils::disconnectBlocks(PredBlock, VecBlock);
+ VPBlockUtils::disconnectBlocks(VecBlock, SuccBlock);
+ VPBlockUtils::connectBlocks(PredBlock, GuardBlock);
+ VPBlockUtils::insertTwoBlocksAfter(VecBlock, JoinBlock, GuardBlock);
+ VPBlockUtils::connectBlocks(VecBlock, JoinBlock);
+ VPBlockUtils::connectBlocks(JoinBlock, SuccBlock);
+
+ return {GuardBlock, JoinBlock};
+}
+
+/// For recipes in the guarded \p VecBlock whose values are used outside the
+/// guard diamond, create ConditionalMerge PHI nodes in \p JoinBlock. These
+/// merge the guarded value with poison on the skip path. All recipes stay
+/// inside \p VecBlock so that expensive operations (FP arithmetic, etc.)
+/// remain guarded and are skipped when no lanes are active.
+static void mergeGuardedValues(VPBasicBlock *VecBlock,
+ VPBasicBlock *GuardBlock,
+ VPBasicBlock *JoinBlock) {
+ VPBuilder JoinBuilder(JoinBlock, JoinBlock->begin());
+ for (VPRecipeBase &R : *VecBlock) {
+ for (VPValue *Def : R.definedValues()) {
+ bool HasExternalUse = false;
+ for (VPUser *U : Def->users()) {
+ auto *UserR = dyn_cast<VPRecipeBase>(U);
+ if (!UserR || UserR->getParent() != VecBlock) {
+ HasExternalUse = true;
+ break;
+ }
+ }
+ if (!HasExternalUse)
+ continue;
+
+ auto *Merge = JoinBuilder.createNaryOp(VPInstruction::ConditionalMerge,
+ {Def}, {}, {}, "cond.merge");
+ Def->replaceUsesWithIf(
+ Merge, [VecBlock, JoinBlock, GuardBlock](VPUser &U, unsigned) {
+ auto *UserR = dyn_cast<VPRecipeBase>(&U);
+ if (!UserR)
+ return false;
+ return UserR->getParent() != VecBlock &&
+ UserR->getParent() != JoinBlock &&
+ UserR->getParent() != GuardBlock;
+ });
+ }
+ }
+}
+
+void VPlanTransforms::introduceConditionalBlockGuards(VPlan &Plan, Loop *OrigLoop,
+ DominatorTree *DT,
+ const TargetTransformInfo *TTI) {
+ if (!EnableConditionalBlockGuards)
+ return;
+ if (TTI->enableScalableVectorization())
+ return;
+
+ VPRegionBlock *LoopRegion = Plan.getVectorLoopRegion();
+ if (!LoopRegion)
+ return;
+ VPBasicBlock *HeaderVPBB = LoopRegion->getEntryBasicBlock();
+ VPBasicBlock *VPLatch = dyn_cast<VPBasicBlock>(LoopRegion->getExiting());
+ if (!VPLatch || !HeaderVPBB)
+ return;
+ BasicBlock *IRLatch = OrigLoop->getLoopLatch();
+ if (!IRLatch)
+ return;
+
+ auto Candidates = findGuardCandidates(HeaderVPBB, IRLatch, DT);
+ if (Candidates.empty())
+ return;
+
+ for (VPBasicBlock *VecBlock : Candidates) {
+ // Skip blocks with in-loop reduction recipes: their merge-point blend
+ // was folded by createInLoopReductionRecipes, so ConditionalMerge with
+ // poison would corrupt the reduction chain.
+ if (any_of(*VecBlock, [](VPRecipeBase &R) {
+ return isa<VPReductionRecipe, VPReductionEVLRecipe>(R);
+ }))
+ continue;
+
+ VPValue *Mask = nullptr;
+ VecBlock = findBlockMaskAndSplit(VecBlock, Mask);
+ if (!VecBlock)
+ continue;
+
+ auto [GuardBlock, JoinBlock] = buildGuardDiamondCFG(Plan, VecBlock, Mask);
+
+ mergeGuardedValues(VecBlock, GuardBlock, JoinBlock);
+
+ }
+}
+
void VPlanTransforms::dissolveLoopRegions(VPlan &Plan) {
// Replace loop regions with explicity CFG.
SmallVector<VPRegionBlock *> LoopRegions;
diff --git a/llvm/lib/Transforms/Vectorize/VPlanTransforms.h b/llvm/lib/Transforms/Vectorize/VPlanTransforms.h
index 75fc549167e03..9d37119afdd2f 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanTransforms.h
+++ b/llvm/lib/Transforms/Vectorize/VPlanTransforms.h
@@ -495,6 +495,15 @@ struct VPlanTransforms {
/// \p Plan.
static void introduceMasksAndLinearize(VPlan &Plan);
+ /// Introduce active-lane guards for conditionally-executed blocks that are
+ /// profitable to guard. Creates a diamond CFG
+ /// (guard -> {vec_block, join}, vec_block -> join) for each candidate block.
+ /// The guard checks whether any SIMD lane is active using AnyOf +
+ /// BranchOnCond; the join block contains ConditionalMerge phis.
+ static void introduceConditionalBlockGuards(VPlan &Plan, Loop *OrigLoop,
+ DominatorTree *DT,
+ const TargetTransformInfo *TTI);
+
/// Replace a VPWidenCanonicalIVRecipe if it is present in \p Plan, with a
/// VPWidenIntOrFpInductionRecipe, provided it would not cause additional
/// spills for \p VF at unroll factor \p UF.
diff --git a/llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp b/llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp
index 4b99829a21817..a1359bcf55911 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp
@@ -273,6 +273,9 @@ bool VPlanVerifier::verifyVPBasicBlock(const VPBasicBlock *VPBB) {
if (RecipeNumbering[UI] >= RecipeNumbering[&R])
continue;
} else {
+ if (auto *VPI = dyn_cast<VPInstruction>(UI))
+ if (VPI->getOpcode() == VPInstruction::ConditionalMerge)
+ continue;
if (VPDT.dominates(VPBB, UI->getParent()))
continue;
}
diff --git a/llvm/test/Transforms/LoopVectorize/VPlan/vplan-print-after-all.ll b/llvm/test/Transforms/LoopVectorize/VPlan/vplan-print-after-all.ll
index 061588317eba7..d00b86687c17f 100644
--- a/llvm/test/Transforms/LoopVectorize/VPlan/vplan-print-after-all.ll
+++ b/llvm/test/Transforms/LoopVectorize/VPlan/vplan-print-after-all.ll
@@ -30,6 +30,7 @@
; CHECK: VPlan for loop in 'foo' after VPlanTransforms::hoistPredicatedLoads
; CHECK: VPlan for loop in 'foo' after VPlanTransforms::sinkPredicatedStores
; CHECK: VPlan for loop in 'foo' after VPlanTransforms::truncateToMinimalBitwidths
+; CHECK: VPlan for loop in 'foo' after VPlanTransforms::introduceConditionalBlockGuards
; CHECK: VPlan for loop in 'foo' after removeRedundantInductionCasts
; CHECK: VPlan for loop in 'foo' after reassociateHeaderMask
; CHECK: VPlan for loop in 'foo' after simplifyRecipes
diff --git a/llvm/test/Transforms/LoopVectorize/boscc-basic-cond-store.ll b/llvm/test/Transforms/LoopVectorize/boscc-basic-cond-store.ll
new file mode 100644
index 0000000000000..9f013d74af2bd
--- /dev/null
+++ b/llvm/test/Transforms/LoopVectorize...
[truncated]
|
You can test this locally with the following command:git-clang-format --diff origin/main HEAD --extensions h,cpp -- llvm/lib/Transforms/Vectorize/LoopVectorize.cpp llvm/lib/Transforms/Vectorize/VPlan.h llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp llvm/lib/Transforms/Vectorize/VPlanTransforms.h llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp --diff_from_common_commit
View the diff from clang-format here.diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
index a68595de8..27780b8cf 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
@@ -874,8 +874,7 @@ Value *VPInstruction::generate(VPTransformState &State) {
return State.get(getOperand(0));
BasicBlock *GuardBB =
State.CFG.VPBB2IRBB.at(JoinVPBB->getCFGPredecessor(0));
- BasicBlock *VecBB =
- State.CFG.VPBB2IRBB.at(JoinVPBB->getCFGPredecessor(1));
+ BasicBlock *VecBB = State.CFG.VPBB2IRBB.at(JoinVPBB->getCFGPredecessor(1));
Value *InVal;
{
// State.get() may insert broadcast instructions (insertelement +
diff --git a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
index 3d93edef8..9b89f2c61 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
@@ -2571,8 +2571,7 @@ static void licm(VPlan &Plan) {
if (!OpR)
return false;
VPBasicBlock *OpBB = OpR->getParent();
- return OpBB != SinkBB &&
- !VPDT.properlyDominates(OpBB, SinkBB);
+ return OpBB != SinkBB && !VPDT.properlyDominates(OpBB, SinkBB);
}))
continue;
@@ -3845,9 +3844,10 @@ static void expandVPDerivedIV(VPDerivedIVRecipe *R, VPTypeAnalysis &TypeInfo) {
llvm_unreachable("Unhandled induction kind");
}
-static cl::opt<bool> EnableConditionalBlockGuards(
- "enable-loop-vectorization-with-conditions", cl::init(false), cl::Hidden,
- cl::desc("Vectorize loop with branches"));
+static cl::opt<bool>
+ EnableConditionalBlockGuards("enable-loop-vectorization-with-conditions",
+ cl::init(false), cl::Hidden,
+ cl::desc("Vectorize loop with branches"));
static cl::opt<unsigned> ConditionalGuardThreshold(
"conditional-guard-threshold", cl::init(5), cl::Hidden,
@@ -3877,9 +3877,9 @@ static BasicBlock *getOrigBBForVPBB(VPBasicBlock *VPBB) {
/// is conditionally executed in the scalar loop).
/// 2. The block contains more recipes than \c ConditionalGuardThreshold
/// (making the guard overhead worthwhile).
-static SmallVector<VPBasicBlock *>
-findGuardCandidates(VPBasicBlock *HeaderVPBB, BasicBlock *IRLatch,
- DominatorTree *DT) {
+static SmallVector<VPBasicBlock *> findGuardCandidates(VPBasicBlock *HeaderVPBB,
+ BasicBlock *IRLatch,
+ DominatorTree *DT) {
SmallVector<VPBasicBlock *> Candidates;
ReversePostOrderTraversal<VPBlockShallowTraversalWrapper<VPBlockBase *>> RPOT(
HeaderVPBB);
@@ -3917,7 +3917,7 @@ static VPValue *getMaskFromRecipe(VPRecipeBase &R) {
/// guarded block, or \c nullptr if the candidate should be skipped. \p Mask
/// is set to the discovered mask value on success.
static VPBasicBlock *findBlockMaskAndSplit(VPBasicBlock *VecBlock,
- VPValue *&Mask) {
+ VPValue *&Mask) {
Mask = nullptr;
for (VPRecipeBase &R : *VecBlock) {
if (VPValue *M = getMaskFromRecipe(R)) {
@@ -3950,8 +3950,7 @@ static std::pair<VPBasicBlock *, VPBasicBlock *>
buildGuardDiamondCFG(VPlan &Plan, VPBasicBlock *VecBlock, VPValue *Mask) {
VPBasicBlock *PredBlock =
cast<VPBasicBlock>(VecBlock->getSinglePredecessor());
- VPBasicBlock *SuccBlock =
- cast<VPBasicBlock>(VecBlock->getSingleSuccessor());
+ VPBasicBlock *SuccBlock = cast<VPBasicBlock>(VecBlock->getSingleSuccessor());
VPBasicBlock *GuardBlock =
Plan.createVPBasicBlock(VecBlock->getName() + ".cond.guard");
@@ -3961,8 +3960,7 @@ buildGuardDiamondCFG(VPlan &Plan, VPBasicBlock *VecBlock, VPValue *Mask) {
JoinBlock->setParent(VecBlock->getParent());
VPBuilder GuardBuilder(GuardBlock);
- VPValue *AnyActive =
- GuardBuilder.createNaryOp(VPInstruction::AnyOf, {Mask});
+ VPValue *AnyActive = GuardBuilder.createNaryOp(VPInstruction::AnyOf, {Mask});
GuardBuilder.createNaryOp(VPInstruction::BranchOnCond, {AnyActive});
VPBlockUtils::disconnectBlocks(PredBlock, VecBlock);
@@ -3980,8 +3978,7 @@ buildGuardDiamondCFG(VPlan &Plan, VPBasicBlock *VecBlock, VPValue *Mask) {
/// merge the guarded value with poison on the skip path. All recipes stay
/// inside \p VecBlock so that expensive operations (FP arithmetic, etc.)
/// remain guarded and are skipped when no lanes are active.
-static void mergeGuardedValues(VPBasicBlock *VecBlock,
- VPBasicBlock *GuardBlock,
+static void mergeGuardedValues(VPBasicBlock *VecBlock, VPBasicBlock *GuardBlock,
VPBasicBlock *JoinBlock) {
VPBuilder JoinBuilder(JoinBlock, JoinBlock->begin());
for (VPRecipeBase &R : *VecBlock) {
@@ -4012,9 +4009,9 @@ static void mergeGuardedValues(VPBasicBlock *VecBlock,
}
}
-void VPlanTransforms::introduceConditionalBlockGuards(VPlan &Plan, Loop *OrigLoop,
- DominatorTree *DT,
- const TargetTransformInfo *TTI) {
+void VPlanTransforms::introduceConditionalBlockGuards(
+ VPlan &Plan, Loop *OrigLoop, DominatorTree *DT,
+ const TargetTransformInfo *TTI) {
if (!EnableConditionalBlockGuards)
return;
if (TTI->enableScalableVectorization())
@@ -4052,7 +4049,6 @@ void VPlanTransforms::introduceConditionalBlockGuards(VPlan &Plan, Loop *OrigLoo
auto [GuardBlock, JoinBlock] = buildGuardDiamondCFG(Plan, VecBlock, Mask);
mergeGuardedValues(VecBlock, GuardBlock, JoinBlock);
-
}
}
|
|
There is similar #141900, why do we need this one? |
|
I’ve also created an llvm-test-suite patch to evaluate the performance of #141900. Would you like me to add any additional test cases, or would you be available to review it as is? |
Thanks, I'll have a check |
I’ve been exploring a related approach in PR #197423 and would appreciate it if reviewers could give it a fair evaluation before it’s set aside. There are a few scenarios where the block-level approach handles things more naturally, and I’ve highlighted those differences for consideration.
|
| const TargetTransformInfo *TTI) { | ||
| if (!EnableConditionalBlockGuards) | ||
| return; | ||
| if (TTI->enableScalableVectorization()) |
There was a problem hiding this comment.
Please note that this is a temporary block check for now. I’d like to remove it eventually, but since I have limited experience with scalable architecture, I would appreciate some guidance on how to do so correctly.
Add a new VPlan transformation, introduceConditionalBlockGuards, that wraps conditionally-executed vector blocks in guard diamonds. When the vector mask has no active lanes, the entire block is bypassed via a scalar branch, avoiding unnecessary execution of expensive vector block.
RFC Link - https://discourse.llvm.org/t/rfc-lv-active-lane-guard-diamonds-skipping-conditionally-executed-vplan-blocks-when-no-lane-is-active/90779
This implements the Branch-On-Superword-Condition-Code (BOSCC) technique at the VPlan level, replacing the earlier approach proposed in D139074 with a simpler design that reuses existing VPlan building blocks (AnyOf, BranchOnCond, VPBlockUtils).
The transformation works in four steps:
A new VPInstruction opcode, ConditionalMerge, is introduced to merge guarded values with poison on the skip path. It lowers to an IR PHI node.
The transformation is off by default (-enable-loop-vectorization-with- conditions=false) and disabled for scalable vector targets and blocks containing in-loop reductions.