Skip to content

[LV][VPlan] Introduce conditional block guards to skip inactive vector blocks (BOSCC)#197423

Open
nema-ashutosh wants to merge 1 commit into
llvm:mainfrom
nema-ashutosh:BOSCC
Open

[LV][VPlan] Introduce conditional block guards to skip inactive vector blocks (BOSCC)#197423
nema-ashutosh wants to merge 1 commit into
llvm:mainfrom
nema-ashutosh:BOSCC

Conversation

@nema-ashutosh

@nema-ashutosh nema-ashutosh commented May 13, 2026

Copy link
Copy Markdown
Contributor

Add a new VPlan transformation, introduceConditionalBlockGuards, that wraps conditionally-executed vector blocks in guard diamonds. When the vector mask has no active lanes, the entire block is bypassed via a scalar branch, avoiding unnecessary execution of expensive vector block.

RFC Link - https://discourse.llvm.org/t/rfc-lv-active-lane-guard-diamonds-skipping-conditionally-executed-vplan-blocks-when-no-lane-is-active/90779

This implements the Branch-On-Superword-Condition-Code (BOSCC) technique at the VPlan level, replacing the earlier approach proposed in D139074 with a simpler design that reuses existing VPlan building blocks (AnyOf, BranchOnCond, VPBlockUtils).

The transformation works in four steps:

  1. Identify candidate blocks that are conditionally executed in the original scalar loop and exceed a recipe-count threshold.
  2. Discover the active-lane mask from masked recipes, splitting the block if the mask definition is local.
  3. Build a guard diamond: GuardBlock (any-of + branch-on-cond) and JoinBlock around the candidate.
  4. Insert ConditionalMerge PHIs in the JoinBlock for values that escape the guarded block.

A new VPInstruction opcode, ConditionalMerge, is introduced to merge guarded values with poison on the skip path. It lowers to an IR PHI node.

The transformation is off by default (-enable-loop-vectorization-with- conditions=false) and disabled for scalable vector targets and blocks containing in-loop reductions.

…r blocks (BOSCC)

Add a new VPlan transformation, introduceConditionalBlockGuards, that
wraps conditionally-executed vector blocks in guard diamonds. When the
vector mask has no active lanes, the entire block is bypassed via a
scalar branch, avoiding unnecessary execution of expensive vector
block.

This implements the Branch-On-Superword-Condition-Code (BOSCC) technique
at the VPlan level, replacing the earlier approach proposed in D139074
with a simpler design that reuses existing VPlan building blocks (AnyOf,
BranchOnCond, VPBlockUtils).

The transformation works in four steps:
  1. Identify candidate blocks that are conditionally executed in the
     original scalar loop and exceed a recipe-count threshold.
  2. Discover the active-lane mask from masked recipes, splitting the
     block if the mask definition is local.
  3. Build a guard diamond: GuardBlock (any-of + branch-on-cond) and
     JoinBlock around the candidate.
  4. Insert ConditionalMerge PHIs in the JoinBlock for values that
     escape the guarded block.

A new VPInstruction opcode, ConditionalMerge, is introduced to merge
guarded values with poison on the skip path. It lowers to an IR PHI
node.

The transformation is off by default (-enable-loop-vectorization-with-
conditions=false) and disabled for scalable vector targets and blocks
containing in-loop reductions.
@llvmorg-github-actions

llvmorg-github-actions Bot commented May 13, 2026

Copy link
Copy Markdown

@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-vectorizers

Author: Ashutosh Nema (nema-ashutosh)

Changes

Add a new VPlan transformation, introduceConditionalBlockGuards, that wraps conditionally-executed vector blocks in guard diamonds. When the vector mask has no active lanes, the entire block is bypassed via a scalar branch, avoiding unnecessary execution of expensive vector block.

This implements the Branch-On-Superword-Condition-Code (BOSCC) technique at the VPlan level, replacing the earlier approach proposed in D139074 with a simpler design that reuses existing VPlan building blocks (AnyOf, BranchOnCond, VPBlockUtils).

The transformation works in four steps:

  1. Identify candidate blocks that are conditionally executed in the original scalar loop and exceed a recipe-count threshold.
  2. Discover the active-lane mask from masked recipes, splitting the block if the mask definition is local.
  3. Build a guard diamond: GuardBlock (any-of + branch-on-cond) and JoinBlock around the candidate.
  4. Insert ConditionalMerge PHIs in the JoinBlock for values that escape the guarded block.

A new VPInstruction opcode, ConditionalMerge, is introduced to merge guarded values with poison on the skip path. It lowers to an IR PHI node.

The transformation is off by default (-enable-loop-vectorization-with- conditions=false) and disabled for scalable vector targets and blocks containing in-loop reductions.


Patch is 26.17 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/197423.diff

11 Files Affected:

  • (modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+2)
  • (modified) llvm/lib/Transforms/Vectorize/VPlan.h (+6)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp (+1)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp (+32)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp (+250-4)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanTransforms.h (+9)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp (+3)
  • (modified) llvm/test/Transforms/LoopVectorize/VPlan/vplan-print-after-all.ll (+1)
  • (added) llvm/test/Transforms/LoopVectorize/boscc-basic-cond-store.ll (+59)
  • (added) llvm/test/Transforms/LoopVectorize/boscc-disabled-flag.ll (+47)
  • (added) llvm/test/Transforms/LoopVectorize/boscc-expensive-fp-guard.ll (+60)
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 61c2d3cd228ec..ce9927e0dd65a 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -6823,6 +6823,8 @@ void LoopVectorizationPlanner::buildVPlans(ElementCount MinVF,
     RUN_VPLAN_PASS(VPlanTransforms::sinkPredicatedStores, *Plan, PSE, OrigLoop);
     RUN_VPLAN_PASS(VPlanTransforms::truncateToMinimalBitwidths, *Plan,
                    Config.getMinimalBitwidths());
+    RUN_VPLAN_PASS(VPlanTransforms::introduceConditionalBlockGuards, *Plan,
+                   OrigLoop, DT, &TTI);
     RUN_VPLAN_PASS(VPlanTransforms::optimize, *Plan);
     // TODO: try to put addExplicitVectorLength close to addActiveLaneMask
     if (CM.foldTailWithEVL()) {
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.h b/llvm/lib/Transforms/Vectorize/VPlan.h
index 63436c79e9a98..1e538d9515b7d 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.h
+++ b/llvm/lib/Transforms/Vectorize/VPlan.h
@@ -1290,6 +1290,12 @@ class LLVM_ABI_FOR_TEST VPInstruction : public VPRecipeWithIRFlags,
     // operand as additional operands. AnyOf is poison-safe as all operands
     // will be frozen.
     AnyOf,
+    // Creates a PHI node at the merge point of an active-lane guard diamond.
+    // Merges the value produced by the guarded vector block (operand 0) with
+    // poison from the guard-skip path. The parent block must have exactly two
+    // CFG predecessors: predecessor 0 is the guard (skip) block and
+    // predecessor 1 is the guarded vector block.
+    ConditionalMerge,
     // Calculates the first active lane index of the vector predicate operands.
     // It produces the lane index across all unrolled iterations. Unrolling will
     // add all copies of its original operand as additional operands.
diff --git a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
index a42b631cd3304..643194b2a7925 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp
@@ -118,6 +118,7 @@ Type *VPTypeAnalysis::inferScalarTypeForRecipe(const VPInstruction *R) {
   case VPInstruction::CalculateTripCountMinusVF:
   case VPInstruction::CanonicalIVIncrementForPart:
   case VPInstruction::AnyOf:
+  case VPInstruction::ConditionalMerge:
   case VPInstruction::BuildStructVector:
   case VPInstruction::BuildVector:
   case VPInstruction::Unpack:
diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
index 11a91dcd46867..bec4f4a5878f4 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
@@ -535,6 +535,7 @@ unsigned VPInstruction::getNumOperandsForOpcode() const {
   case Instruction::PHI:
   case Instruction::Switch:
   case VPInstruction::AnyOf:
+  case VPInstruction::ConditionalMerge:
   case VPInstruction::BuildStructVector:
   case VPInstruction::BuildVector:
   case VPInstruction::CanonicalIVIncrementForPart:
@@ -859,6 +860,31 @@ Value *VPInstruction::generate(VPTransformState &State) {
       Res = Builder.CreateOr(Res, Builder.CreateFreeze(State.get(Op)));
     return State.VF.isScalar() ? Res : Builder.CreateOrReduce(Res);
   }
+  case VPInstruction::ConditionalMerge: {
+    const VPBasicBlock *JoinVPBB = getParent();
+    if (JoinVPBB->getNumPredecessors() < 2)
+      return State.get(getOperand(0));
+    BasicBlock *GuardBB =
+        State.CFG.VPBB2IRBB.at(JoinVPBB->getCFGPredecessor(0));
+    BasicBlock *VecBB =
+        State.CFG.VPBB2IRBB.at(JoinVPBB->getCFGPredecessor(1));
+    Value *InVal;
+    {
+      // State.get() may insert broadcast instructions (insertelement +
+      // shufflevector) when the operand is scalar and needs widening. These
+      // must land in VecBB so that the PHI can reference them as incoming
+      // values from that block. Temporarily redirect the builder there.
+      IRBuilderBase::InsertPointGuard IPG(Builder);
+      Builder.SetInsertPoint(VecBB->getTerminator());
+      InVal = State.get(getOperand(0));
+    }
+    BasicBlock *BB = Builder.GetInsertBlock();
+    PHINode *Phi = PHINode::Create(InVal->getType(), 2, Name);
+    Phi->insertBefore(BB->getFirstNonPHIIt());
+    Phi->addIncoming(InVal, VecBB);
+    Phi->addIncoming(PoisonValue::get(InVal->getType()), GuardBB);
+    return Phi;
+  }
   case VPInstruction::ExtractLane: {
     assert(getNumOperands() != 2 && "ExtractLane from single source should be "
                                     "simplified to ExtractElement.");
@@ -1200,6 +1226,8 @@ InstructionCost VPInstruction::computeCost(ElementCount VF,
     return Ctx.TTI.getArithmeticReductionCost(
         Instruction::Or, cast<VectorType>(VecTy), std::nullopt, Ctx.CostKind);
   }
+  case VPInstruction::ConditionalMerge:
+    return 0;
   case VPInstruction::FirstActiveLane: {
     Type *Ty = Ctx.Types.inferScalarType(this);
     Type *ScalarTy = Ctx.Types.inferScalarType(getOperand(0));
@@ -1381,6 +1409,7 @@ bool VPInstruction::opcodeMayReadOrWriteFromMemory() const {
   case Instruction::Select:
   case Instruction::PHI:
   case VPInstruction::AnyOf:
+  case VPInstruction::ConditionalMerge:
   case VPInstruction::BranchOnCond:
   case VPInstruction::BranchOnTwoConds:
   case VPInstruction::BranchOnCount:
@@ -1580,6 +1609,9 @@ void VPInstruction::printRecipe(raw_ostream &O, const Twine &Indent,
   case VPInstruction::AnyOf:
     O << "any-of";
     break;
+  case VPInstruction::ConditionalMerge:
+    O << "conditional-merge";
+    break;
   case VPInstruction::FirstActiveLane:
     O << "first-active-lane";
     break;
diff --git a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
index 32d89a34105a4..da0732740b1b4 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
@@ -1637,6 +1637,23 @@ static void simplifyRecipe(VPSingleDefRecipe *Def, VPTypeAnalysis &TypeInfo) {
       !cast<VPInstruction>(Def)->isMasked())
     return Def->replaceAllUsesWith(Def->getOperand(0));
 
+  // When removeBranchOnConst folds a constant-true guard condition, it removes
+  // the GuardBlock -> JoinBlock edge and collapses the diamond into a
+  // straight-line chain. JoinBlock then has a single predecessor (VecBlock),
+  // so the ConditionalMerge phi is redundant — replace it with its operand.
+  // The operand necessarily dominates JoinBlock because it was valid at its
+  // use site inside VecBlock before the collapse, and collapsing only
+  // strengthens dominance in the resulting straight-line chain.
+  if (auto *VPI = dyn_cast<VPInstruction>(Def)) {
+    if (VPI->getOpcode() == VPInstruction::ConditionalMerge) {
+      if (VPI->getParent()->getSinglePredecessor()) {
+        Def->replaceAllUsesWith(Def->getOperand(0));
+        Def->eraseFromParent();
+        return;
+      }
+    }
+  }
+
   // Look through ExtractLastLane.
   if (match(Def, m_ExtractLastLane(m_VPValue(A)))) {
     if (match(A, m_BuildVector())) {
@@ -2524,10 +2541,28 @@ static void licm(VPlan &Plan) {
       if (!SinkBB)
         SinkBB = cast<VPBasicBlock>(LoopRegion->getSingleSuccessor());
 
-      // TODO: This will need to be a check instead of a assert after
-      // conditional branches in vectorized loops are supported.
-      assert(VPDT.properlyDominates(VPBB, SinkBB) &&
-             "Defining block must dominate sink block");
+      // With guard diamonds, a recipe defined inside a conditionally-executed
+      // VecBlock may not dominate exit blocks of the loop region because of
+      // the skip path (GuardBlock -> JoinBlock). Skip sinking in this case;
+      // the recipe stays inside the loop where it is correctly guarded.
+      // (Replaces the previous assert to support conditional block guards.)
+      if (!VPDT.properlyDominates(VPBB, SinkBB))
+        continue;
+
+      // Verify all operands defined inside the loop region will still dominate
+      // the recipe at the sink target. A recipe in the JoinBlock of a guard
+      // diamond may reference values from the conditionally-executed VecBlock;
+      // those values don't dominate the exit and sinking would break dominance.
+      if (any_of(R.operands(), [&VPDT, SinkBB](VPValue *Op) {
+            auto *OpR = Op->getDefiningRecipe();
+            if (!OpR)
+              return false;
+            VPBasicBlock *OpBB = OpR->getParent();
+            return OpBB != SinkBB &&
+                   !VPDT.properlyDominates(OpBB, SinkBB);
+          }))
+        continue;
+
       // TODO: Clone the recipe if users are on multiple exit paths, instead of
       // just moving.
       Def->moveBefore(*SinkBB, SinkBB->getFirstNonPhi());
@@ -3798,6 +3833,217 @@ static void expandVPDerivedIV(VPDerivedIVRecipe *R, VPTypeAnalysis &TypeInfo) {
   llvm_unreachable("Unhandled induction kind");
 }
 
+static cl::opt<bool> EnableConditionalBlockGuards(
+    "enable-loop-vectorization-with-conditions", cl::init(false), cl::Hidden,
+    cl::desc("Vectorize loop with branches"));
+
+static cl::opt<unsigned> ConditionalGuardThreshold(
+    "conditional-guard-threshold", cl::init(5), cl::Hidden,
+    cl::desc("The minimum instructions in a block required for guarding"));
+
+/// Return the original IR basic block corresponding to \p VPBB by scanning its
+/// recipes for one that has an underlying IR instruction, and returning that
+/// instruction's parent block. VPBasicBlock has no direct link to the original
+/// IR block, so this heuristic is the only way to recover that mapping.
+/// Returns \c nullptr if no recipe in \p VPBB has an underlying IR value.
+static BasicBlock *getOrigBBForVPBB(VPBasicBlock *VPBB) {
+  for (VPRecipeBase &R : *VPBB) {
+    auto *SD = dyn_cast<VPSingleDefRecipe>(&R);
+    if (!SD)
+      continue;
+    if (auto *UV = SD->getUnderlyingValue())
+      if (auto *I = dyn_cast<Instruction>(UV))
+        return I->getParent();
+  }
+  return nullptr;
+}
+
+/// Collect conditionally-executed VPlan blocks that are profitable to guard.
+/// Walks the loop region in reverse post-order and selects blocks that satisfy
+/// two criteria:
+///   1. The block does not dominate \p IRLatch in the original IR (meaning it
+///      is conditionally executed in the scalar loop).
+///   2. The block contains more recipes than \c ConditionalGuardThreshold
+///      (making the guard overhead worthwhile).
+static SmallVector<VPBasicBlock *>
+findGuardCandidates(VPBasicBlock *HeaderVPBB, BasicBlock *IRLatch,
+                    DominatorTree *DT) {
+  SmallVector<VPBasicBlock *> Candidates;
+  ReversePostOrderTraversal<VPBlockShallowTraversalWrapper<VPBlockBase *>> RPOT(
+      HeaderVPBB);
+  for (VPBasicBlock *VPBB : VPBlockUtils::blocksOnly<VPBasicBlock>(RPOT)) {
+    BasicBlock *IRBB = getOrigBBForVPBB(VPBB);
+    if (!IRBB)
+      continue;
+    if (DT->dominates(IRBB, IRLatch))
+      continue;
+    if (VPBB->size() <= ConditionalGuardThreshold)
+      continue;
+    Candidates.push_back(VPBB);
+  }
+  return Candidates;
+}
+
+/// Extract the mask VPValue from a recipe, handling all recipe types that
+/// carry masks (VPInstruction, VPWidenMemoryRecipe, VPReplicateRecipe,
+/// VPInterleaveBase). Returns nullptr if the recipe has no mask.
+static VPValue *getMaskFromRecipe(VPRecipeBase &R) {
+  if (auto *VPI = dyn_cast<VPInstruction>(&R))
+    return VPI->getMask();
+  if (auto *WMR = dyn_cast<VPWidenMemoryRecipe>(&R))
+    return WMR->getMask();
+  if (auto *Rep = dyn_cast<VPReplicateRecipe>(&R))
+    return Rep->isPredicated() ? Rep->getMask() : nullptr;
+  if (auto *ILR = dyn_cast<VPInterleaveBase>(&R))
+    return ILR->getMask();
+  return nullptr;
+}
+
+/// Find the active-lane mask for \p VecBlock by scanning its recipes, and
+/// split the block if the mask definition lives inside it (so the mask and
+/// any prefix recipes stay in the predecessor). Returns the (possibly new)
+/// guarded block, or \c nullptr if the candidate should be skipped. \p Mask
+/// is set to the discovered mask value on success.
+static VPBasicBlock *findBlockMaskAndSplit(VPBasicBlock *VecBlock,
+                                          VPValue *&Mask) {
+  Mask = nullptr;
+  for (VPRecipeBase &R : *VecBlock) {
+    if (VPValue *M = getMaskFromRecipe(R)) {
+      Mask = M;
+      break;
+    }
+  }
+  if (!Mask)
+    return nullptr;
+
+  if (auto *MaskDef = Mask->getDefiningRecipe()) {
+    if (MaskDef->getParent() == VecBlock) {
+      auto SplitPt = std::next(MaskDef->getIterator());
+      if (SplitPt == VecBlock->end())
+        return nullptr;
+      VecBlock = VecBlock->splitAt(SplitPt);
+    }
+  }
+
+  if (!VecBlock->getSinglePredecessor() || !VecBlock->getSingleSuccessor())
+    return nullptr;
+  return VecBlock;
+}
+
+/// Build the guard diamond around \p VecBlock. Creates a guard block (with
+/// AnyOf + BranchOnCond) and a join block, then rewires the CFG into:
+///   pred -> guard -> {VecBlock, join}, VecBlock -> join -> succ.
+/// Returns the {GuardBlock, JoinBlock} pair.
+static std::pair<VPBasicBlock *, VPBasicBlock *>
+buildGuardDiamondCFG(VPlan &Plan, VPBasicBlock *VecBlock, VPValue *Mask) {
+  VPBasicBlock *PredBlock =
+      cast<VPBasicBlock>(VecBlock->getSinglePredecessor());
+  VPBasicBlock *SuccBlock =
+      cast<VPBasicBlock>(VecBlock->getSingleSuccessor());
+
+  VPBasicBlock *GuardBlock =
+      Plan.createVPBasicBlock(VecBlock->getName() + ".cond.guard");
+  VPBasicBlock *JoinBlock =
+      Plan.createVPBasicBlock(VecBlock->getName() + ".cond.join");
+  GuardBlock->setParent(VecBlock->getParent());
+  JoinBlock->setParent(VecBlock->getParent());
+
+  VPBuilder GuardBuilder(GuardBlock);
+  VPValue *AnyActive =
+      GuardBuilder.createNaryOp(VPInstruction::AnyOf, {Mask});
+  GuardBuilder.createNaryOp(VPInstruction::BranchOnCond, {AnyActive});
+
+  VPBlockUtils::disconnectBlocks(PredBlock, VecBlock);
+  VPBlockUtils::disconnectBlocks(VecBlock, SuccBlock);
+  VPBlockUtils::connectBlocks(PredBlock, GuardBlock);
+  VPBlockUtils::insertTwoBlocksAfter(VecBlock, JoinBlock, GuardBlock);
+  VPBlockUtils::connectBlocks(VecBlock, JoinBlock);
+  VPBlockUtils::connectBlocks(JoinBlock, SuccBlock);
+
+  return {GuardBlock, JoinBlock};
+}
+
+/// For recipes in the guarded \p VecBlock whose values are used outside the
+/// guard diamond, create ConditionalMerge PHI nodes in \p JoinBlock. These
+/// merge the guarded value with poison on the skip path. All recipes stay
+/// inside \p VecBlock so that expensive operations (FP arithmetic, etc.)
+/// remain guarded and are skipped when no lanes are active.
+static void mergeGuardedValues(VPBasicBlock *VecBlock,
+                               VPBasicBlock *GuardBlock,
+                               VPBasicBlock *JoinBlock) {
+  VPBuilder JoinBuilder(JoinBlock, JoinBlock->begin());
+  for (VPRecipeBase &R : *VecBlock) {
+    for (VPValue *Def : R.definedValues()) {
+      bool HasExternalUse = false;
+      for (VPUser *U : Def->users()) {
+        auto *UserR = dyn_cast<VPRecipeBase>(U);
+        if (!UserR || UserR->getParent() != VecBlock) {
+          HasExternalUse = true;
+          break;
+        }
+      }
+      if (!HasExternalUse)
+        continue;
+
+      auto *Merge = JoinBuilder.createNaryOp(VPInstruction::ConditionalMerge,
+                                             {Def}, {}, {}, "cond.merge");
+      Def->replaceUsesWithIf(
+          Merge, [VecBlock, JoinBlock, GuardBlock](VPUser &U, unsigned) {
+            auto *UserR = dyn_cast<VPRecipeBase>(&U);
+            if (!UserR)
+              return false;
+            return UserR->getParent() != VecBlock &&
+                   UserR->getParent() != JoinBlock &&
+                   UserR->getParent() != GuardBlock;
+          });
+    }
+  }
+}
+
+void VPlanTransforms::introduceConditionalBlockGuards(VPlan &Plan, Loop *OrigLoop,
+                                           DominatorTree *DT,
+                                           const TargetTransformInfo *TTI) {
+  if (!EnableConditionalBlockGuards)
+    return;
+  if (TTI->enableScalableVectorization())
+    return;
+
+  VPRegionBlock *LoopRegion = Plan.getVectorLoopRegion();
+  if (!LoopRegion)
+    return;
+  VPBasicBlock *HeaderVPBB = LoopRegion->getEntryBasicBlock();
+  VPBasicBlock *VPLatch = dyn_cast<VPBasicBlock>(LoopRegion->getExiting());
+  if (!VPLatch || !HeaderVPBB)
+    return;
+  BasicBlock *IRLatch = OrigLoop->getLoopLatch();
+  if (!IRLatch)
+    return;
+
+  auto Candidates = findGuardCandidates(HeaderVPBB, IRLatch, DT);
+  if (Candidates.empty())
+    return;
+
+  for (VPBasicBlock *VecBlock : Candidates) {
+    // Skip blocks with in-loop reduction recipes: their merge-point blend
+    // was folded by createInLoopReductionRecipes, so ConditionalMerge with
+    // poison would corrupt the reduction chain.
+    if (any_of(*VecBlock, [](VPRecipeBase &R) {
+          return isa<VPReductionRecipe, VPReductionEVLRecipe>(R);
+        }))
+      continue;
+
+    VPValue *Mask = nullptr;
+    VecBlock = findBlockMaskAndSplit(VecBlock, Mask);
+    if (!VecBlock)
+      continue;
+
+    auto [GuardBlock, JoinBlock] = buildGuardDiamondCFG(Plan, VecBlock, Mask);
+
+    mergeGuardedValues(VecBlock, GuardBlock, JoinBlock);
+
+  }
+}
+
 void VPlanTransforms::dissolveLoopRegions(VPlan &Plan) {
   // Replace loop regions with explicity CFG.
   SmallVector<VPRegionBlock *> LoopRegions;
diff --git a/llvm/lib/Transforms/Vectorize/VPlanTransforms.h b/llvm/lib/Transforms/Vectorize/VPlanTransforms.h
index 75fc549167e03..9d37119afdd2f 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanTransforms.h
+++ b/llvm/lib/Transforms/Vectorize/VPlanTransforms.h
@@ -495,6 +495,15 @@ struct VPlanTransforms {
   /// \p Plan.
   static void introduceMasksAndLinearize(VPlan &Plan);
 
+  /// Introduce active-lane guards for conditionally-executed blocks that are
+  /// profitable to guard. Creates a diamond CFG
+  /// (guard -> {vec_block, join}, vec_block -> join) for each candidate block.
+  /// The guard checks whether any SIMD lane is active using AnyOf +
+  /// BranchOnCond; the join block contains ConditionalMerge phis.
+  static void introduceConditionalBlockGuards(VPlan &Plan, Loop *OrigLoop,
+                                              DominatorTree *DT,
+                                              const TargetTransformInfo *TTI);
+
   /// Replace a VPWidenCanonicalIVRecipe if it is present in \p Plan, with a
   /// VPWidenIntOrFpInductionRecipe, provided it would not cause additional
   /// spills for \p VF at unroll factor \p UF.
diff --git a/llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp b/llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp
index 4b99829a21817..a1359bcf55911 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp
@@ -273,6 +273,9 @@ bool VPlanVerifier::verifyVPBasicBlock(const VPBasicBlock *VPBB) {
           if (RecipeNumbering[UI] >= RecipeNumbering[&R])
             continue;
         } else {
+          if (auto *VPI = dyn_cast<VPInstruction>(UI))
+            if (VPI->getOpcode() == VPInstruction::ConditionalMerge)
+              continue;
           if (VPDT.dominates(VPBB, UI->getParent()))
             continue;
         }
diff --git a/llvm/test/Transforms/LoopVectorize/VPlan/vplan-print-after-all.ll b/llvm/test/Transforms/LoopVectorize/VPlan/vplan-print-after-all.ll
index 061588317eba7..d00b86687c17f 100644
--- a/llvm/test/Transforms/LoopVectorize/VPlan/vplan-print-after-all.ll
+++ b/llvm/test/Transforms/LoopVectorize/VPlan/vplan-print-after-all.ll
@@ -30,6 +30,7 @@
 ; CHECK: VPlan for loop in 'foo' after VPlanTransforms::hoistPredicatedLoads
 ; CHECK: VPlan for loop in 'foo' after VPlanTransforms::sinkPredicatedStores
 ; CHECK: VPlan for loop in 'foo' after VPlanTransforms::truncateToMinimalBitwidths
+; CHECK: VPlan for loop in 'foo' after VPlanTransforms::introduceConditionalBlockGuards
 ; CHECK: VPlan for loop in 'foo' after removeRedundantInductionCasts
 ; CHECK: VPlan for loop in 'foo' after reassociateHeaderMask
 ; CHECK: VPlan for loop in 'foo' after simplifyRecipes
diff --git a/llvm/test/Transforms/LoopVectorize/boscc-basic-cond-store.ll b/llvm/test/Transforms/LoopVectorize/boscc-basic-cond-store.ll
new file mode 100644
index 0000000000000..9f013d74af2bd
--- /dev/null
+++ b/llvm/test/Transforms/LoopVectorize...
[truncated]

@github-actions

Copy link
Copy Markdown

⚠️ C/C++ code formatter, clang-format found issues in your code. ⚠️

You can test this locally with the following command:
git-clang-format --diff origin/main HEAD --extensions h,cpp -- llvm/lib/Transforms/Vectorize/LoopVectorize.cpp llvm/lib/Transforms/Vectorize/VPlan.h llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp llvm/lib/Transforms/Vectorize/VPlanTransforms.h llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp --diff_from_common_commit

⚠️
The reproduction instructions above might return results for more than one PR
in a stack if you are using a stacked PR workflow. You can limit the results by
changing origin/main to the base branch/commit you want to compare against.
⚠️

View the diff from clang-format here.
diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
index a68595de8..27780b8cf 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
@@ -874,8 +874,7 @@ Value *VPInstruction::generate(VPTransformState &State) {
       return State.get(getOperand(0));
     BasicBlock *GuardBB =
         State.CFG.VPBB2IRBB.at(JoinVPBB->getCFGPredecessor(0));
-    BasicBlock *VecBB =
-        State.CFG.VPBB2IRBB.at(JoinVPBB->getCFGPredecessor(1));
+    BasicBlock *VecBB = State.CFG.VPBB2IRBB.at(JoinVPBB->getCFGPredecessor(1));
     Value *InVal;
     {
       // State.get() may insert broadcast instructions (insertelement +
diff --git a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
index 3d93edef8..9b89f2c61 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
@@ -2571,8 +2571,7 @@ static void licm(VPlan &Plan) {
             if (!OpR)
               return false;
             VPBasicBlock *OpBB = OpR->getParent();
-            return OpBB != SinkBB &&
-                   !VPDT.properlyDominates(OpBB, SinkBB);
+            return OpBB != SinkBB && !VPDT.properlyDominates(OpBB, SinkBB);
           }))
         continue;
 
@@ -3845,9 +3844,10 @@ static void expandVPDerivedIV(VPDerivedIVRecipe *R, VPTypeAnalysis &TypeInfo) {
   llvm_unreachable("Unhandled induction kind");
 }
 
-static cl::opt<bool> EnableConditionalBlockGuards(
-    "enable-loop-vectorization-with-conditions", cl::init(false), cl::Hidden,
-    cl::desc("Vectorize loop with branches"));
+static cl::opt<bool>
+    EnableConditionalBlockGuards("enable-loop-vectorization-with-conditions",
+                                 cl::init(false), cl::Hidden,
+                                 cl::desc("Vectorize loop with branches"));
 
 static cl::opt<unsigned> ConditionalGuardThreshold(
     "conditional-guard-threshold", cl::init(5), cl::Hidden,
@@ -3877,9 +3877,9 @@ static BasicBlock *getOrigBBForVPBB(VPBasicBlock *VPBB) {
 ///      is conditionally executed in the scalar loop).
 ///   2. The block contains more recipes than \c ConditionalGuardThreshold
 ///      (making the guard overhead worthwhile).
-static SmallVector<VPBasicBlock *>
-findGuardCandidates(VPBasicBlock *HeaderVPBB, BasicBlock *IRLatch,
-                    DominatorTree *DT) {
+static SmallVector<VPBasicBlock *> findGuardCandidates(VPBasicBlock *HeaderVPBB,
+                                                       BasicBlock *IRLatch,
+                                                       DominatorTree *DT) {
   SmallVector<VPBasicBlock *> Candidates;
   ReversePostOrderTraversal<VPBlockShallowTraversalWrapper<VPBlockBase *>> RPOT(
       HeaderVPBB);
@@ -3917,7 +3917,7 @@ static VPValue *getMaskFromRecipe(VPRecipeBase &R) {
 /// guarded block, or \c nullptr if the candidate should be skipped. \p Mask
 /// is set to the discovered mask value on success.
 static VPBasicBlock *findBlockMaskAndSplit(VPBasicBlock *VecBlock,
-                                          VPValue *&Mask) {
+                                           VPValue *&Mask) {
   Mask = nullptr;
   for (VPRecipeBase &R : *VecBlock) {
     if (VPValue *M = getMaskFromRecipe(R)) {
@@ -3950,8 +3950,7 @@ static std::pair<VPBasicBlock *, VPBasicBlock *>
 buildGuardDiamondCFG(VPlan &Plan, VPBasicBlock *VecBlock, VPValue *Mask) {
   VPBasicBlock *PredBlock =
       cast<VPBasicBlock>(VecBlock->getSinglePredecessor());
-  VPBasicBlock *SuccBlock =
-      cast<VPBasicBlock>(VecBlock->getSingleSuccessor());
+  VPBasicBlock *SuccBlock = cast<VPBasicBlock>(VecBlock->getSingleSuccessor());
 
   VPBasicBlock *GuardBlock =
       Plan.createVPBasicBlock(VecBlock->getName() + ".cond.guard");
@@ -3961,8 +3960,7 @@ buildGuardDiamondCFG(VPlan &Plan, VPBasicBlock *VecBlock, VPValue *Mask) {
   JoinBlock->setParent(VecBlock->getParent());
 
   VPBuilder GuardBuilder(GuardBlock);
-  VPValue *AnyActive =
-      GuardBuilder.createNaryOp(VPInstruction::AnyOf, {Mask});
+  VPValue *AnyActive = GuardBuilder.createNaryOp(VPInstruction::AnyOf, {Mask});
   GuardBuilder.createNaryOp(VPInstruction::BranchOnCond, {AnyActive});
 
   VPBlockUtils::disconnectBlocks(PredBlock, VecBlock);
@@ -3980,8 +3978,7 @@ buildGuardDiamondCFG(VPlan &Plan, VPBasicBlock *VecBlock, VPValue *Mask) {
 /// merge the guarded value with poison on the skip path. All recipes stay
 /// inside \p VecBlock so that expensive operations (FP arithmetic, etc.)
 /// remain guarded and are skipped when no lanes are active.
-static void mergeGuardedValues(VPBasicBlock *VecBlock,
-                               VPBasicBlock *GuardBlock,
+static void mergeGuardedValues(VPBasicBlock *VecBlock, VPBasicBlock *GuardBlock,
                                VPBasicBlock *JoinBlock) {
   VPBuilder JoinBuilder(JoinBlock, JoinBlock->begin());
   for (VPRecipeBase &R : *VecBlock) {
@@ -4012,9 +4009,9 @@ static void mergeGuardedValues(VPBasicBlock *VecBlock,
   }
 }
 
-void VPlanTransforms::introduceConditionalBlockGuards(VPlan &Plan, Loop *OrigLoop,
-                                           DominatorTree *DT,
-                                           const TargetTransformInfo *TTI) {
+void VPlanTransforms::introduceConditionalBlockGuards(
+    VPlan &Plan, Loop *OrigLoop, DominatorTree *DT,
+    const TargetTransformInfo *TTI) {
   if (!EnableConditionalBlockGuards)
     return;
   if (TTI->enableScalableVectorization())
@@ -4052,7 +4049,6 @@ void VPlanTransforms::introduceConditionalBlockGuards(VPlan &Plan, Loop *OrigLoo
     auto [GuardBlock, JoinBlock] = buildGuardDiamondCFG(Plan, VecBlock, Mask);
 
     mergeGuardedValues(VecBlock, GuardBlock, JoinBlock);
-
   }
 }
 

@nema-ashutosh

Copy link
Copy Markdown
Contributor Author

Kindly review - @fhahn @preames @lukel97 @alexey-bataev @rengolin @ayalz

@alexey-bataev

Copy link
Copy Markdown
Member

There is similar #141900, why do we need this one?

@ElvisWang123

Copy link
Copy Markdown
Contributor

I’ve also created an llvm-test-suite patch to evaluate the performance of #141900. Would you like me to add any additional test cases, or would you be available to review it as is?
llvm/llvm-test-suite#345

@nema-ashutosh

Copy link
Copy Markdown
Contributor Author

I’ve also created an llvm-test-suite patch to evaluate the performance of #141900. Would you like me to add any additional test cases, or would you be available to review it as is? llvm/llvm-test-suite#345

Thanks, I'll have a check

@nema-ashutosh

nema-ashutosh commented May 25, 2026

Copy link
Copy Markdown
Contributor Author

There is similar #141900, why do we need this one?

I’ve been exploring a related approach in PR #197423 and would appreciate it if reviewers could give it a fair evaluation before it’s set aside.

There are a few scenarios where the block-level approach handles things more naturally, and I’ve highlighted those differences for consideration.

  1. Store-Only Seeding
    PR 141900: Only seeds from VPWidenStoreRecipe. Conditional blocks with only expensive FP math, masked loads/gathers, replicated recipes, or interleave groups won't trigger the optimization.
    if (cond[i]) {
    temp += sqrt(A[i]) * log(B[C[i]]); // no store -> not guarded
    }
    PR 197423: Identifies candidates at the block level -- any block conditionally executed in the original scalar loop qualifies, regardless of recipe types. Discovers masks from stores, loads, replicated recipes, and interleave groups.

  2. No Support for Values Escaping the Guarded Block
    PR 141900: Recipes with any user outside the tree are pruned. If the expensive computation itself is shared, the tree can shrink to size 1 and no guard is created.
    PR 197423: Introduces ConditionalMerge PHIs in a join block that merge the guarded value with poison on the skip path. The expensive computation stays guarded; external users receive the merged value.

  3. Redundant Guards for Stores Sharing the Same Mask
    PR 141900: Each store is processed independently -- no grouping by mask. Two stores under the same condition produce two sequential any-of + branch pairs computing the identical value:
    if (cond[i]) {
    A[i] = ExpensiveOp1(data[i]);
    B[i] = ExpensiveOp2(data[i]);
    }
    PR 197423: Guards the entire block as a single unit.

const TargetTransformInfo *TTI) {
if (!EnableConditionalBlockGuards)
return;
if (TTI->enableScalableVectorization())

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note that this is a temporary block check for now. I’d like to remove it eventually, but since I have limited experience with scalable architecture, I would appreciate some guidance on how to do so correctly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants