JIT: Merge all RETURN/THROW blocks by BoyBaykiller · Pull Request #128515 · dotnet/runtime

BoyBaykiller · 2026-05-23T03:55:05Z

tailMergePreds(nullptr) was called once, but my understanding is it needs to be called repeatedly as it only processes one set at at time.

dotnet-policy-service · 2026-05-23T03:56:19Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

BoyBaykiller · 2026-05-23T15:04:01Z

@AndyAyersMS PTAL.

AndyAyersMS

Can we do this without repeatedly searching all blocks for returns and throws?

…cate-all-return-throw-blocks

…of reinvoking and re-gathering candidates every timme * hack to suppress positive diffs

AndyAyersMS · 2026-05-26T14:51:26Z

I recommend keeping refactoring/renaming changes and functionality in separate PRs, otherwise reviews are more likely to miss important things.

Also, does tail merging returns lead to new tail merge opportunities like it does for other blocks (eg should we be populating "retry blocks")?

BoyBaykiller · 2026-05-27T00:43:12Z

Also, does tail merging returns lead to new tail merge opportunities like it does for other blocks (eg should we be populating "retry blocks")?

Yes, deduplicating return blocks often does expose new opportunities to tail merge. We are already pushing merged blocks to the retryBlocks stack:

runtime/src/coreclr/jit/fgopt.cpp

Lines 5370 to 5372 in 5341a84

    
           // We should try tail merging the cross jump target. 
        
           // 
        
           retryBlocks.Push(crossJumpTarget);

Here is an example (for myself to harden understanding):

static int Example(bool cond1, bool cond2, ref int x, ref int y)
{
    if (cond1)
    {
        y = 8;
        x = 9;
        return 10; 
    }
    if (cond2)
    {
        y = 8;
        x = 9;
        return 10;
    }
    return 2;
}

First we pull out the return 10; statement:

A set of 2 return/throw blocks end with the same tree
STMT00005 ( 0x017[E--] ... 0x019 )
               [000017] -----------                         *  RETURN    int   
               [000016] -----------                         \--*  CNS_INT   int    10
New Basic Block BB06 [0005] created.
setting likelihood of BB02 -> BB06 to 1
Will cross-jump to newly split off BB06

unlinking STMT00005 ( 0x017[E--] ... 0x019 )
               [000017] -----------                         *  RETURN    int   
               [000016] -----------                         \--*  CNS_INT   int    10
 from BB04
setting likelihood of BB04 -> BB06 to 1
Deduplicated 1 set of return/throw blocks

After that we look at the predecessors of the new return 10; block - more specifically their last statements - and discover they are also the same. So it get's sunken into the return 10; block:

All 2 preds of BB06 end with the same tree, moving
STMT00004 ( 0x013[E--] ... 0x016 )
               [000015] -A-XG------                         *  STOREIND  int   
               [000013] -----------                         +--*  LCL_VAR   byref  V02 arg2         
               [000014] -----------                         \--*  CNS_INT   int    9

unlinking STMT00004 ( 0x013[E--] ... 0x016 )
               [000015] -A-XG------                         *  STOREIND  int   
               [000013] -----------                         +--*  LCL_VAR   byref  V02 arg2         
               [000014] -----------                         \--*  CNS_INT   int    9
 from BB04

unlinking STMT00007 ( 0x006[E--] ... 0x009 )
               [000023] -A-XG------                         *  STOREIND  int   
               [000021] -----------                         +--*  LCL_VAR   byref  V02 arg2         
               [000022] -----------                         \--*  CNS_INT   int    9
 from BB02
Merged 1 set of tails going into BB06

And so one-by-one we work ourselves through the equivalent statements. Regathering predecessors at each step.

Note: For some cases we might be able to consider tails equivalent even though their exact stmt order isnt the same (?), granted they can be re-ordered accordingly.

Update: I just moved de-duplicating return/throw blocks before tail merging and no longer pushing to retryBlocks and that has no diffs.

…f using a BitVec to sparsely mark them as processed * move de-duplication before tail-merging and then no longer add them to the retry list as it isnt needed * use stl iterator tag to be able to call std::stable_partition * and assert to vector indexer

…s in downstream phases because the way we choose the crossJumpVictim is order-dependent and non optimal (for example we'd want to avoid new BBF_NEEDS_GCPOLL) * also remove the std::reverse - same reason

…diff)

…cate-all-return-throw-blocks

AndyAyersMS · 2026-06-02T23:40:15Z

How about something like this (diff is vs main). We could collect also return and throw separately to save a bit of time.

diff --git a/src/coreclr/jit/fgopt.cpp b/src/coreclr/jit/fgopt.cpp
index ffa3d88cba3..d43078f66b4 100644
--- a/src/coreclr/jit/fgopt.cpp
+++ b/src/coreclr/jit/fgopt.cpp
@@ -5492,13 +5492,25 @@ PhaseStatus Compiler::fgHeadTailMerge(bool early)
         }
     }

-    predInfo.Reset();
-    for (BasicBlock* const block : retOrThrowBlocks.BottomUpOrder())
+    if (retOrThrowBlocks.Height() > 1)
     {
-        predInfo.Push(PredInfo(block, block->lastStmt()));
-    }
+        JITDUMP("Trying tail merge of return and throw blocks\n");
+
+        for (int i = 0; i < retOrThrowBlocks.Height() - 1; i++)
+        {
+            predInfo.Reset();
+            for (int j = i; j < retOrThrowBlocks.Height(); j++)
+            {
+                BasicBlock* const block = retOrThrowBlocks.TopRef(j);
+                predInfo.Push(PredInfo(block, block->lastStmt()));
+            }

-    tailMergePreds(nullptr);
+            if tailMergePreds(nullptr)
+            {
+                numOpts++;
+            }
+        }
+    }

     // Work through any retries
     //

plus a check in tailMergePreds that the blocks are the same kind:

diff --git a/src/coreclr/jit/fgopt.cpp b/src/coreclr/jit/fgopt.cpp
index ffa3d88cba3..ae34f2c7df7 100644
--- a/src/coreclr/jit/fgopt.cpp
+++ b/src/coreclr/jit/fgopt.cpp
@@ -5166,6 +5166,11 @@ PhaseStatus Compiler::fgHeadTailMerge(bool early)
             {
                 BasicBlock* const otherBlock = predInfo.TopRef(j).m_block;

+                if (baseBlock->GetKind() != otherBlock->GetKind())
+                {
+                    continue;
+                }
+
                 // Consider: bypass this for statements that can't cause exceptions.
                 //
                 if (!BasicBlock::sameEHRegion(baseBlock, otherBlock))

BoyBaykiller · 2026-06-03T00:29:45Z

Even with the if (baseBlock->GetKind() != otherBlock->GetKind()) check this doesn't skip already processed candidates. Let's see what happens if we run it on this again:

static int Example(bool cond1, bool cond2, ref int x, ref int y)
{
    if (cond1)
    {
        y = 8;
        x = 9;
        return 10; 
    }
    if (cond2)
    {
        y = 8;
        x = 9;
        return 10;
    }
    return 2;
}

On the first pass predInfo contains [return 2, return 10, return 10].
tailMergePreds(nullptr) then de-duplicates the return 10:

A set of 2 return blocks end with the same tree
STMT00005 ( 0x017[E--] ... 0x019 )
               [000017] -----------                         *  RETURN    int   
               [000016] -----------                         \--*  CNS_INT   int    10
New Basic Block BB06 [0005] created.
setting likelihood of BB04 -> BB06 to 1
Will cross-jump to newly split off BB06

unlinking STMT00008 ( 0x00A[E--] ... 0x00C )
               [000025] -----------                         *  RETURN    int   
               [000024] -----------                         \--*  CNS_INT   int    10
 from BB02
setting likelihood of BB02 -> BB06 to 1

On the second pass things start getting weird. predInfo now contains [store(9), store(9)]. The issue is the blocks in retOrThrowBlocks aren't BBJ_RETURN/BBJ_THROW anymore! And then we get:

A set of 2 return blocks end with the same tree
STMT00004 ( 0x013[E--] ... 0x016 )
               [000015] -A-XG------                         *  STOREIND  int   
               [000013] -----------                         +--*  LCL_VAR   byref  V02 arg2         
               [000014] -----------                         \--*  CNS_INT   int    9

Which is unexpected because thats not a return stmt of course.
Ultimately I get Assertion failed '!"Invalid block preds".

AndyAyersMS · 2026-06-03T00:37:39Z

Yeah, just filter those out?

@@ -5492,13 +5497,40 @@ PhaseStatus Compiler::fgHeadTailMerge(bool early)
         }
     }

-    predInfo.Reset();
-    for (BasicBlock* const block : retOrThrowBlocks.BottomUpOrder())
+    if (retOrThrowBlocks.Height() > 1)
     {
-        predInfo.Push(PredInfo(block, block->lastStmt()));
-    }
+        JITDUMP("Trying tail merge of return and throw blocks\n");
+
+        for (int i = 0; i < retOrThrowBlocks.Height() - 1; i++)
+        {
+            BasicBlock* const block = retOrThrowBlocks.TopRef(i);
+
+            // If this block was already merged, skip it
+            //
+            if (!block->KindIs(BBJ_RETURN, BBJ_THROW))
+            {
+                continue;
+            }

-    tailMergePreds(nullptr);
+            predInfo.Reset();
+            for (int j = i; j < retOrThrowBlocks.Height(); j++)
+            {
+                BasicBlock* const otherBlock = retOrThrowBlocks.TopRef(j);
+
+                if (otherBlock->GetKind() != block->GetKind())
+                {
+                    continue;
+                }
+
+                predInfo.Push(PredInfo(otherBlock, otherBlock->lastStmt()));
+            }
+
+            if tailMergePreds(nullptr)
+            {
+                numOpts++;
+            }
+        }
+    }

(then you don't need the other diff, as when tail merging throws or rets there are only throw or ret candidates, respectively)

BoyBaykiller · 2026-06-03T01:12:20Z

That makes sense and seems to work.

What I don't understand: Why do you prefer to add code on top, instead of improving the underlying tailMergePreds to handle all sets at once? Like the PR currently does.

AndyAyersMS · 2026-06-03T14:36:28Z

Why do you prefer to add code on top, instead of improving the underlying tailMergePreds to handle all sets at once? Like the PR currently does.

We generally prefer to keep refactoring/efficiency improvements separate from extending functionality. The first kind of change can be zero diff, which helps assure us that nothing got broken, and we can get a clean read on the TP improvement it offers.

BoyBaykiller · 2026-06-03T23:02:25Z

@AndyAyersMS PTAL.
Should be in its least invasive form now. diffs

AndyAyersMS · 2026-06-03T23:14:58Z

@AndyAyersMS PTAL. Should be in its least invasive form now. diffs

What lead you to that last commit?

BoyBaykiller · 2026-06-03T23:30:00Z

What lead you to that last commit?

The previous code had "hidden diffs" from populating predInfo in a different order than what was previously done. This matters because the way we choose the crossJumpVictim is sensitive to the order of items in predInfo:

runtime/src/coreclr/jit/fgopt.cpp

Lines 5258 to 5311 in 3b0dd88

    
           for (PredInfo& info : matchedPredInfo.TopDownOrder()) 
        
           { 
        
               Statement* const  stmt      = info.m_stmt; 
        
               BasicBlock* const predBlock = info.m_block; 
        
               // Never pick the init block as the victim as that would 
        
               // cause us to add a predecessor to it, which is invalid. 
        
               if (predBlock == fgFirstBB) 
        
               { 
        
                   continue; 
        
               } 
        
               bool const isNoSplit     = stmt == predBlock->firstStmt(); 
        
               bool const isFallThrough = (predBlock->KindIs(BBJ_ALWAYS) && predBlock->JumpsToNext()); 
        
               // Is this block possibly better than what we have? 
        
               // 
        
               bool useBlock = false; 
        
               if (crossJumpVictim == nullptr) 
        
               { 
        
                   // Pick an initial candidate. 
        
                   useBlock = true; 
        
               } 
        
               else if (isNoSplit && isFallThrough) 
        
               { 
        
                   // This is the ideal choice. 
        
                   // 
        
                   useBlock = true; 
        
               } 
        
               else if (!haveNoSplitVictim && isNoSplit) 
        
               { 
        
                   useBlock = true; 
        
               } 
        
               else if (!haveNoSplitVictim && !haveFallThroughVictim && isFallThrough) 
        
               { 
        
                   useBlock = true; 
        
               } 
        
               if (useBlock) 
        
               { 
        
                   crossJumpVictim       = predBlock; 
        
                   crossJumpStmt         = stmt; 
        
                   haveNoSplitVictim     = isNoSplit; 
        
                   haveFallThroughVictim = isFallThrough; 
        
               } 
        
               // If we have the perfect victim, stop looking. 
        
               // 
        
               if (haveNoSplitVictim && haveFallThroughVictim) 
        
               { 
        
                   break; 
        
               } 
        
           }

There may have also been other things at play, but I am certain this is one factor.
Didn't look further into it because the way its done in the "new commit" makes overall more sense to me.

BoyBaykiller · 2026-06-03T23:36:47Z

For the record, this order sensitivity is also the reason why instead of std::partition I was playing with std::stable_partition (to preserve order). c4e8252

Also changing the order on purpose and observing the regressions (as well as improvements) is a great strategy for finding better heuristics of picking the crossJumpVictim.

AndyAyersMS · 2026-06-04T00:00:57Z

Ok, thanks... if there are better ways of figuring out which block to cross jump to, then we should investigate too.

If this is an area you're interested in working on, I think the idea of partial tree merges during tail merging is likely to yield a nice benefit, especially merging of similar looking calls with different (simple) args. Calls expand into a lot of code and also wreak havok with LSRA (inevitably all the calls interfere with one another so collapsing N into 1 would likely save quite a bit of code size, compile time, and might even produce faster code).

I couldn't find an easy way to do it (though I also didn't try very hard). When the tree compares fail you're left with no info about where the matches failed, and plumbing through partial match recording while allowing for commutative swaps seemed quite messy.

The rough idea would be to try and match and then assess in some (as of yet unspecified fashion) whether introducing temps for all the things that now need to cross block boundaries would be "worth it".

AndyAyersMS · 2026-06-04T19:29:54Z

Diffs -- interestingly enough this causes size increases on Wasm.

Wasm is a bit of a strange beast, we might actually be better off not tail merging returns there (since epilogs are basically empty). But ok as is for now.

AndyAyersMS · 2026-06-04T19:33:00Z

Also tail merging may not be a performance improvement. We'll have to keep an eye out for regressions (will see them next week).

@dotnet/jit-contrib FYI

AndyAyersMS

@EgorBo want to do a secondary review?

EgorBo · 2026-06-05T22:44:48Z

/ba-g timeouts

* call tailMergePreds repeatedly

f30defb

github-actions Bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label May 23, 2026

dotnet-policy-service Bot added the community-contribution Indicates that the PR has been added by a community member label May 23, 2026

BoyBaykiller commented May 23, 2026

View reviewed changes

Comment thread src/coreclr/jit/fgopt.cpp Outdated

BoyBaykiller mentioned this pull request May 23, 2026

JIT: Switch to rangecheck when profitable #128524

Open

AndyAyersMS reviewed May 24, 2026

View reviewed changes

Comment thread src/coreclr/jit/fgopt.cpp Outdated

BoyBaykiller added 2 commits May 25, 2026 17:07

Merge branch 'main' of https://github.com/dotnet/runtime into dedupli…

a922217

…cate-all-return-throw-blocks

* process all sets of matchedCandidates at once in tailMerge instead …

77b2d48

…of reinvoking and re-gathering candidates every timme * hack to suppress positive diffs

This was referenced May 26, 2026

"We stopped hearing from agent Azure Pipelines 32. Verify the agent machine is running and has a healthy network connection" dotnet/dnceng#1886

Open

XHarness package install failure on iOS due to devicectl NSPOSIXErrorDomain error 49 #123796

Open

BoyBaykiller marked this pull request as draft May 26, 2026 03:45

BoyBaykiller added 3 commits May 27, 2026 06:07

* switch to partition over stable_partition, this has some small diff…

c4e8252

…s in downstream phases because the way we choose the crossJumpVictim is order-dependent and non optimal (for example we'd want to avoid new BBF_NEEDS_GCPOLL) * also remove the std::reverse - same reason

* only attempt retries when tail-merging and do it immediately (zero-…

86b6a78

…diff)

BoyBaykiller mentioned this pull request May 29, 2026

JIT: Use STL iterator tags instead of custom ones #128786

Open

Merge branch 'main' of https://github.com/dotnet/runtime into dedupli…

f74d1b8

…cate-all-return-throw-blocks

BoyBaykiller added 4 commits June 3, 2026 22:52

* reset to main

c04e7a3

* call tailMergePreds until no sets are left

aa06689

* dont bail as it looses diffs in the current state

8a09dbc

* fix impl

ceba753

BoyBaykiller marked this pull request as ready for review June 3, 2026 22:50

build-analysis Bot mentioned this pull request Jun 4, 2026

The Operation will be canceled. The next steps may not contain expected logs. dotnet/dnceng#3008

Open

3 tasks

AndyAyersMS approved these changes Jun 5, 2026

View reviewed changes

EgorBo approved these changes Jun 5, 2026

View reviewed changes

EgorBo merged commit 503cdd1 into dotnet:main Jun 5, 2026
136 of 139 checks passed

dotnet-maestro Bot mentioned this pull request Jun 6, 2026

[main] Source code updates from dotnet/runtime dotnet/dotnet#7100

Merged

dotnet-milestone-bot Bot added this to the 11.0-preview6 milestone Jun 6, 2026

Conversation

BoyBaykiller commented May 23, 2026

Uh oh!

dotnet-policy-service Bot commented May 23, 2026

Uh oh!

BoyBaykiller commented May 23, 2026

Uh oh!

Uh oh!

AndyAyersMS left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

AndyAyersMS commented May 26, 2026

Uh oh!

BoyBaykiller commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AndyAyersMS commented Jun 2, 2026

Uh oh!

BoyBaykiller commented Jun 3, 2026

Uh oh!

AndyAyersMS commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BoyBaykiller commented Jun 3, 2026

Uh oh!

AndyAyersMS commented Jun 3, 2026

Uh oh!

BoyBaykiller commented Jun 3, 2026

Uh oh!

AndyAyersMS commented Jun 3, 2026

Uh oh!

BoyBaykiller commented Jun 3, 2026

Uh oh!

BoyBaykiller commented Jun 3, 2026

Uh oh!

AndyAyersMS commented Jun 4, 2026

Uh oh!

AndyAyersMS commented Jun 4, 2026

Uh oh!

AndyAyersMS commented Jun 4, 2026

Uh oh!

AndyAyersMS left a comment

Choose a reason for hiding this comment

Uh oh!

EgorBo commented Jun 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

BoyBaykiller commented May 27, 2026 •

edited

Loading

AndyAyersMS commented Jun 3, 2026 •

edited

Loading