JIT: Merge all RETURN/THROW blocks#128515
Conversation
|
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
|
@AndyAyersMS PTAL. |
AndyAyersMS
left a comment
There was a problem hiding this comment.
Can we do this without repeatedly searching all blocks for returns and throws?
…cate-all-return-throw-blocks
…of reinvoking and re-gathering candidates every timme * hack to suppress positive diffs
|
I recommend keeping refactoring/renaming changes and functionality in separate PRs, otherwise reviews are more likely to miss important things. Also, does tail merging returns lead to new tail merge opportunities like it does for other blocks (eg should we be populating "retry blocks")? |
Yes, deduplicating return blocks often does expose new opportunities to tail merge. We are already pushing merged blocks to the runtime/src/coreclr/jit/fgopt.cpp Lines 5370 to 5372 in 5341a84 Here is an example (for myself to harden understanding): static int Example(bool cond1, bool cond2, ref int x, ref int y)
{
if (cond1)
{
y = 8;
x = 9;
return 10;
}
if (cond2)
{
y = 8;
x = 9;
return 10;
}
return 2;
}First we pull out the A set of 2 return/throw blocks end with the same tree
STMT00005 ( 0x017[E--] ... 0x019 )
[000017] ----------- * RETURN int
[000016] ----------- \--* CNS_INT int 10
New Basic Block BB06 [0005] created.
setting likelihood of BB02 -> BB06 to 1
Will cross-jump to newly split off BB06
unlinking STMT00005 ( 0x017[E--] ... 0x019 )
[000017] ----------- * RETURN int
[000016] ----------- \--* CNS_INT int 10
from BB04
setting likelihood of BB04 -> BB06 to 1
Deduplicated 1 set of return/throw blocksAfter that we look at the predecessors of the new All 2 preds of BB06 end with the same tree, moving
STMT00004 ( 0x013[E--] ... 0x016 )
[000015] -A-XG------ * STOREIND int
[000013] ----------- +--* LCL_VAR byref V02 arg2
[000014] ----------- \--* CNS_INT int 9
unlinking STMT00004 ( 0x013[E--] ... 0x016 )
[000015] -A-XG------ * STOREIND int
[000013] ----------- +--* LCL_VAR byref V02 arg2
[000014] ----------- \--* CNS_INT int 9
from BB04
unlinking STMT00007 ( 0x006[E--] ... 0x009 )
[000023] -A-XG------ * STOREIND int
[000021] ----------- +--* LCL_VAR byref V02 arg2
[000022] ----------- \--* CNS_INT int 9
from BB02
Merged 1 set of tails going into BB06And so one-by-one we work ourselves through the equivalent statements. Regathering predecessors at each step. Note: For some cases we might be able to consider tails equivalent even though their exact stmt order isnt the same (?), granted they can be re-ordered accordingly. Update: I just moved de-duplicating return/throw blocks before tail merging and no longer pushing to |
…f using a BitVec to sparsely mark them as processed * move de-duplication before tail-merging and then no longer add them to the retry list as it isnt needed * use stl iterator tag to be able to call std::stable_partition * and assert to vector indexer
…s in downstream phases because the way we choose the crossJumpVictim is order-dependent and non optimal (for example we'd want to avoid new BBF_NEEDS_GCPOLL) * also remove the std::reverse - same reason
…cate-all-return-throw-blocks
|
How about something like this (diff is vs main). We could collect also return and throw separately to save a bit of time. diff --git a/src/coreclr/jit/fgopt.cpp b/src/coreclr/jit/fgopt.cpp
index ffa3d88cba3..d43078f66b4 100644
--- a/src/coreclr/jit/fgopt.cpp
+++ b/src/coreclr/jit/fgopt.cpp
@@ -5492,13 +5492,25 @@ PhaseStatus Compiler::fgHeadTailMerge(bool early)
}
}
- predInfo.Reset();
- for (BasicBlock* const block : retOrThrowBlocks.BottomUpOrder())
+ if (retOrThrowBlocks.Height() > 1)
{
- predInfo.Push(PredInfo(block, block->lastStmt()));
- }
+ JITDUMP("Trying tail merge of return and throw blocks\n");
+
+ for (int i = 0; i < retOrThrowBlocks.Height() - 1; i++)
+ {
+ predInfo.Reset();
+ for (int j = i; j < retOrThrowBlocks.Height(); j++)
+ {
+ BasicBlock* const block = retOrThrowBlocks.TopRef(j);
+ predInfo.Push(PredInfo(block, block->lastStmt()));
+ }
- tailMergePreds(nullptr);
+ if tailMergePreds(nullptr)
+ {
+ numOpts++;
+ }
+ }
+ }
// Work through any retries
//plus a check in diff --git a/src/coreclr/jit/fgopt.cpp b/src/coreclr/jit/fgopt.cpp
index ffa3d88cba3..ae34f2c7df7 100644
--- a/src/coreclr/jit/fgopt.cpp
+++ b/src/coreclr/jit/fgopt.cpp
@@ -5166,6 +5166,11 @@ PhaseStatus Compiler::fgHeadTailMerge(bool early)
{
BasicBlock* const otherBlock = predInfo.TopRef(j).m_block;
+ if (baseBlock->GetKind() != otherBlock->GetKind())
+ {
+ continue;
+ }
+
// Consider: bypass this for statements that can't cause exceptions.
//
if (!BasicBlock::sameEHRegion(baseBlock, otherBlock)) |
|
Even with the static int Example(bool cond1, bool cond2, ref int x, ref int y)
{
if (cond1)
{
y = 8;
x = 9;
return 10;
}
if (cond2)
{
y = 8;
x = 9;
return 10;
}
return 2;
}On the first pass On the second pass things start getting weird. Which is unexpected because thats not a return stmt of course. |
|
Yeah, just filter those out? @@ -5492,13 +5497,40 @@ PhaseStatus Compiler::fgHeadTailMerge(bool early)
}
}
- predInfo.Reset();
- for (BasicBlock* const block : retOrThrowBlocks.BottomUpOrder())
+ if (retOrThrowBlocks.Height() > 1)
{
- predInfo.Push(PredInfo(block, block->lastStmt()));
- }
+ JITDUMP("Trying tail merge of return and throw blocks\n");
+
+ for (int i = 0; i < retOrThrowBlocks.Height() - 1; i++)
+ {
+ BasicBlock* const block = retOrThrowBlocks.TopRef(i);
+
+ // If this block was already merged, skip it
+ //
+ if (!block->KindIs(BBJ_RETURN, BBJ_THROW))
+ {
+ continue;
+ }
- tailMergePreds(nullptr);
+ predInfo.Reset();
+ for (int j = i; j < retOrThrowBlocks.Height(); j++)
+ {
+ BasicBlock* const otherBlock = retOrThrowBlocks.TopRef(j);
+
+ if (otherBlock->GetKind() != block->GetKind())
+ {
+ continue;
+ }
+
+ predInfo.Push(PredInfo(otherBlock, otherBlock->lastStmt()));
+ }
+
+ if tailMergePreds(nullptr)
+ {
+ numOpts++;
+ }
+ }
+ }(then you don't need the other diff, as when tail merging throws or rets there are only throw or ret candidates, respectively) |
|
That makes sense and seems to work. What I don't understand: Why do you prefer to add code on top, instead of improving the underlying |
We generally prefer to keep refactoring/efficiency improvements separate from extending functionality. The first kind of change can be zero diff, which helps assure us that nothing got broken, and we can get a clean read on the TP improvement it offers. |
|
@AndyAyersMS PTAL. |
What lead you to that last commit? |
The previous code had "hidden diffs" from populating runtime/src/coreclr/jit/fgopt.cpp Lines 5258 to 5311 in 3b0dd88 There may have also been other things at play, but I am certain this is one factor. |
|
For the record, this order sensitivity is also the reason why instead of Also changing the order on purpose and observing the regressions (as well as improvements) is a great strategy for finding better heuristics of picking the |
|
Ok, thanks... if there are better ways of figuring out which block to cross jump to, then we should investigate too. If this is an area you're interested in working on, I think the idea of partial tree merges during tail merging is likely to yield a nice benefit, especially merging of similar looking calls with different (simple) args. Calls expand into a lot of code and also wreak havok with LSRA (inevitably all the calls interfere with one another so collapsing N into 1 would likely save quite a bit of code size, compile time, and might even produce faster code). I couldn't find an easy way to do it (though I also didn't try very hard). When the tree compares fail you're left with no info about where the matches failed, and plumbing through partial match recording while allowing for commutative swaps seemed quite messy. The rough idea would be to try and match and then assess in some (as of yet unspecified fashion) whether introducing temps for all the things that now need to cross block boundaries would be "worth it". |
|
Diffs -- interestingly enough this causes size increases on Wasm. Wasm is a bit of a strange beast, we might actually be better off not tail merging returns there (since epilogs are basically empty). But ok as is for now. |
|
Also tail merging may not be a performance improvement. We'll have to keep an eye out for regressions (will see them next week). @dotnet/jit-contrib FYI |
AndyAyersMS
left a comment
There was a problem hiding this comment.
@EgorBo want to do a secondary review?
|
/ba-g timeouts |
Fix #128514
tailMergePreds(nullptr)was called once, but my understanding is it needs to be called repeatedly as it only processes one set at at time.