Skip to content

Arm64: Improve support for HW_Flag_ReturnsPerElementMask#128326

Open
snickolls-arm wants to merge 3 commits into
dotnet:mainfrom
snickolls-arm:fix-conditionalselect-predicates
Open

Arm64: Improve support for HW_Flag_ReturnsPerElementMask#128326
snickolls-arm wants to merge 3 commits into
dotnet:mainfrom
snickolls-arm:fix-conditionalselect-predicates

Conversation

@snickolls-arm

Copy link
Copy Markdown
Contributor

When wrapping an intrinsic node that has an embedded mask with a ConditionalSelect, ensure that the constant node in op3 has a mask type when the intrinsic has the HW_Flag_ReturnsPerElementMask flag.

Build out further support for ConditionalSelect_Predicates, and use this to wrap nodes with HW_Flag_ReturnsPerElementMask. Add GenTree::IsSelectZero and update various areas in HW intrinsic codegen to ensure this intrinsic assembles correctly.

Use a tree visitor for assigning TYP_MASK to intrinsics that have HW_Flag_ReturnsPerElementMask. The current version of impHWIntrinsic does not process child nodes of the tree it returns for mask types, only the root node.

When wrapping an intrinsic node that has an embedded mask with a
ConditionalSelect, ensure that the constant node in op3 has a mask type
when the intrinsic has the HW_Flag_ReturnsPerElementMask flag.

Build out further support for ConditionalSelect_Predicates, and use this
to wrap nodes with HW_Flag_ReturnsPerElementMask. Add GenTree::IsSelectZero
and update various areas in HW intrinsic codegen to ensure this intrinsic
assembles correctly.

Use a tree visitor for assigning `TYP_MASK` to intrinsics that have
`HW_Flag_ReturnsPerElementMask`. The current version of `impHWIntrinsic` does
not process child nodes of the tree it returns for mask types, only the root
node.
@github-actions github-actions Bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label May 18, 2026
@dotnet-policy-service dotnet-policy-service Bot added the community-contribution Indicates that the PR has been added by a community member label May 18, 2026
@dotnet-policy-service

Copy link
Copy Markdown
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Comment thread src/coreclr/jit/hwintrinsic.cpp Outdated
{
GenTreeHWIntrinsic* intrin = (*use)->AsHWIntrinsic();

if (HWIntrinsicInfo::ReturnsPerElementMask(intrin->GetHWIntrinsicId()) && !intrin->TypeIs(TYP_MASK))

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not correct for all platforms and is going to regress xarch as well as be incorrect for hardware without TYP_MASK support. I imagine it may also regress optimization opportunities for AdvSimd on Arm64.

The consideration is that many node kinds return a per element mask and are explicitly not returning TYP_MASK. For example, Vector128.GreaterThan is an API which explicitly returns a vector, but where it is conceptually known to be a "per-element mask" (i.e. each element is either AllBitsSet or Zero).

Having this knowledge, even pre-SVE or pre-AVX512, where no dedicated TYP_MASK support exists, is beneficial as it allows unlocking a number of other optimization opportunities and folding operations that may not be otherwise valid. -- These optimizations are notably missing from Arm64, in part because the SVE predication feature has deviated a lot from the general support.

Rather, we only want to do such a transform if we have TYP_MASK support and its going to emit an instruction that actually produces a TYP_MASK, not that is just "conceptually" a mask. In the case of xarch we do so by marking the downlevel instructions as special-import and adjusting those as needed; there is explicitly no need to do traversals of the tree since we know that we are either producing a mask (and therefore need CvtMaskToVector) or we are expecting a mask (and therefore need CvtVectorToMask).

It's unclear then why Arm64 needs to do a tree traversal itself here as it should have the same general scenario. Any given intrinsic is one of three categories (does nothing with masks, produces a mask, or consumes a mask) and so it should be trivially handled without any consideration of tree traversal.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-- The code here is only called from Arm64, but the ifdef doesn't cover that; nor does the summary or visitor make it clear its only valid for Arm64; so future readers or refactorings may miss the consideration.

But then it's very unclear to me why we need this setup and why it needs to deviate from what's already trivially working for other platforms with masking support.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I should've left a comment with more context.

I was having trouble with implementing the MaxMagnitude intrinsic, see this import code:

https://github.com/dotnet/runtime/blob/758abdf6906992c73adcd2c5ad8a1f5ecd0d70c7/src/coreclr/jit/gentree.cpp#L26420-L26514

The intrinsics in this tree don't have mask types assigned and tend to cause assertions in Lowering when inserting implicit mask operands. I decided that rather than require the author of this sort of algorithm to maintain the TYP_MASK consistency for Arm64, I would add a pass based on HW_Flag_ReturnsPerElementMask to enforce that instead. This would allow you to write short algorithms in import with correct types per the CIL and have the visitor apply the types for a small runtime cost.

-- The code here is only called from Arm64, but the ifdef doesn't cover that; nor does the summary or visitor make it clear its only valid for Arm64; so future readers or refactorings may miss the consideration.

This is my mistake, it was only intended for Arm64 and I need to fix the ifdef. I will clarify the documentation too.

These optimizations are notably missing from Arm64, in part because the SVE predication feature has deviated a lot from the general support.

@a74nh has made a good amount of progress on this recently. We're building an abstraction of a constant in terms of {pattern, value} for this. For example, strength reduction of {any pattern, any value} & {repeated, 0} => {repeated, 0}. The current way things are done I think fits in this hierarchy, just where the pattern is a 'single scalar'.

I can see how SVE has overloaded the meaning of FEATURE_MASKED_HW_INTRINSICS and HW_Flag_ReturnsPerElementMask, really it's a different feature. I think we'll want to reconcile that in future.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided that rather than require the author of this sort of algorithm to maintain the TYP_MASK consistency for Arm64

We notably handle this scenario on xarch by having the user visible API as HW_Flag_InvalidNodeId and having a different internal only intrinsic ID that is expected to have the mask. For example, we have NI_AVX512_CompareEqual which matches the managed API surface returning a Vector512<T> and then NI_AVX512_CompareEqualMask which is the internal API returning a TYP_MASK.

This helps ensure we never produce the vector returning ID in IR (as it triggers an assert when setting the ID here: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/gentree.cpp#L30507-L30531), which helps force users to think about the correct shape.


However, beyond that we also have the GetHWIntrinsicIdFor* and GetLookupTypeFor* helpers, which Arm64 isn't participating in right now (part of the deviation that probably shouldn't exist).

Note for example how gtNewSimdCmpOpNode calls GetLookupTypeForCmpOp which forces the type to TYP_MASK if AVX512 is supported, helping to canonicalize the support.

Then GetHWIntrinsicIdForCmpOp handles this and knows to return say NI_AVX512_CompareEqualMask instead of NI_X86Base_CompareEqual for a 128-bit comparison, guaranteeing the IR is correct since we then know to insert CvtMaskToVectorNode since the lookup type (mask) and actual type (simd) mismatch.

If Arm64 just participates in these existing helpers, then there's no need to have a custom visitor or different logic for SVE, it "just works" with all the existing support in the JIT. It also avoids issues if something like gtNewSimdMinMaxNode is used outside of import, which is very possible for many of the other helpers (especially due to morph or other phases doing optimizations).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally speaking, I'd expect the minimum changes here be that Arm64 has GetLookupTypeFor* return TYP_MASK for any SVE intrinsic that is flagged HW_Flag_ReturnsPerElementMask and for it to actually return NI_Sve_* intrinsics from GetHWIntrinsicIdFor* when the simd size is "unknown".

This will cause almost all the existing helpers to light up and participate in all the general optimizations and to be correct by construction, rather than relying on developers to manually ensure any given instantiation is valid.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The two properties I'm referring to at the moment are 'embedded masks', where the mask operand is not included as part of the C# intrinsic signature, and 'explicit masks' where the mask operand is actually part of the signature. In both cases, the underlying instruction will always require a predicate register as the second operand, and this will apply a mask to the result of the instruction.

I'm referring to the transform with embedded masks, where we wrap the intrinsic with a ConditionalSelect node and set the original node contained within it, during Lowering. The selection has all bits set, as the omission of a mask in the signature is treated as all true. Then codegen expects to see this pattern and generate a single instruction for the pair, using the all true operand from the ConditionalSelect node as the second instruction operand.

Then we also have 'optionally embedded masks' which I think covers instructions that have both predicated and unpredicated forms. I think codegen can handle both the contained pair and just the intrinsic node on its own for these. It may help to simplify this concept and instead have an internal intrinsic for each one that can be switched to when the ConditionalSelect pattern appears, but also it could just be left as is. There are only 6 intrinsics with this flag.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we also have 'optionally embedded masks' which I think covers instructions that have both predicated and unpredicated forms. I think codegen can handle both the contained pair and just the intrinsic node on its own for these. It may help to simplify this concept and instead have an internal intrinsic for each one that can be switched to when the ConditionalSelect pattern appears, but also it could just be left as is. There are only 6 intrinsics with this flag.

So this is the simplest one. The Arm64 impl has diverged a lot here and its unclear to me whether or not its actually required to diverge. If it actually must diverge, then I don't have a strong preference on what is done.

For x64, we opted to not introduce several hundred additioanl IDs and handling just to support masks. Rather, due to how the x64 encoding works we simply check if the intrinsic is ConditionalSelect (NI_AVX512_BlendVariableMask) and if op2 is an embedded mask operand. If so, we set a couple instOptions flags and change the node to the underlying contained instruction.

This setup allows us to relegate all masking codegen support to these 64ish lines of code: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/hwintrinsiccodegenxarch.cpp#L395-L459. We don't need a bunch of otherwise specialized handling for them.

I would imagine that most of Arm64 could be the same and we shouldn't need to do more than pass down the predicate register being used as part of instOptions.

The two properties I'm referring to at the moment are 'embedded masks', where the mask operand is not included as part of the C# intrinsic signature, and 'explicit masks' where the mask operand is actually part of the signature. In both cases, the underlying instruction will always require a predicate register as the second operand, and this will apply a mask to the result of the instruction.

For the explicit masks there shouldn't need to be anything done. These should already be intrinsics that take in TYP_MASK operands. On x64 this is just handled on import, for example BlendVariableMask requires that op3 be a TYP_MASK and so we map the NI_AVX512_BlendVariable (managed id) to this internal ID and then insert the CvtVectorToMask node to ensure the IR is correct: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/hwintrinsicxarch.cpp#L5028-L5029. Phases like morph which transform Or(CvtMaskToVector(x), CvtMaskToVector(y)) to CvtMaskToVector(Or(x, y)) also ensure these fixups occur.

Given that, explicit masks really need no changes and they "just work".


For embedded masks, this is unique to Arm64 and where I was saying I think it's fine to start carrying that as part of the node, which should allow CSE/VN and other things to all work correctly. We can just attach one additional operand that is a constant true mask.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Arm64 impl has diverged a lot here and its unclear to me whether or not its actually required to diverge.

I also don't see why we need to diverge in this case, either we see the ConditionalSelect pattern or we don't and both should be handled by Lowering+codegen without needing a new flag. I can see that HW_Flag_OptionalEmbeddedMask is referenced mainly in register allocation and codegen, so it might be helping us get around some register encoding restrictions. I'll note down to take a further look into this as future improvement. We've had some changes related to the movprfx instruction recently, so this might be closely related to that work.

For the explicit masks there shouldn't need to be anything done. These should already be intrinsics that take in TYP_MASK operands.

I think the consideration for SVE is that the predication operand is always the first source operand to a predicated instruction, and we need to know when to trim the CvtMaskToVector on that operand before registers are assigned, so we get a predicate register. With a flag we're able to do this generically. I'm not sure if this situation applies to AVX, or you found another way to get around that?

For embedded masks, this is unique to Arm64 and where I was saying I think it's fine to start carrying that as part of the node, which should allow CSE/VN and other things to all work correctly. We can just attach one additional operand that is a constant true mask.

I'll also note to do this as future improvement. It will likely improve code size, and simplify register allocation and the lowering transforms.

@tannergooding tannergooding May 28, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the consideration for SVE is that the predication operand is always the first source operand to a predicated instruction, and we need to know when to trim the CvtMaskToVector on that operand before registers are assigned, so we get a predicate register. With a flag we're able to do this generically. I'm not sure if this situation applies to AVX, or you found another way to get around that?

This sounds like you're missing the separation between CndSel(vector, vectorWhenTrue, vectorWhenFalse) and CndSel(mask, vectorWhenTrue, vectorWhenFalse)?

On xarch we have NI_Vector128_ConditionalSelect for the former and NI_AVX512_BlendVariableMask for the latter (the parameter ordering differs, but that's largely irrelevant).

In the case that we have ConditionalSelect(CvtMaskToVector(mask), vectorWhenTrue, vectorWhenFalse)) then we transform it to be BlendVariableMask(vectorWhenFalse, vectorWhenTrue, mask) and drop the CvtMaskToVector entirely; so its never a concern.

We can inversely make the other decision as well, we can take BlendVariableMask(vectorWhenFalse, vectorWhenTrue, VectorToMask(vector)) and transform it to ConditionalSelect(vector, vectorWhenTrue, vectorWhenFalse) for the cases when the input is truly a vector.

This allows us to select the most optimal between:

  • vpternlog - a single instruction, single cycle way to do (vector & vectorWhenTrue) | (~vector & vectorWhenFalse)
  • vpblendm - a single instruction way to do selection using a mask
  • embedded masking - i.e. marking the vpblendm as containing its vectorWhenTrue so codegen passes along the mask used in the instruction option flags.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was already working on a task to reuse ptrue results. Please see #128844

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI community-contribution Indicates that the PR has been added by a community member

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants