[Torch][LinalgExt] Lower flex_attention masks before online_attention by keshavvinayak01 · Pull Request #24426 · iree-org/iree

keshavvinayak01 · 2026-05-09T10:08:21Z

Lower flex_attention mask_mod by evaluating the mask function over broadcastable index tensors and passing the resulting tensor mask into online_attention.

This is done to fix existing GPU compilation failure when I tested flex_attention with causal masks.

The old lowering built Q/K/V indices with iree_linalg_ext.index -> tensor.from_elements. After decomposition
this left a mask-update linalg.generic in the QK path that did not vectorize cleanly, introduced private-memory between QK and the reductions, and lead to the source layout being null downstream in VectorLayoutAnalysis

Lower flex_attention mask_mod callbacks with broadcasted tensor index operands and pass the resulting mask as the online_attention mask operand. This avoids emitting rank-0 tensor payloads inside the score region while leaving score_mod on the existing in-region path. Co-authored-by: Codex <noreply@openai.com> Signed-off-by: Keshav Vinayak Jha <keshavvinayakjha@gmail.com>

Use torch-mlir's shared broadcast shape helper when cloning mask_mod ops with tensor index operands. This keeps the flex_attention mask materialization path aligned with existing Torch broadcast handling.\n\nCo-authored-by: Codex <noreply@openai.com> Signed-off-by: Keshav Vinayak Jha <keshavvinayakjha@gmail.com>

Add a reduced generic-vectorization test for the broadcastable index tensor shape emitted by flex_attention mask_mod lowering. Co-authored-by: Codex <noreply@openai.com> Signed-off-by: Keshav Vinayak Jha <keshavvinayakjha@gmail.com>

Signed-off-by: Keshav Vinayak Jha <keshavvinayakjha@gmail.com>

sommerlukas

Implementation looks OK to me, just a few nits.

Would be good to get eyes on this from somebody with more experience in this area.

Co-authored-by: Codex <noreply@openai.com> Signed-off-by: Keshav Vinayak Jha <keshavvinayakjha@gmail.com>

sommerlukas

Implementation LGTM, please wait for another review from someone with experience in this area, e.g., @IanWood1 or @rsuderman.

rsuderman

You need more tests. So far the additions are extremely minimal despite a substantial change to how the lowering works.

Factor the single-block mask callback clone/remap logic into a helper, add match-failure diagnostics for mask lowering failures, and extend lit coverage for mask broadcasting and GQA mask lowering. Co-authored-by: GPT-5 <noreply@openai.com> Signed-off-by: Keshav Vinayak Jha <keshavvinayakjha@gmail.com>

Run a module-level barrier after the flex_attention conversion pass so callback functions are not rewritten by later per-function Torch conversions before their users are processed. Add a full torch-to-iree test where mask_mod appears before the flex_attention user, covering the previous bool mask legalization failure. Co-authored-by: GPT-5 Codex <noreply@openai.com> Signed-off-by: Keshav Vinayak Jha <keshavvinayakjha@gmail.com>

Avoid making the module-level barrier comment specific to flex_attention; the barrier separates unstructured conversion from later per-function Torch conversions generally. Co-authored-by: GPT-5 Codex <noreply@openai.com> Signed-off-by: Keshav Vinayak Jha <keshavvinayakjha@gmail.com>

Signed-off-by: Keshav Vinayak Jha <keshavvinayakjha@gmail.com>

rsuderman

Look at how other passes handle failure. I see a lot of return failure(); which can make it difficult for future developers to debug what went wrong.

Use notifyMatchFailure for flex_attention callback and preprocessing rejects so failed rewrites produce actionable diagnostics instead of silent pattern failure. Co-authored-by: Codex <noreply@openai.com> Signed-off-by: Keshav Vinayak Jha <keshavvinayakjha@gmail.com>

Signed-off-by: Keshav Vinayak Jha <keshavvinayakjha@gmail.com>

rsuderman · 2026-05-18T17:09:09Z

      TorchInput::createConvertTorchUnstructuredToLinalgExtPass());
+  // Keep this as a module-level barrier so unstructured conversions finish
+  // across the module before later per-function Torch conversions run.
+  pm.addPass(createCanonicalizerPass());


This is a red flag to me. Why are you needing a canonicalization pass here? Typically we note adding passes as cocnerning.

I added the canonicalizer as a module-level barrier because I was worried about the nested per-function pipeline converting a flex_attention callback function before the function containing the hop_flex_attention user had lowered and inlined it.

But in the actual lowering emitted top down from torch's invokation, we're safe for now.

Remove the canonicalizer barrier that was only used to force module ordering between flex attention callbacks and their users. Also drop the torch-to-iree test that covered that artificial ordering case. Co-authored-by: OpenAI Codex <noreply@openai.com> Signed-off-by: Keshav Vinayak Jha <keshavvinayakjha@gmail.com>

keshavvinayak01 force-pushed the users/keshavvinayak01/scalarize-online-attention-mask-region branch from c3dcb86 to f418178 Compare May 11, 2026 06:25

keshavvinayak01 force-pushed the users/keshavvinayak01/scalarize-online-attention-mask-region branch from f418178 to 0fcaf28 Compare May 11, 2026 07:02

keshavvinayak01 and others added 3 commits May 11, 2026 12:50

[Codegen] Test flex attention mask vectorization

2637c57

Add a reduced generic-vectorization test for the broadcastable index tensor shape emitted by flex_attention mask_mod lowering. Co-authored-by: Codex <noreply@openai.com> Signed-off-by: Keshav Vinayak Jha <keshavvinayakjha@gmail.com>

Comments fix

44e4e87

Signed-off-by: Keshav Vinayak Jha <keshavvinayakjha@gmail.com>

keshavvinayak01 marked this pull request as ready for review May 11, 2026 09:12

keshavvinayak01 requested review from MaheshRavishankar, Max191, hanhanW and qedawkins as code owners May 11, 2026 09:12

keshavvinayak01 requested review from Groverkss, IanWood1 and sommerlukas May 11, 2026 09:12

sommerlukas reviewed May 11, 2026

View reviewed changes

[Torch] Address flex_attention mask review nits

fc1a746

Co-authored-by: Codex <noreply@openai.com> Signed-off-by: Keshav Vinayak Jha <keshavvinayakjha@gmail.com>

keshavvinayak01 requested a review from sommerlukas May 11, 2026 12:31

sommerlukas approved these changes May 11, 2026

View reviewed changes

rsuderman requested changes May 12, 2026

View reviewed changes

keshavvinayak01 requested a review from rsuderman May 13, 2026 05:50

keshavvinayak01 and others added 2 commits May 13, 2026 17:05

keshavvinayak01 mentioned this pull request May 13, 2026

[WIP] [fusilli] Emit flex_attention for SDPA iree-org/fusilli#413

Draft

hanhanW reviewed May 13, 2026

View reviewed changes

Comment thread compiler/src/iree/compiler/Codegen/Common/test/generic_vectorization_masked_configured.mlir

[Torch] Remove flex attention mask vectorization test

e23f612

Signed-off-by: Keshav Vinayak Jha <keshavvinayakjha@gmail.com>

keshavvinayak01 force-pushed the users/keshavvinayak01/scalarize-online-attention-mask-region branch from c0f2974 to e23f612 Compare May 13, 2026 17:30

rsuderman requested changes May 14, 2026

View reviewed changes

Comment thread compiler/plugins/input/Torch/InputConversion/ConvertTorchUnstructuredToLinalgExt.cpp Outdated

keshavvinayak01 requested a review from zjgarvey as a code owner May 14, 2026 17:09

Addressing comments: Better errmsgs

068b6ad

Signed-off-by: Keshav Vinayak Jha <keshavvinayakjha@gmail.com>

keshavvinayak01 requested a review from rsuderman May 14, 2026 21:39

rsuderman reviewed May 18, 2026

View reviewed changes

rsuderman requested changes May 18, 2026

View reviewed changes

keshavvinayak01 requested a review from rsuderman May 18, 2026 18:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Torch][LinalgExt] Lower flex_attention masks before online_attention#24426

[Torch][LinalgExt] Lower flex_attention masks before online_attention#24426
keshavvinayak01 wants to merge 12 commits into
iree-org:mainfrom
keshavvinayak01:users/keshavvinayak01/scalarize-online-attention-mask-region

keshavvinayak01 commented May 9, 2026 •

edited

Loading

Uh oh!

sommerlukas left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sommerlukas left a comment

Uh oh!

rsuderman left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rsuderman left a comment

Uh oh!

Uh oh!

rsuderman May 18, 2026

Uh oh!

keshavvinayak01 May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

keshavvinayak01 commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sommerlukas left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sommerlukas left a comment

Choose a reason for hiding this comment

Uh oh!

rsuderman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rsuderman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rsuderman May 18, 2026

Choose a reason for hiding this comment

Uh oh!

keshavvinayak01 May 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

keshavvinayak01 commented May 9, 2026 •

edited

Loading