feat: flex-attention backend#1161
Draft
cathalobrien wants to merge 4 commits into
Draft
Conversation
…sh attention v4 backend
HCookie
approved these changes
Jun 22, 2026
HCookie
left a comment
Member
There was a problem hiding this comment.
Awesome work, looks good to me
|
|
||
| # Try import flash attention v4 | ||
| # if this is avilable it can be used as a backend for flex attention which gives approx 2x performance | ||
| # One reason to use flex attention with the flash attewntion v4 backend, ratehr then using flash attention v4 directly, is |
Member
There was a problem hiding this comment.
Suggested change
| # One reason to use flex attention with the flash attewntion v4 backend, ratehr then using flash attention v4 directly, is | |
| # One reason to use flex attention with the flash attention v4 backend, rather then using flash attention v4 directly, is |
| # if this is avilable it can be used as a backend for flex attention which gives approx 2x performance | ||
| # One reason to use flex attention with the flash attewntion v4 backend, ratehr then using flash attention v4 directly, is | ||
| # flex attentions support for custom block masks. | ||
| # if flash attention is not available then the trion backend will be used for flex attention |
Member
There was a problem hiding this comment.
Suggested change
| # if flash attention is not available then the trion backend will be used for flex attention | |
| # if flash attention is not available then the triton backend will be used for flex attention |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR adds a flex attention backend to the attention layers.
Flex attention has been available in pytorch since ~2.6 but it wasn't stable for our use case. Now it seems more stable (I tested aifs-single v2 with torch 2.11 and it worked fine).
The specific use case here was that a user running on CPU installed the triton cpu build and wanted to use flex attention rather than SDPA.
I have also enabled support for the flash attention v4 backend of flex attention by checking if flash attention v4 is installed and passing the right kernel options to flex attention so it can use it.
Performance is slightly slower then flash attention v2 (0.34 it/s vs 0.38 it/s), but much faster then sdpa.
I added a correctness test which compares against sdpa, and a small benchmark test which compares all 3 attention kernels standalone.