perf(attn): MUSA decode-path fixes — mixed-batch prefill routing + batch-scaled KV splits by zichen-kuuga · Pull Request #85 · MooreThreads/vllm-musa

zichen-kuuga · 2026-06-30T07:34:24Z

Summary

Two FlashAttention decode-path fixes for MUSA, for Qwen3-class
dense models served under FULL_DECODE_ONLY CUDA graphs.

Mixed-batch prefill routing. A continuous-batching step that mixes
decode and no-prefix prefill sent all attention through the slow paged
FmhaFwd kernel (the fast path required a pure-prefill step). The prefill
portion now attends its contiguous new K/V and dispatches to mate's fast
varlen_causal kernel.
Batch-scaled decode KV splits. The graph fixed the KV-split count at
32 for all batch sizes, over-partitioning at larger batch. It is now
scaled per captured batch.

Measurements (Qwen3-8B, TP1, graph on, `vllm bench serve`, 2500-in)

metric	before	after
decode TPOT, bs=8	31.6 ms	22.9 ms (−27%)
output throughput, bs=8	235 tok/s	315 tok/s (+34%)
decode TPOT, bs=1	16.98 ms	16.93 ms (no regression)

…it count

augmentcode · 2026-06-30T07:37:55Z

🤖 Augment PR Summary

Summary: Improves MUSA FlashAttention decode-path performance and routing for mixed decode/prefill workloads under FULL_DECODE_ONLY CUDA graphs.

Changes:

Adds a mixed-batch path that splits a step into decode (paged KV-cache) and no-prefix prefill (contiguous K/V) so prefill can use the fast varlen_causal TCE kernel.
Scales the captured KV-split count based on observed SM count and (intended) decode batch sizing to avoid over-partitioning at larger batches.
Adds a device-property lookup for SM count with a fallback default.

Technical Notes: The new routing is gated on no-prefix prefill + causal + non-cascade + mubin-eligible layers; fallback continues to use the paged varlen path for other cases.

_{🤖 Was this summary useful? React with 👍 or 👎}

augmentcode

Review completed. 3 suggestions posted.

Comment augment review to trigger a new review at any time.

route mixed-batch prefill and batch-scale the FULL_DECODE_ONLY KV-spl…

3a902f9

…it count

augmentcode Bot reviewed Jun 30, 2026

View reviewed changes

Comment thread vllm_musa/v1/attention/backends/flash_attn.py

Comment thread vllm_musa/v1/attention/backends/flash_attn.py Outdated

Comment thread vllm_musa/v1/attention/backends/flash_attn.py

uses token count, not decode batch

0243826

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(attn): MUSA decode-path fixes — mixed-batch prefill routing + batch-scaled KV splits#85

perf(attn): MUSA decode-path fixes — mixed-batch prefill routing + batch-scaled KV splits#85
zichen-kuuga wants to merge 2 commits into
MooreThreads:v0.22.0-devfrom
zichen-kuuga:fix/attn_perf

zichen-kuuga commented Jun 30, 2026

Uh oh!

augmentcode Bot commented Jun 30, 2026

Uh oh!

augmentcode Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

zichen-kuuga commented Jun 30, 2026

Summary

Measurements (Qwen3-8B, TP1, graph on, vllm bench serve, 2500-in)

Uh oh!

augmentcode Bot commented Jun 30, 2026

Uh oh!

augmentcode Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Measurements (Qwen3-8B, TP1, graph on, `vllm bench serve`, 2500-in)