Skip to content

perf(attn): MUSA decode-path fixes — mixed-batch prefill routing + batch-scaled KV splits#85

Open
zichen-kuuga wants to merge 2 commits into
MooreThreads:v0.22.0-devfrom
zichen-kuuga:fix/attn_perf
Open

perf(attn): MUSA decode-path fixes — mixed-batch prefill routing + batch-scaled KV splits#85
zichen-kuuga wants to merge 2 commits into
MooreThreads:v0.22.0-devfrom
zichen-kuuga:fix/attn_perf

Conversation

@zichen-kuuga

Copy link
Copy Markdown
Collaborator

Summary

Two FlashAttention decode-path fixes for MUSA, for Qwen3-class
dense models served under FULL_DECODE_ONLY CUDA graphs.

  1. Mixed-batch prefill routing. A continuous-batching step that mixes
    decode and no-prefix prefill sent all attention through the slow paged
    FmhaFwd kernel (the fast path required a pure-prefill step). The prefill
    portion now attends its contiguous new K/V and dispatches to mate's fast
    varlen_causal kernel.
  2. Batch-scaled decode KV splits. The graph fixed the KV-split count at
    32 for all batch sizes, over-partitioning at larger batch. It is now
    scaled per captured batch.

Measurements (Qwen3-8B, TP1, graph on, vllm bench serve, 2500-in)

metric before after
decode TPOT, bs=8 31.6 ms 22.9 ms (−27%)
output throughput, bs=8 235 tok/s 315 tok/s (+34%)
decode TPOT, bs=1 16.98 ms 16.93 ms (no regression)

@augmentcode

augmentcode Bot commented Jun 30, 2026

Copy link
Copy Markdown
🤖 Augment PR Summary

Summary: Improves MUSA FlashAttention decode-path performance and routing for mixed decode/prefill workloads under FULL_DECODE_ONLY CUDA graphs.

Changes:

  • Adds a mixed-batch path that splits a step into decode (paged KV-cache) and no-prefix prefill (contiguous K/V) so prefill can use the fast varlen_causal TCE kernel.
  • Scales the captured KV-split count based on observed SM count and (intended) decode batch sizing to avoid over-partitioning at larger batches.
  • Adds a device-property lookup for SM count with a fallback default.

Technical Notes: The new routing is gated on no-prefix prefill + causal + non-cascade + mubin-eligible layers; fallback continues to use the paged varlen path for other cases.

🤖 Was this summary useful? React with 👍 or 👎

@augmentcode augmentcode Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 3 suggestions posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.

Comment thread vllm_musa/v1/attention/backends/flash_attn.py
Comment thread vllm_musa/v1/attention/backends/flash_attn.py Outdated
Comment thread vllm_musa/v1/attention/backends/flash_attn.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant