perf(qwen3): select decode attention path by batch, drop ctx>=1024 gate#437
Merged
Conversation
The decode attention-path selector gated SplitKv on `max_seq_len >= 1024`, leaving bs=1 mid-context decode on the SM-starved NonPartition kernel (8 CTAs for Qwen3-4B's 8 kv-heads, idling ~170 SMs on a 5090). bs=1 tpot humped to a peak around ctx800 and dropped off a cliff exactly at ctx1024, which showed up as decode latency "jittering with the dataset". SplitKv's value is filling the SMs, which depends on CTA count (bs x kv-heads) vs SM count, not context length -- at bs=1 the GPU is underfilled at any seq_len. Drop the context gate; keep the `padded_bs <= 32` cap as a coarse CTA-vs-SM proxy. `attention_path` no longer reads instance state and becomes an associated function (matching `graph_index`). Measured (random no-prefix, vllm bench serve, output-len 64): - 5090 bs=1: flat 5.78-5.87ms vs humped 5.95-6.95 (-16% @ctx800) - 5070 Ti bs=1: 11.72->10.84 @ctx800 (-7.5%) - 5090 bs>1: bs4 ctx800 -8.5%, bs16 even, bs32 +0.9% (saturation corner) Verified: hf_golden_gate passes with small-context sequences now on SplitKv (head delta at bf16 noise level). Details in docs/models/qwen3/decode-attention.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
bs=1 decode tpot "jittered with the dataset" — different prompts landed on different points of a latency hump. Root cause: the decode attention-path selector gated SplitKv on
max_seq_len >= 1024, leaving bs=1 mid-context decode on the SM-starved NonPartition kernel.NonPartition issues one CTA per
(request × kv-head). Qwen3-4B has 8 kv-heads, so bs=1 launches only 8 CTAs — idling ~170 SMs on a 5090. tpot climbed with context to a peak near ctx800, then dropped off a cliff exactly at ctx1024 where the old gate flipped to SplitKv:>=1024(vLLM's FlashInfer decode is a flat ~6.0ms here — this PR puts openinfer below it everywhere.)
Why context was the wrong axis
SplitKv's value is filling the SMs, which depends on launched CTA count (
bs × kv-heads) vs SM count — not context length. At bs=1 the 8 CTAs underfill the GPU at any seq_len, so SplitKv should always be used there. The>=1024gate mistook "GPU not full" for "context not long enough"; those are different things.Change
SPLIT_KV_MIN_SEQ_LEN = 1024gate; select onpadded_bs <= 32only (a coarse CTA-vs-SM proxy).attention_pathno longer reads instance state → associated function (matchesgraph_index).Measurements (real
vllm bench serve, random no-prefix, output-len 64)Two-card bs=1 — both flatten, same shape:
5090 bs>1 — the transition lands exactly at CTA-count ≈ SM-count:
bs≤8 wins big, bs=16 even, bs=32 is a <1% regression (~0.1ms): at 256 CTAs NonPartition already fills the SMs and SplitKv pays a merge it doesn't need. It's at the noise floor, in the throughput-saturated regime where latency is least sensitive, and a saturated batch usually has long context (>1024) where the old gate already chose SplitKv. Keeping the
<= 32cap rather than an SM-aware rule is deliberate — a runtime SM-count rule would makeattention_pathGPU-dependent and pollute the CUDA-graph cache key, for a 0.1ms corner.Verification
hf_golden_gatepasses with small-context sequences now routed entirely through SplitKv (head delta at bf16 noise level).padded_bs×64) is context-independent, so capture/replay is safe across any context.Full writeup:
docs/models/qwen3/decode-attention.md.🤖 Generated with Claude Code