Skip to content

perf(qwen3): select decode attention path by batch, drop ctx>=1024 gate#437

Merged
xiaguan merged 1 commit into
mainfrom
fix/qwen3-decode-splitkv-threshold
Jun 22, 2026
Merged

perf(qwen3): select decode attention path by batch, drop ctx>=1024 gate#437
xiaguan merged 1 commit into
mainfrom
fix/qwen3-decode-splitkv-threshold

Conversation

@xiaguan

@xiaguan xiaguan commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator

Problem

bs=1 decode tpot "jittered with the dataset" — different prompts landed on different points of a latency hump. Root cause: the decode attention-path selector gated SplitKv on max_seq_len >= 1024, leaving bs=1 mid-context decode on the SM-starved NonPartition kernel.

NonPartition issues one CTA per (request × kv-head). Qwen3-4B has 8 kv-heads, so bs=1 launches only 8 CTAs — idling ~170 SMs on a 5090. tpot climbed with context to a peak near ctx800, then dropped off a cliff exactly at ctx1024 where the old gate flipped to SplitKv:

5090 bs=1 tpot (ms) ctx128 300 500 800 1000 1024 1100
gate >=1024 5.95 6.15 6.44 6.95 6.43 5.98 5.99
no gate (this PR) 5.78 5.78 5.80 5.84 5.86 5.86 5.87

(vLLM's FlashInfer decode is a flat ~6.0ms here — this PR puts openinfer below it everywhere.)

Why context was the wrong axis

SplitKv's value is filling the SMs, which depends on launched CTA count (bs × kv-heads) vs SM count — not context length. At bs=1 the 8 CTAs underfill the GPU at any seq_len, so SplitKv should always be used there. The >=1024 gate mistook "GPU not full" for "context not long enough"; those are different things.

Change

  • Drop the SPLIT_KV_MIN_SEQ_LEN = 1024 gate; select on padded_bs <= 32 only (a coarse CTA-vs-SM proxy).
  • attention_path no longer reads instance state → associated function (matches graph_index).

Measurements (real vllm bench serve, random no-prefix, output-len 64)

Two-card bs=1 — both flatten, same shape:

  • 5090: −16% @ctx800
  • 5070 Ti: 11.72 → 10.84 @ctx800 (−7.5%; smaller card, fewer idle SMs)

5090 bs>1 — the transition lands exactly at CTA-count ≈ SM-count:

ctx bs NonPartition SplitKv Δ tpot
800 4 8.14 7.45 −8.5%
800 8 8.69 8.36 −3.8%
800 16 10.81 10.77 −0.4%
800 32 15.51 15.65 +0.9%

bs≤8 wins big, bs=16 even, bs=32 is a <1% regression (~0.1ms): at 256 CTAs NonPartition already fills the SMs and SplitKv pays a merge it doesn't need. It's at the noise floor, in the throughput-saturated regime where latency is least sensitive, and a saturated batch usually has long context (>1024) where the old gate already chose SplitKv. Keeping the <= 32 cap rather than an SM-aware rule is deliberate — a runtime SM-count rule would make attention_path GPU-dependent and pollute the CUDA-graph cache key, for a 0.1ms corner.

Verification

  • hf_golden_gate passes with small-context sequences now routed entirely through SplitKv (head delta at bf16 noise level).
  • Adversarial review confirmed: single-chunk (seq_len<64) SplitKv is numerically identical to NonPartition (the merge reads exactly one partial state); the CUDA-graph grid (padded_bs×64) is context-independent, so capture/replay is safe across any context.

Full writeup: docs/models/qwen3/decode-attention.md.

🤖 Generated with Claude Code

The decode attention-path selector gated SplitKv on `max_seq_len >= 1024`,
leaving bs=1 mid-context decode on the SM-starved NonPartition kernel
(8 CTAs for Qwen3-4B's 8 kv-heads, idling ~170 SMs on a 5090). bs=1 tpot
humped to a peak around ctx800 and dropped off a cliff exactly at ctx1024,
which showed up as decode latency "jittering with the dataset".

SplitKv's value is filling the SMs, which depends on CTA count (bs x
kv-heads) vs SM count, not context length -- at bs=1 the GPU is underfilled
at any seq_len. Drop the context gate; keep the `padded_bs <= 32` cap as a
coarse CTA-vs-SM proxy. `attention_path` no longer reads instance state and
becomes an associated function (matching `graph_index`).

Measured (random no-prefix, vllm bench serve, output-len 64):
- 5090 bs=1: flat 5.78-5.87ms vs humped 5.95-6.95 (-16% @ctx800)
- 5070 Ti bs=1: 11.72->10.84 @ctx800 (-7.5%)
- 5090 bs>1: bs4 ctx800 -8.5%, bs16 even, bs32 +0.9% (saturation corner)

Verified: hf_golden_gate passes with small-context sequences now on SplitKv
(head delta at bf16 noise level). Details in
docs/models/qwen3/decode-attention.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@xiaguan xiaguan merged commit 8dd7c1d into main Jun 22, 2026
1 check passed
@xiaguan xiaguan deleted the fix/qwen3-decode-splitkv-threshold branch June 22, 2026 09:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant