perf(qwen3): select decode attention path by batch, drop ctx>=1024 gate by xiaguan · Pull Request #437 · openinfer-project/openinfer

xiaguan · 2026-06-22T09:21:17Z

Problem

bs=1 decode tpot "jittered with the dataset" — different prompts landed on different points of a latency hump. Root cause: the decode attention-path selector gated SplitKv on max_seq_len >= 1024, leaving bs=1 mid-context decode on the SM-starved NonPartition kernel.

NonPartition issues one CTA per (request × kv-head). Qwen3-4B has 8 kv-heads, so bs=1 launches only 8 CTAs — idling ~170 SMs on a 5090. tpot climbed with context to a peak near ctx800, then dropped off a cliff exactly at ctx1024 where the old gate flipped to SplitKv:

5090 bs=1 tpot (ms)	ctx128	300	500	800	1000	1024	1100
gate `>=1024`	5.95	6.15	6.44	6.95	6.43	5.98	5.99
no gate (this PR)	5.78	5.78	5.80	5.84	5.86	5.86	5.87

(vLLM's FlashInfer decode is a flat ~6.0ms here — this PR puts openinfer below it everywhere.)

Why context was the wrong axis

SplitKv's value is filling the SMs, which depends on launched CTA count (bs × kv-heads) vs SM count — not context length. At bs=1 the 8 CTAs underfill the GPU at any seq_len, so SplitKv should always be used there. The >=1024 gate mistook "GPU not full" for "context not long enough"; those are different things.

Change

Drop the SPLIT_KV_MIN_SEQ_LEN = 1024 gate; select on padded_bs <= 32 only (a coarse CTA-vs-SM proxy).
attention_path no longer reads instance state → associated function (matches graph_index).

Measurements (real `vllm bench serve`, random no-prefix, output-len 64)

Two-card bs=1 — both flatten, same shape:

5090: −16% @ctx800
5070 Ti: 11.72 → 10.84 @ctx800 (−7.5%; smaller card, fewer idle SMs)

5090 bs>1 — the transition lands exactly at CTA-count ≈ SM-count:

ctx	bs	NonPartition	SplitKv	Δ tpot
800	4	8.14	7.45	−8.5%
800	8	8.69	8.36	−3.8%
800	16	10.81	10.77	−0.4%
800	32	15.51	15.65	+0.9%

bs≤8 wins big, bs=16 even, bs=32 is a <1% regression (~0.1ms): at 256 CTAs NonPartition already fills the SMs and SplitKv pays a merge it doesn't need. It's at the noise floor, in the throughput-saturated regime where latency is least sensitive, and a saturated batch usually has long context (>1024) where the old gate already chose SplitKv. Keeping the <= 32 cap rather than an SM-aware rule is deliberate — a runtime SM-count rule would make attention_path GPU-dependent and pollute the CUDA-graph cache key, for a 0.1ms corner.

Verification

hf_golden_gate passes with small-context sequences now routed entirely through SplitKv (head delta at bf16 noise level).
Adversarial review confirmed: single-chunk (seq_len<64) SplitKv is numerically identical to NonPartition (the merge reads exactly one partial state); the CUDA-graph grid (padded_bs×64) is context-independent, so capture/replay is safe across any context.

Full writeup: docs/models/qwen3/decode-attention.md.

🤖 Generated with Claude Code

The decode attention-path selector gated SplitKv on `max_seq_len >= 1024`, leaving bs=1 mid-context decode on the SM-starved NonPartition kernel (8 CTAs for Qwen3-4B's 8 kv-heads, idling ~170 SMs on a 5090). bs=1 tpot humped to a peak around ctx800 and dropped off a cliff exactly at ctx1024, which showed up as decode latency "jittering with the dataset". SplitKv's value is filling the SMs, which depends on CTA count (bs x kv-heads) vs SM count, not context length -- at bs=1 the GPU is underfilled at any seq_len. Drop the context gate; keep the `padded_bs <= 32` cap as a coarse CTA-vs-SM proxy. `attention_path` no longer reads instance state and becomes an associated function (matching `graph_index`). Measured (random no-prefix, vllm bench serve, output-len 64): - 5090 bs=1: flat 5.78-5.87ms vs humped 5.95-6.95 (-16% @ctx800) - 5070 Ti bs=1: 11.72->10.84 @ctx800 (-7.5%) - 5090 bs>1: bs4 ctx800 -8.5%, bs16 even, bs32 +0.9% (saturation corner) Verified: hf_golden_gate passes with small-context sequences now on SplitKv (head delta at bf16 noise level). Details in docs/models/qwen3/decode-attention.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

xiaguan merged commit 8dd7c1d into main Jun 22, 2026
1 check passed

xiaguan deleted the fix/qwen3-decode-splitkv-threshold branch June 22, 2026 09:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(qwen3): select decode attention path by batch, drop ctx>=1024 gate#437

perf(qwen3): select decode attention path by batch, drop ctx>=1024 gate#437
xiaguan merged 1 commit into
mainfrom
fix/qwen3-decode-splitkv-threshold

xiaguan commented Jun 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

xiaguan commented Jun 22, 2026

Problem

Why context was the wrong axis

Change

Measurements (real vllm bench serve, random no-prefix, output-len 64)

Verification

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Measurements (real `vllm bench serve`, random no-prefix, output-len 64)