You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Fixed ~512-token prompts, max_output_tokens swept across [64, 256, 1024, 4096]. Separates prefill-bound from decode-bound behaviour. Values show mean ± 95% CI across iterations.
Llama-3.1-8B-Instruct
Throughput (tok/s)
max_tokens
vLLM tok/s
SGLang tok/s
Winner
64
189.2 ± 0.5
191.9 ± 0.3
Tie
256
189.4 ± 0.1
190.3 ± 0.2
Tie
1024
185.1 ± 0.6
186.5 ± 1.0
Tie
4096
158.5 ± 1.3
157.6 ± 6.1
Tie
TTFT P50 (ms)
max_tokens
vLLM TTFT
SGLang TTFT
64
96.2 ± 6.7 ms
69.1 ± 0.8 ms
256
93.3 ± 0.9 ms
69.7 ± 1.3 ms
1024
96.3 ± 2.3 ms
69.1 ± 0.7 ms
4096
113.6 ± 10.5 ms
106.4 ± 1.2 ms
TPOT P50 (ms/token)
max_tokens
vLLM TPOT
SGLang TPOT
64
35.0 ± 0.0 ms
34.8 ± 0.1 ms
256
35.2 ± 0.0 ms
35.0 ± 0.1 ms
1024
36.1 ± 0.0 ms
35.9 ± 0.1 ms
4096
39.3 ± 0.2 ms
38.9 ± 0.0 ms
Findings
Throughput crossover: sglang → vllm at max_tokens=4096
TTFT crossover: no crossover — vllm leads throughout (gap 6.3% at max_tokens=4096)
TTFT at max_tokens=4096: SGLang still leads by 6.3% — TTFT advantage is preserved at long decode.
Phi-4-mini-instruct
Throughput (tok/s)
max_tokens
vLLM tok/s
SGLang tok/s
Winner
64
354.4 ± 0.2
340.1 ± 1.9
vLLM +4%
256
346.2 ± 0.9
333.4 ± 0.1
vLLM +4%
1024
304.7 ± 9.7
322.6 ± 1.9
SGLang +6%
4096
287.2 ± 34.4
293.5 ± 35.1
SGLang +2%
TTFT P50 (ms)
max_tokens
vLLM TTFT
SGLang TTFT
64
55.1 ± 1.0 ms
49.2 ± 2.2 ms
256
56.0 ± 1.7 ms
46.8 ± 0.7 ms
1024
56.4 ± 1.0 ms
47.6 ± 3.7 ms
4096
56.8 ± 2.0 ms
48.3 ± 23.7 ms
TPOT P50 (ms/token)
max_tokens
vLLM TPOT
SGLang TPOT
64
19.0 ± 0.0 ms
19.9 ± 0.0 ms
256
19.4 ± 0.0 ms
20.1 ± 0.0 ms
1024
20.7 ± 0.2 ms
20.9 ± 0.0 ms
4096
20.8 ± 0.5 ms
23.0 ± 4.8 ms
Findings
Throughput crossover: vllm → sglang at max_tokens=1024
TTFT crossover: no crossover — vllm leads throughout (gap 15.1% at max_tokens=4096)
TTFT at max_tokens=4096: SGLang still leads by 15.1% — TTFT advantage is preserved at long decode.
gemma-2-2b-it
Throughput (tok/s)
max_tokens
vLLM tok/s
SGLang tok/s
Winner
64
523.0 ± 34.6
519.1 ± 14.0
Tie
256
493.8 ± 17.5
484.3 ± 1.8
Tie
1024
458.0 ± 11.5
469.7 ± 20.3
SGLang +2%
4096
459.2 ± 26.2
467.0 ± 9.3
Tie
TTFT P50 (ms)
max_tokens
vLLM TTFT
SGLang TTFT
64
42.1 ± 2.1 ms
39.4 ± 3.0 ms
256
36.5 ± 5.2 ms
41.9 ± 6.2 ms
1024
37.3 ± 0.2 ms
37.9 ± 0.2 ms
4096
37.5 ± 0.3 ms
37.9 ± 0.5 ms
TPOT P50 (ms/token)
max_tokens
vLLM TPOT
SGLang TPOT
64
12.9 ± 0.0 ms
13.0 ± 0.0 ms
256
13.2 ± 0.0 ms
13.4 ± 0.2 ms
1024
13.8 ± 0.0 ms
13.7 ± 0.0 ms
4096
13.9 ± 0.1 ms
13.7 ± 0.0 ms
Findings
Throughput crossover: vllm → sglang at max_tokens=1024
TTFT crossover: vllm → sglang at max_tokens=256
TTFT at max_tokens=4096: converged (gap 1.0%) — TTFT advantage does not persist at long decode.
gemma-3-4b-it
Throughput (tok/s)
max_tokens
vLLM tok/s
SGLang tok/s
Winner
64
146.3 ± 23.1
280.8 ± 2.4
SGLang +48%
256
156.7 ± 5.2
289.0 ± 2.3
SGLang +46%
1024
152.7 ± 5.0
274.9 ± 3.3
SGLang +44%
4096
149.4 ± 9.5
269.3 ± 16.5
SGLang +45%
TTFT P50 (ms)
max_tokens
vLLM TTFT
SGLang TTFT
64
128.2 ± 1.6 ms
128.8 ± 11.6 ms
256
126.8 ± 1.8 ms
126.3 ± 0.6 ms
1024
122.6 ± 0.2 ms
100.1 ± 2.6 ms
4096
123.5 ± 2.9 ms
100.2 ± 1.2 ms
TPOT P50 (ms/token)
max_tokens
vLLM TPOT
SGLang TPOT
64
42.1 ± 0.2 ms
22.4 ± 0.0 ms
256
42.6 ± 1.0 ms
22.7 ± 0.1 ms
1024
42.5 ± 0.3 ms
23.4 ± 0.1 ms
4096
42.8 ± 1.6 ms
23.6 ± 0.1 ms
Findings
Throughput crossover: no crossover — sglang leads throughout (gap 44.5% at max_tokens=4096)
TTFT crossover: sglang → vllm at max_tokens=256
TTFT at max_tokens=4096: SGLang still leads by 18.9% — TTFT advantage is preserved at long decode.
gemma-4-E2B-it
Throughput (tok/s)
max_tokens
vLLM tok/s
SGLang tok/s
Winner
64
—
—
—
256
—
—
—
1024
—
—
—
4096
—
—
—
TTFT P50 (ms)
max_tokens
vLLM TTFT
SGLang TTFT
64
95.5 ± 5.6 ms
109.8 ± 0.6 ms
256
95.3 ± 3.1 ms
109.3 ± 1.6 ms
1024
95.2 ± 4.5 ms
110.5 ± 6.6 ms
4096
95.0 ± 6.7 ms
109.2 ± 0.4 ms
TPOT P50 (ms/token)
max_tokens
vLLM TPOT
SGLang TPOT
64
—
—
256
—
—
1024
—
—
4096
—
—
Findings
Throughput crossover: insufficient data
TTFT crossover: no crossover — sglang leads throughout (gap 13.0% at max_tokens=4096)
TTFT at max_tokens=4096: vLLM still leads by 13.0% — TTFT advantage is preserved at long decode.
gemma-4-E4B-it
Throughput (tok/s)
max_tokens
vLLM tok/s
SGLang tok/s
Winner
64
139.4 ± 48.6
132.7 ± 7.3
vLLM +5%
256
157.9 ± 1.3
145.7 ± 6.0
vLLM +8%
1024
159.4 ± 0.5
150.9 ± 0.6
vLLM +5%
4096
159.4 ± 0.1
143.4 ± 1.2
vLLM +10%
TTFT P50 (ms)
max_tokens
vLLM TTFT
SGLang TTFT
64
114.9 ± 1.8 ms
130.3 ± 2.7 ms
256
115.0 ± 1.2 ms
129.0 ± 6.6 ms
1024
115.1 ± 2.5 ms
129.5 ± 2.1 ms
4096
114.6 ± 1.3 ms
130.9 ± 4.4 ms
TPOT P50 (ms/token)
max_tokens
vLLM TPOT
SGLang TPOT
64
40.4 ± 0.6 ms
46.1 ± 1.0 ms
256
40.2 ± 0.6 ms
43.6 ± 1.4 ms
1024
40.3 ± 0.2 ms
42.5 ± 0.4 ms
4096
40.3 ± 0.2 ms
42.3 ± 0.8 ms
Findings
Throughput crossover: no crossover — vllm leads throughout (gap 10.1% at max_tokens=4096)
TTFT crossover: no crossover — sglang leads throughout (gap 12.4% at max_tokens=4096)
TTFT at max_tokens=4096: vLLM still leads by 12.4% — TTFT advantage is preserved at long decode.
Summary: TTFT Advantage at Long Decode (max_tokens=4096)
Does vLLM's TTFT advantage (seen at concurrency=1) survive high output-token budgets? TTFT is determined at the prefill stage, so it should be independent of max_tokens — any divergence here indicates system-level effects (KV memory pressure, scheduler backpressure).