Skip to content

Commit 9cafa6a

Browse files
committed
add v1.0.0 release notes
1 parent 0527d28 commit 9cafa6a

1 file changed

Lines changed: 96 additions & 0 deletions

File tree

RELEASE_NOTES_v1.0.0.md

Lines changed: 96 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,96 @@
1+
# v1.0.0 — Production Release
2+
3+
First production release of the vLLM vs SGLang benchmark harness. The v0.1.0-beta results set (5 models, single GPU class) is now superseded by a fully validated matrix: **16 models**, **10 scenarios**, **4 dedicated extended phases**, and **~530 result files at 100 % success rate** on AWS A10G 24 GB.
4+
5+
## Highlights
6+
7+
- **Benchmark matrix is complete.** 14-model core baseline + 2 Gemma 4 models, each run across 5 core scenarios on both engines, plus speculative decoding (Ngram + Eagle3), variance, concurrency-64, and a decode-length sweep.
8+
- **Four new figures** regenerate from saved result files — no hand-drawn charts.
9+
- **CI is green and enforced.** `ruff check`, `ruff format --check`, `mypy` (24 source files), `pytest` (89 tests), `python -m build`, `twine check`.
10+
- **README reflects reality.** No broken script references, no stale dates, no out-of-date module lists, no phantom follow-ups.
11+
- **Dashboard hardened.** Version bumped to 1.0.0; in-memory job registry now bounded (`DASHBOARD_MAX_JOBS`, default 100) with oldest-terminal eviction instead of unbounded growth.
12+
13+
## Validated benchmark snapshot (2026-04-21)
14+
15+
Environment:
16+
- AWS `g5.2xlarge` (NVIDIA A10G 24 GB)
17+
- vLLM `v0.18.0-cu130`, SGLang `nightly-dev-cu13-20260321`
18+
- 16 models from 2B to 9B, `bfloat16`
19+
- Sequential single-GPU execution
20+
- Source of truth: [`reports/final_benchmark_report_2026-03-31.md`](reports/final_benchmark_report_2026-03-31.md)
21+
22+
### Headline findings
23+
24+
- **vLLM wins TTFT on 13 / 14 core-baseline models** (20–60 % lower than SGLang at concurrency 1). Only Gemma 3 4B flips (SGLang faster by 9 ms) because vLLM needs `--enforce-eager`.
25+
- **vLLM wins small-model throughput** (≤ 4B): +3–12 % on SmolLM3, Phi-3 mini, Phi-4 mini, Gemma 2 2B.
26+
- **Engines converge at 7–9B.** Differences < 3 % on Qwen, Mistral, Llama, Granite, DeepSeek-R1 variants.
27+
- **Gemma 3 4B is SGLang's strongest case:** +77 % peak throughput vs vLLM (149 vs 84 tok/s).
28+
- **Structured generation:** vLLM wins 12 / 14 models; SGLang wins 2.
29+
- **Prefix-sharing TTFT:** SGLang wins 10 / 14 — RadixAttention pays off when prefixes are genuinely shared.
30+
- **Speculative decoding on A10G:** Ngram works on Llama 3.1 8B, Qwen3 8B, and Gemma 4 E4B across both engines. Eagle3 works on Llama 3.1 8B with vLLM only (SGLang + Eagle3 exceeds 24 GB VRAM). Net: spec-dec **hurts** throughput on A10G; expect a reversal on ≥ 40 GB hardware.
31+
- **Goodput (TTFT ≤ 100 ms, TPOT ≤ 35 ms):** vLLM leads on small models (SmolLM3 1.06 rps, Gemma 2 2B 1.37 rps). SGLang leads on 7–9B under concurrent load when TTFT dominates.
32+
33+
### Extended phases shipped in this release
34+
35+
| Phase | Cells | Iterations | Purpose |
36+
|---|---|---|---|
37+
| Variance subset | 4 models × 5 scenarios × 2 engines | n=5 | 95 % CIs and CV% for reproducibility |
38+
| Concurrency-64 ramp | 4 × 7–9B models × 2 engines | 150 req/level × 6 levels | Saturation + tail-latency behaviour |
39+
| Decode-length sweep | 6 models × 4 output lengths × 2 engines | n=3 | Crossover analysis at max_output_tokens ∈ {64, 256, 1024, 4096} |
40+
| Gemma 4 (E2B + E4B) | 2 models × 5 scenarios × 2 engines + Ngram | n=1 | First published Gemma 4 numbers |
41+
42+
## What's new since v0.1.0-beta
43+
44+
### Benchmark surface
45+
- **+11 models** to the validated matrix: SmolLM3 3B, Llama 3.2 3B, Phi-4 mini, Gemma 3 4B, DeepSeek-R1-Distill Qwen 7B + Llama 8B, Qwen 2.5 7B, Llama 3.1 8B, Qwen3 8B, Granite 3.3 8B, and both Gemma 4 sizes (E2B, E4B).
46+
- **+5 scenarios** registered: `throughput_ramp_extended`, `decode_length_sweep_{64,256,1024,4096}`.
47+
- **Speculative decoding:** 6 engine variants with one-command Docker Compose profiles (`vllm`, `vllm-eagle3`, `vllm-ngram`, `sglang`, `sglang-eagle3`, `sglang-ngram`).
48+
49+
### Visualizations
50+
- **Four new auto-regenerating figures** under `analysis/generate_*_figure.py`:
51+
- `speculative_decoding.svg` — Llama 3.1 8B + Qwen3 8B + Gemma 4 E4B, baseline vs Ngram vs Eagle3
52+
- `decode_length_sweep.svg` — tokens/sec vs max_output_tokens (6 models, 95 % CI error bars)
53+
- `variance_cv.svg` — CV% per (model × engine × scenario × metric) with 5 % threshold line
54+
- `goodput.svg` — joint TTFT/TPOT SLO goodput per model
55+
- Shared dark-theme style in `analysis/_figure_style.py`.
56+
- Existing core baseline figures remain regenerable via `python -m analysis.generate_final_benchmark_report`.
57+
58+
### Analysis tooling
59+
- `analysis/variance_analysis.py` — CV% + t-distribution 95 % CIs across iterations; flags claims above the 5 % threshold.
60+
- `analysis/tpot_analysis.py` — per-request TPOT P50/P95/P99 (`(total_ms − ttft_ms) / max(output_tokens − 1, 1)`).
61+
- `analysis/decode_length_analysis.py` — crossover detection across the decode-length sweep.
62+
- `analysis/goodput.py` — configurable joint SLO goodput (defaults: TTFT ≤ 100 ms, TPOT ≤ 35 ms).
63+
64+
### Release-readiness & quality
65+
- **CI strengthened** to cover `ruff format --check` + full `mypy` pass on `engines/`, `benchmarks/`, `analysis/`, `dashboard/` (24 source files, 0 errors), on top of the existing `ruff check`, `pytest`, `python -m build`, `twine check`.
66+
- **Scenario registry test** aligned with the actual 10 registered scenarios.
67+
- **Dashboard** version synced to package version; `_jobs` dict bounded with oldest-terminal eviction (configurable via `DASHBOARD_MAX_JOBS`).
68+
- **README accuracy pass:** removed a broken reference to a pruned helper script; updated Project Structure to list all 7 `analysis/` modules; added the 5 extended scenarios to the scenario catalog; linked every published report and figure; removed rot-prone `Last updated` stamps.
69+
70+
## Known open items (tracked, non-blocking)
71+
72+
- **Llama 3.1 8B SGLang-Eagle3** — blocked on the retired `lmsysorg/sglang:nightly-dev-cu13-20260321-94194537` image; needs a fresh nightly pin before retry.
73+
- **Gemma 4 E2B rerun**`single_request_latency` and `throughput_ramp` result files report `total_tokens_generated = 0` for E2B; other three scenarios are clean. Decode-throughput numbers for E2B ngram are therefore not published in this release.
74+
- **Dashboard has no auth** — fine for `localhost` and documented in `SECURITY.md`, but a reverse proxy / basic-auth shim is required before exposing the EC2 deployment path to the public internet.
75+
- **SGLang Docker image pin** — same retired nightly in `.env.example` and `docker-compose.yml`. Update to a currently-pullable tag before running a fresh clone.
76+
77+
## Upgrade notes
78+
79+
```bash
80+
git pull
81+
pip install -e ".[dev]"
82+
```
83+
84+
To regenerate every figure from the committed results set:
85+
86+
```bash
87+
python -m analysis.generate_final_benchmark_report # core baseline (5 SVGs)
88+
python -m analysis.generate_spec_decoding_figure
89+
python -m analysis.generate_decode_length_figure
90+
python -m analysis.generate_variance_figure
91+
python -m analysis.generate_goodput_figure
92+
```
93+
94+
## Tag
95+
96+
`v1.0.0`

0 commit comments

Comments
 (0)