|
| 1 | +# v1.0.0 — Production Release |
| 2 | + |
| 3 | +First production release of the vLLM vs SGLang benchmark harness. The v0.1.0-beta results set (5 models, single GPU class) is now superseded by a fully validated matrix: **16 models**, **10 scenarios**, **4 dedicated extended phases**, and **~530 result files at 100 % success rate** on AWS A10G 24 GB. |
| 4 | + |
| 5 | +## Highlights |
| 6 | + |
| 7 | +- **Benchmark matrix is complete.** 14-model core baseline + 2 Gemma 4 models, each run across 5 core scenarios on both engines, plus speculative decoding (Ngram + Eagle3), variance, concurrency-64, and a decode-length sweep. |
| 8 | +- **Four new figures** regenerate from saved result files — no hand-drawn charts. |
| 9 | +- **CI is green and enforced.** `ruff check`, `ruff format --check`, `mypy` (24 source files), `pytest` (89 tests), `python -m build`, `twine check`. |
| 10 | +- **README reflects reality.** No broken script references, no stale dates, no out-of-date module lists, no phantom follow-ups. |
| 11 | +- **Dashboard hardened.** Version bumped to 1.0.0; in-memory job registry now bounded (`DASHBOARD_MAX_JOBS`, default 100) with oldest-terminal eviction instead of unbounded growth. |
| 12 | + |
| 13 | +## Validated benchmark snapshot (2026-04-21) |
| 14 | + |
| 15 | +Environment: |
| 16 | +- AWS `g5.2xlarge` (NVIDIA A10G 24 GB) |
| 17 | +- vLLM `v0.18.0-cu130`, SGLang `nightly-dev-cu13-20260321` |
| 18 | +- 16 models from 2B to 9B, `bfloat16` |
| 19 | +- Sequential single-GPU execution |
| 20 | +- Source of truth: [`reports/final_benchmark_report_2026-03-31.md`](reports/final_benchmark_report_2026-03-31.md) |
| 21 | + |
| 22 | +### Headline findings |
| 23 | + |
| 24 | +- **vLLM wins TTFT on 13 / 14 core-baseline models** (20–60 % lower than SGLang at concurrency 1). Only Gemma 3 4B flips (SGLang faster by 9 ms) because vLLM needs `--enforce-eager`. |
| 25 | +- **vLLM wins small-model throughput** (≤ 4B): +3–12 % on SmolLM3, Phi-3 mini, Phi-4 mini, Gemma 2 2B. |
| 26 | +- **Engines converge at 7–9B.** Differences < 3 % on Qwen, Mistral, Llama, Granite, DeepSeek-R1 variants. |
| 27 | +- **Gemma 3 4B is SGLang's strongest case:** +77 % peak throughput vs vLLM (149 vs 84 tok/s). |
| 28 | +- **Structured generation:** vLLM wins 12 / 14 models; SGLang wins 2. |
| 29 | +- **Prefix-sharing TTFT:** SGLang wins 10 / 14 — RadixAttention pays off when prefixes are genuinely shared. |
| 30 | +- **Speculative decoding on A10G:** Ngram works on Llama 3.1 8B, Qwen3 8B, and Gemma 4 E4B across both engines. Eagle3 works on Llama 3.1 8B with vLLM only (SGLang + Eagle3 exceeds 24 GB VRAM). Net: spec-dec **hurts** throughput on A10G; expect a reversal on ≥ 40 GB hardware. |
| 31 | +- **Goodput (TTFT ≤ 100 ms, TPOT ≤ 35 ms):** vLLM leads on small models (SmolLM3 1.06 rps, Gemma 2 2B 1.37 rps). SGLang leads on 7–9B under concurrent load when TTFT dominates. |
| 32 | + |
| 33 | +### Extended phases shipped in this release |
| 34 | + |
| 35 | +| Phase | Cells | Iterations | Purpose | |
| 36 | +|---|---|---|---| |
| 37 | +| Variance subset | 4 models × 5 scenarios × 2 engines | n=5 | 95 % CIs and CV% for reproducibility | |
| 38 | +| Concurrency-64 ramp | 4 × 7–9B models × 2 engines | 150 req/level × 6 levels | Saturation + tail-latency behaviour | |
| 39 | +| Decode-length sweep | 6 models × 4 output lengths × 2 engines | n=3 | Crossover analysis at max_output_tokens ∈ {64, 256, 1024, 4096} | |
| 40 | +| Gemma 4 (E2B + E4B) | 2 models × 5 scenarios × 2 engines + Ngram | n=1 | First published Gemma 4 numbers | |
| 41 | + |
| 42 | +## What's new since v0.1.0-beta |
| 43 | + |
| 44 | +### Benchmark surface |
| 45 | +- **+11 models** to the validated matrix: SmolLM3 3B, Llama 3.2 3B, Phi-4 mini, Gemma 3 4B, DeepSeek-R1-Distill Qwen 7B + Llama 8B, Qwen 2.5 7B, Llama 3.1 8B, Qwen3 8B, Granite 3.3 8B, and both Gemma 4 sizes (E2B, E4B). |
| 46 | +- **+5 scenarios** registered: `throughput_ramp_extended`, `decode_length_sweep_{64,256,1024,4096}`. |
| 47 | +- **Speculative decoding:** 6 engine variants with one-command Docker Compose profiles (`vllm`, `vllm-eagle3`, `vllm-ngram`, `sglang`, `sglang-eagle3`, `sglang-ngram`). |
| 48 | + |
| 49 | +### Visualizations |
| 50 | +- **Four new auto-regenerating figures** under `analysis/generate_*_figure.py`: |
| 51 | + - `speculative_decoding.svg` — Llama 3.1 8B + Qwen3 8B + Gemma 4 E4B, baseline vs Ngram vs Eagle3 |
| 52 | + - `decode_length_sweep.svg` — tokens/sec vs max_output_tokens (6 models, 95 % CI error bars) |
| 53 | + - `variance_cv.svg` — CV% per (model × engine × scenario × metric) with 5 % threshold line |
| 54 | + - `goodput.svg` — joint TTFT/TPOT SLO goodput per model |
| 55 | +- Shared dark-theme style in `analysis/_figure_style.py`. |
| 56 | +- Existing core baseline figures remain regenerable via `python -m analysis.generate_final_benchmark_report`. |
| 57 | + |
| 58 | +### Analysis tooling |
| 59 | +- `analysis/variance_analysis.py` — CV% + t-distribution 95 % CIs across iterations; flags claims above the 5 % threshold. |
| 60 | +- `analysis/tpot_analysis.py` — per-request TPOT P50/P95/P99 (`(total_ms − ttft_ms) / max(output_tokens − 1, 1)`). |
| 61 | +- `analysis/decode_length_analysis.py` — crossover detection across the decode-length sweep. |
| 62 | +- `analysis/goodput.py` — configurable joint SLO goodput (defaults: TTFT ≤ 100 ms, TPOT ≤ 35 ms). |
| 63 | + |
| 64 | +### Release-readiness & quality |
| 65 | +- **CI strengthened** to cover `ruff format --check` + full `mypy` pass on `engines/`, `benchmarks/`, `analysis/`, `dashboard/` (24 source files, 0 errors), on top of the existing `ruff check`, `pytest`, `python -m build`, `twine check`. |
| 66 | +- **Scenario registry test** aligned with the actual 10 registered scenarios. |
| 67 | +- **Dashboard** version synced to package version; `_jobs` dict bounded with oldest-terminal eviction (configurable via `DASHBOARD_MAX_JOBS`). |
| 68 | +- **README accuracy pass:** removed a broken reference to a pruned helper script; updated Project Structure to list all 7 `analysis/` modules; added the 5 extended scenarios to the scenario catalog; linked every published report and figure; removed rot-prone `Last updated` stamps. |
| 69 | + |
| 70 | +## Known open items (tracked, non-blocking) |
| 71 | + |
| 72 | +- **Llama 3.1 8B SGLang-Eagle3** — blocked on the retired `lmsysorg/sglang:nightly-dev-cu13-20260321-94194537` image; needs a fresh nightly pin before retry. |
| 73 | +- **Gemma 4 E2B rerun** — `single_request_latency` and `throughput_ramp` result files report `total_tokens_generated = 0` for E2B; other three scenarios are clean. Decode-throughput numbers for E2B ngram are therefore not published in this release. |
| 74 | +- **Dashboard has no auth** — fine for `localhost` and documented in `SECURITY.md`, but a reverse proxy / basic-auth shim is required before exposing the EC2 deployment path to the public internet. |
| 75 | +- **SGLang Docker image pin** — same retired nightly in `.env.example` and `docker-compose.yml`. Update to a currently-pullable tag before running a fresh clone. |
| 76 | + |
| 77 | +## Upgrade notes |
| 78 | + |
| 79 | +```bash |
| 80 | +git pull |
| 81 | +pip install -e ".[dev]" |
| 82 | +``` |
| 83 | + |
| 84 | +To regenerate every figure from the committed results set: |
| 85 | + |
| 86 | +```bash |
| 87 | +python -m analysis.generate_final_benchmark_report # core baseline (5 SVGs) |
| 88 | +python -m analysis.generate_spec_decoding_figure |
| 89 | +python -m analysis.generate_decode_length_figure |
| 90 | +python -m analysis.generate_variance_figure |
| 91 | +python -m analysis.generate_goodput_figure |
| 92 | +``` |
| 93 | + |
| 94 | +## Tag |
| 95 | + |
| 96 | +`v1.0.0` |
0 commit comments