Releases: varad-more/inference-engine-benchmark-system
Releases · varad-more/inference-engine-benchmark-system
v1.0.0 — Production Release
v1.0.0 — Production Release
First production release of the vLLM vs SGLang benchmark harness. The v0.1.0-beta results set (5 models, single GPU class) is now superseded by a fully validated matrix: 16 models, 10 scenarios, 4 dedicated extended phases, and ~530 result files at 100 % success rate on AWS A10G 24 GB.
Highlights
- Benchmark matrix is complete. 14-model core baseline + 2 Gemma 4 models, each run across 5 core scenarios on both engines, plus speculative decoding (Ngram + Eagle3), variance, concurrency-64, and a decode-length sweep.
- Four new figures regenerate from saved result files — no hand-drawn charts.
- CI is green and enforced.
ruff check,ruff format --check,mypy(24 source files),pytest(89 tests),python -m build,twine check. - README reflects reality. No broken script references, no stale dates, no out-of-date module lists, no phantom follow-ups.
- Dashboard hardened. Version bumped to 1.0.0; in-memory job registry now bounded (
DASHBOARD_MAX_JOBS, default 100) with oldest-terminal eviction instead of unbounded growth.
Validated benchmark snapshot (2026-04-21)
Environment:
- AWS
g5.2xlarge(NVIDIA A10G 24 GB) - vLLM
v0.18.0-cu130, SGLangnightly-dev-cu13-20260321 - 16 models from 2B to 9B,
bfloat16 - Sequential single-GPU execution
- Source of truth:
reports/final_benchmark_report_2026-03-31.md
Headline findings
- vLLM wins TTFT on 13 / 14 core-baseline models (20–60 % lower than SGLang at concurrency 1). Only Gemma 3 4B flips (SGLang faster by 9 ms) because vLLM needs
--enforce-eager. - vLLM wins small-model throughput (≤ 4B): +3–12 % on SmolLM3, Phi-3 mini, Phi-4 mini, Gemma 2 2B.
- Engines converge at 7–9B. Differences < 3 % on Qwen, Mistral, Llama, Granite, DeepSeek-R1 variants.
- Gemma 3 4B is SGLang's strongest case: +77 % peak throughput vs vLLM (149 vs 84 tok/s).
- Structured generation: vLLM wins 12 / 14 models; SGLang wins 2.
- Prefix-sharing TTFT: SGLang wins 10 / 14 — RadixAttention pays off when prefixes are genuinely shared.
- Speculative decoding on A10G: Ngram works on Llama 3.1 8B, Qwen3 8B, and Gemma 4 E4B across both engines. Eagle3 works on Llama 3.1 8B with vLLM only (SGLang + Eagle3 exceeds 24 GB VRAM). Net: spec-dec hurts throughput on A10G; expect a reversal on ≥ 40 GB hardware.
- Goodput (TTFT ≤ 100 ms, TPOT ≤ 35 ms): vLLM leads on small models (SmolLM3 1.06 rps, Gemma 2 2B 1.37 rps). SGLang leads on 7–9B under concurrent load when TTFT dominates.
Extended phases shipped in this release
| Phase | Cells | Iterations | Purpose |
|---|---|---|---|
| Variance subset | 4 models × 5 scenarios × 2 engines | n=5 | 95 % CIs and CV% for reproducibility |
| Concurrency-64 ramp | 4 × 7–9B models × 2 engines | 150 req/level × 6 levels | Saturation + tail-latency behaviour |
| Decode-length sweep | 6 models × 4 output lengths × 2 engines | n=3 | Crossover analysis at max_output_tokens ∈ {64, 256, 1024, 4096} |
| Gemma 4 (E2B + E4B) | 2 models × 5 scenarios × 2 engines + Ngram | n=1 | First published Gemma 4 numbers |
What's new since v0.1.0-beta
Benchmark surface
- +11 models to the validated matrix: SmolLM3 3B, Llama 3.2 3B, Phi-4 mini, Gemma 3 4B, DeepSeek-R1-Distill Qwen 7B + Llama 8B, Qwen 2.5 7B, Llama 3.1 8B, Qwen3 8B, Granite 3.3 8B, and both Gemma 4 sizes (E2B, E4B).
- +5 scenarios registered:
throughput_ramp_extended,decode_length_sweep_{64,256,1024,4096}. - Speculative decoding: 6 engine variants with one-command Docker Compose profiles (
vllm,vllm-eagle3,vllm-ngram,sglang,sglang-eagle3,sglang-ngram).
Visualizations
- Four new auto-regenerating figures under
analysis/generate_*_figure.py:speculative_decoding.svg— Llama 3.1 8B + Qwen3 8B + Gemma 4 E4B, baseline vs Ngram vs Eagle3decode_length_sweep.svg— tokens/sec vs max_output_tokens (6 models, 95 % CI error bars)variance_cv.svg— CV% per (model × engine × scenario × metric) with 5 % threshold linegoodput.svg— joint TTFT/TPOT SLO goodput per model
- Shared dark-theme style in
analysis/_figure_style.py. - Existing core baseline figures remain regenerable via
python -m analysis.generate_final_benchmark_report.
Analysis tooling
analysis/variance_analysis.py— CV% + t-distribution 95 % CIs across iterations; flags claims above the 5 % threshold.analysis/tpot_analysis.py— per-request TPOT P50/P95/P99 ((total_ms − ttft_ms) / max(output_tokens − 1, 1)).analysis/decode_length_analysis.py— crossover detection across the decode-length sweep.analysis/goodput.py— configurable joint SLO goodput (defaults: TTFT ≤ 100 ms, TPOT ≤ 35 ms).
Release-readiness & quality
- CI strengthened to cover
ruff format --check+ fullmypypass onengines/,benchmarks/,analysis/,dashboard/(24 source files, 0 errors), on top of the existingruff check,pytest,python -m build,twine check. - Scenario registry test aligned with the actual 10 registered scenarios.
- Dashboard version synced to package version;
_jobsdict bounded with oldest-terminal eviction (configurable viaDASHBOARD_MAX_JOBS). - README accuracy pass: removed a broken reference to a pruned helper script; updated Project Structure to list all 7
analysis/modules; added the 5 extended scenarios to the scenario catalog; linked every published report and figure; removed rot-proneLast updatedstamps.
Known open items (tracked, non-blocking)
- Llama 3.1 8B SGLang-Eagle3 — blocked on the retired
lmsysorg/sglang:nightly-dev-cu13-20260321-94194537image; needs a fresh nightly pin before retry. - Gemma 4 E2B rerun —
single_request_latencyandthroughput_rampresult files reporttotal_tokens_generated = 0for E2B; other three scenarios are clean. Decode-throughput numbers for E2B ngram are therefore not published in this release. - Dashboard has no auth — fine for
localhostand documented inSECURITY.md, but a reverse proxy / basic-auth shim is required before exposing the EC2 deployment path to the public internet. - SGLang Docker image pin — same retired nightly in
.env.exampleanddocker-compose.yml. Update to a currently-pullable tag before running a fresh clone.
Upgrade notes
git pull
pip install -e ".[dev]"To regenerate every figure from the committed results set:
python -m analysis.generate_final_benchmark_report # core baseline (5 SVGs)
python -m analysis.generate_spec_decoding_figure
python -m analysis.generate_decode_length_figure
python -m analysis.generate_variance_figure
python -m analysis.generate_goodput_figureTag
v1.0.0
v0.1.0-beta — Public Beta
v0.1.0-beta — Public Beta
Highlights
This beta ships a production-style benchmark harness for comparing vLLM and SGLang under realistic inference workloads, with Dockerized setup, CLI runner, dashboard, and AWS deployment options.
What’s included
- Comparative benchmark framework for vLLM vs SGLang
- Scenario runner with latency/throughput/cache-oriented workloads
- FastAPI dashboard for viewing and running benchmarks
- Docker Compose setup for local/remote benchmarking
- AWS deployment paths:
deploy/ec2_deploy.sh(quick end-to-end script)deploy/terraform/(repeatable infra workflow)
- Result artifact persistence in
results/*.json - README benchmark report from validated A10G run
Validated benchmark snapshot (2026-03-22)
Environment:
- AWS
g5.2xlarge(NVIDIA A10G 24GB) - Model:
Qwen/Qwen2.5-1.5B-Instruct - Execution mode: sequential single-GPU runs
single_request_latency (50 requests)
- vLLM: TTFT p50 15.9 ms, p95 16.4 ms, total latency p95 1047.4 ms, 4982.1 tok/s
- SGLang: TTFT p50 27.5 ms, p95 28.1 ms, total latency p95 1008.2 ms, 6253.8 tok/s
throughput_ramp (700 requests)
- vLLM: TTFT p50 31.5 ms, p95 178.1 ms, total latency p95 3202.4 ms, 55287.6 tok/s
- SGLang: TTFT p50 49.4 ms, p95 155.7 ms, total latency p95 4480.8 ms, 39840.1 tok/s
All above runs completed at 100% success rate.
Release-readiness improvements in this beta
- Fixed packaging metadata for Hatch build targets
- Added CI workflow (Python 3.11/3.12):
ruff check .pytest -q
- Added release-readiness + benchmark report sections to README
- Cleaned Docker Compose config by removing obsolete
versionfield - Fixed SGLang startup arg mismatch in Compose configuration
Known limitations
- Public results currently reflect one model (
Qwen 1.5B) and one GPU class (A10G) - For stronger production claims, run a wider matrix:
- multiple model families/sizes (e.g., Llama 8B, Qwen 7B/14B)
- multiple reruns for variance bands
- longer-context and structured-generation heavy workloads
- Single-GPU environments should run engines sequentially to avoid VRAM contention
Upgrade notes
- If you pull latest changes, re-run:
pip install -e ".[dev]"- CI is now enforced through GitHub Actions on push/PR.
Tag
v0.1.0-beta