22 Apr 00:37

9cafa6a

Latest

v1.0.0 — Production Release

First production release of the vLLM vs SGLang benchmark harness. The v0.1.0-beta results set (5 models, single GPU class) is now superseded by a fully validated matrix: 16 models, 10 scenarios, 4 dedicated extended phases, and ~530 result files at 100 % success rate on AWS A10G 24 GB.

Highlights

Benchmark matrix is complete. 14-model core baseline + 2 Gemma 4 models, each run across 5 core scenarios on both engines, plus speculative decoding (Ngram + Eagle3), variance, concurrency-64, and a decode-length sweep.
Four new figures regenerate from saved result files — no hand-drawn charts.
CI is green and enforced. ruff check, ruff format --check, mypy (24 source files), pytest (89 tests), python -m build, twine check.
README reflects reality. No broken script references, no stale dates, no out-of-date module lists, no phantom follow-ups.
Dashboard hardened. Version bumped to 1.0.0; in-memory job registry now bounded (DASHBOARD_MAX_JOBS, default 100) with oldest-terminal eviction instead of unbounded growth.

Validated benchmark snapshot (2026-04-21)

Environment:

AWS g5.2xlarge (NVIDIA A10G 24 GB)
vLLM v0.18.0-cu130, SGLang nightly-dev-cu13-20260321
16 models from 2B to 9B, bfloat16
Sequential single-GPU execution
Source of truth: reports/final_benchmark_report_2026-03-31.md

Headline findings

vLLM wins TTFT on 13 / 14 core-baseline models (20–60 % lower than SGLang at concurrency 1). Only Gemma 3 4B flips (SGLang faster by 9 ms) because vLLM needs --enforce-eager.
vLLM wins small-model throughput (≤ 4B): +3–12 % on SmolLM3, Phi-3 mini, Phi-4 mini, Gemma 2 2B.
Engines converge at 7–9B. Differences < 3 % on Qwen, Mistral, Llama, Granite, DeepSeek-R1 variants.
Gemma 3 4B is SGLang's strongest case: +77 % peak throughput vs vLLM (149 vs 84 tok/s).
Structured generation: vLLM wins 12 / 14 models; SGLang wins 2.
Prefix-sharing TTFT: SGLang wins 10 / 14 — RadixAttention pays off when prefixes are genuinely shared.
Speculative decoding on A10G: Ngram works on Llama 3.1 8B, Qwen3 8B, and Gemma 4 E4B across both engines. Eagle3 works on Llama 3.1 8B with vLLM only (SGLang + Eagle3 exceeds 24 GB VRAM). Net: spec-dec hurts throughput on A10G; expect a reversal on ≥ 40 GB hardware.
Goodput (TTFT ≤ 100 ms, TPOT ≤ 35 ms): vLLM leads on small models (SmolLM3 1.06 rps, Gemma 2 2B 1.37 rps). SGLang leads on 7–9B under concurrent load when TTFT dominates.

Extended phases shipped in this release

Phase	Cells	Iterations	Purpose
Variance subset	4 models × 5 scenarios × 2 engines	n=5	95 % CIs and CV% for reproducibility
Concurrency-64 ramp	4 × 7–9B models × 2 engines	150 req/level × 6 levels	Saturation + tail-latency behaviour
Decode-length sweep	6 models × 4 output lengths × 2 engines	n=3	Crossover analysis at max_output_tokens ∈ {64, 256, 1024, 4096}
Gemma 4 (E2B + E4B)	2 models × 5 scenarios × 2 engines + Ngram	n=1	First published Gemma 4 numbers

What's new since v0.1.0-beta

Benchmark surface

+11 models to the validated matrix: SmolLM3 3B, Llama 3.2 3B, Phi-4 mini, Gemma 3 4B, DeepSeek-R1-Distill Qwen 7B + Llama 8B, Qwen 2.5 7B, Llama 3.1 8B, Qwen3 8B, Granite 3.3 8B, and both Gemma 4 sizes (E2B, E4B).
+5 scenarios registered: throughput_ramp_extended, decode_length_sweep_{64,256,1024,4096}.
Speculative decoding: 6 engine variants with one-command Docker Compose profiles (vllm, vllm-eagle3, vllm-ngram, sglang, sglang-eagle3, sglang-ngram).

Visualizations

Four new auto-regenerating figures under analysis/generate_*_figure.py:
- speculative_decoding.svg — Llama 3.1 8B + Qwen3 8B + Gemma 4 E4B, baseline vs Ngram vs Eagle3
- decode_length_sweep.svg — tokens/sec vs max_output_tokens (6 models, 95 % CI error bars)
- variance_cv.svg — CV% per (model × engine × scenario × metric) with 5 % threshold line
- goodput.svg — joint TTFT/TPOT SLO goodput per model
Shared dark-theme style in analysis/_figure_style.py.
Existing core baseline figures remain regenerable via python -m analysis.generate_final_benchmark_report.

Analysis tooling

analysis/variance_analysis.py — CV% + t-distribution 95 % CIs across iterations; flags claims above the 5 % threshold.
analysis/tpot_analysis.py — per-request TPOT P50/P95/P99 ((total_ms − ttft_ms) / max(output_tokens − 1, 1)).
analysis/decode_length_analysis.py — crossover detection across the decode-length sweep.
analysis/goodput.py — configurable joint SLO goodput (defaults: TTFT ≤ 100 ms, TPOT ≤ 35 ms).

Release-readiness & quality

CI strengthened to cover ruff format --check + full mypy pass on engines/, benchmarks/, analysis/, dashboard/ (24 source files, 0 errors), on top of the existing ruff check, pytest, python -m build, twine check.
Scenario registry test aligned with the actual 10 registered scenarios.
Dashboard version synced to package version; _jobs dict bounded with oldest-terminal eviction (configurable via DASHBOARD_MAX_JOBS).
README accuracy pass: removed a broken reference to a pruned helper script; updated Project Structure to list all 7 analysis/ modules; added the 5 extended scenarios to the scenario catalog; linked every published report and figure; removed rot-prone Last updated stamps.

Known open items (tracked, non-blocking)

Llama 3.1 8B SGLang-Eagle3 — blocked on the retired lmsysorg/sglang:nightly-dev-cu13-20260321-94194537 image; needs a fresh nightly pin before retry.
Gemma 4 E2B rerun — single_request_latency and throughput_ramp result files report total_tokens_generated = 0 for E2B; other three scenarios are clean. Decode-throughput numbers for E2B ngram are therefore not published in this release.
Dashboard has no auth — fine for localhost and documented in SECURITY.md, but a reverse proxy / basic-auth shim is required before exposing the EC2 deployment path to the public internet.
SGLang Docker image pin — same retired nightly in .env.example and docker-compose.yml. Update to a currently-pullable tag before running a fresh clone.

Upgrade notes

git pull
pip install -e ".[dev]"

To regenerate every figure from the committed results set:

python -m analysis.generate_final_benchmark_report   # core baseline (5 SVGs)
python -m analysis.generate_spec_decoding_figure
python -m analysis.generate_decode_length_figure
python -m analysis.generate_variance_figure
python -m analysis.generate_goodput_figure

Tag

v1.0.0

Assets 2

22 Mar 07:49

varad-more

v0.1.0-beta

9ec1fc0

v0.1.0-beta — Public Beta

Highlights

This beta ships a production-style benchmark harness for comparing vLLM and SGLang under realistic inference workloads, with Dockerized setup, CLI runner, dashboard, and AWS deployment options.

What’s included

Comparative benchmark framework for vLLM vs SGLang
Scenario runner with latency/throughput/cache-oriented workloads
FastAPI dashboard for viewing and running benchmarks
Docker Compose setup for local/remote benchmarking
AWS deployment paths:
- deploy/ec2_deploy.sh (quick end-to-end script)
- deploy/terraform/ (repeatable infra workflow)
Result artifact persistence in results/*.json
README benchmark report from validated A10G run

Validated benchmark snapshot (2026-03-22)

Environment:

AWS g5.2xlarge (NVIDIA A10G 24GB)
Model: Qwen/Qwen2.5-1.5B-Instruct
Execution mode: sequential single-GPU runs

single_request_latency (50 requests)

vLLM: TTFT p50 15.9 ms, p95 16.4 ms, total latency p95 1047.4 ms, 4982.1 tok/s
SGLang: TTFT p50 27.5 ms, p95 28.1 ms, total latency p95 1008.2 ms, 6253.8 tok/s

throughput_ramp (700 requests)

vLLM: TTFT p50 31.5 ms, p95 178.1 ms, total latency p95 3202.4 ms, 55287.6 tok/s
SGLang: TTFT p50 49.4 ms, p95 155.7 ms, total latency p95 4480.8 ms, 39840.1 tok/s

All above runs completed at 100% success rate.

Release-readiness improvements in this beta

Fixed packaging metadata for Hatch build targets
Added CI workflow (Python 3.11/3.12):
- ruff check .
- pytest -q
Added release-readiness + benchmark report sections to README
Cleaned Docker Compose config by removing obsolete version field
Fixed SGLang startup arg mismatch in Compose configuration

Known limitations

Public results currently reflect one model (Qwen 1.5B) and one GPU class (A10G)
For stronger production claims, run a wider matrix:
- multiple model families/sizes (e.g., Llama 8B, Qwen 7B/14B)
- multiple reruns for variance bands
- longer-context and structured-generation heavy workloads
Single-GPU environments should run engines sequentially to avoid VRAM contention

Upgrade notes

If you pull latest changes, re-run:

pip install -e ".[dev]"

CI is now enforced through GitHub Actions on push/PR.

Tag

v0.1.0-beta

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

v1.0.0 — Production Release

Highlights

Validated benchmark snapshot (2026-04-21)

Headline findings

Extended phases shipped in this release

What's new since v0.1.0-beta

Benchmark surface

Visualizations

Analysis tooling

Release-readiness & quality

Known open items (tracked, non-blocking)

Upgrade notes

Tag

Uh oh!

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

v0.1.0-beta — Public Beta

Highlights

What’s included

Validated benchmark snapshot (2026-03-22)

single_request_latency (50 requests)

throughput_ramp (700 requests)

Release-readiness improvements in this beta

Known limitations

Upgrade notes

Tag

Uh oh!

Releases: varad-more/inference-engine-benchmark-system