Skip to content

Releases: varad-more/inference-engine-benchmark-system

v1.0.0 — Production Release

22 Apr 00:37

Choose a tag to compare

v1.0.0 — Production Release

First production release of the vLLM vs SGLang benchmark harness. The v0.1.0-beta results set (5 models, single GPU class) is now superseded by a fully validated matrix: 16 models, 10 scenarios, 4 dedicated extended phases, and ~530 result files at 100 % success rate on AWS A10G 24 GB.

Highlights

  • Benchmark matrix is complete. 14-model core baseline + 2 Gemma 4 models, each run across 5 core scenarios on both engines, plus speculative decoding (Ngram + Eagle3), variance, concurrency-64, and a decode-length sweep.
  • Four new figures regenerate from saved result files — no hand-drawn charts.
  • CI is green and enforced. ruff check, ruff format --check, mypy (24 source files), pytest (89 tests), python -m build, twine check.
  • README reflects reality. No broken script references, no stale dates, no out-of-date module lists, no phantom follow-ups.
  • Dashboard hardened. Version bumped to 1.0.0; in-memory job registry now bounded (DASHBOARD_MAX_JOBS, default 100) with oldest-terminal eviction instead of unbounded growth.

Validated benchmark snapshot (2026-04-21)

Environment:

Headline findings

  • vLLM wins TTFT on 13 / 14 core-baseline models (20–60 % lower than SGLang at concurrency 1). Only Gemma 3 4B flips (SGLang faster by 9 ms) because vLLM needs --enforce-eager.
  • vLLM wins small-model throughput (≤ 4B): +3–12 % on SmolLM3, Phi-3 mini, Phi-4 mini, Gemma 2 2B.
  • Engines converge at 7–9B. Differences < 3 % on Qwen, Mistral, Llama, Granite, DeepSeek-R1 variants.
  • Gemma 3 4B is SGLang's strongest case: +77 % peak throughput vs vLLM (149 vs 84 tok/s).
  • Structured generation: vLLM wins 12 / 14 models; SGLang wins 2.
  • Prefix-sharing TTFT: SGLang wins 10 / 14 — RadixAttention pays off when prefixes are genuinely shared.
  • Speculative decoding on A10G: Ngram works on Llama 3.1 8B, Qwen3 8B, and Gemma 4 E4B across both engines. Eagle3 works on Llama 3.1 8B with vLLM only (SGLang + Eagle3 exceeds 24 GB VRAM). Net: spec-dec hurts throughput on A10G; expect a reversal on ≥ 40 GB hardware.
  • Goodput (TTFT ≤ 100 ms, TPOT ≤ 35 ms): vLLM leads on small models (SmolLM3 1.06 rps, Gemma 2 2B 1.37 rps). SGLang leads on 7–9B under concurrent load when TTFT dominates.

Extended phases shipped in this release

Phase Cells Iterations Purpose
Variance subset 4 models × 5 scenarios × 2 engines n=5 95 % CIs and CV% for reproducibility
Concurrency-64 ramp 4 × 7–9B models × 2 engines 150 req/level × 6 levels Saturation + tail-latency behaviour
Decode-length sweep 6 models × 4 output lengths × 2 engines n=3 Crossover analysis at max_output_tokens ∈ {64, 256, 1024, 4096}
Gemma 4 (E2B + E4B) 2 models × 5 scenarios × 2 engines + Ngram n=1 First published Gemma 4 numbers

What's new since v0.1.0-beta

Benchmark surface

  • +11 models to the validated matrix: SmolLM3 3B, Llama 3.2 3B, Phi-4 mini, Gemma 3 4B, DeepSeek-R1-Distill Qwen 7B + Llama 8B, Qwen 2.5 7B, Llama 3.1 8B, Qwen3 8B, Granite 3.3 8B, and both Gemma 4 sizes (E2B, E4B).
  • +5 scenarios registered: throughput_ramp_extended, decode_length_sweep_{64,256,1024,4096}.
  • Speculative decoding: 6 engine variants with one-command Docker Compose profiles (vllm, vllm-eagle3, vllm-ngram, sglang, sglang-eagle3, sglang-ngram).

Visualizations

  • Four new auto-regenerating figures under analysis/generate_*_figure.py:
    • speculative_decoding.svg — Llama 3.1 8B + Qwen3 8B + Gemma 4 E4B, baseline vs Ngram vs Eagle3
    • decode_length_sweep.svg — tokens/sec vs max_output_tokens (6 models, 95 % CI error bars)
    • variance_cv.svg — CV% per (model × engine × scenario × metric) with 5 % threshold line
    • goodput.svg — joint TTFT/TPOT SLO goodput per model
  • Shared dark-theme style in analysis/_figure_style.py.
  • Existing core baseline figures remain regenerable via python -m analysis.generate_final_benchmark_report.

Analysis tooling

  • analysis/variance_analysis.py — CV% + t-distribution 95 % CIs across iterations; flags claims above the 5 % threshold.
  • analysis/tpot_analysis.py — per-request TPOT P50/P95/P99 ((total_ms − ttft_ms) / max(output_tokens − 1, 1)).
  • analysis/decode_length_analysis.py — crossover detection across the decode-length sweep.
  • analysis/goodput.py — configurable joint SLO goodput (defaults: TTFT ≤ 100 ms, TPOT ≤ 35 ms).

Release-readiness & quality

  • CI strengthened to cover ruff format --check + full mypy pass on engines/, benchmarks/, analysis/, dashboard/ (24 source files, 0 errors), on top of the existing ruff check, pytest, python -m build, twine check.
  • Scenario registry test aligned with the actual 10 registered scenarios.
  • Dashboard version synced to package version; _jobs dict bounded with oldest-terminal eviction (configurable via DASHBOARD_MAX_JOBS).
  • README accuracy pass: removed a broken reference to a pruned helper script; updated Project Structure to list all 7 analysis/ modules; added the 5 extended scenarios to the scenario catalog; linked every published report and figure; removed rot-prone Last updated stamps.

Known open items (tracked, non-blocking)

  • Llama 3.1 8B SGLang-Eagle3 — blocked on the retired lmsysorg/sglang:nightly-dev-cu13-20260321-94194537 image; needs a fresh nightly pin before retry.
  • Gemma 4 E2B rerunsingle_request_latency and throughput_ramp result files report total_tokens_generated = 0 for E2B; other three scenarios are clean. Decode-throughput numbers for E2B ngram are therefore not published in this release.
  • Dashboard has no auth — fine for localhost and documented in SECURITY.md, but a reverse proxy / basic-auth shim is required before exposing the EC2 deployment path to the public internet.
  • SGLang Docker image pin — same retired nightly in .env.example and docker-compose.yml. Update to a currently-pullable tag before running a fresh clone.

Upgrade notes

git pull
pip install -e ".[dev]"

To regenerate every figure from the committed results set:

python -m analysis.generate_final_benchmark_report   # core baseline (5 SVGs)
python -m analysis.generate_spec_decoding_figure
python -m analysis.generate_decode_length_figure
python -m analysis.generate_variance_figure
python -m analysis.generate_goodput_figure

Tag

v1.0.0

v0.1.0-beta — Public Beta

22 Mar 07:49

Choose a tag to compare

v0.1.0-beta — Public Beta

Highlights

This beta ships a production-style benchmark harness for comparing vLLM and SGLang under realistic inference workloads, with Dockerized setup, CLI runner, dashboard, and AWS deployment options.

What’s included

  • Comparative benchmark framework for vLLM vs SGLang
  • Scenario runner with latency/throughput/cache-oriented workloads
  • FastAPI dashboard for viewing and running benchmarks
  • Docker Compose setup for local/remote benchmarking
  • AWS deployment paths:
    • deploy/ec2_deploy.sh (quick end-to-end script)
    • deploy/terraform/ (repeatable infra workflow)
  • Result artifact persistence in results/*.json
  • README benchmark report from validated A10G run

Validated benchmark snapshot (2026-03-22)

Environment:

  • AWS g5.2xlarge (NVIDIA A10G 24GB)
  • Model: Qwen/Qwen2.5-1.5B-Instruct
  • Execution mode: sequential single-GPU runs

single_request_latency (50 requests)

  • vLLM: TTFT p50 15.9 ms, p95 16.4 ms, total latency p95 1047.4 ms, 4982.1 tok/s
  • SGLang: TTFT p50 27.5 ms, p95 28.1 ms, total latency p95 1008.2 ms, 6253.8 tok/s

throughput_ramp (700 requests)

  • vLLM: TTFT p50 31.5 ms, p95 178.1 ms, total latency p95 3202.4 ms, 55287.6 tok/s
  • SGLang: TTFT p50 49.4 ms, p95 155.7 ms, total latency p95 4480.8 ms, 39840.1 tok/s

All above runs completed at 100% success rate.

Release-readiness improvements in this beta

  • Fixed packaging metadata for Hatch build targets
  • Added CI workflow (Python 3.11/3.12):
    • ruff check .
    • pytest -q
  • Added release-readiness + benchmark report sections to README
  • Cleaned Docker Compose config by removing obsolete version field
  • Fixed SGLang startup arg mismatch in Compose configuration

Known limitations

  • Public results currently reflect one model (Qwen 1.5B) and one GPU class (A10G)
  • For stronger production claims, run a wider matrix:
    • multiple model families/sizes (e.g., Llama 8B, Qwen 7B/14B)
    • multiple reruns for variance bands
    • longer-context and structured-generation heavy workloads
  • Single-GPU environments should run engines sequentially to avoid VRAM contention

Upgrade notes

  • If you pull latest changes, re-run:
pip install -e ".[dev]"
  • CI is now enforced through GitHub Actions on push/PR.

Tag

v0.1.0-beta