Speculative Decoding Benchmarks

Speculative decoding reduces time-to-first-token (TTFT) and end-to-end latency by letting a small "draft" model predict several tokens ahead, which the main model then verifies in a single forward pass. When the draft is correct, you get multiple tokens for the cost of one verification step.

This guide covers how to run speculative decoding benchmarks using the two supported methods — Eagle3 and Ngram — on both vLLM and SGLang.

Supported Methods

Method	How it works	Draft model needed	VRAM overhead	Best for
Eagle3	Trained speculative head predicts next tokens	Yes (~1-2 GB)	~3-5 GB extra	Greedy/low-temp generation, chat
Ngram	Matches token sequences from the prompt	No	Minimal	Documents with repeated phrases, RAG

Expected speedup (single request, A10G 24GB):

Eagle3 on Llama 3.1 8B: ~1.8-2.4x TTFT reduction
Ngram on Llama 3.1 8B: ~1.2-1.5x TTFT reduction
Speedup degrades at high concurrency (>16 concurrent requests)

Draft Model Reference

Target Model	Eagle3 Draft (vLLM)	Eagle3 Draft (SGLang)
`meta-llama/Llama-3.1-8B-Instruct`	`RedHatAI/Llama-3.1-8B-Instruct-speculator.eagle3`	`jamesliu1/sglang-EAGLE3-Llama-3.1-Instruct-8B`
`Qwen/Qwen3-8B`	`RedHatAI/Qwen3-8B-speculator.eagle3`	Not yet available
`google/gemma-3-4b-it`	Not yet available	Not yet available
`meta-llama/Llama-3.3-70B-Instruct`	`RedHatAI/Llama-3.3-70B-Instruct-speculator.eagle3`	`yuhuili/EAGLE3-LLaMA3.3-Instruct-70B`

Note: Eagle3 speculators must share the same tokenizer/vocabulary as the target model. Do not mix model families (e.g., a Llama draft with a Qwen target).

2025-2026 Models — Hardware Requirements

A10G 24GB (current benchmark hardware)

Model	HF ID	Est. VRAM	Spec-Dec Compatible
Qwen 3 8B	`Qwen/Qwen3-8B`	~16 GB	Eagle3 (vLLM), Ngram both engines
Gemma 3 4B	`google/gemma-3-4b-it`	~8 GB	Ngram only (no Eagle3 draft yet)
Llama 3.1 8B	`meta-llama/Llama-3.1-8B-Instruct`	~16 GB	Eagle3 + Ngram, both engines
DeepSeek-R1 Distill 7B	`deepseek-ai/DeepSeek-R1-Distill-Qwen-7B`	~14 GB	Ngram only

For Eagle3 on 8B models, set gpu-memory-utilization=0.80 (vLLM) or mem-fraction-static=0.70 (SGLang) to leave room for the ~1-2 GB draft model.

A100/H100 (larger hardware)

Model	HF ID	Est. VRAM	Notes
Mistral Small 3.2 24B	`mistralai/Mistral-Small-3.2-24B-Instruct-2506`	~48 GB	A100 80GB or 2× A100 40GB
Qwen 3 32B	`Qwen/Qwen3-32B`	~64 GB	A100 80GB
Llama 3.3 70B	`meta-llama/Llama-3.3-70B-Instruct`	~140 GB	2× A100 80GB or H100 SXM

Running Speculative Decoding Benchmarks

Prerequisites

Edit .env with your HuggingFace token (needed for gated models like Llama):

HUGGING_FACE_HUB_TOKEN=hf_YOUR_TOKEN
EAGLE3_VLLM_DRAFT=RedHatAI/Llama-3.1-8B-Instruct-speculator.eagle3
EAGLE3_SGLANG_DRAFT=jamesliu1/sglang-EAGLE3-Llama-3.1-Instruct-8B

Set target model (Eagle3 works best with Llama 3.1 8B — has full draft model support):
```
export MODEL=meta-llama/Llama-3.1-8B-Instruct
```

Step-by-step: Baseline vs Eagle3 vs Ngram (vLLM)

# 1. Run baseline
docker compose --profile vllm up -d vllm
sleep 120  # wait for model to load
python run_experiment.py run -s single_request_latency -e vllm --model meta-llama/Llama-3.1-8B-Instruct
python run_experiment.py run -s throughput_ramp -e vllm --model meta-llama/Llama-3.1-8B-Instruct
docker compose --profile vllm down

# 2. Run Eagle3
docker compose --profile vllm-eagle3 up -d vllm-eagle3
sleep 180  # two models load — takes longer
python run_experiment.py run -s single_request_latency -e vllm-eagle3 --model meta-llama/Llama-3.1-8B-Instruct
python run_experiment.py run -s throughput_ramp -e vllm-eagle3 --model meta-llama/Llama-3.1-8B-Instruct
docker compose --profile vllm-eagle3 down

# 3. Run Ngram
docker compose --profile vllm-ngram up -d vllm-ngram
sleep 120
python run_experiment.py run -s single_request_latency -e vllm-ngram --model meta-llama/Llama-3.1-8B-Instruct
python run_experiment.py run -s throughput_ramp -e vllm-ngram --model meta-llama/Llama-3.1-8B-Instruct
docker compose --profile vllm-ngram down

# 4. Generate comparison report
python run_experiment.py report --output spec_dec_report.html

Step-by-step: Baseline vs Eagle3 vs Ngram (SGLang)

# 1. Baseline
docker compose --profile sglang up -d sglang && sleep 120
python run_experiment.py run -s single_request_latency -e sglang --model meta-llama/Llama-3.1-8B-Instruct
docker compose --profile sglang down

# 2. Eagle3
docker compose --profile sglang-eagle3 up -d sglang-eagle3 && sleep 180
python run_experiment.py run -s single_request_latency -e sglang-eagle3 --model meta-llama/Llama-3.1-8B-Instruct
docker compose --profile sglang-eagle3 down

# 3. Ngram
docker compose --profile sglang-ngram up -d sglang-ngram && sleep 120
python run_experiment.py run -s single_request_latency -e sglang-ngram --model meta-llama/Llama-3.1-8B-Instruct
docker compose --profile sglang-ngram down

Using the matrix command (recommended for full run)

MODEL=meta-llama/Llama-3.1-8B-Instruct

# Baseline vLLM
docker compose --profile vllm up -d vllm && sleep 120
python run_experiment.py matrix \
  --scenarios single_request_latency,throughput_ramp,long_context_stress,prefix_sharing_benefit,structured_generation_speed \
  --engines vllm --model $MODEL --iterations 2 --cooldown-seconds 120
docker compose --profile vllm down

# vLLM Eagle3
docker compose --profile vllm-eagle3 up -d vllm-eagle3 && sleep 180
python run_experiment.py matrix \
  --scenarios single_request_latency,throughput_ramp \
  --engines vllm-eagle3 --model $MODEL --iterations 2 --cooldown-seconds 120
docker compose --profile vllm-eagle3 down

# vLLM Ngram
docker compose --profile vllm-ngram up -d vllm-ngram && sleep 120
python run_experiment.py matrix \
  --scenarios single_request_latency,throughput_ramp \
  --engines vllm-ngram --model $MODEL --iterations 2 --cooldown-seconds 120
docker compose --profile vllm-ngram down

Direct head-to-head comparison

# Compare baseline vs Eagle3 on the same scenario
# (requires both services to be running on different hosts, or run sequentially and use saved results)
python run_experiment.py compare \
  --scenario single_request_latency \
  --engines vllm,vllm-eagle3 \
  --model meta-llama/Llama-3.1-8B-Instruct

Troubleshooting

Symptom	Cause	Fix
OOM on Eagle3 startup	Draft model + main model exceed VRAM	Reduce `gpu-memory-utilization` to 0.75 or `mem-fraction-static` to 0.65
Eagle3 much slower than baseline	High concurrency saturates draft verification	Reduce concurrency or test at lower request rates
SGLang Eagle3 crashes on startup	Context length too large for draft tree	Add `--speculative-num-draft-tokens 8` to reduce tree size
Ngram shows no speedup	Prompts have no repeated token patterns	Ngram benefits only repetitive/document-style prompts; try `long_context` scenario
`EAGLE3_VLLM_DRAFT` not found on HF Hub	Draft model not available for this target	Use Ngram instead, or check RedHatAI HF page for latest speculators

Known Limitations

Eagle3 speculative decoding has a context length cap of 2048 tokens for the speculative portion in some vLLM versions. Long-context benchmarks may fall back to normal decoding.
Speculative decoding speedup is highest at low concurrency (1-4 concurrent requests). At concurrency ≥ 32, benefits shrink significantly as the GPU is already saturated.
The structured_generation_speed scenario uses constrained decoding, which may interact with speculative decoding in engine-specific ways. Treat those results with care.
Eagle3 draft models are specific to a model family and version. Qwen3-8B speculators may not yet be available for SGLang.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speculative Decoding Benchmarks

Supported Methods

Draft Model Reference

2025-2026 Models — Hardware Requirements

A10G 24GB (current benchmark hardware)

A100/H100 (larger hardware)

Running Speculative Decoding Benchmarks

Prerequisites

Step-by-step: Baseline vs Eagle3 vs Ngram (vLLM)

Step-by-step: Baseline vs Eagle3 vs Ngram (SGLang)

Using the matrix command (recommended for full run)

Direct head-to-head comparison

Troubleshooting

Known Limitations

FilesExpand file tree

SPECULATIVE_DECODING.md

Latest commit

History

SPECULATIVE_DECODING.md

File metadata and controls

Speculative Decoding Benchmarks

Supported Methods

Draft Model Reference

2025-2026 Models — Hardware Requirements

A10G 24GB (current benchmark hardware)

A100/H100 (larger hardware)

Running Speculative Decoding Benchmarks

Prerequisites

Step-by-step: Baseline vs Eagle3 vs Ngram (vLLM)

Step-by-step: Baseline vs Eagle3 vs Ngram (SGLang)

Using the matrix command (recommended for full run)

Direct head-to-head comparison

Troubleshooting

Known Limitations