vLLM Beam Search Plugin

MRV2 beam-search scheduler and sampler plugin for vLLM V1.

This package provides:

vllm_beam_search.scheduler.BeamSearchScheduler
an MRV2 custom sampler wrapper installed through a plugin-local ModelState hook
plugin-local runtime hooks for MRV2 worker history rewrites

The current production path targets MRV2 generate models with async scheduling. The sampler hook is model-state generic; BART-family models still need the companion vllm-bart-plugin for encoder-decoder model support.

For BART-family encoder-decoder serving, see BART_BEAM_SEARCH.md.

Install

uv pip install -e .

For stress tooling:

uv pip install -e '.[stress]'

Server

MODEL=${MODEL:-meta-llama/Meta-Llama-3-8B-Instruct}
SERVED_MODEL=${SERVED_MODEL:-llama3-8b}

CUDA_VISIBLE_DEVICES=0 \
VLLM_USE_FLASHINFER_SAMPLER=0 \
python -m vllm.entrypoints.openai.api_server \
  --model "${MODEL}" \
  --served-model-name "${SERVED_MODEL}" \
  --dtype bfloat16 \
  --port 8005 \
  --scheduler-cls vllm_beam_search.scheduler.BeamSearchScheduler

Request Shape

{
  "model": "llama3-8b",
  "prompt": "Write a concise summary of why beam search is useful:",
  "max_tokens": 128,
  "temperature": 0,
  "add_special_tokens": false,
  "vllm_xargs": {
    "beam_width": 4,
    "no_repeat_ngram_size": 3
  }
}

Validation

Run unit tests:

python -m pytest tests -q

Run sustained stress plus memory sampling against a running server:

vllm-beam-stress \
  --base-url http://localhost:8005 \
  --model llama3-8b \
  --rounds 100 \
  --requests-per-round 32 \
  --concurrency 64 \
  --abort-rounds 3

The stress tool writes CSV samples with request count, RSS, and GPU memory.

Runtime Knobs

VLLM_BEAM_GROUP_STATE_CAPACITY controls GPU beam-state pool capacity.
VLLM_BEAM_TRANSITION_BUFFER_SLOTS controls async transition buffer slots.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
tests		tests
vllm_beam_search		vllm_beam_search
.gitignore		.gitignore
BART_BEAM_SEARCH.md		BART_BEAM_SEARCH.md
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

vLLM Beam Search Plugin

Install

Server

Request Shape

Validation

Runtime Knobs

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

vLLM Beam Search Plugin

Install

Server

Request Shape

Validation

Runtime Knobs

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages