MRV2 beam-search scheduler and sampler plugin for vLLM V1.
This package provides:
vllm_beam_search.scheduler.BeamSearchScheduler- an MRV2 custom sampler wrapper installed through a plugin-local
ModelStatehook - plugin-local runtime hooks for MRV2 worker history rewrites
The current production path targets MRV2 generate models with async scheduling.
The sampler hook is model-state generic; BART-family models still need the
companion vllm-bart-plugin for encoder-decoder model support.
For BART-family encoder-decoder serving, see
BART_BEAM_SEARCH.md.
uv pip install -e .For stress tooling:
uv pip install -e '.[stress]'MODEL=${MODEL:-meta-llama/Meta-Llama-3-8B-Instruct}
SERVED_MODEL=${SERVED_MODEL:-llama3-8b}
CUDA_VISIBLE_DEVICES=0 \
VLLM_USE_FLASHINFER_SAMPLER=0 \
python -m vllm.entrypoints.openai.api_server \
--model "${MODEL}" \
--served-model-name "${SERVED_MODEL}" \
--dtype bfloat16 \
--port 8005 \
--scheduler-cls vllm_beam_search.scheduler.BeamSearchScheduler{
"model": "llama3-8b",
"prompt": "Write a concise summary of why beam search is useful:",
"max_tokens": 128,
"temperature": 0,
"add_special_tokens": false,
"vllm_xargs": {
"beam_width": 4,
"no_repeat_ngram_size": 3
}
}Run unit tests:
python -m pytest tests -qRun sustained stress plus memory sampling against a running server:
vllm-beam-stress \
--base-url http://localhost:8005 \
--model llama3-8b \
--rounds 100 \
--requests-per-round 32 \
--concurrency 64 \
--abort-rounds 3The stress tool writes CSV samples with request count, RSS, and GPU memory.
VLLM_BEAM_GROUP_STATE_CAPACITYcontrols GPU beam-state pool capacity.VLLM_BEAM_TRANSITION_BUFFER_SLOTScontrols async transition buffer slots.