Add vLLM offline backend with micro-batching support by maryamtahhan · Pull Request #736 · vllm-project/guidellm

maryamtahhan · 2026-05-20T15:33:44Z

Add vLLM Offline Backend with Shared Base Class

This PR implements offline/batch inference support for vLLM using a clean, extensible architecture that eliminates code duplication between vLLM backends.

Summary

Adds VLLMOfflineBackend for batch processing and refactors existing vLLM code into a shared VLLMBackendBase class. This reduces code duplication by ~360 lines while adding new offline inference capabilities optimized for benchmarking scenarios.

New Components

VLLMBackendBase (base.py)

Shared base class for all vLLM backends containing ~400 lines of common functionality:

Chat template resolution (plain, default-template, custom Jinja2)
Multimodal data handling (image/audio columns)
Request formatting and prompt resolution
Sampling parameter creation
Abstract _get_tokenizer() method for subclass implementation

VLLMOfflineBackend (offline.py)

New backend for offline batch processing using vLLM's LLM class:

Micro-batching with configurable batch_size (default: 32)
Buffers requests until batch is full, then processes with LLM.generate()
Auto-flushes remaining requests on shutdown
Single-process execution for batch coordination
Ideal for offline benchmarking and dataset evaluation

Refactored VLLMPythonBackend (vllm.py)

Now extends VLLMBackendBase instead of Backend directly
Removed ~360 lines of duplicate code
Implements _get_tokenizer() for AsyncLLMEngine
No breaking changes to public API

Key Benefits

✅ Code Reuse: ~400 lines shared between backends
✅ Reduced Duplication: ~360 lines eliminated from VLLMPythonBackend
✅ Extensibility: Easy to add new vLLM-based backends (e.g., vLLM server)
✅ No Breaking Changes: VLLMPythonBackend API unchanged
✅ Clean Architecture: Clear separation of concerns with shared base

Documentation

New guide: docs/guides/vllm-offline-backend.md
- Usage examples and configuration options
- Performance tuning (batch size, vLLM EngineArgs)
- Comparison with other backends
- Troubleshooting guide
Updated: docs/guides/backends.md with offline backend section

Usage Example

guidellm benchmark run \
  --backend vllm_offline \
  --model "Qwen/Qwen3-0.6B" \
  --backend-kwargs '{"batch_size": 64, "vllm_config": {"tensor_parallel_size": 2}}' \
  --data "prompt_tokens=256,output_tokens=128" \
  --max-requests 1000

Test Plan

Unit Tests (✅ Passing)

2296 unit tests passing (all existing + new tests)
New test coverage:
- VLLMOfflineBackend lifecycle (startup, shutdown, validate)
- Batch processing logic and request buffering
- VLLMBackendBase request resolution and formatting
- Chat template handling (plain, default, custom)
- Multimodal data processing (audio/image)
- Sampling parameter creation
- Backend registration and creation

Integration Tests (✅ Verified)

Backend registration in Backend registry
Args creation and validation (VLLMOfflineBackendArgs)
Backend creation via Backend.create()
Request resolution with chat templates
Batch size configuration (8-128+)
vLLM config passthrough (tensor_parallel_size, gpu_memory_utilization, etc.)
Backend info property exposure

Manual Testing

Validated functionality on local environment

Details

Add VLLMBackendBase shared base class in src/guidellm/backends/vllm_python/base.py
Add VLLMOfflineBackend and VLLMOfflineBackendArgs in src/guidellm/backends/vllm_python/offline.py
Refactor VLLMPythonBackend to extend VLLMBackendBase (eliminate duplication)
Re-export test helpers (_ResolvedRequest, _has_jinja2_markers) from base for backward compatibility
Add optional dependency handling for audio/vision extras (catch RuntimeError from torchcodec/PIL)
Add comprehensive test coverage in tests/unit/backends/vllm_python/test_vllm.py
Add new guide docs/guides/vllm-offline-backend.md
Update docs/guides/backends.md with offline backend documentation
Register vllm_offline backend type in Backend registry
Update test_backend.py with offline backend registration test

"I certify that all code in this PR is my own, except as noted below."

Use of AI

Includes code generated or substantially modified by an AI agent
Includes tests generated or substantially modified by an AI agent

All commits include appropriate Co-Authored-By trailers as described in DEVELOPING.md.

git log

commit 5d2304d
Author: Maryam Tahhan mtahhan@redhat.com
Date: Thu Jun 25 11:35:31 2026 +0100

Add vLLM Offline Backend for batch processing

Implements standalone offline backend using vLLM's LLM class for micro-batching.
Adapted to main's architecture without VLLMBackendBase, using main's import patterns
(lazy loading via guidellm.extras, utils.audio/vision).

Features:
- Batch processing with configurable batch_size (default: 32)
- Chat template support (plain, default-template, custom Jinja2)
- Multimodal data handling (image/audio)
- Single-process execution for batch coordination
- Compatible with vLLM 0.21.0+

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

commit 251eb67
Author: Maryam Tahhan mtahhan@redhat.com
Date: Thu Jun 25 12:35:17 2026 +0100

Fix __all__ ordering in vllm_python __init__

Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

commit bfbcad1
Author: Maryam Tahhan mtahhan@redhat.com
Date: Thu Jun 25 13:41:36 2026 +0100

Refactor vLLM backends to use shared common.py module

Extract duplicated helper methods (_build_multi_modal_data_from_columns,
_resolve_chat_template, _extract_prompt_chat_tokenizer, _create_sampling_params)
into common.py to follow DRY principles.

This addresses maintainer feedback about code reuse and abstraction.
Both vllm_python and vllm_offline backends now share the same implementation
for these helpers, reducing code duplication from ~400 lines to a single
shared module.

Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

commit 1b673a2
Author: Maryam Tahhan mtahhan@redhat.com
Date: Thu Jun 25 14:11:19 2026 +0100

Extract all duplicated helpers to common.py for maximum code reuse

Moved 5 additional helper methods to common.py that were duplicated between
vllm_python and vllm_offline backends:
- extract_text_from_content
- build_placeholder_prefix
- format_column_blocks
- inject_placeholders_into_messages
- extract_prompt_chat_plain

Total duplication eliminated: ~450 lines across both backends.

All helper logic is now centralized in common.py with both backends using thin
wrapper methods that delegate to the shared implementation.

Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

commit f27b076
Author: Maryam Tahhan mtahhan@redhat.com
Date: Thu Jun 25 14:18:13 2026 +0100

Fix mypy type errors for lazy-loaded vllm module

Add type: ignore comments for vllm.EngineArgs and vllm.LLM runtime usage
since these are lazy-loaded and mypy can't resolve them at static analysis time.
Use Any type for vllm.LLM annotations with inline comments documenting the
actual type.

Fixes CI type-check failures.

Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

Co-Authored-By: Claude Sonnet 4.5 noreply@anthropic.com
Signed-off-by: Maryam Tahhan mtahhan@redhat.com

Implements standalone offline backend using vLLM's LLM class for micro-batching. Adapted to main's architecture without VLLMBackendBase, using main's import patterns (lazy loading via guidellm.extras, utils.audio/vision). Features: - Batch processing with configurable batch_size (default: 32) - Chat template support (plain, default-template, custom Jinja2) - Multimodal data handling (image/audio) - Single-process execution for batch coordination - Compatible with vLLM 0.21.0+ Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

mergify · 2026-06-25T11:36:03Z

Hi @maryamtahhan, the DCO check has failed. Please click on DCO in the Checks section for instructions on how to resolve this.

Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

Extract duplicated helper methods (_build_multi_modal_data_from_columns, _resolve_chat_template, _extract_prompt_chat_tokenizer, _create_sampling_params) into common.py to follow DRY principles. This addresses maintainer feedback about code reuse and abstraction. Both vllm_python and vllm_offline backends now share the same implementation for these helpers, reducing code duplication from ~400 lines to a single shared module. Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

Moved 5 additional helper methods to common.py that were duplicated between vllm_python and vllm_offline backends: - extract_text_from_content - build_placeholder_prefix - format_column_blocks - inject_placeholders_into_messages - extract_prompt_chat_plain Total duplication eliminated: ~450 lines across both backends. All helper logic is now centralized in common.py with both backends using thin wrapper methods that delegate to the shared implementation. Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

Add type: ignore comments for vllm.EngineArgs and vllm.LLM runtime usage since these are lazy-loaded and mypy can't resolve them at static analysis time. Use Any type for vllm.LLM annotations with inline comments documenting the actual type. Fixes CI type-check failures. Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

maryamtahhan · 2026-06-25T13:24:41Z

@sjmonson @jaredoconnell this PR has been rebased and is green again

maryamtahhan force-pushed the feat/vllm-offline-batching-backend branch 4 times, most recently from fc01371 to bbe2874 Compare May 25, 2026 09:14

maryamtahhan marked this pull request as ready for review May 25, 2026 10:25

maryamtahhan force-pushed the feat/vllm-offline-batching-backend branch from efa1d9e to 942fa2e Compare May 25, 2026 13:43

sjmonson self-requested a review May 27, 2026 15:25

sjmonson added the internal filed by core contributor or associate label May 27, 2026

sjmonson added this to the v0.8.0 milestone May 27, 2026

sjmonson added the priority-low label Jun 1, 2026

sjmonson requested a review from jaredoconnell June 1, 2026 15:01

maryamtahhan force-pushed the feat/vllm-offline-batching-backend branch from 942fa2e to 5d2304d Compare June 25, 2026 11:32

Fix __all__ ordering in vllm_python __init__

251eb67

Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

maryamtahhan force-pushed the feat/vllm-offline-batching-backend branch from 664810c to 251eb67 Compare June 25, 2026 11:37

maryamtahhan added 3 commits June 25, 2026 13:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add vLLM offline backend with micro-batching support#736

Add vLLM offline backend with micro-batching support#736
maryamtahhan wants to merge 5 commits into
vllm-project:mainfrom
maryamtahhan:feat/vllm-offline-batching-backend

maryamtahhan commented May 20, 2026 •

edited by github-actions Bot

Loading

Uh oh!

mergify Bot commented Jun 25, 2026

Uh oh!

maryamtahhan commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

maryamtahhan commented May 20, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Add vLLM Offline Backend with Shared Base Class

Summary

New Components

VLLMBackendBase (base.py)

VLLMOfflineBackend (offline.py)

Refactored VLLMPythonBackend (vllm.py)

Key Benefits

Documentation

Usage Example

Test Plan

Unit Tests (✅ Passing)

Integration Tests (✅ Verified)

Manual Testing

Details

Use of AI

git log

Uh oh!

mergify Bot commented Jun 25, 2026

Uh oh!

maryamtahhan commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

maryamtahhan commented May 20, 2026 •

edited by github-actions Bot

Loading