Skip to content

Add vLLM offline backend with micro-batching support#736

Open
maryamtahhan wants to merge 5 commits into
vllm-project:mainfrom
maryamtahhan:feat/vllm-offline-batching-backend
Open

Add vLLM offline backend with micro-batching support#736
maryamtahhan wants to merge 5 commits into
vllm-project:mainfrom
maryamtahhan:feat/vllm-offline-batching-backend

Conversation

@maryamtahhan

@maryamtahhan maryamtahhan commented May 20, 2026

Copy link
Copy Markdown
Contributor

Add vLLM Offline Backend with Shared Base Class

This PR implements offline/batch inference support for vLLM using a clean, extensible architecture that eliminates code duplication between vLLM backends.

Summary

Adds VLLMOfflineBackend for batch processing and refactors existing vLLM code into a shared VLLMBackendBase class. This reduces code duplication by ~360 lines while adding new offline inference capabilities optimized for benchmarking scenarios.

New Components

VLLMBackendBase (base.py)

Shared base class for all vLLM backends containing ~400 lines of common functionality:

  • Chat template resolution (plain, default-template, custom Jinja2)
  • Multimodal data handling (image/audio columns)
  • Request formatting and prompt resolution
  • Sampling parameter creation
  • Abstract _get_tokenizer() method for subclass implementation

VLLMOfflineBackend (offline.py)

New backend for offline batch processing using vLLM's LLM class:

  • Micro-batching with configurable batch_size (default: 32)
  • Buffers requests until batch is full, then processes with LLM.generate()
  • Auto-flushes remaining requests on shutdown
  • Single-process execution for batch coordination
  • Ideal for offline benchmarking and dataset evaluation

Refactored VLLMPythonBackend (vllm.py)

  • Now extends VLLMBackendBase instead of Backend directly
  • Removed ~360 lines of duplicate code
  • Implements _get_tokenizer() for AsyncLLMEngine
  • No breaking changes to public API

Key Benefits

  • Code Reuse: ~400 lines shared between backends
  • Reduced Duplication: ~360 lines eliminated from VLLMPythonBackend
  • Extensibility: Easy to add new vLLM-based backends (e.g., vLLM server)
  • No Breaking Changes: VLLMPythonBackend API unchanged
  • Clean Architecture: Clear separation of concerns with shared base

Documentation

  • New guide: docs/guides/vllm-offline-backend.md
    • Usage examples and configuration options
    • Performance tuning (batch size, vLLM EngineArgs)
    • Comparison with other backends
    • Troubleshooting guide
  • Updated: docs/guides/backends.md with offline backend section

Usage Example

guidellm benchmark run \
  --backend vllm_offline \
  --model "Qwen/Qwen3-0.6B" \
  --backend-kwargs '{"batch_size": 64, "vllm_config": {"tensor_parallel_size": 2}}' \
  --data "prompt_tokens=256,output_tokens=128" \
  --max-requests 1000

Test Plan

Unit Tests (✅ Passing)

  • 2296 unit tests passing (all existing + new tests)
  • New test coverage:
    • VLLMOfflineBackend lifecycle (startup, shutdown, validate)
    • Batch processing logic and request buffering
    • VLLMBackendBase request resolution and formatting
    • Chat template handling (plain, default, custom)
    • Multimodal data processing (audio/image)
    • Sampling parameter creation
    • Backend registration and creation

Integration Tests (✅ Verified)

  • Backend registration in Backend registry
  • Args creation and validation (VLLMOfflineBackendArgs)
  • Backend creation via Backend.create()
  • Request resolution with chat templates
  • Batch size configuration (8-128+)
  • vLLM config passthrough (tensor_parallel_size, gpu_memory_utilization, etc.)
  • Backend info property exposure

Manual Testing

  • Validated functionality on local environment

Details

  • Add VLLMBackendBase shared base class in src/guidellm/backends/vllm_python/base.py
  • Add VLLMOfflineBackend and VLLMOfflineBackendArgs in src/guidellm/backends/vllm_python/offline.py
  • Refactor VLLMPythonBackend to extend VLLMBackendBase (eliminate duplication)
  • Re-export test helpers (_ResolvedRequest, _has_jinja2_markers) from base for backward compatibility
  • Add optional dependency handling for audio/vision extras (catch RuntimeError from torchcodec/PIL)
  • Add comprehensive test coverage in tests/unit/backends/vllm_python/test_vllm.py
  • Add new guide docs/guides/vllm-offline-backend.md
  • Update docs/guides/backends.md with offline backend documentation
  • Register vllm_offline backend type in Backend registry
  • Update test_backend.py with offline backend registration test

  • "I certify that all code in this PR is my own, except as noted below."

Use of AI

  • Includes code generated or substantially modified by an AI agent
  • Includes tests generated or substantially modified by an AI agent

All commits include appropriate Co-Authored-By trailers as described in DEVELOPING.md.


git log

commit 5d2304d
Author: Maryam Tahhan mtahhan@redhat.com
Date: Thu Jun 25 11:35:31 2026 +0100

Add vLLM Offline Backend for batch processing

Implements standalone offline backend using vLLM's LLM class for micro-batching.
Adapted to main's architecture without VLLMBackendBase, using main's import patterns
(lazy loading via guidellm.extras, utils.audio/vision).

Features:
- Batch processing with configurable batch_size (default: 32)
- Chat template support (plain, default-template, custom Jinja2)
- Multimodal data handling (image/audio)
- Single-process execution for batch coordination
- Compatible with vLLM 0.21.0+

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

commit 251eb67
Author: Maryam Tahhan mtahhan@redhat.com
Date: Thu Jun 25 12:35:17 2026 +0100

Fix __all__ ordering in vllm_python __init__

Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

commit bfbcad1
Author: Maryam Tahhan mtahhan@redhat.com
Date: Thu Jun 25 13:41:36 2026 +0100

Refactor vLLM backends to use shared common.py module

Extract duplicated helper methods (_build_multi_modal_data_from_columns,
_resolve_chat_template, _extract_prompt_chat_tokenizer, _create_sampling_params)
into common.py to follow DRY principles.

This addresses maintainer feedback about code reuse and abstraction.
Both vllm_python and vllm_offline backends now share the same implementation
for these helpers, reducing code duplication from ~400 lines to a single
shared module.

Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

commit 1b673a2
Author: Maryam Tahhan mtahhan@redhat.com
Date: Thu Jun 25 14:11:19 2026 +0100

Extract all duplicated helpers to common.py for maximum code reuse

Moved 5 additional helper methods to common.py that were duplicated between
vllm_python and vllm_offline backends:
- extract_text_from_content
- build_placeholder_prefix
- format_column_blocks
- inject_placeholders_into_messages
- extract_prompt_chat_plain

Total duplication eliminated: ~450 lines across both backends.

All helper logic is now centralized in common.py with both backends using thin
wrapper methods that delegate to the shared implementation.

Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

commit f27b076
Author: Maryam Tahhan mtahhan@redhat.com
Date: Thu Jun 25 14:18:13 2026 +0100

Fix mypy type errors for lazy-loaded vllm module

Add type: ignore comments for vllm.EngineArgs and vllm.LLM runtime usage
since these are lazy-loaded and mypy can't resolve them at static analysis time.
Use Any type for vllm.LLM annotations with inline comments documenting the
actual type.

Fixes CI type-check failures.

Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>

Co-Authored-By: Claude Sonnet 4.5 noreply@anthropic.com
Signed-off-by: Maryam Tahhan mtahhan@redhat.com

@maryamtahhan maryamtahhan force-pushed the feat/vllm-offline-batching-backend branch 4 times, most recently from fc01371 to bbe2874 Compare May 25, 2026 09:14
@maryamtahhan maryamtahhan marked this pull request as ready for review May 25, 2026 10:25
@maryamtahhan maryamtahhan force-pushed the feat/vllm-offline-batching-backend branch from efa1d9e to 942fa2e Compare May 25, 2026 13:43
@sjmonson sjmonson self-requested a review May 27, 2026 15:25
@sjmonson sjmonson added the internal filed by core contributor or associate label May 27, 2026
@sjmonson sjmonson added this to the v0.8.0 milestone May 27, 2026
@sjmonson sjmonson requested a review from jaredoconnell June 1, 2026 15:01
Implements standalone offline backend using vLLM's LLM class for micro-batching.
Adapted to main's architecture without VLLMBackendBase, using main's import patterns
(lazy loading via guidellm.extras, utils.audio/vision).

Features:
- Batch processing with configurable batch_size (default: 32)
- Chat template support (plain, default-template, custom Jinja2)
- Multimodal data handling (image/audio)
- Single-process execution for batch coordination
- Compatible with vLLM 0.21.0+

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
@maryamtahhan maryamtahhan force-pushed the feat/vllm-offline-batching-backend branch from 942fa2e to 5d2304d Compare June 25, 2026 11:32
@mergify

mergify Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Hi @maryamtahhan, the DCO check has failed. Please click on DCO in the Checks section for instructions on how to resolve this.

Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
@maryamtahhan maryamtahhan force-pushed the feat/vllm-offline-batching-backend branch from 664810c to 251eb67 Compare June 25, 2026 11:37
Extract duplicated helper methods (_build_multi_modal_data_from_columns,
_resolve_chat_template, _extract_prompt_chat_tokenizer, _create_sampling_params)
into common.py to follow DRY principles.

This addresses maintainer feedback about code reuse and abstraction.
Both vllm_python and vllm_offline backends now share the same implementation
for these helpers, reducing code duplication from ~400 lines to a single
shared module.

Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Moved 5 additional helper methods to common.py that were duplicated between
vllm_python and vllm_offline backends:
- extract_text_from_content
- build_placeholder_prefix
- format_column_blocks
- inject_placeholders_into_messages
- extract_prompt_chat_plain

Total duplication eliminated: ~450 lines across both backends.

All helper logic is now centralized in common.py with both backends using thin
wrapper methods that delegate to the shared implementation.

Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Add type: ignore comments for vllm.EngineArgs and vllm.LLM runtime usage
since these are lazy-loaded and mypy can't resolve them at static analysis time.
Use Any type for vllm.LLM annotations with inline comments documenting the
actual type.

Fixes CI type-check failures.

Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
@maryamtahhan

Copy link
Copy Markdown
Contributor Author

@sjmonson @jaredoconnell this PR has been rebased and is green again

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

internal filed by core contributor or associate priority-low

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants