Add vLLM offline backend with micro-batching support#736
Open
maryamtahhan wants to merge 5 commits into
Open
Conversation
fc01371 to
bbe2874
Compare
efa1d9e to
942fa2e
Compare
Implements standalone offline backend using vLLM's LLM class for micro-batching. Adapted to main's architecture without VLLMBackendBase, using main's import patterns (lazy loading via guidellm.extras, utils.audio/vision). Features: - Batch processing with configurable batch_size (default: 32) - Chat template support (plain, default-template, custom Jinja2) - Multimodal data handling (image/audio) - Single-process execution for batch coordination - Compatible with vLLM 0.21.0+ Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
942fa2e to
5d2304d
Compare
Contributor
|
Hi @maryamtahhan, the DCO check has failed. Please click on DCO in the Checks section for instructions on how to resolve this. |
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
664810c to
251eb67
Compare
Extract duplicated helper methods (_build_multi_modal_data_from_columns, _resolve_chat_template, _extract_prompt_chat_tokenizer, _create_sampling_params) into common.py to follow DRY principles. This addresses maintainer feedback about code reuse and abstraction. Both vllm_python and vllm_offline backends now share the same implementation for these helpers, reducing code duplication from ~400 lines to a single shared module. Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Moved 5 additional helper methods to common.py that were duplicated between vllm_python and vllm_offline backends: - extract_text_from_content - build_placeholder_prefix - format_column_blocks - inject_placeholders_into_messages - extract_prompt_chat_plain Total duplication eliminated: ~450 lines across both backends. All helper logic is now centralized in common.py with both backends using thin wrapper methods that delegate to the shared implementation. Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Add type: ignore comments for vllm.EngineArgs and vllm.LLM runtime usage since these are lazy-loaded and mypy can't resolve them at static analysis time. Use Any type for vllm.LLM annotations with inline comments documenting the actual type. Fixes CI type-check failures. Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Contributor
Author
|
@sjmonson @jaredoconnell this PR has been rebased and is green again |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add vLLM Offline Backend with Shared Base Class
This PR implements offline/batch inference support for vLLM using a clean, extensible architecture that eliminates code duplication between vLLM backends.
Summary
Adds
VLLMOfflineBackendfor batch processing and refactors existing vLLM code into a sharedVLLMBackendBaseclass. This reduces code duplication by ~360 lines while adding new offline inference capabilities optimized for benchmarking scenarios.New Components
VLLMBackendBase (base.py)
Shared base class for all vLLM backends containing ~400 lines of common functionality:
_get_tokenizer()method for subclass implementationVLLMOfflineBackend (offline.py)
New backend for offline batch processing using vLLM's
LLMclass:batch_size(default: 32)LLM.generate()Refactored VLLMPythonBackend (vllm.py)
VLLMBackendBaseinstead ofBackenddirectly_get_tokenizer()forAsyncLLMEngineKey Benefits
Documentation
docs/guides/vllm-offline-backend.mddocs/guides/backends.mdwith offline backend sectionUsage Example
Test Plan
Unit Tests (✅ Passing)
VLLMOfflineBackendlifecycle (startup, shutdown, validate)VLLMBackendBaserequest resolution and formattingIntegration Tests (✅ Verified)
Backend.create()Manual Testing
Details
VLLMBackendBaseshared base class insrc/guidellm/backends/vllm_python/base.pyVLLMOfflineBackendandVLLMOfflineBackendArgsinsrc/guidellm/backends/vllm_python/offline.pyVLLMPythonBackendto extendVLLMBackendBase(eliminate duplication)_ResolvedRequest,_has_jinja2_markers) from base for backward compatibilityRuntimeErrorfrom torchcodec/PIL)tests/unit/backends/vllm_python/test_vllm.pydocs/guides/vllm-offline-backend.mddocs/guides/backends.mdwith offline backend documentationvllm_offlinebackend type in Backend registrytest_backend.pywith offline backend registration testUse of AI
git log
commit 5d2304d
Author: Maryam Tahhan mtahhan@redhat.com
Date: Thu Jun 25 11:35:31 2026 +0100
commit 251eb67
Author: Maryam Tahhan mtahhan@redhat.com
Date: Thu Jun 25 12:35:17 2026 +0100
commit bfbcad1
Author: Maryam Tahhan mtahhan@redhat.com
Date: Thu Jun 25 13:41:36 2026 +0100
commit 1b673a2
Author: Maryam Tahhan mtahhan@redhat.com
Date: Thu Jun 25 14:11:19 2026 +0100
commit f27b076
Author: Maryam Tahhan mtahhan@redhat.com
Date: Thu Jun 25 14:18:13 2026 +0100
Co-Authored-By: Claude Sonnet 4.5 noreply@anthropic.com
Signed-off-by: Maryam Tahhan mtahhan@redhat.com