[MM][Perf][CG] Support dual-path ViT full CUDA graph for DeepSeek-OCR#43586
[MM][Perf][CG] Support dual-path ViT full CUDA graph for DeepSeek-OCR#43586shen-shanshan wants to merge 15 commits into
Conversation
|
Documentation preview: https://vllm--43586.org.readthedocs.build/en/43586/ |
There was a problem hiding this comment.
Code Review
This pull request implements the SupportsEncoderCudaGraph protocol for the DeepseekOCRForCausalLM model, enabling CUDA graph support for its vision encoder. The implementation includes methods for token calculation, input preparation, and post-processing. Critical feedback was provided regarding a potential TypeError in _get_num_input_output_tokens due to a missing null check, and incorrect configuration in EncoderCudaGraphConfig where input_key_by_modality should be used instead of input_keys. Furthermore, the images_crop tensor must be explicitly registered in buffer_keys and included in both capture and replay buffers to ensure correct graph execution. Finally, the get_max_frames_per_video method should be added to fully comply with the protocol.
|
This pull request has merge conflicts that must be resolved before it can be |
|
This pull request has merge conflicts that must be resolved before it can be |
|
CC @Isotr0py At first, I have tried to contain the
Thus, I decide to only contain Future Plan (in following PRs): I want to explore dual-path ViT cuda graph budget selecting mechanism for DeepSeek-OCR and Step3-VL to decouple the |
|
This pull request has merge conflicts that must be resolved before it can be |
|
Hi @shen-shanshan, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, |
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: shen-shanshan <467638484@qq.com>
Signed-off-by: shen-shanshan <467638484@qq.com>
Signed-off-by: shen-shanshan <467638484@qq.com>
Signed-off-by: shen-shanshan <467638484@qq.com>
Signed-off-by: shen-shanshan <467638484@qq.com>
Signed-off-by: shen-shanshan <467638484@qq.com>
Signed-off-by: shen-shanshan <467638484@qq.com>
Signed-off-by: shen-shanshan <467638484@qq.com>
|
Seems multimodal CI failures are related: https://buildkite.com/vllm/ci/builds/71879#019ebeee-1634-4a5e-b0a4-3280eadb8c8b |
Signed-off-by: shen-shanshan <467638484@qq.com>
Sorry, there are still some issues, please don't merge. I will fix it recently. |
Purpose
This PR implements the
SupportsEncoderCudaGraphprotocol forDeepseekOCRForCausalLM, enabling full CUDA graph capture of the vision encoder with a dual-path graph architecture. DeepSeek-OCR uses a two-tower ViT (SAM + CLIP) with a dynamic tiling mechanism — global images at 1024×1024 and optional local patches at 640×640.Rather than capturing a single monolithic graph for both paths, this PR introduces a dual-path design (
enable_dual_path_graph=True): two independent graph sets are captured — one for global images (constant 272 tokens each) and one for local patches (100 tokens each). During inference, the manager independently selects the smallest fitting budget per path, enabling partial graph fallback (one path hits while the other falls back to eager), and skipping local-path graphs entirely when no patches are present. This avoids wasted compute on zero-padded patch buffers for untiled images and avoids graphs that would otherwise be invalidated by variablecrop_shapeper image.Note
Find more background details at DeepSeek-OCR technical report.
TODO:
--mm-encoder-tp-mode data.Test Plan
E2E functional test:
python examples/generate/multimodal/vision_language_offline.py -m deepseek_ocr --modality "image" --enable-vit-cuda-graphBenchmark:
Test Result
E2E functional test results:
Benchmark results (old version: only contains global images into CUDA graph):
Benchmark results (new version: dual-path graph select for global images and local patches respectively):
Note
When
image_width <= 640 and image_height <= 640, the mm inputs will only contain global image, without generating local patches.Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.