[MM][Perf][CG] Support dual-path ViT full CUDA graph for DeepSeek-OCR by shen-shanshan · Pull Request #43586 · vllm-project/vllm

shen-shanshan · 2026-05-25T09:24:29Z

Purpose

This PR implements the SupportsEncoderCudaGraph protocol for DeepseekOCRForCausalLM, enabling full CUDA graph capture of the vision encoder with a dual-path graph architecture. DeepSeek-OCR uses a two-tower ViT (SAM + CLIP) with a dynamic tiling mechanism — global images at 1024×1024 and optional local patches at 640×640.

Rather than capturing a single monolithic graph for both paths, this PR introduces a dual-path design (enable_dual_path_graph=True): two independent graph sets are captured — one for global images (constant 272 tokens each) and one for local patches (100 tokens each). During inference, the manager independently selects the smallest fitting budget per path, enabling partial graph fallback (one path hits while the other falls back to eager), and skipping local-path graphs entirely when no patches are present. This avoids wasted compute on zero-padded patch buffers for untiled images and avoids graphs that would otherwise be invalidated by variable crop_shape per image.

Note

Find more background details at DeepSeek-OCR technical report.

TODO:

Wait [Multimodal] Simplify ViT CUDA graph interfaces #41234 and Adjust design around encoder_cudagraph_forward #42288 to be merged.
E2E functional test.
Benchmark:
- No DP VIT: eager vs cuda graph.
- DP VIT: This model does not support --mm-encoder-tp-mode data.
Update "Vision Encoder (ViT) CUDA Graphs" docs.
Add CI test.

Test Plan

E2E functional test:

python examples/generate/multimodal/vision_language_offline.py -m deepseek_ocr --modality "image" --enable-vit-cuda-graph

Benchmark:

vllm bench mm-processor \
--model /shared/models/modelscope/models/deepseek-ai/DeepSeek-OCR \
--max-model-len 8192 \
--dataset-name random-mm \
--random-mm-base-items-per-request 1 \
--random-mm-num-mm-items-range-ratio 0.0 \
--random-mm-bucket-config '{(896, 896, 1): 1.0}' \
--random-mm-limit-mm-per-prompt '{"image": 1, "video": 0}' \
--num-prompts 100 \
--compilation-config '{"cudagraph_mm_encoder": true, "encoder_cudagraph_max_vision_items_per_batch": 8}'

Test Result

E2E functional test results:

--------------------------------------------------
The image captures the majestic Tokyo Skytree, the tallest tower in Japan, standing tall against the backdrop of a clear blue sky. The tower, painted in a pristine white, is adorned with a distinctive blue and white striped pattern on its upper section, adding a touch of color to the otherwise monochrome structure. The perspective of the photo is from a low angle, giving the viewer a sense of the tower's impressive height. In the foreground, cherry blossom trees in full bloom add a splash of pink to the scene, their delicate petals contrasting beautifully with the stark white of the tower. The image beautifully encapsulates the blend of urban architecture and natural beauty that characterizes Tokyo.
--------------------------------------------------
The image captures the majestic Tokyo Skytree, the tallest tower in Japan, standing tall against the backdrop of a clear blue sky. The tower, painted in a pristine white, is adorned with a distinctive blue and white striped pattern on its upper section, adding a touch of color to the otherwise monochrome structure. The perspective of the photo is from a low angle, giving the viewer a sense of the tower's impressive height. In the foreground, cherry blossom trees in full bloom add a splash of pink to the scene, their branches reaching out towards the tower. The image beautifully juxtaposes the modernity of the Tokyo Skytree with the natural beauty of the cherry blossoms.
--------------------------------------------------
The image captures the majestic Tokyo Skytree, the tallest tower in Japan, standing tall against the backdrop of a clear blue sky. The tower, painted in a pristine white, is adorned with a distinctive blue and white striped pattern on its upper section, adding a touch of color to the otherwise monochrome structure. The perspective of the photo is from a low angle, giving the viewer a sense of the tower's impressive height. In the foreground, cherry blossom trees in full bloom add a splash of pink to the scene, their branches reaching out towards the tower. The image beautifully juxtaposes the modernity of the Tokyo Skytree with the natural beauty of the cherry blossoms.
--------------------------------------------------
The image captures the majestic Tokyo Skytree, the tallest tower in Japan, standing tall against the backdrop of a clear blue sky. The tower, painted in a pristine white, is adorned with a distinctive blue and white striped pattern on its upper section, adding a touch of color to the otherwise monochrome structure. The perspective of the photo is from a low angle, giving the viewer a sense of the tower's impressive height. In the foreground, cherry blossom trees in full bloom add a splash of pink to the scene, their branches reaching out towards the tower. The image beautifully juxtaposes the modernity of the Tokyo Skytree with the natural beauty of the cherry blossoms.
--------------------------------------------------

Benchmark results (old version: only contains global images into CUDA graph):

Input Size	Tiled	Mean Latency	P99 Latency
(224, 224)	❌	-12.76% (20.22ms -> 17.64ms) ↓	-15.49% (22.59ms -> 19.09ms) ↓
(448, 448)	❌	-17.10% (21.34ms -> 17.69ms) ↓	-29.15% (28.47ms -> 20.17ms) ↓
(896, 896)	✅	-2.95% (42.08ms -> 40.84ms) ↓	-6.60% (44.07ms -> 41.16ms) ↓
(1024, 1024)	✅	-4.75% (42.96ms -> 40.92ms) ↓	-6.96% (45.00ms -> 41.87ms) ↓

Benchmark results (new version: dual-path graph select for global images and local patches respectively):

Input Size	Tiled	Mean Latency	P99 Latency
(224, 224)	❌	-13.93% (20.74ms -> 17.85ms) ↓	-15.49% (22.59ms -> 19.09ms) ↓
(448, 448)	❌	-20.96% (23.24ms -> 18.37ms) ↓	-27.10% (29.11ms -> 21.22ms) ↓
(896, 896)	✅	-4.59% (42.08ms -> 40.15ms) ↓	-8.12% (44.07ms -> 40.49ms) ↓
(1024, 1024)	✅	-6.56% (42.96ms -> 40.14ms) ↓	-10.31% (45.00ms -> 40.36ms) ↓

Note

When image_width <= 640 and image_height <= 640, the mm inputs will only contain global image, without generating local patches.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

mergify · 2026-05-25T09:25:15Z

Documentation preview: https://vllm--43586.org.readthedocs.build/en/43586/

gemini-code-assist

Code Review

This pull request implements the SupportsEncoderCudaGraph protocol for the DeepseekOCRForCausalLM model, enabling CUDA graph support for its vision encoder. The implementation includes methods for token calculation, input preparation, and post-processing. Critical feedback was provided regarding a potential TypeError in _get_num_input_output_tokens due to a missing null check, and incorrect configuration in EncoderCudaGraphConfig where input_key_by_modality should be used instead of input_keys. Furthermore, the images_crop tensor must be explicitly registered in buffer_keys and included in both capture and replay buffers to ensure correct graph execution. Finally, the get_max_frames_per_video method should be added to fully comply with the protocol.

mergify · 2026-05-29T03:36:36Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @shen-shanshan.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2026-06-04T02:35:58Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @shen-shanshan.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

shen-shanshan · 2026-06-04T03:50:14Z

CC @Isotr0py

At first, I have tried to contain the local_patch encoding into the ViT cuda graph, but I find this solution has some drawbacks:

The grid and number of local_patch is dynamic according to the input images, and it also requires to add newline tokens to the end of each raw in the grid, which makes the tensor shape dynamic and unpredictable. Thus, we have to only put the encoding of the raw local_patch into the cuda graph, then add newline tokens in the postprocess_encoder_output().
To make sure images with max number of local_patch could be correctly repalyed, we have to capture max_crops buffers for each input, even for images with size < (640, 640), which will not be tiled and will not have local patches. In this case, the performance is worse than the eager execution, since it can lead to additional and redundant dummy local_patch replay.

Thus, I decide to only contain global_image encoding into the ViT cuda graph, with eager executing local_patch encoding, then assemble them in postprocess_encoder_output(). This is a tradeoff of performance for both images < (640, 640) and larger images.

Future Plan (in following PRs): I want to explore dual-path ViT cuda graph budget selecting mechanism for DeepSeek-OCR and Step3-VL to decouple the global_image replay and local_patch replay.

mergify · 2026-06-09T03:16:38Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @shen-shanshan.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Isotr0py

Overall LGTM, leave a nit.

mergify · 2026-06-12T04:51:26Z

Hi @shen-shanshan, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

mergify · 2026-06-12T05:23:49Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @shen-shanshan.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: shen-shanshan <467638484@qq.com>

Isotr0py · 2026-06-13T07:02:28Z

Seems multimodal CI failures are related: https://buildkite.com/vllm/ci/builds/71879#019ebeee-1634-4a5e-b0a4-3280eadb8c8b

Signed-off-by: shen-shanshan <467638484@qq.com>

shen-shanshan · 2026-06-13T17:18:22Z

Seems multimodal CI failures are related: https://buildkite.com/vllm/ci/builds/71879#019ebeee-1634-4a5e-b0a4-3280eadb8c8b

Sorry, there are still some issues, please don't merge. I will fix it recently.

shen-shanshan marked this pull request as draft May 25, 2026 09:24

mergify Bot added documentation Improvements or additions to documentation deepseek Related to DeepSeek models nvidia labels May 25, 2026

github-project-automation Bot added this to NVIDIA May 25, 2026

shen-shanshan mentioned this pull request May 25, 2026

[RFC]: Support ViT Full CUDA Graph (Tracker) #38175

Open

34 tasks

gemini-code-assist Bot reviewed May 25, 2026

View reviewed changes

shen-shanshan mentioned this pull request May 27, 2026

Adjust design around encoder_cudagraph_forward #42288

Merged

4 tasks

mergify Bot added the v1 label May 28, 2026

mergify Bot added the needs-rebase label May 29, 2026

shen-shanshan force-pushed the vit-cg branch from 7c8f5f1 to 0e37a63 Compare May 29, 2026 07:52

mergify Bot removed the needs-rebase label May 29, 2026

shen-shanshan force-pushed the vit-cg branch from d3769da to 611e28b Compare June 2, 2026 08:16

shen-shanshan marked this pull request as ready for review June 3, 2026 04:01

shen-shanshan requested review from AndreasKaratzas, DarkLight1337 and ywang96 as code owners June 3, 2026 04:01

mergify Bot added the multi-modality Related to multi-modality (#4194) label Jun 3, 2026

mergify Bot added the needs-rebase label Jun 4, 2026

shen-shanshan force-pushed the vit-cg branch from eadcd16 to 11efbd0 Compare June 4, 2026 02:40

mergify Bot removed the needs-rebase label Jun 4, 2026

Isotr0py reviewed Jun 5, 2026

View reviewed changes

Comment thread docs/design/cuda_graphs_multimodal.md Outdated

Comment thread vllm/model_executor/models/deepseek_ocr.py Outdated

Comment thread vllm/model_executor/models/deepseek_ocr.py

mergify Bot added the needs-rebase label Jun 9, 2026

shen-shanshan force-pushed the vit-cg branch from 7fc6509 to 61d675c Compare June 9, 2026 10:33

shen-shanshan requested a review from njhill as a code owner June 9, 2026 10:33

Isotr0py self-assigned this Jun 12, 2026

Isotr0py approved these changes Jun 12, 2026

View reviewed changes

Comment thread vllm/v1/worker/encoder_cudagraph.py Outdated

github-project-automation Bot moved this to Ready in NVIDIA Jun 12, 2026

Isotr0py added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 12, 2026

mergify Bot added the needs-rebase label Jun 12, 2026

shen-shanshan added 11 commits June 13, 2026 02:07

support vit cg for deepseek-ocr

74769d7

Signed-off-by: shen-shanshan <467638484@qq.com>

update interface

77350d6

Signed-off-by: shen-shanshan <467638484@qq.com>

bugfix

921a1b6

Signed-off-by: shen-shanshan <467638484@qq.com>

optimize perf

fe5f108

Signed-off-by: shen-shanshan <467638484@qq.com>

add ci test

c31d6ee

Signed-off-by: shen-shanshan <467638484@qq.com>

update

9413803

Signed-off-by: shen-shanshan <467638484@qq.com>

support dual-path graph select

f145903

Signed-off-by: shen-shanshan <467638484@qq.com>

update

cb2bb7e

Signed-off-by: shen-shanshan <467638484@qq.com>

update doc

64480fd

Signed-off-by: shen-shanshan <467638484@qq.com>

update

d7269cb

Signed-off-by: shen-shanshan <467638484@qq.com>

use dict for all graphs

c249601

Signed-off-by: shen-shanshan <467638484@qq.com>

shen-shanshan force-pushed the vit-cg branch from 46a6dc9 to c249601 Compare June 13, 2026 02:37

mergify Bot removed the needs-rebase label Jun 13, 2026

add dual-path mode into doc

070c56f

Signed-off-by: shen-shanshan <467638484@qq.com>

fix ci

73c924b

Signed-off-by: shen-shanshan <467638484@qq.com>

shen-shanshan requested review from sighingnow and vadiklyutiy as code owners June 13, 2026 09:04

mergify Bot added llama Related to Llama models qwen Related to Qwen models labels Jun 13, 2026

shen-shanshan added 2 commits June 13, 2026 09:12

fix ci

3b29dd4

Signed-off-by: shen-shanshan <467638484@qq.com>

use dummy weight

3915b6e

Signed-off-by: shen-shanshan <467638484@qq.com>

Uh oh!

Conversation

shen-shanshan commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

mergify Bot commented May 25, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergify Bot commented May 29, 2026

Uh oh!

mergify Bot commented Jun 4, 2026

Uh oh!

shen-shanshan commented Jun 4, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergify Bot commented Jun 9, 2026

Uh oh!

Isotr0py left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mergify Bot commented Jun 12, 2026

Uh oh!

mergify Bot commented Jun 12, 2026

Uh oh!

Isotr0py commented Jun 13, 2026

Uh oh!

shen-shanshan commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

shen-shanshan commented May 25, 2026 •

edited

Loading