[Feature][Frontend] Report multimodal token counts in usage.prompt_tokens_details by Sunt-ing · Pull Request #45458 · vllm-project/vllm

Sunt-ing · 2026-06-12T21:45:19Z

Purpose

usage.prompt_tokens already includes the image / audio / video placeholder tokens, but clients cannot tell how many tokens each modality contributed. This adds image_tokens, audio_tokens and video_tokens to usage.prompt_tokens_details, following sgl-project/sglang#27122. audio_tokens mirrors the OpenAI field of the same name; image_tokens and video_tokens are multimodal extensions beyond the OpenAI schema.

The per-modality counts come from the request's multimodal placeholder ranges (PlaceholderRange.length), which matches the placeholder tokens already counted in usage.prompt_tokens (not get_num_embeds(), the embedding-mask count). Both streaming and non-streaming final usage are covered and gated by --enable-prompt-tokens-details. The per-modality counts ride alongside cached_tokens, which keeps its existing behavior (0 is still reported), so a multimodal request reports both.

Scope is the Chat Completions endpoint. The legacy Completions endpoint shares PromptTokenUsageInfo but is left unchanged, since it does not carry image / audio / video content.

Test Plan

Real /v1/chat/completions before/after on Qwen2.5-VL-3B-Instruct (below), plus pytest tests/entrypoints/openai/chat_completion/test_serving_chat.py.

Test Result

Same image sent to a real server with --enable-prompt-tokens-details. prompt_tokens is 108 on both sides; the PR exposes that 81 of them are the image placeholder. cached_tokens (including 0) is unchanged, and text-only requests are unaffected.

request	current main	this PR
image, non-stream	`{cached_tokens: 0}`	`{cached_tokens: 0, image_tokens: 81}`
image, stream final usage	`{cached_tokens: 96}`	`{cached_tokens: 96, image_tokens: 81}`
text-only, non-stream	`{cached_tokens: 0}`	`{cached_tokens: 0}`

One unit test in test_serving_chat.py covers the parts the E2E run cannot: per-modality multi-range summation, the feature gate, and that zero cached_tokens is still reported while multimodal counts ride alongside. ruff check and ruff format --check clean.

repro (serve command + client + full output)

# A800-80GB. VLLM_USE_FLASHINFER_SAMPLER=0 + FLASH_ATTN are A800 JIT
# workarounds, unrelated to this feature.
VLLM_USE_FLASHINFER_SAMPLER=0 VLLM_ATTENTION_BACKEND=FLASH_ATTN \
vllm serve Qwen/Qwen2.5-VL-3B-Instruct --served-model-name qwen-vl \
  --enable-prompt-tokens-details --max-model-len 4096 \
  --gpu-memory-utilization 0.80 --enforce-eager

Client (image + text; non-stream and streaming):

import base64, io, requests
from PIL import Image

buf = io.BytesIO()
Image.new("RGB", (256, 256), (123, 200, 50)).save(buf, "PNG")
url = "data:image/png;base64," + base64.b64encode(buf.getvalue()).decode()
messages = [{"role": "user", "content": [
    {"type": "image_url", "image_url": {"url": url}},
    {"type": "text", "text": "What color dominates this image?"}]}]

# non-stream
r = requests.post("http://127.0.0.1:8000/v1/chat/completions", json={
    "model": "qwen-vl", "messages": messages, "max_tokens": 8, "temperature": 0})
print(r.json()["usage"])

# stream
r = requests.post("http://127.0.0.1:8000/v1/chat/completions", json={
    "model": "qwen-vl", "messages": messages, "max_tokens": 8, "temperature": 0,
    "stream": True, "stream_options": {"include_usage": True}}, stream=True)
# ... accumulate the final chunk's "usage"

Output, current main vs this PR:

# current main, non-stream: no per-modality breakdown
{"prompt_tokens": 108, "total_tokens": 116, "completion_tokens": 8,
 "prompt_tokens_details": {"cached_tokens": 0}}

# this PR, non-stream
{"prompt_tokens": 108, "total_tokens": 116, "completion_tokens": 8,
 "prompt_tokens_details": {"cached_tokens": 0, "image_tokens": 81,
                           "audio_tokens": null, "video_tokens": null}}

# this PR, stream final usage (cached hit + image both reported)
{"prompt_tokens": 108, "total_tokens": 116, "completion_tokens": 8,
 "prompt_tokens_details": {"cached_tokens": 96, "image_tokens": 81}}

# this PR, text-only: cached_tokens unchanged, per-modality counts all null
{"prompt_tokens": 25, "total_tokens": 28, "completion_tokens": 3,
 "prompt_tokens_details": {"cached_tokens": 0, "image_tokens": null,
                           "audio_tokens": null, "video_tokens": null}}

AI assistance was used to investigate, reproduce, and draft this change; the author reviewed the diff and validation output.

cc @DarkLight1337

…kens_details Add image_tokens, audio_tokens and video_tokens to PromptTokenUsageInfo so OpenAI chat completion clients can see how many prompt tokens each modality contributed. Counts are aggregated from the request's multimodal placeholder ranges (PlaceholderRange.length), which matches the placeholder tokens already included in usage.prompt_tokens. Both streaming and non-streaming final usage are covered, gated by --enable-prompt-tokens-details. The existing cached_tokens reporting (including zero) is preserved; multimodal counts are added alongside it. Signed-off-by: Ting Sun <suntcrick@gmail.com>

Address review: instead of hardcoded image_tokens/audio_tokens/video_tokens fields, report a multimodal_tokens dict keyed by modality name so new modalities are surfaced without protocol changes. The helper now iterates the request's multimodal placeholders directly with no hardcoded modality list. Signed-off-by: Ting Sun <suntcrick@gmail.com>

DarkLight1337 · 2026-06-13T10:09:10Z

+        for modality, ranges in mm_placeholders.items()
+        if ranges
+    }
+    return counts or None


I think we can just return the dictionary directly

Done, _get_mm_token_counts now returns the dict directly (empty when there are no placeholders), _make_prompt_tokens_details already treats an empty map as nothing to report.

Address review: drop the None-coalescing in _get_mm_token_counts and return the per-modality dict directly (empty when there are no placeholders). _make_prompt_tokens_details already treats an empty map as "nothing to report". Signed-off-by: Ting Sun <suntcrick@gmail.com>

Address review: add a docstring to PromptTokenUsageInfo.multimodal_tokens explaining it is a per-modality breakdown (keyed by modality name) of the multimodal placeholder tokens already counted in prompt_tokens, and None when the request has no multimodal input. Signed-off-by: Ting Sun <suntcrick@gmail.com>

…string The docs render with MkDocs (Markdown), so inline code spans use single backticks rather than RST double backticks. Signed-off-by: Ting Sun <suntcrick@gmail.com>

DarkLight1337

LGTM now, thanks for your patience!

Sunt-ing · 2026-06-13T16:21:09Z

Thanks for your careful guidance. Learned a lot.

Sunt-ing requested review from AndreasKaratzas, DarkLight1337, NickLucche, aarnphm, chaunceyjiang, robertgshaw2-redhat and russellb as code owners June 12, 2026 21:45

mergify Bot added the frontend label Jun 12, 2026

DarkLight1337 reviewed Jun 13, 2026

View reviewed changes

Comment thread vllm/entrypoints/openai/engine/protocol.py Outdated

DarkLight1337 reviewed Jun 13, 2026

View reviewed changes

Comment thread vllm/entrypoints/openai/engine/protocol.py

DarkLight1337 reviewed Jun 13, 2026

View reviewed changes

Comment thread vllm/entrypoints/openai/engine/protocol.py Outdated

[Feature][Frontend] Use single backticks in the multimodal_tokens doc…

0e02106

…string The docs render with MkDocs (Markdown), so inline code spans use single backticks rather than RST double backticks. Signed-off-by: Ting Sun <suntcrick@gmail.com>

DarkLight1337 approved these changes Jun 13, 2026

View reviewed changes

DarkLight1337 enabled auto-merge (squash) June 13, 2026 16:20

github-actions Bot added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature][Frontend] Report multimodal token counts in usage.prompt_tokens_details#45458

[Feature][Frontend] Report multimodal token counts in usage.prompt_tokens_details#45458
Sunt-ing wants to merge 5 commits into
vllm-project:mainfrom
Sunt-ing:feat/27122-mm-usage-token-details

Sunt-ing commented Jun 12, 2026

Uh oh!

Uh oh!

DarkLight1337 Jun 13, 2026

Uh oh!

Sunt-ing Jun 13, 2026

Uh oh!

Uh oh!

Uh oh!

DarkLight1337 left a comment

Uh oh!

Sunt-ing commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Sunt-ing commented Jun 12, 2026

Purpose

Test Plan

Test Result

Uh oh!

Uh oh!

DarkLight1337 Jun 13, 2026

Choose a reason for hiding this comment

Uh oh!

Sunt-ing Jun 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

DarkLight1337 left a comment

Choose a reason for hiding this comment

Uh oh!

Sunt-ing commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants