Skip to content

[Feature][Frontend] Report multimodal token counts in usage.prompt_tokens_details#45458

Open
Sunt-ing wants to merge 5 commits into
vllm-project:mainfrom
Sunt-ing:feat/27122-mm-usage-token-details
Open

[Feature][Frontend] Report multimodal token counts in usage.prompt_tokens_details#45458
Sunt-ing wants to merge 5 commits into
vllm-project:mainfrom
Sunt-ing:feat/27122-mm-usage-token-details

Conversation

@Sunt-ing

Copy link
Copy Markdown
Contributor

Purpose

usage.prompt_tokens already includes the image / audio / video placeholder tokens, but clients cannot tell how many tokens each modality contributed. This adds image_tokens, audio_tokens and video_tokens to usage.prompt_tokens_details, following sgl-project/sglang#27122. audio_tokens mirrors the OpenAI field of the same name; image_tokens and video_tokens are multimodal extensions beyond the OpenAI schema.

The per-modality counts come from the request's multimodal placeholder ranges (PlaceholderRange.length), which matches the placeholder tokens already counted in usage.prompt_tokens (not get_num_embeds(), the embedding-mask count). Both streaming and non-streaming final usage are covered and gated by --enable-prompt-tokens-details. The per-modality counts ride alongside cached_tokens, which keeps its existing behavior (0 is still reported), so a multimodal request reports both.

Scope is the Chat Completions endpoint. The legacy Completions endpoint shares PromptTokenUsageInfo but is left unchanged, since it does not carry image / audio / video content.

Test Plan

Real /v1/chat/completions before/after on Qwen2.5-VL-3B-Instruct (below), plus pytest tests/entrypoints/openai/chat_completion/test_serving_chat.py.

Test Result

Same image sent to a real server with --enable-prompt-tokens-details. prompt_tokens is 108 on both sides; the PR exposes that 81 of them are the image placeholder. cached_tokens (including 0) is unchanged, and text-only requests are unaffected.

request current main this PR
image, non-stream {cached_tokens: 0} {cached_tokens: 0, image_tokens: 81}
image, stream final usage {cached_tokens: 96} {cached_tokens: 96, image_tokens: 81}
text-only, non-stream {cached_tokens: 0} {cached_tokens: 0}

One unit test in test_serving_chat.py covers the parts the E2E run cannot: per-modality multi-range summation, the feature gate, and that zero cached_tokens is still reported while multimodal counts ride alongside. ruff check and ruff format --check clean.

repro (serve command + client + full output)
# A800-80GB. VLLM_USE_FLASHINFER_SAMPLER=0 + FLASH_ATTN are A800 JIT
# workarounds, unrelated to this feature.
VLLM_USE_FLASHINFER_SAMPLER=0 VLLM_ATTENTION_BACKEND=FLASH_ATTN \
vllm serve Qwen/Qwen2.5-VL-3B-Instruct --served-model-name qwen-vl \
  --enable-prompt-tokens-details --max-model-len 4096 \
  --gpu-memory-utilization 0.80 --enforce-eager

Client (image + text; non-stream and streaming):

import base64, io, requests
from PIL import Image

buf = io.BytesIO()
Image.new("RGB", (256, 256), (123, 200, 50)).save(buf, "PNG")
url = "data:image/png;base64," + base64.b64encode(buf.getvalue()).decode()
messages = [{"role": "user", "content": [
    {"type": "image_url", "image_url": {"url": url}},
    {"type": "text", "text": "What color dominates this image?"}]}]

# non-stream
r = requests.post("http://127.0.0.1:8000/v1/chat/completions", json={
    "model": "qwen-vl", "messages": messages, "max_tokens": 8, "temperature": 0})
print(r.json()["usage"])

# stream
r = requests.post("http://127.0.0.1:8000/v1/chat/completions", json={
    "model": "qwen-vl", "messages": messages, "max_tokens": 8, "temperature": 0,
    "stream": True, "stream_options": {"include_usage": True}}, stream=True)
# ... accumulate the final chunk's "usage"

Output, current main vs this PR:

# current main, non-stream: no per-modality breakdown
{"prompt_tokens": 108, "total_tokens": 116, "completion_tokens": 8,
 "prompt_tokens_details": {"cached_tokens": 0}}

# this PR, non-stream
{"prompt_tokens": 108, "total_tokens": 116, "completion_tokens": 8,
 "prompt_tokens_details": {"cached_tokens": 0, "image_tokens": 81,
                           "audio_tokens": null, "video_tokens": null}}

# this PR, stream final usage (cached hit + image both reported)
{"prompt_tokens": 108, "total_tokens": 116, "completion_tokens": 8,
 "prompt_tokens_details": {"cached_tokens": 96, "image_tokens": 81}}

# this PR, text-only: cached_tokens unchanged, per-modality counts all null
{"prompt_tokens": 25, "total_tokens": 28, "completion_tokens": 3,
 "prompt_tokens_details": {"cached_tokens": 0, "image_tokens": null,
                           "audio_tokens": null, "video_tokens": null}}

AI assistance was used to investigate, reproduce, and draft this change; the author reviewed the diff and validation output.

cc @DarkLight1337

…kens_details

Add image_tokens, audio_tokens and video_tokens to PromptTokenUsageInfo so
OpenAI chat completion clients can see how many prompt tokens each modality
contributed. Counts are aggregated from the request's multimodal placeholder
ranges (PlaceholderRange.length), which matches the placeholder tokens already
included in usage.prompt_tokens.

Both streaming and non-streaming final usage are covered, gated by
--enable-prompt-tokens-details. The existing cached_tokens reporting (including
zero) is preserved; multimodal counts are added alongside it.

Signed-off-by: Ting Sun <suntcrick@gmail.com>
Comment thread vllm/entrypoints/openai/engine/protocol.py Outdated
Address review: instead of hardcoded image_tokens/audio_tokens/video_tokens
fields, report a multimodal_tokens dict keyed by modality name so new
modalities are surfaced without protocol changes. The helper now iterates the
request's multimodal placeholders directly with no hardcoded modality list.

Signed-off-by: Ting Sun <suntcrick@gmail.com>
for modality, ranges in mm_placeholders.items()
if ranges
}
return counts or None

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can just return the dictionary directly

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, _get_mm_token_counts now returns the dict directly (empty when there are no placeholders), _make_prompt_tokens_details already treats an empty map as nothing to report.

Address review: drop the None-coalescing in _get_mm_token_counts and return
the per-modality dict directly (empty when there are no placeholders).
_make_prompt_tokens_details already treats an empty map as "nothing to report".

Signed-off-by: Ting Sun <suntcrick@gmail.com>
Comment thread vllm/entrypoints/openai/engine/protocol.py
Address review: add a docstring to PromptTokenUsageInfo.multimodal_tokens
explaining it is a per-modality breakdown (keyed by modality name) of the
multimodal placeholder tokens already counted in prompt_tokens, and None
when the request has no multimodal input.

Signed-off-by: Ting Sun <suntcrick@gmail.com>
Comment thread vllm/entrypoints/openai/engine/protocol.py Outdated
…string

The docs render with MkDocs (Markdown), so inline code spans use single
backticks rather than RST double backticks.

Signed-off-by: Ting Sun <suntcrick@gmail.com>

@DarkLight1337 DarkLight1337 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now, thanks for your patience!

@DarkLight1337 DarkLight1337 enabled auto-merge (squash) June 13, 2026 16:20
@github-actions github-actions Bot added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 13, 2026
@Sunt-ing

Copy link
Copy Markdown
Contributor Author

Thanks for your careful guidance. Learned a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

frontend ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants