[Feature][Frontend] Report multimodal token counts in usage.prompt_tokens_details#45458
Open
Sunt-ing wants to merge 5 commits into
Open
[Feature][Frontend] Report multimodal token counts in usage.prompt_tokens_details#45458Sunt-ing wants to merge 5 commits into
Sunt-ing wants to merge 5 commits into
Conversation
…kens_details Add image_tokens, audio_tokens and video_tokens to PromptTokenUsageInfo so OpenAI chat completion clients can see how many prompt tokens each modality contributed. Counts are aggregated from the request's multimodal placeholder ranges (PlaceholderRange.length), which matches the placeholder tokens already included in usage.prompt_tokens. Both streaming and non-streaming final usage are covered, gated by --enable-prompt-tokens-details. The existing cached_tokens reporting (including zero) is preserved; multimodal counts are added alongside it. Signed-off-by: Ting Sun <suntcrick@gmail.com>
Address review: instead of hardcoded image_tokens/audio_tokens/video_tokens fields, report a multimodal_tokens dict keyed by modality name so new modalities are surfaced without protocol changes. The helper now iterates the request's multimodal placeholders directly with no hardcoded modality list. Signed-off-by: Ting Sun <suntcrick@gmail.com>
| for modality, ranges in mm_placeholders.items() | ||
| if ranges | ||
| } | ||
| return counts or None |
Member
There was a problem hiding this comment.
I think we can just return the dictionary directly
Contributor
Author
There was a problem hiding this comment.
Done, _get_mm_token_counts now returns the dict directly (empty when there are no placeholders), _make_prompt_tokens_details already treats an empty map as nothing to report.
Address review: drop the None-coalescing in _get_mm_token_counts and return the per-modality dict directly (empty when there are no placeholders). _make_prompt_tokens_details already treats an empty map as "nothing to report". Signed-off-by: Ting Sun <suntcrick@gmail.com>
Address review: add a docstring to PromptTokenUsageInfo.multimodal_tokens explaining it is a per-modality breakdown (keyed by modality name) of the multimodal placeholder tokens already counted in prompt_tokens, and None when the request has no multimodal input. Signed-off-by: Ting Sun <suntcrick@gmail.com>
…string The docs render with MkDocs (Markdown), so inline code spans use single backticks rather than RST double backticks. Signed-off-by: Ting Sun <suntcrick@gmail.com>
DarkLight1337
approved these changes
Jun 13, 2026
DarkLight1337
left a comment
Member
There was a problem hiding this comment.
LGTM now, thanks for your patience!
Contributor
Author
|
Thanks for your careful guidance. Learned a lot. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
usage.prompt_tokensalready includes the image / audio / video placeholder tokens, but clients cannot tell how many tokens each modality contributed. This addsimage_tokens,audio_tokensandvideo_tokenstousage.prompt_tokens_details, following sgl-project/sglang#27122.audio_tokensmirrors the OpenAI field of the same name;image_tokensandvideo_tokensare multimodal extensions beyond the OpenAI schema.The per-modality counts come from the request's multimodal placeholder ranges (
PlaceholderRange.length), which matches the placeholder tokens already counted inusage.prompt_tokens(notget_num_embeds(), the embedding-mask count). Both streaming and non-streaming final usage are covered and gated by--enable-prompt-tokens-details. The per-modality counts ride alongsidecached_tokens, which keeps its existing behavior (0 is still reported), so a multimodal request reports both.Scope is the Chat Completions endpoint. The legacy Completions endpoint shares
PromptTokenUsageInfobut is left unchanged, since it does not carry image / audio / video content.Test Plan
Real
/v1/chat/completionsbefore/after onQwen2.5-VL-3B-Instruct(below), pluspytest tests/entrypoints/openai/chat_completion/test_serving_chat.py.Test Result
Same image sent to a real server with
--enable-prompt-tokens-details.prompt_tokensis 108 on both sides; the PR exposes that 81 of them are the image placeholder.cached_tokens(including 0) is unchanged, and text-only requests are unaffected.{cached_tokens: 0}{cached_tokens: 0, image_tokens: 81}{cached_tokens: 96}{cached_tokens: 96, image_tokens: 81}{cached_tokens: 0}{cached_tokens: 0}One unit test in
test_serving_chat.pycovers the parts the E2E run cannot: per-modality multi-range summation, the feature gate, and that zerocached_tokensis still reported while multimodal counts ride alongside.ruff checkandruff format --checkclean.repro (serve command + client + full output)
Client (image + text; non-stream and streaming):
Output, current main vs this PR:
AI assistance was used to investigate, reproduce, and draft this change; the author reviewed the diff and validation output.
cc @DarkLight1337