[Bugfix][Rollout] Record prefix_cache_hit_rate from vLLM usage by aoshen02 · Pull Request #241 · vllm-project/vime

aoshen02 · 2026-06-12T00:09:28Z

Problem

rollout/prefix_cache_hit_rate always logs 0 on the vLLM path (slime/sglang reports it fine). The metric pipeline (Sample.PrefixCacheInfo.add in vime/utils/types.py, _compute_prefix_cache_metrics in vime/ray/rollout.py) is byte-identical to slime — the sglang→vllm divergence broke the wiring on two ends:

Parser — vllm_rollout.py / vllm_streaming_rollout.py built meta from usage.prompt_tokens + usage.completion_tokens but never read the cached count, so PrefixCacheInfo.add saw cached_tokens=0 every time. The hit-rate numerator was structurally pinned to 0 even though the denominator (prompt_tokens) was populated. vLLM reports the count nested as usage.prompt_tokens_details.cached_tokens; surface it.
Server — vLLM only emits prompt_tokens_details when the OpenAI frontend is started with --enable-prompt-tokens-details (default off). This flag lives on FrontendArgs, not AsyncEngineArgs, so it is not reachable via the --vllm-* auto-forwarder and must be passed explicitly when assembling the vllm serve command.

vLLM source confirmation

/inference/v1/generate → ServingTokens.serve_tokens sets usage.prompt_tokens_details.cached_tokens = final_res.num_cached_tokens only if enable_prompt_tokens_details (which is wired from args.enable_prompt_tokens_details, CLI default False).
Prefix caching is ON by default (enable_prefix_caching=True), so the count is meaningful for grouped RL rollouts (shared prompt across n samples).

History

This is a re-application of #83 (b502fed), which was reverted by #123 (92c3014) only to "split the change out cleanly" during the slime→vime sync, then never re-landed. Adapted to current main (inlined meta build; old _vllm_meta_from_generate_choice is gone) and extended to the streaming rollout path, which postdates #83 and had the identical gap.

This is the legitimate sglang→vllm translation point; the metric/types layer stays byte-for-byte slime.

Changes (16 insertions, 3 files)

vime/backends/vllm_utils/vllm_engine.py — add --enable-prompt-tokens-details to the launch command (always-on, mirroring slime's always-available cached_tokens; negligible cost).
vime/rollout/vllm_rollout.py — meta["cached_tokens"] = (usage.get("prompt_tokens_details") or {}).get("cached_tokens", 0).
vime/rollout/vllm_streaming_rollout.py — same extraction (parity).

Cosmetic metric only — does not affect training or the train_rollout_logprob_abs_diff consistency metric.

Test

py_compile on all three files: OK.
Standalone logic check: usage with prompt_tokens_details.cached_tokens → hit_rate=0.500 (cached 150 / prompt 300); missing key and explicit null both → cached=0, no crash.
tests/test_sample.py exercises update_from_meta_info with cached_tokens already in meta (accumulation logic, unchanged) — not affected.

🤖 Generated with Claude Code

gemini-code-assist

Code Review

This pull request refactors and cleans up arguments in arguments.py, adds validation checks for prefill_num_servers and vllm_config, enables prefix-cache accounting by passing --enable-prompt-tokens-details to the vLLM command, extracts cached_tokens in rollout generation, and removes the monkey-patching of dist.gather_object. Feedback is provided to replace the assert statements in validate_args with standard if statements raising ValueError to ensure robust validation even when Python is run with optimization flags.

gemini-code-assist · 2026-06-12T00:10:20Z

+    assert not (
+        getattr(args, "prefill_num_servers", None) is not None and getattr(args, "rollout_external", False)
+    ), "prefill_num_servers cannot be set with --rollout-external-engine-addrs."
+
+    assert not (
+        getattr(args, "vllm_config", None) is not None and getattr(args, "rollout_external", False)
+    ), "vllm_config cannot be set with --rollout-external-engine-addrs."
+
+    assert not (
+        getattr(args, "vllm_config", None) is not None and getattr(args, "prefill_num_servers", None) is not None
+    ), "vllm_config and prefill_num_servers are mutually exclusive. Use server_groups in the YAML config instead."


Using assert statements for validating command-line arguments or configuration is an anti-pattern in Python. Assertions can be globally disabled in production environments when Python is run with optimization flags (e.g., python -O or PYTHONOPTIMIZE=1), which would silently bypass these validation checks.

To ensure robust validation, these checks should be implemented using standard if statements that raise a ValueError.

Suggested change

assert not (

getattr(args, "prefill_num_servers", None) is not None and getattr(args, "rollout_external", False)

), "prefill_num_servers cannot be set with --rollout-external-engine-addrs."

assert not (

getattr(args, "vllm_config", None) is not None and getattr(args, "rollout_external", False)

), "vllm_config cannot be set with --rollout-external-engine-addrs."

assert not (

getattr(args, "vllm_config", None) is not None and getattr(args, "prefill_num_servers", None) is not None

), "vllm_config and prefill_num_servers are mutually exclusive. Use server_groups in the YAML config instead."

if getattr(args, "prefill_num_servers", None) is not None and getattr(args, "rollout_external", False):

raise ValueError("prefill_num_servers cannot be set with --rollout-external-engine-addrs.")

if getattr(args, "vllm_config", None) is not None and getattr(args, "rollout_external", False):

raise ValueError("vllm_config cannot be set with --rollout-external-engine-addrs.")

if getattr(args, "vllm_config", None) is not None and getattr(args, "prefill_num_servers", None) is not None:

raise ValueError("vllm_config and prefill_num_servers are mutually exclusive. Use server_groups in the YAML config instead.")

read-the-docs-community · 2026-06-12T00:13:19Z

Documentation build overview

📚 vime | 🛠️ Build #33103784 | 📁 Comparing 18c103b against latest (fa0b6e9)

🔍 Preview build

26 files changed · ± 26 modified

± Modified

rollout/prefix_cache_hit_rate always reported 0 on the vLLM path due to two compounding gaps (re-applies #83, reverted by #123 only to "split out cleanly"; never re-landed): 1. Parser: vllm_rollout.py / vllm_streaming_rollout.py built `meta` from usage.prompt_tokens + usage.completion_tokens but never read the cached count, so Sample.PrefixCacheInfo.add saw cached_tokens=0 every time — the hit-rate numerator was structurally pinned to 0. vLLM reports the count nested as usage.prompt_tokens_details.cached_tokens; surface it. 2. Server: vLLM only emits prompt_tokens_details when the OpenAI frontend is started with --enable-prompt-tokens-details (default off). This flag lives on FrontendArgs, not AsyncEngineArgs, so it is not reachable via the --vllm-* auto-forwarder and must be passed explicitly when assembling the `vllm serve` command. Streaming path: besides the same extraction, request the terminal usage chunk via stream_options.include_usage. vLLM gates that chunk on should_include_usage(stream_options); without it the streaming loop never observes `usage`, so prompt_tokens AND cached_tokens went unrecorded (the loop already handles the choices=[] usage-only chunk — the request just never asked for it). Verified against the deployed vllm v0.22.0 tag. Cosmetic metric only — does not affect training. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

gemini-code-assist Bot reviewed Jun 12, 2026

View reviewed changes

aoshen02 force-pushed the enable-prefix-cache-hit-rate branch from cdac171 to abdcc12 Compare June 12, 2026 00:12

aoshen02 force-pushed the enable-prefix-cache-hit-rate branch from abdcc12 to 18c103b Compare June 12, 2026 00:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix][Rollout] Record prefix_cache_hit_rate from vLLM usage#241

[Bugfix][Rollout] Record prefix_cache_hit_rate from vLLM usage#241
aoshen02 wants to merge 1 commit into
mainfrom
enable-prefix-cache-hit-rate

aoshen02 commented Jun 12, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 12, 2026

Uh oh!

read-the-docs-community Bot commented Jun 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

aoshen02 commented Jun 12, 2026

Problem

vLLM source confirmation

History

Changes (16 insertions, 3 files)

Test

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

read-the-docs-community Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Documentation build overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

read-the-docs-community Bot commented Jun 12, 2026 •

edited

Loading