Skip to content

[Bugfix][Rollout] Record prefix_cache_hit_rate from vLLM usage#241

Open
aoshen02 wants to merge 1 commit into
mainfrom
enable-prefix-cache-hit-rate
Open

[Bugfix][Rollout] Record prefix_cache_hit_rate from vLLM usage#241
aoshen02 wants to merge 1 commit into
mainfrom
enable-prefix-cache-hit-rate

Conversation

@aoshen02

Copy link
Copy Markdown
Collaborator

Problem

rollout/prefix_cache_hit_rate always logs 0 on the vLLM path (slime/sglang reports it fine). The metric pipeline (Sample.PrefixCacheInfo.add in vime/utils/types.py, _compute_prefix_cache_metrics in vime/ray/rollout.py) is byte-identical to slime — the sglang→vllm divergence broke the wiring on two ends:

  1. Parservllm_rollout.py / vllm_streaming_rollout.py built meta from usage.prompt_tokens + usage.completion_tokens but never read the cached count, so PrefixCacheInfo.add saw cached_tokens=0 every time. The hit-rate numerator was structurally pinned to 0 even though the denominator (prompt_tokens) was populated. vLLM reports the count nested as usage.prompt_tokens_details.cached_tokens; surface it.

  2. Server — vLLM only emits prompt_tokens_details when the OpenAI frontend is started with --enable-prompt-tokens-details (default off). This flag lives on FrontendArgs, not AsyncEngineArgs, so it is not reachable via the --vllm-* auto-forwarder and must be passed explicitly when assembling the vllm serve command.

vLLM source confirmation

  • /inference/v1/generateServingTokens.serve_tokens sets usage.prompt_tokens_details.cached_tokens = final_res.num_cached_tokens only if enable_prompt_tokens_details (which is wired from args.enable_prompt_tokens_details, CLI default False).
  • Prefix caching is ON by default (enable_prefix_caching=True), so the count is meaningful for grouped RL rollouts (shared prompt across n samples).

History

This is a re-application of #83 (b502fed), which was reverted by #123 (92c3014) only to "split the change out cleanly" during the slime→vime sync, then never re-landed. Adapted to current main (inlined meta build; old _vllm_meta_from_generate_choice is gone) and extended to the streaming rollout path, which postdates #83 and had the identical gap.

This is the legitimate sglang→vllm translation point; the metric/types layer stays byte-for-byte slime.

Changes (16 insertions, 3 files)

  • vime/backends/vllm_utils/vllm_engine.py — add --enable-prompt-tokens-details to the launch command (always-on, mirroring slime's always-available cached_tokens; negligible cost).
  • vime/rollout/vllm_rollout.pymeta["cached_tokens"] = (usage.get("prompt_tokens_details") or {}).get("cached_tokens", 0).
  • vime/rollout/vllm_streaming_rollout.py — same extraction (parity).

Cosmetic metric only — does not affect training or the train_rollout_logprob_abs_diff consistency metric.

Test

  • py_compile on all three files: OK.
  • Standalone logic check: usage with prompt_tokens_details.cached_tokenshit_rate=0.500 (cached 150 / prompt 300); missing key and explicit null both → cached=0, no crash.
  • tests/test_sample.py exercises update_from_meta_info with cached_tokens already in meta (accumulation logic, unchanged) — not affected.

🤖 Generated with Claude Code

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors and cleans up arguments in arguments.py, adds validation checks for prefill_num_servers and vllm_config, enables prefix-cache accounting by passing --enable-prompt-tokens-details to the vLLM command, extracts cached_tokens in rollout generation, and removes the monkey-patching of dist.gather_object. Feedback is provided to replace the assert statements in validate_args with standard if statements raising ValueError to ensure robust validation even when Python is run with optimization flags.

Comment on lines +268 to +278
assert not (
getattr(args, "prefill_num_servers", None) is not None and getattr(args, "rollout_external", False)
), "prefill_num_servers cannot be set with --rollout-external-engine-addrs."

assert not (
getattr(args, "vllm_config", None) is not None and getattr(args, "rollout_external", False)
), "vllm_config cannot be set with --rollout-external-engine-addrs."

assert not (
getattr(args, "vllm_config", None) is not None and getattr(args, "prefill_num_servers", None) is not None
), "vllm_config and prefill_num_servers are mutually exclusive. Use server_groups in the YAML config instead."

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using assert statements for validating command-line arguments or configuration is an anti-pattern in Python. Assertions can be globally disabled in production environments when Python is run with optimization flags (e.g., python -O or PYTHONOPTIMIZE=1), which would silently bypass these validation checks.

To ensure robust validation, these checks should be implemented using standard if statements that raise a ValueError.

Suggested change
assert not (
getattr(args, "prefill_num_servers", None) is not None and getattr(args, "rollout_external", False)
), "prefill_num_servers cannot be set with --rollout-external-engine-addrs."
assert not (
getattr(args, "vllm_config", None) is not None and getattr(args, "rollout_external", False)
), "vllm_config cannot be set with --rollout-external-engine-addrs."
assert not (
getattr(args, "vllm_config", None) is not None and getattr(args, "prefill_num_servers", None) is not None
), "vllm_config and prefill_num_servers are mutually exclusive. Use server_groups in the YAML config instead."
if getattr(args, "prefill_num_servers", None) is not None and getattr(args, "rollout_external", False):
raise ValueError("prefill_num_servers cannot be set with --rollout-external-engine-addrs.")
if getattr(args, "vllm_config", None) is not None and getattr(args, "rollout_external", False):
raise ValueError("vllm_config cannot be set with --rollout-external-engine-addrs.")
if getattr(args, "vllm_config", None) is not None and getattr(args, "prefill_num_servers", None) is not None:
raise ValueError("vllm_config and prefill_num_servers are mutually exclusive. Use server_groups in the YAML config instead.")

@aoshen02 aoshen02 force-pushed the enable-prefix-cache-hit-rate branch from cdac171 to abdcc12 Compare June 12, 2026 00:12
@read-the-docs-community

read-the-docs-community Bot commented Jun 12, 2026

Copy link
Copy Markdown

rollout/prefix_cache_hit_rate always reported 0 on the vLLM path due to
two compounding gaps (re-applies #83, reverted by #123 only to "split out
cleanly"; never re-landed):

1. Parser: vllm_rollout.py / vllm_streaming_rollout.py built `meta` from
   usage.prompt_tokens + usage.completion_tokens but never read the cached
   count, so Sample.PrefixCacheInfo.add saw cached_tokens=0 every time —
   the hit-rate numerator was structurally pinned to 0. vLLM reports the
   count nested as usage.prompt_tokens_details.cached_tokens; surface it.

2. Server: vLLM only emits prompt_tokens_details when the OpenAI frontend
   is started with --enable-prompt-tokens-details (default off). This flag
   lives on FrontendArgs, not AsyncEngineArgs, so it is not reachable via
   the --vllm-* auto-forwarder and must be passed explicitly when
   assembling the `vllm serve` command.

Streaming path: besides the same extraction, request the terminal usage
chunk via stream_options.include_usage. vLLM gates that chunk on
should_include_usage(stream_options); without it the streaming loop never
observes `usage`, so prompt_tokens AND cached_tokens went unrecorded (the
loop already handles the choices=[] usage-only chunk — the request just
never asked for it). Verified against the deployed vllm v0.22.0 tag.

Cosmetic metric only — does not affect training.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@aoshen02 aoshen02 force-pushed the enable-prefix-cache-hit-rate branch from abdcc12 to 18c103b Compare June 12, 2026 00:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant