[Bugfix][Rollout] Record prefix_cache_hit_rate from vLLM usage#241
[Bugfix][Rollout] Record prefix_cache_hit_rate from vLLM usage#241aoshen02 wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors and cleans up arguments in arguments.py, adds validation checks for prefill_num_servers and vllm_config, enables prefix-cache accounting by passing --enable-prompt-tokens-details to the vLLM command, extracts cached_tokens in rollout generation, and removes the monkey-patching of dist.gather_object. Feedback is provided to replace the assert statements in validate_args with standard if statements raising ValueError to ensure robust validation even when Python is run with optimization flags.
| assert not ( | ||
| getattr(args, "prefill_num_servers", None) is not None and getattr(args, "rollout_external", False) | ||
| ), "prefill_num_servers cannot be set with --rollout-external-engine-addrs." | ||
|
|
||
| assert not ( | ||
| getattr(args, "vllm_config", None) is not None and getattr(args, "rollout_external", False) | ||
| ), "vllm_config cannot be set with --rollout-external-engine-addrs." | ||
|
|
||
| assert not ( | ||
| getattr(args, "vllm_config", None) is not None and getattr(args, "prefill_num_servers", None) is not None | ||
| ), "vllm_config and prefill_num_servers are mutually exclusive. Use server_groups in the YAML config instead." |
There was a problem hiding this comment.
Using assert statements for validating command-line arguments or configuration is an anti-pattern in Python. Assertions can be globally disabled in production environments when Python is run with optimization flags (e.g., python -O or PYTHONOPTIMIZE=1), which would silently bypass these validation checks.
To ensure robust validation, these checks should be implemented using standard if statements that raise a ValueError.
| assert not ( | |
| getattr(args, "prefill_num_servers", None) is not None and getattr(args, "rollout_external", False) | |
| ), "prefill_num_servers cannot be set with --rollout-external-engine-addrs." | |
| assert not ( | |
| getattr(args, "vllm_config", None) is not None and getattr(args, "rollout_external", False) | |
| ), "vllm_config cannot be set with --rollout-external-engine-addrs." | |
| assert not ( | |
| getattr(args, "vllm_config", None) is not None and getattr(args, "prefill_num_servers", None) is not None | |
| ), "vllm_config and prefill_num_servers are mutually exclusive. Use server_groups in the YAML config instead." | |
| if getattr(args, "prefill_num_servers", None) is not None and getattr(args, "rollout_external", False): | |
| raise ValueError("prefill_num_servers cannot be set with --rollout-external-engine-addrs.") | |
| if getattr(args, "vllm_config", None) is not None and getattr(args, "rollout_external", False): | |
| raise ValueError("vllm_config cannot be set with --rollout-external-engine-addrs.") | |
| if getattr(args, "vllm_config", None) is not None and getattr(args, "prefill_num_servers", None) is not None: | |
| raise ValueError("vllm_config and prefill_num_servers are mutually exclusive. Use server_groups in the YAML config instead.") |
cdac171 to
abdcc12
Compare
rollout/prefix_cache_hit_rate always reported 0 on the vLLM path due to two compounding gaps (re-applies #83, reverted by #123 only to "split out cleanly"; never re-landed): 1. Parser: vllm_rollout.py / vllm_streaming_rollout.py built `meta` from usage.prompt_tokens + usage.completion_tokens but never read the cached count, so Sample.PrefixCacheInfo.add saw cached_tokens=0 every time — the hit-rate numerator was structurally pinned to 0. vLLM reports the count nested as usage.prompt_tokens_details.cached_tokens; surface it. 2. Server: vLLM only emits prompt_tokens_details when the OpenAI frontend is started with --enable-prompt-tokens-details (default off). This flag lives on FrontendArgs, not AsyncEngineArgs, so it is not reachable via the --vllm-* auto-forwarder and must be passed explicitly when assembling the `vllm serve` command. Streaming path: besides the same extraction, request the terminal usage chunk via stream_options.include_usage. vLLM gates that chunk on should_include_usage(stream_options); without it the streaming loop never observes `usage`, so prompt_tokens AND cached_tokens went unrecorded (the loop already handles the choices=[] usage-only chunk — the request just never asked for it). Verified against the deployed vllm v0.22.0 tag. Cosmetic metric only — does not affect training. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
abdcc12 to
18c103b
Compare
Problem
rollout/prefix_cache_hit_ratealways logs0on the vLLM path (slime/sglang reports it fine). The metric pipeline (Sample.PrefixCacheInfo.addinvime/utils/types.py,_compute_prefix_cache_metricsinvime/ray/rollout.py) is byte-identical to slime — the sglang→vllm divergence broke the wiring on two ends:Parser —
vllm_rollout.py/vllm_streaming_rollout.pybuiltmetafromusage.prompt_tokens+usage.completion_tokensbut never read the cached count, soPrefixCacheInfo.addsawcached_tokens=0every time. The hit-rate numerator was structurally pinned to 0 even though the denominator (prompt_tokens) was populated. vLLM reports the count nested asusage.prompt_tokens_details.cached_tokens; surface it.Server — vLLM only emits
prompt_tokens_detailswhen the OpenAI frontend is started with--enable-prompt-tokens-details(default off). This flag lives onFrontendArgs, notAsyncEngineArgs, so it is not reachable via the--vllm-*auto-forwarder and must be passed explicitly when assembling thevllm servecommand.vLLM source confirmation
/inference/v1/generate→ServingTokens.serve_tokenssetsusage.prompt_tokens_details.cached_tokens = final_res.num_cached_tokensonly ifenable_prompt_tokens_details(which is wired fromargs.enable_prompt_tokens_details, CLI defaultFalse).enable_prefix_caching=True), so the count is meaningful for grouped RL rollouts (shared prompt acrossnsamples).History
This is a re-application of #83 (
b502fed), which was reverted by #123 (92c3014) only to "split the change out cleanly" during the slime→vime sync, then never re-landed. Adapted to currentmain(inlinedmetabuild; old_vllm_meta_from_generate_choiceis gone) and extended to the streaming rollout path, which postdates #83 and had the identical gap.This is the legitimate sglang→vllm translation point; the metric/types layer stays byte-for-byte slime.
Changes (16 insertions, 3 files)
vime/backends/vllm_utils/vllm_engine.py— add--enable-prompt-tokens-detailsto the launch command (always-on, mirroring slime's always-availablecached_tokens; negligible cost).vime/rollout/vllm_rollout.py—meta["cached_tokens"] = (usage.get("prompt_tokens_details") or {}).get("cached_tokens", 0).vime/rollout/vllm_streaming_rollout.py— same extraction (parity).Cosmetic metric only — does not affect training or the
train_rollout_logprob_abs_diffconsistency metric.Test
py_compileon all three files: OK.prompt_tokens_details.cached_tokens→hit_rate=0.500(cached 150 / prompt 300); missing key and explicitnullboth →cached=0, no crash.tests/test_sample.pyexercisesupdate_from_meta_infowithcached_tokensalready inmeta(accumulation logic, unchanged) — not affected.🤖 Generated with Claude Code