Skip to content

fix(rollout): compact_logprobs + logprobs:0 + server-side text#247

Draft
aoshen02 wants to merge 1 commit into
mainfrom
fix/rollout-compact-logprobs
Draft

fix(rollout): compact_logprobs + logprobs:0 + server-side text#247
aoshen02 wants to merge 1 commit into
mainfrom
fix/rollout-compact-logprobs

Conversation

@aoshen02

Copy link
Copy Markdown
Collaborator

Summary

  • logprobs:0: Change logprobs: 1logprobs: 0 in _build_inference_sampling_params. The previous value caused vLLM to compute and serialize per-token log probabilities that no downstream consumer reads, adding ~13.6% serialization overhead (measured via bare-serve A/B bench).
  • compact_logprobs: Parse the compact_logprobs response field (list of [logprob, token_id] pairs) when present, avoiding the deeply nested logprobs.content[i].logprob object traversal.
  • Server-side text: Prefer choice.text returned by the engine over client-side tokenizer.decode(). Falls back to asyncio.to_thread(decode) to avoid blocking the event loop when the server doesn't return text.

Test plan

  • FP8-30B-A3B rollout smoke (GB300): 0 NaN, throughput ≥ 160 tok/s/gpu
  • bf16 30B-A3B rollout: serialization gap closed vs bare-serve baseline

🤖 Generated with Claude Code

Three rollout performance/correctness fixes:

1. Request logprobs:0 instead of logprobs:1 — the per-token logprob
   over-fetch added ~13.6% serialization overhead with no consumer.

2. Parse compact_logprobs response field when present, avoiding the
   nested logprobs.content[].logprob object traversal.

3. Prefer server-returned text over client-side tokenizer.decode(),
   offloading decode work to the engine and using asyncio.to_thread
   as fallback to avoid blocking the event loop.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates vime/rollout/vllm_rollout.py to support parsing compact_logprobs and text directly from the vLLM response, falling back to standard token_ids and logprobs parsing and asynchronous token decoding if they are not present. Additionally, the default logprobs in sampling parameters is changed from 1 to 0. The reviewer suggests wrapping the compact_logprobs parsing logic in a try-except block to prevent potential crashes from malformed server responses.

Comment on lines +349 to +363
clp = choice.get("compact_logprobs")
if isinstance(clp, list) and clp:
new_response_tokens = [int(pair[1]) for pair in clp]
new_response_log_probs = [float(pair[0]) for pair in clp]
else:
new_response_tokens = choice.get("token_ids") or []
new_response_log_probs: list[float] = []
lp = choice.get("logprobs")
if isinstance(lp, dict):
content_items = lp.get("content") or []
new_response_log_probs = [
float(item.get("logprob", 0.0)) if isinstance(item, dict) else 0.0 for item in content_items
]
if not new_response_log_probs:
new_response_log_probs = [0.0] * len(new_response_tokens)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Parsing compact_logprobs directly without a try-except block is risky. If the server returns a malformed list, elements that are not pairs, or values that cannot be cast to int or float, it will raise an exception (such as IndexError, TypeError, or ValueError) and crash the entire rollout process. Wrapping the parsing logic in a try-except block and falling back to the standard token_ids and logprobs parsing ensures robustness.

    clp = choice.get("compact_logprobs")
    parsed_clp = False
    if isinstance(clp, list) and clp:
        try:
            new_response_tokens = [int(pair[1]) for pair in clp]
            new_response_log_probs = [float(pair[0]) for pair in clp]
            parsed_clp = True
        except (IndexError, TypeError, ValueError):
            pass

    if not parsed_clp:
        new_response_tokens = choice.get("token_ids") or []
        new_response_log_probs: list[float] = []
        lp = choice.get("logprobs")
        if isinstance(lp, dict):
            content_items = lp.get("content") or []
            new_response_log_probs = [
                float(item.get("logprob", 0.0)) if isinstance(item, dict) else 0.0 for item in content_items
            ]
        if not new_response_log_probs:
            new_response_log_probs = [0.0] * len(new_response_tokens)

@read-the-docs-community

Copy link
Copy Markdown

@aoshen02 aoshen02 marked this pull request as draft June 15, 2026 02:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant