fix(qwen3): chunk echo prefill to bound prompt-logprob activation scratch#416
fix(qwen3): chunk echo prefill to bound prompt-logprob activation scratch#416wjinxu wants to merge 4 commits into
Conversation
…atch Echo with logprobs returns a logprob for every prompt token, which the naive path produced by materializing all-position logits (vocab × prompt_len) in one forward. A long echo prompt's scratch was never budgeted, so a well-formed request deterministically OOMed after admission instead of being served or rejected up front. Chunk echo prefill like any other prompt: each chunk computes all-position logits only for its own slice (vocab × chunk) and the executor stitches the per-chunk prompt logprobs back together, so the scratch is bounded by the prefill budget regardless of prompt length. - scheduler: drop the echo exemption in take_prefill_chunks; echo chunks under the shared prefill budget. - executor: build a per-chunk partial of prompt logprobs and fold it into a per-request accumulator, surfacing the full vector only on the final chunk; clear the accumulator on drop to avoid leaks. - plan: route echo+logprobs to a dedicated Prefill step, since the unified forward can't produce all-position logits; gate the all-position-logits scratch on echo+logprobs so echo without logprobs no longer pays for it. Tests: unit coverage for the chunk-fold seam and the plan gating, plus a GPU self-consistency gate asserting chunked echo prompt logprobs match the single-pass result within bf16 tolerance. Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: cad1a58253
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if !pending.is_empty() && have_active && !needs_dedicated_prefill { | ||
| Some(ExecutionPlan::Unified { pending }) | ||
| } else if !pending.is_empty() { | ||
| Some(ExecutionPlan::Prefill { pending }) |
There was a problem hiding this comment.
Preserve active decodes on prefill errors
When an echo+logprobs chunk is pending while decodes are active, this branch now returns a Prefill plan with have_active == true. If that prefill step errors (for example, the bounded all-position logits still OOM or a kernel returns an error), failure_targets_for treats Prefill as pending-only while fail_touched_requests still calls active.clear(), so the active streams are silently removed without an Error/Finished event and their executor KV is not dropped. Either keep active state out of the prefill failure path or notify/drop active requests before allowing Prefill with active requests.
Useful? React with 👍 / 👎.
echo+logprobs now routes to a Prefill step even with active decodes, so a Prefill failure can occur while decodes are running. failure_targets_for only collected the pending requests, but fail_touched_requests clears all active state unconditionally — the active decodes were dropped without an Error event or a drop_request, hanging the client and leaking KV. Fail the active requests too, matching the Unified path. Also trim the chunk-stitching comments to one authoritative seam explanation. Co-authored-by: Cursor <cursoragent@cursor.com>
Resolve scheduler/plan.rs conflict with openinfer-project#413's batch sampling refactor: keep the echo+logprobs gating (echo flag = batch_needs_prompt_logprobs) on top of the new plan-level sample_seed. Adapt to upstream API changes: PrefillStepItem::new dropped random_val and PrefillPlan gained sample_seed (echo_chunked_prefill test), and token_tx is now a TokenSink (failure test). Co-authored-by: Cursor <cursoragent@cursor.com>
|
@codex review |
|
Codex Review: Didn't find any major issues. More of your lovely PRs please. Reviewed commit: ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
|
Thanks for digging into this — the chunk-fold seam logic and the failure-target fix are both solid work. Before we land it I want to make sure we're fixing the reported bug. The #358 repro is a plain Also — and this one is on me, the issue was stale — per-request prefill chunking already landed in #363, so the "prefill scratch is never budgeted" half of #358 is done. What's actually left is admission accounting for single-request activation headroom, which this PR doesn't touch. Could you confirm whether your repro used I'll tighten #358 to spell out the repro params and the remaining admission gap so the next person isn't sent down the same path. |
|
Thanks for the careful read — you're right, and I appreciate you catching this before it landed. Confirming the repro: it was a plain /v1/completions request, no echo. So this PR is targeting a path that isn't reachable from the API today (echo hardcoded false + prompt_logprobs rejected at the guard), and what I measured was the normal last-token-logits prefill, not the vocab × prompt_len spike I assumed. Combined with #363 already covering per-request prefill chunking, this PR doesn't move #358 forward. Closing this out and unlinking from #358. The chunk-fold seam logic and the partial-accumulator design should still be useful whenever echo gets wired through the bridge — happy to revisit then. I'll wait for your updated #358 to see where the real admission-accounting gap lives before picking up anything new here. |
Summary
Echo with
logprobsreturns a logprob for every prompt token, which the old path produced by materializing all-position logits (vocab × prompt_len) in a single forward. Echo was exempt from prefill chunking, so a long echo prompt's activation scratch was never budgeted — a well-formed request (e.g. ~12k-token prompt) deterministically OOMed after admission instead of being served or rejected up front.This chunks echo prefill like any other prompt: each chunk computes all-position logits only for its own slice (
vocab × chunk) and the executor stitches the per-chunk prompt logprobs back together, so the scratch is bounded by the prefill budget regardless of prompt length.Changes
take_prefill_chunks): drop the echo exemption; echo chunks under the shared prefill budget like everything else.Some) and fold it into a per-request accumulator, surfacing the full vector only on the final chunk. The accumulator is cleared ondrop_requestso a mid-prefill disconnect can't leak.echo + logprobsto a dedicated Prefill step, since the unified forward can't produce all-position logits (would silently drop the echoed prompt logprobs). Also gate the all-position-logits scratch onecho + logprobs, so echo without logprobs (which only echoes prompt ids back) no longer pays for that scratch.Correctness
The seam between chunks is the delicate part: prompt token
k's logprob comes from the distribution at positionk-1, so a boundary token's logprob is produced by the previous chunk's last position. Each prompt index is produced by exactly one chunk, so folding partials never drops or double-writes a slot.Tests
echo+logprobs→ dedicated prefill; all-position-logits fire only forecho+logprobs).tests/echo_chunked_prefill.rs): chunked echo prompt logprobs must match the single-pass result within bf16 tolerance (no HF golden for within-prompt distributions, so this asserts a hardware-independent invariant instead). Skips cleanly without the model.Tradeoff
While an
echo + logprobsrequest is mid-prefill it forces dedicated Prefill steps, so active decodes wait those ticks (no interleaving).echo + logprobsis rare and this trades a small latency hit for correctness; the previous unified path silently dropped the prompt logprobs.