Skip to content

fix(qwen3): chunk echo prefill to bound prompt-logprob activation scratch#416

Closed
wjinxu wants to merge 4 commits into
openinfer-project:mainfrom
wjinxu:fix/qwen3-echo-logprobs-unified
Closed

fix(qwen3): chunk echo prefill to bound prompt-logprob activation scratch#416
wjinxu wants to merge 4 commits into
openinfer-project:mainfrom
wjinxu:fix/qwen3-echo-logprobs-unified

Conversation

@wjinxu

@wjinxu wjinxu commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Summary

Echo with logprobs returns a logprob for every prompt token, which the old path produced by materializing all-position logits (vocab × prompt_len) in a single forward. Echo was exempt from prefill chunking, so a long echo prompt's activation scratch was never budgeted — a well-formed request (e.g. ~12k-token prompt) deterministically OOMed after admission instead of being served or rejected up front.

This chunks echo prefill like any other prompt: each chunk computes all-position logits only for its own slice (vocab × chunk) and the executor stitches the per-chunk prompt logprobs back together, so the scratch is bounded by the prefill budget regardless of prompt length.

Changes

  • scheduler (take_prefill_chunks): drop the echo exemption; echo chunks under the shared prefill budget like everything else.
  • executor: build a per-chunk partial of prompt logprobs (only the positions this chunk forwarded are Some) and fold it into a per-request accumulator, surfacing the full vector only on the final chunk. The accumulator is cleared on drop_request so a mid-prefill disconnect can't leak.
  • plan: route echo + logprobs to a dedicated Prefill step, since the unified forward can't produce all-position logits (would silently drop the echoed prompt logprobs). Also gate the all-position-logits scratch on echo + logprobs, so echo without logprobs (which only echoes prompt ids back) no longer pays for that scratch.

Correctness

The seam between chunks is the delicate part: prompt token k's logprob comes from the distribution at position k-1, so a boundary token's logprob is produced by the previous chunk's last position. Each prompt index is produced by exactly one chunk, so folding partials never drops or double-writes a slot.

Tests

  • Unit: chunk-fold seam reconstruction, echo-chunks-under-budget scheduling, and the plan gating (echo+logprobs → dedicated prefill; all-position-logits fire only for echo+logprobs).
  • GPU self-consistency gate (tests/echo_chunked_prefill.rs): chunked echo prompt logprobs must match the single-pass result within bf16 tolerance (no HF golden for within-prompt distributions, so this asserts a hardware-independent invariant instead). Skips cleanly without the model.

Tradeoff

While an echo + logprobs request is mid-prefill it forces dedicated Prefill steps, so active decodes wait those ticks (no interleaving). echo + logprobs is rare and this trades a small latency hit for correctness; the previous unified path silently dropped the prompt logprobs.

wjinxu and others added 2 commits June 17, 2026 14:02
…atch

Echo with logprobs returns a logprob for every prompt token, which the
naive path produced by materializing all-position logits (vocab ×
prompt_len) in one forward. A long echo prompt's scratch was never
budgeted, so a well-formed request deterministically OOMed after
admission instead of being served or rejected up front.

Chunk echo prefill like any other prompt: each chunk computes
all-position logits only for its own slice (vocab × chunk) and the
executor stitches the per-chunk prompt logprobs back together, so the
scratch is bounded by the prefill budget regardless of prompt length.

- scheduler: drop the echo exemption in take_prefill_chunks; echo chunks
  under the shared prefill budget.
- executor: build a per-chunk partial of prompt logprobs and fold it into
  a per-request accumulator, surfacing the full vector only on the final
  chunk; clear the accumulator on drop to avoid leaks.
- plan: route echo+logprobs to a dedicated Prefill step, since the
  unified forward can't produce all-position logits; gate the
  all-position-logits scratch on echo+logprobs so echo without logprobs
  no longer pays for it.

Tests: unit coverage for the chunk-fold seam and the plan gating, plus a
GPU self-consistency gate asserting chunked echo prompt logprobs match
the single-pass result within bf16 tolerance.

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cad1a58253

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +49 to 52
if !pending.is_empty() && have_active && !needs_dedicated_prefill {
Some(ExecutionPlan::Unified { pending })
} else if !pending.is_empty() {
Some(ExecutionPlan::Prefill { pending })

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve active decodes on prefill errors

When an echo+logprobs chunk is pending while decodes are active, this branch now returns a Prefill plan with have_active == true. If that prefill step errors (for example, the bounded all-position logits still OOM or a kernel returns an error), failure_targets_for treats Prefill as pending-only while fail_touched_requests still calls active.clear(), so the active streams are silently removed without an Error/Finished event and their executor KV is not dropped. Either keep active state out of the prefill failure path or notify/drop active requests before allowing Prefill with active requests.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

wjinxu and others added 2 commits June 17, 2026 14:27
echo+logprobs now routes to a Prefill step even with active decodes, so a
Prefill failure can occur while decodes are running. failure_targets_for
only collected the pending requests, but fail_touched_requests clears all
active state unconditionally — the active decodes were dropped without an
Error event or a drop_request, hanging the client and leaking KV. Fail
the active requests too, matching the Unified path.

Also trim the chunk-stitching comments to one authoritative seam
explanation.

Co-authored-by: Cursor <cursoragent@cursor.com>
Resolve scheduler/plan.rs conflict with openinfer-project#413's batch sampling refactor:
keep the echo+logprobs gating (echo flag = batch_needs_prompt_logprobs)
on top of the new plan-level sample_seed. Adapt to upstream API changes:
PrefillStepItem::new dropped random_val and PrefillPlan gained sample_seed
(echo_chunked_prefill test), and token_tx is now a TokenSink (failure
test).

Co-authored-by: Cursor <cursoragent@cursor.com>
@xiaguan

xiaguan commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator

@codex review

@chatgpt-codex-connector

Copy link
Copy Markdown

Codex Review: Didn't find any major issues. More of your lovely PRs please.

Reviewed commit: 15fcc418ed

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@xiaguan

xiaguan commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator

Thanks for digging into this — the chunk-fold seam logic and the failure-target fix are both solid work.

Before we land it I want to make sure we're fixing the reported bug. The #358 repro is a plain /v1/completions request — and on that path echo is hardcoded false (bridge.rs:240) and prompt_logprobs is rejected outright (sampling_guard.rs:188), so the echo + all-position-logits path this PR chunks is never reached by an API request. The ~22 KB/token I measured is the normal last-token-logits prefill, not the vocab × prompt_len spike.

Also — and this one is on me, the issue was stale — per-request prefill chunking already landed in #363, so the "prefill scratch is never budgeted" half of #358 is done. What's actually left is admission accounting for single-request activation headroom, which this PR doesn't touch.

Could you confirm whether your repro used echo? If it was a plain prompt, this PR doesn't address #358 and we should unlink it. The echo hardening itself is real, but it's currently unreachable from the API (echo off + prompt_logprobs rejected), so I'd rather not carry the extra per-request accumulator + dedicated-prefill routing until echo is actually wired.

I'll tighten #358 to spell out the repro params and the remaining admission gap so the next person isn't sent down the same path.

@wjinxu

wjinxu commented Jun 18, 2026

Copy link
Copy Markdown
Contributor Author

Thanks for the careful read — you're right, and I appreciate you catching this before it landed.

Confirming the repro: it was a plain /v1/completions request, no echo. So this PR is targeting a path that isn't reachable from the API today (echo hardcoded false + prompt_logprobs rejected at the guard), and what I measured was the normal last-token-logits prefill, not the vocab × prompt_len spike I assumed. Combined with #363 already covering per-request prefill chunking, this PR doesn't move #358 forward.

Closing this out and unlinking from #358. The chunk-fold seam logic and the partial-accumulator design should still be useful whenever echo gets wired through the bridge — happy to revisit then. I'll wait for your updated #358 to see where the real admission-accounting gap lives before picking up anything new here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants