fix(qwen3): chunk echo prefill to bound prompt-logprob activation scratch by wjinxu · Pull Request #416 · openinfer-project/openinfer

wjinxu · 2026-06-17T06:06:23Z

Summary

Echo with logprobs returns a logprob for every prompt token, which the old path produced by materializing all-position logits (vocab × prompt_len) in a single forward. Echo was exempt from prefill chunking, so a long echo prompt's activation scratch was never budgeted — a well-formed request (e.g. ~12k-token prompt) deterministically OOMed after admission instead of being served or rejected up front.

This chunks echo prefill like any other prompt: each chunk computes all-position logits only for its own slice (vocab × chunk) and the executor stitches the per-chunk prompt logprobs back together, so the scratch is bounded by the prefill budget regardless of prompt length.

Changes

scheduler (take_prefill_chunks): drop the echo exemption; echo chunks under the shared prefill budget like everything else.
executor: build a per-chunk partial of prompt logprobs (only the positions this chunk forwarded are Some) and fold it into a per-request accumulator, surfacing the full vector only on the final chunk. The accumulator is cleared on drop_request so a mid-prefill disconnect can't leak.
plan: route echo + logprobs to a dedicated Prefill step, since the unified forward can't produce all-position logits (would silently drop the echoed prompt logprobs). Also gate the all-position-logits scratch on echo + logprobs, so echo without logprobs (which only echoes prompt ids back) no longer pays for that scratch.

Correctness

The seam between chunks is the delicate part: prompt token k's logprob comes from the distribution at position k-1, so a boundary token's logprob is produced by the previous chunk's last position. Each prompt index is produced by exactly one chunk, so folding partials never drops or double-writes a slot.

Tests

Unit: chunk-fold seam reconstruction, echo-chunks-under-budget scheduling, and the plan gating (echo+logprobs → dedicated prefill; all-position-logits fire only for echo+logprobs).
GPU self-consistency gate (tests/echo_chunked_prefill.rs): chunked echo prompt logprobs must match the single-pass result within bf16 tolerance (no HF golden for within-prompt distributions, so this asserts a hardware-independent invariant instead). Skips cleanly without the model.

Tradeoff

While an echo + logprobs request is mid-prefill it forces dedicated Prefill steps, so active decodes wait those ticks (no interleaving). echo + logprobs is rare and this trades a small latency hit for correctness; the previous unified path silently dropped the prompt logprobs.

…atch Echo with logprobs returns a logprob for every prompt token, which the naive path produced by materializing all-position logits (vocab × prompt_len) in one forward. A long echo prompt's scratch was never budgeted, so a well-formed request deterministically OOMed after admission instead of being served or rejected up front. Chunk echo prefill like any other prompt: each chunk computes all-position logits only for its own slice (vocab × chunk) and the executor stitches the per-chunk prompt logprobs back together, so the scratch is bounded by the prefill budget regardless of prompt length. - scheduler: drop the echo exemption in take_prefill_chunks; echo chunks under the shared prefill budget. - executor: build a per-chunk partial of prompt logprobs and fold it into a per-request accumulator, surfacing the full vector only on the final chunk; clear the accumulator on drop to avoid leaks. - plan: route echo+logprobs to a dedicated Prefill step, since the unified forward can't produce all-position logits; gate the all-position-logits scratch on echo+logprobs so echo without logprobs no longer pays for it. Tests: unit coverage for the chunk-fold seam and the plan gating, plus a GPU self-consistency gate asserting chunked echo prompt logprobs match the single-pass result within bf16 tolerance. Co-authored-by: Cursor <cursoragent@cursor.com>

Co-authored-by: Cursor <cursoragent@cursor.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cad1a58253

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-06-17T06:11:48Z

+    if !pending.is_empty() && have_active && !needs_dedicated_prefill {
        Some(ExecutionPlan::Unified { pending })
    } else if !pending.is_empty() {
        Some(ExecutionPlan::Prefill { pending })


Preserve active decodes on prefill errors

When an echo+logprobs chunk is pending while decodes are active, this branch now returns a Prefill plan with have_active == true. If that prefill step errors (for example, the bounded all-position logits still OOM or a kernel returns an error), failure_targets_for treats Prefill as pending-only while fail_touched_requests still calls active.clear(), so the active streams are silently removed without an Error/Finished event and their executor KV is not dropped. Either keep active state out of the prefill failure path or notify/drop active requests before allowing Prefill with active requests.

Useful? React with 👍 / 👎.

echo+logprobs now routes to a Prefill step even with active decodes, so a Prefill failure can occur while decodes are running. failure_targets_for only collected the pending requests, but fail_touched_requests clears all active state unconditionally — the active decodes were dropped without an Error event or a drop_request, hanging the client and leaking KV. Fail the active requests too, matching the Unified path. Also trim the chunk-stitching comments to one authoritative seam explanation. Co-authored-by: Cursor <cursoragent@cursor.com>

Resolve scheduler/plan.rs conflict with openinfer-project#413's batch sampling refactor: keep the echo+logprobs gating (echo flag = batch_needs_prompt_logprobs) on top of the new plan-level sample_seed. Adapt to upstream API changes: PrefillStepItem::new dropped random_val and PrefillPlan gained sample_seed (echo_chunked_prefill test), and token_tx is now a TokenSink (failure test). Co-authored-by: Cursor <cursoragent@cursor.com>

xiaguan · 2026-06-18T02:38:40Z

@codex review

chatgpt-codex-connector · 2026-06-18T02:45:24Z

Codex Review: Didn't find any major issues. More of your lovely PRs please.

Reviewed commit: 15fcc418ed

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

xiaguan · 2026-06-18T06:06:22Z

Thanks for digging into this — the chunk-fold seam logic and the failure-target fix are both solid work.

Before we land it I want to make sure we're fixing the reported bug. The #358 repro is a plain /v1/completions request — and on that path echo is hardcoded false (bridge.rs:240) and prompt_logprobs is rejected outright (sampling_guard.rs:188), so the echo + all-position-logits path this PR chunks is never reached by an API request. The ~22 KB/token I measured is the normal last-token-logits prefill, not the vocab × prompt_len spike.

Also — and this one is on me, the issue was stale — per-request prefill chunking already landed in #363, so the "prefill scratch is never budgeted" half of #358 is done. What's actually left is admission accounting for single-request activation headroom, which this PR doesn't touch.

Could you confirm whether your repro used echo? If it was a plain prompt, this PR doesn't address #358 and we should unlink it. The echo hardening itself is real, but it's currently unreachable from the API (echo off + prompt_logprobs rejected), so I'd rather not carry the extra per-request accumulator + dedicated-prefill routing until echo is actually wired.

I'll tighten #358 to spell out the repro params and the remaining admission gap so the next person isn't sent down the same path.

wjinxu · 2026-06-18T06:26:58Z

Thanks for the careful read — you're right, and I appreciate you catching this before it landed.

Confirming the repro: it was a plain /v1/completions request, no echo. So this PR is targeting a path that isn't reachable from the API today (echo hardcoded false + prompt_logprobs rejected at the guard), and what I measured was the normal last-token-logits prefill, not the vocab × prompt_len spike I assumed. Combined with #363 already covering per-request prefill chunking, this PR doesn't move #358 forward.

Closing this out and unlinking from #358. The chunk-fold seam logic and the partial-accumulator design should still be useful whenever echo gets wired through the bridge — happy to revisit then. I'll wait for your updated #358 to see where the real admission-accounting gap lives before picking up anything new here.

wjinxu and others added 2 commits June 17, 2026 14:02

style(qwen3): rustfmt echo chunked-prefill changes

7f9bf32

Co-authored-by: Cursor <cursoragent@cursor.com>

chatgpt-codex-connector Bot reviewed Jun 17, 2026

View reviewed changes

wjinxu and others added 2 commits June 17, 2026 14:27

xiaguan mentioned this pull request Jun 18, 2026

qwen3: admission accepts single prompts whose prefill activation OOMs #358

Closed

wjinxu closed this Jun 18, 2026

xiaguan mentioned this pull request Jun 18, 2026

RFC: profile-based KV-cache sizing to replace the flat 0.85 activation headroom #422

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(qwen3): chunk echo prefill to bound prompt-logprob activation scratch#416

fix(qwen3): chunk echo prefill to bound prompt-logprob activation scratch#416
wjinxu wants to merge 4 commits into
openinfer-project:mainfrom
wjinxu:fix/qwen3-echo-logprobs-unified

wjinxu commented Jun 17, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 17, 2026

Uh oh!

wjinxu Jun 17, 2026

Uh oh!

xiaguan commented Jun 18, 2026

Uh oh!

chatgpt-codex-connector Bot commented Jun 18, 2026

Uh oh!

xiaguan commented Jun 18, 2026

Uh oh!

wjinxu commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wjinxu commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Correctness

Tests

Tradeoff

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

wjinxu Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

xiaguan commented Jun 18, 2026

Uh oh!

chatgpt-codex-connector Bot commented Jun 18, 2026

Uh oh!

xiaguan commented Jun 18, 2026

Uh oh!

wjinxu commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wjinxu commented Jun 17, 2026 •

edited

Loading