Skip to content

feat(qwen35): scheduler-level chunked prefill (#375)#431

Open
scatyf3 wants to merge 8 commits into
openinfer-project:mainfrom
scatyf3:feat/qwen35-chunked-prefill
Open

feat(qwen35): scheduler-level chunked prefill (#375)#431
scatyf3 wants to merge 8 commits into
openinfer-project:mainfrom
scatyf3:feat/qwen35-chunked-prefill

Conversation

@scatyf3

@scatyf3 scatyf3 commented Jun 21, 2026

Copy link
Copy Markdown
Contributor

Description

Fixes #375

Qwen3.5 already chunked prefill at the KV level (PREFILL_CHUNK_LEN), but the whole chunk loop ran inside a single scheduler step. A long prompt was therefore atomic to the scheduler: one Unified step packed the entire prompt plus every active decode row into one forward pass, stalling all decoding requests for one inter-token gap — the same ITL tail #368 profiled on Qwen3.

This PR moves chunking up to the scheduler. State now lives across steps instead of inside one prefill call, and each step prefills at most a per-step token budget off the FIFO front, servicing the decode batch between chunks.

No new kernel: prefill_forward already reads its base from kv.seq_len() and advances KV + recurrent state in place, so calling it with successive prompt slices evolves state bit-identically to a whole-prompt call. This is pure scheduler bookkeeping.

KV / recurrent-state menagement

Before, prefill allocated the full KV/rec state in a temporary, and on completion did a KV-move + rec-copy and dropped the temp. Now state advances in place across chunks:

  • In flight: a new PrefillingRequest35 owns the growing KvState + RecurrentState + a cursor (prompt tokens prefilled so far) and lives in the prefilling FIFO across steps — not as a function local.
  • On completion: the same ownership handoff as before — KV/rec state moves from PrefillingRequest35 to ActiveRequest35.

Scheduler (scheduler.rs, scheduler/plan.rs)

  • prefilling queue — FIFO that owns in-flight KV/rec state across steps.
  • Per-step budget helpers (pure, unit-tested):
    • plan_prefill_chunks(remaining, budget) packs front requests up to budgettokens; a request that doesn't fit takes a partial chunk and stays at thefront, so one long prompt is sliced across steps while short prompts behind it still get serviced.
    • Admission shrinks the budgets passed to admit_pending_requests under 2 constrain, kv pages and cuda graph slots
  • run_step mirrors the Qwen3 scheduler:
  • admitted prompts enter theprefilling queue → ExecutionPlan now threads the prefilling queue → advance scheduled prefill chunks → decode_step the active batch → promote_or_requeue each scheduled chunk (a finished prompt promotes to decode, an unfinished one re-queues at the front).

usage

use OPENINFER_QWEN35_PREFILL_BUDGET to control chunked size

Evaluation

  • Latency: improves p99 ITL; slightly raises p50 in a minority of cases (the expected budget trade-off).
  • Correctness: Qwen3.5 e2e output accuracy unchanged. cargo fmt, the 24 scheduler unit tests, and e2e_scheduler (at the 1024 default and at budget 4 for heavy slicing / concurrent interleave) all green.

Type of Change

New feature (non-breaking change which adds functionality)

scatyf3 and others added 4 commits June 19, 2026 16:40
Move chunked-prefill control from the KV level into the Qwen3.5 scheduler. A
per-step prefill token budget (OPENINFER_QWEN35_PREFILL_BUDGET, default 1024,
matching Qwen3) slices a long prompt across steps and services the decode batch
between chunks, killing the unified-step ITL stall measured on Qwen3 (openinfer-project#368).

New prompts enter a `prefilling` FIFO that owns their growing KvState +
RecurrentState + cursor; each step feeds the next budgeted chunk into the
existing `batch_prefill_logits` / `unified_step` exactly as a whole-prompt
prefill would. The executor reads its base position from `kv.seq_len()`, so
successive chunks continue in place -- no new kernel and no executor change
(unified_forward.rs is untouched). Finished prompts promote into the decode
batch; the unfinished tail re-queues at the front. Admission reserves graph
slots and future KV pages for in-flight prefills. Structure mirrors Qwen3:
take_prefill_chunks -> build_next_plan -> prefill_batch / unified_step_sched /
decode_step; the decode path (decode_step / process_decode_logits /
dispatch_decode_tokens / compact_slot) is unchanged from main.

Validated: 30 lib unit tests green on A100 (incl. plan chunked-prefill cases at
the 1024 default and budget 4); hf_golden_gate + sampling_behavior compile.
e2e_scheduler needs Qwen3.5-4B weights to run (the local 0.8B has incompatible
GDN head dims for the baked AOT kernels).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Condense over-long doc comments in the chunked-prefill scheduler and
move the `deferred` reassignment next to where it is consumed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Condense the module header and struct/const doc comments now that the
chunked-prefill design is settled.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Remove the dedicated chunked-prefill design doc and its index/scheduler
references; the scheduler-level chunked prefill change stands on its own
in code + unit tests.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 29ee05bc3c

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +166 to +168
if left == 0 || rem == 0 {
break;
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Handle zero-token prompts before chunk planning

When an admitted request has an empty prompt, its remaining length is 0, so this branch returns no scheduled chunks while leaving the request in prefilling; the scheduler then skips blocking because prefilling is non-empty and never drains later prompts behind it. Before this change the prefill path would error the empty request via prefill_last_hidden, but now a single empty /v1/completions prompt can permanently wedge the Qwen3.5 scheduler, so reject/drop zero-token prompts or emit a terminal event before enqueueing them.

Useful? React with 👍 / 👎.

Comment on lines +841 to +842
if end < req.prompt_tokens.len() {
still_prefilling.push(PrefillingRequest35 {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Drop cancelled partial prefills before requeueing

For prompts that span multiple chunks, no TokenEvent is sent on non-final chunks, so a client disconnect is only observable through req.token_tx.is_closed(). This branch requeues the request unconditionally, which keeps a cancelled long prompt at the FIFO front and continues spending prefill chunks and reserved capacity until the full prompt is processed; check the sink before requeueing and drop the request state when it is closed.

Useful? React with 👍 / 👎.

@xiaguan xiaguan self-assigned this Jun 21, 2026

@xiaguan xiaguan left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified locally (RTX 5070 Ti, Qwen3.5-4B): 30 lib tests + e2e_scheduler green at the 1024 default and at budget 4. Greedy output at budget 4 (273-tok prompt sliced into ~68 chunks) is byte-identical to the whole-prompt run, so the cross-step KV/recurrent advance is exact even with the chunkwise GDN kernel. Correctness LGTM.

Two changes before merge:

  1. Drop the OPENINFER_QWEN35_PREFILL_BUDGET env var. Wire the existing --max-prefill-tokens server flag through to Qwen3.5 instead — it's already an arg (used by Qwen3) but currently ignored on this path. A hidden env var for a serving knob isn't the pattern we want, and it contradicts the 'mirrors Qwen3' framing.

  2. #375 asks for an ITL A/B report and the PR doesn't include one. Please add the numbers: mixed-load p50/p99 ITL, chunked vs not, same profile as #368. The ITL win is structurally expected, but it should be measured, not asserted.

scatyf3 and others added 4 commits June 21, 2026 18:55
…env vars

Qwen3.5's scheduler read a hidden OPENINFER_QWEN35_PREFILL_BUDGET env var for
the per-step chunked-prefill budget while the server's --max-prefill-tokens flag
(already used by Qwen3) was ignored on this path. Thread the flag through
launch -> start_with_capacity -> scheduler_loop instead, mirroring Qwen3, and
assert the budget is positive.

Also replace the bench-only OPENINFER_QWEN3{,5}_PREFILL_BUDGET env vars with a
bench_serving --max-prefill-tokens flag, so both serving and benching use the
same knob on both model lines (no env var anywhere).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Record the chunking-off mixed-load ITL characterization for both lines
(qps x prompt x prefix): Qwen3's unified-step freeze blows up p99 with prompt
(8k 1161, 12k 3270ms), while Qwen3.5's same freeze lands just under the 1% p99
knee (shows in max, not p99 -- a measurement artifact of the 1024-token
background, not architectural immunity).

Regenerate the Qwen3.5 bench snapshot so it carries the mixed_itl profile
(canonical default-chunked cell) on par with Qwen3, refreshing the stale March
prefill/decode numbers.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Bring back the severity/frequency two-knob explanation (the rigorous model whose
1%-knee framing is exactly why Qwen3.5's freeze stays out of p99), and rename the
section to 'mixed-load stall, chunking off' — the tables are all chunking-off, so
the prior 'chunked-prefill A/B' title overclaimed. The chunked-on result lives in
the mixed_itl snapshot cell, not a table here.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
A stray 'max_prefill_tokens' token had been pasted into PromptInputArgs (editor
autocomplete), breaking the parse and CI's cargo fmt --all --check; remove it.
Run cargo fmt (collapses the Qwen35 launch match arm in main.rs). Revert the
--max-prefill-tokens help text to the original Qwen3-specific wording per
maintainer preference.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@scatyf3 scatyf3 requested a review from xiaguan June 22, 2026 14:43

@xiaguan xiaguan left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Clean scheduler-level chunked prefill that mirrors the proven Qwen3 design — the admission accounting (slot_budget = max_batch - prefilling.len() - active.len(), plus prefilling_future_pages reserving each in-flight prefill's KV growth) keeps active.len() + prefilling.len() <= max_batch invariant, so promotion never runs out of a graph slot, and the state-ownership handoff across steps is symmetric. No new kernel, recurrent/KV state advances in place off kv.seq_len().

One non-blocking nit: the --max-prefill-tokens help text in openinfer-server/src/config.rs still scopes the flag to Qwen3 ("one Qwen3 scheduler step", "Echo requests are never split"), but after this PR it also drives the Qwen3.5 chunked-prefill budget. Worth a one-line tweak so the flag's scope reads correctly for both model lines.

Approving.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement scheduler-level chunked prefill for Qwen3.5

2 participants