feat(qwen35): scheduler-level chunked prefill (#375) by scatyf3 · Pull Request #431 · openinfer-project/openinfer

scatyf3 · 2026-06-21T15:23:17Z

Description

Fixes #375

Qwen3.5 already chunked prefill at the KV level (PREFILL_CHUNK_LEN), but the whole chunk loop ran inside a single scheduler step. A long prompt was therefore atomic to the scheduler: one Unified step packed the entire prompt plus every active decode row into one forward pass, stalling all decoding requests for one inter-token gap — the same ITL tail #368 profiled on Qwen3.

This PR moves chunking up to the scheduler. State now lives across steps instead of inside one prefill call, and each step prefills at most a per-step token budget off the FIFO front, servicing the decode batch between chunks.

No new kernel: prefill_forward already reads its base from kv.seq_len() and advances KV + recurrent state in place, so calling it with successive prompt slices evolves state bit-identically to a whole-prompt call. This is pure scheduler bookkeeping.

KV / recurrent-state menagement

Before, prefill allocated the full KV/rec state in a temporary, and on completion did a KV-move + rec-copy and dropped the temp. Now state advances in place across chunks:

In flight: a new PrefillingRequest35 owns the growing KvState + RecurrentState + a cursor (prompt tokens prefilled so far) and lives in the prefilling FIFO across steps — not as a function local.
On completion: the same ownership handoff as before — KV/rec state moves from PrefillingRequest35 to ActiveRequest35.

Scheduler (`scheduler.rs`, `scheduler/plan.rs`)

prefilling queue — FIFO that owns in-flight KV/rec state across steps.
Per-step budget helpers (pure, unit-tested):
- plan_prefill_chunks(remaining, budget) packs front requests up to budgettokens; a request that doesn't fit takes a partial chunk and stays at thefront, so one long prompt is sliced across steps while short prompts behind it still get serviced.
- Admission shrinks the budgets passed to admit_pending_requests under 2 constrain, kv pages and cuda graph slots
run_step mirrors the Qwen3 scheduler:
admitted prompts enter theprefilling queue → ExecutionPlan now threads the prefilling queue → advance scheduled prefill chunks → decode_step the active batch → promote_or_requeue each scheduled chunk (a finished prompt promotes to decode, an unfinished one re-queues at the front).

usage

use OPENINFER_QWEN35_PREFILL_BUDGET to control chunked size

Evaluation

Latency: improves p99 ITL; slightly raises p50 in a minority of cases (the expected budget trade-off).
Correctness: Qwen3.5 e2e output accuracy unchanged. cargo fmt, the 24 scheduler unit tests, and e2e_scheduler (at the 1024 default and at budget 4 for heavy slicing / concurrent interleave) all green.

Type of Change

New feature (non-breaking change which adds functionality)

Move chunked-prefill control from the KV level into the Qwen3.5 scheduler. A per-step prefill token budget (OPENINFER_QWEN35_PREFILL_BUDGET, default 1024, matching Qwen3) slices a long prompt across steps and services the decode batch between chunks, killing the unified-step ITL stall measured on Qwen3 (openinfer-project#368). New prompts enter a `prefilling` FIFO that owns their growing KvState + RecurrentState + cursor; each step feeds the next budgeted chunk into the existing `batch_prefill_logits` / `unified_step` exactly as a whole-prompt prefill would. The executor reads its base position from `kv.seq_len()`, so successive chunks continue in place -- no new kernel and no executor change (unified_forward.rs is untouched). Finished prompts promote into the decode batch; the unfinished tail re-queues at the front. Admission reserves graph slots and future KV pages for in-flight prefills. Structure mirrors Qwen3: take_prefill_chunks -> build_next_plan -> prefill_batch / unified_step_sched / decode_step; the decode path (decode_step / process_decode_logits / dispatch_decode_tokens / compact_slot) is unchanged from main. Validated: 30 lib unit tests green on A100 (incl. plan chunked-prefill cases at the 1024 default and budget 4); hf_golden_gate + sampling_behavior compile. e2e_scheduler needs Qwen3.5-4B weights to run (the local 0.8B has incompatible GDN head dims for the baked AOT kernels). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Condense over-long doc comments in the chunked-prefill scheduler and move the `deferred` reassignment next to where it is consumed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Condense the module header and struct/const doc comments now that the chunked-prefill design is settled. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Remove the dedicated chunked-prefill design doc and its index/scheduler references; the scheduler-level chunked prefill change stands on its own in code + unit tests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 29ee05bc3c

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-06-21T15:27:08Z

+        if left == 0 || rem == 0 {
+            break;
+        }


Handle zero-token prompts before chunk planning

When an admitted request has an empty prompt, its remaining length is 0, so this branch returns no scheduled chunks while leaving the request in prefilling; the scheduler then skips blocking because prefilling is non-empty and never drains later prompts behind it. Before this change the prefill path would error the empty request via prefill_last_hidden, but now a single empty /v1/completions prompt can permanently wedge the Qwen3.5 scheduler, so reject/drop zero-token prompts or emit a terminal event before enqueueing them.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-21T15:27:08Z

+        if end < req.prompt_tokens.len() {
+            still_prefilling.push(PrefillingRequest35 {


Drop cancelled partial prefills before requeueing

For prompts that span multiple chunks, no TokenEvent is sent on non-final chunks, so a client disconnect is only observable through req.token_tx.is_closed(). This branch requeues the request unconditionally, which keeps a cancelled long prompt at the FIFO front and continues spending prefill chunks and reserved capacity until the full prompt is processed; check the sink before requeueing and drop the request state when it is closed.

Useful? React with 👍 / 👎.

xiaguan

Verified locally (RTX 5070 Ti, Qwen3.5-4B): 30 lib tests + e2e_scheduler green at the 1024 default and at budget 4. Greedy output at budget 4 (273-tok prompt sliced into ~68 chunks) is byte-identical to the whole-prompt run, so the cross-step KV/recurrent advance is exact even with the chunkwise GDN kernel. Correctness LGTM.

Two changes before merge:

Drop the OPENINFER_QWEN35_PREFILL_BUDGET env var. Wire the existing --max-prefill-tokens server flag through to Qwen3.5 instead — it's already an arg (used by Qwen3) but currently ignored on this path. A hidden env var for a serving knob isn't the pattern we want, and it contradicts the 'mirrors Qwen3' framing.
#375 asks for an ITL A/B report and the PR doesn't include one. Please add the numbers: mixed-load p50/p99 ITL, chunked vs not, same profile as #368. The ITL win is structurally expected, but it should be measured, not asserted.

…env vars Qwen3.5's scheduler read a hidden OPENINFER_QWEN35_PREFILL_BUDGET env var for the per-step chunked-prefill budget while the server's --max-prefill-tokens flag (already used by Qwen3) was ignored on this path. Thread the flag through launch -> start_with_capacity -> scheduler_loop instead, mirroring Qwen3, and assert the budget is positive. Also replace the bench-only OPENINFER_QWEN3{,5}_PREFILL_BUDGET env vars with a bench_serving --max-prefill-tokens flag, so both serving and benching use the same knob on both model lines (no env var anywhere). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Record the chunking-off mixed-load ITL characterization for both lines (qps x prompt x prefix): Qwen3's unified-step freeze blows up p99 with prompt (8k 1161, 12k 3270ms), while Qwen3.5's same freeze lands just under the 1% p99 knee (shows in max, not p99 -- a measurement artifact of the 1024-token background, not architectural immunity). Regenerate the Qwen3.5 bench snapshot so it carries the mixed_itl profile (canonical default-chunked cell) on par with Qwen3, refreshing the stale March prefill/decode numbers. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Bring back the severity/frequency two-knob explanation (the rigorous model whose 1%-knee framing is exactly why Qwen3.5's freeze stays out of p99), and rename the section to 'mixed-load stall, chunking off' — the tables are all chunking-off, so the prior 'chunked-prefill A/B' title overclaimed. The chunked-on result lives in the mixed_itl snapshot cell, not a table here. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

A stray 'max_prefill_tokens' token had been pasted into PromptInputArgs (editor autocomplete), breaking the parse and CI's cargo fmt --all --check; remove it. Run cargo fmt (collapses the Qwen35 launch match arm in main.rs). Revert the --max-prefill-tokens help text to the original Qwen3-specific wording per maintainer preference. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

xiaguan

LGTM. Clean scheduler-level chunked prefill that mirrors the proven Qwen3 design — the admission accounting (slot_budget = max_batch - prefilling.len() - active.len(), plus prefilling_future_pages reserving each in-flight prefill's KV growth) keeps active.len() + prefilling.len() <= max_batch invariant, so promotion never runs out of a graph slot, and the state-ownership handoff across steps is symmetric. No new kernel, recurrent/KV state advances in place off kv.seq_len().

One non-blocking nit: the --max-prefill-tokens help text in openinfer-server/src/config.rs still scopes the flag to Qwen3 ("one Qwen3 scheduler step", "Echo requests are never split"), but after this PR it also drives the Qwen3.5 chunked-prefill budget. Worth a one-line tweak so the flag's scope reads correctly for both model lines.

Approving.

scatyf3 and others added 4 commits June 19, 2026 16:40

docs(qwen35): trim verbose scheduler comments

49f57d0

Condense over-long doc comments in the chunked-prefill scheduler and move the `deferred` reassignment next to where it is consumed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

docs(qwen35): trim chunked-prefill scheduler comments

462d7f2

Condense the module header and struct/const doc comments now that the chunked-prefill design is settled. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector Bot reviewed Jun 21, 2026

View reviewed changes

xiaguan self-assigned this Jun 21, 2026

xiaguan requested changes Jun 21, 2026

View reviewed changes

scatyf3 and others added 4 commits June 21, 2026 18:55

scatyf3 requested a review from xiaguan June 22, 2026 14:43

xiaguan approved these changes Jun 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(qwen35): scheduler-level chunked prefill (#375)#431

feat(qwen35): scheduler-level chunked prefill (#375)#431
scatyf3 wants to merge 8 commits into
openinfer-project:mainfrom
scatyf3:feat/qwen35-chunked-prefill

scatyf3 commented Jun 21, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 21, 2026

Uh oh!

chatgpt-codex-connector Bot Jun 21, 2026

Uh oh!

xiaguan left a comment

Uh oh!

xiaguan left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		if end < req.prompt_tokens.len() {
		still_prefilling.push(PrefillingRequest35 {

Conversation

scatyf3 commented Jun 21, 2026

Description

KV / recurrent-state menagement

Scheduler (scheduler.rs, scheduler/plan.rs)

usage

Evaluation

Type of Change

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 21, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 21, 2026

Choose a reason for hiding this comment

Uh oh!

xiaguan left a comment

Choose a reason for hiding this comment

Uh oh!

xiaguan left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Scheduler (`scheduler.rs`, `scheduler/plan.rs`)