feat(qwen3): add DFlash speculative decoding by xiaguan · Pull Request #380 · openinfer-project/openinfer

xiaguan · 2026-06-14T04:10:16Z

Summary

Add opt-in Qwen3-4B DFlash speculative decoding via --dflash-draft-model-path.
Add native DFlash draft model loading/forward, verifier-span target execution, speculative KV lifecycle, and greedy acceptance logging with committed_tokens.
Enable DFlash for all eligible active greedy requests when configured; unsupported combinations fail closed or use baseline execution.
Harden multi-active DFlash with byte-budgeted side-state admission, transactional speculative KV rollback, active+pending prefill failure cleanup, and per-request prefill hidden capture.
Split Qwen3 executor responsibilities into lifecycle / worker / DFlash lane / speculative execution modules, and document local/5090/vLLM comparison results.

Validation

Commit hooks passed on pushed commits; targeted prek run --files passed for this hardening diff, including fmt and clippy.
cargo fmt --all --check
git diff --check
cargo test --release -p openinfer-kv-cache --test lifecycle speculative -- --nocapture
cargo test --release -p openinfer-qwen3-4b dflash_prefill --lib -- --nocapture
cargo test --release -p openinfer-qwen3-4b admission_ --lib -- --nocapture
cargo test --release -p openinfer-qwen3-4b dflash_ --lib -- --nocapture
cargo test --release -p openinfer-qwen3-4b --lib -- --nocapture
cargo build --release -p openinfer-server
OPENINFER_TEST_MODEL_PATH=/data/models/Qwen3-4B OPENINFER_DFLASH_TEST_MODEL_PATH=/data/models/Qwen3-4B-DFlash-b16 cargo test --release -p openinfer-qwen3-4b --test hf_golden_gate dflash_speculative_verify_matches_hf_argmax_regret_gate -- --nocapture
Toxic review re-check after the multi-active hardening returned Ready for review with no P0/P1/P2 findings.
Local 5070 Ti PR-head vllm bench serve, greedy /v1/completions:
- Spec-Bench c1: 89.72 -> 149.32 tok/s (1.66x)
- Spec-Bench c4: 303.92 -> 330.42 tok/s (1.09x)
- Random 1024/128 c1: 86.61 -> 136.09 tok/s (1.57x)
- Random 1024/128 c4: 270.93 -> 349.50 tok/s (1.29x)
Post-hardening local Spec-Bench c4 smoke completed 12/12 at 368.71 tok/s and logged four concurrent DFlash request IDs in one wave.
5090 real-weight gate passed with /data/Qwen3-4B and /data/Qwen3-4B-DFlash-b16.
5090 reference measurements are documented: OpenInfer Spec-Bench c1 167.34 -> 251.48 tok/s; upstream vLLM 0.22.1 DFlash reaches 289.57 tok/s on the same Spec-Bench with native acceptance metrics.

Draft State

Draft PR opened early to avoid branch collision.
Latest code removes the single-active-request DFlash restriction; multi-active greedy requests are now eligible together.
Known performance shape: target verify is batched, but DFlash draft is still per-request serial and not CUDA-graph captured.
Main remaining performance work before calling this polished is bs=1 draft-path optimization against the 5090 vLLM DFlash reference, then replacing serial draft with graph-captured or batched/fused draft execution.

wjinxu · 2026-06-15T09:47:11Z

Heads-up + I'd like to build on this rather than ship a parallel stack: I've been prototyping n-gram (prompt-lookup) speculative decoding in #349. n-gram is just a CPU draft source that produces the same kind of verify span, so it should be able to reuse your verify + greedy-acceptance + schedule/apply/revert_speculative KV path as a zero-cost second method (no draft model, no draft KV, no hidden capture).

I'll hold the detailed integration until your PR settles so I'm adapting to the final shape, not a moving target.

xiaguan added 3 commits June 14, 2026 12:09

feat(qwen3): add dflash speculative decoding

639791c

docs(qwen3): record dflash multi-active benchmarks

68d5ac4

fix(qwen3): harden dflash multi-active admission

2ed2d32

xiaguan mentioned this pull request Jun 22, 2026

feat(qwen3): DFlash speculative decoding #436

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(qwen3): add DFlash speculative decoding#380

feat(qwen3): add DFlash speculative decoding#380
xiaguan wants to merge 3 commits into
mainfrom
feat/qwen3-dflash-mtp

xiaguan commented Jun 14, 2026 •

edited

Loading

Uh oh!

wjinxu commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

xiaguan commented Jun 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Draft State

Uh oh!

wjinxu commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

xiaguan commented Jun 14, 2026 •

edited

Loading