Skip to content

feat(qwen3): add DFlash speculative decoding#380

Draft
xiaguan wants to merge 3 commits into
mainfrom
feat/qwen3-dflash-mtp
Draft

feat(qwen3): add DFlash speculative decoding#380
xiaguan wants to merge 3 commits into
mainfrom
feat/qwen3-dflash-mtp

Conversation

@xiaguan

@xiaguan xiaguan commented Jun 14, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Add opt-in Qwen3-4B DFlash speculative decoding via --dflash-draft-model-path.
  • Add native DFlash draft model loading/forward, verifier-span target execution, speculative KV lifecycle, and greedy acceptance logging with committed_tokens.
  • Enable DFlash for all eligible active greedy requests when configured; unsupported combinations fail closed or use baseline execution.
  • Harden multi-active DFlash with byte-budgeted side-state admission, transactional speculative KV rollback, active+pending prefill failure cleanup, and per-request prefill hidden capture.
  • Split Qwen3 executor responsibilities into lifecycle / worker / DFlash lane / speculative execution modules, and document local/5090/vLLM comparison results.

Validation

  • Commit hooks passed on pushed commits; targeted prek run --files passed for this hardening diff, including fmt and clippy.
  • cargo fmt --all --check
  • git diff --check
  • cargo test --release -p openinfer-kv-cache --test lifecycle speculative -- --nocapture
  • cargo test --release -p openinfer-qwen3-4b dflash_prefill --lib -- --nocapture
  • cargo test --release -p openinfer-qwen3-4b admission_ --lib -- --nocapture
  • cargo test --release -p openinfer-qwen3-4b dflash_ --lib -- --nocapture
  • cargo test --release -p openinfer-qwen3-4b --lib -- --nocapture
  • cargo build --release -p openinfer-server
  • OPENINFER_TEST_MODEL_PATH=/data/models/Qwen3-4B OPENINFER_DFLASH_TEST_MODEL_PATH=/data/models/Qwen3-4B-DFlash-b16 cargo test --release -p openinfer-qwen3-4b --test hf_golden_gate dflash_speculative_verify_matches_hf_argmax_regret_gate -- --nocapture
  • Toxic review re-check after the multi-active hardening returned Ready for review with no P0/P1/P2 findings.
  • Local 5070 Ti PR-head vllm bench serve, greedy /v1/completions:
    • Spec-Bench c1: 89.72 -> 149.32 tok/s (1.66x)
    • Spec-Bench c4: 303.92 -> 330.42 tok/s (1.09x)
    • Random 1024/128 c1: 86.61 -> 136.09 tok/s (1.57x)
    • Random 1024/128 c4: 270.93 -> 349.50 tok/s (1.29x)
  • Post-hardening local Spec-Bench c4 smoke completed 12/12 at 368.71 tok/s and logged four concurrent DFlash request IDs in one wave.
  • 5090 real-weight gate passed with /data/Qwen3-4B and /data/Qwen3-4B-DFlash-b16.
  • 5090 reference measurements are documented: OpenInfer Spec-Bench c1 167.34 -> 251.48 tok/s; upstream vLLM 0.22.1 DFlash reaches 289.57 tok/s on the same Spec-Bench with native acceptance metrics.

Draft State

  • Draft PR opened early to avoid branch collision.
  • Latest code removes the single-active-request DFlash restriction; multi-active greedy requests are now eligible together.
  • Known performance shape: target verify is batched, but DFlash draft is still per-request serial and not CUDA-graph captured.
  • Main remaining performance work before calling this polished is bs=1 draft-path optimization against the 5090 vLLM DFlash reference, then replacing serial draft with graph-captured or batched/fused draft execution.

@wjinxu

wjinxu commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Heads-up + I'd like to build on this rather than ship a parallel stack: I've been prototyping n-gram (prompt-lookup) speculative decoding in #349. n-gram is just a CPU draft source that produces the same kind of verify span, so it should be able to reuse your verify + greedy-acceptance + schedule/apply/revert_speculative KV path as a zero-cost second method (no draft model, no draft KV, no hidden capture).

I'll hold the detailed integration until your PR settles so I'm adapting to the final shape, not a moving target.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants