feat(dflash): config-driven drafter converter + spec-decode fixes + rope_parameters Q4 converter#447
Open
dusterbloom wants to merge 3 commits into
Open
Conversation
Contributor
There was a problem hiding this comment.
8 issues found across 20 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="server/scripts/quantize_draft_q8.py">
<violation number="1" location="server/scripts/quantize_draft_q8.py:30">
P3: `is_norm_tensor` is imported and then redefined locally, so the import is dead/shadowed and adds avoidable duplication.</violation>
</file>
<file name="server/src/qwen35/qwen35_backend.cpp">
<violation number="1" location="server/src/qwen35/qwen35_backend.cpp:191">
P1: Early drafter config sync can copy uninitialized capture layer IDs when `n_target_layers` exists but `target_layer_ids` is missing/invalid.</violation>
</file>
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
| std::string dp(cfg_.draft_path); | ||
| if (dp.size() >= 5 && dp.substr(dp.size() - 5) == ".gguf") { | ||
| int n_cap = 0; | ||
| int cap_ids[DFLASH_MAX_CAPTURE_LAYERS]; |
Contributor
There was a problem hiding this comment.
P1: Early drafter config sync can copy uninitialized capture layer IDs when n_target_layers exists but target_layer_ids is missing/invalid.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/qwen35/qwen35_backend.cpp, line 191:
<comment>Early drafter config sync can copy uninitialized capture layer IDs when `n_target_layers` exists but `target_layer_ids` is missing/invalid.</comment>
<file context>
@@ -180,6 +180,24 @@ bool Qwen35Backend::init() {
+ std::string dp(cfg_.draft_path);
+ if (dp.size() >= 5 && dp.substr(dp.size() - 5) == ".gguf") {
+ int n_cap = 0;
+ int cap_ids[DFLASH_MAX_CAPTURE_LAYERS];
+ if (read_draft_capture_config(cfg_.draft_path, n_cap, cap_ids,
+ DFLASH_MAX_CAPTURE_LAYERS)) {
</file context>
| # Share arch resolution, tensor-name mapping, and safetensors I/O with | ||
| # the F16 converter so both produce structurally identical GGUF metadata. | ||
| from convert_dflash_to_gguf import ( | ||
| ARCH, load_arch, map_name, is_norm_tensor, |
Contributor
There was a problem hiding this comment.
P3: is_norm_tensor is imported and then redefined locally, so the import is dead/shadowed and adds avoidable duplication.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/scripts/quantize_draft_q8.py, line 30:
<comment>`is_norm_tensor` is imported and then redefined locally, so the import is dead/shadowed and adds avoidable duplication.</comment>
<file context>
@@ -16,84 +18,20 @@
+# Share arch resolution, tensor-name mapping, and safetensors I/O with
+# the F16 converter so both produce structurally identical GGUF metadata.
+from convert_dflash_to_gguf import (
+ ARCH, load_arch, map_name, is_norm_tensor,
+ load_safetensors_header,
+)
</file context>
Suggested change
| ARCH, load_arch, map_name, is_norm_tensor, | |
| ARCH, load_arch, map_name, |
549ae92 to
166a409
Compare
…ectness fixes DRAFTER CONVERTER (config-driven): - convert_dflash_to_gguf.py reads all architecture params from config.json (hidden_size, n_layer, mask_token_id, target_layer_ids, layer_types for SWA, sliding_window). No hardcoded constants. - quantize_draft_q8.py shares load_arch with the converter. - GGUF metadata: dflash.mask_token_id, dflash.target_layer_ids[], dflash.block_size, attention.sliding_window + pattern. - draft_gguf_loader.cpp: read_draft_capture_config(), mask from GGUF metadata, block_size override, SWA pattern from metadata. - draft_safetensors_loader.cpp: dynamic layer count, SWA+mask from config.json. - gguf_target_loader.cpp: respect drafter-specified capture layers instead of overwriting with evenly-spaced heuristic. - qwen35_backend.cpp: early-read capture sync + mask token propagation. - internal.h: capture_layer_ids[16], DFLASH_MAX_CAPTURE_LAYERS=16. - dflash27b.h: DFLASH_MAX_CAPTURE_LAYERS=16. SPEC-DECODE PERFORMANCE: - graph_builders.cpp: build_lm_head_projection_step skips rebuild when ctx alive + n_tokens matches (centralized guard; was per-call-site). - qwen35_backend.cpp: do_spec_decode uses member draft_sg_ (not local) for graph persistence; kFastRollbackThreshold env-tunable (DFLASH_FAST_ROLLBACK_MIN, default 5). - dflash_draft_graph.cpp: exact-ctx_len non-view reuse guard (DFLASH_DRAFT_GRAPH_REUSE, default ON). 4MB ctx alloc (was 256MB). - graph_builders.cpp: 4MB ctx alloc (was 64MB). - step_graph.h: graph_ctx_len + graph_used_view tracking fields. SPEC-DECODE CORRECTNESS: - qwen35_target_graph.cpp: DFLASH_FEAT_RING_CAP env overrides the hardcoded 4096 feature ring cap. Default 4096 causes acceptance collapse from 85% to 7.7% EXACTLY at 4096 prompt tokens (ring wrap corrupts features). - qwen35_backend.cpp: mirror init honors DFLASH_FEAT_RING_CAP. - qwen35_dflash_target.cpp: guard against invalid token IDs from GPU argmax at long context (NaN/Inf → clamp to 0, verify rejects gracefully). MOE EXPERIMENTAL (behind flags): - qwen35moe_backend.cpp: DFLASH_MOE_ALLHOT_HYBRID=1 builds moe_hybrid storage even with 0 cold experts to enable pipelined spec-decode verify. - Persistent moe_hybrid_logits_sg_ graph (was 64MB per-token alloc in hybrid_forward_one_token). GPU argmax (4 bytes vs 1MB vocab readback). - Batched verify/replay via hybrid_forward_batch (was 8 sequential forwards). VALIDATED: - 27B dense + reconverted drafter: 57% accept on code gen, 85% on short prompts. block=16 gives 252 tok/s (2.2x AR) on code generation. - 35B-A3B MoE + reconverted new drafter: 86% accept, 245 tok/s (2.1x AR). - Feature ring cap=16384: 85% holds to 5K tokens, 58% to 10K. - Full pFlash + dFlash stack: goldgate agentic trace passes (100% tool calls valid), pFlash cuts 34K prefill from 475s to 208s (2.3x). - repo_inspection prompt: correct answers, spec at 33.8% accept, 34 tok/s.
…al_dflash_to_gguf.py) resolve_rope_theta reads rope_parameters.rope_theta -> top-level rope_theta -> sys.exit(1); drops the silent 1M default that baked wrong RoPE for Qwen3.6 drafters whose true 10M lives nested under rope_parameters (missed by the old top-level-only pick(), default-baked 1M -> long-ctx spec-decode acceptance collapse). Q4_K_M output mirrors quantize_draft_q8.py (256 superblock, norm tensors stay F32). 12 pure-python tests cover the 35B-a3b nested case, 27B legacy top-level, gemma4 genuine-1M, and missing-theta fail-loud. Original convert_dflash_to_gguf.py untouched.
…OB, UAF, atoi-UB, re-prefill fallback, pin_range, gguf type guards)
166a409 to
02f2eb6
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Wave-2 base. The config-driven drafter converter + spec-decode perf/correctness fixes, plus a new corrected converter.
New in this PR (commit 166a409): server/scripts/convert_modal_dflash_to_gguf.py — resolves rope_theta from the nested rope_parameters scope then top-level (legacy), and FAILS LOUD (sys.exit) on a missing theta instead of silently baking a 1M default. This fixes the root cause of the drafter RoPE mismatch: Qwen3.6 drafters publish their true 10M under rope_parameters, which the old top-level-only pick() missed, baking 1M and collapsing long-ctx spec-decode acceptance (gemma4 is genuinely 1M so the bug was invisible there). Q4_K_M output, 12 pure-python tests. Original convert_dflash_to_gguf.py untouched.
Depends on the pooled-prefill core (#430).