feat(dflash): config-driven drafter converter + spec-decode fixes + rope_parameters Q4 converter by dusterbloom · Pull Request #447 · Luce-Org/lucebox-hub

dusterbloom · 2026-06-24T15:37:08Z

Wave-2 base. The config-driven drafter converter + spec-decode perf/correctness fixes, plus a new corrected converter.

New in this PR (commit 166a409): server/scripts/convert_modal_dflash_to_gguf.py — resolves rope_theta from the nested rope_parameters scope then top-level (legacy), and FAILS LOUD (sys.exit) on a missing theta instead of silently baking a 1M default. This fixes the root cause of the drafter RoPE mismatch: Qwen3.6 drafters publish their true 10M under rope_parameters, which the old top-level-only pick() missed, baking 1M and collapsing long-ctx spec-decode acceptance (gemma4 is genuinely 1M so the bug was invisible there). Q4_K_M output, 12 pure-python tests. Original convert_dflash_to_gguf.py untouched.

Depends on the pooled-prefill core (#430).

cubic-dev-ai

8 issues found across 20 files

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="server/scripts/quantize_draft_q8.py">

<violation number="1" location="server/scripts/quantize_draft_q8.py:30">
P3: `is_norm_tensor` is imported and then redefined locally, so the import is dead/shadowed and adds avoidable duplication.</violation>
</file>

<file name="server/src/qwen35/qwen35_backend.cpp">

<violation number="1" location="server/src/qwen35/qwen35_backend.cpp:191">
P1: Early drafter config sync can copy uninitialized capture layer IDs when `n_target_layers` exists but `target_layer_ids` is missing/invalid.</violation>
</file>

_{Reply with feedback, questions, or to request a fix.

Re-trigger cubic}

cubic-dev-ai · 2026-06-24T15:49:49Z

+        std::string dp(cfg_.draft_path);
+        if (dp.size() >= 5 && dp.substr(dp.size() - 5) == ".gguf") {
+            int n_cap = 0;
+            int cap_ids[DFLASH_MAX_CAPTURE_LAYERS];


P1: Early drafter config sync can copy uninitialized capture layer IDs when n_target_layers exists but target_layer_ids is missing/invalid.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At server/src/qwen35/qwen35_backend.cpp, line 191: <comment>Early drafter config sync can copy uninitialized capture layer IDs when `n_target_layers` exists but `target_layer_ids` is missing/invalid.</comment> <file context> @@ -180,6 +180,24 @@ bool Qwen35Backend::init() { + std::string dp(cfg_.draft_path); + if (dp.size() >= 5 && dp.substr(dp.size() - 5) == ".gguf") { + int n_cap = 0; + int cap_ids[DFLASH_MAX_CAPTURE_LAYERS]; + if (read_draft_capture_config(cfg_.draft_path, n_cap, cap_ids, + DFLASH_MAX_CAPTURE_LAYERS)) { </file context>

cubic-dev-ai · 2026-06-24T15:49:49Z

+# Share arch resolution, tensor-name mapping, and safetensors I/O with
+# the F16 converter so both produce structurally identical GGUF metadata.
+from convert_dflash_to_gguf import (
+    ARCH, load_arch, map_name, is_norm_tensor,


P3: is_norm_tensor is imported and then redefined locally, so the import is dead/shadowed and adds avoidable duplication.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At server/scripts/quantize_draft_q8.py, line 30: <comment>`is_norm_tensor` is imported and then redefined locally, so the import is dead/shadowed and adds avoidable duplication.</comment> <file context> @@ -16,84 +18,20 @@ +# Share arch resolution, tensor-name mapping, and safetensors I/O with +# the F16 converter so both produce structurally identical GGUF metadata. +from convert_dflash_to_gguf import ( + ARCH, load_arch, map_name, is_norm_tensor, + load_safetensors_header, +) </file context>

Suggested change

ARCH, load_arch, map_name, is_norm_tensor,

ARCH, load_arch, map_name,

…ectness fixes DRAFTER CONVERTER (config-driven): - convert_dflash_to_gguf.py reads all architecture params from config.json (hidden_size, n_layer, mask_token_id, target_layer_ids, layer_types for SWA, sliding_window). No hardcoded constants. - quantize_draft_q8.py shares load_arch with the converter. - GGUF metadata: dflash.mask_token_id, dflash.target_layer_ids[], dflash.block_size, attention.sliding_window + pattern. - draft_gguf_loader.cpp: read_draft_capture_config(), mask from GGUF metadata, block_size override, SWA pattern from metadata. - draft_safetensors_loader.cpp: dynamic layer count, SWA+mask from config.json. - gguf_target_loader.cpp: respect drafter-specified capture layers instead of overwriting with evenly-spaced heuristic. - qwen35_backend.cpp: early-read capture sync + mask token propagation. - internal.h: capture_layer_ids[16], DFLASH_MAX_CAPTURE_LAYERS=16. - dflash27b.h: DFLASH_MAX_CAPTURE_LAYERS=16. SPEC-DECODE PERFORMANCE: - graph_builders.cpp: build_lm_head_projection_step skips rebuild when ctx alive + n_tokens matches (centralized guard; was per-call-site). - qwen35_backend.cpp: do_spec_decode uses member draft_sg_ (not local) for graph persistence; kFastRollbackThreshold env-tunable (DFLASH_FAST_ROLLBACK_MIN, default 5). - dflash_draft_graph.cpp: exact-ctx_len non-view reuse guard (DFLASH_DRAFT_GRAPH_REUSE, default ON). 4MB ctx alloc (was 256MB). - graph_builders.cpp: 4MB ctx alloc (was 64MB). - step_graph.h: graph_ctx_len + graph_used_view tracking fields. SPEC-DECODE CORRECTNESS: - qwen35_target_graph.cpp: DFLASH_FEAT_RING_CAP env overrides the hardcoded 4096 feature ring cap. Default 4096 causes acceptance collapse from 85% to 7.7% EXACTLY at 4096 prompt tokens (ring wrap corrupts features). - qwen35_backend.cpp: mirror init honors DFLASH_FEAT_RING_CAP. - qwen35_dflash_target.cpp: guard against invalid token IDs from GPU argmax at long context (NaN/Inf → clamp to 0, verify rejects gracefully). MOE EXPERIMENTAL (behind flags): - qwen35moe_backend.cpp: DFLASH_MOE_ALLHOT_HYBRID=1 builds moe_hybrid storage even with 0 cold experts to enable pipelined spec-decode verify. - Persistent moe_hybrid_logits_sg_ graph (was 64MB per-token alloc in hybrid_forward_one_token). GPU argmax (4 bytes vs 1MB vocab readback). - Batched verify/replay via hybrid_forward_batch (was 8 sequential forwards). VALIDATED: - 27B dense + reconverted drafter: 57% accept on code gen, 85% on short prompts. block=16 gives 252 tok/s (2.2x AR) on code generation. - 35B-A3B MoE + reconverted new drafter: 86% accept, 245 tok/s (2.1x AR). - Feature ring cap=16384: 85% holds to 5K tokens, 58% to 10K. - Full pFlash + dFlash stack: goldgate agentic trace passes (100% tool calls valid), pFlash cuts 34K prefill from 475s to 208s (2.3x). - repo_inspection prompt: correct answers, spec at 33.8% accept, 34 tok/s.

…al_dflash_to_gguf.py) resolve_rope_theta reads rope_parameters.rope_theta -> top-level rope_theta -> sys.exit(1); drops the silent 1M default that baked wrong RoPE for Qwen3.6 drafters whose true 10M lives nested under rope_parameters (missed by the old top-level-only pick(), default-baked 1M -> long-ctx spec-decode acceptance collapse). Q4_K_M output mirrors quantize_draft_q8.py (256 superblock, norm tensors stay F32). 12 pure-python tests cover the 35B-a3b nested case, 27B legacy top-level, gemma4 genuine-1M, and missing-theta fail-loud. Original convert_dflash_to_gguf.py untouched.

…OB, UAF, atoi-UB, re-prefill fallback, pin_range, gguf type guards)

dusterbloom mentioned this pull request Jun 24, 2026

fix(dflash): drafter rope_theta align + ring=max_ctx long-ctx uncripple dusterbloom/lucebox-hub#4

Open

cubic-dev-ai Bot reviewed Jun 24, 2026

View reviewed changes

dusterbloom force-pushed the pr/dflash-converter-and-fixes branch 2 times, most recently from 549ae92 to 166a409 Compare June 24, 2026 17:14

dusterbloom added 3 commits June 24, 2026 22:24

fix(cubic): address production P1/P2 review findings (bounded loop, O…

02f2eb6

…OB, UAF, atoi-UB, re-prefill fallback, pin_range, gguf type guards)

dusterbloom force-pushed the pr/dflash-converter-and-fixes branch from 166a409 to 02f2eb6 Compare June 24, 2026 20:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(dflash): config-driven drafter converter + spec-decode fixes + rope_parameters Q4 converter#447

feat(dflash): config-driven drafter converter + spec-decode fixes + rope_parameters Q4 converter#447
dusterbloom wants to merge 3 commits into
Luce-Org:mainfrom
dusterbloom:pr/dflash-converter-and-fixes

dusterbloom commented Jun 24, 2026 •

edited by cubic-dev-ai Bot

Loading

Uh oh!

cubic-dev-ai Bot left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai Bot Jun 24, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai Bot Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	ARCH, load_arch, map_name, is_norm_tensor,
	ARCH, load_arch, map_name,

Uh oh!

Conversation

dusterbloom commented Jun 24, 2026 • edited by cubic-dev-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dusterbloom commented Jun 24, 2026 •

edited by cubic-dev-ai Bot

Loading

cubic-dev-ai Bot left a comment •

edited

Loading