Skip to content

feat(dflash): config-driven drafter converter + spec-decode fixes + rope_parameters Q4 converter#447

Open
dusterbloom wants to merge 3 commits into
Luce-Org:mainfrom
dusterbloom:pr/dflash-converter-and-fixes
Open

feat(dflash): config-driven drafter converter + spec-decode fixes + rope_parameters Q4 converter#447
dusterbloom wants to merge 3 commits into
Luce-Org:mainfrom
dusterbloom:pr/dflash-converter-and-fixes

Conversation

@dusterbloom

@dusterbloom dusterbloom commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

Wave-2 base. The config-driven drafter converter + spec-decode perf/correctness fixes, plus a new corrected converter.

New in this PR (commit 166a409): server/scripts/convert_modal_dflash_to_gguf.py — resolves rope_theta from the nested rope_parameters scope then top-level (legacy), and FAILS LOUD (sys.exit) on a missing theta instead of silently baking a 1M default. This fixes the root cause of the drafter RoPE mismatch: Qwen3.6 drafters publish their true 10M under rope_parameters, which the old top-level-only pick() missed, baking 1M and collapsing long-ctx spec-decode acceptance (gemma4 is genuinely 1M so the bug was invisible there). Q4_K_M output, 12 pure-python tests. Original convert_dflash_to_gguf.py untouched.

Depends on the pooled-prefill core (#430).

Review in cubic

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8 issues found across 20 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="server/scripts/quantize_draft_q8.py">

<violation number="1" location="server/scripts/quantize_draft_q8.py:30">
P3: `is_norm_tensor` is imported and then redefined locally, so the import is dead/shadowed and adds avoidable duplication.</violation>
</file>

<file name="server/src/qwen35/qwen35_backend.cpp">

<violation number="1" location="server/src/qwen35/qwen35_backend.cpp:191">
P1: Early drafter config sync can copy uninitialized capture layer IDs when `n_target_layers` exists but `target_layer_ids` is missing/invalid.</violation>
</file>

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

Comment thread server/src/qwen35/qwen35_target_graph.cpp Outdated
Comment thread server/src/qwen35moe/qwen35moe_backend.cpp
std::string dp(cfg_.draft_path);
if (dp.size() >= 5 && dp.substr(dp.size() - 5) == ".gguf") {
int n_cap = 0;
int cap_ids[DFLASH_MAX_CAPTURE_LAYERS];

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Early drafter config sync can copy uninitialized capture layer IDs when n_target_layers exists but target_layer_ids is missing/invalid.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/qwen35/qwen35_backend.cpp, line 191:

<comment>Early drafter config sync can copy uninitialized capture layer IDs when `n_target_layers` exists but `target_layer_ids` is missing/invalid.</comment>

<file context>
@@ -180,6 +180,24 @@ bool Qwen35Backend::init() {
+        std::string dp(cfg_.draft_path);
+        if (dp.size() >= 5 && dp.substr(dp.size() - 5) == ".gguf") {
+            int n_cap = 0;
+            int cap_ids[DFLASH_MAX_CAPTURE_LAYERS];
+            if (read_draft_capture_config(cfg_.draft_path, n_cap, cap_ids,
+                                          DFLASH_MAX_CAPTURE_LAYERS)) {
</file context>

Comment thread server/src/draft/draft_gguf_loader.cpp Outdated
Comment thread server/scripts/convert_modal_dflash_to_gguf.py Outdated
Comment thread server/scripts/convert_modal_dflash_to_gguf.py
Comment thread server/scripts/convert_modal_dflash_to_gguf.py Outdated
# Share arch resolution, tensor-name mapping, and safetensors I/O with
# the F16 converter so both produce structurally identical GGUF metadata.
from convert_dflash_to_gguf import (
ARCH, load_arch, map_name, is_norm_tensor,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3: is_norm_tensor is imported and then redefined locally, so the import is dead/shadowed and adds avoidable duplication.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/scripts/quantize_draft_q8.py, line 30:

<comment>`is_norm_tensor` is imported and then redefined locally, so the import is dead/shadowed and adds avoidable duplication.</comment>

<file context>
@@ -16,84 +18,20 @@
+# Share arch resolution, tensor-name mapping, and safetensors I/O with
+# the F16 converter so both produce structurally identical GGUF metadata.
+from convert_dflash_to_gguf import (
+    ARCH, load_arch, map_name, is_norm_tensor,
+    load_safetensors_header,
+)
</file context>
Suggested change
ARCH, load_arch, map_name, is_norm_tensor,
ARCH, load_arch, map_name,

@dusterbloom dusterbloom force-pushed the pr/dflash-converter-and-fixes branch 2 times, most recently from 549ae92 to 166a409 Compare June 24, 2026 17:14
…ectness fixes

DRAFTER CONVERTER (config-driven):
- convert_dflash_to_gguf.py reads all architecture params from config.json
  (hidden_size, n_layer, mask_token_id, target_layer_ids, layer_types for
  SWA, sliding_window). No hardcoded constants.
- quantize_draft_q8.py shares load_arch with the converter.
- GGUF metadata: dflash.mask_token_id, dflash.target_layer_ids[],
  dflash.block_size, attention.sliding_window + pattern.
- draft_gguf_loader.cpp: read_draft_capture_config(), mask from GGUF
  metadata, block_size override, SWA pattern from metadata.
- draft_safetensors_loader.cpp: dynamic layer count, SWA+mask from
  config.json.
- gguf_target_loader.cpp: respect drafter-specified capture layers instead
  of overwriting with evenly-spaced heuristic.
- qwen35_backend.cpp: early-read capture sync + mask token propagation.
- internal.h: capture_layer_ids[16], DFLASH_MAX_CAPTURE_LAYERS=16.
- dflash27b.h: DFLASH_MAX_CAPTURE_LAYERS=16.

SPEC-DECODE PERFORMANCE:
- graph_builders.cpp: build_lm_head_projection_step skips rebuild when ctx
  alive + n_tokens matches (centralized guard; was per-call-site).
- qwen35_backend.cpp: do_spec_decode uses member draft_sg_ (not local) for
  graph persistence; kFastRollbackThreshold env-tunable
  (DFLASH_FAST_ROLLBACK_MIN, default 5).
- dflash_draft_graph.cpp: exact-ctx_len non-view reuse guard
  (DFLASH_DRAFT_GRAPH_REUSE, default ON). 4MB ctx alloc (was 256MB).
- graph_builders.cpp: 4MB ctx alloc (was 64MB).
- step_graph.h: graph_ctx_len + graph_used_view tracking fields.

SPEC-DECODE CORRECTNESS:
- qwen35_target_graph.cpp: DFLASH_FEAT_RING_CAP env overrides the hardcoded
  4096 feature ring cap. Default 4096 causes acceptance collapse from 85%
  to 7.7% EXACTLY at 4096 prompt tokens (ring wrap corrupts features).
- qwen35_backend.cpp: mirror init honors DFLASH_FEAT_RING_CAP.
- qwen35_dflash_target.cpp: guard against invalid token IDs from GPU argmax
  at long context (NaN/Inf → clamp to 0, verify rejects gracefully).

MOE EXPERIMENTAL (behind flags):
- qwen35moe_backend.cpp: DFLASH_MOE_ALLHOT_HYBRID=1 builds moe_hybrid
  storage even with 0 cold experts to enable pipelined spec-decode verify.
- Persistent moe_hybrid_logits_sg_ graph (was 64MB per-token alloc in
  hybrid_forward_one_token). GPU argmax (4 bytes vs 1MB vocab readback).
- Batched verify/replay via hybrid_forward_batch (was 8 sequential forwards).

VALIDATED:
- 27B dense + reconverted drafter: 57% accept on code gen, 85% on short
  prompts. block=16 gives 252 tok/s (2.2x AR) on code generation.
- 35B-A3B MoE + reconverted new drafter: 86% accept, 245 tok/s (2.1x AR).
- Feature ring cap=16384: 85% holds to 5K tokens, 58% to 10K.
- Full pFlash + dFlash stack: goldgate agentic trace passes (100% tool calls
  valid), pFlash cuts 34K prefill from 475s to 208s (2.3x).
- repo_inspection prompt: correct answers, spec at 33.8% accept, 34 tok/s.
…al_dflash_to_gguf.py)

resolve_rope_theta reads rope_parameters.rope_theta -> top-level rope_theta -> sys.exit(1);
drops the silent 1M default that baked wrong RoPE for Qwen3.6 drafters whose true 10M lives
nested under rope_parameters (missed by the old top-level-only pick(), default-baked 1M ->
long-ctx spec-decode acceptance collapse). Q4_K_M output mirrors quantize_draft_q8.py
(256 superblock, norm tensors stay F32). 12 pure-python tests cover the 35B-a3b nested case,
27B legacy top-level, gemma4 genuine-1M, and missing-theta fail-loud. Original
convert_dflash_to_gguf.py untouched.
…OB, UAF, atoi-UB, re-prefill fallback, pin_range, gguf type guards)
@dusterbloom dusterbloom force-pushed the pr/dflash-converter-and-fixes branch from 166a409 to 02f2eb6 Compare June 24, 2026 20:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant