fix(qwen35moe): size KV reservation by n_full + cache-type, not n_layer×f16 by dusterbloom · Pull Request #454 · Luce-Org/lucebox-hub

dusterbloom · 2026-06-26T11:17:38Z

Problem

The qwen35moe expert-placement KV reservation (qwen35moe_backend.cpp) counted all n_layer (40) layers with a hardcoded f16, but only the n_full = n_layer / full_attention_interval (10) full-attention layers carry a KV cache — the rest are O(1)-state SSM/DeltaNet — and the cache element type (e.g. q4_0) is far smaller than f16. That over-reserved KV by ~14× (25 GiB vs 1.76 GiB at 131K ctx), shrinking the expert budget and forcing experts cold at deep context (dropping decode onto the slow hybrid path; the cold cliff fired as early as 24–32K).

Fix

Extract a shared kv_reservation_bytes_per_token() helper (kv_quant.h) using n_full + ggml_row_size(resolved cache type), and use it in both the qwen35moe placement path (the bug) and the qwen35 dense budget (previously a correct but duplicated copy) — one source of truth so the bug class can't recur.

Verification

Unit test (test_kv_quant.cpp T5): pins n_full + cache-type, guards against the old n_layer × f16 form.
E2E on RTX 3090: Qwen3.6-35B-A3B-UD-Q3_K_XL at --max-ctx 131072 with no kvflash now reports kv_cache=0.70 GiB and 10240 hot experts, 0 cold experts (all-hot), where the old reservation forced the cold cliff at 24–32K.

@131k

…er×f16 Only n_full = n_layer/full_attention_interval layers carry a KV cache (the rest are O(1)-state SSM/DeltaNet); honoring that plus the resolved q4_0 cache type cuts the placement reservation ~14x (25 -> 1.76 GiB @131k), keeping experts all-hot at deep context instead of forcing the slow hybrid spec path. Extract a shared kv_reservation_bytes_per_token() helper (one source of truth for qwen35 + qwen35moe) and add a unit test pinning n_full + cache-type vs the old form.

cubic-dev-ai

No issues found across 4 files

_{Re-trigger cubic}

cubic-dev-ai Bot reviewed Jun 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(qwen35moe): size KV reservation by n_full + cache-type, not n_layer×f16#454

fix(qwen35moe): size KV reservation by n_full + cache-type, not n_layer×f16#454
dusterbloom wants to merge 1 commit into
Luce-Org:mainfrom
dusterbloom:ship/qwen35moe-kv-reservation

dusterbloom commented Jun 26, 2026 •

edited by cubic-dev-ai Bot

Loading

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

dusterbloom commented Jun 26, 2026 • edited by cubic-dev-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Verification

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dusterbloom commented Jun 26, 2026 •

edited by cubic-dev-ai Bot

Loading