Skip to content

fix(qwen35moe): size KV reservation by n_full + cache-type, not n_layer×f16#454

Open
dusterbloom wants to merge 1 commit into
Luce-Org:mainfrom
dusterbloom:ship/qwen35moe-kv-reservation
Open

fix(qwen35moe): size KV reservation by n_full + cache-type, not n_layer×f16#454
dusterbloom wants to merge 1 commit into
Luce-Org:mainfrom
dusterbloom:ship/qwen35moe-kv-reservation

Conversation

@dusterbloom

@dusterbloom dusterbloom commented Jun 26, 2026

Copy link
Copy Markdown
Collaborator

Problem

The qwen35moe expert-placement KV reservation (qwen35moe_backend.cpp) counted all n_layer (40) layers with a hardcoded f16, but only the n_full = n_layer / full_attention_interval (10) full-attention layers carry a KV cache — the rest are O(1)-state SSM/DeltaNet — and the cache element type (e.g. q4_0) is far smaller than f16. That over-reserved KV by ~14× (25 GiB vs 1.76 GiB at 131K ctx), shrinking the expert budget and forcing experts cold at deep context (dropping decode onto the slow hybrid path; the cold cliff fired as early as 24–32K).

Fix

Extract a shared kv_reservation_bytes_per_token() helper (kv_quant.h) using n_full + ggml_row_size(resolved cache type), and use it in both the qwen35moe placement path (the bug) and the qwen35 dense budget (previously a correct but duplicated copy) — one source of truth so the bug class can't recur.

Verification

  • Unit test (test_kv_quant.cpp T5): pins n_full + cache-type, guards against the old n_layer × f16 form.
  • E2E on RTX 3090: Qwen3.6-35B-A3B-UD-Q3_K_XL at --max-ctx 131072 with no kvflash now reports kv_cache=0.70 GiB and 10240 hot experts, 0 cold experts (all-hot), where the old reservation forced the cold cliff at 24–32K.

Review in cubic

…er×f16

Only n_full = n_layer/full_attention_interval layers carry a KV cache (the rest
are O(1)-state SSM/DeltaNet); honoring that plus the resolved q4_0 cache type
cuts the placement reservation ~14x (25 -> 1.76 GiB @131k), keeping experts
all-hot at deep context instead of forcing the slow hybrid spec path. Extract a
shared kv_reservation_bytes_per_token() helper (one source of truth for qwen35 +
qwen35moe) and add a unit test pinning n_full + cache-type vs the old form.

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 4 files

Re-trigger cubic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant