fix(qwen35moe): size KV reservation by n_full + cache-type, not n_layer×f16#454
Open
dusterbloom wants to merge 1 commit into
Open
fix(qwen35moe): size KV reservation by n_full + cache-type, not n_layer×f16#454dusterbloom wants to merge 1 commit into
dusterbloom wants to merge 1 commit into
Conversation
…er×f16 Only n_full = n_layer/full_attention_interval layers carry a KV cache (the rest are O(1)-state SSM/DeltaNet); honoring that plus the resolved q4_0 cache type cuts the placement reservation ~14x (25 -> 1.76 GiB @131k), keeping experts all-hot at deep context instead of forcing the slow hybrid spec path. Extract a shared kv_reservation_bytes_per_token() helper (one source of truth for qwen35 + qwen35moe) and add a unit test pinning n_full + cache-type vs the old form.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The qwen35moe expert-placement KV reservation (
qwen35moe_backend.cpp) counted alln_layer(40) layers with a hardcoded f16, but only then_full = n_layer / full_attention_interval(10) full-attention layers carry a KV cache — the rest are O(1)-state SSM/DeltaNet — and the cache element type (e.g. q4_0) is far smaller than f16. That over-reserved KV by ~14× (25 GiB vs 1.76 GiB at 131K ctx), shrinking the expert budget and forcing experts cold at deep context (dropping decode onto the slow hybrid path; the cold cliff fired as early as 24–32K).Fix
Extract a shared
kv_reservation_bytes_per_token()helper (kv_quant.h) usingn_full+ggml_row_size(resolved cache type), and use it in both the qwen35moe placement path (the bug) and the qwen35 dense budget (previously a correct but duplicated copy) — one source of truth so the bug class can't recur.Verification
test_kv_quant.cppT5): pinsn_full+ cache-type, guards against the oldn_layer × f16form.--max-ctx 131072with no kvflash now reportskv_cache=0.70 GiBand10240 hot experts, 0 cold experts(all-hot), where the old reservation forced the cold cliff at 24–32K.