Fix CUDA MoE router hardcoded to 256 experts#466
Open
slackarea wants to merge 1 commit into
Open
Conversation
The CUDA MoE router rejects any model whose routed-expert count isn't 256: ds4_gpu_router_select_tensor / _batch_tensor guard on n_expert==256 && scale==1.5, and the three router_select kernels hardcode 256 (logits/probs stride, top-k bound) and the 1.5 scale. Fix (minimal, zero-regression for the 256 fast path): - Parametrize the serial router_select_kernel with n_expert and scale. - Relax both host-wrapper guards to accept 256 or 384; size the bias/logits/probs checks by n_expert instead of 256. - Dispatch n_expert != 256 to the parametrized serial kernel; 256 stays on the existing fast warp/parallel kernels (unchanged). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This was referenced Jun 28, 2026
Open
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The CUDA MoE router rejects any model whose routed-expert count isn't 256:
ds4_gpu_router_select_tensor/_batch_tensordoif (n_expert != 256u || n_expert_used != 6u || fabsf(scale-1.5f)>1e-6f) return 0;and the three router_select kernels hardcode 256 (logits/probs stride, top-k bound) and
the 1.5f routed-weight scale. On a DeepSeek-V4-Pro GGUF (384 experts) prefill fails with
gpu layer N ffn batch encode failed(the router-select call returns 0).Fix (minimal, zero-regression for Flash):
router_select_kernelwithn_expertandscale.256u*sizeof(float)bias/logits/probs checks with
n_expert.n_expert != 256-> the parametrized serial kernel; 256 stays on theexisting fast warp/parallel kernels (unchanged).
Regression (Linux, CUDA, H200, DeepSeek-V4-Flash q2-imatrix; clean main + this patch only):
ds4_test --logprob-vectors: PASSCaveat / follow-up: still assumes
n_expert_used == 6(thefor j<6loops + guard) — finefor Flash/Pro (top-6), not for other top-k (e.g. GLM-5.2 top-8). The serial path is
correctness-first (1 thread/token); the warp/parallel kernels could be parametrized for
speed (shared
sprob[256]and the 32x8 unroll). Happy to extend if useful.