On Linux + 2x H200 (no NVLink), after fixing the CUDA router 256-expert guard (#466),
DeepSeek-V4-Pro q2 runs via --ssd-streaming (loads ~114GB, all 61 layers, ~2 t/s)
but output is GARBAGE. Two further Pro-specific hardcodes look unexercised:
-
CUDA DSA-indexer top-k hardcoded for 512. The indexer top-k dispatch in ds4_cuda.cu is
a chain of if (top_k == 512u && n_comp <= N); plus __shared__ uint32_t comp_rows[512]
and if (comp_count > 512u) comp_count = 512u truncation in the attention kernels,
indexed_topk_sort_512_asc_kernel, and cub::BlockRadixSort<uint64_t, 512, 16>. Pro
uses top_k=1024 -> matches no branch and its compressed-row selection is truncated to
512 -> corrupted attention. (Flash=512 is fine.)
-
CPU backend rejects Pro's Q8_0 attention layout: make cpu + Pro fails at
prefill layer 1/61 with grouped Q8_0 tensor has an unexpected layout (Q8_0 attn
projections tied to output_group_count).
Not urgent / Pro's documented target is unified-memory Metal. Filing so it's recorded.
Repro on request.
On Linux + 2x H200 (no NVLink), after fixing the CUDA router 256-expert guard (#466),
DeepSeek-V4-Pro q2 runs via --ssd-streaming (loads ~114GB, all 61 layers, ~2 t/s)
but output is GARBAGE. Two further Pro-specific hardcodes look unexercised:
CUDA DSA-indexer top-k hardcoded for 512. The indexer top-k dispatch in ds4_cuda.cu is
a chain of
if (top_k == 512u && n_comp <= N); plus__shared__ uint32_t comp_rows[512]and
if (comp_count > 512u) comp_count = 512utruncation in the attention kernels,indexed_topk_sort_512_asc_kernel, andcub::BlockRadixSort<uint64_t, 512, 16>. Prouses top_k=1024 -> matches no branch and its compressed-row selection is truncated to
512 -> corrupted attention. (Flash=512 is fine.)
CPU backend rejects Pro's Q8_0 attention layout:
make cpu+ Pro fails atprefill layer 1/61withgrouped Q8_0 tensor has an unexpected layout(Q8_0 attnprojections tied to output_group_count).
Not urgent / Pro's documented target is unified-memory Metal. Filing so it's recorded.
Repro on request.