DeepSeek-V4-Pro: CUDA-streaming + CPU paths produce wrong output / errors

On Linux + 2x H200 (no NVLink), after fixing the CUDA router 256-expert guard (#466),
DeepSeek-V4-Pro q2 *runs* via --ssd-streaming (loads ~114GB, all 61 layers, ~2 t/s)
but output is GARBAGE. Two further Pro-specific hardcodes look unexercised:

1. CUDA DSA-indexer top-k hardcoded for 512. The indexer top-k dispatch in ds4_cuda.cu is
   a chain of `if (top_k == 512u && n_comp <= N)`; plus `__shared__ uint32_t comp_rows[512]`
   and `if (comp_count > 512u) comp_count = 512u` truncation in the attention kernels,
   `indexed_topk_sort_512_asc_kernel`, and `cub::BlockRadixSort<uint64_t, 512, 16>`. Pro
   uses top_k=1024 -> matches no branch and its compressed-row selection is truncated to
   512 -> corrupted attention. (Flash=512 is fine.)

2. CPU backend rejects Pro's Q8_0 attention layout: `make cpu` + Pro fails at
   `prefill layer 1/61` with `grouped Q8_0 tensor has an unexpected layout` (Q8_0 attn
   projections tied to output_group_count).

Not urgent / Pro's documented target is unified-memory Metal. Filing so it's recorded.
Repro on request.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DeepSeek-V4-Pro: CUDA-streaming + CPU paths produce wrong output / errors #471

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

DeepSeek-V4-Pro: CUDA-streaming + CPU paths produce wrong output / errors #471

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions