Skip to content

DeepSeek-V4-Pro: CUDA-streaming + CPU paths produce wrong output / errors #471

Description

@slackarea

On Linux + 2x H200 (no NVLink), after fixing the CUDA router 256-expert guard (#466),
DeepSeek-V4-Pro q2 runs via --ssd-streaming (loads ~114GB, all 61 layers, ~2 t/s)
but output is GARBAGE. Two further Pro-specific hardcodes look unexercised:

  1. CUDA DSA-indexer top-k hardcoded for 512. The indexer top-k dispatch in ds4_cuda.cu is
    a chain of if (top_k == 512u && n_comp <= N); plus __shared__ uint32_t comp_rows[512]
    and if (comp_count > 512u) comp_count = 512u truncation in the attention kernels,
    indexed_topk_sort_512_asc_kernel, and cub::BlockRadixSort<uint64_t, 512, 16>. Pro
    uses top_k=1024 -> matches no branch and its compressed-row selection is truncated to
    512 -> corrupted attention. (Flash=512 is fine.)

  2. CPU backend rejects Pro's Q8_0 attention layout: make cpu + Pro fails at
    prefill layer 1/61 with grouped Q8_0 tensor has an unexpected layout (Q8_0 attn
    projections tied to output_group_count).

Not urgent / Pro's documented target is unified-memory Metal. Filing so it's recorded.
Repro on request.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions