Skip to content

Feasibility: a GLM-5.2 (GlmMoeDsa) backend for DS4 — runs the real 744B with correct logits #470

Description

@slackarea

GLM-5.2 (GlmMoeDsaForCausalLM, Zhipu, MIT, ~744B/40B-active) is architecturally in the
DeepSeek-V4 family: MLA + a DSA sparse indexer + MoE with a shared expert + an MTP layer.
So most of DS4's hard machinery applies and a backend is largely re-parametrization +
absorbed-MLA conversion.

Status — it works end-to-end on the real model:

  • A standalone converter maps GlmMoeDsa safetensors -> DS4-layout GGUF (absorbed-MLA: folds
    GLM's explicit kv_b_proj/o_proj into DS4's attn_q_b/output_a/b; proven equal to the
    explicit GLM attention to 2e-08).
  • With a GLM-5.2 shape variant + variant-gated engine changes, DS4 reproduces the HF
    transformers GlmMoeDsa logits to corr 0.9994 / max|Δ| 0.035
    (quant noise) on a small
    real-architecture model, via --dump-logits vs a transformers reference (a numpy
    DS4-style full forward also matches HF to 7e-06).
  • The real zai-org/GLM-5.2 (1.5 TB) converts to a q2 DS4 GGUF (~252 GB / 775 B params),
    --inspect loads it, the forward produces sane peaked logits, and it generates via the
    decode path. Ran 1-GPU (SSD-streaming) and 2-GPU distributed on 2x H200.

All engine changes are gated on the GLM variant; the DeepSeek Flash/Pro path is
byte-identical (verified: Flash decode 40 t/s + ds4_test --logprob-vectors OK on the same
build). The changes:

  • n_hc==1 identity bypass (GLM has no mHC; the hc split/sinkhorn/weighted_sum primitives are
    hardcoded for n_hc==4);
  • router: top-k param (GLM top-8) + sigmoid scoring (GLM scoring_func=sigmoid vs DeepSeek
    sqrt-softplus) — extends the merged router fix (Fix CUDA MoE router hardcoded to 256 experts #466);
  • MLA: skip the per-head q RMSNorm GLM lacks; kv-norm over kv_lora only (k_rope tail raw);
    absorbed-MLA dims (head_dim = kv_lora+rope = 576, not 512);
  • a Q2_K MoE expert dispatch + 2 kernels (so a q2 GLM fits 2x H200; DS4 previously only had
    Q4_K and IQ2_XXS+Q2_K expert combos);
  • bumping several DS4_MAX_* (LAYER 61->96, VOCAB, HEAD_DIM, OUT_GROUP, LORA_Q,
    EXPERT_USED, FF_EXP) — these Flash-sized maxes overflowed on GLM (heap corruption ->
    double-free at model_close).

Recurring theme (also seen with Pro): the CUDA/loader/distributed paths bake in Flash
constants. Two further finds while running GLM on 2 GPUs: the default single-GPU startup
weight-cache budget is 96 GiB (needs DS4_CUDA_WEIGHT_CACHE_LIMIT_GB to go resident), and
the distributed coordinator reports the Flash default variant after loading a non-Flash
model (g_ds4_shape reverts to Flash between load and the handshake -> model id mismatch),
which blocks fast 2-GPU full residency. Happy to write these up separately.

Would a GLM-5.2 backend be of interest? If so I can clean up the converter + scaffold and the
remaining bits (native GLM tokenizer/template; the coordinator fix). Full writeup, the numpy
equivalence/reference harness, and repro steps available.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions