Feasibility: a GLM-5.2 (GlmMoeDsa) backend for DS4 — runs the real 744B with correct logits

GLM-5.2 (GlmMoeDsaForCausalLM, Zhipu, MIT, ~744B/40B-active) is architecturally in the
DeepSeek-V4 family: MLA + a DSA sparse indexer + MoE with a shared expert + an MTP layer.
So most of DS4's hard machinery applies and a backend is largely re-parametrization +
absorbed-MLA conversion.

**Status — it works end-to-end on the real model:**
- A standalone converter maps GlmMoeDsa safetensors -> DS4-layout GGUF (absorbed-MLA: folds
  GLM's explicit kv_b_proj/o_proj into DS4's attn_q_b/output_a/b; proven equal to the
  explicit GLM attention to 2e-08).
- With a GLM-5.2 shape variant + variant-gated engine changes, **DS4 reproduces the HF
  transformers GlmMoeDsa logits to corr 0.9994 / max|Δ| 0.035** (quant noise) on a small
  real-architecture model, via `--dump-logits` vs a transformers reference (a numpy
  DS4-style full forward also matches HF to 7e-06).
- The **real zai-org/GLM-5.2 (1.5 TB)** converts to a q2 DS4 GGUF (~252 GB / 775 B params),
  `--inspect` loads it, the forward produces sane peaked logits, and it **generates** via the
  decode path. Ran 1-GPU (SSD-streaming) and 2-GPU distributed on 2x H200.

All engine changes are gated on the GLM variant; the DeepSeek Flash/Pro path is
byte-identical (verified: Flash decode 40 t/s + `ds4_test --logprob-vectors` OK on the same
build). The changes:
- n_hc==1 identity bypass (GLM has no mHC; the hc split/sinkhorn/weighted_sum primitives are
  hardcoded for n_hc==4);
- router: top-k param (GLM top-8) + sigmoid scoring (GLM scoring_func=sigmoid vs DeepSeek
  sqrt-softplus) — extends the merged router fix (#466);
- MLA: skip the per-head q RMSNorm GLM lacks; kv-norm over kv_lora only (k_rope tail raw);
  absorbed-MLA dims (head_dim = kv_lora+rope = 576, not 512);
- a Q2_K MoE expert dispatch + 2 kernels (so a q2 GLM fits 2x H200; DS4 previously only had
  Q4_K and IQ2_XXS+Q2_K expert combos);
- bumping several `DS4_MAX_*` (LAYER 61->96, VOCAB, HEAD_DIM, OUT_GROUP, LORA_Q,
  EXPERT_USED, FF_EXP) — these Flash-sized maxes overflowed on GLM (heap corruption ->
  double-free at model_close).

Recurring theme (also seen with Pro): the CUDA/loader/distributed paths bake in Flash
constants. Two further finds while running GLM on 2 GPUs: the default single-GPU startup
weight-cache budget is 96 GiB (needs `DS4_CUDA_WEIGHT_CACHE_LIMIT_GB` to go resident), and
the distributed coordinator reports the Flash default variant after loading a non-Flash
model (`g_ds4_shape` reverts to Flash between load and the handshake -> `model id mismatch`),
which blocks fast 2-GPU full residency. Happy to write these up separately.

Would a GLM-5.2 backend be of interest? If so I can clean up the converter + scaffold and the
remaining bits (native GLM tokenizer/template; the coordinator fix). Full writeup, the numpy
equivalence/reference harness, and repro steps available.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feasibility: a GLM-5.2 (GlmMoeDsa) backend for DS4 — runs the real 744B with correct logits #470

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Feasibility: a GLM-5.2 (GlmMoeDsa) backend for DS4 — runs the real 744B with correct logits #470

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions