You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
GLM-5.2 (GlmMoeDsaForCausalLM, Zhipu, MIT, ~744B/40B-active) is architecturally in the
DeepSeek-V4 family: MLA + a DSA sparse indexer + MoE with a shared expert + an MTP layer.
So most of DS4's hard machinery applies and a backend is largely re-parametrization +
absorbed-MLA conversion.
Status — it works end-to-end on the real model:
A standalone converter maps GlmMoeDsa safetensors -> DS4-layout GGUF (absorbed-MLA: folds
GLM's explicit kv_b_proj/o_proj into DS4's attn_q_b/output_a/b; proven equal to the
explicit GLM attention to 2e-08).
With a GLM-5.2 shape variant + variant-gated engine changes, DS4 reproduces the HF
transformers GlmMoeDsa logits to corr 0.9994 / max|Δ| 0.035 (quant noise) on a small
real-architecture model, via --dump-logits vs a transformers reference (a numpy
DS4-style full forward also matches HF to 7e-06).
The real zai-org/GLM-5.2 (1.5 TB) converts to a q2 DS4 GGUF (~252 GB / 775 B params), --inspect loads it, the forward produces sane peaked logits, and it generates via the
decode path. Ran 1-GPU (SSD-streaming) and 2-GPU distributed on 2x H200.
All engine changes are gated on the GLM variant; the DeepSeek Flash/Pro path is
byte-identical (verified: Flash decode 40 t/s + ds4_test --logprob-vectors OK on the same
build). The changes:
n_hc==1 identity bypass (GLM has no mHC; the hc split/sinkhorn/weighted_sum primitives are
hardcoded for n_hc==4);
MLA: skip the per-head q RMSNorm GLM lacks; kv-norm over kv_lora only (k_rope tail raw);
absorbed-MLA dims (head_dim = kv_lora+rope = 576, not 512);
a Q2_K MoE expert dispatch + 2 kernels (so a q2 GLM fits 2x H200; DS4 previously only had
Q4_K and IQ2_XXS+Q2_K expert combos);
bumping several DS4_MAX_* (LAYER 61->96, VOCAB, HEAD_DIM, OUT_GROUP, LORA_Q,
EXPERT_USED, FF_EXP) — these Flash-sized maxes overflowed on GLM (heap corruption ->
double-free at model_close).
Recurring theme (also seen with Pro): the CUDA/loader/distributed paths bake in Flash
constants. Two further finds while running GLM on 2 GPUs: the default single-GPU startup
weight-cache budget is 96 GiB (needs DS4_CUDA_WEIGHT_CACHE_LIMIT_GB to go resident), and
the distributed coordinator reports the Flash default variant after loading a non-Flash
model (g_ds4_shape reverts to Flash between load and the handshake -> model id mismatch),
which blocks fast 2-GPU full residency. Happy to write these up separately.
Would a GLM-5.2 backend be of interest? If so I can clean up the converter + scaffold and the
remaining bits (native GLM tokenizer/template; the coordinator fix). Full writeup, the numpy
equivalence/reference harness, and repro steps available.
GLM-5.2 (GlmMoeDsaForCausalLM, Zhipu, MIT, ~744B/40B-active) is architecturally in the
DeepSeek-V4 family: MLA + a DSA sparse indexer + MoE with a shared expert + an MTP layer.
So most of DS4's hard machinery applies and a backend is largely re-parametrization +
absorbed-MLA conversion.
Status — it works end-to-end on the real model:
GLM's explicit kv_b_proj/o_proj into DS4's attn_q_b/output_a/b; proven equal to the
explicit GLM attention to 2e-08).
transformers GlmMoeDsa logits to corr 0.9994 / max|Δ| 0.035 (quant noise) on a small
real-architecture model, via
--dump-logitsvs a transformers reference (a numpyDS4-style full forward also matches HF to 7e-06).
--inspectloads it, the forward produces sane peaked logits, and it generates via thedecode path. Ran 1-GPU (SSD-streaming) and 2-GPU distributed on 2x H200.
All engine changes are gated on the GLM variant; the DeepSeek Flash/Pro path is
byte-identical (verified: Flash decode 40 t/s +
ds4_test --logprob-vectorsOK on the samebuild). The changes:
hardcoded for n_hc==4);
sqrt-softplus) — extends the merged router fix (Fix CUDA MoE router hardcoded to 256 experts #466);
absorbed-MLA dims (head_dim = kv_lora+rope = 576, not 512);
Q4_K and IQ2_XXS+Q2_K expert combos);
DS4_MAX_*(LAYER 61->96, VOCAB, HEAD_DIM, OUT_GROUP, LORA_Q,EXPERT_USED, FF_EXP) — these Flash-sized maxes overflowed on GLM (heap corruption ->
double-free at model_close).
Recurring theme (also seen with Pro): the CUDA/loader/distributed paths bake in Flash
constants. Two further finds while running GLM on 2 GPUs: the default single-GPU startup
weight-cache budget is 96 GiB (needs
DS4_CUDA_WEIGHT_CACHE_LIMIT_GBto go resident), andthe distributed coordinator reports the Flash default variant after loading a non-Flash
model (
g_ds4_shapereverts to Flash between load and the handshake ->model id mismatch),which blocks fast 2-GPU full residency. Happy to write these up separately.
Would a GLM-5.2 backend be of interest? If so I can clean up the converter + scaffold and the
remaining bits (native GLM tokenizer/template; the coordinator fix). Full writeup, the numpy
equivalence/reference harness, and repro steps available.