Skip to content

GLM-5.2 (GlmMoeDsa, ~744B) runs on DS4 across 2x H200 — report + reusable findings #473

Description

@slackarea

GLM-5.2 (Zhipu, MIT — GlmMoeDsaForCausalLM, DeepSeek-V3.2-family: MLA + DSA indexer + MoE(+shared) + MTP) runs on DS4: loads, computes correct logits (corr 0.9994 vs HF transformers GlmMoeDsa on a small real-arch model), generates, runs fully resident on 2× H200 NVL (q2 ~252 GB, no SSD streaming, ~16 t/s), and chats with a native GLM tokenizer in the GGUF.

All engine changes are gated on the GLM variant (n_hc==1) / on tokenizer-token presence, so the DeepSeek Flash/Pro paths stay byte-identical — verified: Flash 40 t/s + --logprob-vectors OK on the same build. DS4's joyai-llm byte-level BPE reproduces GLM's tokenization exactly (--dump-tokens == HF AutoTokenizer, 6/6 diverse strings).

Substance (full diff + converter + scripts + repro, experimental reference branch on my fork): https://github.com/vcnngr/ds4/tree/glm-5.2-backend — see GLM52_MAINTAINER_REPORT.md / MAINTAINER_NOTES.md.

Two model-independent findings worth upstreaming on their own:

  1. q8→f16 dequant-cache reserve flat 512 MiB on ≥112 GiB cards → fills HBM and starves the session/prefill graph → OOM after a successful load (weights themselves fit). DS4_CUDA_WEIGHT_CACHE_LIMIT_GB doesn't bound it; an MTP model disables the cache and hides it. → PR CUDA: scale q8->f16 cache reserve on >=112 GiB cards (fixes session OOM on large models) #472 (CUDA twin of Fix ROCm Q8->F16 cache reserve starving session tensors on large models (q4q2) #446 on ROCm).
  2. Distributed model id mismatch: a coordinator started without -m silently loads the default ds4flash.gguf, then rejects workers with model id mismatch. Detection is correct; the message just doesn't hint at the cause — a one-line message fix saves a long debug.

Router: GLM needed #466 (router expert-count) plus n_used (top-8) + sigmoid scoring; #435 (draft) generalizes the same router for Pro — worth converging so one router serves Flash/Pro/GLM.

Happy to split any of this into focused PRs if useful. Machine: 2× H200 NVL (143 GB, no NVLink), CUDA 13.2.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions