GLM-5.2 (GlmMoeDsa, ~744B) runs on DS4 across 2x H200 — report + reusable findings

**GLM-5.2** (Zhipu, MIT — `GlmMoeDsaForCausalLM`, DeepSeek-V3.2-family: MLA + DSA indexer + MoE(+shared) + MTP) runs on DS4: loads, computes correct logits (corr **0.9994** vs HF `transformers` GlmMoeDsa on a small real-arch model), generates, runs **fully resident on 2× H200 NVL** (q2 ~252 GB, no SSD streaming, ~16 t/s), and chats with a **native GLM tokenizer** in the GGUF.

All engine changes are gated on the GLM variant (`n_hc==1`) / on tokenizer-token presence, so the DeepSeek **Flash/Pro paths stay byte-identical** — verified: Flash 40 t/s + `--logprob-vectors` OK on the same build. DS4's `joyai-llm` byte-level BPE reproduces GLM's tokenization exactly (`--dump-tokens` == HF `AutoTokenizer`, 6/6 diverse strings).

**Substance** (full diff + converter + scripts + repro, experimental reference branch on my fork): https://github.com/vcnngr/ds4/tree/glm-5.2-backend — see `GLM52_MAINTAINER_REPORT.md` / `MAINTAINER_NOTES.md`.

Two **model-independent** findings worth upstreaming on their own:

1. **`q8→f16` dequant-cache reserve flat 512 MiB on ≥112 GiB cards** → fills HBM and starves the session/prefill graph → OOM *after* a successful load (weights themselves fit). `DS4_CUDA_WEIGHT_CACHE_LIMIT_GB` doesn't bound it; an MTP model disables the cache and hides it. **→ PR #472** (CUDA twin of #446 on ROCm).
2. **Distributed `model id mismatch`**: a coordinator started **without `-m`** silently loads the default `ds4flash.gguf`, then rejects workers with `model id mismatch`. Detection is correct; the message just doesn't hint at the cause — a one-line message fix saves a long debug.

**Router**: GLM needed #466 (router expert-count) plus `n_used` (top-8) + sigmoid scoring; #435 (draft) generalizes the same router for Pro — worth converging so one router serves Flash/Pro/GLM.

Happy to split any of this into focused PRs if useful. Machine: 2× H200 NVL (143 GB, no NVLink), CUDA 13.2.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GLM-5.2 (GlmMoeDsa, ~744B) runs on DS4 across 2x H200 — report + reusable findings #473

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

GLM-5.2 (GlmMoeDsa, ~744B) runs on DS4 across 2x H200 — report + reusable findings #473

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions