You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
GLM-5.2 (Zhipu, MIT — GlmMoeDsaForCausalLM, DeepSeek-V3.2-family: MLA + DSA indexer + MoE(+shared) + MTP) runs on DS4: loads, computes correct logits (corr 0.9994 vs HF transformers GlmMoeDsa on a small real-arch model), generates, runs fully resident on 2× H200 NVL (q2 ~252 GB, no SSD streaming, ~16 t/s), and chats with a native GLM tokenizer in the GGUF.
All engine changes are gated on the GLM variant (n_hc==1) / on tokenizer-token presence, so the DeepSeek Flash/Pro paths stay byte-identical — verified: Flash 40 t/s + --logprob-vectors OK on the same build. DS4's joyai-llm byte-level BPE reproduces GLM's tokenization exactly (--dump-tokens == HF AutoTokenizer, 6/6 diverse strings).
Substance (full diff + converter + scripts + repro, experimental reference branch on my fork): https://github.com/vcnngr/ds4/tree/glm-5.2-backend — see GLM52_MAINTAINER_REPORT.md / MAINTAINER_NOTES.md.
Two model-independent findings worth upstreaming on their own:
Distributed model id mismatch: a coordinator started without -m silently loads the default ds4flash.gguf, then rejects workers with model id mismatch. Detection is correct; the message just doesn't hint at the cause — a one-line message fix saves a long debug.
Router: GLM needed #466 (router expert-count) plus n_used (top-8) + sigmoid scoring; #435 (draft) generalizes the same router for Pro — worth converging so one router serves Flash/Pro/GLM.
Happy to split any of this into focused PRs if useful. Machine: 2× H200 NVL (143 GB, no NVLink), CUDA 13.2.
GLM-5.2 (Zhipu, MIT —
GlmMoeDsaForCausalLM, DeepSeek-V3.2-family: MLA + DSA indexer + MoE(+shared) + MTP) runs on DS4: loads, computes correct logits (corr 0.9994 vs HFtransformersGlmMoeDsa on a small real-arch model), generates, runs fully resident on 2× H200 NVL (q2 ~252 GB, no SSD streaming, ~16 t/s), and chats with a native GLM tokenizer in the GGUF.All engine changes are gated on the GLM variant (
n_hc==1) / on tokenizer-token presence, so the DeepSeek Flash/Pro paths stay byte-identical — verified: Flash 40 t/s +--logprob-vectorsOK on the same build. DS4'sjoyai-llmbyte-level BPE reproduces GLM's tokenization exactly (--dump-tokens== HFAutoTokenizer, 6/6 diverse strings).Substance (full diff + converter + scripts + repro, experimental reference branch on my fork): https://github.com/vcnngr/ds4/tree/glm-5.2-backend — see
GLM52_MAINTAINER_REPORT.md/MAINTAINER_NOTES.md.Two model-independent findings worth upstreaming on their own:
q8→f16dequant-cache reserve flat 512 MiB on ≥112 GiB cards → fills HBM and starves the session/prefill graph → OOM after a successful load (weights themselves fit).DS4_CUDA_WEIGHT_CACHE_LIMIT_GBdoesn't bound it; an MTP model disables the cache and hides it. → PR CUDA: scale q8->f16 cache reserve on >=112 GiB cards (fixes session OOM on large models) #472 (CUDA twin of Fix ROCm Q8->F16 cache reserve starving session tensors on large models (q4q2) #446 on ROCm).model id mismatch: a coordinator started without-msilently loads the defaultds4flash.gguf, then rejects workers withmodel id mismatch. Detection is correct; the message just doesn't hint at the cause — a one-line message fix saves a long debug.Router: GLM needed #466 (router expert-count) plus
n_used(top-8) + sigmoid scoring; #435 (draft) generalizes the same router for Pro — worth converging so one router serves Flash/Pro/GLM.Happy to split any of this into focused PRs if useful. Machine: 2× H200 NVL (143 GB, no NVLink), CUDA 13.2.