🎉 News • 📖 Introduction • ✨ ZEDA
🚀 Getting Started • 📊 Main Results • 💖 Acknowledgements • 📨 Contact • 🎈 Citation
Fully trained Mixture-of-Experts (MoE) models are expensive to serve. Dynamic variant of MoE reduces computation by adjusting the activated experts in an input-dependent manner, while most existing dynamic MoE methods rely on pre-training from scratch or task-specific adaptation.
In this paper, we introduce ZEDA, a low-cost framework that transforms post-trained static MoE models into efficient dynamic ones, eliminating over 50% of expert FLOPs at marginal accuracy loss.
- [2026-05-19] We introduce Zero-Expert Self-Distillation Adaptation (ZEDA).
We introduce Zero-Expert Self-Distillation Adaptation (ZEDA), a low-cost framework that transforms post-trained static MoE models into efficient dynamic ones without substantially sacrificing their established capabilities. ZEDA targets the practical deployment scenario where MoE models have already undergone expensive pre-training and post-training, and further inference-cost reduction is desired after the main training pipeline is finalized.
To stabilize this architectural conversion, ZEDA injects parameter-free zero-output experts into each MoE layer and adapts the augmented model through two-stage self-distillation, utilizing the original MoE as a frozen teacher and applying a group-level balancing loss. On Qwen3-30B-A3B and GLM-4.7-Flash across 11 benchmarks spanning math, code, and instruction following, ZEDA eliminates over 50% of expert FLOPs at marginal accuracy loss. It outperforms the strongest dynamic MoE baseline by 6.1 and 4.0 points on the two models, and delivers ~1.20× end-to-end inference speedup.
ZEDA first injects zero experts into a post-trained MoE, architecturally converting it into a dynamic one, and then adapts it through two-stage self-distillation with the original MoE as a fixed teacher.
ZEDA introduces parameterless zero experts, whose outputs are identically zero, into the existing expert pool of a post-trained MoE model. This expands the router candidate pool with zero-computation experts while the activation number remains unchanged, naturally reducing active normal experts. The augmented model is then adapted through a two-stage self-distillation process:
- SFT Stage: Trains the student on responses sampled from the teacher (original MoE).
- OPD Stage: Shifts to on-policy learning, where responses are sampled from the current student and the teacher supplies token-level targets via reverse KL.
ZEDA incorporates the Group Auxiliary Loss
To run ZEDA, follow these steps:
ZEDA is built upon large-scale MoE training and serving codebases, including slime, SGLang, and Megatron. Please use the Docker image slimerl/slime:20251113-v1 released by slime:
# Pull the image
docker pull slimerl/slime:20251113-v1
# Start the container
docker run --rm --gpus all --ipc=host --shm-size=16g \
--ulimit memlock=-1 --ulimit stack=67108864 \
-it slimerl/slime:latest /bin/bashAfter pull and start the docker container, you simply need to install our modified versions of SGLang and slime:
cd zeda/sglang/python
pip install -e . --no-deps
cd ..
patch -p1 < ../slime/docker/patch/latest/sglang.patch
cd ../transformers
pip install -e . --no-deps
cd ../slime
pip install -e . --no-depsZEDA uses 60k prompts including math, code, and chat data, and the corresponding self-distillation rollouts.
- Prompts: The prompts are used for rollout and OPD. The prompts are chosen from AceReason-1.1-SFT and Llama-Nemotron-Post-Training-Dataset, and we release them in ZEDA-prompts-60k.
- Rollouts: The rollouts are used for SFT. You need to use the specific post-trained MoE model intended for adaptation to perform the rollout. You can also directly utilize our released rollout results ZEDA-Qwen3-30B-A3B-rollout-60k and ZEDA-GLM-4.7-Flash-rollout-60k.
After downloading the data, please put them in the data folder.
After downloading the specific post-trained MoE model intended for adaptation from Huggingface, please first convert the model into a dynamic one through zero expert injection:
python scripts/convert-hf-to-ZCE.py --input-dir path_Qwen3-30B-A3B --output-dir path_Qwen3-30B-A3B-dynamic --new-num-experts 192 # for Qwen3-30B-A3B
python scripts/convert-hf-to-ZCE.py --input-dir path_GLM-4.7-Flash --output-dir path_GLM-4.7-Flash-dynamic --new-num-experts 96 # for GLM-4.7-FlashThen modify the config.json in the output Dynamic MoE dir, modify the "architectures" and add "use_zce_mask", "zce_nums", and "zce_types". For Qwen3-30B-A3B:
"architectures": ["Qwen3MoePlusPlusForCausalLM"],
"use_zce_mask": false,
"zce_nums": [64],
"zce_types": ["zero"]For GLM-4.7-Flash:
"architectures": ["Glm4MoeLitePlusPlusForCausalLM"],
"use_zce_mask": false,
"zce_nums": [32],
"zce_types": ["zero"]Finally, convert the dynamic MoE into the format compatible with Megatron:
bash scripts/convert_hf_to_torch_dist.shZEDA consists of zero-expert injection, SFT, and OPD. You can run the following scripts to start the adaptation pipeline:
# For Qwen3-30B-A3B
bash scripts/train_zeda_qwen_sft.sh # SFT
bash scripts/convert_torch_dist_to_hf.sh # Convert Model
bash scripts/run_teacher_server.sh # Start Teacher Server
bash scripts/train_zeda_qwen_opd.sh # After the teacher server starts, run OPD
# For GLM-4.7-Flash
bash scripts/train_zeda_glm_sft.sh # SFT
bash scripts/convert_torch_dist_to_hf.sh # Convert Model
bash scripts/run_teacher_server.sh # Start Teacher Server
bash scripts/train_zeda_glm_opd.sh # After the teacher server starts, run OPDWe provide a unified evaluation pipeline covering the released math, code, instruction-following, and science benchmarks. To reproduce the reported results, first download the benchmark data from Hugging Face to evaluation/benchmark:
cd ZEDA
hf download TsinghuaC3I/ZEDA-Evaluation \
--repo-type dataset \
--local-dir evaluation/benchmarkThen install the required evaluation dependencies:
pip install -r requirements.txtNext, configure evaluation/run_sglang_server.sh by specifying:
MODEL_PATH: path to the evaluated modelTP_SIZE: tensor parallel sizePORT: server port
Launch the SGLang server with:
bash evaluation/run_sglang_server.shOnce the server is ready, configure evaluation/run_evaluation.sh by setting:
MODEL_NAME: model name used in output filenamesMODEL_PATH: path to the evaluated modelSERVER_URL: server endpoint, for examplehttp://0.0.0.0:$PORT/generate
Then start the evaluation pipeline:
bash evaluation/run_evaluation.shThe script iterates over all released benchmarks, stores raw generations in evaluation/raw_output/, and finally invokes compute_reward.py to aggregate the evaluation metrics.
We release our adapted dynamic MoE models and rollout data in Huggingface:
| Model | Huggingface | Base Model |
|---|---|---|
| ZEDA-Qwen3-30B-A3B-Dynamic | TsinghuaC3I/ZEDA-Qwen3-30B-A3B-Dynamic | Qwen3-30B-A3B |
| ZEDA-GLM-4.7-Flash-Dynamic | TsinghuaC3I/ZEDA-GLM-4.7-Flash-Dynamic | GLM-4.7-Flash |
| Rollout Data | Huggingface |
|---|---|
| ZEDA-Qwen3-30B-A3B-rollout-60k | TsinghuaC3I/ZEDA |
| ZEDA-GLM-4.7-Flash-rollout-60k | TsinghuaC3I/ZEDA |
ZEDA demonstrates consistent improvements across multiple models and benchmarks:
Our project mainly builds upon slime, SGLang, and Megatron. We leverage the datasets of AceReason and Llama-Nemotron-Post-Training-Dataset, and backbone models of Qwen3-30B-A3B and GLM-4.7-Flash. We are grateful for these significant open-source contributions.
For questions about this work, please contact:
- Xingtai Lv: lvxt24@mails.tsinghua.edu.cn
If you find this work helpful, please cite our paper:
@misc{lv2026posttrainedmoeskiphalf,
title={Post-Trained MoE Can Skip Half Experts via Self-Distillation},
author={Xingtai Lv and Li Sheng and Kaiyan Zhang and Yichen You and Siyan Gao and Xueheng Luo and Yuxin Zuo and Yuchen Fan and Junlin Yang and Ganqu Cui and Bingning Wang and Fan Yang and Youbang Sun and Ning Ding and Bowen Zhou},
year={2026},
eprint={2605.18643},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2605.18643},
}


