Post-Trained MoE Can Skip Half Experts via Self-Distillation

🚀 Getting Started • 📊 Main Results • 💖 Acknowledgements • 📨 Contact • 🎈 Citation

Fully trained Mixture-of-Experts (MoE) models are expensive to serve. Dynamic variant of MoE reduces computation by adjusting the activated experts in an input-dependent manner, while most existing dynamic MoE methods rely on pre-training from scratch or task-specific adaptation.

In this paper, we introduce ZEDA, a low-cost framework that transforms post-trained static MoE models into efficient dynamic ones, eliminating over 50% of expert FLOPs at marginal accuracy loss.

🎉News

[2026-05-19] We introduce Zero-Expert Self-Distillation Adaptation (ZEDA).

📖Introduction

We introduce Zero-Expert Self-Distillation Adaptation (ZEDA), a low-cost framework that transforms post-trained static MoE models into efficient dynamic ones without substantially sacrificing their established capabilities. ZEDA targets the practical deployment scenario where MoE models have already undergone expensive pre-training and post-training, and further inference-cost reduction is desired after the main training pipeline is finalized.

To stabilize this architectural conversion, ZEDA injects parameter-free zero-output experts into each MoE layer and adapts the augmented model through two-stage self-distillation, utilizing the original MoE as a frozen teacher and applying a group-level balancing loss. On Qwen3-30B-A3B and GLM-4.7-Flash across 11 benchmarks spanning math, code, and instruction following, ZEDA eliminates over 50% of expert FLOPs at marginal accuracy loss. It outperforms the strongest dynamic MoE baseline by 6.1 and 4.0 points on the two models, and delivers ~1.20× end-to-end inference speedup.

✨ZEDA

ZEDA first injects zero experts into a post-trained MoE, architecturally converting it into a dynamic one, and then adapts it through two-stage self-distillation with the original MoE as a fixed teacher.

ZEDA introduces parameterless zero experts, whose outputs are identically zero, into the existing expert pool of a post-trained MoE model. This expands the router candidate pool with zero-computation experts while the activation number remains unchanged, naturally reducing active normal experts. The augmented model is then adapted through a two-stage self-distillation process:

SFT Stage: Trains the student on responses sampled from the teacher (original MoE).
OPD Stage: Shifts to on-policy learning, where responses are sampled from the current student and the teacher supplies token-level targets via reverse KL.

ZEDA incorporates the Group Auxiliary Loss $\mathcal{L}_{GA}$ to regulate the relative activation frequency between normal experts and zero experts, while preserving the learned routing structures among normal experts. The loss is defined as:

$$\mathcal{L}_{GA} = \alpha \cdot \frac{N + N_Z \cdot w}{K} \cdot \left( \frac{f_{\mathcal{E}} \cdot P_{\mathcal{E}}}{N} + \frac{f_{\mathcal{Z}} \cdot P_{\mathcal{Z}}}{N_Z \cdot w} \right)$$

🚀Getting Started

To run ZEDA, follow these steps:

Env Setup

ZEDA is built upon large-scale MoE training and serving codebases, including slime, SGLang, and Megatron. Please use the Docker image slimerl/slime:20251113-v1 released by slime:

# Pull the image
docker pull slimerl/slime:20251113-v1

# Start the container
docker run --rm --gpus all --ipc=host --shm-size=16g \
  --ulimit memlock=-1 --ulimit stack=67108864 \
  -it slimerl/slime:latest /bin/bash

After pull and start the docker container, you simply need to install our modified versions of SGLang and slime:

cd zeda/sglang/python
pip install -e . --no-deps
cd ..
patch -p1 < ../slime/docker/patch/latest/sglang.patch

cd ../transformers
pip install -e . --no-deps

cd ../slime
pip install -e . --no-deps

Data Preparation

ZEDA uses 60k prompts including math, code, and chat data, and the corresponding self-distillation rollouts.

Prompts: The prompts are used for rollout and OPD. The prompts are chosen from AceReason-1.1-SFT and Llama-Nemotron-Post-Training-Dataset, and we release them in ZEDA-prompts-60k.
Rollouts: The rollouts are used for SFT. You need to use the specific post-trained MoE model intended for adaptation to perform the rollout. You can also directly utilize our released rollout results ZEDA-Qwen3-30B-A3B-rollout-60k and ZEDA-GLM-4.7-Flash-rollout-60k.

After downloading the data, please put them in the data folder.

Model Preparation

After downloading the specific post-trained MoE model intended for adaptation from Huggingface, please first convert the model into a dynamic one through zero expert injection:

python scripts/convert-hf-to-ZCE.py --input-dir path_Qwen3-30B-A3B --output-dir path_Qwen3-30B-A3B-dynamic --new-num-experts 192 # for Qwen3-30B-A3B

python scripts/convert-hf-to-ZCE.py --input-dir path_GLM-4.7-Flash --output-dir path_GLM-4.7-Flash-dynamic --new-num-experts 96 # for GLM-4.7-Flash

Then modify the config.json in the output Dynamic MoE dir, modify the "architectures" and add "use_zce_mask", "zce_nums", and "zce_types". For Qwen3-30B-A3B:

"architectures": ["Qwen3MoePlusPlusForCausalLM"],
"use_zce_mask": false,
"zce_nums": [64],
"zce_types": ["zero"]

For GLM-4.7-Flash:

"architectures": ["Glm4MoeLitePlusPlusForCausalLM"],
"use_zce_mask": false,
"zce_nums": [32],
"zce_types": ["zero"]

Finally, convert the dynamic MoE into the format compatible with Megatron:

bash scripts/convert_hf_to_torch_dist.sh

Training

ZEDA consists of zero-expert injection, SFT, and OPD. You can run the following scripts to start the adaptation pipeline:

# For Qwen3-30B-A3B
bash scripts/train_zeda_qwen_sft.sh # SFT
bash scripts/convert_torch_dist_to_hf.sh # Convert Model
bash scripts/run_teacher_server.sh # Start Teacher Server
bash scripts/train_zeda_qwen_opd.sh # After the teacher server starts, run OPD

# For GLM-4.7-Flash
bash scripts/train_zeda_glm_sft.sh # SFT
bash scripts/convert_torch_dist_to_hf.sh # Convert Model
bash scripts/run_teacher_server.sh # Start Teacher Server
bash scripts/train_zeda_glm_opd.sh # After the teacher server starts, run OPD

Evaluation

We provide a unified evaluation pipeline covering the released math, code, instruction-following, and science benchmarks. To reproduce the reported results, first download the benchmark data from Hugging Face to evaluation/benchmark:

cd ZEDA
hf download TsinghuaC3I/ZEDA-Evaluation \
  --repo-type dataset \
  --local-dir evaluation/benchmark

Then install the required evaluation dependencies:

pip install -r requirements.txt

Next, configure evaluation/run_sglang_server.sh by specifying:

MODEL_PATH: path to the evaluated model
TP_SIZE: tensor parallel size
PORT: server port

Launch the SGLang server with:

bash evaluation/run_sglang_server.sh

Once the server is ready, configure evaluation/run_evaluation.sh by setting:

MODEL_NAME: model name used in output filenames
MODEL_PATH: path to the evaluated model
SERVER_URL: server endpoint, for example http://0.0.0.0:$PORT/generate

Then start the evaluation pipeline:

bash evaluation/run_evaluation.sh

The script iterates over all released benchmarks, stores raw generations in evaluation/raw_output/, and finally invokes compute_reward.py to aggregate the evaluation metrics.

Models and Datasets

We release our adapted dynamic MoE models and rollout data in Huggingface:

Model	Huggingface	Base Model
ZEDA-Qwen3-30B-A3B-Dynamic	TsinghuaC3I/ZEDA-Qwen3-30B-A3B-Dynamic	Qwen3-30B-A3B
ZEDA-GLM-4.7-Flash-Dynamic	TsinghuaC3I/ZEDA-GLM-4.7-Flash-Dynamic	GLM-4.7-Flash

Rollout Data	Huggingface
ZEDA-Qwen3-30B-A3B-rollout-60k	TsinghuaC3I/ZEDA
ZEDA-GLM-4.7-Flash-rollout-60k	TsinghuaC3I/ZEDA

📊Main Results

ZEDA demonstrates consistent improvements across multiple models and benchmarks:

💖Acknowledgements

Our project mainly builds upon slime, SGLang, and Megatron. We leverage the datasets of AceReason and Llama-Nemotron-Post-Training-Dataset, and backbone models of Qwen3-30B-A3B and GLM-4.7-Flash. We are grateful for these significant open-source contributions.

📨Contact

For questions about this work, please contact:

Xingtai Lv: lvxt24@mails.tsinghua.edu.cn

🎈Citation

If you find this work helpful, please cite our paper:

@misc{lv2026posttrainedmoeskiphalf,
      title={Post-Trained MoE Can Skip Half Experts via Self-Distillation}, 
      author={Xingtai Lv and Li Sheng and Kaiyan Zhang and Yichen You and Siyan Gao and Xueheng Luo and Yuxin Zuo and Yuchen Fan and Junlin Yang and Ganqu Cui and Bingning Wang and Fan Yang and Youbang Sun and Ning Ding and Bowen Zhou},
      year={2026},
      eprint={2605.18643},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2605.18643}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
evaluation		evaluation
figs		figs
scripts		scripts
zeda		zeda
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Post-Trained MoE Can Skip Half Experts via Self-Distillation

🎉News

📖Introduction

✨ZEDA

🚀Getting Started

Env Setup

Data Preparation

Model Preparation

Training

Evaluation

Models and Datasets

📊Main Results

💖Acknowledgements

📨Contact

🎈Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Post-Trained MoE Can Skip Half Experts via Self-Distillation

🎉News

📖Introduction

✨ZEDA

🚀Getting Started

Env Setup

Data Preparation

Model Preparation

Training

Evaluation

Models and Datasets

📊Main Results

💖Acknowledgements

📨Contact

🎈Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages