moe-upcycle

Turn a small dense LLM into a Mixture-of-Experts model — then specialize the experts by distilling from a stronger teacher. A point-and-run toolkit + a demo built on Qwen2.5-0.5B.

This is a method toolkit, not a model release. It does the full sparse-upcycling pipeline — surgery → distillation → two flavors of expert specialization → evaluation → a scaling-law sweep → an honest head-to-head against the base model — on a single GPU, with reproducible scripts.

🤗 Demo model (Mode B, named domain experts): outlast85/qwen2.5-0.5b-moe-domain-experts — read its card for the honest limitations (it's a mechanism demo, not a better small model).

TL;DR — what it does, and the honest headline

You take a trained dense model, clone each feed-forward block into N experts + a router (sparse upcycling), then train so the experts specialize. Two ways to specialize are included:

Mode A — emergent: a learned router + load-balancing; experts drift apart on their own (how Mixtral / DeepSeek / Qwen-MoE actually work).
Mode B — supervised "named experts": because our data is domain-tagged, we teach the router "code → expert 0, chat → expert 1, …". You get real, individually-loadable field experts + an always-on shared "generalist" — the "GP sends you to a specialist" design.

The honest headline, measured (compare_base.py): at this scale the upcycled model is bigger, slower, and slightly worse than the base it came from. That's expected and the repo shows it rather than hiding it — see Results. The value here is the method and the specialization behavior, not model quality. Upcycling only beats the base at training scales far beyond one GPU; making a better small model is not what a few-million-token run can do.

The pipeline

dense model (Qwen2.5-0.5B)
   │
   ▼
1. SURGERY  (upcycle.py)      each block's FFN → [ N routed experts + 1 shared expert + a router ].
                             • Drop-Upcycling: re-randomize ~30% of each expert so clones diverge.
                             • shared always-on expert (the "generalist"); fp32 router; top-k gating.
                             • with drop=0, the upcycled model is BEHAVIORALLY IDENTICAL at init.
   │
   ▼
2. TEACHER  (distill_precompute.py)   run a stronger same-family teacher (Qwen2.5-3B) over the
                             text and cache its top-20 logits = the "dark knowledge" to distill.
   │
   ▼
3. TRAIN    (train.py = Mode A | mode_b_domain_experts.py = Mode B)
                             3 losses: next-token LM + forward-KL to the teacher + a router term
                             (load-balance for A, domain-supervision for B). LR warmup, noisy-top-k.
   │
   ▼
4. MEASURE  (evaluate.py, probe_layers.py, sweep_train.py, compare_base.py)
                             generation sanity • per-layer routing/specialization • a data-scaling
                             law • and an honest base-vs-MoE benchmark.

Results

All numbers are from the included scripts on Qwen2.5-0.5B, teacher Qwen2.5-3B, trained on ~30k license-clean real examples (wikitext • codeparrot • OpenAssistant/oasst1 • Dolly-15k), on an RTX 5090 (+ a 3060 running the teacher in parallel).

Surgery is lossless at init — 494M → 1,749M params (4 routed + 1 shared, top-2); with drop=0 the MoE reproduces the dense model to a max logit diff of 3e-5 ("grow without forgetting"), and drop=0.3 makes the experts diverge as intended.

Mode A — emergent specialization (probe_layers.py): routing separates domains in the deep layers (early layers stay general — as the literature predicts). Peak per-layer domain-separation 0.21, with a 2.9× routing contrast (e.g. expert E1 takes 58% of code tokens but only 20% of web). Code claims its own experts; the language domains overlap more.

Mode B — supervised named experts (mode_b_domain_experts.py): the router learns the dispatch (routing cross-entropy 1.55 → 0.11). Held-out routing accuracy:

domain	→ expert	accuracy
code	0	97.7%
conversation	1	64.5%
explanation	2	88.0%
web	3	97.8%
mean		87.0%

A built-in "secretary" reads a prompt and reports the one specialist to load (+ the shared GP) — the basis for selective expert offloading (cf. llama.cpp -ncmoe). The blurriest domain (conversation, 64.5%) is honestly the one whose tokens overlap everything.

Each domain claims its expert (code→0, explanation→2, web→3); conversation honestly bleeds into the explanation expert — its tokens overlap everything.

Data-scaling law (sweep_train.py) — held-out perplexity vs training data:

data	tokens	val ppl
1×	0.52M	13.98
2×	1.04M	12.92
4×	2.08M	11.97
8×	4.17M	11.15

Fit: every 10× more data ≈ 22% lower perplexity. Extrapolated: ~2.6× data for 10% lower ppl, ~14× for 25%, and ~569× (billions of tokens) for 50% — the diminishing-returns wall, measured. (The 50% figure is a long extrapolation past the measured range — directionally certain, not exact.)

Honest head-to-head vs base (compare_base.py, same held-out set):

model	val ppl	total params	active/token	gen tok/s	weights
base Qwen2.5-0.5B	9.27	494M	494M	50.6	0.99 GB
our upcycled MoE	13.46	1,749M	1,122M	20.2	3.50 GB

The base wins on quality and efficiency. Upcycling + a 4M-token budget can't beat a model pretrained on ~18T tokens; Drop-Upcycling also deliberately perturbs pretrained weights, and recovering them needs scale we don't have here. When upcycling wins: large training budgets, and when compared against a dense model of equal total size / compute — not against the tiny 0.5B you started from. This repo measures the regime where it loses, honestly.

Quickstart

python -m venv .venv && . .venv/bin/activate
pip install torch transformers datasets numpy accelerate

python upcycle.py                 # surgery + identity/divergence self-check
python fetch_data.py              # pull license-clean real data into data/  ...OR...
OPENAI_BASE_URL=... python teacher.py   # generate data from any OpenAI-compatible endpoint
PRECOMPUTE_DEVICE=cuda:0 python distill_precompute.py   # cache teacher top-k logits
python train.py                   # Mode A: distill + emergent specialization
python mode_b_domain_experts.py   # Mode B: supervised named domain experts
python probe_layers.py            # where/how much do experts specialize?
python sweep_train.py             # the data-scaling law
python compare_base.py            # honest head-to-head vs the base model

Repo layout

upcycle.py                  the surgery: dense FFN -> MoE (Drop-Upcycling, shared expert, smart router)
fetch_data.py               stream license-clean real data (wiki/code/oasst/dolly), robust + resumable
teacher.py                  generate training text from any OpenAI-compatible endpoint (reproducible teacher)
distill_precompute.py       cache a teacher's top-k logits for logit distillation
train.py                    Mode A — distillation fine-tune + load-balancing (emergent experts)
mode_b_domain_experts.py    Mode B — supervised domain routing -> named, loadable experts + a "secretary"
evaluate.py / probe_layers.py   generation sanity + per-layer routing/specialization analysis
sweep_train.py              data-scaling law: train on 1x/2x/4x/8x, fit, extrapolate
compare_base.py             honest quality + efficiency benchmark vs the base model
docs/router-notes.md        researched router upgrades and why we chose each

Notes

Two ways to get the teacher's signal. teacher.py generates training text from any OpenAI-compatible endpoint (portable, sequence-level distillation). Full logit distillation needs the teacher's per-token logits aligned to the student vocab, so it runs the teacher locally (distill_precompute.py, Qwen2.5-3B) — or a same-tokenizer endpoint exposing top_logprobs.
The expert loop in upcycle.py is intentionally a readable Python loop, not a fused grouped-GEMM kernel — auditable and portable (runs on new GPUs like Blackwell sm_120) over ~2–3× faster but opaque. Swap in a fused kernel when scaling up.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

moe-upcycle

TL;DR — what it does, and the honest headline

The pipeline

Results

Quickstart

Repo layout

Notes

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
docs		docs
.gitignore		.gitignore
README.md		README.md
compare_base.py		compare_base.py
distill_precompute.py		distill_precompute.py
evaluate.py		evaluate.py
fetch_data.py		fetch_data.py
generate_code_examples.py		generate_code_examples.py
generate_conversation_examples.py		generate_conversation_examples.py
generate_examples.py		generate_examples.py
make_plots.py		make_plots.py
mode_b_domain_experts.py		mode_b_domain_experts.py
mode_b_results.json		mode_b_results.json
probe_layers.py		probe_layers.py
sweep_results.json		sweep_results.json
sweep_train.py		sweep_train.py
teacher.py		teacher.py
train.py		train.py
upcycle.py		upcycle.py

Folders and files

Latest commit

History

Repository files navigation

moe-upcycle

TL;DR — what it does, and the honest headline

The pipeline

Results

Quickstart

Repo layout

Notes

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages