Turn a small dense LLM into a Mixture-of-Experts model — then specialize the experts by distilling from a stronger teacher. A point-and-run toolkit + a demo built on Qwen2.5-0.5B.
This is a method toolkit, not a model release. It does the full sparse-upcycling pipeline — surgery → distillation → two flavors of expert specialization → evaluation → a scaling-law sweep → an honest head-to-head against the base model — on a single GPU, with reproducible scripts.
🤗 Demo model (Mode B, named domain experts): outlast85/qwen2.5-0.5b-moe-domain-experts — read its card for the honest limitations (it's a mechanism demo, not a better small model).
You take a trained dense model, clone each feed-forward block into N experts + a router (sparse upcycling), then train so the experts specialize. Two ways to specialize are included:
- Mode A — emergent: a learned router + load-balancing; experts drift apart on their own (how Mixtral / DeepSeek / Qwen-MoE actually work).
- Mode B — supervised "named experts": because our data is domain-tagged, we teach the router "code → expert 0, chat → expert 1, …". You get real, individually-loadable field experts + an always-on shared "generalist" — the "GP sends you to a specialist" design.
The honest headline, measured (compare_base.py): at this scale the upcycled model is
bigger, slower, and slightly worse than the base it came from. That's expected and the repo
shows it rather than hiding it — see Results. The value here is the method and the
specialization behavior, not model quality. Upcycling only beats the base at training scales far
beyond one GPU; making a better small model is not what a few-million-token run can do.
dense model (Qwen2.5-0.5B)
│
▼
1. SURGERY (upcycle.py) each block's FFN → [ N routed experts + 1 shared expert + a router ].
• Drop-Upcycling: re-randomize ~30% of each expert so clones diverge.
• shared always-on expert (the "generalist"); fp32 router; top-k gating.
• with drop=0, the upcycled model is BEHAVIORALLY IDENTICAL at init.
│
▼
2. TEACHER (distill_precompute.py) run a stronger same-family teacher (Qwen2.5-3B) over the
text and cache its top-20 logits = the "dark knowledge" to distill.
│
▼
3. TRAIN (train.py = Mode A | mode_b_domain_experts.py = Mode B)
3 losses: next-token LM + forward-KL to the teacher + a router term
(load-balance for A, domain-supervision for B). LR warmup, noisy-top-k.
│
▼
4. MEASURE (evaluate.py, probe_layers.py, sweep_train.py, compare_base.py)
generation sanity • per-layer routing/specialization • a data-scaling
law • and an honest base-vs-MoE benchmark.
All numbers are from the included scripts on Qwen2.5-0.5B, teacher Qwen2.5-3B, trained on ~30k license-clean real examples (wikitext • codeparrot • OpenAssistant/oasst1 • Dolly-15k), on an RTX 5090 (+ a 3060 running the teacher in parallel).
Surgery is lossless at init — 494M → 1,749M params (4 routed + 1 shared, top-2); with
drop=0 the MoE reproduces the dense model to a max logit diff of 3e-5 ("grow without
forgetting"), and drop=0.3 makes the experts diverge as intended.
Mode A — emergent specialization (probe_layers.py): routing separates domains in the
deep layers (early layers stay general — as the literature predicts). Peak per-layer
domain-separation 0.21, with a 2.9× routing contrast (e.g. expert E1 takes 58% of code
tokens but only 20% of web). Code claims its own experts; the language domains overlap more.
Mode B — supervised named experts (mode_b_domain_experts.py): the router learns the
dispatch (routing cross-entropy 1.55 → 0.11). Held-out routing accuracy:
| domain | → expert | accuracy |
|---|---|---|
| code | 0 | 97.7% |
| conversation | 1 | 64.5% |
| explanation | 2 | 88.0% |
| web | 3 | 97.8% |
| mean | 87.0% |
A built-in "secretary" reads a prompt and reports the one specialist to load (+ the shared GP) —
the basis for selective expert offloading (cf. llama.cpp -ncmoe). The blurriest domain
(conversation, 64.5%) is honestly the one whose tokens overlap everything.
Each domain claims its expert (code→0, explanation→2, web→3); conversation honestly bleeds into the explanation expert — its tokens overlap everything.
Data-scaling law (sweep_train.py) — held-out perplexity vs training data:
| data | tokens | val ppl |
|---|---|---|
| 1× | 0.52M | 13.98 |
| 2× | 1.04M | 12.92 |
| 4× | 2.08M | 11.97 |
| 8× | 4.17M | 11.15 |
Fit: every 10× more data ≈ 22% lower perplexity. Extrapolated: ~2.6× data for 10% lower ppl, ~14× for 25%, and ~569× (billions of tokens) for 50% — the diminishing-returns wall, measured. (The 50% figure is a long extrapolation past the measured range — directionally certain, not exact.)
Honest head-to-head vs base (compare_base.py, same held-out set):
| model | val ppl | total params | active/token | gen tok/s | weights |
|---|---|---|---|---|---|
| base Qwen2.5-0.5B | 9.27 | 494M | 494M | 50.6 | 0.99 GB |
| our upcycled MoE | 13.46 | 1,749M | 1,122M | 20.2 | 3.50 GB |
The base wins on quality and efficiency. Upcycling + a 4M-token budget can't beat a model pretrained on ~18T tokens; Drop-Upcycling also deliberately perturbs pretrained weights, and recovering them needs scale we don't have here. When upcycling wins: large training budgets, and when compared against a dense model of equal total size / compute — not against the tiny 0.5B you started from. This repo measures the regime where it loses, honestly.
python -m venv .venv && . .venv/bin/activate
pip install torch transformers datasets numpy accelerate
python upcycle.py # surgery + identity/divergence self-check
python fetch_data.py # pull license-clean real data into data/ ...OR...
OPENAI_BASE_URL=... python teacher.py # generate data from any OpenAI-compatible endpoint
PRECOMPUTE_DEVICE=cuda:0 python distill_precompute.py # cache teacher top-k logits
python train.py # Mode A: distill + emergent specialization
python mode_b_domain_experts.py # Mode B: supervised named domain experts
python probe_layers.py # where/how much do experts specialize?
python sweep_train.py # the data-scaling law
python compare_base.py # honest head-to-head vs the base modelupcycle.py the surgery: dense FFN -> MoE (Drop-Upcycling, shared expert, smart router)
fetch_data.py stream license-clean real data (wiki/code/oasst/dolly), robust + resumable
teacher.py generate training text from any OpenAI-compatible endpoint (reproducible teacher)
distill_precompute.py cache a teacher's top-k logits for logit distillation
train.py Mode A — distillation fine-tune + load-balancing (emergent experts)
mode_b_domain_experts.py Mode B — supervised domain routing -> named, loadable experts + a "secretary"
evaluate.py / probe_layers.py generation sanity + per-layer routing/specialization analysis
sweep_train.py data-scaling law: train on 1x/2x/4x/8x, fit, extrapolate
compare_base.py honest quality + efficiency benchmark vs the base model
docs/router-notes.md researched router upgrades and why we chose each
- Two ways to get the teacher's signal.
teacher.pygenerates training text from any OpenAI-compatible endpoint (portable, sequence-level distillation). Full logit distillation needs the teacher's per-token logits aligned to the student vocab, so it runs the teacher locally (distill_precompute.py, Qwen2.5-3B) — or a same-tokenizer endpoint exposingtop_logprobs. - The expert loop in
upcycle.pyis intentionally a readable Python loop, not a fused grouped-GEMM kernel — auditable and portable (runs on new GPUs like Blackwellsm_120) over ~2–3× faster but opaque. Swap in a fused kernel when scaling up.
MIT © 2026 Gal Cohen

