Skip to content

outlast85/moe-upcycle

Repository files navigation

moe-upcycle

Turn a small dense LLM into a Mixture-of-Experts model — then specialize the experts by distilling from a stronger teacher. A point-and-run toolkit + a demo built on Qwen2.5-0.5B.

This is a method toolkit, not a model release. It does the full sparse-upcycling pipeline — surgery → distillation → two flavors of expert specialization → evaluation → a scaling-law sweep → an honest head-to-head against the base model — on a single GPU, with reproducible scripts.

🤗 Demo model (Mode B, named domain experts): outlast85/qwen2.5-0.5b-moe-domain-experts — read its card for the honest limitations (it's a mechanism demo, not a better small model).


TL;DR — what it does, and the honest headline

You take a trained dense model, clone each feed-forward block into N experts + a router (sparse upcycling), then train so the experts specialize. Two ways to specialize are included:

  • Mode A — emergent: a learned router + load-balancing; experts drift apart on their own (how Mixtral / DeepSeek / Qwen-MoE actually work).
  • Mode B — supervised "named experts": because our data is domain-tagged, we teach the router "code → expert 0, chat → expert 1, …". You get real, individually-loadable field experts + an always-on shared "generalist" — the "GP sends you to a specialist" design.

The honest headline, measured (compare_base.py): at this scale the upcycled model is bigger, slower, and slightly worse than the base it came from. That's expected and the repo shows it rather than hiding it — see Results. The value here is the method and the specialization behavior, not model quality. Upcycling only beats the base at training scales far beyond one GPU; making a better small model is not what a few-million-token run can do.


The pipeline

dense model (Qwen2.5-0.5B)
   │
   ▼
1. SURGERY  (upcycle.py)      each block's FFN → [ N routed experts + 1 shared expert + a router ].
                             • Drop-Upcycling: re-randomize ~30% of each expert so clones diverge.
                             • shared always-on expert (the "generalist"); fp32 router; top-k gating.
                             • with drop=0, the upcycled model is BEHAVIORALLY IDENTICAL at init.
   │
   ▼
2. TEACHER  (distill_precompute.py)   run a stronger same-family teacher (Qwen2.5-3B) over the
                             text and cache its top-20 logits = the "dark knowledge" to distill.
   │
   ▼
3. TRAIN    (train.py = Mode A | mode_b_domain_experts.py = Mode B)
                             3 losses: next-token LM + forward-KL to the teacher + a router term
                             (load-balance for A, domain-supervision for B). LR warmup, noisy-top-k.
   │
   ▼
4. MEASURE  (evaluate.py, probe_layers.py, sweep_train.py, compare_base.py)
                             generation sanity • per-layer routing/specialization • a data-scaling
                             law • and an honest base-vs-MoE benchmark.

Results

All numbers are from the included scripts on Qwen2.5-0.5B, teacher Qwen2.5-3B, trained on ~30k license-clean real examples (wikitext • codeparrot • OpenAssistant/oasst1 • Dolly-15k), on an RTX 5090 (+ a 3060 running the teacher in parallel).

Surgery is lossless at init494M → 1,749M params (4 routed + 1 shared, top-2); with drop=0 the MoE reproduces the dense model to a max logit diff of 3e-5 ("grow without forgetting"), and drop=0.3 makes the experts diverge as intended.

Mode A — emergent specialization (probe_layers.py): routing separates domains in the deep layers (early layers stay general — as the literature predicts). Peak per-layer domain-separation 0.21, with a 2.9× routing contrast (e.g. expert E1 takes 58% of code tokens but only 20% of web). Code claims its own experts; the language domains overlap more.

Mode B — supervised named experts (mode_b_domain_experts.py): the router learns the dispatch (routing cross-entropy 1.55 → 0.11). Held-out routing accuracy:

domain → expert accuracy
code 0 97.7%
conversation 1 64.5%
explanation 2 88.0%
web 3 97.8%
mean 87.0%

A built-in "secretary" reads a prompt and reports the one specialist to load (+ the shared GP) — the basis for selective expert offloading (cf. llama.cpp -ncmoe). The blurriest domain (conversation, 64.5%) is honestly the one whose tokens overlap everything.

Mode B routing heatmap

Each domain claims its expert (code→0, explanation→2, web→3); conversation honestly bleeds into the explanation expert — its tokens overlap everything.

Data-scaling law (sweep_train.py) — held-out perplexity vs training data:

data tokens val ppl
0.52M 13.98
1.04M 12.92
2.08M 11.97
4.17M 11.15

Fit: every 10× more data ≈ 22% lower perplexity. Extrapolated: ~2.6× data for 10% lower ppl, ~14× for 25%, and ~569× (billions of tokens) for 50% — the diminishing-returns wall, measured. (The 50% figure is a long extrapolation past the measured range — directionally certain, not exact.)

Scaling law

Honest head-to-head vs base (compare_base.py, same held-out set):

model val ppl total params active/token gen tok/s weights
base Qwen2.5-0.5B 9.27 494M 494M 50.6 0.99 GB
our upcycled MoE 13.46 1,749M 1,122M 20.2 3.50 GB

The base wins on quality and efficiency. Upcycling + a 4M-token budget can't beat a model pretrained on ~18T tokens; Drop-Upcycling also deliberately perturbs pretrained weights, and recovering them needs scale we don't have here. When upcycling wins: large training budgets, and when compared against a dense model of equal total size / compute — not against the tiny 0.5B you started from. This repo measures the regime where it loses, honestly.


Quickstart

python -m venv .venv && . .venv/bin/activate
pip install torch transformers datasets numpy accelerate

python upcycle.py                 # surgery + identity/divergence self-check
python fetch_data.py              # pull license-clean real data into data/  ...OR...
OPENAI_BASE_URL=... python teacher.py   # generate data from any OpenAI-compatible endpoint
PRECOMPUTE_DEVICE=cuda:0 python distill_precompute.py   # cache teacher top-k logits
python train.py                   # Mode A: distill + emergent specialization
python mode_b_domain_experts.py   # Mode B: supervised named domain experts
python probe_layers.py            # where/how much do experts specialize?
python sweep_train.py             # the data-scaling law
python compare_base.py            # honest head-to-head vs the base model

Repo layout

upcycle.py                  the surgery: dense FFN -> MoE (Drop-Upcycling, shared expert, smart router)
fetch_data.py               stream license-clean real data (wiki/code/oasst/dolly), robust + resumable
teacher.py                  generate training text from any OpenAI-compatible endpoint (reproducible teacher)
distill_precompute.py       cache a teacher's top-k logits for logit distillation
train.py                    Mode A — distillation fine-tune + load-balancing (emergent experts)
mode_b_domain_experts.py    Mode B — supervised domain routing -> named, loadable experts + a "secretary"
evaluate.py / probe_layers.py   generation sanity + per-layer routing/specialization analysis
sweep_train.py              data-scaling law: train on 1x/2x/4x/8x, fit, extrapolate
compare_base.py             honest quality + efficiency benchmark vs the base model
docs/router-notes.md        researched router upgrades and why we chose each

Notes

  • Two ways to get the teacher's signal. teacher.py generates training text from any OpenAI-compatible endpoint (portable, sequence-level distillation). Full logit distillation needs the teacher's per-token logits aligned to the student vocab, so it runs the teacher locally (distill_precompute.py, Qwen2.5-3B) — or a same-tokenizer endpoint exposing top_logprobs.
  • The expert loop in upcycle.py is intentionally a readable Python loop, not a fused grouped-GEMM kernel — auditable and portable (runs on new GPUs like Blackwell sm_120) over ~2–3× faster but opaque. Swap in a fused kernel when scaling up.

License

MIT © 2026 Gal Cohen

About

Turn a small dense LLM into a Mixture-of-Experts model, then specialize the experts by distilling from a teacher. Reproducible toolkit + honest benchmarks (Qwen2.5-0.5B demo).

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages