A from-scratch discrete diffusion language model — and a systematic scaling study.
This repository contains two related projects:
-
Original project — A 10.7M parameter character-level discrete diffusion LM trained on Tiny Shakespeare, built from scratch to understand how discrete diffusion works and how it compares to GPT. Complete, published on Substack.
-
Scaling study (in progress) — A systematic comparison of 6 forward noise schedules and 2 masking strategies for absorbing-state discrete diffusion at the 124M parameter scale on FineWeb-Edu, running on Northeastern's Explorer HPC cluster.
Same architecture, same data — diffusion finishes in 39 steps vs GPT's 225 steps
Most language models generate text left-to-right, one token at a time. Discrete diffusion models generate text all at once — starting from a fully masked sequence and iteratively revealing tokens in parallel, like developing a photograph.
This project builds both from scratch with an identical architecture (same params, same data, same transformer) so you can see exactly what changes — and it turns out to be surprisingly little:
| # | What Changes | GPT | Diffusion |
|---|---|---|---|
| 1 | Vocabulary | Standard chars | + 1 MASK token (_) |
| 2 | Attention | Causal (sees only left ←) | Bidirectional (sees everything ↔) |
| 3 | Training objective | Predict next token | Denoise masked tokens |
| 4 | Loss scope | All positions | Masked positions only |
| 5 | Generation | Sequential, left-to-right | Parallel, by confidence |
That's it. Same transformer, same RoPE, same RMSNorm, same ReluSquared MLP. ~80% of the code is shared.
All models trained on Tiny Shakespeare (~1.1M characters, 65 unique + 1 MASK = 66 vocab).
| Model | Architecture | Params | Train Loss | Val Loss | Iters | Time |
|---|---|---|---|---|---|---|
| MLP Denoiser (step 1) | 2-layer FF | 0.37M | 3.31 | 3.31 | 5,000 | ~6 min |
| Small Transformer (step2) | 4L / 4H / 128E | 1.6M | 2.16 | 2.27 | 10,000 | ~7 min |
| Diffusion (final) | 6L / 6H / 384E | 10.7M | 1.93 | 2.09 | 10,000 | ~47 min |
| GPT (final) | 6L / 6H / 384E | 10.7M | 0.13 | 4.09 | 5,000 | ~24 min |
GPT produces better text than diffusion at this scale. This is expected and well-documented in the literature — diffusion LMs typically need 3-5× more training to match autoregressive quality at small scale. The quality gap narrows significantly at larger scale (see MDLM, SEDD).
What diffusion demonstrates:
- ⚡ ~6× fewer forward passes for generation (39 vs 225 steps)
- 🔀 Parallel decoding — tokens appear everywhere simultaneously
- 🧩 A fundamentally different approach to language modeling
Diffusion: Parallel Unmasking
Step 0: ·················································· (all masked)
Step 5: Be····d him to····with·a·························
Step 15: Be hold him to him with a milling his············
Step 30: Be hold him to him with a milling his cold, As he
GPT: Sequential Typing
Token 0: First Citizen:\nB
Token 10: First Citizen:\nBefore we p
Token 20: First Citizen:\nBefore we proceed any
Token 48: First Citizen:\nBefore we proceed any further, hear me speak.
git clone https://github.com/BrutalCaeser/microDLM.git
cd microDLM
pip install torch
# Download data
mkdir -p data
wget -O data/shakespeare.txt \
https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
# Train diffusion LM (~47 min on T4 GPU)
python diffusion.py --train
# Generate text
python diffusion.py --generate
# Same for GPT (~24 min on T4 GPU)
python gpt.py --train --generate
# Animated race visualization
python visualize.py| File | Concept | Loss |
|---|---|---|
steps/step0_masking.py |
Forward process only — no neural net, just math | — |
steps/step1_denoise_mlp.py |
MLP denoiser — proves training loop works | 3.31 |
steps/step2_transformer.py |
Bidirectional transformer — the quality jump | 2.16 |
diffusion.py |
Full model, 10.7M params | 2.09 val |
Input tokens (B, T)
↓
Token Embedding → (B, T, 384)
↓
RMSNorm
↓
Transformer Block × 6
├─ RMSNorm
├─ Multi-Head Attention (6 heads, RoPE, QK-Norm)
│ ← Bidirectional (diffusion) or Causal (GPT)
├─ Residual
├─ RMSNorm
├─ MLP: Linear → ReluSquared → Linear (4× expansion)
└─ Residual
↓
RMSNorm
↓
Linear Head → (B, T, 66) ← logits over vocabulary
Every discrete diffusion language model makes a choice about its forward process: how quickly and in what pattern should tokens be masked during training? MDLM uses cosine. SEDD uses log-linear. LLaDA uses linear. These choices are made once, justified briefly, and never compared head-to-head on the same model, data, and scale.
This study fills that gap. We train identical 124M parameter diffusion LMs on FineWeb-Edu under 6 different noise schedules and 2 masking strategies (uniform vs. entropy-weighted), producing a 12-run controlled experiment.
Research questions:
- Which noise schedule minimizes perplexity for absorbing-state discrete diffusion?
- Does per-token entropy-weighted masking improve over uniform masking?
- How do schedule choices affect generation quality, not just likelihood?
- Do the answers change between small (10M) and medium (124M) scale?
The 6 noise schedules (α(t) = survival probability at noise level t):
| Schedule | α(t) | α(1) | Used by |
|---|---|---|---|
| Linear | 1 − t | 0.0 | LLaDA |
| Cosine | cos(πt/2) | 0.0 | MDLM |
| Log-linear | 1 − (1−10⁻³)·t | 0.001 | SEDD (corrected) |
| Square root | 1 − √t | 0.0 | — |
| Clipped cosine | 0.05 + 0.90·cos(πt/2) | 0.05 | Block Diffusion |
| Sigmoid | 1/(1+exp(10·(t−0.5))) | ~0.0 | — |
Note on log-linear: SEDD Appendix C.1 defines this as a linear schedule in log-space. The formula
exp(−t·ln 2)that appears in some implementations gives α(1)=0.5 — only 50% masking at maximum noise, which is incorrect. The formula above (1 − (1−10⁻³)·t) is the correct implementation. This was caught and fixed during Phase 4 pilots (see CHANGELOG).
The 2 masking strategies:
- Uniform — every token masked with equal probability 1−α(t)
- Entropy-weighted — tokens with high GPT-2 entropy (hard to predict) are masked less often, giving the model more context at difficult positions. Mask probability is
(1−α(t)) × clamp(2.0 − ent_norm, 0.5, 1.5)whereent_normis normalized entropy (mean 1.0).
The 12-run matrix:
| Uniform | Entropy-weighted | |
|---|---|---|
| Linear | Run 1 | Run 7 |
| Cosine | Run 2 | Run 8 |
| Log-linear | Run 3 | Run 9 |
| Square root | Run 4 | Run 10 |
| Clipped cosine | Run 5 | Run 11 |
| Sigmoid | Run 6 | Run 12 |
Every run: identical 124M model, identical 1B tokens of FineWeb-Edu, identical hyperparameters. The only variable is the forward process.
Fixed hyperparameters across all runs:
| Hyperparameter | Value |
|---|---|
| Architecture | 12L / 12H / 768E, SwiGLU, RoPE, RMSNorm, weight tying |
| Parameters | ~124M |
| Tokenizer | GPT-2 BPE (50257 vocab + 1 MASK = 50258) |
| Dataset | FineWeb-Edu sample-10BT, first 1B tokens |
| Context length | 1024 tokens |
| Batch size | 32 (effective) |
| Optimizer | AdamW, lr=6e-4, cosine decay to 6e-5, 2000-step warmup |
| Precision | bfloat16 (A100) |
| Training steps | 100,000 |
| Hardware | 1× A100 per run (Explorer HPC gpu partition: 1 GPU/job limit) |
| Phase | Description | Status |
|---|---|---|
| Phase 0 | Verify HPC environment | Complete — A100-SXM4-80GB confirmed, CUDA 12.1, PyTorch 2.5.1 |
| Phase 1 | Checkpoint-resume infrastructure | Complete — SIGUSR1 graceful shutdown + auto-resubmit verified |
| Phase 2 | FineWeb-Edu data preparation | Complete — 1B tokens tokenized; entropy array computed (96.2% vocab coverage) |
| Phase 3 | 124M model + unified training script | Complete — 200-step FineWeb smoke test passed |
| Phase 4 | Pilot experiments (Shakespeare, 10K steps) | Complete — 4 pilots passed; loglinear schedule formula bug found and fixed |
| Phase 5 | Multi-GPU verification | Complete — DDP verified on 2×A100; downgraded to 1×A100/job due to QOS limits |
| Phase 6 | Full 12-run experimental matrix | 8 of 12 complete — clipped_cosine and sigmoid stalled; excluded from analysis |
| Phase 7 | Evaluation (NELBO PPL, generative PPL, MAUVE) | Complete for 8 runs — results in results/comparison_table.md |
| Phase 8 | Writeup | In progress |
| Run | FineWeb PPL ↓ | WikiText-2 PPL ↓ | Gen PPL ↓ | MAUVE ↑ | Val Loss |
|---|---|---|---|---|---|
| cosine_uniform | 69.56 | 94.32 | 228.51 | 0.575 | 4.379 |
| cosine_entropy_weighted | 64.39 | 186.69 | 100.52 | 0.521 | 3.668 |
| loglinear_uniform | 79.67 | 161.81 | 208.53 | 0.651 | 4.517 |
| loglinear_entropy_weighted | 98.66 | 261.69 | 95.51 | 0.611 | 3.853 |
| linear_uniform | 83.56 | 151.26 | 220.59 | 0.624 | 4.535 |
| linear_entropy_weighted | 102.67 | 225.22 | 114.41 | 0.453 | 3.849 |
| sqrt_uniform | 149.15 | 299.84 | 212.97 | 0.641 | 5.087 |
| sqrt_entropy_weighted | 174.99 | 473.75 | 104.22 | 0.561 | 4.313 |
Three findings that held consistently across all schedule pairs:
-
Cosine wins on denoising. Best FineWeb NELBO (69.56) and WikiText-2 (94.32) — the intermediate-noise-spending profile matters most for learning to denoise.
-
Entropy-weighted masking improves generative fluency ~2× across every schedule (gen PPL 95–114 vs 208–228), but degrades out-of-distribution generalization ~2× (WikiText-2 roughly doubles for every pair). The entropy prior is FineWeb-specific, so the model specializes rather than generalizing.
-
The training val_loss advantage of entropy-weighted (~0.7 nats) is largely measurement bias. When evaluated under the same uniform masking distribution (fair comparison), the gap shrinks to ~0.08 nats for cosine. The training loss measured performance on an easier exam.
Full table: results/comparison_table.md
These are small-scale pilots on Tiny Shakespeare (10.7M model, 10K steps, 1×A100). They validate the training loop for each schedule before the full FineWeb runs.
| Run | Schedule | Masking | Val Loss (10K steps) | Notes |
|---|---|---|---|---|
| P1 | Cosine | Uniform | 1.93 | Baseline |
| P2 | Linear | Uniform | 1.92 | Matches cosine closely |
| P3 | Log-linear | Uniform | 1.91 | After bug fix; was 1.03 with incorrect formula |
| P4 | Cosine | Entropy-weighted | 1.90 | Slight improvement over uniform |
Full FineWeb-Edu results (100K steps, 124M model) will populate here as Phase 6 completes.
- HPC: Northeastern Explorer cluster, SLURM
- Target GPU: A100 (1 per job —
gpupartition enforcesMaxTRES=gres/gpu=1), V100 fallback - Auto-resubmit: Every training job catches SIGUSR1 (5 min before wall time), saves checkpoint, resubmits itself. Required because the
gpupartition has a 7.5-hour effective limit and each run takes ~45 hours (~6 chained jobs). - QOS limits: 4 GPUs in use simultaneously, 8 jobs submitted at once. Runs 9–12 queue automatically as earlier runs complete.
- Logging: JSONL files (one per run), one entry per 100 training steps
microDLM/
├── diffusion.py ← Original 10.7M diffusion LM (Shakespeare)
├── gpt.py ← Original 10.7M GPT baseline (Shakespeare)
├── visualize.py ← Terminal animation: diffusion vs GPT race
│
├── train_experiment.py ← Unified training script (all 12 scaling runs)
├── config.py ← ModelConfig, TrainConfig, ExperimentConfig
│
├── scripts/
│ ├── prepare_fineweb.py ← Tokenize FineWeb-Edu into binary shards
│ ├── prepare_fineweb.sh ← SLURM job (short partition, 24hr, CPU)
│ ├── compute_token_entropy.py ← Precompute per-token entropy via GPT-2
│ ├── compute_token_entropy.sh ← SLURM job (gpu-short, 2hr, 1×A100)
│ ├── test_checkpoint_resume.sh ← Phase 1 verification job
│ ├── verify_model.sh ← Phase 3 smoke test (200 steps on FineWeb)
│ ├── pilot_schedules.sh ← Phase 4: 4 sequential pilots on Shakespeare
│ ├── test_ddp.sh ← Phase 5: DDP verification (2×A100, 500 steps)
│ ├── train_template.sh ← SLURM template for production runs (1×A100)
│ ├── launch_all_experiments.sh ← Phase 6: generate + submit all 12 run scripts
│ ├── requeue_stalled.sh ← Re-submit runs whose jobs fell off the queue
│ └── plot_pilot_curves.py ← 2-panel loss curve figure for pilot runs
│
├── data/
│ ├── shakespeare.txt ← Tiny Shakespeare (~1.1MB)
│ └── fineweb/ ← Tokenized FineWeb-Edu shards (HPC only)
│ ├── shard_0000.bin ← 100M tokens, uint16 (~200MB each)
│ ├── ...
│ ├── metadata.json
│ └── token_entropy.npy ← (50257,) float32, per-token GPT-2 entropy
│
├── checkpoints/ ← Model checkpoints (HPC only, not in git)
├── logs/ ← SLURM output + JSONL training logs
├── results/ ← Evaluation outputs (populated after Phase 7)
│
├── steps/ ← Progressive educational build
│ ├── step0_masking.py
│ ├── step1_denoise_mlp.py
│ └── step2_transformer.py
│
├── web/ ← Static web demo (GitHub Pages)
│ ├── index.html
│ ├── style.css
│ ├── race.js
│ └── frames.json
│
├── weights/ ← Trained Shakespeare weights (not in git)
├── CLAUDE.md ← Agent guidelines and project ground truth
├── CHANGELOG.md ← Detailed log of all phases and bugs fixed
├── scaling_plan_hpc.md ← Full research plan and HPC execution guide
└── HPC_REFERENCE.md ← HPC quick reference (commands, partitions)
Each token independently survives with probability α(t) or becomes MASK with probability (1 − α(t)). MASK is absorbing: once masked, a token stays masked in the forward direction.
Cross-entropy at masked positions only, averaged over random noise levels. One t sampled per training step (Monte Carlo).
Two components must both be present:
- Carry-over: loss computed only at masked positions (above)
- Zero-masking: clamp
logits[:, :, mask_token_id] = -infbefore softmax
The model never predicts MASK as an output token — only real vocabulary tokens.
| Paper | Citation |
|---|---|
| D3PM | Austin et al., NeurIPS 2021, arXiv:2107.03006 |
| MDLM | Sahoo et al., NeurIPS 2024, arXiv:2406.07524 |
| SEDD | Lou, Meng & Ermon, ICML 2024 Best Paper, arXiv:2310.16834 |
| Block Diffusion | Arriola et al., ICLR 2025 Oral, arXiv:2503.09573 |
| LLaDA | Nie et al., 2025, arXiv:2502.09992 |
| nanoGPT | Karpathy, github.com/karpathy/nanoGPT |
MIT.
