Skip to content

BrutalCaeser/microDLM

Repository files navigation

microDLM

A from-scratch discrete diffusion language model — and a systematic scaling study.

PyTorch License

This repository contains two related projects:

  1. Original project — A 10.7M parameter character-level discrete diffusion LM trained on Tiny Shakespeare, built from scratch to understand how discrete diffusion works and how it compares to GPT. Complete, published on Substack.

  2. Scaling study (in progress) — A systematic comparison of 6 forward noise schedules and 2 masking strategies for absorbing-state discrete diffusion at the 124M parameter scale on FineWeb-Edu, running on Northeastern's Explorer HPC cluster.


Part 1: Educational Project (Complete)

Diffusion vs GPT generation race — diffusion finishes ~6× faster

Same architecture, same data — diffusion finishes in 39 steps vs GPT's 225 steps

🌐 Live Web Demo →

What is it?

Most language models generate text left-to-right, one token at a time. Discrete diffusion models generate text all at once — starting from a fully masked sequence and iteratively revealing tokens in parallel, like developing a photograph.

This project builds both from scratch with an identical architecture (same params, same data, same transformer) so you can see exactly what changes — and it turns out to be surprisingly little:

# What Changes GPT Diffusion
1 Vocabulary Standard chars + 1 MASK token (_)
2 Attention Causal (sees only left ←) Bidirectional (sees everything ↔)
3 Training objective Predict next token Denoise masked tokens
4 Loss scope All positions Masked positions only
5 Generation Sequential, left-to-right Parallel, by confidence

That's it. Same transformer, same RoPE, same RMSNorm, same ReluSquared MLP. ~80% of the code is shared.

Training Results

All models trained on Tiny Shakespeare (~1.1M characters, 65 unique + 1 MASK = 66 vocab).

Model Architecture Params Train Loss Val Loss Iters Time
MLP Denoiser (step 1) 2-layer FF 0.37M 3.31 3.31 5,000 ~6 min
Small Transformer (step2) 4L / 4H / 128E 1.6M 2.16 2.27 10,000 ~7 min
Diffusion (final) 6L / 6H / 384E 10.7M 1.93 2.09 10,000 ~47 min
GPT (final) 6L / 6H / 384E 10.7M 0.13 4.09 5,000 ~24 min

GPT produces better text than diffusion at this scale. This is expected and well-documented in the literature — diffusion LMs typically need 3-5× more training to match autoregressive quality at small scale. The quality gap narrows significantly at larger scale (see MDLM, SEDD).

What diffusion demonstrates:

  • ⚡ ~6× fewer forward passes for generation (39 vs 225 steps)
  • 🔀 Parallel decoding — tokens appear everywhere simultaneously
  • 🧩 A fundamentally different approach to language modeling

How Generation Works

Diffusion: Parallel Unmasking

Step  0:  ··················································  (all masked)
Step  5:  Be····d him to····with·a·························
Step 15:  Be hold him to him with a milling his············
Step 30:  Be hold him to him with a milling his cold, As he

GPT: Sequential Typing

Token  0:  First Citizen:\nB
Token 10:  First Citizen:\nBefore we p
Token 20:  First Citizen:\nBefore we proceed any
Token 48:  First Citizen:\nBefore we proceed any further, hear me speak.

Quick Start

git clone https://github.com/BrutalCaeser/microDLM.git
cd microDLM
pip install torch

# Download data
mkdir -p data
wget -O data/shakespeare.txt \
  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

# Train diffusion LM (~47 min on T4 GPU)
python diffusion.py --train

# Generate text
python diffusion.py --generate

# Same for GPT (~24 min on T4 GPU)
python gpt.py --train --generate

# Animated race visualization
python visualize.py

Progressive Build (4 steps)

File Concept Loss
steps/step0_masking.py Forward process only — no neural net, just math
steps/step1_denoise_mlp.py MLP denoiser — proves training loop works 3.31
steps/step2_transformer.py Bidirectional transformer — the quality jump 2.16
diffusion.py Full model, 10.7M params 2.09 val

Architecture (shared between diffusion and GPT)

Input tokens (B, T)
    ↓
Token Embedding → (B, T, 384)
    ↓
RMSNorm
    ↓
Transformer Block × 6
  ├─ RMSNorm
  ├─ Multi-Head Attention (6 heads, RoPE, QK-Norm)
  │    ← Bidirectional (diffusion) or Causal (GPT)
  ├─ Residual
  ├─ RMSNorm
  ├─ MLP: Linear → ReluSquared → Linear (4× expansion)
  └─ Residual
    ↓
RMSNorm
    ↓
Linear Head → (B, T, 66)  ← logits over vocabulary

Part 2: Scaling Study (In Progress)

Research Thesis

Every discrete diffusion language model makes a choice about its forward process: how quickly and in what pattern should tokens be masked during training? MDLM uses cosine. SEDD uses log-linear. LLaDA uses linear. These choices are made once, justified briefly, and never compared head-to-head on the same model, data, and scale.

This study fills that gap. We train identical 124M parameter diffusion LMs on FineWeb-Edu under 6 different noise schedules and 2 masking strategies (uniform vs. entropy-weighted), producing a 12-run controlled experiment.

Research questions:

  1. Which noise schedule minimizes perplexity for absorbing-state discrete diffusion?
  2. Does per-token entropy-weighted masking improve over uniform masking?
  3. How do schedule choices affect generation quality, not just likelihood?
  4. Do the answers change between small (10M) and medium (124M) scale?

Experimental Design

The 6 noise schedules (α(t) = survival probability at noise level t):

Schedule α(t) α(1) Used by
Linear 1 − t 0.0 LLaDA
Cosine cos(πt/2) 0.0 MDLM
Log-linear 1 − (1−10⁻³)·t 0.001 SEDD (corrected)
Square root 1 − √t 0.0
Clipped cosine 0.05 + 0.90·cos(πt/2) 0.05 Block Diffusion
Sigmoid 1/(1+exp(10·(t−0.5))) ~0.0

Note on log-linear: SEDD Appendix C.1 defines this as a linear schedule in log-space. The formula exp(−t·ln 2) that appears in some implementations gives α(1)=0.5 — only 50% masking at maximum noise, which is incorrect. The formula above (1 − (1−10⁻³)·t) is the correct implementation. This was caught and fixed during Phase 4 pilots (see CHANGELOG).

The 2 masking strategies:

  • Uniform — every token masked with equal probability 1−α(t)
  • Entropy-weighted — tokens with high GPT-2 entropy (hard to predict) are masked less often, giving the model more context at difficult positions. Mask probability is (1−α(t)) × clamp(2.0 − ent_norm, 0.5, 1.5) where ent_norm is normalized entropy (mean 1.0).

The 12-run matrix:

Uniform Entropy-weighted
Linear Run 1 Run 7
Cosine Run 2 Run 8
Log-linear Run 3 Run 9
Square root Run 4 Run 10
Clipped cosine Run 5 Run 11
Sigmoid Run 6 Run 12

Every run: identical 124M model, identical 1B tokens of FineWeb-Edu, identical hyperparameters. The only variable is the forward process.

Fixed hyperparameters across all runs:

Hyperparameter Value
Architecture 12L / 12H / 768E, SwiGLU, RoPE, RMSNorm, weight tying
Parameters ~124M
Tokenizer GPT-2 BPE (50257 vocab + 1 MASK = 50258)
Dataset FineWeb-Edu sample-10BT, first 1B tokens
Context length 1024 tokens
Batch size 32 (effective)
Optimizer AdamW, lr=6e-4, cosine decay to 6e-5, 2000-step warmup
Precision bfloat16 (A100)
Training steps 100,000
Hardware 1× A100 per run (Explorer HPC gpu partition: 1 GPU/job limit)

Current Status

Phase Description Status
Phase 0 Verify HPC environment Complete — A100-SXM4-80GB confirmed, CUDA 12.1, PyTorch 2.5.1
Phase 1 Checkpoint-resume infrastructure Complete — SIGUSR1 graceful shutdown + auto-resubmit verified
Phase 2 FineWeb-Edu data preparation Complete — 1B tokens tokenized; entropy array computed (96.2% vocab coverage)
Phase 3 124M model + unified training script Complete — 200-step FineWeb smoke test passed
Phase 4 Pilot experiments (Shakespeare, 10K steps) Complete — 4 pilots passed; loglinear schedule formula bug found and fixed
Phase 5 Multi-GPU verification Complete — DDP verified on 2×A100; downgraded to 1×A100/job due to QOS limits
Phase 6 Full 12-run experimental matrix 8 of 12 complete — clipped_cosine and sigmoid stalled; excluded from analysis
Phase 7 Evaluation (NELBO PPL, generative PPL, MAUVE) Complete for 8 runs — results in results/comparison_table.md
Phase 8 Writeup In progress

Phase 7 Results (8 runs evaluated)

Run FineWeb PPL ↓ WikiText-2 PPL ↓ Gen PPL ↓ MAUVE ↑ Val Loss
cosine_uniform 69.56 94.32 228.51 0.575 4.379
cosine_entropy_weighted 64.39 186.69 100.52 0.521 3.668
loglinear_uniform 79.67 161.81 208.53 0.651 4.517
loglinear_entropy_weighted 98.66 261.69 95.51 0.611 3.853
linear_uniform 83.56 151.26 220.59 0.624 4.535
linear_entropy_weighted 102.67 225.22 114.41 0.453 3.849
sqrt_uniform 149.15 299.84 212.97 0.641 5.087
sqrt_entropy_weighted 174.99 473.75 104.22 0.561 4.313

Three findings that held consistently across all schedule pairs:

  1. Cosine wins on denoising. Best FineWeb NELBO (69.56) and WikiText-2 (94.32) — the intermediate-noise-spending profile matters most for learning to denoise.

  2. Entropy-weighted masking improves generative fluency ~2× across every schedule (gen PPL 95–114 vs 208–228), but degrades out-of-distribution generalization ~2× (WikiText-2 roughly doubles for every pair). The entropy prior is FineWeb-specific, so the model specializes rather than generalizing.

  3. The training val_loss advantage of entropy-weighted (~0.7 nats) is largely measurement bias. When evaluated under the same uniform masking distribution (fair comparison), the gap shrinks to ~0.08 nats for cosine. The training loss measured performance on an easier exam.

Full table: results/comparison_table.md

Pilot Results (Phase 4 — Shakespeare, 10K steps)

These are small-scale pilots on Tiny Shakespeare (10.7M model, 10K steps, 1×A100). They validate the training loop for each schedule before the full FineWeb runs.

Run Schedule Masking Val Loss (10K steps) Notes
P1 Cosine Uniform 1.93 Baseline
P2 Linear Uniform 1.92 Matches cosine closely
P3 Log-linear Uniform 1.91 After bug fix; was 1.03 with incorrect formula
P4 Cosine Entropy-weighted 1.90 Slight improvement over uniform

Full FineWeb-Edu results (100K steps, 124M model) will populate here as Phase 6 completes.

Infrastructure

  • HPC: Northeastern Explorer cluster, SLURM
  • Target GPU: A100 (1 per job — gpu partition enforces MaxTRES=gres/gpu=1), V100 fallback
  • Auto-resubmit: Every training job catches SIGUSR1 (5 min before wall time), saves checkpoint, resubmits itself. Required because the gpu partition has a 7.5-hour effective limit and each run takes ~45 hours (~6 chained jobs).
  • QOS limits: 4 GPUs in use simultaneously, 8 jobs submitted at once. Runs 9–12 queue automatically as earlier runs complete.
  • Logging: JSONL files (one per run), one entry per 100 training steps

Repository Structure

microDLM/
├── diffusion.py                    ← Original 10.7M diffusion LM (Shakespeare)
├── gpt.py                          ← Original 10.7M GPT baseline (Shakespeare)
├── visualize.py                    ← Terminal animation: diffusion vs GPT race
│
├── train_experiment.py             ← Unified training script (all 12 scaling runs)
├── config.py                       ← ModelConfig, TrainConfig, ExperimentConfig
│
├── scripts/
│   ├── prepare_fineweb.py          ← Tokenize FineWeb-Edu into binary shards
│   ├── prepare_fineweb.sh          ← SLURM job (short partition, 24hr, CPU)
│   ├── compute_token_entropy.py    ← Precompute per-token entropy via GPT-2
│   ├── compute_token_entropy.sh    ← SLURM job (gpu-short, 2hr, 1×A100)
│   ├── test_checkpoint_resume.sh  ← Phase 1 verification job
│   ├── verify_model.sh             ← Phase 3 smoke test (200 steps on FineWeb)
│   ├── pilot_schedules.sh          ← Phase 4: 4 sequential pilots on Shakespeare
│   ├── test_ddp.sh                 ← Phase 5: DDP verification (2×A100, 500 steps)
│   ├── train_template.sh           ← SLURM template for production runs (1×A100)
│   ├── launch_all_experiments.sh   ← Phase 6: generate + submit all 12 run scripts
│   ├── requeue_stalled.sh          ← Re-submit runs whose jobs fell off the queue
│   └── plot_pilot_curves.py        ← 2-panel loss curve figure for pilot runs
│
├── data/
│   ├── shakespeare.txt             ← Tiny Shakespeare (~1.1MB)
│   └── fineweb/                    ← Tokenized FineWeb-Edu shards (HPC only)
│       ├── shard_0000.bin          ← 100M tokens, uint16 (~200MB each)
│       ├── ...
│       ├── metadata.json
│       └── token_entropy.npy       ← (50257,) float32, per-token GPT-2 entropy
│
├── checkpoints/                    ← Model checkpoints (HPC only, not in git)
├── logs/                           ← SLURM output + JSONL training logs
├── results/                        ← Evaluation outputs (populated after Phase 7)
│
├── steps/                          ← Progressive educational build
│   ├── step0_masking.py
│   ├── step1_denoise_mlp.py
│   └── step2_transformer.py
│
├── web/                            ← Static web demo (GitHub Pages)
│   ├── index.html
│   ├── style.css
│   ├── race.js
│   └── frames.json
│
├── weights/                        ← Trained Shakespeare weights (not in git)
├── CLAUDE.md                       ← Agent guidelines and project ground truth
├── CHANGELOG.md                    ← Detailed log of all phases and bugs fixed
├── scaling_plan_hpc.md             ← Full research plan and HPC execution guide
└── HPC_REFERENCE.md                ← HPC quick reference (commands, partitions)

The Math

Forward Process

Each token independently survives with probability α(t) or becomes MASK with probability (1 − α(t)). MASK is absorbing: once masked, a token stays masked in the forward direction.

$$P(x_t^i = \texttt{MASK} \mid x_0^i) = 1 - \alpha(t)$$

Training Loss (continuous-time NELBO)

$$\mathcal{L} = \mathbb{E}_{t} \left[ \frac{1}{\sum_i m_i} \sum_i m_i \cdot \text{CE}(\text{logits}_i, x_i) \right]$$

Cross-entropy at masked positions only, averaged over random noise levels. One t sampled per training step (Monte Carlo).

SUBS Parameterization

Two components must both be present:

  1. Carry-over: loss computed only at masked positions (above)
  2. Zero-masking: clamp logits[:, :, mask_token_id] = -inf before softmax

The model never predicts MASK as an output token — only real vocabulary tokens.


References

Paper Citation
D3PM Austin et al., NeurIPS 2021, arXiv:2107.03006
MDLM Sahoo et al., NeurIPS 2024, arXiv:2406.07524
SEDD Lou, Meng & Ermon, ICML 2024 Best Paper, arXiv:2310.16834
Block Diffusion Arriola et al., ICLR 2025 Oral, arXiv:2503.09573
LLaDA Nie et al., 2025, arXiv:2502.09992
nanoGPT Karpathy, github.com/karpathy/nanoGPT

License

MIT.

About

From-scratch discrete diffusion language model on Tiny Shakespeare — 5 changes from GPT

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors