microDLM

A from-scratch discrete diffusion language model — and a systematic scaling study.

This repository contains two related projects:

Original project — A 10.7M parameter character-level discrete diffusion LM trained on Tiny Shakespeare, built from scratch to understand how discrete diffusion works and how it compares to GPT. Complete, published on Substack.
Scaling study (in progress) — A systematic comparison of 6 forward noise schedules and 2 masking strategies for absorbing-state discrete diffusion at the 124M parameter scale on FineWeb-Edu, running on Northeastern's Explorer HPC cluster.

Part 1: Educational Project (Complete)

Same architecture, same data — diffusion finishes in 39 steps vs GPT's 225 steps

What is it?

Most language models generate text left-to-right, one token at a time. Discrete diffusion models generate text all at once — starting from a fully masked sequence and iteratively revealing tokens in parallel, like developing a photograph.

This project builds both from scratch with an identical architecture (same params, same data, same transformer) so you can see exactly what changes — and it turns out to be surprisingly little:

#	What Changes	GPT	Diffusion
1	Vocabulary	Standard chars	+ 1 MASK token (`_`)
2	Attention	Causal (sees only left ←)	Bidirectional (sees everything ↔)
3	Training objective	Predict next token	Denoise masked tokens
4	Loss scope	All positions	Masked positions only
5	Generation	Sequential, left-to-right	Parallel, by confidence

That's it. Same transformer, same RoPE, same RMSNorm, same ReluSquared MLP. ~80% of the code is shared.

Training Results

All models trained on Tiny Shakespeare (~1.1M characters, 65 unique + 1 MASK = 66 vocab).

Model	Architecture	Params	Train Loss	Val Loss	Iters	Time
MLP Denoiser (step 1)	2-layer FF	0.37M	3.31	3.31	5,000	~6 min
Small Transformer (step2)	4L / 4H / 128E	1.6M	2.16	2.27	10,000	~7 min
Diffusion (final)	6L / 6H / 384E	10.7M	1.93	2.09	10,000	~47 min
GPT (final)	6L / 6H / 384E	10.7M	0.13	4.09	5,000	~24 min

GPT produces better text than diffusion at this scale. This is expected and well-documented in the literature — diffusion LMs typically need 3-5× more training to match autoregressive quality at small scale. The quality gap narrows significantly at larger scale (see MDLM, SEDD).

What diffusion demonstrates:

⚡ ~6× fewer forward passes for generation (39 vs 225 steps)
🔀 Parallel decoding — tokens appear everywhere simultaneously
🧩 A fundamentally different approach to language modeling

How Generation Works

Diffusion: Parallel Unmasking

Step  0:  ··················································  (all masked)
Step  5:  Be····d him to····with·a·························
Step 15:  Be hold him to him with a milling his············
Step 30:  Be hold him to him with a milling his cold, As he

GPT: Sequential Typing

Token  0:  First Citizen:\nB
Token 10:  First Citizen:\nBefore we p
Token 20:  First Citizen:\nBefore we proceed any
Token 48:  First Citizen:\nBefore we proceed any further, hear me speak.

Quick Start

git clone https://github.com/BrutalCaeser/microDLM.git
cd microDLM
pip install torch

# Download data
mkdir -p data
wget -O data/shakespeare.txt \
  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

# Train diffusion LM (~47 min on T4 GPU)
python diffusion.py --train

# Generate text
python diffusion.py --generate

# Same for GPT (~24 min on T4 GPU)
python gpt.py --train --generate

# Animated race visualization
python visualize.py

Progressive Build (4 steps)

File	Concept	Loss
`steps/step0_masking.py`	Forward process only — no neural net, just math	—
`steps/step1_denoise_mlp.py`	MLP denoiser — proves training loop works	3.31
`steps/step2_transformer.py`	Bidirectional transformer — the quality jump	2.16
`diffusion.py`	Full model, 10.7M params	2.09 val

Architecture (shared between diffusion and GPT)

Input tokens (B, T)
    ↓
Token Embedding → (B, T, 384)
    ↓
RMSNorm
    ↓
Transformer Block × 6
  ├─ RMSNorm
  ├─ Multi-Head Attention (6 heads, RoPE, QK-Norm)
  │    ← Bidirectional (diffusion) or Causal (GPT)
  ├─ Residual
  ├─ RMSNorm
  ├─ MLP: Linear → ReluSquared → Linear (4× expansion)
  └─ Residual
    ↓
RMSNorm
    ↓
Linear Head → (B, T, 66)  ← logits over vocabulary

Part 2: Scaling Study (In Progress)

Research Thesis

Every discrete diffusion language model makes a choice about its forward process: how quickly and in what pattern should tokens be masked during training? MDLM uses cosine. SEDD uses log-linear. LLaDA uses linear. These choices are made once, justified briefly, and never compared head-to-head on the same model, data, and scale.

This study fills that gap. We train identical 124M parameter diffusion LMs on FineWeb-Edu under 6 different noise schedules and 2 masking strategies (uniform vs. entropy-weighted), producing a 12-run controlled experiment.

Research questions:

Which noise schedule minimizes perplexity for absorbing-state discrete diffusion?
Does per-token entropy-weighted masking improve over uniform masking?
How do schedule choices affect generation quality, not just likelihood?
Do the answers change between small (10M) and medium (124M) scale?

Experimental Design

The 6 noise schedules (α(t) = survival probability at noise level t):

Schedule	α(t)	α(1)	Used by
Linear	1 − t	0.0	LLaDA
Cosine	cos(πt/2)	0.0	MDLM
Log-linear	1 − (1−10⁻³)·t	0.001	SEDD (corrected)
Square root	1 − √t	0.0	—
Clipped cosine	0.05 + 0.90·cos(πt/2)	0.05	Block Diffusion
Sigmoid	1/(1+exp(10·(t−0.5)))	~0.0	—

Note on log-linear: SEDD Appendix C.1 defines this as a linear schedule in log-space. The formula exp(−t·ln 2) that appears in some implementations gives α(1)=0.5 — only 50% masking at maximum noise, which is incorrect. The formula above (1 − (1−10⁻³)·t) is the correct implementation. This was caught and fixed during Phase 4 pilots (see CHANGELOG).

The 2 masking strategies:

Uniform — every token masked with equal probability 1−α(t)
Entropy-weighted — tokens with high GPT-2 entropy (hard to predict) are masked less often, giving the model more context at difficult positions. Mask probability is (1−α(t)) × clamp(2.0 − ent_norm, 0.5, 1.5) where ent_norm is normalized entropy (mean 1.0).

The 12-run matrix:

	Uniform	Entropy-weighted
Linear	Run 1	Run 7
Cosine	Run 2	Run 8
Log-linear	Run 3	Run 9
Square root	Run 4	Run 10
Clipped cosine	Run 5	Run 11
Sigmoid	Run 6	Run 12

Every run: identical 124M model, identical 1B tokens of FineWeb-Edu, identical hyperparameters. The only variable is the forward process.

Fixed hyperparameters across all runs:

Hyperparameter	Value
Architecture	12L / 12H / 768E, SwiGLU, RoPE, RMSNorm, weight tying
Parameters	~124M
Tokenizer	GPT-2 BPE (50257 vocab + 1 MASK = 50258)
Dataset	FineWeb-Edu sample-10BT, first 1B tokens
Context length	1024 tokens
Batch size	32 (effective)
Optimizer	AdamW, lr=6e-4, cosine decay to 6e-5, 2000-step warmup
Precision	bfloat16 (A100)
Training steps	100,000
Hardware	1× A100 per run (Explorer HPC `gpu` partition: 1 GPU/job limit)

Current Status

Phase	Description	Status
Phase 0	Verify HPC environment	Complete — A100-SXM4-80GB confirmed, CUDA 12.1, PyTorch 2.5.1
Phase 1	Checkpoint-resume infrastructure	Complete — SIGUSR1 graceful shutdown + auto-resubmit verified
Phase 2	FineWeb-Edu data preparation	Complete — 1B tokens tokenized; entropy array computed (96.2% vocab coverage)
Phase 3	124M model + unified training script	Complete — 200-step FineWeb smoke test passed
Phase 4	Pilot experiments (Shakespeare, 10K steps)	Complete — 4 pilots passed; loglinear schedule formula bug found and fixed
Phase 5	Multi-GPU verification	Complete — DDP verified on 2×A100; downgraded to 1×A100/job due to QOS limits
Phase 6	Full 12-run experimental matrix	8 of 12 complete — clipped_cosine and sigmoid stalled; excluded from analysis
Phase 7	Evaluation (NELBO PPL, generative PPL, MAUVE)	Complete for 8 runs — results in `results/comparison_table.md`
Phase 8	Writeup	In progress

Phase 7 Results (8 runs evaluated)

Run	FineWeb PPL ↓	WikiText-2 PPL ↓	Gen PPL ↓	MAUVE ↑	Val Loss
cosine_uniform	69.56	94.32	228.51	0.575	4.379
cosine_entropy_weighted	64.39	186.69	100.52	0.521	3.668
loglinear_uniform	79.67	161.81	208.53	0.651	4.517
loglinear_entropy_weighted	98.66	261.69	95.51	0.611	3.853
linear_uniform	83.56	151.26	220.59	0.624	4.535
linear_entropy_weighted	102.67	225.22	114.41	0.453	3.849
sqrt_uniform	149.15	299.84	212.97	0.641	5.087
sqrt_entropy_weighted	174.99	473.75	104.22	0.561	4.313

Three findings that held consistently across all schedule pairs:

Cosine wins on denoising. Best FineWeb NELBO (69.56) and WikiText-2 (94.32) — the intermediate-noise-spending profile matters most for learning to denoise.
Entropy-weighted masking improves generative fluency ~2× across every schedule (gen PPL 95–114 vs 208–228), but degrades out-of-distribution generalization ~2× (WikiText-2 roughly doubles for every pair). The entropy prior is FineWeb-specific, so the model specializes rather than generalizing.
The training val_loss advantage of entropy-weighted (~0.7 nats) is largely measurement bias. When evaluated under the same uniform masking distribution (fair comparison), the gap shrinks to ~0.08 nats for cosine. The training loss measured performance on an easier exam.

Full table: results/comparison_table.md

Pilot Results (Phase 4 — Shakespeare, 10K steps)

These are small-scale pilots on Tiny Shakespeare (10.7M model, 10K steps, 1×A100). They validate the training loop for each schedule before the full FineWeb runs.

Run	Schedule	Masking	Val Loss (10K steps)	Notes
P1	Cosine	Uniform	1.93	Baseline
P2	Linear	Uniform	1.92	Matches cosine closely
P3	Log-linear	Uniform	1.91	After bug fix; was 1.03 with incorrect formula
P4	Cosine	Entropy-weighted	1.90	Slight improvement over uniform

Full FineWeb-Edu results (100K steps, 124M model) will populate here as Phase 6 completes.

Infrastructure

HPC: Northeastern Explorer cluster, SLURM
Target GPU: A100 (1 per job — gpu partition enforces MaxTRES=gres/gpu=1), V100 fallback
Auto-resubmit: Every training job catches SIGUSR1 (5 min before wall time), saves checkpoint, resubmits itself. Required because the gpu partition has a 7.5-hour effective limit and each run takes ~45 hours (~6 chained jobs).
QOS limits: 4 GPUs in use simultaneously, 8 jobs submitted at once. Runs 9–12 queue automatically as earlier runs complete.
Logging: JSONL files (one per run), one entry per 100 training steps

Repository Structure

microDLM/
├── diffusion.py                    ← Original 10.7M diffusion LM (Shakespeare)
├── gpt.py                          ← Original 10.7M GPT baseline (Shakespeare)
├── visualize.py                    ← Terminal animation: diffusion vs GPT race
│
├── train_experiment.py             ← Unified training script (all 12 scaling runs)
├── config.py                       ← ModelConfig, TrainConfig, ExperimentConfig
│
├── scripts/
│   ├── prepare_fineweb.py          ← Tokenize FineWeb-Edu into binary shards
│   ├── prepare_fineweb.sh          ← SLURM job (short partition, 24hr, CPU)
│   ├── compute_token_entropy.py    ← Precompute per-token entropy via GPT-2
│   ├── compute_token_entropy.sh    ← SLURM job (gpu-short, 2hr, 1×A100)
│   ├── test_checkpoint_resume.sh  ← Phase 1 verification job
│   ├── verify_model.sh             ← Phase 3 smoke test (200 steps on FineWeb)
│   ├── pilot_schedules.sh          ← Phase 4: 4 sequential pilots on Shakespeare
│   ├── test_ddp.sh                 ← Phase 5: DDP verification (2×A100, 500 steps)
│   ├── train_template.sh           ← SLURM template for production runs (1×A100)
│   ├── launch_all_experiments.sh   ← Phase 6: generate + submit all 12 run scripts
│   ├── requeue_stalled.sh          ← Re-submit runs whose jobs fell off the queue
│   └── plot_pilot_curves.py        ← 2-panel loss curve figure for pilot runs
│
├── data/
│   ├── shakespeare.txt             ← Tiny Shakespeare (~1.1MB)
│   └── fineweb/                    ← Tokenized FineWeb-Edu shards (HPC only)
│       ├── shard_0000.bin          ← 100M tokens, uint16 (~200MB each)
│       ├── ...
│       ├── metadata.json
│       └── token_entropy.npy       ← (50257,) float32, per-token GPT-2 entropy
│
├── checkpoints/                    ← Model checkpoints (HPC only, not in git)
├── logs/                           ← SLURM output + JSONL training logs
├── results/                        ← Evaluation outputs (populated after Phase 7)
│
├── steps/                          ← Progressive educational build
│   ├── step0_masking.py
│   ├── step1_denoise_mlp.py
│   └── step2_transformer.py
│
├── web/                            ← Static web demo (GitHub Pages)
│   ├── index.html
│   ├── style.css
│   ├── race.js
│   └── frames.json
│
├── weights/                        ← Trained Shakespeare weights (not in git)
├── CLAUDE.md                       ← Agent guidelines and project ground truth
├── CHANGELOG.md                    ← Detailed log of all phases and bugs fixed
├── scaling_plan_hpc.md             ← Full research plan and HPC execution guide
└── HPC_REFERENCE.md                ← HPC quick reference (commands, partitions)

The Math

Forward Process

Each token independently survives with probability α(t) or becomes MASK with probability (1 − α(t)). MASK is absorbing: once masked, a token stays masked in the forward direction.

$$P(x_t^i = \texttt{MASK} \mid x_0^i) = 1 - \alpha(t)$$

Training Loss (continuous-time NELBO)

$$\mathcal{L} = \mathbb{E}_{t} \left[ \frac{1}{\sum_i m_i} \sum_i m_i \cdot \text{CE}(\text{logits}_i, x_i) \right]$$

Cross-entropy at masked positions only, averaged over random noise levels. One t sampled per training step (Monte Carlo).

SUBS Parameterization

Two components must both be present:

Carry-over: loss computed only at masked positions (above)
Zero-masking: clamp logits[:, :, mask_token_id] = -inf before softmax

The model never predicts MASK as an output token — only real vocabulary tokens.

References

Paper	Citation
D3PM	Austin et al., NeurIPS 2021, arXiv:2107.03006
MDLM	Sahoo et al., NeurIPS 2024, arXiv:2406.07524
SEDD	Lou, Meng & Ermon, ICML 2024 Best Paper, arXiv:2310.16834
Block Diffusion	Arriola et al., ICLR 2025 Oral, arXiv:2503.09573
LLaDA	Nie et al., 2025, arXiv:2502.09992
nanoGPT	Karpathy, github.com/karpathy/nanoGPT

License

MIT.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

microDLM

Part 1: Educational Project (Complete)

What is it?

Training Results

How Generation Works

Quick Start

Progressive Build (4 steps)

Architecture (shared between diffusion and GPT)

Part 2: Scaling Study (In Progress)

Research Thesis

Experimental Design

Current Status

Phase 7 Results (8 runs evaluated)

Pilot Results (Phase 4 — Shakespeare, 10K steps)

Infrastructure

Repository Structure

The Math

Forward Process

Training Loss (continuous-time NELBO)

SUBS Parameterization

References

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 74 Commits
.github/workflows		.github/workflows
assets		assets
data		data
logs		logs
results		results
scripts		scripts
steps		steps
web		web
weights		weights
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
HPC_REFERENCE.md		HPC_REFERENCE.md
HPC_STARTER_GUIDE.md		HPC_STARTER_GUIDE.md
README.md		README.md
colab_training.ipynb		colab_training.ipynb
config.py		config.py
demo.tape		demo.tape
diffusion.py		diffusion.py
eval_explained.md		eval_explained.md
evaluate.py		evaluate.py
generate_traces.py		generate_traces.py
gpt.py		gpt.py
scaling_plan_hpc.md		scaling_plan_hpc.md
train_experiment.py		train_experiment.py
visualize.py		visualize.py

Folders and files

Latest commit

History

Repository files navigation

microDLM

Part 1: Educational Project (Complete)

What is it?

Training Results

How Generation Works

Quick Start

Progressive Build (4 steps)

Architecture (shared between diffusion and GPT)

Part 2: Scaling Study (In Progress)

Research Thesis

Experimental Design

Current Status

Phase 7 Results (8 runs evaluated)

Pilot Results (Phase 4 — Shakespeare, 10K steps)

Infrastructure

Repository Structure

The Math

Forward Process

Training Loss (continuous-time NELBO)

SUBS Parameterization

References

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages