A small language model (SLM) built from scratch in PyTorch, inspired by the paper: "TinyStories: How Small Can Language Models Be and Still Speak Coherent English?".
This project implements a GPT-style transformer model and a complete training/inference pipeline, demonstrating how a model with a small parameter count can be trained to generate coherent text.
- Built from Scratch: The transformer model (
GPT) and its components (Block,CausalSelfAttention,MLP) are written from the ground up in pure PyTorch. - Modern Architecture: Implements a standard GPT-2 style, decoder-only transformer with pre-layer normalization.
- Robust Training Harness: Includes a modern training loop featuring:
- Mixed-precision training (
bfloat16orfloat16) withtorch.amp.GradScaler. AdamWoptimizer with weight decay.- Learning rate scheduler with linear warmup followed by cosine decay.
- Gradient accumulation to simulate larger batch sizes.
- Periodic evaluation against a validation set and saving the best model.
- Mixed-precision training (
- Weight Tying: Implements weight-tying between the token embedding layer (
wte) and the final language model head (lm_head) to save parameters.
This project was built and trained using Python 3.13 on an NVIDIA H100 NVL GPU.
conda create -n scratchspeak python=3.13
conda activate scratchspeakThen, install the required packages:
# Install PyTorch (select the command for your CUDA 12.4 setup)
# Example for CUDA 12.1, adjust as needed for 12.4
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
# Install other dependencies
pip install tiktoken datasets numpy tqdmThe code expects the training data to be in binary files (train.bin, validation.bin). The provided process function handles this. Assuming your data is loaded via the datasets library (e.g., from Hugging Face):
- The process function uses the tiktoken GPT-2 tokenizer to encode the text.
- It then memory-maps a large binary file for the 'train' and 'validation' splits.
- It iterates through the dataset in shards, tokenizes them, and writes the token IDs directly into the
.binfiles.
This step only needs to be run once. The script will automatically detect the .bin files on subsequent runs.
Start the training:
- Initialize the GPT model with the specified GPTConfig.
- Initialize the AdamW optimizer and learning rate schedulers.
- Load training and validation data using the get_batch function.
- Run the training loop for max_iters, performing gradient accumulation and scaling.
- Periodically evaluate the model and save the best-performing checkpoint to
best_model_params.pt.
This model was trained on an NVIDIA H100 NVL GPU with CUDA 12.4.
The model was trained for 40,000 iterations. The training and validation loss steadily decreased, demonstrating successful learning.
Here is a snapshot of the training logs, showing the start and end of the run:
Start of Training:
Epoch 500: train loss 8.7482, val loss 8.7557
The current learning rate: 0.00007
...
Epoch 1000: train loss 7.8323, val loss 7.8322
The current learning rate: 0.00010
...
Epoch 1500: train loss 7.0012, val loss 7.0022
The current learning rate: 0.00010
...
End of Training:
Epoch 38000: train loss 2.0887, val loss 2.0897
Epoch 38500: train loss 2.0681, val loss 2.0844
Epoch 39000: train loss 2.0672, val loss 2.0771
Epoch 39500: train loss 2.0633, val loss 2.0698
100%|██████████| 40000/40000 [32:57<00:00, 20.22it/s]
The final validation loss reached approximately 2.07.
Below is the training loss curve:
