ScratchSpeakSLM

A small language model (SLM) built from scratch in PyTorch, inspired by the paper: "TinyStories: How Small Can Language Models Be and Still Speak Coherent English?".

This project implements a GPT-style transformer model and a complete training/inference pipeline, demonstrating how a model with a small parameter count can be trained to generate coherent text.

🚀 Features

Built from Scratch: The transformer model (GPT) and its components (Block, CausalSelfAttention, MLP) are written from the ground up in pure PyTorch.
Modern Architecture: Implements a standard GPT-2 style, decoder-only transformer with pre-layer normalization.
Robust Training Harness: Includes a modern training loop featuring:
- Mixed-precision training (bfloat16 or float16) with torch.amp.GradScaler.
- AdamW optimizer with weight decay.
- Learning rate scheduler with linear warmup followed by cosine decay.
- Gradient accumulation to simulate larger batch sizes.
- Periodic evaluation against a validation set and saving the best model.
Weight Tying: Implements weight-tying between the token embedding layer (wte) and the final language model head (lm_head) to save parameters.

🛠️ Setup and Training

This project was built and trained using Python 3.13 on an NVIDIA H100 NVL GPU.

1. Environment Setup

conda create -n scratchspeak python=3.13
conda activate scratchspeak

Then, install the required packages:

# Install PyTorch (select the command for your CUDA 12.4 setup)
# Example for CUDA 12.1, adjust as needed for 12.4
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# Install other dependencies
pip install tiktoken datasets numpy tqdm

2. Data Preparation

The code expects the training data to be in binary files (train.bin, validation.bin). The provided process function handles this. Assuming your data is loaded via the datasets library (e.g., from Hugging Face):

The process function uses the tiktoken GPT-2 tokenizer to encode the text.
It then memory-maps a large binary file for the 'train' and 'validation' splits.
It iterates through the dataset in shards, tokenizes them, and writes the token IDs directly into the .bin files.

This step only needs to be run once. The script will automatically detect the .bin files on subsequent runs.

3. Training

Start the training:

Initialize the GPT model with the specified GPTConfig.
Initialize the AdamW optimizer and learning rate schedulers.
Load training and validation data using the get_batch function.
Run the training loop for max_iters, performing gradient accumulation and scaling.
Periodically evaluate the model and save the best-performing checkpoint to best_model_params.pt.

This model was trained on an NVIDIA H100 NVL GPU with CUDA 12.4.

📊 Training Performance

The model was trained for 40,000 iterations. The training and validation loss steadily decreased, demonstrating successful learning.

Here is a snapshot of the training logs, showing the start and end of the run:

Start of Training:

Epoch 500: train loss 8.7482, val loss 8.7557
The current learning rate: 0.00007
...
Epoch 1000: train loss 7.8323, val loss 7.8322
The current learning rate: 0.00010
...
Epoch 1500: train loss 7.0012, val loss 7.0022
The current learning rate: 0.00010
...

End of Training:

Epoch 38000: train loss 2.0887, val loss 2.0897
Epoch 38500: train loss 2.0681, val loss 2.0844
Epoch 39000: train loss 2.0672, val loss 2.0771
Epoch 39500: train loss 2.0633, val loss 2.0698
100%|██████████| 40000/40000 [32:57<00:00, 20.22it/s]

The final validation loss reached approximately 2.07.

📈 Loss Curve

Below is the training loss curve:

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
images		images
(R)TinyStories - How Small Can Language Models Be and Still Speak Coherent English.pdf		(R)TinyStories - How Small Can Language Models Be and Still Speak Coherent English.pdf
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md
best_model_params.pt		best_model_params.pt
notebook.ipynb		notebook.ipynb
train.bin		train.bin
validation.bin		validation.bin

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ScratchSpeakSLM

🚀 Features

🛠️ Setup and Training

1. Environment Setup

2. Data Preparation

3. Training

📊 Training Performance

📈 Loss Curve

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ScratchSpeakSLM

🚀 Features

🛠️ Setup and Training

1. Environment Setup

2. Data Preparation

3. Training

📊 Training Performance

📈 Loss Curve

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages