Skip to content

Aashutoshh01/ScratchSpeakSLM

Repository files navigation

ScratchSpeakSLM

Python Version Framework Inspiration

A small language model (SLM) built from scratch in PyTorch, inspired by the paper: "TinyStories: How Small Can Language Models Be and Still Speak Coherent English?".

This project implements a GPT-style transformer model and a complete training/inference pipeline, demonstrating how a model with a small parameter count can be trained to generate coherent text.

🚀 Features

  • Built from Scratch: The transformer model (GPT) and its components (Block, CausalSelfAttention, MLP) are written from the ground up in pure PyTorch.
  • Modern Architecture: Implements a standard GPT-2 style, decoder-only transformer with pre-layer normalization.
  • Robust Training Harness: Includes a modern training loop featuring:
    • Mixed-precision training (bfloat16 or float16) with torch.amp.GradScaler.
    • AdamW optimizer with weight decay.
    • Learning rate scheduler with linear warmup followed by cosine decay.
    • Gradient accumulation to simulate larger batch sizes.
    • Periodic evaluation against a validation set and saving the best model.
  • Weight Tying: Implements weight-tying between the token embedding layer (wte) and the final language model head (lm_head) to save parameters.

🛠️ Setup and Training

This project was built and trained using Python 3.13 on an NVIDIA H100 NVL GPU.

1. Environment Setup

conda create -n scratchspeak python=3.13
conda activate scratchspeak

Then, install the required packages:

# Install PyTorch (select the command for your CUDA 12.4 setup)
# Example for CUDA 12.1, adjust as needed for 12.4
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# Install other dependencies
pip install tiktoken datasets numpy tqdm

2. Data Preparation

The code expects the training data to be in binary files (train.bin, validation.bin). The provided process function handles this. Assuming your data is loaded via the datasets library (e.g., from Hugging Face):

  • The process function uses the tiktoken GPT-2 tokenizer to encode the text.
  • It then memory-maps a large binary file for the 'train' and 'validation' splits.
  • It iterates through the dataset in shards, tokenizes them, and writes the token IDs directly into the .bin files.

This step only needs to be run once. The script will automatically detect the .bin files on subsequent runs.

3. Training

Start the training:

  • Initialize the GPT model with the specified GPTConfig.
  • Initialize the AdamW optimizer and learning rate schedulers.
  • Load training and validation data using the get_batch function.
  • Run the training loop for max_iters, performing gradient accumulation and scaling.
  • Periodically evaluate the model and save the best-performing checkpoint to best_model_params.pt.

This model was trained on an NVIDIA H100 NVL GPU with CUDA 12.4.

📊 Training Performance

The model was trained for 40,000 iterations. The training and validation loss steadily decreased, demonstrating successful learning.

Here is a snapshot of the training logs, showing the start and end of the run:

Start of Training:

Epoch 500: train loss 8.7482, val loss 8.7557
The current learning rate: 0.00007
...
Epoch 1000: train loss 7.8323, val loss 7.8322
The current learning rate: 0.00010
...
Epoch 1500: train loss 7.0012, val loss 7.0022
The current learning rate: 0.00010
...

End of Training:

Epoch 38000: train loss 2.0887, val loss 2.0897
Epoch 38500: train loss 2.0681, val loss 2.0844
Epoch 39000: train loss 2.0672, val loss 2.0771
Epoch 39500: train loss 2.0633, val loss 2.0698
100%|██████████| 40000/40000 [32:57<00:00, 20.22it/s]

The final validation loss reached approximately 2.07.

📈 Loss Curve

Below is the training loss curve:

Training Loss Curve

About

A tiny GPT-style language model built completely from scratch in PyTorch, inspired by TinyStories, demonstrating how small transformer models can learn to generate coherent text.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors