This repository contains the code for ArXivGPT, a 124M parameter GPT-2 model pre-trained from scratch on a subset of the ArXiv dataset.
This project is a practical implementation of the concepts from Sebastian Raschka's book, "Build a Large Language Model from Scratch," demonstrating how to build and train all key components of a Transformer-based LLM.
All essential components of the GPT-2 architecture were built from scratch using PyTorch:
LayerNorm: Custom Layer Normalization module.GELU: Custom Gaussian Error Linear Unit activation.FeedForward: Position-wise feed-forward network.MultiHeadAttention: Causal multi-head self-attention.TransformerBlock: A single Transformer decoder block.GPTModel: The complete GPT model assembling the blocks.GPTDatasetV1: A custom PyTorchDatasetfor tokenizing, chunking, and serving the text data.
GPT_CONFIG_124M = {
"vocab_size": 50257,
"context_length": 512,
"emb_dim": 768,
"n_heads": 12,
"n_layers": 12,
"drop_rate": 0.1,
"qkv_bias": False
}GPTModel(
(tok_emb): Embedding(50257, 768)
(pos_emb): Embedding(512, 768)
(drop_emb): Dropout(p=0.1, inplace=False)
(trs_blocks): Sequential(
(0): TransformerBlock(...)
...
(11): TransformerBlock(...)
)
(final_norm): LayerNorm()
(out_head): Linear(in_features=768, out_features=50257, bias=False)
)The model was trained on a single text file (arxiv.txt) containing a processed subset of ArXiv abstracts.
- Tokenizer: tiktoken ("gpt2" encoding)
- Tokenizer Version: 0.12.0
- Train/Validation Split: 90% / 10%
- Total Tokens: 2,883,970
| Metric | Value |
|---|---|
| Characters Processed | 10,737,418 |
| Train Tokens | 2,592,768 |
| Validation Tokens | 289,792 |
| Train Dataloader Length | 633 batches |
| Validation Dataloader Length | 71 batches |
| Dataloader Shape | [8, 512], [8, 512] |
git clone https://github.com/your-username/ArXivGPT.git
cd ArXivGPTconda create -n arxivgpt python=3.13
conda activate arxivgptTrained on a single NVIDIA H100 NVL GPU.
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.144.03 Driver Version: 550.144.03 CUDA Version: 12.4 |
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
|-----------------------------------------+------------------------+----------------------+
| 0 NVIDIA H100 NVL Off | 00000000:B5:00.0 Off | 0 |
| N/A 61C P0 242W / 350W | 40786MiB / 95830MiB | 100% Default |
+-----------------------------------------------------------------------------------------+
| Parameter | Value |
|---|---|
| Optimizer | AdamW |
| Learning Rate | 0.0004 |
| Weight Decay | 0.1 |
| Batch Size | 8 |
| Number of Epochs | 3 |
| Context Length | 512 |
| Device | cuda |
| Eval Frequency | Every 200 steps |
| Step | Train Loss | Val Loss |
|---|---|---|
| 0 | 10.9975 | 10.9994 |
| 200 | 5.102 | 4.989 |
| 400 | 4.493 | 4.617 |
| 600 | 4.023 | 4.476 |
| 800 | 3.869 | 4.309 |
| 1000 | 3.633 | 4.269 |
| 1200 | 3.707 | 4.182 |
| 1400 | 3.445 | 4.151 |
| 1600 | 3.533 | 4.127 |
| 1800 | 3.495 | 4.048 |
Final Results:
- Total Training Time: 11.58 minutes
- Final Training Loss: 3.495
- Final Validation Loss: 4.048 (Best achieved)
This project is inspired by Sebastian Raschka’s work in “Build a Large Language Model from Scratch”.
The dataset used is derived from publicly available ArXiv abstracts.
This project is released under the MIT License.