Skip to content

Aashutoshh01/ArXivGPT

Repository files navigation

ArXivGPT

This repository contains the code for ArXivGPT, a 124M parameter GPT-2 model pre-trained from scratch on a subset of the ArXiv dataset.

This project is a practical implementation of the concepts from Sebastian Raschka's book, "Build a Large Language Model from Scratch," demonstrating how to build and train all key components of a Transformer-based LLM.


🚀 Key Features

All essential components of the GPT-2 architecture were built from scratch using PyTorch:

  • LayerNorm: Custom Layer Normalization module.
  • GELU: Custom Gaussian Error Linear Unit activation.
  • FeedForward: Position-wise feed-forward network.
  • MultiHeadAttention: Causal multi-head self-attention.
  • TransformerBlock: A single Transformer decoder block.
  • GPTModel: The complete GPT model assembling the blocks.
  • GPTDatasetV1: A custom PyTorch Dataset for tokenizing, chunking, and serving the text data.

🛠️ Model & Data

Model Configuration

GPT_CONFIG_124M = {
    "vocab_size": 50257,
    "context_length": 512,
    "emb_dim": 768,
    "n_heads": 12,
    "n_layers": 12,
    "drop_rate": 0.1,
    "qkv_bias": False
}

Model Architecture

GPTModel(
  (tok_emb): Embedding(50257, 768)
  (pos_emb): Embedding(512, 768)
  (drop_emb): Dropout(p=0.1, inplace=False)
  (trs_blocks): Sequential(
    (0): TransformerBlock(...)
    ...
    (11): TransformerBlock(...)
  )
  (final_norm): LayerNorm()
  (out_head): Linear(in_features=768, out_features=50257, bias=False)
)

Dataset

The model was trained on a single text file (arxiv.txt) containing a processed subset of ArXiv abstracts.

  • Tokenizer: tiktoken ("gpt2" encoding)
  • Tokenizer Version: 0.12.0
  • Train/Validation Split: 90% / 10%
  • Total Tokens: 2,883,970
Metric Value
Characters Processed 10,737,418
Train Tokens 2,592,768
Validation Tokens 289,792
Train Dataloader Length 633 batches
Validation Dataloader Length 71 batches
Dataloader Shape [8, 512], [8, 512]

⚙️ Setup & Installation

1. Clone the Repository

git clone https://github.com/your-username/ArXivGPT.git
cd ArXivGPT

2. Create a Conda Environment

conda create -n arxivgpt python=3.13
conda activate arxivgpt

📈 Training & Results

Hardware

Trained on a single NVIDIA H100 NVL GPU.

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.144.03             Driver Version: 550.144.03     CUDA Version: 12.4     |
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
|-----------------------------------------+------------------------+----------------------+
|   0  NVIDIA H100 NVL                Off |   00000000:B5:00.0 Off |                    0 |
| N/A   61C    P0            242W /  350W |   40786MiB /  95830MiB |    100%      Default |
+-----------------------------------------------------------------------------------------+

Training Hyperparameters

Parameter Value
Optimizer AdamW
Learning Rate 0.0004
Weight Decay 0.1
Batch Size 8
Number of Epochs 3
Context Length 512
Device cuda
Eval Frequency Every 200 steps

Training Log

Step Train Loss Val Loss
0 10.9975 10.9994
200 5.102 4.989
400 4.493 4.617
600 4.023 4.476
800 3.869 4.309
1000 3.633 4.269
1200 3.707 4.182
1400 3.445 4.151
1600 3.533 4.127
1800 3.495 4.048

Final Results:

  • Total Training Time: 11.58 minutes
  • Final Training Loss: 3.495
  • Final Validation Loss: 4.048 (Best achieved)

📚 Acknowledgements

This project is inspired by Sebastian Raschka’s work in “Build a Large Language Model from Scratch”.
The dataset used is derived from publicly available ArXiv abstracts.


🧠 License

This project is released under the MIT License.

About

A 124M-parameter GPT-2 model trained from scratch on ArXiv abstracts, featuring fully custom Transformer components implemented in PyTorch.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors