ArXivGPT

This repository contains the code for ArXivGPT, a 124M parameter GPT-2 model pre-trained from scratch on a subset of the ArXiv dataset.

This project is a practical implementation of the concepts from Sebastian Raschka's book, "Build a Large Language Model from Scratch," demonstrating how to build and train all key components of a Transformer-based LLM.

🚀 Key Features

All essential components of the GPT-2 architecture were built from scratch using PyTorch:

LayerNorm: Custom Layer Normalization module.
GELU: Custom Gaussian Error Linear Unit activation.
FeedForward: Position-wise feed-forward network.
MultiHeadAttention: Causal multi-head self-attention.
TransformerBlock: A single Transformer decoder block.
GPTModel: The complete GPT model assembling the blocks.
GPTDatasetV1: A custom PyTorch Dataset for tokenizing, chunking, and serving the text data.

🛠️ Model & Data

Model Configuration

GPT_CONFIG_124M = {
    "vocab_size": 50257,
    "context_length": 512,
    "emb_dim": 768,
    "n_heads": 12,
    "n_layers": 12,
    "drop_rate": 0.1,
    "qkv_bias": False
}

Model Architecture

GPTModel(
  (tok_emb): Embedding(50257, 768)
  (pos_emb): Embedding(512, 768)
  (drop_emb): Dropout(p=0.1, inplace=False)
  (trs_blocks): Sequential(
    (0): TransformerBlock(...)
    ...
    (11): TransformerBlock(...)
  )
  (final_norm): LayerNorm()
  (out_head): Linear(in_features=768, out_features=50257, bias=False)
)

Dataset

The model was trained on a single text file (arxiv.txt) containing a processed subset of ArXiv abstracts.

Tokenizer: tiktoken ("gpt2" encoding)
Tokenizer Version: 0.12.0
Train/Validation Split: 90% / 10%
Total Tokens: 2,883,970

Metric	Value
Characters Processed	10,737,418
Train Tokens	2,592,768
Validation Tokens	289,792
Train Dataloader Length	633 batches
Validation Dataloader Length	71 batches
Dataloader Shape	[8, 512], [8, 512]

⚙️ Setup & Installation

1. Clone the Repository

git clone https://github.com/your-username/ArXivGPT.git
cd ArXivGPT

2. Create a Conda Environment

conda create -n arxivgpt python=3.13
conda activate arxivgpt

📈 Training & Results

Hardware

Trained on a single NVIDIA H100 NVL GPU.

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.144.03             Driver Version: 550.144.03     CUDA Version: 12.4     |
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
|-----------------------------------------+------------------------+----------------------+
|   0  NVIDIA H100 NVL                Off |   00000000:B5:00.0 Off |                    0 |
| N/A   61C    P0            242W /  350W |   40786MiB /  95830MiB |    100%      Default |
+-----------------------------------------------------------------------------------------+

Training Hyperparameters

Parameter	Value
Optimizer	AdamW
Learning Rate	0.0004
Weight Decay	0.1
Batch Size	8
Number of Epochs	3
Context Length	512
Device	cuda
Eval Frequency	Every 200 steps

Training Log

Step	Train Loss	Val Loss
0	10.9975	10.9994
200	5.102	4.989
400	4.493	4.617
600	4.023	4.476
800	3.869	4.309
1000	3.633	4.269
1200	3.707	4.182
1400	3.445	4.151
1600	3.533	4.127
1800	3.495	4.048

Final Results:

Total Training Time: 11.58 minutes
Final Training Loss: 3.495
Final Validation Loss: 4.048 (Best achieved)

📚 Acknowledgements

This project is inspired by Sebastian Raschka’s work in “Build a Large Language Model from Scratch”.
The dataset used is derived from publicly available ArXiv abstracts.

🧠 License

This project is released under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
(R)Build_a_Large_Language_Model_(From_Scratch).pdf		(R)Build_a_Large_Language_Model_(From_Scratch).pdf
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md
arxiv.zip		arxiv.zip
model_and_optimizer.pth		model_and_optimizer.pth
notebook.ipynb		notebook.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ArXivGPT

🚀 Key Features

🛠️ Model & Data

Model Configuration

Model Architecture

Dataset

⚙️ Setup & Installation

1. Clone the Repository

2. Create a Conda Environment

📈 Training & Results

Hardware

Training Hyperparameters

Training Log

📚 Acknowledgements

🧠 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ArXivGPT

🚀 Key Features

🛠️ Model & Data

Model Configuration

Model Architecture

Dataset

⚙️ Setup & Installation

1. Clone the Repository

2. Create a Conda Environment

📈 Training & Results

Hardware

Training Hyperparameters

Training Log

📚 Acknowledgements

🧠 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages