Skip to content

FANZ3R/Vision_Transformer_From_scratch

Repository files navigation

Vision Transformer (ViT) from Scratch — MNIST

A Vision Transformer built entirely from scratch using PyTorch, achieving 96.3% accuracy on the MNIST handwritten digit dataset in just 5 epochs.


Architecture

Input Image (28×28, grayscale)
        ↓
Patch Embedding — Conv2d splits image into 16 patches (4×4 grid, patch size 7×7)
        ↓
CLS Token prepended — learnable classification token
        ↓
Position Embeddings added — so the model knows patch order
        ↓
Transformer Encoder × 4
    ├── LayerNorm
    ├── Multi-Head Self-Attention (4 heads, batch_first=True)
    ├── Residual Connection
    ├── LayerNorm
    ├── MLP Block (16 → 64 → 16, GELU activation)
    └── Residual Connection
        ↓
MLP Head — extracts CLS token → LayerNorm → Linear(16, 10)
        ↓
Class Logits (10 digits)

Results

Epoch Loss Accuracy
1 654.84 77.97%
2 198.41 93.67%
3 153.87 94.93%
4 129.35 95.74%
5 112.05 96.33%

Key Concepts Implemented

  • Patch Embedding — images split into patches via Conv2d, flattened into token sequences
  • CLS Token — learnable token prepended to sequence; accumulates global image information for classification
  • Position Embeddings — learnable vectors added to each token so the model understands spatial order
  • Multi-Head Self-Attention — each patch attends to every other patch, learning which regions matter
  • Residual Connections — skip connections around attention and MLP blocks to stabilise training
  • MLP with 4× expansion — each encoder block expands features (16→64) then contracts back (64→16)

Setup

git clone https://github.com/YOUR_USERNAME/vision-transformer-mnist.git
cd vision-transformer-mnist
pip install -r requirements.txt

Then open ViT_MNIST.ipynb in Jupyter or Google Colab.

Tip: Enable GPU in Colab — Runtime → Change runtime type → T4 GPU


Hyperparameters

Parameter Value
Image size 28×28
Patch size 7×7
Number of patches 16
Embedding dim 16
Attention heads 4
Transformer blocks 4
MLP hidden dim 64
Batch size 64
Learning rate 0.001
Optimizer Adam
Epochs 5

References

About

Vision Transformer (ViT) built from scratch in PyTorch — 96.3% accuracy on MNIST in 5 epochs

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors