Vision Transformer (ViT) from Scratch — MNIST

A Vision Transformer built entirely from scratch using PyTorch, achieving 96.3% accuracy on the MNIST handwritten digit dataset in just 5 epochs.

Architecture

Input Image (28×28, grayscale)
        ↓
Patch Embedding — Conv2d splits image into 16 patches (4×4 grid, patch size 7×7)
        ↓
CLS Token prepended — learnable classification token
        ↓
Position Embeddings added — so the model knows patch order
        ↓
Transformer Encoder × 4
    ├── LayerNorm
    ├── Multi-Head Self-Attention (4 heads, batch_first=True)
    ├── Residual Connection
    ├── LayerNorm
    ├── MLP Block (16 → 64 → 16, GELU activation)
    └── Residual Connection
        ↓
MLP Head — extracts CLS token → LayerNorm → Linear(16, 10)
        ↓
Class Logits (10 digits)

Results

Epoch	Loss	Accuracy
1	654.84	77.97%
2	198.41	93.67%
3	153.87	94.93%
4	129.35	95.74%
5	112.05	96.33%

Key Concepts Implemented

Patch Embedding — images split into patches via Conv2d, flattened into token sequences
CLS Token — learnable token prepended to sequence; accumulates global image information for classification
Position Embeddings — learnable vectors added to each token so the model understands spatial order
Multi-Head Self-Attention — each patch attends to every other patch, learning which regions matter
Residual Connections — skip connections around attention and MLP blocks to stabilise training
MLP with 4× expansion — each encoder block expands features (16→64) then contracts back (64→16)

Setup

git clone https://github.com/YOUR_USERNAME/vision-transformer-mnist.git
cd vision-transformer-mnist
pip install -r requirements.txt

Then open ViT_MNIST.ipynb in Jupyter or Google Colab.

Tip: Enable GPU in Colab — Runtime → Change runtime type → T4 GPU

Hyperparameters

Parameter	Value
Image size	28×28
Patch size	7×7
Number of patches	16
Embedding dim	16
Attention heads	4
Transformer blocks	4
MLP hidden dim	64
Batch size	64
Learning rate	0.001
Optimizer	Adam
Epochs	5

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
Coding_a_Vision_Transformer_Architecture_using_pytorch(2).ipynb		Coding_a_Vision_Transformer_Architecture_using_pytorch(2).ipynb
Readme.MD		Readme.MD
Requirements.txt		Requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vision Transformer (ViT) from Scratch — MNIST

Architecture

Results

Key Concepts Implemented

Setup

Hyperparameters

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Vision Transformer (ViT) from Scratch — MNIST

Architecture

Results

Key Concepts Implemented

Setup

Hyperparameters

References

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages