A Vision Transformer built entirely from scratch using PyTorch, achieving 96.3% accuracy on the MNIST handwritten digit dataset in just 5 epochs.
Input Image (28×28, grayscale)
↓
Patch Embedding — Conv2d splits image into 16 patches (4×4 grid, patch size 7×7)
↓
CLS Token prepended — learnable classification token
↓
Position Embeddings added — so the model knows patch order
↓
Transformer Encoder × 4
├── LayerNorm
├── Multi-Head Self-Attention (4 heads, batch_first=True)
├── Residual Connection
├── LayerNorm
├── MLP Block (16 → 64 → 16, GELU activation)
└── Residual Connection
↓
MLP Head — extracts CLS token → LayerNorm → Linear(16, 10)
↓
Class Logits (10 digits)
| Epoch | Loss | Accuracy |
|---|---|---|
| 1 | 654.84 | 77.97% |
| 2 | 198.41 | 93.67% |
| 3 | 153.87 | 94.93% |
| 4 | 129.35 | 95.74% |
| 5 | 112.05 | 96.33% |
- Patch Embedding — images split into patches via Conv2d, flattened into token sequences
- CLS Token — learnable token prepended to sequence; accumulates global image information for classification
- Position Embeddings — learnable vectors added to each token so the model understands spatial order
- Multi-Head Self-Attention — each patch attends to every other patch, learning which regions matter
- Residual Connections — skip connections around attention and MLP blocks to stabilise training
- MLP with 4× expansion — each encoder block expands features (16→64) then contracts back (64→16)
git clone https://github.com/YOUR_USERNAME/vision-transformer-mnist.git
cd vision-transformer-mnist
pip install -r requirements.txtThen open ViT_MNIST.ipynb in Jupyter or Google Colab.
Tip: Enable GPU in Colab — Runtime → Change runtime type → T4 GPU
| Parameter | Value |
|---|---|
| Image size | 28×28 |
| Patch size | 7×7 |
| Number of patches | 16 |
| Embedding dim | 16 |
| Attention heads | 4 |
| Transformer blocks | 4 |
| MLP hidden dim | 64 |
| Batch size | 64 |
| Learning rate | 0.001 |
| Optimizer | Adam |
| Epochs | 5 |