Module: ECS7026P — Neural Networks and Deep Learning, Queen Mary University of London
Group AZ: Hanad Ali · Muhammad Husaam Ateeq · Blazej Olszta
This repository contains two deep learning projects submitted as the coursework for ECS7026P.
Part 1 implements a custom convolutional neural network architecture (CIFARNet) for image classification on CIFAR-10, achieving 96.20% test accuracy.
Part 2 investigates long-tail image recognition on the real-world iNaturalist 2018 dataset (8,142 species, 461,939 images, imbalance ratio 200:1), comparing five methods from the long-tail learning literature against a controlled experimental protocol.
├── CIFAR-10_image_classification.ipynb # Part 1: CIFARNet on CIFAR-10
├── Beyond_CIFAR_10_image_classification.ipynb # Part 2: Long-tail recognition on iNaturalist 2018
└── README.md
CIFARNet implements a prescribed architecture composed of stacked intermediate blocks followed by an output block.
Intermediate block design:
Each block receives an input image x and outputs a weighted combination of L = 4 independent convolutional branches:
x' = a₁C₁(x) + a₂C₂(x) + a₃C₃(x) + a₄C₄(x)
The weighting vector a is produced by a fully connected layer applied to the channel-wise global average pool of the input — the network learns to weight its own convolutional branches dynamically based on the input's channel statistics.
Key architectural decisions:
| Component | Detail |
|---|---|
| Intermediate blocks | 6 blocks (B1–B6) in sequence |
| Branches per block | 4 independent conv branches (kernel sizes: 3, 3, 5, 5) |
| Branch coefficients | Softmax-normalised (prevents magnitude explosion under mixed precision) |
| Inter-block downsampling | Strided conv layers (stride 2) between block groups; spatial resolution reduced 32×32 → 16×16 → 8×8 |
| Channel progression | 64 → 128 → 256 across block groups |
| Output block | Global average pool → MLP (256 hidden, ReLU, dropout p=0.3) → 10-way classifier |
| Total parameters | 11.88 million |
Block group structure:
| Block group | Kernel sizes | Output channels | Feature map |
|---|---|---|---|
| B1, B2 | [3, 3, 5, 5] | 64 | 32 × 32 |
| B3, B4 | [3, 3, 5, 5] | 128 | 16 × 16 |
| B5, B6 | [3, 3, 5, 5] | 256 | 8 × 8 |
| Output | MLP (256 hidden, dropout 0.3) | 10 | 1 × 1 |
| Component | Setting |
|---|---|
| Optimiser | SGD with Nesterov momentum (μ = 0.9) |
| Weight decay | 5×10⁻⁴ (conv and linear weights only; excludes BN params and biases) |
| Base learning rate | 0.1 |
| LR schedule | Linear warmup (5 epochs) → cosine annealing to 1×10⁻⁵ over 195 epochs |
| Batch size | 128 |
| Epochs | 200 |
| Loss | Cross-entropy with label smoothing (ε = 0.1) |
| MixUp | α = 0.2 |
| Augmentation | RandomCrop (32×32, pad 4, reflect) · RandomHorizontalFlip · RandAugment (n=2, m=9) · RandomErasing (p=0.25) |
| Precision | Automatic mixed precision (fp16 forward, fp32 master weights) |
| Initialisation | Kaiming normal (fan-out, ReLU) |
| Metric | Value |
|---|---|
| Best test accuracy | 96.20% (epoch 199) |
| Marking bracket | ≥ 92% (top bracket) |
The training loss curve shows steady descent over ~78,200 batches, with high per-batch variance driven by MixUp's U-shaped mixing coefficients and RandomErasing. Both training and test accuracy curves rise together throughout, with a ~4 percentage point generalisation gap at epoch 199 — consistent with a well-regularised model trained from scratch.
We study long-tail image recognition — a recognised challenge in computer vision where most classes have very few training examples and a small number of head classes dominate the distribution.
We use iNaturalist 2018 (Van Horn et al., 2018), a real biodiversity dataset with naturally occurring class imbalance, rather than a synthetic benchmark. The task is species classification across 8,142 classes.
| Dataset property | Value |
|---|---|
| Total images | 461,939 |
| Number of classes | 8,142 |
| Min samples per class | 3 |
| Max samples per class | 602 |
| Imbalance ratio | 200.67:1 |
| Mean samples per class | 34 |
| Median samples per class | 15 |
Head classes (top 20%, 1,628 classes) average 115 training samples; tail classes (bottom 20%, 1,628 classes) average only 8.77.
Backbone: ImageNet-pretrained ResNet-50 (IMAGENET1K_V2 weights) with the classifier replaced by an 8,142-way linear head (40.19M parameters, all updated end-to-end except during cRT Stage 2).
Data split: 60/20/20 stratified train/val/test split (277,163 / 92,388 / 92,388 images). Stratification is essential — without it, many tail classes would be absent from one or more splits.
Evaluation metric: Balanced accuracy (primary), not overall accuracy — overall accuracy is dominated by head classes and masks poor tail performance.
Each method was repeated with seeds 42, 123, and 456; all reported numbers are means and standard deviations across seeds.
Five methods spanning the principal solution categories in the long-tail literature:
| Method | Description |
|---|---|
| Baseline CE | Standard cross-entropy on the natural distribution — no rebalancing |
| Re-weighting | Loss weighted by inverse class frequency (1/nʸ) |
| Resampling | WeightedRandomSampler oversampling rare classes by 1/nʸ at the data level |
| Logit Adjustment | Baseline training; at inference logits adjusted by subtracting τ·log(πʸ), τ=1.0 |
| Two-Stage cRT | Stage 1: full network on natural distribution (80 epochs); Stage 2: backbone frozen, classifier retrained with balanced sampling (20 epochs) |
| Method | Overall | Balanced | Macro-F1 | Head | Medium | Tail |
|---|---|---|---|---|---|---|
| Baseline CE | 65.39 ± 0.19 | 54.33 ± 0.36 | 61.82 ± 0.24 | 64.30 | 53.86 | 45.74 |
| Re-weighting | 57.15 ± 0.16 | 51.03 ± 0.17 | 57.18 ± 0.16 | 58.44 | 50.70 | 44.60 |
| Resampling | 59.34 ± 0.19 | 53.01 ± 0.16 | 58.71 ± 0.18 | 60.17 | 52.54 | 47.26 |
| Logit Adjustment | 61.66 ± 0.09 | 58.72 ± 0.18 | 57.13 ± 0.03 | 60.51 | 58.97 | 56.19 |
| Two-Stage cRT | 61.67 ± 0.20 | 57.32 ± 0.08 | 57.62 ± 0.19 | 61.50 | 57.56 | 52.43 |
Values are test-set percentages (mean ± std across 3 seeds). Bold = best by balanced accuracy.
-
Overall accuracy is a misleading metric on long-tailed data. The baseline tops overall accuracy (65.39%) but achieves only 45.74% on tail classes — a 19.6 pp gap that is invisible if you only report the headline number.
-
Inference-time correction outperforms training-time rebalancing. Logit adjustment achieves the best balanced accuracy (58.72%) and tail accuracy (56.19%) without any change to training. Re-weighting and resampling both underperform the baseline on balanced accuracy at this scale of imbalance (~200:1), consistent with Buda et al. (2018).
-
Re-weighting and resampling are near-equivalent, as theory predicts. The tail accuracy gap between the two is 2.66 pp (47.26% vs 44.60%), confirming near-equivalence of applying gradient pressure via loss weights vs sampling frequency (Buda et al., 2018).
Both notebooks are designed to run on Google Colab. Set the runtime to GPU before starting: Runtime > Change runtime type > T4 GPU.
Run all cells in order. The CIFAR-10 dataset is downloaded automatically via torchvision.
The notebook is split into three sections:
| Section | Purpose | Run on Colab? |
|---|---|---|
| Section 0 — Shared Setup | Imports, device detection, hyperparameters, model architecture, shared utilities | ✅ Run first |
| Section 1 — Training | Full local training pipeline that produced the saved checkpoints and logs | ⏭ Skip on Colab |
| Section 2 — Colab Evaluation | Downloads checkpoints and logs from HuggingFace, recreates all plots and tables, validates hypotheses | ✅ Run after Section 0 |
To reproduce all results on Colab (no dataset download required):
- Run all Section 0 cells
- Skip Section 1 entirely
- Run Section 2 cells top to bottom — checkpoints and training logs are downloaded automatically from HuggingFace (~9 GB)
Model weights: checkpoints.zip (~9 GB) hosted at HuggingFace:
https://huggingface.co/datasets/husaam7/inat2018-checkpoints
- PyTorch — model implementation and training
- torchvision — CIFAR-10 DataLoaders, ResNet-50 pretrained weights, data augmentation
- Hugging Face / iNaturalist 2018 — dataset for Part 2
- Matplotlib — training curves and result plots
- Google Colab — training environment (GPU)
- Buda, M., Maki, A. and Mazurowski, M.A. (2018). A systematic study of the class imbalance problem in convolutional neural networks. Neural Networks, 106, pp. 249–259.
- Cubuk, E.D., Zoph, B., Shlens, J. and Le, Q.V. (2020). RandAugment: Practical automated data augmentation with a reduced search space. NeurIPS, 33, pp. 18613–18624.
- Cui, Y. et al. (2019). Class-balanced loss based on effective number of samples. CVPR, pp. 9268–9277.
- Kang, B. et al. (2020). Decoupling representation and classifier for long-tailed recognition. ICLR.
- Loshchilov, I. and Hutter, F. (2017). SGDR: Stochastic gradient descent with warm restarts. ICLR.
- Menon, A.K. et al. (2021). Long-tail learning via logit adjustment. ICLR.
- Szegedy, C. et al. (2016). Rethinking the Inception architecture for computer vision. CVPR, pp. 2818–2826.
- Van Horn, G. et al. (2018). The iNaturalist species classification and detection dataset. CVPR, pp. 8769–8778.
- Zhang, H. et al. (2018). MixUp: Beyond empirical risk minimisation. ICLR.