Single-Stage, Two-Stage, and Transformer-Based Detection on the BIRDSAI Dataset
Course: COL780 -- Computer Vision, IIT Delhi
Author: Rishit Jakharia (2022CS11621)
This repository implements and benchmarks three major object detection paradigms on the BIRDSAI thermal infrared aerial surveillance dataset, which contains Long-Wave Infrared (LWIR) imagery captured from UAVs for wildlife and human monitoring.
The project evaluates:
| Task | Model | Paradigm |
|---|---|---|
| Task 1 | YOLOv8 (Nano / Small) | Single-stage detection |
| Task 2 | Faster R-CNN + FPN | Two-stage detection with CE and Focal Loss ablation |
| Task 3 | Deformable DETR | Transformer-based detection with three fine-tuning strategies |
All models are evaluated under a Real to Real setting using the scale-aware, class-wise mAP metric with 11-point interpolation at IoU 0.5.
| Model | Overall | Animals | Humans |
|---|---|---|---|
| YOLOv8 Nano | 0.338 | 0.585 | 0.091 |
| YOLOv8 Small | 0.306 | 0.521 | 0.091 |
| Faster R-CNN (CE) | 0.325 | 0.507 | 0.143 |
| Faster R-CNN (Focal) | 0.237 | 0.369 | 0.106 |
| Deformable DETR -- Exp 2 (Decoder FT) | 0.358 | 0.625 | 0.091 |
Deformable DETR with decoder-only fine-tuning achieved the highest overall accuracy, while YOLO maintained the fastest inference throughput suitable for real-time edge deployment.
code/
├── task1.sh # Entry point for YOLO training and evaluation
├── task2.sh # Entry point for Faster R-CNN
├── task3.sh # Entry point for Deformable DETR
│
├── scripts/
│ ├── task1_prep.py # BIRDSAI to Ultralytics YOLO format converter
│ ├── task2_train.py # Faster R-CNN training and evaluation driver
│ └── task3_train.py # Deformable DETR training and evaluation driver
│
├── src/
│ ├── data/
│ │ ├── birdsai.py # BIRDSAI PyTorch Dataset with scale-aware indexing
│ │ ├── mot_parser.py # MOT annotation parser and scale prior computation
│ │ └── transforms.py # Data augmentation transforms
│ │
│ ├── models/
│ │ ├── yolo.py # YOLOv8 training and evaluation wrapper
│ │ ├── yolo_adapter.py # Adapter bridging Ultralytics output to custom evaluator
│ │ ├── frcnn.py # Faster R-CNN: full pipeline from scratch
│ │ ├── frcnn_backbone.py # ResNet-18 + Feature Pyramid Network backbone
│ │ ├── frcnn_heads.py # RPN head and Fast R-CNN classification head
│ │ └── detr.py # Deformable DETR wrapper with fine-tune mode selection
│ │
│ ├── engine/
│ │ ├── trainer.py # Generic training loop with mixed precision and early stopping
│ │ └── evaluator.py # Scale-aware, class-wise mAP evaluator (11-point interpolation)
│ │
│ ├── losses/
│ │ └── focal_loss.py # Focal Loss implementation for class imbalance
│ │
│ └── utils/
│ ├── logger.py # Experiment logger: JSONL metrics and checkpoint management
│ ├── utils.py # Miscellaneous helpers
│ └── visualization.py # Detection visualization and comparison utilities
│
└── notebook/
└── vizualization.ipynb # Qualitative analysis and figure generation
Leverages Ultralytics YOLOv8 as a single-stage, anchor-free detector. The BIRDSAI MOT annotations are first converted to the Ultralytics label format via a parallel conversion pipeline (YOLOFormatBuilder). Both the Nano and Small backbone variants were compared to study the regularization effect of model capacity on low-texture thermal imagery.
A complete, from-scratch implementation comprising:
- Backbone: ResNet-18 pretrained on ImageNet, extended with a 4-level Feature Pyramid Network for multi-scale feature fusion.
- Region Proposal Network: Generates object proposals using multi-scale anchors (base sizes 16--128, aspect ratios 0.5, 1.0, 2.0) with IoU-based positive/negative sampling.
- Fast R-CNN Head: Two fully-connected layers (1024-d each) followed by per-class bounding box regression and classification.
- Loss Ablation: Standard Cross-Entropy vs Focal Loss (alpha=0.25, gamma=2.0) to study class imbalance handling.
Uses the SenseTime pretrained Deformable DETR via HuggingFace Transformers. Three fine-tuning strategies are compared:
| Experiment | Trainable Components |
|---|---|
| Exp 1 (Full) | Entire network -- backbone, encoder, decoder, and heads |
| Exp 2 (Decoder) | Decoder + classification/bbox heads only |
| Exp 3 (Encoder) | Encoder + classification/bbox heads only |
Decoder-only fine-tuning (Exp 2) proved most effective -- it preserves robust feature extraction learned from COCO while adapting the object queries to the thermal domain.
The BIRDSAI dataset provides aerial LWIR thermal imagery for wildlife conservation and anti-poaching surveillance.
- Classes: Animals (0), Humans (1)
- Splits:
TrainReal,TestReal - Scale categories (video-level, based on average bounding box area):
- Small: < 200 px
- Medium: 200 -- 2000 px
- Large: > 2000 px
- Annotations: MOT-style CSV files with bounding boxes, class labels, and object IDs
- Bondi et al., "BIRDSAI: A Dataset for Detection and Tracking in Aerial Thermal Infrared Videos," WACV 2020.
- Redmon et al., "You Only Look Once: Unified, Real-Time Object Detection," CVPR 2016.
- Ren et al., "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks," IEEE TPAMI, 2017.
- Zhu et al., "Deformable DETR: Deformable Transformers for End-to-End Object Detection," ICLR 2021.