Comparative Analysis of Object Detection Frameworks on Aerial Thermal Imagery

Single-Stage, Two-Stage, and Transformer-Based Detection on the BIRDSAI Dataset

Course: COL780 -- Computer Vision, IIT Delhi
Author: Rishit Jakharia (2022CS11621)

Overview

This repository implements and benchmarks three major object detection paradigms on the BIRDSAI thermal infrared aerial surveillance dataset, which contains Long-Wave Infrared (LWIR) imagery captured from UAVs for wildlife and human monitoring.

The project evaluates:

Task	Model	Paradigm
Task 1	YOLOv8 (Nano / Small)	Single-stage detection
Task 2	Faster R-CNN + FPN	Two-stage detection with CE and Focal Loss ablation
Task 3	Deformable DETR	Transformer-based detection with three fine-tuning strategies

All models are evaluated under a Real to Real setting using the scale-aware, class-wise mAP metric with 11-point interpolation at IoU 0.5.

Key Results

Overall mAP Comparison (IoU @ 0.5)

Model	Overall	Animals	Humans
YOLOv8 Nano	0.338	0.585	0.091
YOLOv8 Small	0.306	0.521	0.091
Faster R-CNN (CE)	0.325	0.507	0.143
Faster R-CNN (Focal)	0.237	0.369	0.106
Deformable DETR -- Exp 2 (Decoder FT)	0.358	0.625	0.091

Deformable DETR with decoder-only fine-tuning achieved the highest overall accuracy, while YOLO maintained the fastest inference throughput suitable for real-time edge deployment.

Project Structure

code/
├── task1.sh                     # Entry point for YOLO training and evaluation
├── task2.sh                     # Entry point for Faster R-CNN
├── task3.sh                     # Entry point for Deformable DETR
│
├── scripts/
│   ├── task1_prep.py            # BIRDSAI to Ultralytics YOLO format converter
│   ├── task2_train.py           # Faster R-CNN training and evaluation driver
│   └── task3_train.py           # Deformable DETR training and evaluation driver
│
├── src/
│   ├── data/
│   │   ├── birdsai.py           # BIRDSAI PyTorch Dataset with scale-aware indexing
│   │   ├── mot_parser.py        # MOT annotation parser and scale prior computation
│   │   └── transforms.py        # Data augmentation transforms
│   │
│   ├── models/
│   │   ├── yolo.py              # YOLOv8 training and evaluation wrapper
│   │   ├── yolo_adapter.py      # Adapter bridging Ultralytics output to custom evaluator
│   │   ├── frcnn.py             # Faster R-CNN: full pipeline from scratch
│   │   ├── frcnn_backbone.py    # ResNet-18 + Feature Pyramid Network backbone
│   │   ├── frcnn_heads.py       # RPN head and Fast R-CNN classification head
│   │   └── detr.py              # Deformable DETR wrapper with fine-tune mode selection
│   │
│   ├── engine/
│   │   ├── trainer.py           # Generic training loop with mixed precision and early stopping
│   │   └── evaluator.py         # Scale-aware, class-wise mAP evaluator (11-point interpolation)
│   │
│   ├── losses/
│   │   └── focal_loss.py        # Focal Loss implementation for class imbalance
│   │
│   └── utils/
│       ├── logger.py            # Experiment logger: JSONL metrics and checkpoint management
│       ├── utils.py             # Miscellaneous helpers
│       └── visualization.py     # Detection visualization and comparison utilities
│
└── notebook/
    └── vizualization.ipynb      # Qualitative analysis and figure generation

Architecture Details

Task 1 -- YOLO

Leverages Ultralytics YOLOv8 as a single-stage, anchor-free detector. The BIRDSAI MOT annotations are first converted to the Ultralytics label format via a parallel conversion pipeline (YOLOFormatBuilder). Both the Nano and Small backbone variants were compared to study the regularization effect of model capacity on low-texture thermal imagery.

Task 2 -- Faster R-CNN + FPN

A complete, from-scratch implementation comprising:

Backbone: ResNet-18 pretrained on ImageNet, extended with a 4-level Feature Pyramid Network for multi-scale feature fusion.
Region Proposal Network: Generates object proposals using multi-scale anchors (base sizes 16--128, aspect ratios 0.5, 1.0, 2.0) with IoU-based positive/negative sampling.
Fast R-CNN Head: Two fully-connected layers (1024-d each) followed by per-class bounding box regression and classification.
Loss Ablation: Standard Cross-Entropy vs Focal Loss (alpha=0.25, gamma=2.0) to study class imbalance handling.

Task 3 -- Deformable DETR

Uses the SenseTime pretrained Deformable DETR via HuggingFace Transformers. Three fine-tuning strategies are compared:

Experiment	Trainable Components
Exp 1 (Full)	Entire network -- backbone, encoder, decoder, and heads
Exp 2 (Decoder)	Decoder + classification/bbox heads only
Exp 3 (Encoder)	Encoder + classification/bbox heads only

Decoder-only fine-tuning (Exp 2) proved most effective -- it preserves robust feature extraction learned from COCO while adapting the object queries to the thermal domain.

Dataset

The BIRDSAI dataset provides aerial LWIR thermal imagery for wildlife conservation and anti-poaching surveillance.

Classes: Animals (0), Humans (1)
Splits: TrainReal, TestReal
Scale categories (video-level, based on average bounding box area):
- Small: < 200 px
- Medium: 200 -- 2000 px
- Large: > 2000 px
Annotations: MOT-style CSV files with bounding boxes, class labels, and object IDs

References

Bondi et al., "BIRDSAI: A Dataset for Detection and Tracking in Aerial Thermal Infrared Videos," WACV 2020.
Redmon et al., "You Only Look Once: Unified, Real-Time Object Detection," CVPR 2016.
Ren et al., "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks," IEEE TPAMI, 2017.
Zhu et al., "Deformable DETR: Deformable Transformers for End-to-End Object Detection," ICLR 2021.

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
code		code
.gitignore		.gitignore
Assignment4.pdf		Assignment4.pdf
LICENSE		LICENSE
README.md		README.md
report.pdf		report.pdf
setup.md		setup.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Comparative Analysis of Object Detection Frameworks on Aerial Thermal Imagery

Overview

Key Results

Overall mAP Comparison (IoU @ 0.5)

Project Structure

Architecture Details

Task 1 -- YOLO

Task 2 -- Faster R-CNN + FPN

Task 3 -- Deformable DETR

Dataset

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Comparative Analysis of Object Detection Frameworks on Aerial Thermal Imagery

Overview

Key Results

Overall mAP Comparison (IoU @ 0.5)

Project Structure

Architecture Details

Task 1 -- YOLO

Task 2 -- Faster R-CNN + FPN

Task 3 -- Deformable DETR

Dataset

References

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages