IEEE ICIP 2025 Grand Challenge — Cityscape Aerial image Dataset for Object Detection
Academic project exploring data augmentation strategies to improve YOLOv11 detection of rare object classes in dense urban aerial imagery.
Team members: Romain Sebire · Pauline Rougeot · Rémy Plastre
Course: Computer Vision — M2 2024/2025
This project was developed as part of the CADOT Challenge (IEEE ICIP 2025), which tasks participants with detecting 14 categories of urban objects in high-resolution aerial images of the Paris region, captured by the French National Institute of Geographic and Forest Information (IGN).
The core challenge is the extreme class imbalance: common classes like small vehicle (46.7%) and building (40.9%) dominate, while rare classes like basketball field, football field, train, or swimming pool each represent less than 1% of annotations. Standard training leads to models that excel on common classes but fail on rare ones.
Our approach combines three complementary strategies:
- Targeted Albumentations — class-aware offline data augmentation with physics-based pipelines tailored to each object type
- Generative AI Inpainting — synthetic data generation for the rarest class (basketball fields) using Google Gemini's built-in image generator (Imagen 3 / Nano Banana)
- YOLOv11 fine-tuning — systematic comparison of model sizes and augmentation strategies
| Experiment | Model | Epochs | mAP50 | mAP50-95 | Precision | Recall |
|---|---|---|---|---|---|---|
| Baseline (no augmentation) | YOLOv11m | 81 | 0.529 | 0.364 | 0.524 | 0.526 |
| Classic Albumentations | YOLOv11m | 61 | 0.540 | 0.359 | 0.584 | 0.525 |
| Multi-Pipeline Albumentations | YOLOv11m | 85 | 0.558 | 0.373 | 0.560 | 0.568 |
| Baseline + GenAI Basketball | YOLOv11m | 70 | 0.530 | 0.363 | 0.522 | 0.562 |
| Baseline (smaller model) | YOLOv11n | 206 | 0.526 | 0.345 | 0.535 | 0.511 |
Best model: Multi-Pipeline Albumentations — mAP50 = 0.558 (+5.5% over baseline)
For reference, the official challenge baselines on the validation set are: YOLOv11 mAP50 = 62, YOLOv12 = 56, Faster R-CNN = 33.88, DiffusionDet = 52.76 (see baseline comparison).
- Source: CADOT Challenge Page
- Download: CADOT_Dataset.zip
- Format: COCO (JSON annotations) — converted to YOLO format via our script
- Images: 4,628 high-resolution (500×500 px) aerial images from the Paris region
- Annotations: 106,691 bounding boxes across 14 categories
- Split: Train / Validation / Test (test labels withheld by organizers)
14 Object Categories: basketball field, building, crosswalk, football field, graveyard, large vehicle, medium vehicle, playground, roundabout, ship, small vehicle, swimming pool, tennis court, train
git clone https://github.com/romainsebire/CADOTProject.git
cd CADOTProject
pip install -r requirements.txt- Download the dataset from the link above and extract
CADOT_Dataset/into the project root. - Convert COCO annotations to YOLO format:
python convert_coco_to_yolo.pyThis creates a Dataset_YOLO/ folder with the proper directory structure (train/images, train/labels, val/images, val/labels).
- Update
cadot.yamlwith the absolute path to yourDataset_YOLO/folder.
Applies uniform transformations (rotation, flip, blur, brightness, CLAHE) to all images containing rare classes, multiplied by a fixed factor (default: ×5).
python augmentation_classic.pyEdit RARE_CLASSES and AUGMENT_FACTOR in the script to adjust.
Applies class-specific pipelines based on the physical properties of each object type:
| Pipeline | Target Classes | Strategy |
|---|---|---|
| Sport | Basketball, Football, Tennis | Preserves court lines and colors — only geometric transforms + brightness |
| Texture | Swimming Pool, Graveyard | Enhances fine details (waves, stones) — sharpening + CLAHE, no blur |
| Shape | Roundabout, Playground, Train, Ship | Allows geometric distortion and mild blur for shape-flexible objects |
Each class has an independent augmentation factor proportional to its rarity (e.g., basketball ×35, tennis ×5).
python augmentation_multi_pipeline.pyThis pipeline generates synthetic basketball courts in empty image regions using Generative AI. We used Google Gemini's image generator (Imagen 3, now rebranded as Nano Banana) to produce high-quality synthetic aerial views.
Why only basketball fields? Our training infrastructure (Docker with NVIDIA T4) could only run Stable Diffusion 1.5 locally, which produced low-quality aerial imagery. Google Gemini's Imagen 3 delivered far superior results, but the semi-automated workflow (prepare masks locally → generate via Gemini → download → resize → verify) limited us to a single class demonstration.
# 1. Prepare masks and coordinates for empty regions
python prepare_inpainting.py
# 2. Upload INPAINTING_STAGING/images/ and masks/ to your generation service
# Prompt: "Satellite view of a basketball court, distinct white lines,
# asphalt surface, top-down orthographic view, high resolution"
# 3. Place generated images in GENERATED_RESULTS/
# Ensure filenames match the originals from step 1
# 4. Resize generated images to 500x500
python resize_image.py
# 5. Merge into dataset with automatic labeling
python merge_results.py
# 6. Visual verification of bounding box alignment
python visualize_bbox.pypython train.pyKey parameters (edit in script):
model:yolo11n.pt(Nano, 2.6M params) oryolo11m.pt(Medium, 20.1M params)epochs: 300 (with early stopping, patience=20)batch: 16imgsz: 512 (matching 500×500 source images)name: output folder name underruns/device:0for CUDA GPU,mpsfor Apple Silicon
Our models were trained on a Docker container with an NVIDIA T4 GPU (16 GB VRAM).
Training results (metrics, curves, confusion matrices) are automatically saved under runs/<name>/. Key files:
results.csv— per-epoch metricsresults.png— training curvesconfusion_matrix.png— class-level predictionsBoxF1_curve.png,BoxPR_curve.png— F1 and Precision-Recall curves
CADOTProject/
├── train.py # YOLOv11 training script
├── convert_coco_to_yolo.py # COCO → YOLO format conversion
├── augmentation_classic.py # Uniform augmentation for rare classes
├── augmentation_multi_pipeline.py # Class-specific augmentation pipelines
├── prepare_inpainting.py # Mask generation for synthetic data
├── resize_image.py # Resize generated images to 500x500
├── merge_results.py # Merge synthetic images into dataset
├── visualize_bbox.py # Visual debug of bounding boxes
├── count_objects.py # Dataset class distribution analysis
├── cadot.yaml # YOLO dataset configuration
├── requirements.txt # Python dependencies
├── docs/
│ ├── cadot_challenge_baseline_performance.png
│ └── references/ # YOLO research papers
├── samples/ # Sample images (10 per folder)
│ ├── yolo/ # YOLO-format dataset samples
│ │ ├── train/images/ & labels/
│ │ └── val/images/ & labels/
│ └── augmented/ # Augmentation output samples
│ ├── images/ & labels/
│ └── visual_debug/ # Bounding box verification
├── runs/ # Training results (metrics, curves, plots)
│ ├── finetune_v11n/
│ ├── finetune_v11m/
│ ├── finetune_v11m_albumentations_classique/
│ ├── finetune_v11m_albumentations_multipipelines/ ← Best model (weights included)
│ └── finetune_v11m_iagenbasketball/
└── INPAINTING_STAGING/ # Generated masks and coordinates (15 examples)
Note: The full dataset is not included in this repository. Download it from the CADOT Challenge page. The
samples/folder contains 10 representative images per split for reference.
The multi-pipeline approach yielded the best overall improvement (+5.5% mAP50), demonstrating that class-aware augmentation outperforms both no augmentation and uniform augmentation. The key insight is that different object types have different invariance properties:
- Sports courts must preserve their line markings and colors (rotation is fine, blur is not)
- Textured objects like pools and graveyards benefit from contrast enhancement but not geometric distortion
- Shape-flexible objects like roundabouts tolerate elastic transforms that would destroy court lines
The inpainting experiment specifically targeted the basketball field class, which has 0 mAP50 on both val and test in the Faster R-CNN baseline and only 52% with YOLOv11. While the overall mAP50 improvement was modest (+0.1%), the recall increased from 0.526 to 0.562 (+6.8%), suggesting the model became better at finding objects it previously missed. A larger-scale generation effort covering all rare classes would likely yield stronger gains, but was limited by our infrastructure constraints (see Step 2 above).
YOLOv11n (Nano) trained for 206 epochs before early stopping and reached mAP50 = 0.526, comparable to the YOLOv11m baseline (0.529) but with significantly fewer parameters. The Medium model benefited more from augmentation, suggesting that larger models can better leverage additional training data.
To keep the repository size manageable, model weights (.pt) are only included for the best performing model:
runs/finetune_v11m_albumentations_multipipelines/weights/best.pt
Use this file for inference or further fine-tuning.
- CADOT Challenge (IEEE ICIP 2025)
- Ultralytics YOLOv11 Documentation
- Albumentations Library
- Research papers in
docs/references/
This project was developed for academic purposes as part of the CADOT Challenge. The dataset is provided by LabCom IRISER (ANR-21-LCV3-0004).