This repository contains an implementation framework for an image captioning system using a combination of Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs). The system takes images as input and generates natural language descriptions of their content.
This implementation follows the encoder-decoder architecture:
- An encoder (CNN) extracts visual features from input images
- A decoder (RNN/LSTM/GRU) generates captions word-by-word based on these features
image_captioning_assignment/
├── data/
│ └── download_flickr.py # Script to download and prepare Flickr8k dataset
├── models/
│ ├── encoder.py # CNN encoder implementations
│ ├── decoder.py # RNN decoder implementations
│ └── caption_model.py # Combined encoder-decoder model
├── utils/
│ ├── dataset.py # Dataset and data loader utilities
│ ├── vocabulary.py # Vocabulary building and text processing
│ ├── trainer.py # Training loop and optimization
│ └── metrics.py # Evaluation metrics (BLEU, etc.)
├── notebooks/
│ ├── 1_Data_Exploration.ipynb # Dataset exploration
│ ├── 2_Feature_Extraction.ipynb # CNN feature extraction
│ ├── 3_Model_Training.ipynb # Model training
│ └── 4_Evaluation_Visualization.ipynb # Results analysis
├── requirements.txt # Project dependencies
└── README.md # Project documentation
-
Data Processing (
download_flickr.py):- Processes captions from the Flickr8k dataset
- Creates train/val/test splits
-
Encoder (
encoder.py):- Initializes CNN backbones (ResNet, MobileNet)
- Creates projection layers for feature vectors
-
Decoder (
decoder.py):- Implements the RNN/LSTM/GRU decoder
- Creates word embedding layers
- Implements the caption generation logic with teacher forcing
- Implements greedy decoding for inference
-
Caption Model (
caption_model.py):- Integrates encoder and decoder
- Implements the forward pass
- Implements caption generation
-
Data Utilities:
- Builds vocabulary and tokenization functions (
vocabulary.py) - Creats dataset loaders and transformations (
dataset.py) - Implement evaluation metrics (
metrics.py) - Creates training and validation loops (
trainer.py)
- Builds vocabulary and tokenization functions (
- Clone this repository:
git clone https://github.com/AmirAAZ818/image-captioning.git
cd image-captioning
- Create a virtual environment and install dependencies:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install -r requirements.txt
- Download the Flickr8k dataset:
python data/download_flickr.py --data_dir ./data
This project uses the Flickr8k dataset, which contains:
- Approximately 8,000 images
- 5 different captions for each image (40,000 captions total)
- A diverse range of scenes, objects, and actions
The download_flickr.py script handles:
- Downloading the images and captions
- Preprocessing captions (cleaning, normalization)
- Creating train/validation/test splits
- Organizing files in the expected directory structure
The project is organized as a sequence of notebooks, each focused on a different stage of the pipeline:
jupyter notebook notebooks/1_Data_Exploration.ipynb
This notebook:
- Explores the image and caption distributions
- Analyzes caption lengths and vocabulary
- Visualizes sample images with their captions
- Builds and saves the vocabulary
jupyter notebook notebooks/2_Feature_Extraction.ipynb
This notebook:
- Implements feature extraction using different CNN backbones (ResNet18, ResNet50, MobileNetV2)
- Compares models in terms of feature dimensions, extraction speed, and memory requirements
- Analyzes feature distributions and properties
- Saves extracted features to disk for efficient training
jupyter notebook notebooks/3_Model_Training.ipynb
This notebook:
- Implements the encoder-decoder architecture
- Sets up the training pipeline with teacher forcing
- Trains the model with appropriate hyperparameters
- Monitors training progress and validation performance
- Saves model checkpoints for later evaluation
jupyter notebook notebooks/4_Evaluation_Visualization.ipynb
This notebook:
- Generates captions for test images
- Calculates BLEU scores and other metrics
- Analyzes model performance across different image types
- Provides an interactive demo for generating captions on new images
- Compares different decoding strategies (greedy vs. beam search)
The encoder module (models/encoder.py) provides several CNN options:
- ResNet18: A lightweight model with 512-dimensional features
- ResNet50: A deeper model with 2048-dimensional features
- MobileNetV2: An efficient model with 1280-dimensional features
Each model is pre-trained on ImageNet and modified to output feature vectors.
The decoder module (models/decoder.py) implements:
- LSTM and GRU variants
- Word embedding layer for caption tokens
- Linear projection to vocabulary size
- Optional beam search decoding for improved caption quality
The combined model (models/caption_model.py) connects the encoder and decoder:
- Uses the encoder to extract image features
- Feeds these features to the decoder as initial state
- Implements caption generation using teacher forcing during training
- Provides both greedy and beam search decoding during inference
- Attention Mechanism: Implementing visual attention to focus on relevant image regions
- Transformer Architecture: Replacing the RNN decoder with a Transformer
- Larger Datasets: Using MS COCO or Flickr30k for more training data
- Different Metrics: Implementing CIDEr or METEOR for evaluation
- Fine-tuning: fine-tuning the CNN encoder during training as well as the RNN.
Main dependencies include:
- PyTorch (1.7.0+)
- torchvision
- numpy
- matplotlib
- nltk
- h5py
- tqdm
See requirements.txt for the complete list.
- The implementation is inspired by the "Show and Tell" paper by Vinyals et al.
- Pre-trained models are provided by torchvision
- Flickr8k dataset from the University of Illinois
This repository was initially provided as a framework by the TA of the Deep Learning course at University of Kerman, where students were tasked with completing the implementation by addressing TODO sections across the modules and notebooks. My contributions was developing the core functionality, including data processing, model training, and evaluation pipelines, which transformed the initial structure into a fully functional image captioning system. This work represents an effort in applying the "Show and Tell" approach.