A deep learning project that automatically generates descriptive captions for images using a CNN-RNN architecture. This project uses the COCO dataset to train a model that combines computer vision and natural language processing.
This project implements an end-to-end image captioning system using:
- Encoder (CNN): ResNet-18 for extracting visual features from images
- Decoder (RNN): LSTM network for generating captions based on visual features
The model learns to predict the next word in a caption given previous words and the image features, enabling automatic generation of natural language descriptions for any image.
image-captioning/
├── src/
│ ├── model.py # CNN-RNN model architecture
│ ├── data_loader.py # COCO dataset loading and preprocessing
│ └── vocabulary.py # Vocabulary management for captions
├── preliminaries.ipynb # Data loading and exploration
├── training.ipynb # Model training pipeline
├── inference.ipynb # Caption generation on new images
└── README.md # This file
EncoderCNN
- Uses pre-trained ResNet-18 to extract visual features from images
- Final fully connected layer projects features to embedding space
- Frozen convolutional layers preserve learned visual representations
DecoderRNN
- LSTM-based decoder that generates captions word-by-word
- Embeds vocabulary words and processes them sequentially
- Projects LSTM hidden states to vocabulary predictions
CoCoDataset
- Loads images and captions from COCO dataset
- Applies transformations: resizing, cropping, normalization
- Tokenizes captions and converts to vocabulary indices
Vocabulary
- Builds vocabulary from training captions with frequency thresholds
- Handles special tokens:
<start>,<end>,<unk> - Supports pickle serialization for efficient reuse
- Python 3.7+
- PyTorch
- torchvision
- NLTK
- pycocotools
- Jupyter Notebook
- Clone the repository:
git clone https://github.com/yourusername/image-captioning.git
cd image-captioning- Install dependencies:
pip install torch torchvision nltk pycocotools
python -m nltk.downloader punkt- Download COCO dataset (skip if already available):
# Follow instructions at http://cocodataset.org/#download- Loads and explores the COCO dataset
- Preprocesses images (resize, crop, normalize)
- Builds vocabulary from training captions
- Configurable parameters:
vocab_threshold: Minimum word frequency (default: 5)batch_size: Training batch size (default: 10)
- Trains the CNN-RNN model on COCO dataset
- Monitors training loss and validation metrics
- Saves best model checkpoints
- Generates captions for new images
- Uses greedy decoding or sampling strategies
- Visualizes results with images and predictions
- Pre-trained CNN: Leverages ImageNet-trained ResNet-18 for robust feature extraction
- Flexible Vocabulary: Customizable word frequency thresholds
- Data Augmentation: Random cropping and horizontal flipping during training
- Batch Processing: Efficient GPU utilization with configurable batch sizes
- Word Embeddings: Learned embeddings map vocabulary to semantic space
- Encoder: Frozen ResNet-18 (transfer learning)
- Decoder: 1-layer LSTM with embedding layer and linear output
- Loss Function: Cross-entropy loss for word prediction
- Optimizer: Adam optimizer for gradient updates
- Input Features: 224×224 normalized images
- Feature Dimension: Configurable embedding size (typically 256-512)
from src.data_loader import get_loader
from src.model import EncoderCNN, DecoderRNN
import torch
# Load data
data_loader = get_loader(
transform=transform_train,
mode='train',
batch_size=10,
vocab_threshold=5,
vocab_from_file=False
)
# Initialize models
encoder = EncoderCNN(embed_size=256)
decoder = DecoderRNN(embed_size=256, hidden_size=512,
vocab_size=len(data_loader.dataset.vocab))
# Train on batch
for images, captions in data_loader:
features = encoder(images)
outputs = decoder(features, captions)The trained model learns to generate meaningful captions that:
- Identify objects and people in images
- Describe spatial relationships
- Capture scene context and atmosphere
- Use diverse and natural language
- Attention mechanism for focusing on image regions during caption generation
- Beam search decoding for better quality captions
- Extended training on full COCO dataset
- Fine-tuning encoder with stronger backbones (ResNet-50, EfficientNet)
- Multi-head attention and transformer-based decoder
- COCO Dataset
- Show and Tell: A Neural Image Caption Generator
- Knowing When to Look: Adaptive Attention in Image Captioning
This project was completed as part of the Udacity Computer Vision Nanodegree Program.
Tabish Punjani
Contributions are welcome! Please feel free to submit a Pull Request.