This project implements an image captioning pipeline using a pre-trained EfficientNet-B0 backbone combined with custom Transformer encoder–decoder blocks. It generates natural-language descriptions for images in the Flickr8k dataset.
- EfficientNet-B0 (frozen) for robust image feature extraction
- Custom TransformerEncoderBlock with LayerNorm → Dense(ReLU) → MultiHeadAttention → residual connections
- Custom TransformerDecoderBlock with positional + token embeddings, causal self-attention, cross-attention, and feed-forward residuals
ImageCaptioningModelorchestrating multi-caption training and evaluation- Text standardization and vectorization via Keras
TextVectorization - Data augmentation (random flip, rotation, contrast)
- BLEU score evaluation with smoothing
- Early stopping and warmup learning-rate scheduler
- Plotting utilities for dataset distribution and training metrics
- Python 3.8 or higher
- TensorFlow 2.x (includes Keras)
- NumPy
- Matplotlib
- NLTK (for BLEU score)
- wget, unzip (or similar download tools)
Install required Python packages with:
pip install tensorflow numpy matplotlib nltkThe Flickr8k dataset, which consists of images and captions:
Flicker8k_Dataset/ # Flickr8k images
└── Flickr8k.token.txt # Captions file
- EfficientNetB0 (frozen) extracts spatial feature maps from input images.
- TransformerEncoderBlock:
- LayerNormalization → Dense(ReLU)
- MultiHeadAttention
- Residual connection
- TransformerDecoderBlock:
- Token + positional embeddings
- Causal self-attention → cross-attention → feed-forward → residuals
- ImageCaptioningModel:
- Integrates CNN features, encoder, decoder
- Implements custom training (
train_step) and testing (test_step) loops for handling multiple captions per image
This model leverages two key attention types within the Transformer blocks:
- Allows each position in the input sequence (tokens or spatial features) to attend to all other positions.
- Calculates attention weights by projecting inputs into queries, keys, and values, then computing scaled dot-product attention across multiple heads.
- Heads learn diverse representation subspaces, enhancing model capacity.
- Enables the decoder to focus on relevant parts of the image representation when generating each output token.
- The decoder’s queries come from the previous layer’s output, while keys and values come from the encoder’s output.
- Guides token generation using visual context.
The Jupyter notebook (notebooks/train.ipynb) demonstrates the end-to-end process:
- Set
KERAS_BACKENDtotensorflow - Download and extract the Flickr8k dataset
- Load and preprocess captions and images
- Split data into train/val/test sets (70%/15%/15%)
- Build
tf.data.Datasetpipelines with augmentation and batching - Define CNN, Transformer blocks, and the captioning model
- Compile with a warmup learning-rate schedule and early stopping
- Train the model and plot loss/accuracy per epoch
- Evaluate performance with BLEU scores
Run the training script:
PixelPhrase.ipynbKey hyperparameters:
- Image size: 299 × 299
- Vocabulary size: 10,000 tokens
- Sequence length: 25 tokens
- Embedding dimension: 512
- Feed-forward dimension: 512
- Batch size: 64
- Epochs: 30
- Warmup steps: total_training_steps / 15
- Optimizer: Adam with custom
LearningRateSchedule
Callbacks:
EarlyStopping(patience=3,monitor='val_loss')- Custom
PlotCallbackfor live metric visualization
- Compute BLEU scores on validation and test splits
- Generate stochastic captions and average results across multiple runs
- Plot line charts and histograms of BLEU score distributions
After training, you should obtain:
- Loss and accuracy curves for training vs. validation per epoch
- BLEU scores across multiple caption generations for sample images
- Distribution statistics (mean, min, max, std) of BLEU scores over 100 validation images
- Saved plots and metrics (optionally stored under
outputs/)
This project involved designing and implementing an end-to-end image-captioning system on the Flickr8k dataset by leveraging a pre-trained EfficientNet-B0 backbone for feature extraction and custom Transformer encoder–decoder blocks with self- and cross-attention for sequence modeling; preprocessing of both images and text was performed, the network was trained end-to-end, and performance was optimized using BLEU-4 scoring—ultimately achieving a score of 0.67—sharpening skills in deep learning, NLP, and Python while demonstrating how to fine-tune complex multimodal architectures for real-world applications.