Vision-Language Model (VLM) Training

A PyTorch implementation for training vision-language models from scratch using Q-Former architecture and small language models. This project demonstrates how to build a complete VLM pipeline that can understand and describe images.

Overview

This project implements a vision-language model training pipeline that:

Trains a Q-Former to align visual and text embeddings using contrastive learning (CLIP-style)
Adapts a pretrained language model (SmolLM or Qwen) to understand visual inputs through the trained Q-Former
Enables image captioning and visual question answering with lightweight models suitable for consumer hardware

The architecture uses:

Vision Transformer (ViT) for encoding images
Q-Former (Query Transformer) to compress and align visual features with text
Small Language Models (SmolLM-135M or Qwen3-0.6B) with LoRA fine-tuning
Conceptual Captions dataset for training

Architecture

The diagram above shows the complete end-to-end architecture:

Image Processing: Input images are encoded by ViT into patch embeddings
Q-Former: Compresses ViT embeddings into a fixed number of query tokens
MLP Adapter: Projects Q-Former outputs into the language model's embedding space
Text Input: User prompt (e.g., "Describe this image") is tokenized
Concatenation: Image tokens and text tokens are combined into a single sequence
Language Model: Decoder layers (with LoRA adapters) generate the caption autoregressively

This 2-stage training approach first aligns vision and language representations (Q-Former), then teaches the LLM to understand visual inputs (full VLM).

Project Structure

vlm/
├── vlm_train/                    # Main training directory
│   ├── datasets/                 # Data loading modules
│   │   ├── cc_dataloader.py     # Conceptual Captions dataloader
│   │   └── lm_dataloader.py     # Language model dataloader
│   ├── networks/                 # Neural network architectures
│   │   ├── q_former.py          # Q-Former implementation
│   │   └── lm_to_vlm.py         # VLM wrapper with LoRA
│   ├── utils/                    # Utility functions
│   │   ├── calculate_recall.py  # Retrieval metrics (I2T, T2I)
│   │   ├── utils.py             # Visualization utilities
│   │   └── filter_dataset.py    # Dataset preparation script
│   ├── q_former_train.py        # Q-Former training script
│   ├── lm_train.py              # VLM fine-tuning script
│   ├── test_generation.py       # Generation evaluation script
│   └── basic_inference.py       # Inference and metrics visualization
├── inference_results/            # Output directory for results
│   └── similarity_grid.jpg      # Image-text retrieval visualization
└── pyproject.toml               # Project dependencies

Installation

Prerequisites

Python >= 3.12
CUDA-capable GPU recommended (can also run on MPS/CPU)
~10GB disk space for dataset

Steps

Clone the repository

git clone https://github.com/avbiswas/vlm.git
cd vlm

Install dependencies

Using uv (recommended):

uv pip install -e .

Or using pip:

pip install -e .

Download the dataset

Run the filter script to download Conceptual Captions:

uv run vlm_train/utils/filter_dataset.py

This will download ~200k image-caption pairs to dataset/conceptual-captions-200k.parquet.

Download images using img2dataset

img2dataset --url_list dataset/conceptual-captions-200k.parquet \
            --input_format "parquet" \
            --url_col "image_url" \
            --caption_col "caption" \
            --output_folder dataset/cc_images \
            --processes_count 16 \
            --thread_count 64 \
            --image_size 224 \
            --resize_mode center_crop

Training

Stage 1: Train Q-Former (Vision-Language Alignment)

The Q-Former learns to align visual features from ViT with text embeddings from DistilBERT using contrastive learning:

uv run vlm_train/q_former_train.py

Key hyperparameters:

Learning rate: 1e-4 (default), 1e-3 (cross-attention & queries)
Batch size: 8
Loss: CLIP-style contrastive loss
Device: Auto-detected (CUDA/MPS/CPU)

The trained model will be saved to models/trained_qformer/best/.

Stage 2: Train Vision-Language Model

Fine-tune a language model to understand visual inputs through the trained Q-Former:

uv run vlm_train/lm_train.py

Key hyperparameters:

Model: SmolLM-135M-Instruct (default) or Qwen3-0.6B
Learning rate: 1e-4 (Q-Former), 5e-4 (adapter + LLM LoRA)
Batch size: 8 with gradient accumulation (4 steps)
LoRA config: r=64, alpha=128
Mixed precision: bfloat16

Models are saved to models/vlm_peft/best/ and models/vlm_peft/latest/.

Evaluation and Testing

Generate Captions

Test the model on evaluation samples:

uv run vlm_train/test_generation.py

This generates captions for samples 130-160 and saves results to test_generation_results.csv.

Compute Retrieval Metrics

Run inference to compute image-to-text and text-to-image retrieval metrics:

uv run vlm_train/basic_inference.py

This generates:

Recall@K metrics (K=1, 5, 10) for both I2T and T2I retrieval
Similarity grid visualization saved to inference_results/similarity_grid.jpg

Results

Q-Former Retrieval Performance

Below is an example similarity grid showing image-text retrieval performance on 8 test samples:

The grid shows:

Rows: Text captions
Columns: Images
Cell values: Cosine similarity between image and text embeddings

Retrieval Performance:

Image-to-Text: High recall indicates the model can find correct captions for images
Text-to-Image: High recall indicates the model can find correct images for captions

VLM Caption Generation

Below are examples of captions generated by the trained Vision-Language Model on test images:

The model demonstrates the ability to:

Describe visual content: Objects, scenes, and settings in images
Generate coherent captions: Natural language descriptions that relate to the image content
Handle diverse images: From landscapes and architecture to products and people

Each image is paired with a generated caption (shown in green), demonstrating the model's vision-language understanding capabilities after the 2-stage training process.

Note: This project was created for the YouTube tutorial, focusing on educational clarity over scale. The model was trained on just 50,000 image-caption pairs (a small subset of Conceptual Captions) and the entire 2-stage training process completed in approximately 4 hours on a single GPU. Despite the limited data and training time, the model demonstrates meaningful vision-language understanding, making this an accessible starting point for learning VLM development.

Model Architecture

Q-Former

Based on DistilBERT architecture
32 learnable query tokens to compress visual information
Cross-attention layers to attend to ViT features
Outputs aligned visual and text embeddings for contrastive learning