A PyTorch implementation for training vision-language models from scratch using Q-Former architecture and small language models. This project demonstrates how to build a complete VLM pipeline that can understand and describe images.
This project implements a vision-language model training pipeline that:
- Trains a Q-Former to align visual and text embeddings using contrastive learning (CLIP-style)
- Adapts a pretrained language model (SmolLM or Qwen) to understand visual inputs through the trained Q-Former
- Enables image captioning and visual question answering with lightweight models suitable for consumer hardware
The architecture uses:
- Vision Transformer (ViT) for encoding images
- Q-Former (Query Transformer) to compress and align visual features with text
- Small Language Models (SmolLM-135M or Qwen3-0.6B) with LoRA fine-tuning
- Conceptual Captions dataset for training
The diagram above shows the complete end-to-end architecture:
- Image Processing: Input images are encoded by ViT into patch embeddings
- Q-Former: Compresses ViT embeddings into a fixed number of query tokens
- MLP Adapter: Projects Q-Former outputs into the language model's embedding space
- Text Input: User prompt (e.g., "Describe this image") is tokenized
- Concatenation: Image tokens and text tokens are combined into a single sequence
- Language Model: Decoder layers (with LoRA adapters) generate the caption autoregressively
This 2-stage training approach first aligns vision and language representations (Q-Former), then teaches the LLM to understand visual inputs (full VLM).
vlm/
├── vlm_train/ # Main training directory
│ ├── datasets/ # Data loading modules
│ │ ├── cc_dataloader.py # Conceptual Captions dataloader
│ │ └── lm_dataloader.py # Language model dataloader
│ ├── networks/ # Neural network architectures
│ │ ├── q_former.py # Q-Former implementation
│ │ └── lm_to_vlm.py # VLM wrapper with LoRA
│ ├── utils/ # Utility functions
│ │ ├── calculate_recall.py # Retrieval metrics (I2T, T2I)
│ │ ├── utils.py # Visualization utilities
│ │ └── filter_dataset.py # Dataset preparation script
│ ├── q_former_train.py # Q-Former training script
│ ├── lm_train.py # VLM fine-tuning script
│ ├── test_generation.py # Generation evaluation script
│ └── basic_inference.py # Inference and metrics visualization
├── inference_results/ # Output directory for results
│ └── similarity_grid.jpg # Image-text retrieval visualization
└── pyproject.toml # Project dependencies
- Python >= 3.12
- CUDA-capable GPU recommended (can also run on MPS/CPU)
- ~10GB disk space for dataset
- Clone the repository
git clone https://github.com/avbiswas/vlm.git
cd vlm- Install dependencies
Using uv (recommended):
uv pip install -e .Or using pip:
pip install -e .- Download the dataset
Run the filter script to download Conceptual Captions:
uv run vlm_train/utils/filter_dataset.pyThis will download ~200k image-caption pairs to dataset/conceptual-captions-200k.parquet.
- Download images using img2dataset
img2dataset --url_list dataset/conceptual-captions-200k.parquet \
--input_format "parquet" \
--url_col "image_url" \
--caption_col "caption" \
--output_folder dataset/cc_images \
--processes_count 16 \
--thread_count 64 \
--image_size 224 \
--resize_mode center_cropThe Q-Former learns to align visual features from ViT with text embeddings from DistilBERT using contrastive learning:
uv run vlm_train/q_former_train.pyKey hyperparameters:
- Learning rate: 1e-4 (default), 1e-3 (cross-attention & queries)
- Batch size: 8
- Loss: CLIP-style contrastive loss
- Device: Auto-detected (CUDA/MPS/CPU)
The trained model will be saved to models/trained_qformer/best/.
Fine-tune a language model to understand visual inputs through the trained Q-Former:
uv run vlm_train/lm_train.pyKey hyperparameters:
- Model: SmolLM-135M-Instruct (default) or Qwen3-0.6B
- Learning rate: 1e-4 (Q-Former), 5e-4 (adapter + LLM LoRA)
- Batch size: 8 with gradient accumulation (4 steps)
- LoRA config: r=64, alpha=128
- Mixed precision: bfloat16
Models are saved to models/vlm_peft/best/ and models/vlm_peft/latest/.
Test the model on evaluation samples:
uv run vlm_train/test_generation.pyThis generates captions for samples 130-160 and saves results to test_generation_results.csv.
Run inference to compute image-to-text and text-to-image retrieval metrics:
uv run vlm_train/basic_inference.pyThis generates:
- Recall@K metrics (K=1, 5, 10) for both I2T and T2I retrieval
- Similarity grid visualization saved to
inference_results/similarity_grid.jpg
Below is an example similarity grid showing image-text retrieval performance on 8 test samples:
The grid shows:
- Rows: Text captions
- Columns: Images
- Cell values: Cosine similarity between image and text embeddings
Retrieval Performance:
- Image-to-Text: High recall indicates the model can find correct captions for images
- Text-to-Image: High recall indicates the model can find correct images for captions
Below are examples of captions generated by the trained Vision-Language Model on test images:
The model demonstrates the ability to:
- Describe visual content: Objects, scenes, and settings in images
- Generate coherent captions: Natural language descriptions that relate to the image content
- Handle diverse images: From landscapes and architecture to products and people
Each image is paired with a generated caption (shown in green), demonstrating the model's vision-language understanding capabilities after the 2-stage training process.
Note: This project was created for the YouTube tutorial, focusing on educational clarity over scale. The model was trained on just 50,000 image-caption pairs (a small subset of Conceptual Captions) and the entire 2-stage training process completed in approximately 4 hours on a single GPU. Despite the limited data and training time, the model demonstrates meaningful vision-language understanding, making this an accessible starting point for learning VLM development.
- Based on DistilBERT architecture
- 32 learnable query tokens to compress visual information
- Cross-attention layers to attend to ViT features
- Outputs aligned visual and text embeddings for contrastive learning
- Vision Encoder: ViT-base-patch16-224 (frozen)
- Q-Former: Trained from Stage 1 (fine-tuned)
- Adapter: 2-layer MLP to project Q-Former outputs to LLM space
- Language Model: SmolLM-135M with LoRA (r=64) on all linear layers
- Config-driven development: Most hyperparameters, model paths, and training settings are currently hardcoded in the training scripts. Extracting these into YAML/JSON config files would make experimentation much easier and reduce code duplication
- Experiment tracking: Integrate with Weights & Biases or MLflow for better experiment tracking and visualization
- Data augmentation: Add image augmentations to improve model robustness
- Larger training runs: Scale up to the full Conceptual Captions dataset or use other datasets like LAION
- Better evaluation metrics: Add BLEU, CIDEr, and other standard image captioning metrics
- Model checkpointing: Implement better checkpoint management and resume-from-checkpoint functionality
If you find this project helpful, please:
- ⭐ Star the repository: https://github.com/avbiswas/vlm
- 📺 Watch the tutorial: https://youtu.be/Oj27kALfvr0
- ☕ Support on Patreon: https://www.patreon.com/NeuralBreakdownwithAVB
- BLIP-2 paper for Q-Former architecture inspiration
- Hugging Face for transformers library and model hosting
- Conceptual Captions dataset creators


