Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
47 changes: 47 additions & 0 deletions examples/multimodal_ai/nvidia-video-rag/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# 🎥 Video Q&A Pipeline (LangChain + Transformers)

A lightweight, modular pipeline that enables question-answering from video content using frame extraction, image captioning, semantic retrieval, and LLM-based response generation.

## 🚀 Features

* ✅ Frame extraction from video (OpenCV)
* 🧠 Image captioning using ViT-GPT2 (Hugging Face)
* 🔍 Semantic retrieval with ChromaDB + LangChain
* 🤖 Q&A using `flan-t5-small` (Text2Text pipeline)
* 💻 Works in CPU/GPU environments

## 📦 Dependencies

* `torch`, `transformers`, `opencv-python-headless`
* `langchain`, `langchain-community`, `langchain-huggingface`
* `sentence-transformers`, `chromadb`, `Pillow`


## 🧩 Pipeline Overview

```text
Video → Frames → Captions → Embeddings → ChromaDB → Retriever + LLM → Answer
```

## 🛠️ Usage

1. **Run in Jupyter**

2. **Open the notebook** and follow steps:

* 📥 Download video
* 🖼️ Extract frames
* 🧾 Generate captions
* 💾 Store in ChromaDB
* ❓ Ask questions via LLM

## 🧠 Example Questions

* What is happening in the video?
* What objects or people appear?
* Describe the main activity.

## ✅ Conclusion

This template provides a clean foundation for building **video understanding** applications using modern AI tooling. Extend it with your own videos, models, or use cases.

330 changes: 330 additions & 0 deletions examples/multimodal_ai/nvidia-video-rag/nvidia_video_rag.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,330 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "de31e710-406c-4ed8-b65b-337578a20382",
"metadata": {},
"source": [
"# 🎬 Video Q&A Pipeline\n",
"\n",
"\n",
"This template is an intelligent video analysis system that extracts frames from videos, generates captions using vision transformers, and enables natural language Q&A through Retrieval-Augmented Generation (RAG). Ask questions about video content and get AI-powered answers based on visual understanding.\n",
"\n",
"On [Saturn Cloud](https://saturncloud.io), scale your Video Q&A pipeline from single-GPU prototyping to [distributed multi-GPU inference](https://saturncloud.io/docs/user-guide/how-to/llms/parallel_training/). Process multiple videos simultaneously, parallelize frame captioning across GPUs, accelerate vector embedding generation, and handle high-volume Q&A requests — all in a managed environment with CUDA-optimized OpenCV, PyTorch, and Transformers.\n"
]
},
{
"cell_type": "markdown",
"id": "6a3e9225-76a0-4f5b-a8c1-4789cb25e08a",
"metadata": {},
"source": [
"Install dependencies for the template"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2a62505f-59d2-4ce4-b1bb-113b2e93635d",
"metadata": {},
"outputs": [],
"source": [
"!pip install -q --no-deps transformers\n",
"!pip install -q --no-deps tokenizers\n",
"!pip install -q pillow\n",
"!pip install -q --no-deps langchain\n",
"!pip install -q --no-deps langchain-community\n",
"!pip install -q --no-deps langchain-core\n",
"!pip install -q --no-deps langchain-huggingface\n",
"!pip install -q chromadb\n",
"!pip install -q sentence-transformers\n",
"!pip install -q huggingface-hub"
]
},
{
"cell_type": "markdown",
"id": "c5c61b07-cf57-4ec5-8aa2-5f013dd25494",
"metadata": {},
"source": [
"This is to programmatically download a video from a specified URL and save it as a temporary file on the local file system, this is to build the the video Q&A pipeline."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c028137b-ad16-40db-b79e-771d0336bfef",
"metadata": {},
"outputs": [],
"source": [
"import requests\n",
"import tempfile\n",
"import os\n",
"\n",
"# Disable SSL warnings (optional, for insecure URLs)\n",
"import urllib3\n",
"urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)\n",
"\n",
"def get_video_from_url(url):\n",
" response = requests.get(url, stream=True, verify=False)\n",
" if response.status_code == 200:\n",
" temp_file = tempfile.NamedTemporaryFile(delete=False, suffix=\".mp4\", dir=\"/content\")\n",
" temp_file.write(response.content)\n",
" temp_file.flush()\n",
" return temp_file.name\n",
" else:\n",
" raise Exception(f\"Failed to download video: {response.status_code}\")\n",
"\n",
"# Example video URL\n",
"video_url = \"https://opencv.org/wp-content/uploads/2025/02/Example-Video.mp4\"\n",
"video_path = get_video_from_url(video_url)\n",
"print(\"✅ Video downloaded and available at:\", video_path)"
]
},
{
"cell_type": "markdown",
"id": "452e25ee-cc8f-4d7c-adaa-8bb8fe4c1753",
"metadata": {},
"source": [
"This extract frames from a video file at specified intervals and saving them as individual image files. It uses the cv2 library (OpenCV) to handle video reading and image writing."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d85fb9a6-bdac-44b6-bfd8-58e52a6cb028",
"metadata": {},
"outputs": [],
"source": [
"import cv2\n",
"\n",
"def extract_frames(video_path, output_dir=\"/content/frames\", frame_interval=30):\n",
" \"\"\"Extract frames from video at specified intervals\"\"\"\n",
" os.makedirs(output_dir, exist_ok=True)\n",
"\n",
" cap = cv2.VideoCapture(video_path)\n",
" frame_count = 0\n",
" saved_count = 0\n",
"\n",
" while True:\n",
" ret, frame = cap.read()\n",
" if not ret:\n",
" break\n",
"\n",
" if frame_count % frame_interval == 0:\n",
" frame_path = os.path.join(output_dir, f\"frame_{saved_count:04d}.jpg\")\n",
" cv2.imwrite(frame_path, frame)\n",
" saved_count += 1\n",
"\n",
" frame_count += 1\n",
"\n",
" cap.release()\n",
" print(f\"✅ Extracted {saved_count} frames to '{output_dir}/'\")\n",
" return saved_count\n",
"\n",
"# Extract frames from the downloaded video\n",
"extract_frames(video_path, frame_interval=30)"
]
},
{
"cell_type": "markdown",
"id": "3ac647f5-5542-42bd-bd0c-a9c01c0c4e6a",
"metadata": {},
"source": [
"Performs image captioning using a pre-trained VisionEncoderDecoder model (ViT-GPT2) from Hugging Face Transformers to generate textual descriptions for a series of image frames."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e75225a9-34e3-479c-ba84-3df1378552d3",
"metadata": {},
"outputs": [],
"source": [
"from transformers import VisionEncoderDecoderModel, ViTImageProcessor, AutoTokenizer\n",
"from PIL import Image\n",
"import torch\n",
"\n",
"# Device config\n",
"device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n",
"print(\"Using device:\", device)\n",
"\n",
"# Load model & processor\n",
"print(\"🔄 Loading ViT-GPT2 image captioning model...\")\n",
"processor = ViTImageProcessor.from_pretrained(\"nlpconnect/vit-gpt2-image-captioning\")\n",
"tokenizer = AutoTokenizer.from_pretrained(\"nlpconnect/vit-gpt2-image-captioning\")\n",
"model = VisionEncoderDecoderModel.from_pretrained(\"nlpconnect/vit-gpt2-image-captioning\").to(device)\n",
"print(\"✅ Model loaded.\")\n",
"\n",
"# Generate captions for frames\n",
"def generate_lightweight_captions(frame_dir=\"/content/frames\", max_frames=11):\n",
" captions = []\n",
" files = sorted(os.listdir(frame_dir))[:max_frames]\n",
"\n",
" for idx, fname in enumerate(files):\n",
" try:\n",
" path = os.path.join(frame_dir, fname)\n",
" image = Image.open(path).convert(\"RGB\")\n",
" pixel_values = processor(images=image, return_tensors=\"pt\").pixel_values.to(device)\n",
" output_ids = model.generate(pixel_values, max_length=16, num_beams=4)\n",
" caption = tokenizer.decode(output_ids[0], skip_special_tokens=True)\n",
" captions.append(caption)\n",
" print(f\"[{idx+1}/{len(files)}] {fname} → {caption}\")\n",
" except Exception as e:\n",
" print(f\"⚠️ Error processing {fname}: {e}\")\n",
" captions.append(\"Error generating caption\")\n",
"\n",
" return captions\n",
"\n",
"# Run captioning\n",
"captions = generate_lightweight_captions(\"/content/frames\", max_frames=11)"
]
},
{
"cell_type": "markdown",
"id": "4a953d87-0582-42a5-9474-ddfa8401365d",
"metadata": {},
"source": [
"This sets up a vector database (ChromaDB) and populates it with text embeddings generated from image captions. It uses the LangChain framework—specifically the langchain_community module—for handling embeddings, vector storage, and document retrieval."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "63c6919d-9247-40ca-80f1-2dccc49493af",
"metadata": {},
"outputs": [],
"source": [
"from langchain_community.embeddings import HuggingFaceEmbeddings\n",
"from langchain_community.vectorstores import Chroma\n",
"from langchain_core.documents import Document\n",
"\n",
"# Convert captions into Document objects\n",
"docs = [Document(page_content=cap) for cap in captions]\n",
"\n",
"# Load the embedding model\n",
"print(\"🔄 Loading embedding model...\")\n",
"embedding_model = HuggingFaceEmbeddings(model_name=\"sentence-transformers/all-MiniLM-L6-v2\")\n",
"\n",
"# Store in ChromaDB\n",
"vectorstore = Chroma.from_documents(\n",
" documents=docs,\n",
" embedding=embedding_model,\n",
" persist_directory=\"/content/chroma_db\"\n",
")\n",
"\n",
"# Create retriever\n",
"retriever = vectorstore.as_retriever(search_kwargs={\"k\": 3})\n",
"\n",
"print(\"✅ Vector store created and retriever ready!\")\n",
"print(f\"📊 Indexed {len(captions)} frame captions\\n\")"
]
},
{
"cell_type": "markdown",
"id": "09953226-de80-4c44-8c86-85968875c167",
"metadata": {},
"source": [
"This code implements the final step of a video-based question answering (Q&A) pipeline using a lightweight language model from Hugging Face Transformers."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3d1da8fd-5085-4d52-a456-ae3af6680162",
"metadata": {},
"outputs": [],
"source": [
"from transformers import pipeline\n",
"\n",
"# Load a small, efficient model for Q&A (fits in Colab free tier)\n",
"print(\"🔄 Loading lightweight Q&A model...\")\n",
"qa_pipeline = pipeline(\n",
" \"text2text-generation\",\n",
" model=\"google/flan-t5-small\", # Small model (~300MB)\n",
" device=0 if torch.cuda.is_available() else -1,\n",
" max_length=512\n",
")\n",
"print(\"✅ Q&A model loaded!\")\n",
"\n",
"def ask_video_question(query):\n",
" \"\"\"Ask a question about the video and get an AI-generated answer\"\"\"\n",
"\n",
" # Retrieve relevant frame captions\n",
" docs = retriever.invoke(query)\n",
" context = \"\\n\".join([f\"Frame {i+1}: {doc.page_content}\" for i, doc in enumerate(docs)])\n",
"\n",
" # Create prompt\n",
" prompt = f\"\"\"Answer the question based on these video frame descriptions:\n",
"\n",
"{context}\n",
"\n",
"Question: {query}\n",
"Answer:\"\"\"\n",
"\n",
" # Get response from model\n",
" print(f\"\\n🔍 Question: {query}\")\n",
" print(\"🤔 Thinking...\")\n",
"\n",
" try:\n",
" response = qa_pipeline(prompt, max_length=256, do_sample=False)[0]['generated_text']\n",
" print(f\"✅ Answer: {response}\\n\")\n",
" except Exception as e:\n",
" # Fallback: just show the relevant captions\n",
" print(f\"⚠️ Error generating answer: {e}\")\n",
" print(f\"📋 Relevant frames found:\")\n",
" for i, doc in enumerate(docs, 1):\n",
" print(f\" Frame {i}: {doc.page_content}\")\n",
" response = \" \".join([doc.page_content for doc in docs])\n",
"\n",
" return response\n",
"\n",
"# Ask questions about your video\n",
"print(\"=\"*60)\n",
"print(\"VIDEO Q&A SESSION\")\n",
"print(\"=\"*60)\n",
"\n",
"ask_video_question(\"What is happening in the video?\")\n",
"ask_video_question(\"What objects or people appear in the video?\")\n",
"ask_video_question(\"Describe the main activity shown in the video?\")\n",
"\n",
"# Optional: Interactive Q&A\n",
"print(\"\\n\" + \"=\"*60)\n",
"print(\"💡 TIP: You can ask custom questions by calling:\")\n",
"print('ask_video_question(\"your question here\")')\n",
"print(\"=\"*60)"
]
},
{
"cell_type": "markdown",
"id": "73d9e8a2-26d2-4dad-81f6-3210d92c4069",
"metadata": {},
"source": [
"This template provides all the tools that is needed to develop advanced video understanding applications—whether you're building video summarization tools, educational video analysis, or AI-driven surveillance systems. \n",
"\n",
"If you plan to scale up to longer videos or batch-processing video datasets, you can distribute workloads using [Saturn Cloud](https://saturncloud.io).\n",
"\n",
"You can also enhance the performance and specificity of the system by fine-tuning captioning or embedding models on domain-specific video datasets."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.19"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Loading
Loading