saturncloud · GeoSegun · Oct 27, 2025 · Oct 30, 2025 · Oct 30, 2025
diff --git a/examples/multimodal_ai/nvidia-video-rag/README.md b/examples/multimodal_ai/nvidia-video-rag/README.md
@@ -0,0 +1,47 @@
+# 🎥 Video Q&A Pipeline (LangChain + Transformers)
+
+A lightweight, modular pipeline that enables question-answering from video content using frame extraction, image captioning, semantic retrieval, and LLM-based response generation.
+
+## 🚀 Features
+
+* ✅ Frame extraction from video (OpenCV)
+* 🧠 Image captioning using ViT-GPT2 (Hugging Face)
+* 🔍 Semantic retrieval with ChromaDB + LangChain
+* 🤖 Q&A using `flan-t5-small` (Text2Text pipeline)
+* 💻 Works in CPU/GPU environments
+
+## 📦 Dependencies
+
+* `torch`, `transformers`, `opencv-python-headless`
+* `langchain`, `langchain-community`, `langchain-huggingface`
+* `sentence-transformers`, `chromadb`, `Pillow`
+
+
+## 🧩 Pipeline Overview
+
+```text
+Video → Frames → Captions → Embeddings → ChromaDB → Retriever + LLM → Answer
+```
+
+## 🛠️ Usage
+
+1. **Run in Jupyter**
+
+2. **Open the notebook** and follow steps:
+
+   * 📥 Download video
+   * 🖼️ Extract frames
+   * 🧾 Generate captions
+   * 💾 Store in ChromaDB
+   * ❓ Ask questions via LLM
+
+## 🧠 Example Questions
+
+* What is happening in the video?
+* What objects or people appear?
+* Describe the main activity.
+
+## ✅ Conclusion
+
+This template provides a clean foundation for building **video understanding** applications using modern AI tooling. Extend it with your own videos, models, or use cases.
+
diff --git a/examples/multimodal_ai/nvidia-video-rag/nvidia_video_rag.ipynb b/examples/multimodal_ai/nvidia-video-rag/nvidia_video_rag.ipynb
@@ -0,0 +1,330 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "de31e710-406c-4ed8-b65b-337578a20382",
+   "metadata": {},
+   "source": [
+    "# 🎬 Video Q&A Pipeline\n",
+    "\n",
+    "\n",
+    "This template is an intelligent video analysis system that extracts frames from videos, generates captions using vision transformers, and enables natural language Q&A through Retrieval-Augmented Generation (RAG). Ask questions about video content and get AI-powered answers based on visual understanding.\n",
+    "\n",
+    "On [Saturn Cloud](https://saturncloud.io), scale your Video Q&A pipeline from single-GPU prototyping to [distributed multi-GPU inference](https://saturncloud.io/docs/user-guide/how-to/llms/parallel_training/). Process multiple videos simultaneously, parallelize frame captioning across GPUs, accelerate vector embedding generation, and handle high-volume Q&A requests — all in a managed environment with CUDA-optimized OpenCV, PyTorch, and Transformers.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6a3e9225-76a0-4f5b-a8c1-4789cb25e08a",
+   "metadata": {},
+   "source": [
+    "Install dependencies for the template"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2a62505f-59d2-4ce4-b1bb-113b2e93635d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!pip install -q --no-deps transformers\n",
+    "!pip install -q --no-deps tokenizers\n",
+    "!pip install -q pillow\n",
+    "!pip install -q --no-deps langchain\n",
+    "!pip install -q --no-deps langchain-community\n",
+    "!pip install -q --no-deps langchain-core\n",
+    "!pip install -q --no-deps langchain-huggingface\n",
+    "!pip install -q chromadb\n",
+    "!pip install -q sentence-transformers\n",
+    "!pip install -q huggingface-hub"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c5c61b07-cf57-4ec5-8aa2-5f013dd25494",
+   "metadata": {},
+   "source": [
+    "This is to programmatically download a video from a specified URL and save it as a temporary file on the local file system, this is to build the the video Q&A pipeline."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c028137b-ad16-40db-b79e-771d0336bfef",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import requests\n",
+    "import tempfile\n",
+    "import os\n",
+    "\n",
+    "# Disable SSL warnings (optional, for insecure URLs)\n",
+    "import urllib3\n",
+    "urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)\n",
+    "\n",
+    "def get_video_from_url(url):\n",
+    "    response = requests.get(url, stream=True, verify=False)\n",
+    "    if response.status_code == 200:\n",
+    "        temp_file = tempfile.NamedTemporaryFile(delete=False, suffix=\".mp4\", dir=\"/content\")\n",
+    "        temp_file.write(response.content)\n",
+    "        temp_file.flush()\n",
+    "        return temp_file.name\n",
+    "    else:\n",
+    "        raise Exception(f\"Failed to download video: {response.status_code}\")\n",
+    "\n",
+    "# Example video URL\n",
+    "video_url = \"https://opencv.org/wp-content/uploads/2025/02/Example-Video.mp4\"\n",
+    "video_path = get_video_from_url(video_url)\n",
+    "print(\"✅ Video downloaded and available at:\", video_path)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "452e25ee-cc8f-4d7c-adaa-8bb8fe4c1753",
+   "metadata": {},
+   "source": [
+    "This extract frames from a video file at specified intervals and saving them as individual image files. It uses the cv2 library (OpenCV) to handle video reading and image writing."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d85fb9a6-bdac-44b6-bfd8-58e52a6cb028",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import cv2\n",
+    "\n",
+    "def extract_frames(video_path, output_dir=\"/content/frames\", frame_interval=30):\n",
+    "    \"\"\"Extract frames from video at specified intervals\"\"\"\n",
+    "    os.makedirs(output_dir, exist_ok=True)\n",
+    "\n",
+    "    cap = cv2.VideoCapture(video_path)\n",
+    "    frame_count = 0\n",
+    "    saved_count = 0\n",
+    "\n",
+    "    while True:\n",
+    "        ret, frame = cap.read()\n",
+    "        if not ret:\n",
+    "            break\n",
+    "\n",
+    "        if frame_count % frame_interval == 0:\n",
+    "            frame_path = os.path.join(output_dir, f\"frame_{saved_count:04d}.jpg\")\n",
+    "            cv2.imwrite(frame_path, frame)\n",
+    "            saved_count += 1\n",
+    "\n",
+    "        frame_count += 1\n",
+    "\n",
+    "    cap.release()\n",
+    "    print(f\"✅ Extracted {saved_count} frames to '{output_dir}/'\")\n",
+    "    return saved_count\n",
+    "\n",
+    "# Extract frames from the downloaded video\n",
+    "extract_frames(video_path, frame_interval=30)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3ac647f5-5542-42bd-bd0c-a9c01c0c4e6a",
+   "metadata": {},
+   "source": [
+    "Performs image captioning using a pre-trained VisionEncoderDecoder model (ViT-GPT2) from Hugging Face Transformers to generate textual descriptions for a series of image frames."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e75225a9-34e3-479c-ba84-3df1378552d3",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from transformers import VisionEncoderDecoderModel, ViTImageProcessor, AutoTokenizer\n",
+    "from PIL import Image\n",
+    "import torch\n",
+    "\n",
+    "# Device config\n",
+    "device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n",
+    "print(\"Using device:\", device)\n",
+    "\n",
+    "# Load model & processor\n",
+    "print(\"🔄 Loading ViT-GPT2 image captioning model...\")\n",
+    "processor = ViTImageProcessor.from_pretrained(\"nlpconnect/vit-gpt2-image-captioning\")\n",
+    "tokenizer = AutoTokenizer.from_pretrained(\"nlpconnect/vit-gpt2-image-captioning\")\n",
+    "model = VisionEncoderDecoderModel.from_pretrained(\"nlpconnect/vit-gpt2-image-captioning\").to(device)\n",
+    "print(\"✅ Model loaded.\")\n",
+    "\n",
+    "# Generate captions for frames\n",
+    "def generate_lightweight_captions(frame_dir=\"/content/frames\", max_frames=11):\n",
+    "    captions = []\n",
+    "    files = sorted(os.listdir(frame_dir))[:max_frames]\n",
+    "\n",
+    "    for idx, fname in enumerate(files):\n",
+    "        try:\n",
+    "            path = os.path.join(frame_dir, fname)\n",
+    "            image = Image.open(path).convert(\"RGB\")\n",
+    "            pixel_values = processor(images=image, return_tensors=\"pt\").pixel_values.to(device)\n",
+    "            output_ids = model.generate(pixel_values, max_length=16, num_beams=4)\n",
+    "            caption = tokenizer.decode(output_ids[0], skip_special_tokens=True)\n",
+    "            captions.append(caption)\n",
+    "            print(f\"[{idx+1}/{len(files)}] {fname} → {caption}\")\n",
+    "        except Exception as e:\n",
+    "            print(f\"⚠️ Error processing {fname}: {e}\")\n",
+    "            captions.append(\"Error generating caption\")\n",
+    "\n",
+    "    return captions\n",
+    "\n",
+    "# Run captioning\n",
+    "captions = generate_lightweight_captions(\"/content/frames\", max_frames=11)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4a953d87-0582-42a5-9474-ddfa8401365d",
+   "metadata": {},
+   "source": [
+    "This sets up a vector database (ChromaDB) and populates it with text embeddings generated from image captions. It uses the LangChain framework—specifically the langchain_community module—for handling embeddings, vector storage, and document retrieval."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "63c6919d-9247-40ca-80f1-2dccc49493af",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain_community.embeddings import HuggingFaceEmbeddings\n",
+    "from langchain_community.vectorstores import Chroma\n",
+    "from langchain_core.documents import Document\n",
+    "\n",
+    "# Convert captions into Document objects\n",
+    "docs = [Document(page_content=cap) for cap in captions]\n",
+    "\n",
+    "# Load the embedding model\n",
+    "print(\"🔄 Loading embedding model...\")\n",
+    "embedding_model = HuggingFaceEmbeddings(model_name=\"sentence-transformers/all-MiniLM-L6-v2\")\n",
+    "\n",
+    "# Store in ChromaDB\n",
+    "vectorstore = Chroma.from_documents(\n",
+    "    documents=docs,\n",
+    "    embedding=embedding_model,\n",
+    "    persist_directory=\"/content/chroma_db\"\n",
+    ")\n",
+    "\n",
+    "# Create retriever\n",
+    "retriever = vectorstore.as_retriever(search_kwargs={\"k\": 3})\n",
+    "\n",
+    "print(\"✅ Vector store created and retriever ready!\")\n",
+    "print(f\"📊 Indexed {len(captions)} frame captions\\n\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "09953226-de80-4c44-8c86-85968875c167",
+   "metadata": {},
+   "source": [
+    "This code implements the final step of a video-based question answering (Q&A) pipeline using a lightweight language model from Hugging Face Transformers."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "3d1da8fd-5085-4d52-a456-ae3af6680162",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from transformers import pipeline\n",
+    "\n",
+    "# Load a small, efficient model for Q&A (fits in Colab free tier)\n",
+    "print(\"🔄 Loading lightweight Q&A model...\")\n",
+    "qa_pipeline = pipeline(\n",
+    "    \"text2text-generation\",\n",
+    "    model=\"google/flan-t5-small\",  # Small model (~300MB)\n",
+    "    device=0 if torch.cuda.is_available() else -1,\n",
+    "    max_length=512\n",
+    ")\n",
+    "print(\"✅ Q&A model loaded!\")\n",
+    "\n",
+    "def ask_video_question(query):\n",
+    "    \"\"\"Ask a question about the video and get an AI-generated answer\"\"\"\n",
+    "\n",
+    "    # Retrieve relevant frame captions\n",
+    "    docs = retriever.invoke(query)\n",
+    "    context = \"\\n\".join([f\"Frame {i+1}: {doc.page_content}\" for i, doc in enumerate(docs)])\n",
+    "\n",
+    "    # Create prompt\n",
+    "    prompt = f\"\"\"Answer the question based on these video frame descriptions:\n",
+    "\n",
+    "{context}\n",
+    "\n",
+    "Question: {query}\n",
+    "Answer:\"\"\"\n",
+    "\n",
+    "    # Get response from model\n",
+    "    print(f\"\\n🔍 Question: {query}\")\n",
+    "    print(\"🤔 Thinking...\")\n",
+    "\n",
+    "    try:\n",
+    "        response = qa_pipeline(prompt, max_length=256, do_sample=False)[0]['generated_text']\n",
+    "        print(f\"✅ Answer: {response}\\n\")\n",
+    "    except Exception as e:\n",
+    "        # Fallback: just show the relevant captions\n",
+    "        print(f\"⚠️ Error generating answer: {e}\")\n",
+    "        print(f\"📋 Relevant frames found:\")\n",
+    "        for i, doc in enumerate(docs, 1):\n",
+    "            print(f\"  Frame {i}: {doc.page_content}\")\n",
+    "        response = \" \".join([doc.page_content for doc in docs])\n",
+    "\n",
+    "    return response\n",
+    "\n",
+    "# Ask questions about your video\n",
+    "print(\"=\"*60)\n",
+    "print(\"VIDEO Q&A SESSION\")\n",
+    "print(\"=\"*60)\n",
+    "\n",
+    "ask_video_question(\"What is happening in the video?\")\n",
+    "ask_video_question(\"What objects or people appear in the video?\")\n",
+    "ask_video_question(\"Describe the main activity shown in the video?\")\n",
+    "\n",
+    "# Optional: Interactive Q&A\n",
+    "print(\"\\n\" + \"=\"*60)\n",
+    "print(\"💡 TIP: You can ask custom questions by calling:\")\n",
+    "print('ask_video_question(\"your question here\")')\n",
+    "print(\"=\"*60)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "73d9e8a2-26d2-4dad-81f6-3210d92c4069",
+   "metadata": {},
+   "source": [
+    "This template provides all the tools that is needed to develop advanced video understanding applications—whether you're building video summarization tools, educational video analysis, or AI-driven surveillance systems. \n",
+    "\n",
+    "If you plan to scale up to longer videos or batch-processing video datasets, you can distribute workloads using [Saturn Cloud](https://saturncloud.io).\n",
+    "\n",
+    "You can also enhance the performance and specificity of the system by fine-tuning captioning or embedding models on domain-specific video datasets."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.19"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}