diff --git a/examples/multimodal_ai/nvidia-video-rag/README.md b/examples/multimodal_ai/nvidia-video-rag/README.md new file mode 100644 index 00000000..49875e14 --- /dev/null +++ b/examples/multimodal_ai/nvidia-video-rag/README.md @@ -0,0 +1,47 @@ +# πŸŽ₯ Video Q&A Pipeline (LangChain + Transformers) + +A lightweight, modular pipeline that enables question-answering from video content using frame extraction, image captioning, semantic retrieval, and LLM-based response generation. + +## πŸš€ Features + +* βœ… Frame extraction from video (OpenCV) +* 🧠 Image captioning using ViT-GPT2 (Hugging Face) +* πŸ” Semantic retrieval with ChromaDB + LangChain +* πŸ€– Q&A using `flan-t5-small` (Text2Text pipeline) +* πŸ’» Works in CPU/GPU environments + +## πŸ“¦ Dependencies + +* `torch`, `transformers`, `opencv-python-headless` +* `langchain`, `langchain-community`, `langchain-huggingface` +* `sentence-transformers`, `chromadb`, `Pillow` + + +## 🧩 Pipeline Overview + +```text +Video β†’ Frames β†’ Captions β†’ Embeddings β†’ ChromaDB β†’ Retriever + LLM β†’ Answer +``` + +## πŸ› οΈ Usage + +1. **Run in Jupyter** + +2. **Open the notebook** and follow steps: + + * πŸ“₯ Download video + * πŸ–ΌοΈ Extract frames + * 🧾 Generate captions + * πŸ’Ύ Store in ChromaDB + * ❓ Ask questions via LLM + +## 🧠 Example Questions + +* What is happening in the video? +* What objects or people appear? +* Describe the main activity. + +## βœ… Conclusion + +This template provides a clean foundation for building **video understanding** applications using modern AI tooling. Extend it with your own videos, models, or use cases. + diff --git a/examples/multimodal_ai/nvidia-video-rag/nvidia_video_rag.ipynb b/examples/multimodal_ai/nvidia-video-rag/nvidia_video_rag.ipynb new file mode 100644 index 00000000..f0a920b1 --- /dev/null +++ b/examples/multimodal_ai/nvidia-video-rag/nvidia_video_rag.ipynb @@ -0,0 +1,330 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "de31e710-406c-4ed8-b65b-337578a20382", + "metadata": {}, + "source": [ + "# 🎬 Video Q&A Pipeline\n", + "\n", + "\n", + "This template is an intelligent video analysis system that extracts frames from videos, generates captions using vision transformers, and enables natural language Q&A through Retrieval-Augmented Generation (RAG). Ask questions about video content and get AI-powered answers based on visual understanding.\n", + "\n", + "On [Saturn Cloud](https://saturncloud.io), scale your Video Q&A pipeline from single-GPU prototyping to [distributed multi-GPU inference](https://saturncloud.io/docs/user-guide/how-to/llms/parallel_training/). Process multiple videos simultaneously, parallelize frame captioning across GPUs, accelerate vector embedding generation, and handle high-volume Q&A requests β€” all in a managed environment with CUDA-optimized OpenCV, PyTorch, and Transformers.\n" + ] + }, + { + "cell_type": "markdown", + "id": "6a3e9225-76a0-4f5b-a8c1-4789cb25e08a", + "metadata": {}, + "source": [ + "Install dependencies for the template" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2a62505f-59d2-4ce4-b1bb-113b2e93635d", + "metadata": {}, + "outputs": [], + "source": [ + "!pip install -q --no-deps transformers\n", + "!pip install -q --no-deps tokenizers\n", + "!pip install -q pillow\n", + "!pip install -q --no-deps langchain\n", + "!pip install -q --no-deps langchain-community\n", + "!pip install -q --no-deps langchain-core\n", + "!pip install -q --no-deps langchain-huggingface\n", + "!pip install -q chromadb\n", + "!pip install -q sentence-transformers\n", + "!pip install -q huggingface-hub" + ] + }, + { + "cell_type": "markdown", + "id": "c5c61b07-cf57-4ec5-8aa2-5f013dd25494", + "metadata": {}, + "source": [ + "This is to programmatically download a video from a specified URL and save it as a temporary file on the local file system, this is to build the the video Q&A pipeline." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c028137b-ad16-40db-b79e-771d0336bfef", + "metadata": {}, + "outputs": [], + "source": [ + "import requests\n", + "import tempfile\n", + "import os\n", + "\n", + "# Disable SSL warnings (optional, for insecure URLs)\n", + "import urllib3\n", + "urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)\n", + "\n", + "def get_video_from_url(url):\n", + " response = requests.get(url, stream=True, verify=False)\n", + " if response.status_code == 200:\n", + " temp_file = tempfile.NamedTemporaryFile(delete=False, suffix=\".mp4\", dir=\"/content\")\n", + " temp_file.write(response.content)\n", + " temp_file.flush()\n", + " return temp_file.name\n", + " else:\n", + " raise Exception(f\"Failed to download video: {response.status_code}\")\n", + "\n", + "# Example video URL\n", + "video_url = \"https://opencv.org/wp-content/uploads/2025/02/Example-Video.mp4\"\n", + "video_path = get_video_from_url(video_url)\n", + "print(\"βœ… Video downloaded and available at:\", video_path)" + ] + }, + { + "cell_type": "markdown", + "id": "452e25ee-cc8f-4d7c-adaa-8bb8fe4c1753", + "metadata": {}, + "source": [ + "This extract frames from a video file at specified intervals and saving them as individual image files. It uses the cv2 library (OpenCV) to handle video reading and image writing." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d85fb9a6-bdac-44b6-bfd8-58e52a6cb028", + "metadata": {}, + "outputs": [], + "source": [ + "import cv2\n", + "\n", + "def extract_frames(video_path, output_dir=\"/content/frames\", frame_interval=30):\n", + " \"\"\"Extract frames from video at specified intervals\"\"\"\n", + " os.makedirs(output_dir, exist_ok=True)\n", + "\n", + " cap = cv2.VideoCapture(video_path)\n", + " frame_count = 0\n", + " saved_count = 0\n", + "\n", + " while True:\n", + " ret, frame = cap.read()\n", + " if not ret:\n", + " break\n", + "\n", + " if frame_count % frame_interval == 0:\n", + " frame_path = os.path.join(output_dir, f\"frame_{saved_count:04d}.jpg\")\n", + " cv2.imwrite(frame_path, frame)\n", + " saved_count += 1\n", + "\n", + " frame_count += 1\n", + "\n", + " cap.release()\n", + " print(f\"βœ… Extracted {saved_count} frames to '{output_dir}/'\")\n", + " return saved_count\n", + "\n", + "# Extract frames from the downloaded video\n", + "extract_frames(video_path, frame_interval=30)" + ] + }, + { + "cell_type": "markdown", + "id": "3ac647f5-5542-42bd-bd0c-a9c01c0c4e6a", + "metadata": {}, + "source": [ + "Performs image captioning using a pre-trained VisionEncoderDecoder model (ViT-GPT2) from Hugging Face Transformers to generate textual descriptions for a series of image frames." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e75225a9-34e3-479c-ba84-3df1378552d3", + "metadata": {}, + "outputs": [], + "source": [ + "from transformers import VisionEncoderDecoderModel, ViTImageProcessor, AutoTokenizer\n", + "from PIL import Image\n", + "import torch\n", + "\n", + "# Device config\n", + "device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n", + "print(\"Using device:\", device)\n", + "\n", + "# Load model & processor\n", + "print(\"πŸ”„ Loading ViT-GPT2 image captioning model...\")\n", + "processor = ViTImageProcessor.from_pretrained(\"nlpconnect/vit-gpt2-image-captioning\")\n", + "tokenizer = AutoTokenizer.from_pretrained(\"nlpconnect/vit-gpt2-image-captioning\")\n", + "model = VisionEncoderDecoderModel.from_pretrained(\"nlpconnect/vit-gpt2-image-captioning\").to(device)\n", + "print(\"βœ… Model loaded.\")\n", + "\n", + "# Generate captions for frames\n", + "def generate_lightweight_captions(frame_dir=\"/content/frames\", max_frames=11):\n", + " captions = []\n", + " files = sorted(os.listdir(frame_dir))[:max_frames]\n", + "\n", + " for idx, fname in enumerate(files):\n", + " try:\n", + " path = os.path.join(frame_dir, fname)\n", + " image = Image.open(path).convert(\"RGB\")\n", + " pixel_values = processor(images=image, return_tensors=\"pt\").pixel_values.to(device)\n", + " output_ids = model.generate(pixel_values, max_length=16, num_beams=4)\n", + " caption = tokenizer.decode(output_ids[0], skip_special_tokens=True)\n", + " captions.append(caption)\n", + " print(f\"[{idx+1}/{len(files)}] {fname} β†’ {caption}\")\n", + " except Exception as e:\n", + " print(f\"⚠️ Error processing {fname}: {e}\")\n", + " captions.append(\"Error generating caption\")\n", + "\n", + " return captions\n", + "\n", + "# Run captioning\n", + "captions = generate_lightweight_captions(\"/content/frames\", max_frames=11)" + ] + }, + { + "cell_type": "markdown", + "id": "4a953d87-0582-42a5-9474-ddfa8401365d", + "metadata": {}, + "source": [ + "This sets up a vector database (ChromaDB) and populates it with text embeddings generated from image captions. It uses the LangChain frameworkβ€”specifically the langchain_community moduleβ€”for handling embeddings, vector storage, and document retrieval." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "63c6919d-9247-40ca-80f1-2dccc49493af", + "metadata": {}, + "outputs": [], + "source": [ + "from langchain_community.embeddings import HuggingFaceEmbeddings\n", + "from langchain_community.vectorstores import Chroma\n", + "from langchain_core.documents import Document\n", + "\n", + "# Convert captions into Document objects\n", + "docs = [Document(page_content=cap) for cap in captions]\n", + "\n", + "# Load the embedding model\n", + "print(\"πŸ”„ Loading embedding model...\")\n", + "embedding_model = HuggingFaceEmbeddings(model_name=\"sentence-transformers/all-MiniLM-L6-v2\")\n", + "\n", + "# Store in ChromaDB\n", + "vectorstore = Chroma.from_documents(\n", + " documents=docs,\n", + " embedding=embedding_model,\n", + " persist_directory=\"/content/chroma_db\"\n", + ")\n", + "\n", + "# Create retriever\n", + "retriever = vectorstore.as_retriever(search_kwargs={\"k\": 3})\n", + "\n", + "print(\"βœ… Vector store created and retriever ready!\")\n", + "print(f\"πŸ“Š Indexed {len(captions)} frame captions\\n\")" + ] + }, + { + "cell_type": "markdown", + "id": "09953226-de80-4c44-8c86-85968875c167", + "metadata": {}, + "source": [ + "This code implements the final step of a video-based question answering (Q&A) pipeline using a lightweight language model from Hugging Face Transformers." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3d1da8fd-5085-4d52-a456-ae3af6680162", + "metadata": {}, + "outputs": [], + "source": [ + "from transformers import pipeline\n", + "\n", + "# Load a small, efficient model for Q&A (fits in Colab free tier)\n", + "print(\"πŸ”„ Loading lightweight Q&A model...\")\n", + "qa_pipeline = pipeline(\n", + " \"text2text-generation\",\n", + " model=\"google/flan-t5-small\", # Small model (~300MB)\n", + " device=0 if torch.cuda.is_available() else -1,\n", + " max_length=512\n", + ")\n", + "print(\"βœ… Q&A model loaded!\")\n", + "\n", + "def ask_video_question(query):\n", + " \"\"\"Ask a question about the video and get an AI-generated answer\"\"\"\n", + "\n", + " # Retrieve relevant frame captions\n", + " docs = retriever.invoke(query)\n", + " context = \"\\n\".join([f\"Frame {i+1}: {doc.page_content}\" for i, doc in enumerate(docs)])\n", + "\n", + " # Create prompt\n", + " prompt = f\"\"\"Answer the question based on these video frame descriptions:\n", + "\n", + "{context}\n", + "\n", + "Question: {query}\n", + "Answer:\"\"\"\n", + "\n", + " # Get response from model\n", + " print(f\"\\nπŸ” Question: {query}\")\n", + " print(\"πŸ€” Thinking...\")\n", + "\n", + " try:\n", + " response = qa_pipeline(prompt, max_length=256, do_sample=False)[0]['generated_text']\n", + " print(f\"βœ… Answer: {response}\\n\")\n", + " except Exception as e:\n", + " # Fallback: just show the relevant captions\n", + " print(f\"⚠️ Error generating answer: {e}\")\n", + " print(f\"πŸ“‹ Relevant frames found:\")\n", + " for i, doc in enumerate(docs, 1):\n", + " print(f\" Frame {i}: {doc.page_content}\")\n", + " response = \" \".join([doc.page_content for doc in docs])\n", + "\n", + " return response\n", + "\n", + "# Ask questions about your video\n", + "print(\"=\"*60)\n", + "print(\"VIDEO Q&A SESSION\")\n", + "print(\"=\"*60)\n", + "\n", + "ask_video_question(\"What is happening in the video?\")\n", + "ask_video_question(\"What objects or people appear in the video?\")\n", + "ask_video_question(\"Describe the main activity shown in the video?\")\n", + "\n", + "# Optional: Interactive Q&A\n", + "print(\"\\n\" + \"=\"*60)\n", + "print(\"πŸ’‘ TIP: You can ask custom questions by calling:\")\n", + "print('ask_video_question(\"your question here\")')\n", + "print(\"=\"*60)" + ] + }, + { + "cell_type": "markdown", + "id": "73d9e8a2-26d2-4dad-81f6-3210d92c4069", + "metadata": {}, + "source": [ + "This template provides all the tools that is needed to develop advanced video understanding applicationsβ€”whether you're building video summarization tools, educational video analysis, or AI-driven surveillance systems. \n", + "\n", + "If you plan to scale up to longer videos or batch-processing video datasets, you can distribute workloads using [Saturn Cloud](https://saturncloud.io).\n", + "\n", + "You can also enhance the performance and specificity of the system by fine-tuning captioning or embedding models on domain-specific video datasets." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.19" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/examples/multimodal_ai/nvidia_clip-search/README.md b/examples/multimodal_ai/nvidia_clip-search/README.md new file mode 100644 index 00000000..ed510cab --- /dev/null +++ b/examples/multimodal_ai/nvidia_clip-search/README.md @@ -0,0 +1,54 @@ +# πŸ” Clip Search (CLIP Retrieval Demo) + +This template demonstrates how to build a **text-to-image retrieval system** using **CLIP embeddings**. Users can input natural language queries (like _β€œbrain coral”_ or _β€œa man riding a horse”_), and the system will return the most visually relevant images from a dataset. + +Powered by: +- 🧠 **Sentence-Transformers (CLIP-ViT-B-32)** for unified text and image embeddings +- πŸ–ΌοΈ **Towhee Image Search Dataset** (1,000+ labeled images) +- πŸ§ͺ **Gradio UI** for interactive image querying +- ☁️ **Runs seamlessly on Saturn Cloud** + +--- + +## πŸ’‘ Use Case + +This is ideal for: +- Reverse image search +- Visual semantic search systems +- Image tagging and organization +- Educational demos for computer vision and AI retrieval +--- + +- GPU-powered Jupyter environments +- Persistent storage and shared workspaces +- Scalable cloud infrastructure for deep learning + +--- + +## πŸ“¦ Dependencies + +```bash +pip install sentence-transformers gradio scikit-learn pandas pillow tqdm +```` + +--- + +## πŸ“ Dataset Info + +We use a curated subset of **ImageNet-style data** available here: +[Reverse Image Search Dataset (Towhee)](https://github.com/towhee-io/examples) + +Images are labeled for supervised retrieval, and we use the `label` field as pseudo-captions for prompt-to-image similarity. + +--- + +## πŸ“œ License + +This project is for **educational and research** purposes only. Data sourced from Towhee and ImageNet subsets. + +--- + +## πŸ“š Learn More + +* [Saturn Cloud Docs](https://saturncloud.io/docs/?utm_source=github&utm_medium=template&utm_campaign=prompt-image-search) – GPU scaling, environments, and deployment +* [More Templates](https://saturncloud.io/resources/templates/?utm_source=github&utm_medium=template&utm_campaign=prompt-image-search) – NLP, vision, LLM, and ML pipelines diff --git a/examples/multimodal_ai/nvidia_clip-search/nvidia_clip-search.ipynb b/examples/multimodal_ai/nvidia_clip-search/nvidia_clip-search.ipynb new file mode 100644 index 00000000..6dc619da --- /dev/null +++ b/examples/multimodal_ai/nvidia_clip-search/nvidia_clip-search.ipynb @@ -0,0 +1 @@ +{"nbformat":4,"nbformat_minor":0,"metadata":{"colab":{"provenance":[],"authorship_tag":"ABX9TyPiKpBthSzWpi7SEMwlZfhB"},"kernelspec":{"name":"python3","display_name":"Python 3"},"language_info":{"name":"python"}},"cells":[{"cell_type":"markdown","source":["## CLIP Image-Text Search\n","\n","This project demonstrates how to build a **prompt-to-image retrieval system** using a pre-built image-caption dataset inspired by ImageNet and Towhee. Instead of training from scratch, we leverage the powerful [Sentence-Transformers](https://www.sbert.net/) CLIP model (`clip-ViT-B-32`) to embed both **textual prompts** and **image features**, and retrieve the most relevant images based on **cosine similarity**.\n","\n","Running this project on **[Saturn Cloud](https://saturncloud.io/)** gives you:\n","\n","* πŸš€ GPU-accelerated inference for large embedding models\n","* 🧩 Reproducibility across experiments and environments\n","* πŸ‘₯ Collaborative Jupyter notebooks with persistent storage\n","* πŸ’Ό Enterprise-ready tools with easy deployment to production\n","\n","This is an ideal template for reverse image search, image tagging, and multimodal AI applications."],"metadata":{"id":"0r95hT6yTuXs"}},{"cell_type":"markdown","source":["Installing dependency for the project."],"metadata":{"id":"TBKS3vIFUtJs"}},{"cell_type":"code","source":["!pip install -q \\\n"," sentence-transformers \\\n"," scikit-learn \\\n"," gradio \\\n"," pandas \\\n"," pillow \\\n"," tqdm\n"],"metadata":{"id":"BAkeM_paUsRK","executionInfo":{"status":"ok","timestamp":1761832298897,"user_tz":-60,"elapsed":17241,"user":{"displayName":"Durojaye Olusegun","userId":"09188621512197003284"}}},"execution_count":8,"outputs":[]},{"cell_type":"markdown","source":["This block downloads a lightweight **reverse image search dataset** from Towhee’s GitHub releases. It includes training and test image directories and a CSV mapping file that pairs image paths with labels, which we'll treat as \"pseudo-captions\"."],"metadata":{"id":"H6risuQVUpKm"}},{"cell_type":"code","execution_count":null,"metadata":{"id":"tovy4jUqGns4"},"outputs":[],"source":["# πŸ”½ Download and unzip the Towhee reverse image search dataset\n","!wget https://github.com/towhee-io/examples/releases/download/data/reverse_image_search.zip\n","!unzip -q reverse_image_search.zip -d images_folder"]},{"cell_type":"markdown","source":["This block downloads a lightweight **reverse image search dataset** from Towhee’s GitHub releases. It includes training and test image directories and a CSV mapping file that pairs image paths with labels, which we'll treat as \"pseudo-captions\".\n","\n","we load the `reverse_image_search.csv` metadata file to extract:\n","\n","* βœ… Full image paths\n","* βœ… Corresponding class labels (used as text prompts)\n","\n","These are the foundation for generating our dual embeddings."],"metadata":{"id":"3dKu4_06UgCx"}},{"cell_type":"code","source":["import pandas as pd\n","from pathlib import Path\n","\n","# Paths\n","csv_path = Path(\"images_folder/reverse_image_search.csv\")\n","root_dir = Path(\"images_folder\") # Base directory for relative paths in CSV\n","\n","# Load metadata\n","df = pd.read_csv(csv_path)\n","\n","# Construct full image paths and labels\n","image_paths = [root_dir / Path(p) for p in df[\"path\"]]\n","captions = df[\"label\"].tolist() # Use labels as pseudo-captions\n","print(f\"βœ… Loaded {len(image_paths)} images with labels\")\n"],"metadata":{"id":"SJQz01fBGu5t"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["We load **`clip-ViT-B-32`**, a dual encoder from `sentence-transformers` capable of embedding both text and images in the same latent space. This enables natural language to be matched directly with visual content."],"metadata":{"id":"9XRe0ziDVoAG"}},{"cell_type":"code","source":["from sentence_transformers import SentenceTransformer\n","import torch\n","\n","device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n","print(f\"πŸ–₯️ Using device: {device}\")\n","\n","# Load multi-modal embedding model\n","model = SentenceTransformer(\"clip-ViT-B-32\").to(device)"],"metadata":{"id":"Ow_p-TITGsx0"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["This section encodes:\n","\n","* **Image features** via `model.encode(image)`\n","* **Text prompts (labels)** via `model.encode(label)`\n","\n","Both embeddings are saved as numpy arrays for efficient similarity comparisons."],"metadata":{"id":"Y4JNFaD3V2vM"}},{"cell_type":"code","source":["from PIL import Image\n","import numpy as np\n","from tqdm import tqdm\n","\n","text_embeddings = []\n","image_embeddings = []\n","valid_paths = []\n","valid_labels = []\n","\n","print(\"πŸ” Generating embeddings...\")\n","for path, label in tqdm(zip(image_paths, captions), total=len(image_paths)):\n"," try:\n"," img = Image.open(path).convert(\"RGB\")\n","\n"," # Generate image embedding\n"," with torch.no_grad():\n"," img_emb = model.encode(img, convert_to_tensor=True).cpu().numpy()\n","\n"," # Generate text embedding\n"," with torch.no_grad():\n"," txt_emb = model.encode(label, convert_to_tensor=True).cpu().numpy()\n","\n"," image_embeddings.append(img_emb)\n"," text_embeddings.append(txt_emb)\n"," valid_paths.append(path)\n"," valid_labels.append(label)\n","\n"," except Exception as e:\n"," print(f\"❌ Skipped {path.name}: {e}\")\n"," continue\n","\n","print(f\"βœ… Embedded {len(valid_paths)} images\")"],"metadata":{"id":"B4tAjuP_Gsal"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["All preprocessed data is saved for later use:\n","\n","* πŸ” `image_embeddings.npy`\n","* πŸ” `text_embeddings.npy`\n","* πŸ” `metadata.json` (image paths + labels)"],"metadata":{"id":"7IIZFKXAWEjL"}},{"cell_type":"code","source":["import json\n","import numpy as np\n","\n","save_dir = Path(\"data/processed\")\n","save_dir.mkdir(parents=True, exist_ok=True)\n","\n","# Save embeddings\n","np.save(save_dir / \"image_embeddings.npy\", np.vstack(image_embeddings))\n","np.save(save_dir / \"text_embeddings.npy\", np.vstack(text_embeddings))\n","\n","# Save paths and labels as metadata\n","metadata = {\n"," \"paths\": [str(p) for p in valid_paths],\n"," \"labels\": valid_labels\n","}\n","with open(save_dir / \"metadata.json\", \"w\") as f:\n"," json.dump(metadata, f)\n","\n","print(\"πŸ’Ύ Embeddings and metadata saved.\")"],"metadata":{"id":"BOKIS4eAGrZk"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["Using **cosine similarity**, this function:\n","\n","* Embeds the input query\n","* Compares it to all text embeddings\n","* Returns the top-k matches based on similarity"],"metadata":{"id":"vLtwrMjkWeCe"}},{"cell_type":"code","source":["from sklearn.metrics.pairwise import cosine_similarity\n","\n","def search_images(prompt, k=3):\n"," # Load data\n"," image_embeddings = np.load(\"data/processed/image_embeddings.npy\")\n"," text_embeddings = np.load(\"data/processed/text_embeddings.npy\")\n","\n"," with open(\"data/processed/metadata.json\", \"r\") as f:\n"," metadata = json.load(f)\n","\n"," paths = metadata[\"paths\"]\n"," labels = metadata[\"labels\"]\n","\n"," # Encode query\n"," query_emb = model.encode(prompt, convert_to_tensor=True).cpu().numpy().reshape(1, -1)\n","\n"," # Compare to text embeddings\n"," similarities = cosine_similarity(query_emb, text_embeddings)[0]\n"," top_indices = similarities.argsort()[::-1][:k]\n","\n"," results = [{\"path\": paths[i], \"label\": labels[i], \"score\": float(similarities[i])} for i in top_indices]\n"," return results\n"],"metadata":{"id":"nN1DqfXbGrH6"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["This section builds an interactive **Gradio app** so users can enter a natural language prompt and view the top matching images.\n","\n","Perfect for demos, prototyping, or deploying as a user-facing search tool."],"metadata":{"id":"XlopEFsoWral"}},{"cell_type":"code","source":["import gradio as gr\n","from PIL import Image\n","\n","def search_and_display(query, k):\n"," results = search_images(query, k)\n"," return [(r[\"path\"], f\"{r['label']} ({r['score']:.2f})\") for r in results]\n","\n","gr.Interface(\n"," fn=search_and_display,\n"," inputs=[\n"," gr.Textbox(label=\"Enter a prompt (e.g. 'brain coral', 'a hound dog')\"),\n"," gr.Slider(1, 5, value=3, step=1, label=\"Number of results\")\n"," ],\n"," outputs=gr.Gallery(label=\"Matching Images\"),\n"," title=\"πŸ” Prompt β†’ Image Search\",\n"," description=\"Type a label or prompt to search for similar images in the ImageNet-style dataset\"\n",").launch(share=True)\n"],"metadata":{"id":"YT1-sRVFGq_L"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["## βœ… Conclusion (Tailored for Saturn Cloud)\n","\n","In this project, you built a **prompt-to-image retrieval system** using a CLIP-based model from `sentence-transformers`, applied to a lightweight ImageNet-style dataset.\n","\n","By running this on **[Saturn Cloud](https://saturncloud.io/)**, you benefit from:\n","\n","* πŸš€ Fast, GPU-accelerated model inference for embeddings\n","* πŸ’Ύ Persistent storage across Jupyter sessions for data, models, and results\n","* πŸ” Easy reproducibility of workflows in collaborative teams\n","* πŸ“¦ Seamless deployment to production-ready environments\n","\n","Whether you're prototyping AI search tools or deploying image classification pipelines, Saturn Cloud’s infrastructure makes it easy to scale from notebooks to production.\n","\n","---\n","\n","### 🧭 Want to Learn More?\n","\n","* πŸ“š [Saturn Cloud Documentation](https://saturncloud.io/docs/)\n","* πŸ”¨ [Saturn Cloud Templates](https://saturncloud.io/resources/templates/)\n","* 🌐 [Saturn Cloud Home](https://saturncloud.io/)"],"metadata":{"id":"EH_vWN6eXEjI"}}]} \ No newline at end of file diff --git a/examples/multimodal_ai/nvidia_text2video/README.md b/examples/multimodal_ai/nvidia_text2video/README.md new file mode 100644 index 00000000..3d259f1a --- /dev/null +++ b/examples/multimodal_ai/nvidia_text2video/README.md @@ -0,0 +1,74 @@ +# 🎬 Gradio Textβ†’Video Demo + +Generate 5-second AI-generated video clips from text prompts using state-of-the-art diffusion models β€” all inside a GPU-powered Jupyter Notebook. + +![Video Demo](https://cdn-icons-png.flaticon.com/512/3447/3447651.png) + +--- + +## 🧠 Model Used + +- [`cerspense/zeroscope_v2_576w`](https://huggingface.co/cerspense/zeroscope_v2_576w) + A powerful text-to-video diffusion model capable of producing smooth and realistic short clips at **576Γ—320 resolution**. + +--- + +## βš™οΈ Tech Stack + +- **Diffusers** – For loading and running the diffusion pipeline +- **Gradio** – To build an interactive UI for prompt-based video generation +- **PyTorch** – For GPU acceleration +- **Saturn Cloud** – To run everything in a high-performance GPU notebook environment + +--- + +## πŸš€ Features + +- Text prompt β†’ 5-second video (~16 frames @ 3 fps) +- Clean and responsive Gradio interface +- Adjustable settings for inference steps, guidance scale, and frame count +- Inline preview inside Jupyter Notebook (with fallback display using `IPython.display.Video`) +- Works seamlessly on GPU inside [Saturn Cloud](https://saturncloud.io/) + +--- + +## πŸ§ͺ Example Prompts + +- `"Waves crashing on a beach at sunset"` +- `"A coffee cup steaming on a table, camera zoom in"` +- `"Fireworks exploding in the night sky"` +- `"A cat playing with a ball of yarn"` + +--- + +## πŸ’» How to Run + +1. Launch the notebook in [Saturn Cloud](https://saturncloud.io/). +2. Make sure your environment includes a **GPU** and **`diffusers`, `gradio`, `transformers`, `torch`, and `imageio`**. +3. Run all cells in sequence. +4. Type your prompt, hit **Generate**, and preview your video instantly! + +--- + +## πŸ“Ž Useful Links + +- [Saturn Cloud Templates](https://saturncloud.io/resources/templates/) +- [Diffusers Library (Hugging Face)](https://huggingface.co/docs/diffusers/index) +- [Zeroscope Model Card](https://huggingface.co/cerspense/zeroscope_v2_576w) +- [Gradio Documentation](https://www.gradio.app/) + +--- + +## πŸ“’ Note + +This demo generates short, low-frame-rate videos. For longer or higher-quality clips, you may explore batch rendering or multi-GPU scaling in Saturn Cloud. + +--- + +## πŸ›°οΈ Powered By + +[![Saturn Cloud](https://saturncloud.io/images/logo/logo-light-mode.svg)](https://saturncloud.io/) + +A fully managed data science platform with GPU notebooks, scalable compute, and collaborative workflows. + +--- diff --git a/examples/multimodal_ai/nvidia_text2video/nvidia_text2video.ipynb b/examples/multimodal_ai/nvidia_text2video/nvidia_text2video.ipynb new file mode 100644 index 00000000..61b65677 --- /dev/null +++ b/examples/multimodal_ai/nvidia_text2video/nvidia_text2video.ipynb @@ -0,0 +1,401 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "933sDEm1ZiGk" + }, + "source": [ + "\n", + "# Gradio Textβ†’Video Demo\n", + "\n", + "\"prompt\"\n", + "\n", + "\n", + "\"prompt\"\n", + "\n", + "\n", + "This template demonstrates how to generate short video clips from text prompts using a powerful diffusion-based video generation model.\n", + "\n", + "The generated videos are rendered and immediately previewed inside the notebook, making it ideal for experimentation or creative prototyping.\n", + "\n", + "On [Saturn Cloud](https://saturncloud.io), you can scale from a single NVIDIA GPU to multi-GPU clusters, enabling distributed inference for larger models or higher throughput workloads β€” all within a managed, GPU-ready environment." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "PUKz5Z1ubbTU" + }, + "source": [ + "Let's install the dependencies." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "9j4AnRLDcWhB" + }, + "outputs": [], + "source": [ + "!pip install -q gradio>=4.0.0 diffusers>=0.25.0 transformers>=4.36.0 accelerate>=0.25.0 torch>=2.0.0 torchvision>=0.15.0 opencv-python>=4.8.0 imageio>=2.33.0 imageio-ffmpeg>=0.4.9 pillow>=10.0.0" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "EXmjh4sJdzIQ" + }, + "source": [ + "This section initializes the core components of the text-to-video generation pipeline. We use the cerspense/zeroscope_v2_576w model β€” a powerful diffusion model fine-tuned for higher-resolution video generation at 576Γ—320 pixels. It's designed to turn natural language prompts into short, animated video clips." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "NEpkvHlWcaNq" + }, + "outputs": [], + "source": [ + "import gradio as gr\n", + "import torch\n", + "from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler\n", + "from diffusers.utils import export_to_video\n", + "import numpy as np\n", + "import warnings\n", + "warnings.filterwarnings('ignore')\n", + "\n", + "# Model configuration\n", + "MODEL_ID = \"cerspense/zeroscope_v2_576w\" # Higher resolution: 576x320\n", + "# Alternative: \"damo-vilab/text-to-video-ms-1.7b\" for 256x256\n", + "DEVICE = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n", + "\n", + "print(f\"πŸ–₯️ Using device: {DEVICE}\")\n", + "print(f\"πŸ“¦ Model: {MODEL_ID}\")\n", + "print(f\"πŸ“Ί Output resolution: 576x320 pixels\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "N5o0pWpvfBFT" + }, + "source": [ + "This section handles loading the text-to-video diffusion model using Hugging Face's DiffusionPipeline. This ensure the model is initialized only once, using a global pipeline object to avoid repeated downloads or memory spikes." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "ulgckOMwcaKv" + }, + "outputs": [], + "source": [ + "# Global pipeline variable\n", + "pipe = None\n", + "\n", + "def initialize_model():\n", + " \"\"\"Initialize the text-to-video diffusion model\"\"\"\n", + " global pipe\n", + "\n", + " if pipe is None:\n", + " print(\"πŸ“₯ Loading text-to-video model...\")\n", + " pipe = DiffusionPipeline.from_pretrained(\n", + " MODEL_ID,\n", + " torch_dtype=torch.float16 if DEVICE == \"cuda\" else torch.float32,\n", + " # Note: Not using variant parameter as this model doesn't have fp16 files\n", + " )\n", + "\n", + " # Optimize with DPM Solver for faster generation\n", + " pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)\n", + " pipe = pipe.to(DEVICE)\n", + "\n", + " # Enable memory optimizations\n", + " if DEVICE == \"cuda\":\n", + " pipe.enable_model_cpu_offload()\n", + " pipe.enable_vae_slicing()\n", + "\n", + " print(\"βœ… Model loaded successfully!\")\n", + "\n", + " return pipe\n", + "\n", + "# Initialize the model\n", + "model = initialize_model()\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "41NhffuIf34F" + }, + "source": [ + "This function transforms a text description into a short video clip using the loaded diffusion pipeline. It handles every part of the process β€” from inference to frame formatting and final export." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Om4fOPY9VlFg" + }, + "outputs": [], + "source": [ + "def generate_video(prompt, num_inference_steps=25, guidance_scale=9.0, num_frames=16):\n", + " \"\"\"\n", + " Generate a video from a text prompt\n", + "\n", + " Args:\n", + " prompt: Text description of the video\n", + " num_inference_steps: Number of denoising steps (higher = better quality, slower)\n", + " guidance_scale: How closely to follow the prompt (7-12 recommended)\n", + " num_frames: Number of frames to generate (16 frames β‰ˆ 5 seconds at 3fps)\n", + "\n", + " Returns:\n", + " Path to generated video file\n", + " \"\"\"\n", + " try:\n", + " # Generate video frames\n", + " print(f\"🎬 Generating video for: '{prompt}'\")\n", + " video_output = pipe(\n", + " prompt,\n", + " num_inference_steps=num_inference_steps,\n", + " guidance_scale=guidance_scale,\n", + " num_frames=num_frames,\n", + " height=320, # Zeroscope native resolution\n", + " width=576, # Zeroscope native resolution\n", + " )\n", + "\n", + " # Extract frames properly - the output is [batch, frames, channels, height, width]\n", + " video_frames = video_output.frames[0] # Get first batch\n", + "\n", + " # Convert to proper format for export_to_video\n", + " # Frames should be a list of numpy arrays with shape (height, width, 3)\n", + " formatted_frames = []\n", + " for frame in video_frames:\n", + " # Convert from tensor/array to numpy if needed\n", + " if torch.is_tensor(frame):\n", + " frame = frame.cpu().numpy()\n", + "\n", + " # Ensure frame is in correct format (H, W, C)\n", + " if frame.ndim == 3:\n", + " # If channels first (C, H, W), transpose to (H, W, C)\n", + " if frame.shape[0] == 3 or frame.shape[0] == 1:\n", + " frame = np.transpose(frame, (1, 2, 0))\n", + "\n", + " # Ensure values are in 0-255 range for uint8\n", + " if frame.dtype == np.float32 or frame.dtype == np.float64:\n", + " frame = (frame * 255).astype(np.uint8)\n", + "\n", + " formatted_frames.append(frame)\n", + "\n", + " # Export to video file\n", + " output_path = \"/tmp/generated_video.mp4\"\n", + " video_path = export_to_video(formatted_frames, output_path, fps=3)\n", + "\n", + " print(f\"βœ… Video generated: {video_path}\")\n", + " return video_path\n", + "\n", + " except Exception as e:\n", + " print(f\"❌ Error generating video: {str(e)}\")\n", + " import traceback\n", + " traceback.print_exc()\n", + " raise gr.Error(f\"Failed to generate video: {str(e)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4233l3RZgPrR" + }, + "source": [ + "his section builds a user-friendly interface to generate video clips directly from natural language prompts. The app is powered by Gradio, which lets you interact with the model in real time β€” all within the notebook or as a shareable web app." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "dAG0L91UVlCn" + }, + "outputs": [], + "source": [ + "# Create Gradio interface\n", + "with gr.Blocks(title=\"Text-to-Video Generator\", theme=gr.themes.Soft()) as demo:\n", + " gr.Markdown(\n", + " \"\"\"\n", + " # 🎬 Textβ†’Video Generator\n", + "\n", + " Generate 5-second video clips from text prompts using diffusion models.\n", + " Simply describe what you want to see, and AI will create it!\n", + " \"\"\"\n", + " )\n", + "\n", + " with gr.Row():\n", + " with gr.Column():\n", + " # Input controls\n", + " prompt_input = gr.Textbox(\n", + " label=\"Video Prompt\",\n", + " placeholder=\"A cat playing with a ball of yarn...\",\n", + " lines=3\n", + " )\n", + "\n", + " with gr.Accordion(\"Advanced Settings\", open=False):\n", + " steps_slider = gr.Slider(\n", + " minimum=10,\n", + " maximum=50,\n", + " value=25,\n", + " step=5,\n", + " label=\"Inference Steps (higher = better quality)\",\n", + " )\n", + "\n", + " guidance_slider = gr.Slider(\n", + " minimum=1.0,\n", + " maximum=15.0,\n", + " value=9.0,\n", + " step=0.5,\n", + " label=\"Guidance Scale (how closely to follow prompt)\",\n", + " )\n", + "\n", + " frames_slider = gr.Slider(\n", + " minimum=8,\n", + " maximum=24,\n", + " value=16,\n", + " step=8,\n", + " label=\"Number of Frames\",\n", + " )\n", + "\n", + " generate_btn = gr.Button(\"πŸŽ₯ Generate Video\", variant=\"primary\", size=\"lg\")\n", + "\n", + " # Example prompts\n", + " gr.Examples(\n", + " examples=[\n", + " [\"A cat playing with a ball of yarn on a wooden floor\"],\n", + " [\"Waves crashing on a beach at sunset\"],\n", + " [\"A coffee cup steaming on a table, camera zoom in\"],\n", + " [\"Fireworks exploding in the night sky\"],\n", + " [\"A flower blooming in timelapse\"],\n", + " ],\n", + " inputs=prompt_input,\n", + " )\n", + "\n", + " with gr.Column():\n", + " # Output video\n", + " video_output = gr.Video(\n", + " label=\"Generated Video\",\n", + " height=400,\n", + " )\n", + "\n", + " gr.Markdown(\n", + " \"\"\"\n", + " ### Tips for better results:\n", + " - Be specific and descriptive\n", + " - Include camera movements (zoom, pan, etc.)\n", + " - Mention lighting and atmosphere\n", + " - Keep prompts clear and concise\n", + " \"\"\"\n", + " )\n", + "\n", + " # Connect generate button to function\n", + " generate_btn.click(\n", + " fn=generate_video,\n", + " inputs=[prompt_input, steps_slider, guidance_slider, frames_slider],\n", + " outputs=video_output,\n", + " )\n", + "\n", + " gr.Markdown(\n", + " \"\"\"\n", + " ---\n", + " **Note:** Video generation may take 1-3 minutes depending on your hardware.\n", + " \"\"\"\n", + " )\n", + "\n", + "# Launch the interface\n", + "# Display inline within the Jupyter notebook\n", + "demo.launch(\n", + " inline=True, # Display interface directly in the notebook (inline)\n", + " share=False, # Set to True to create a public gradio.live link\n", + " debug=True,\n", + " height=800, # Height of the inline iframe for better visibility\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "i9lIlKzjgV-M" + }, + "source": [ + "**Note**: After generating the image withing the gradio section above, stop the code in the cell and run the code block below to see the generated video. This should be repeated after every generated video.\n", + "\n", + "Once the video has been successfully generated and saved (typically to a temporary path like /tmp/generated_video.mp4), you can preview it directly within the notebook using IPython’s built-in video display tools.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "woea4ZboVk_S" + }, + "outputs": [], + "source": [ + "# After generating the video, run this in a new cell:\n", + "from IPython.display import Video\n", + "\n", + "# Display the generated video\n", + "Video(\"/tmp/generated_video.mp4\", embed=True, width=512, height=512)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vFIqpMJiiT8U" + }, + "source": [ + "## βœ… Conclusion\n", + "\n", + "In this template, we built a complete **text-to-video generation workflow** using the **Zeroscope Diffusion model** and a custom **Gradio interface** β€” all running on a GPU-powered environment.\n", + "\n", + "Running this pipeline on **[Saturn Cloud](https://saturncloud.io/)** makes the experience both **powerful and efficient**. With Saturn Cloud, you can:\n", + "\n", + "* Launch a **GPU-accelerated Jupyter Notebook** in seconds β€” ideal for models like Zeroscope.\n", + "* Seamlessly integrate **interactive Gradio apps** into your workflow for real-time experimentation.\n", + "* Scale up to **multi-GPU compute clusters** for faster video generation and larger batch processing.\n", + "* Save and share your notebooks or deploy the app as a **production-grade API**.\n", + "\n", + "By using this template, you've unlocked a modern, reproducible, and GPU-optimized approach to **text-to-video generation** β€” perfect for prototyping creative AI applications or deploying custom media generation tools." + ] + } + ], + "metadata": { + "accelerator": "GPU", + "colab": { + "authorship_tag": "ABX9TyNpcycZiSLkmERTuTNTQ2Nv", + "gpuType": "T4", + "machine_shape": "hm", + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.19" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +}