quickstarts: new rag using pgvector (#136)

aldogonzalez8 · web-flow · commit 23198e89e51b · 2024-10-28T10:00:57.000-06:00
diff --git a/vector_store_integration/RAG_using_PGVector.ipynb b/vector_store_integration/RAG_using_PGVector.ipynb
@@ -0,0 +1,220 @@
+{
+ "nbformat": 4,
+ "nbformat_minor": 0,
+ "metadata": {
+  "colab": {
+   "provenance": []
+  },
+  "kernelspec": {
+   "name": "python3",
+   "display_name": "Python 3"
+  },
+  "language_info": {
+   "name": "python"
+  }
+ },
+ "cells": [
+  {
+   "metadata": {},
+   "cell_type": "code",
+   "outputs": [],
+   "execution_count": null,
+   "source": [
+    "## Airbyte PGVector RAG Demo\n",
+    "\n",
+    "This tutorial demonstrates how to use data stored in Airbyte's PGVector destination to perform Retrieval-Augmented Generation (RAG). You should use this destination when you intend to use PGVector for LLM specific vector operations like RAG.\n",
+    "\n",
+    "As a practical example, we'll build a Assistant—an AI chatbot capable of answering questions related to simpsons episoded using data from multiple Airbyte-related sources.\n",
+    "\n",
+    "#### Prerequisites:\n",
+    "* Vector data stored in Postgres with Vector colums via PGVector destination. In our case we are using data from kaggle.\n",
+    "* Postgresql DB with PGVector enabled\n",
+    "* Open AI key\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "### a. Install dependencies and import secrets\n",
+    "\n"
+   ],
+   "metadata": {
+    "id": "7R0-uD7R3Uki"
+   }
+  },
+  {
+   "cell_type": "code",
+   "source": [
+    "!pip install sqlalchemy openai rich psycopg2 python-dotenv langchain-openai"
+   ],
+   "metadata": {
+    "collapsed": true,
+    "id": "HbR-po_Z3VFV"
+   },
+   "execution_count": null,
+   "outputs": []
+  },
+  {
+   "cell_type": "code",
+   "source": [
+    "import openai\n",
+    "import json\n",
+    "import rich\n",
+    "from langchain_openai import OpenAIEmbeddings\n",
+    "from sqlalchemy import create_engine\n",
+    "from sqlalchemy.engine import URL\n",
+    "from google.colab import userdata\n",
+    "\n",
+    "\n",
+    "OPENAI_API_KEY = userdata.get('openai_api_key')\n",
+    "HOST =  userdata.get(\"db_host\")\n",
+    "USERNAME =  userdata.get(\"db_username\")\n",
+    "PASSWORD =  userdata.get(\"db_password\")\n",
+    "DATABASE = userdata.get(\"db_name\")"
+   ],
+   "metadata": {
+    "id": "LP0QfMFQ6Flz"
+   },
+   "execution_count": null,
+   "outputs": []
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "### b. Initialize open AI client and DB Engine\n"
+   ],
+   "metadata": {
+    "id": "mmHB_MId7zwo"
+   }
+  },
+  {
+   "cell_type": "code",
+   "source": [
+    "openai.api_key = OPENAI_API_KEY\n",
+    "\n",
+    "url = URL.create(\n",
+    "    \"postgresql\",\n",
+    "    host=HOST,\n",
+    "    username=USERNAME,\n",
+    "    password=PASSWORD,\n",
+    "    database=DATABASE,\n",
+    ")\n",
+    "\n",
+    "engine = create_engine(url)"
+   ],
+   "metadata": {
+    "id": "7Y3iCe6e7-Ra"
+   },
+   "execution_count": null,
+   "outputs": []
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "### d. Explore data stored in Posgresql\n",
+    "\n",
+    "We need a few methods to embed user questions and make searches in DB\n",
+    "\n",
+    "- Helper to embed the user question so we can the search for it in the DB.\n",
+    "- Function to get the context from the database using a user question as input.\n",
+    "- One to get the response from the chat assistant that will use the context using the method from previous step."
+   ],
+   "metadata": {
+    "id": "ZQ0PzOfb9rtK"
+   }
+  },
+  {
+   "cell_type": "code",
+   "source": [
+    "from sqlalchemy import text\n",
+    "from openai import OpenAI\n",
+    "\n",
+    "client = OpenAI(\n",
+    "    api_key=OPENAI_API_KEY,\n",
+    ")\n",
+    "\n",
+    "def get_embedding_from_open_ai(question):\n",
+    "    print(f\"Embedding user's query: {question}\")\n",
+    "    embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)\n",
+    "    embedding_response = embeddings.embed_query(question)\n",
+    "    return embedding_response\n",
+    "\n",
+    "QUERY_TEMPLATE = \"\"\"\n",
+    "SELECT document_content,\n",
+    "    metadata->'title' as episode_title,\n",
+    "    metadata->'script_line_number' as script_line_number,\n",
+    "    metadata->'name' as character_name,\n",
+    "    metadata->'spoken_words' as spoken_words\n",
+    "FROM episode_spoken_words\n",
+    "ORDER BY embedding <-> :question_vector\n",
+    "LIMIT 5\n",
+    "\"\"\"\n",
+    "\n",
+    "def get_context(question) -> str:\n",
+    "    # Get the embedding from OpenAI\n",
+    "    question_vector = get_embedding_from_open_ai(question)\n",
+    "\n",
+    "    # Format the question vector as a string in the format expected by PostgreSQL\n",
+    "    question_vector_str = '[' + ','.join(map(str, question_vector)) + ']'\n",
+    "\n",
+    "    # Use the text() function for raw queries with SQLAlchemy\n",
+    "    query = text(QUERY_TEMPLATE)\n",
+    "\n",
+    "    # Execute the query, passing the vector as a bind parameter\n",
+    "    with engine.connect() as connection:\n",
+    "        result = connection.execute(query, {'question_vector': question_vector_str})\n",
+    "\n",
+    "        # Format and return the result\n",
+    "        return (\"\\n\\n\" + \"-\" * 8 + \"\\n\\n\").join(\n",
+    "            [\n",
+    "                f\"Episode {row.episode_title} | Line number: {row.script_line_number} | \"\n",
+    "                f\"Spoken Words: {row.spoken_words} | Character: {row.character_name}\"\n",
+    "                for row in result\n",
+    "            ]\n",
+    "        )\n",
+    "\n",
+    "\n",
+    "def get_response(question):\n",
+    "    response = client.chat.completions.create(\n",
+    "        model=\"gpt-3.5-turbo\",\n",
+    "        messages=[\n",
+    "            {\"role\": \"system\", \"content\": \"You are a Simpsons expert talking about Simpsons episodes.\"},\n",
+    "            {\"role\": \"user\", \"content\": question},\n",
+    "            {\"role\": \"assistant\", \"content\": f\"Use only this information to answer the question: {get_context(question)}. Do not search on the internet.\"}\n",
+    "        ]\n",
+    "    )\n",
+    "    return response.choices[0].message.content\n",
+    "\n"
+   ],
+   "metadata": {
+    "id": "_PJ6eb5-A419"
+   },
+   "execution_count": null,
+   "outputs": []
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "### d. Make questions\n",
+    "\n",
+    "Finally, let's put all together and get a response from our assistant using the Simpsons database."
+   ],
+   "metadata": {
+    "id": "spQPCVe9AZKh"
+   }
+  },
+  {
+   "cell_type": "code",
+   "source": [
+    "question = \"Talking about food\"\n",
+    "response = get_response(question)\n",
+    "rich.print(response)"
+   ],
+   "metadata": {
+    "id": "pwegM_02AgOU"
+   },
+   "execution_count": null,
+   "outputs": []
+  }
+ ]
+}