|
| 1 | +{ |
| 2 | + "nbformat": 4, |
| 3 | + "nbformat_minor": 0, |
| 4 | + "metadata": { |
| 5 | + "colab": { |
| 6 | + "provenance": [] |
| 7 | + }, |
| 8 | + "kernelspec": { |
| 9 | + "name": "python3", |
| 10 | + "display_name": "Python 3" |
| 11 | + }, |
| 12 | + "language_info": { |
| 13 | + "name": "python" |
| 14 | + } |
| 15 | + }, |
| 16 | + "cells": [ |
| 17 | + { |
| 18 | + "metadata": {}, |
| 19 | + "cell_type": "code", |
| 20 | + "outputs": [], |
| 21 | + "execution_count": null, |
| 22 | + "source": [ |
| 23 | + "## Airbyte PGVector RAG Demo\n", |
| 24 | + "\n", |
| 25 | + "This tutorial demonstrates how to use data stored in Airbyte's PGVector destination to perform Retrieval-Augmented Generation (RAG). You should use this destination when you intend to use PGVector for LLM specific vector operations like RAG.\n", |
| 26 | + "\n", |
| 27 | + "As a practical example, we'll build a Assistant—an AI chatbot capable of answering questions related to simpsons episoded using data from multiple Airbyte-related sources.\n", |
| 28 | + "\n", |
| 29 | + "#### Prerequisites:\n", |
| 30 | + "* Vector data stored in Postgres with Vector colums via PGVector destination. In our case we are using data from kaggle.\n", |
| 31 | + "* Postgresql DB with PGVector enabled\n", |
| 32 | + "* Open AI key\n" |
| 33 | + ] |
| 34 | + }, |
| 35 | + { |
| 36 | + "cell_type": "markdown", |
| 37 | + "source": [ |
| 38 | + "### a. Install dependencies and import secrets\n", |
| 39 | + "\n" |
| 40 | + ], |
| 41 | + "metadata": { |
| 42 | + "id": "7R0-uD7R3Uki" |
| 43 | + } |
| 44 | + }, |
| 45 | + { |
| 46 | + "cell_type": "code", |
| 47 | + "source": [ |
| 48 | + "!pip install sqlalchemy openai rich psycopg2 python-dotenv langchain-openai" |
| 49 | + ], |
| 50 | + "metadata": { |
| 51 | + "collapsed": true, |
| 52 | + "id": "HbR-po_Z3VFV" |
| 53 | + }, |
| 54 | + "execution_count": null, |
| 55 | + "outputs": [] |
| 56 | + }, |
| 57 | + { |
| 58 | + "cell_type": "code", |
| 59 | + "source": [ |
| 60 | + "import openai\n", |
| 61 | + "import json\n", |
| 62 | + "import rich\n", |
| 63 | + "from langchain_openai import OpenAIEmbeddings\n", |
| 64 | + "from sqlalchemy import create_engine\n", |
| 65 | + "from sqlalchemy.engine import URL\n", |
| 66 | + "from google.colab import userdata\n", |
| 67 | + "\n", |
| 68 | + "\n", |
| 69 | + "OPENAI_API_KEY = userdata.get('openai_api_key')\n", |
| 70 | + "HOST = userdata.get(\"db_host\")\n", |
| 71 | + "USERNAME = userdata.get(\"db_username\")\n", |
| 72 | + "PASSWORD = userdata.get(\"db_password\")\n", |
| 73 | + "DATABASE = userdata.get(\"db_name\")" |
| 74 | + ], |
| 75 | + "metadata": { |
| 76 | + "id": "LP0QfMFQ6Flz" |
| 77 | + }, |
| 78 | + "execution_count": null, |
| 79 | + "outputs": [] |
| 80 | + }, |
| 81 | + { |
| 82 | + "cell_type": "markdown", |
| 83 | + "source": [ |
| 84 | + "### b. Initialize open AI client and DB Engine\n" |
| 85 | + ], |
| 86 | + "metadata": { |
| 87 | + "id": "mmHB_MId7zwo" |
| 88 | + } |
| 89 | + }, |
| 90 | + { |
| 91 | + "cell_type": "code", |
| 92 | + "source": [ |
| 93 | + "openai.api_key = OPENAI_API_KEY\n", |
| 94 | + "\n", |
| 95 | + "url = URL.create(\n", |
| 96 | + " \"postgresql\",\n", |
| 97 | + " host=HOST,\n", |
| 98 | + " username=USERNAME,\n", |
| 99 | + " password=PASSWORD,\n", |
| 100 | + " database=DATABASE,\n", |
| 101 | + ")\n", |
| 102 | + "\n", |
| 103 | + "engine = create_engine(url)" |
| 104 | + ], |
| 105 | + "metadata": { |
| 106 | + "id": "7Y3iCe6e7-Ra" |
| 107 | + }, |
| 108 | + "execution_count": null, |
| 109 | + "outputs": [] |
| 110 | + }, |
| 111 | + { |
| 112 | + "cell_type": "markdown", |
| 113 | + "source": [ |
| 114 | + "### d. Explore data stored in Posgresql\n", |
| 115 | + "\n", |
| 116 | + "We need a few methods to embed user questions and make searches in DB\n", |
| 117 | + "\n", |
| 118 | + "- Helper to embed the user question so we can the search for it in the DB.\n", |
| 119 | + "- Function to get the context from the database using a user question as input.\n", |
| 120 | + "- One to get the response from the chat assistant that will use the context using the method from previous step." |
| 121 | + ], |
| 122 | + "metadata": { |
| 123 | + "id": "ZQ0PzOfb9rtK" |
| 124 | + } |
| 125 | + }, |
| 126 | + { |
| 127 | + "cell_type": "code", |
| 128 | + "source": [ |
| 129 | + "from sqlalchemy import text\n", |
| 130 | + "from openai import OpenAI\n", |
| 131 | + "\n", |
| 132 | + "client = OpenAI(\n", |
| 133 | + " api_key=OPENAI_API_KEY,\n", |
| 134 | + ")\n", |
| 135 | + "\n", |
| 136 | + "def get_embedding_from_open_ai(question):\n", |
| 137 | + " print(f\"Embedding user's query: {question}\")\n", |
| 138 | + " embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)\n", |
| 139 | + " embedding_response = embeddings.embed_query(question)\n", |
| 140 | + " return embedding_response\n", |
| 141 | + "\n", |
| 142 | + "QUERY_TEMPLATE = \"\"\"\n", |
| 143 | + "SELECT document_content,\n", |
| 144 | + " metadata->'title' as episode_title,\n", |
| 145 | + " metadata->'script_line_number' as script_line_number,\n", |
| 146 | + " metadata->'name' as character_name,\n", |
| 147 | + " metadata->'spoken_words' as spoken_words\n", |
| 148 | + "FROM episode_spoken_words\n", |
| 149 | + "ORDER BY embedding <-> :question_vector\n", |
| 150 | + "LIMIT 5\n", |
| 151 | + "\"\"\"\n", |
| 152 | + "\n", |
| 153 | + "def get_context(question) -> str:\n", |
| 154 | + " # Get the embedding from OpenAI\n", |
| 155 | + " question_vector = get_embedding_from_open_ai(question)\n", |
| 156 | + "\n", |
| 157 | + " # Format the question vector as a string in the format expected by PostgreSQL\n", |
| 158 | + " question_vector_str = '[' + ','.join(map(str, question_vector)) + ']'\n", |
| 159 | + "\n", |
| 160 | + " # Use the text() function for raw queries with SQLAlchemy\n", |
| 161 | + " query = text(QUERY_TEMPLATE)\n", |
| 162 | + "\n", |
| 163 | + " # Execute the query, passing the vector as a bind parameter\n", |
| 164 | + " with engine.connect() as connection:\n", |
| 165 | + " result = connection.execute(query, {'question_vector': question_vector_str})\n", |
| 166 | + "\n", |
| 167 | + " # Format and return the result\n", |
| 168 | + " return (\"\\n\\n\" + \"-\" * 8 + \"\\n\\n\").join(\n", |
| 169 | + " [\n", |
| 170 | + " f\"Episode {row.episode_title} | Line number: {row.script_line_number} | \"\n", |
| 171 | + " f\"Spoken Words: {row.spoken_words} | Character: {row.character_name}\"\n", |
| 172 | + " for row in result\n", |
| 173 | + " ]\n", |
| 174 | + " )\n", |
| 175 | + "\n", |
| 176 | + "\n", |
| 177 | + "def get_response(question):\n", |
| 178 | + " response = client.chat.completions.create(\n", |
| 179 | + " model=\"gpt-3.5-turbo\",\n", |
| 180 | + " messages=[\n", |
| 181 | + " {\"role\": \"system\", \"content\": \"You are a Simpsons expert talking about Simpsons episodes.\"},\n", |
| 182 | + " {\"role\": \"user\", \"content\": question},\n", |
| 183 | + " {\"role\": \"assistant\", \"content\": f\"Use only this information to answer the question: {get_context(question)}. Do not search on the internet.\"}\n", |
| 184 | + " ]\n", |
| 185 | + " )\n", |
| 186 | + " return response.choices[0].message.content\n", |
| 187 | + "\n" |
| 188 | + ], |
| 189 | + "metadata": { |
| 190 | + "id": "_PJ6eb5-A419" |
| 191 | + }, |
| 192 | + "execution_count": null, |
| 193 | + "outputs": [] |
| 194 | + }, |
| 195 | + { |
| 196 | + "cell_type": "markdown", |
| 197 | + "source": [ |
| 198 | + "### d. Make questions\n", |
| 199 | + "\n", |
| 200 | + "Finally, let's put all together and get a response from our assistant using the Simpsons database." |
| 201 | + ], |
| 202 | + "metadata": { |
| 203 | + "id": "spQPCVe9AZKh" |
| 204 | + } |
| 205 | + }, |
| 206 | + { |
| 207 | + "cell_type": "code", |
| 208 | + "source": [ |
| 209 | + "question = \"Talking about food\"\n", |
| 210 | + "response = get_response(question)\n", |
| 211 | + "rich.print(response)" |
| 212 | + ], |
| 213 | + "metadata": { |
| 214 | + "id": "pwegM_02AgOU" |
| 215 | + }, |
| 216 | + "execution_count": null, |
| 217 | + "outputs": [] |
| 218 | + } |
| 219 | + ] |
| 220 | +} |
0 commit comments