Skip to content

Commit 23198e8

Browse files
quickstarts: new rag using pgvector (#136)
1 parent 3b3d9ac commit 23198e8

File tree

1 file changed

+220
-0
lines changed

1 file changed

+220
-0
lines changed

Diff for: vector_store_integration/RAG_using_PGVector.ipynb

+220
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,220 @@
1+
{
2+
"nbformat": 4,
3+
"nbformat_minor": 0,
4+
"metadata": {
5+
"colab": {
6+
"provenance": []
7+
},
8+
"kernelspec": {
9+
"name": "python3",
10+
"display_name": "Python 3"
11+
},
12+
"language_info": {
13+
"name": "python"
14+
}
15+
},
16+
"cells": [
17+
{
18+
"metadata": {},
19+
"cell_type": "code",
20+
"outputs": [],
21+
"execution_count": null,
22+
"source": [
23+
"## Airbyte PGVector RAG Demo\n",
24+
"\n",
25+
"This tutorial demonstrates how to use data stored in Airbyte's PGVector destination to perform Retrieval-Augmented Generation (RAG). You should use this destination when you intend to use PGVector for LLM specific vector operations like RAG.\n",
26+
"\n",
27+
"As a practical example, we'll build a Assistant—an AI chatbot capable of answering questions related to simpsons episoded using data from multiple Airbyte-related sources.\n",
28+
"\n",
29+
"#### Prerequisites:\n",
30+
"* Vector data stored in Postgres with Vector colums via PGVector destination. In our case we are using data from kaggle.\n",
31+
"* Postgresql DB with PGVector enabled\n",
32+
"* Open AI key\n"
33+
]
34+
},
35+
{
36+
"cell_type": "markdown",
37+
"source": [
38+
"### a. Install dependencies and import secrets\n",
39+
"\n"
40+
],
41+
"metadata": {
42+
"id": "7R0-uD7R3Uki"
43+
}
44+
},
45+
{
46+
"cell_type": "code",
47+
"source": [
48+
"!pip install sqlalchemy openai rich psycopg2 python-dotenv langchain-openai"
49+
],
50+
"metadata": {
51+
"collapsed": true,
52+
"id": "HbR-po_Z3VFV"
53+
},
54+
"execution_count": null,
55+
"outputs": []
56+
},
57+
{
58+
"cell_type": "code",
59+
"source": [
60+
"import openai\n",
61+
"import json\n",
62+
"import rich\n",
63+
"from langchain_openai import OpenAIEmbeddings\n",
64+
"from sqlalchemy import create_engine\n",
65+
"from sqlalchemy.engine import URL\n",
66+
"from google.colab import userdata\n",
67+
"\n",
68+
"\n",
69+
"OPENAI_API_KEY = userdata.get('openai_api_key')\n",
70+
"HOST = userdata.get(\"db_host\")\n",
71+
"USERNAME = userdata.get(\"db_username\")\n",
72+
"PASSWORD = userdata.get(\"db_password\")\n",
73+
"DATABASE = userdata.get(\"db_name\")"
74+
],
75+
"metadata": {
76+
"id": "LP0QfMFQ6Flz"
77+
},
78+
"execution_count": null,
79+
"outputs": []
80+
},
81+
{
82+
"cell_type": "markdown",
83+
"source": [
84+
"### b. Initialize open AI client and DB Engine\n"
85+
],
86+
"metadata": {
87+
"id": "mmHB_MId7zwo"
88+
}
89+
},
90+
{
91+
"cell_type": "code",
92+
"source": [
93+
"openai.api_key = OPENAI_API_KEY\n",
94+
"\n",
95+
"url = URL.create(\n",
96+
" \"postgresql\",\n",
97+
" host=HOST,\n",
98+
" username=USERNAME,\n",
99+
" password=PASSWORD,\n",
100+
" database=DATABASE,\n",
101+
")\n",
102+
"\n",
103+
"engine = create_engine(url)"
104+
],
105+
"metadata": {
106+
"id": "7Y3iCe6e7-Ra"
107+
},
108+
"execution_count": null,
109+
"outputs": []
110+
},
111+
{
112+
"cell_type": "markdown",
113+
"source": [
114+
"### d. Explore data stored in Posgresql\n",
115+
"\n",
116+
"We need a few methods to embed user questions and make searches in DB\n",
117+
"\n",
118+
"- Helper to embed the user question so we can the search for it in the DB.\n",
119+
"- Function to get the context from the database using a user question as input.\n",
120+
"- One to get the response from the chat assistant that will use the context using the method from previous step."
121+
],
122+
"metadata": {
123+
"id": "ZQ0PzOfb9rtK"
124+
}
125+
},
126+
{
127+
"cell_type": "code",
128+
"source": [
129+
"from sqlalchemy import text\n",
130+
"from openai import OpenAI\n",
131+
"\n",
132+
"client = OpenAI(\n",
133+
" api_key=OPENAI_API_KEY,\n",
134+
")\n",
135+
"\n",
136+
"def get_embedding_from_open_ai(question):\n",
137+
" print(f\"Embedding user's query: {question}\")\n",
138+
" embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)\n",
139+
" embedding_response = embeddings.embed_query(question)\n",
140+
" return embedding_response\n",
141+
"\n",
142+
"QUERY_TEMPLATE = \"\"\"\n",
143+
"SELECT document_content,\n",
144+
" metadata->'title' as episode_title,\n",
145+
" metadata->'script_line_number' as script_line_number,\n",
146+
" metadata->'name' as character_name,\n",
147+
" metadata->'spoken_words' as spoken_words\n",
148+
"FROM episode_spoken_words\n",
149+
"ORDER BY embedding <-> :question_vector\n",
150+
"LIMIT 5\n",
151+
"\"\"\"\n",
152+
"\n",
153+
"def get_context(question) -> str:\n",
154+
" # Get the embedding from OpenAI\n",
155+
" question_vector = get_embedding_from_open_ai(question)\n",
156+
"\n",
157+
" # Format the question vector as a string in the format expected by PostgreSQL\n",
158+
" question_vector_str = '[' + ','.join(map(str, question_vector)) + ']'\n",
159+
"\n",
160+
" # Use the text() function for raw queries with SQLAlchemy\n",
161+
" query = text(QUERY_TEMPLATE)\n",
162+
"\n",
163+
" # Execute the query, passing the vector as a bind parameter\n",
164+
" with engine.connect() as connection:\n",
165+
" result = connection.execute(query, {'question_vector': question_vector_str})\n",
166+
"\n",
167+
" # Format and return the result\n",
168+
" return (\"\\n\\n\" + \"-\" * 8 + \"\\n\\n\").join(\n",
169+
" [\n",
170+
" f\"Episode {row.episode_title} | Line number: {row.script_line_number} | \"\n",
171+
" f\"Spoken Words: {row.spoken_words} | Character: {row.character_name}\"\n",
172+
" for row in result\n",
173+
" ]\n",
174+
" )\n",
175+
"\n",
176+
"\n",
177+
"def get_response(question):\n",
178+
" response = client.chat.completions.create(\n",
179+
" model=\"gpt-3.5-turbo\",\n",
180+
" messages=[\n",
181+
" {\"role\": \"system\", \"content\": \"You are a Simpsons expert talking about Simpsons episodes.\"},\n",
182+
" {\"role\": \"user\", \"content\": question},\n",
183+
" {\"role\": \"assistant\", \"content\": f\"Use only this information to answer the question: {get_context(question)}. Do not search on the internet.\"}\n",
184+
" ]\n",
185+
" )\n",
186+
" return response.choices[0].message.content\n",
187+
"\n"
188+
],
189+
"metadata": {
190+
"id": "_PJ6eb5-A419"
191+
},
192+
"execution_count": null,
193+
"outputs": []
194+
},
195+
{
196+
"cell_type": "markdown",
197+
"source": [
198+
"### d. Make questions\n",
199+
"\n",
200+
"Finally, let's put all together and get a response from our assistant using the Simpsons database."
201+
],
202+
"metadata": {
203+
"id": "spQPCVe9AZKh"
204+
}
205+
},
206+
{
207+
"cell_type": "code",
208+
"source": [
209+
"question = \"Talking about food\"\n",
210+
"response = get_response(question)\n",
211+
"rich.print(response)"
212+
],
213+
"metadata": {
214+
"id": "pwegM_02AgOU"
215+
},
216+
"execution_count": null,
217+
"outputs": []
218+
}
219+
]
220+
}

0 commit comments

Comments
 (0)