diff --git a/README.md b/README.md index fb786e3b7..51832b553 100644 --- a/README.md +++ b/README.md @@ -26,5 +26,9 @@ from abacusai import ApiClient client = ApiClient('YOUR_API_KEY') ``` +[API KEY](https://abacus.ai/app/profile/apikey) +[Abacus.AI CheatSheet](https://github.com/abacusai/api-python/blob/main/examples/CheatSheet.md) +[Abacus.AI API Examples](https://github.com/abacusai/api-python/blob/main/examples) + ## License [MIT](https://github.com/abacusai/api-python/blob/main/LICENSE) diff --git a/examples/CheatSheet.md b/examples/CheatSheet.md new file mode 100644 index 000000000..0684e0cda --- /dev/null +++ b/examples/CheatSheet.md @@ -0,0 +1,49 @@ +## Python SDK Documentation Examples +The full Documentation of the Abacus.AI python SDK can be found [here](https://abacusai.github.io/api-python/autoapi/abacusai/index.html) + +Please note that from within the platform's UI, you will have access to template/example code for all below cases: +- [Python Feature Group](https://abacus.ai/app/python_functions_list) +- [Pipelines](https://abacus.ai/app/pipelines) +- [Custom Loss Functions](https://abacus.ai/app/custom_loss_functions_list) +- Custom Models & [Algorithms](https://abacus.ai/app/algorithm_list) +- [Python Modules](https://abacus.ai/app/modules_list) +- And others... + +Furthermore, the platform's `AI Engineer`, which is a ChatBot located on the bottom right of the screen, inside any project, also has access to our API's and will be able to provide you with examples. + +#### Abacus.AI - API Command Cheatsheet +Here is a list of important methods you should keep saved somewhere. You can find examples for all of these methods inside the notebooks, or you can refer to the official documentation. + +```python +from abacusai import ApiClient +client = ApiClient('YOUR_API_KEY') +#client.whatever_name_of_method_below() +``` + +| Method | Explanation | +|----------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| suggest_abacus_apis | Describe what you need, and we will return the methods that will help you achieve it. | +| describe_project | Describe's project | +| create_dataset_from_upload | Creates a dataset object from local data | +| describe_feature_group_by_table_name | Describes the feature group using the table name | +| describe_feature_group_version | Describes the feature group using the feature group version | +| list_models | List's models of a project | +| extract_data_using_llm | Extracts data from a document. Allows you to create a json output and extract specific information from a document | +| execute_data_query_using_llm | Runs SQL on top of feature groups based on natural language input. Can return both SQL and the result of SQL execution. | +| get_chat_response | Uses a chatLLM deployment. Can be used to add filters, change LLM and do advanced use cases using an agent on top of a ChatLLM deployment. | +| get_chat_response_with_binary_data | Same as above, but you can also send a binary dataset | +| get_conversation_response | Uses a chatLLM deployment with conversation history. Useful when you need to use the API. You create a conversation ID and you send it or you use the one created by Abacus. | +| get_conversation_response_with_binary_data | Same as above, but you can also send a binary dataset | +| evaluate_prompt | LLM call for a user query. Can get JSON output using additional arguments | +| get_matching_documents | Gets the search results for a user query using document retriever directly. Can be used along with evaluate_prompt to create a customized chat llm like agent | +| get_relevant_snippets | Creates a doc retriever on the fly for retrieving search results | +| extract_document_data | Extract data from a PDF, Word document, etc using OCR or using the digital text. | +| get_docstore_document | Download document from the doc store using their doc_id. | +| get_docstore_document_data | Get extracted or embedded text from a document using their doc_id. | +| stream_message | Streams message on the UI for agents | +| update_feature_group_sql_definition | Updates the SQL definition of a feature group | +| query_database_connector | Executes a SQL query on top of a database connector. Will only work for connectors that support it. | +| export_feature_group_version_to_file_connector | Exports a feature group to a file connector | +| export_feature_group_version_to_database_connector | Exports a feature group to a database connector | +| create_dataset_version_from_file_connector | Refreshes data from the file connector connected to the file connector. | +| create_dataset_version_from_database_connector | Refreshes data from the file connector connected to the database connector. | \ No newline at end of file diff --git a/examples/basics/basics.ipynb b/examples/basics/basics.ipynb new file mode 100644 index 000000000..b32999877 --- /dev/null +++ b/examples/basics/basics.ipynb @@ -0,0 +1,291 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Connect to Abacus\n", + "You can find your API key here: [API KEY](https://abacus.ai/app/profile/apikey)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import abacusai\n", + "client = abacusai.ApiClient(\"YOUR API KEY\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Finding API's easily.\n", + "There are two ways to find API's easily in the Abacus platform:\n", + "1. Try auto-completion by using tab. Most API's follow expressive language so you can search them using the autocomplete feature.\n", + "2. Use the `suggest_abacus_apis` method. This method calls a large language model that has access to our full documentation. It can suggest you what API works for what you are trying to do.\n", + "3. Use the official [Python SDK documentation](https://abacusai.github.io/api-python/autoapi/abacusai/index.html) page which will have all the available methods and attributes of classes." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "apis = client.suggest_abacus_apis(\"list feature groups in a project\", verbosity=2, limit=3)\n", + "for api in apis:\n", + " print(f\"Method: {api.method}\")\n", + " print(f\"Docstring: {api.docstring}\")\n", + " print(\"---\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Project Level API's\n", + "You can find the ID easily by looking at the URL in your browser. For example, if your URL looks like this: `https://abacus.ai/app/projects/fsdfasg33?doUpload=true`, the project id is \"fsdfasg33\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Gets information about the project based on the ID.\n", + "project = client.describe_project(project_id=\"YOUR_PROJECT_ID\")\n", + "\n", + "# A list of all models trained under the project\n", + "models = client.list_models(project_id=\"YOUR_PROJECT_ID\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Load a Feature Group" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Loads the specific version of a FeatureGroup class object \n", + "fg = client.describe_feature_group_version(\"FEATURE_GROUP_VERSION\")\n", + "\n", + "# Loads the latest version of a FeatureGroup class object based on a name\n", + "fg = client.describe_feature_group_by_table_name(\"FEATURE_GROUP_NAME\")\n", + "\n", + "# Loads the FeatureGroup as a pandas dataframe\n", + "df = fg.load_as_pandas()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Add a Feature Group to the Project" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# First we connect our docstore to our project\n", + "client.add_feature_group_to_project(\n", + " feature_group_id='FEATURE_GROUP_ID',\n", + " project_id='PROJECT_ID',\n", + " feature_group_type='CUSTOM_TABLE' # You can set to DOCUMENTS if this is a document set\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Update the feature group SQL definition" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "client.update_feature_group_sql_definition('YOUR_FG_ID', 'SQL')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Creating a Dataset from local\n", + "For every dataset created, a feature group with the same name will also be generated. When you need to update the source data, just update the dataset directly and the feature group will also reflect those changes." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import io\n", + "zip_filename= 'sample_data_folder.zip'\n", + "\n", + "with open(zip_filename, 'rb') as f:\n", + " zip_file_content = f.read()\n", + "\n", + "zip_file_io = io.BytesIO(zip_file_content)\n", + "\n", + "# If the ZIP folder contains unstructured text documents (PDF, Word, etc.), then set `is_documentset` == True\n", + "upload = client.create_dataset_from_upload(table_name='MY_SAMPLE_DATA', file_format='ZIP', is_documentset=False)\n", + "upload.upload_file(zip_file_io)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Updating a Dataset from local" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "upload = client.create_dataset_version_from_upload(dataset_id='YOUR_DATASET_ID', file_format='ZIP')\n", + "upload.upload_file(zip_file_io)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Executing SQL using a connector" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "connector_id = \"YOUR_CONNECTOR_ID\"\n", + "sql_query = \"SELECT * FROM TABLE LIMIT 5\"\n", + "\n", + "result = client.query_database_connector(connector_id, sql_query)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Uploading a Dataset using a connector\n", + "\n", + "`doc_processing_config` is optional depending on if you want to load a document set or no. use the code below and change based on your application. \n", + "\n", + "Similar to `create_dataset_from_file_connector` you can use `create_dataset_from_database_connector`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# doc_processing_config = abacusai.DatasetDocumentProcessingConfig(\n", + "# extract_bounding_boxes=True,\n", + "# use_full_ocr=False,\n", + "# remove_header_footer=False,\n", + "# remove_watermarks=True,\n", + "# convert_to_markdown=False,\n", + "# )\n", + "\n", + "dataset = client.create_dataset_from_file_connector(\n", + " table_name=\"MY_TABLE_NAME\",\n", + " location=\"azure://my-location:share/whatever/*\",\n", + " # refresh_schedule=\"0 0 * * *\", # Daily refresh at midnight UTC\n", + " # is_documentset=True, #Only if this is an actual documentset (Meaning word documents, PDF files, etc)\n", + " # extract_bounding_boxes=True,\n", + " # document_processing_config=doc_processing_config,\n", + " # reference_only_documentset=False,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Updating a Dataset using a connector" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "client.create_dataset_version_from_file_connector('DATASET_ID') # For file connector\n", + "client.create_dataset_version_from_database_connector('DATASET_ID')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Export A feature group to a connector\n", + "Below code will also work for non-SQL connectors like blob storages. The `database_feature_mapping` would be optional in those cases.\n", + "\n", + "You can find the `connector_id` [here](https://abacus.ai/app/profile/connected_services)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "WRITEBACK = 'Anonymized_Store_Week_Result'\n", + "MAPPING = {\n", + " 'COLUMN_1': 'COLUMN_1', \n", + " 'COLUMN_2': 'COLUMN_2', \n", + "}\n", + "\n", + "feature_group = client.describe_feature_group_by_table_name(f\"FEATURE_GROUP_NAME\")\n", + "feature_group.materialize() # To make sure we have latest version\n", + "feature_group_version = feature_group.latest_feature_group_version.feature_group_version\n", + "client.export_feature_group_version_to_database_connector(\n", + " feature_group_version, \n", + " database_connector_id='connector_id',\n", + " object_name=WRITEBACK,\n", + " database_feature_mapping=MAPPING, \n", + " write_mode='insert'\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### " + ] + } + ], + "metadata": { + "language_info": { + "name": "python" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/examples/example_gen_ai_commands.ipynb b/examples/example_gen_ai_commands.ipynb deleted file mode 100644 index 8fffc6c3c..000000000 --- a/examples/example_gen_ai_commands.ipynb +++ /dev/null @@ -1,1317 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Tutorial notebook\n", - "\n", - "This notebook shares base snippets to use Abacus.AI API with chat and AI agent functionality\n", - "API comes in handy to ease up some recurring tasks" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Loading Documents\n", - "When documents are uploaded into the platform, they are uploaded as a special Class type \"BlobInput\"." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Testing BlobInput Locally" - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "metadata": { - "execution": { - "iopub.execute_input": "2024-07-22T08:57:33.707610Z", - "iopub.status.busy": "2024-07-22T08:57:33.707215Z", - "iopub.status.idle": "2024-07-22T08:57:33.823496Z", - "shell.execute_reply": "2024-07-22T08:57:33.822805Z", - "shell.execute_reply.started": "2024-07-22T08:57:33.707580Z" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "from abacusai import ApiClient\n", - "client = ApiClient()\n", - "\n", - "# Your project id consists of numbers and letters id, \n", - "# Can be found as a part of the browser URL or the project's main page. \n", - "# Needed for some API calls. \n", - "# For this example, this should be the AI Agent Project\n", - "project_id = 'your_project_id' \n", - "try:\n", - " client.describe_project(project_id)\n", - "except:\n", - " raise Exception('Provide your current project ID')\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Here, we upload training file from the current location of the notebook\n", - "# You can add files to Jupyter Notebook by drag and drop\n", - "from abacusai.client import BlobInput\n", - "document = BlobInput.from_local_file(\"test.docx\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "#document.contents is a bytes string of the document\n", - "extracted_doc_data = client.extract_document_data(document.contents)\n", - "\n", - "print(extracted_doc_data.pages[0]) # Text from page 0\n", - "print(extracted_doc_data.embedded_text) # All text from the document" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "execution": { - "iopub.status.busy": "2024-07-19T11:57:40.640115Z", - "iopub.status.idle": "2024-07-19T11:57:40.640754Z", - "shell.execute_reply": "2024-07-19T11:57:40.640509Z", - "shell.execute_reply.started": "2024-07-19T11:57:40.640484Z" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "len(extracted_doc_data.pages)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "execution": { - "iopub.status.busy": "2024-07-19T11:57:40.642525Z", - "iopub.status.idle": "2024-07-19T11:57:40.642957Z", - "shell.execute_reply": "2024-07-19T11:57:40.642750Z", - "shell.execute_reply.started": "2024-07-19T11:57:40.642728Z" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "# Here, we will create the upload for the dataset from the notebook as an Abacus dataset that we will be able to use later.\n", - "# Docstore is special table format for document storage\n", - "\n", - "upload = client.create_dataset_from_upload(\n", - " table_name='my_documents_'+client.describe_user().email.split('@')[0], #name should be unique inside the organisation\n", - " file_format='DOCX',\n", - " is_documentset=True\n", - ")\n" - ] - }, - { - "cell_type": "code", - "execution_count": 68, - "metadata": { - "execution": { - "iopub.execute_input": "2024-07-18T16:23:49.025773Z", - "iopub.status.busy": "2024-07-18T16:23:49.025348Z", - "iopub.status.idle": "2024-07-18T16:23:49.776343Z", - "shell.execute_reply": "2024-07-18T16:23:49.775579Z", - "shell.execute_reply.started": "2024-07-18T16:23:49.025744Z" - }, - "tags": [] - }, - "outputs": [ - { - "data": { - "text/plain": [ - "Dataset(dataset_id='93c789714',\n", - " source_type='UPLOAD',\n", - " data_source='re://datasets/93c789714',\n", - " created_at='2024-07-18T16:23:47+00:00',\n", - " ephemeral=False,\n", - " feature_group_table_name='my_documents_bogdan',\n", - " incremental=False,\n", - " is_documentset=True,\n", - " extract_bounding_boxes=False,\n", - " merge_file_schemas=True,\n", - " reference_only_documentset=False,\n", - " latest_dataset_version=DatasetVersion(dataset_version='18b974e6a',\n", - " status='CONVERTING',\n", - " dataset_id='93c789714',\n", - " size=136504,\n", - " created_at='2024-07-18T16:23:47+00:00',\n", - " merge_file_schemas=True),\n", - " document_processing_config=DocumentProcessingConfig(extract_bounding_boxes=False, ocr_mode='DEFAULT', use_full_ocr=None, remove_header_footer=False, remove_watermarks=True, convert_to_markdown=False))" - ] - }, - "execution_count": 68, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "with open(\"test.docx\", \"rb\") as file:\n", - "\n", - " file_uploaded = upload.upload_file(file)\n", - " file_uploaded.wait_for_import()\n", - "\n", - "file_uploaded" - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "metadata": { - "execution": { - "iopub.execute_input": "2024-07-19T12:01:07.897492Z", - "iopub.status.busy": "2024-07-19T12:01:07.897085Z", - "iopub.status.idle": "2024-07-19T12:01:08.014131Z", - "shell.execute_reply": "2024-07-19T12:01:08.013392Z", - "shell.execute_reply.started": "2024-07-19T12:01:07.897463Z" - }, - "tags": [] - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - " feature group found\n" - ] - } - ], - "source": [ - "# Here we verify our upload and check the structure of file we created in docstore\n", - "\n", - "try:\n", - " feature_group = client.describe_feature_group_by_table_name('my_documents_'+client.describe_user().email.split('@')[0])\n", - " print(' feature group found')\n", - "except:\n", - " feature_group = file_uploaded.describe_feature_group()\n", - " if not feature_group.list_versions():\n", - " print(\"creating first version\")\n", - " feature_group.create_version()\n", - " feature_group.wait_for_materialization()" - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "metadata": { - "execution": { - "iopub.execute_input": "2024-07-19T12:01:09.179593Z", - "iopub.status.busy": "2024-07-19T12:01:09.179205Z", - "iopub.status.idle": "2024-07-19T12:01:10.773613Z", - "shell.execute_reply": "2024-07-19T12:01:10.772818Z", - "shell.execute_reply.started": "2024-07-19T12:01:09.179565Z" - }, - "tags": [] - }, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
doc_idpage_infosfile_pathfile_size_bytesfile_checksumfile_descriptionmime_typepage_counttoken_count
018b974e6a-000000000-5c536d98323d4a361309f5516c...{'first_page': 0, 'last_page': 57}uploaded_data.docx136504SHA512_256:5f4100c619746dc53c62b706540e759ac86...Microsoft Word 2007+application/vnd.openxmlformats-officedocument....5817598
\n", - "
" - ], - "text/plain": [ - " doc_id \\\n", - "0 18b974e6a-000000000-5c536d98323d4a361309f5516c... \n", - "\n", - " page_infos file_path file_size_bytes \\\n", - "0 {'first_page': 0, 'last_page': 57} uploaded_data.docx 136504 \n", - "\n", - " file_checksum file_description \\\n", - "0 SHA512_256:5f4100c619746dc53c62b706540e759ac86... Microsoft Word 2007+ \n", - "\n", - " mime_type page_count token_count \n", - "0 application/vnd.openxmlformats-officedocument.... 58 17598 " - ] - }, - "execution_count": 7, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "df = feature_group.load_as_pandas()\n", - "df" - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "metadata": { - "execution": { - "iopub.execute_input": "2024-07-19T12:01:10.775611Z", - "iopub.status.busy": "2024-07-19T12:01:10.774972Z", - "iopub.status.idle": "2024-07-19T12:01:10.780482Z", - "shell.execute_reply": "2024-07-19T12:01:10.779846Z", - "shell.execute_reply.started": "2024-07-19T12:01:10.775579Z" - }, - "tags": [] - }, - "outputs": [ - { - "data": { - "text/plain": [ - "'18b974e6a-000000000-5c536d98323d4a361309f5516c6282e003a385e11822616975ed720f8d473ba4'" - ] - }, - "execution_count": 8, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "df['doc_id'][0]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Extracting the documents text" - ] - }, - { - "cell_type": "code", - "execution_count": 72, - "metadata": { - "execution": { - "iopub.execute_input": "2024-07-18T16:27:06.403338Z", - "iopub.status.busy": "2024-07-18T16:27:06.403055Z", - "iopub.status.idle": "2024-07-18T16:27:12.595804Z", - "shell.execute_reply": "2024-07-18T16:27:12.595177Z", - "shell.execute_reply.started": "2024-07-18T16:27:06.403312Z" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "\n", - "data_from_docstore = client.get_docstore_document_data(df['doc_id'][0]) # Get data from a document stored in the docstore" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Getting document text from uploaded data" - ] - }, - { - "cell_type": "code", - "execution_count": 74, - "metadata": { - "execution": { - "iopub.execute_input": "2024-07-18T16:41:18.951733Z", - "iopub.status.busy": "2024-07-18T16:41:18.951323Z", - "iopub.status.idle": "2024-07-18T16:41:19.890519Z", - "shell.execute_reply": "2024-07-18T16:41:19.889614Z", - "shell.execute_reply.started": "2024-07-18T16:41:18.951705Z" - }, - "tags": [] - }, - "outputs": [ - { - "data": { - "text/plain": [ - "dict_keys(['metadata', 'tokens', 'pages', 'doc_id', 'embedded_text', 'extracted_text'])" - ] - }, - "execution_count": 74, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# To access docstore later, or when it was created outside of this notebook, we may use the name or id of it by functions describe_feature_group_by_table_name or describe_feature_group, respectively\n", - "\n", - "df = client.describe_feature_group_by_table_name(feature_group.name).load_as_pandas_documents(doc_id_column = 'doc_id',document_column = 'page_infos')\n", - "df['page_infos'][0].keys()\n", - "# dict_keys(['pages', 'tokens', 'metadata', 'extracted_text'])\n", - "\n", - "#pages: This is the embedded text from the document on a per page level\n", - "#extracted_text: This is the OCR extracted text from the document" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Creating RAG systems on the fly\n", - "\n", - "How to create RAG system \"on the fly\" with an uploaded document" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Returns chunks of documents that are relevant to the query and can be used to feed an LLM\n", - "# Example for blob in memory of notebook\n", - "\n", - "relevant_snippets = client.get_relevant_snippets(\n", - " blobs={\"document\": document.contents},\n", - " query=\"What are the key terms\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Returns chunks of documents that are relevant to the query and can be used to feed an LLM\n", - "# Example for document in the docstore\n", - "\n", - "relevant_snippets = client.get_relevant_snippets(\n", - " doc_ids = [df['doc_id'][0]],\n", - " # blobs={\"document\": document.contents},\n", - " query=\"What are the key terms\")\n", - "\n", - "relevant_snippets" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Using A document Retriever as a standalone deployment\n", - "You can also use a documen retriever, even if a ChatLLM model is not trained!" - ] - }, - { - "cell_type": "code", - "execution_count": 79, - "metadata": { - "execution": { - "iopub.execute_input": "2024-07-18T16:41:41.891390Z", - "iopub.status.busy": "2024-07-18T16:41:41.890693Z", - "iopub.status.idle": "2024-07-18T16:41:41.954327Z", - "shell.execute_reply": "2024-07-18T16:41:41.953701Z", - "shell.execute_reply.started": "2024-07-18T16:41:41.891362Z" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "# First we connect our docstore to our project\n", - "\n", - "client.add_feature_group_to_project(\n", - " feature_group_id=feature_group.id,\n", - " project_id=project_id,\n", - " feature_group_type='DOCUMENTS' # Optional, defaults to 'CUSTOM_TABLE'. But important to set 'DOCUMENTS' as it will enable Document retriver to work properly with it\n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": 80, - "metadata": { - "execution": { - "iopub.execute_input": "2024-07-18T16:41:41.955884Z", - "iopub.status.busy": "2024-07-18T16:41:41.955576Z", - "iopub.status.idle": "2024-07-18T16:41:41.960320Z", - "shell.execute_reply": "2024-07-18T16:41:41.959690Z", - "shell.execute_reply.started": "2024-07-18T16:41:41.955858Z" - }, - "tags": [] - }, - "outputs": [ - { - "data": { - "text/plain": [ - "'1246377b2c'" - ] - }, - "execution_count": 80, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "feature_group.id" - ] - }, - { - "cell_type": "code", - "execution_count": 81, - "metadata": { - "execution": { - "iopub.execute_input": "2024-07-18T16:41:41.961462Z", - "iopub.status.busy": "2024-07-18T16:41:41.961181Z", - "iopub.status.idle": "2024-07-18T16:41:47.887608Z", - "shell.execute_reply": "2024-07-18T16:41:47.886850Z", - "shell.execute_reply.started": "2024-07-18T16:41:41.961437Z" - }, - "tags": [] - }, - "outputs": [ - { - "data": { - "text/plain": [ - "InferredFeatureMappings(error='',\n", - " feature_mappings=[FeatureMapping(feature_mapping='DOCUMENT_ID',\n", - " feature_name='doc_id'), FeatureMapping(feature_mapping='DOCUMENT',\n", - " feature_name='file_description')])" - ] - }, - "execution_count": 81, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "ifm = client.infer_feature_mappings(project_id=project_id,feature_group_id=feature_group.id)\n", - "\n", - "# ifm = client.infer_feature_mappings(project_id='15ed76a6a8',feature_group_id='98a8d9cce')\n", - "ifm" - ] - }, - { - "cell_type": "code", - "execution_count": 82, - "metadata": { - "execution": { - "iopub.execute_input": "2024-07-18T16:41:47.888874Z", - "iopub.status.busy": "2024-07-18T16:41:47.888586Z", - "iopub.status.idle": "2024-07-18T16:41:47.892105Z", - "shell.execute_reply": "2024-07-18T16:41:47.891417Z", - "shell.execute_reply.started": "2024-07-18T16:41:47.888847Z" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "# This blocs of code might be useful to fix featuregroup for docstore usage by document retrievers\n", - "\n", - "# client.set_feature_group_type(project_id='15ed76a6a8', feature_group_id='98a8d9cce', feature_group_type='DOCUMENTS')\n", - "# client.set_feature_mapping(project_id,feature_group.id,feature_name='doc_id',feature_mapping='DOCUMENT_ID')\n", - "# client.set_feature_mapping(project_id,feature_group.id,feature_name='page_infos',feature_mapping='DOCUMENT')\n" - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "metadata": { - "execution": { - "iopub.execute_input": "2024-07-19T12:01:48.962128Z", - "iopub.status.busy": "2024-07-19T12:01:48.961697Z", - "iopub.status.idle": "2024-07-19T12:01:49.195026Z", - "shell.execute_reply": "2024-07-19T12:01:49.194395Z", - "shell.execute_reply.started": "2024-07-19T12:01:48.962086Z" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "# Creating a document retriever\n", - "\n", - "document_retriever = client.create_document_retriever(\n", - " project_id=project_id,\n", - " name='demo_document_retriever__'+client.describe_user().email.split('@')[0],\n", - " feature_group_id=feature_group.id\n", - ")\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "# Accessing document retriever that is already crreated\n", - "\n", - "# dr = client.describe_document_retriever_by_name('demo_document_retriever_'+client.describe_user().email.split('@')[0])\n", - "# dr" - ] - }, - { - "cell_type": "code", - "execution_count": 13, - "metadata": { - "execution": { - "iopub.execute_input": "2024-07-19T13:02:32.750744Z", - "iopub.status.busy": "2024-07-19T13:02:32.750341Z", - "iopub.status.idle": "2024-07-19T13:02:32.877158Z", - "shell.execute_reply": "2024-07-19T13:02:32.876301Z", - "shell.execute_reply.started": "2024-07-19T13:02:32.750715Z" - }, - "tags": [] - }, - "outputs": [ - { - "data": { - "text/plain": [ - "[]" - ] - }, - "execution_count": 13, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "r = client.describe_document_retriever(document_retriever.id)\n", - "# Filters allow you to filter the documents that the doc retriever can use on the fly, using some columns of the training feature group that was used as input to the doc retriever.\n", - "# Filters are also available when using .get_chat_reponse\n", - "\n", - "r.get_matching_documents(query = \"WHATEVER_YOU_WANT_TO_ASK\", filters = {\"document_identification\":['AXIP-4440']})" - ] - }, - { - "cell_type": "code", - "execution_count": 21, - "metadata": { - "execution": { - "iopub.execute_input": "2024-07-19T14:58:12.356794Z", - "iopub.status.busy": "2024-07-19T14:58:12.356373Z", - "iopub.status.idle": "2024-07-19T14:58:12.896285Z", - "shell.execute_reply": "2024-07-19T14:58:12.895452Z", - "shell.execute_reply.started": "2024-07-19T14:58:12.356765Z" - }, - "tags": [] - }, - "outputs": [ - { - "data": { - "text/plain": [ - "10" - ] - }, - "execution_count": 21, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Examples of document retriever usage\n", - "\n", - "res = document_retriever.get_matching_documents(\"Agreement of the Parties\")\n", - "len(res)" - ] - }, - { - "cell_type": "code", - "execution_count": 32, - "metadata": { - "execution": { - "iopub.execute_input": "2024-07-19T15:05:15.506942Z", - "iopub.status.busy": "2024-07-19T15:05:15.506540Z", - "iopub.status.idle": "2024-07-19T15:05:16.143817Z", - "shell.execute_reply": "2024-07-19T15:05:16.143043Z", - "shell.execute_reply.started": "2024-07-19T15:05:15.506910Z" - }, - "tags": [] - }, - "outputs": [ - { - "data": { - "text/plain": [ - "[]" - ] - }, - "execution_count": 32, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Example of getting no results\n", - "\n", - "res2 = document_retriever.get_matching_documents(\"planting potatoes on a mars\", required_phrases=['mars'])\n", - "res2" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Calling a Large Language Model\n", - "You can use the `evalute_prompt` method to call the LLM of your choice:\n", - "- prompt: This is the actual message that the model receives from the user\n", - "- system_message: These are the instructions that the model will follow" - ] - }, - { - "cell_type": "code", - "execution_count": 14, - "metadata": { - "execution": { - "iopub.execute_input": "2024-07-19T13:33:27.110785Z", - "iopub.status.busy": "2024-07-19T13:33:27.110381Z", - "iopub.status.idle": "2024-07-19T13:33:31.256828Z", - "shell.execute_reply": "2024-07-19T13:33:31.255945Z", - "shell.execute_reply.started": "2024-07-19T13:33:27.110755Z" - } - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Athens\n" - ] - } - ], - "source": [ - "r = client.evaluate_prompt(prompt = \"What is the capital of Greece?\", system_message = \"You should answer all questions with a single word.\", llm_name = \"OPENAI_GPT4O\")\n", - "\n", - "# Response:\n", - "print(r.content)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Calling a Large Language Model and specifying some output schema\n", - "You can also use the `json_response_schema` to specify the output of the model in a pre-defined manner" - ] - }, - { - "cell_type": "code", - "execution_count": 36, - "metadata": { - "execution": { - "iopub.execute_input": "2024-07-19T15:08:40.269078Z", - "iopub.status.busy": "2024-07-19T15:08:40.268645Z", - "iopub.status.idle": "2024-07-19T15:08:44.355169Z", - "shell.execute_reply": "2024-07-19T15:08:44.354149Z", - "shell.execute_reply.started": "2024-07-19T15:08:40.269046Z" - }, - "tags": [] - }, - "outputs": [ - { - "data": { - "text/plain": [ - "{'learning_objectives': ['Understand the components and functions of car batteries',\n", - " 'Learn how to maintain and troubleshoot car batteries',\n", - " 'Gain knowledge about the different types of car doors and their mechanisms',\n", - " 'Learn how to repair and replace car doors',\n", - " 'Understand the principles and components of car suspension systems',\n", - " 'Learn how to diagnose and fix common issues in car suspension systems']}" - ] - }, - "execution_count": 36, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "import json\n", - "\n", - "r = client.evaluate_prompt(prompt = \"In this course, you will learn about car batteries, car doors, and car suspension system\",\n", - " # system_message = \"OPTIONAL, but good to have\", \n", - " llm_name = 'OPENAI_GPT4O',\n", - " json_response_schema = {\"learning_objectives\": {\"type\": \"list\", \"description\": \"A list of learning objectives\", \"is_required\": True}}\n", - ")\n", - "learning_objectives = json.loads(r.content)\n", - "learning_objectives" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Creating a simple AI Agent with workflows" - ] - }, - { - "cell_type": "code", - "execution_count": 38, - "metadata": { - "execution": { - "iopub.execute_input": "2024-07-19T15:17:10.645602Z", - "iopub.status.busy": "2024-07-19T15:17:10.645200Z", - "iopub.status.idle": "2024-07-19T15:17:10.649548Z", - "shell.execute_reply": "2024-07-19T15:17:10.648754Z", - "shell.execute_reply.started": "2024-07-19T15:17:10.645573Z" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "from abacusai import (\n", - " AgentInterface,\n", - " WorkflowGraph,\n", - " WorkflowGraphEdge,\n", - " WorkflowGraphNode,\n", - " WorkflowNodeInputMapping,\n", - " WorkflowNodeInputSchema,\n", - " WorkflowNodeInputType,\n", - " WorkflowNodeOutputMapping,\n", - " WorkflowNodeOutputSchema,\n", - " WorkflowNodeOutputType,\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### For this agent you can select one of preselected charecters to answer questions" - ] - }, - { - "cell_type": "code", - "execution_count": 95, - "metadata": { - "execution": { - "iopub.execute_input": "2024-07-19T16:33:28.364370Z", - "iopub.status.busy": "2024-07-19T16:33:28.363964Z", - "iopub.status.idle": "2024-07-19T16:33:28.368666Z", - "shell.execute_reply": "2024-07-19T16:33:28.367948Z", - "shell.execute_reply.started": "2024-07-19T16:33:28.364340Z" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "def agent_function(nlp_query, character):\n", - " \"\"\"\n", - " Args:\n", - " nlp_query (Any): Data row to predict on/with or to pass to the agent for execution\n", - " Returns:\n", - " The result which can be any json serializable python type\n", - " \"\"\"\n", - " from abacusai import ApiClient\n", - "\n", - " # Let agent respond like your favorite character.\n", - " char = character or 'Sherlock Holmes'\n", - " response = ApiClient().evaluate_prompt(prompt=nlp_query, system_message=f'Respond like {char}. Prepend your name.')\n", - " return str(response.content)" - ] - }, - { - "cell_type": "code", - "execution_count": 96, - "metadata": { - "execution": { - "iopub.execute_input": "2024-07-19T16:33:28.718690Z", - "iopub.status.busy": "2024-07-19T16:33:28.718280Z", - "iopub.status.idle": "2024-07-19T16:33:30.911606Z", - "shell.execute_reply": "2024-07-19T16:33:30.910908Z", - "shell.execute_reply.started": "2024-07-19T16:33:28.718661Z" - }, - "tags": [] - }, - "outputs": [ - { - "data": { - "text/plain": [ - "'Homer Simpson: Mmm, donuts...'" - ] - }, - "execution_count": 96, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "agent_function('what is your favorite food','Homer Simpson')" - ] - }, - { - "cell_type": "code", - "execution_count": 97, - "metadata": { - "execution": { - "iopub.execute_input": "2024-07-19T16:33:38.887253Z", - "iopub.status.busy": "2024-07-19T16:33:38.886850Z", - "iopub.status.idle": "2024-07-19T16:33:38.891021Z", - "shell.execute_reply": "2024-07-19T16:33:38.890220Z", - "shell.execute_reply.started": "2024-07-19T16:33:38.887219Z" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "package_requirements = [] # e.g. ['numpy==1.2.3', 'pandas>=1.4.0']\n", - "description = None\n", - "memory = 16\n", - "enable_binary_input = True" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "WorkflowGraphNode is one block of creation of AI Agent" - ] - }, - { - "cell_type": "code", - "execution_count": 99, - "metadata": { - "execution": { - "iopub.execute_input": "2024-07-19T16:34:33.287084Z", - "iopub.status.busy": "2024-07-19T16:34:33.286482Z", - "iopub.status.idle": "2024-07-19T16:34:33.294142Z", - "shell.execute_reply": "2024-07-19T16:34:33.293351Z", - "shell.execute_reply.started": "2024-07-19T16:34:33.287054Z" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "workflow_graph_node = WorkflowGraphNode(\n", - " name=\"input_text\",\n", - " function=agent_function,\n", - " input_mappings=[\n", - " WorkflowNodeInputMapping(\n", - " name=\"nlp_query\",\n", - " variable_type=WorkflowNodeInputType.USER_INPUT,\n", - " # variable_source=\"obi van\"\n", - " ),\n", - " WorkflowNodeInputMapping(\n", - " name=\"character\",\n", - " variable_type=WorkflowNodeInputType.USER_INPUT,\n", - " # variable_source=\"obi van\"\n", - " ),\n", - " ],\n", - " input_schema = WorkflowNodeInputSchema(\n", - " json_schema={\n", - " \"type\": \"object\",\n", - " \"title\": \"Get character responce\",\n", - " \"required\": [\"nlp_query\", \"character\"],\n", - " \"properties\": {\n", - " \"nlp_query\": {\"type\": \"string\", \"title\": \"Your question\"},\n", - " \"character\": {\n", - " \"type\": \"string\",\n", - " \"title\": \"Characters\",\n", - " \"enum\": [\"Sherlock Holmes\", \"Elon Musk\", \"Homer Simpson\"],\n", - " \"default\": \"Homer Simpson\"\n", - " # \"enumNames\": [\"Sherlock Holmes\", \"Elon Musk\", \"Homer Simpson\"]\n", - " }\n", - " # \"table_name\": {\"type\": \"string\", \"title\": \"Table Name\"},\n", - " # \"document_column_name\": {\"type\": \"string\", \"title\": \"Document Column Name\"},\n", - " # \"chunk_size\": {\"type\": \"integer\", \"title\": \"Chunk Size\"},\n", - " # \"text_encoder\": {\"type\": \"string\", \"title\": \"Text Encoder\", \"enum\": [e.value for e in VectorStoreTextEncoder]},\n", - " },\n", - " },\n", - " # ui_schema={\n", - " # \"text_encoder\": {\"ui:widget\": \"select\"},\n", - " # }\n", - " ),\n", - " output_mappings=[\n", - " WorkflowNodeOutputMapping(\n", - " name=\"str_out\",\n", - " variable_type=WorkflowNodeOutputType.STRING\n", - " ),\n", - " ],\n", - " output_schema=WorkflowNodeOutputSchema({\n", - " \"type\": \"object\",\n", - " \"properties\": {\n", - " \"str_out\": {\"type\": \"string\", \"title\": \"Response\"},\n", - " },\n", - " })\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "WorkflowGraph is final graph of all nodes and edges that create an AI Agent Logic" - ] - }, - { - "cell_type": "code", - "execution_count": 100, - "metadata": { - "execution": { - "iopub.execute_input": "2024-07-19T16:34:35.190444Z", - "iopub.status.busy": "2024-07-19T16:34:35.190040Z", - "iopub.status.idle": "2024-07-19T16:34:35.194012Z", - "shell.execute_reply": "2024-07-19T16:34:35.193341Z", - "shell.execute_reply.started": "2024-07-19T16:34:35.190412Z" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "workflow_graph = WorkflowGraph(\n", - " nodes=[\n", - " workflow_graph_node,\n", - " ],\n", - " edges=[],\n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": 101, - "metadata": { - "execution": { - "iopub.execute_input": "2024-07-19T16:34:35.886585Z", - "iopub.status.busy": "2024-07-19T16:34:35.886176Z", - "iopub.status.idle": "2024-07-19T16:34:35.949178Z", - "shell.execute_reply": "2024-07-19T16:34:35.948341Z", - "shell.execute_reply.started": "2024-07-19T16:34:35.886553Z" - }, - "tags": [] - }, - "outputs": [ - { - "data": { - "text/plain": [ - "[Agent(name='example_agent',\n", - " agent_id='3c981ea4e',\n", - " created_at='2024-06-24T10:47:22+00:00',\n", - " project_id={'projectId': '45d76db9c', 'problemType': 'ai_agent'},\n", - " agent_config={'ENABLE_BINARY_INPUT': True},\n", - " agent_execution_config={'character': ['Elon Musk', 'Joe Biden']},\n", - " latest_agent_version=AgentVersion(agent_version='325916946',\n", - " status='COMPLETE',\n", - " publishing_completed_at='2024-06-24T10:48:09+00:00')),\n", - " Agent(name='example_agent',\n", - " agent_id='d741a5db6',\n", - " created_at='2024-06-25T16:24:16+00:00',\n", - " project_id={'projectId': '45d76db9c', 'problemType': 'ai_agent'},\n", - " agent_config={},\n", - " latest_agent_version=AgentVersion(agent_version='75cdd1928',\n", - " status='COMPLETE',\n", - " publishing_completed_at='2024-07-01T11:20:44+00:00'),\n", - " workflow_graph=WorkflowGraph(nodes=[WorkflowGraphNode()], edges=[], primary_start_node='input_text')),\n", - " Agent(name='Example_Character_Agent',\n", - " agent_id='1000067952',\n", - " created_at='2024-07-19T15:48:24+00:00',\n", - " project_id={'projectId': '45d76db9c', 'problemType': 'ai_agent'},\n", - " agent_config={},\n", - " latest_agent_version=AgentVersion(agent_version='f9cc62cc0',\n", - " status='COMPLETE',\n", - " publishing_completed_at='2024-07-19T16:31:30+00:00'),\n", - " workflow_graph=WorkflowGraph(nodes=[WorkflowGraphNode()], edges=[], primary_start_node='input_text'))]" - ] - }, - "execution_count": 101, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "client.list_agents(project_id)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "tags": [] - }, - "source": [ - "There are 2 main types of AI Agents AgentInterface.DEFAULT and AgentInterface.CHAT\n", - "\n", - "- AgentInterface.DEFAULT is an AI agent that uses forms to fill in and work like an app\n", - "- AgentInterface.CHAT reproduces experiense of chat with LLM with logic that you may create for it" - ] - }, - { - "cell_type": "code", - "execution_count": 103, - "metadata": { - "execution": { - "iopub.execute_input": "2024-07-19T16:34:41.086643Z", - "iopub.status.busy": "2024-07-19T16:34:41.086228Z", - "iopub.status.idle": "2024-07-19T16:35:42.676266Z", - "shell.execute_reply": "2024-07-19T16:35:42.675477Z", - "shell.execute_reply.started": "2024-07-19T16:34:41.086610Z" - }, - "tags": [] - }, - "outputs": [ - { - "data": { - "text/plain": [ - "Agent(name='Example_Character_Agent',\n", - " agent_id='1000067952',\n", - " created_at='2024-07-19T15:48:24+00:00',\n", - " project_id={'problemType': 'ai_agent', 'allProjectModels': None, 'projectId': '45d76db9c', 'ingressName': None, 'starred': 0, 'tags': None, 'useCase': 'ai_agent', 'name': 'AI_agent_bogdan', 'ingressType': None, 'createdAt': '2024-06-19T12:30:10+00:00', 'deployments': None, 'systemCreated': False, 'info': None, 'updatedAt': '2024-06-19T12:30:10+00:00'},\n", - " notebook_id='5441509c0',\n", - " agent_config={},\n", - " code_source=CodeSource(source_type='TEXT',\n", - " source_code='def agent_function(nlp_query, character):\\n \"\"\"\\n Args:\\n nlp_query (Any): Data row to predict on/with or to pass to the agent for execution\\n Returns:\\n The result which can be any json serializable python type\\n \"\"\"\\n from abacusai import ApiClient\\n\\n # Let agent respond like your favorite character.\\n char = character or \\'Sherlock Holmes\\'\\n response = ApiClient().evaluate_prompt(prompt=nlp_query, system_message=f\\'Respond like {char}. Prepend your name.\\')\\n return str(response.content)\\n\\n',\n", - " package_requirements=[],\n", - " status='PENDING',\n", - " module_dependencies=[]),\n", - " latest_agent_version=AgentVersion(agent_version='117e736ee8',\n", - " status='COMPLETE',\n", - " agent_id='1000067952',\n", - " agent_config={},\n", - " publishing_started_at='2024-07-19T16:35:00+00:00',\n", - " publishing_completed_at='2024-07-19T16:35:13+00:00',\n", - " pending_deployment_ids=['69d0022c6'],\n", - " failed_deployment_ids=[],\n", - " code_source=CodeSource(source_type='TEXT',\n", - " source_code='def agent_function(nlp_query, character):\\n \"\"\"\\n Args:\\n nlp_query (Any): Data row to predict on/with or to pass to the agent for execution\\n Returns:\\n The result which can be any json serializable python type\\n \"\"\"\\n from abacusai import ApiClient\\n\\n # Let agent respond like your favorite character.\\n char = character or \\'Sherlock Holmes\\'\\n response = ApiClient().evaluate_prompt(prompt=nlp_query, system_message=f\\'Respond like {char}. Prepend your name.\\')\\n return str(response.content)\\n\\n',\n", - " package_requirements=[],\n", - " status='COMPLETE',\n", - " publishing_msg={'warningInfo': ''},\n", - " module_dependencies=[]),\n", - " workflow_graph=WorkflowGraph(nodes=[WorkflowGraphNode()], edges=[], primary_start_node='input_text')),\n", - " workflow_graph=WorkflowGraph(nodes=[WorkflowGraphNode()], edges=[], primary_start_node='input_text'))" - ] - }, - "execution_count": 103, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "from abacusai import ApiClient\n", - "client = ApiClient()\n", - "agent_interface: AgentInterface = AgentInterface.DEFAULT\n", - "if 'agent_function' not in vars():\n", - " raise Exception('Please define agent function with name - agent_function')\n", - "\n", - "if not [x for x in client.list_agents(project_id) if 'Example_Character_Agent'==x.name]:\n", - " agent = client.create_agent(project_id=project_id,\n", - " # function_source_code=agent_function, agent_function_name='agent_function',\n", - " name='Example_Character_Agent', \n", - " package_requirements=package_requirements,\n", - " description=description,\n", - " # enable_binary_input=enable_binary_input, memory=memory,\n", - " workflow_graph=workflow_graph, agent_interface=agent_interface)\n", - " agent.wait_for_publish()\n", - " deployment = client.create_deployment(model_id=agent.agent_id)\n", - " deployment.wait_for_deployment()\n", - "\n", - "else:\n", - " agent = client.update_agent(model_id=agent.id,\n", - " # function_source_code=agent_function, agent_function_name='agent_function',\n", - " # name='example_agent',\n", - " # package_requirements=package_requirements,\n", - " # description=description,\n", - " # enable_binary_input=enable_binary_input, memory=memory,\n", - " workflow_graph=workflow_graph, agent_interface=agent_interface)\n", - " agent.wait_for_publish()\n", - "\n", - "agent" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Deployment of AI Agent\n", - "\n", - "Additionally to creation of model it should be deployed\n", - "Ai agent later may be reached through Deployments > Predictions Dash inside this project\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "client.list_deployment_tokens(project_id)\n", - "# use client.create_deployment_token() if you have no tokens" - ] - }, - { - "cell_type": "code", - "execution_count": 69, - "metadata": { - "execution": { - "iopub.execute_input": "2024-07-19T16:01:11.274557Z", - "iopub.status.busy": "2024-07-19T16:01:11.273771Z", - "iopub.status.idle": "2024-07-19T16:01:56.721141Z", - "shell.execute_reply": "2024-07-19T16:01:56.720353Z", - "shell.execute_reply.started": "2024-07-19T16:01:11.274520Z" - }, - "tags": [] - }, - "outputs": [ - { - "data": { - "text/plain": [ - "Deployment(deployment_id='69d0022c6',\n", - " name='Example_Character_Agent Deployment',\n", - " status='ACTIVE',\n", - " description='',\n", - " deployed_at='2024-07-19T16:01:54+00:00',\n", - " created_at='2024-07-19T16:01:11+00:00',\n", - " project_id='45d76db9c',\n", - " model_id='1000067952',\n", - " model_version='3da4b4f46',\n", - " calls_per_second=5,\n", - " auto_deploy=True,\n", - " skip_metrics_check=False,\n", - " algo_name='AI Agent',\n", - " regions=[{'name': 'Us East 1', 'value': 'us-east-1'}],\n", - " batch_streaming_updates=False,\n", - " algorithm='2efe1d48f',\n", - " model_deployment_config={'otherModelsForDataClusterTypes': {}, 'streamingFeatureGroupDetails': [], 'modelTrainingType': None})" - ] - }, - "execution_count": 69, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "deployment = client.create_deployment(model_id=agent.agent_id)\n", - "deployment.wait_for_deployment()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# You can use below command, to get the response from a deployed ChatLLM model / This ChatLLM model might be using RAG under the hood.\n", - "r = client.get_chat_response(deployment_token=client.list_deployment_tokens(project_id)[0], deployment_id='fddsfff', messages=[{\"is_user\":True,\"text\":\"What is the meaning of life?\"}])\n", - "print(r.keys())" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Agent JSON Schema\n", - "Below instructions are only relevant for when you are creating an \"AI Agent\". The \"json_schema\" variable allows you to create a custom UX that the user can use to interact with the agent. You can find a playground for the json schema here: https://rjsf-team.github.io/react-jsonschema-form/.\n", - "\n", - "Below is an example of a json_schema that allows user to:\n", - "1. Upload a document\n", - "2. Select an option" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Additional parameters for the JSON Schema allows you to create the UX\n", - "json_schema ={\n", - "\"type\": \"object\",\n", - "\"title\": \"Upload Document and select task\",\n", - "\"required\": [\"document\", \"options\"],\n", - "\"properties\": {\n", - " \"document\": {\n", - " \"type\": \"string\",\n", - " \"title\": \"Upload Document\",\n", - " \"format\": \"data-url\"},\n", - " \"options\": {\n", - " \"type\": \"string\",\n", - " \"title\": \"Options\",\n", - " \"enum\": [\"extract_rfp_questions\", \"complete_rfp_questions\"],\n", - " \"enumNames\": [\"Extract RFP Questions\", \"Complete RFP Questions\"]\n", - " }\n", - "}\n", - "}\n", - "\n", - "# The WorkflowNodeOutputSchema allows you to setup the output to be of data-url so that the user can download.\n", - "output_schema=WorkflowNodeOutputSchema(\n", - " json_schema={\n", - " \"type\": \"object\",\n", - " \"properties\": {\n", - " \"processed_doc\": {\n", - " \"type\": \"string\",\n", - " \"title\": \"Responses\",\n", - " \"format\": \"data-url\",\n", - " }\n", - " },\n", - " }\n", - ")\n", - "\n", - "# Here is how the Agent Response should look like:\n", - "return AgentResponse(processed_document_doc=Blob(doc_bytes,\"application/vnd.openxmlformats-officedocument.wordprocessingml.document\",filename=f\"result.docx\",))" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.8.10" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -} diff --git a/examples/fullscript.py b/examples/fullscript.py deleted file mode 100644 index 398eb5756..000000000 --- a/examples/fullscript.py +++ /dev/null @@ -1,50 +0,0 @@ -from sys import stderr -import pandas as pd -import numpy as np -from sklearn.preprocessing import QuantileTransformer -from sklearn.linear_model import LinearRegression - - -def transform_concrete(concrete_dataset): - feature_df = concrete_dataset.drop(['flyash'], axis=1) - no_flyash = feature_df[concrete_dataset.flyash == 0.0] - flyash = feature_df[concrete_dataset.flyash > 0.0] - mean_df = no_flyash.mean() - print(mean_df) - return pd.concat([no_flyash - no_flyash.assign(age=0).mean(), flyash - flyash.assign(age=0).mean()]) - - -def to_quantiles(X): - qt = QuantileTransformer(n_quantiles=20) - X_q = qt.fit_transform(X.values) - print(qt.quantiles_) - return qt, X_q - -def train_model(training_dataset): - np.random.seed(5) - - X = training_dataset.drop(['csMPa'], axis=1) - print(X.mean()) - y = training_dataset.csMPa - qt, X_q = to_quantiles(X) - - recent_model = LinearRegression() - fit_result = recent_model.fit(X_q, y) - print(fit_result) - model_r2 = recent_model.score(qt.transform(X.values), y) - - print(f'Linear model R^2 = {model_r2}') - if model_r2 < 0.50: - raise RuntimeError('Could not get a model with sufficient accuracy') - if model_r2 < 0.85: - print(f'Got a low R^2 {model_r2}', file=stderr) - - return (X.columns, qt, recent_model) - - -def predict(model, query): - columns, qt, recent_model = model - import pandas as pd - X = pd.DataFrame({c: [query.get(c, 0.0)] for c in columns}) - y = recent_model.predict(qt.transform(X.values))[0] - return {'csMPa': y} \ No newline at end of file diff --git a/examples/language/0.calling_large_language_models.ipynb b/examples/language/0.calling_large_language_models.ipynb new file mode 100644 index 000000000..a8e0ca4e6 --- /dev/null +++ b/examples/language/0.calling_large_language_models.ipynb @@ -0,0 +1,81 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Calling a Large Language Model\n", + "You can use the `evalute_prompt` method to call the LLM of your choice:\n", + "- prompt: This is the actual message that the model receives from the user\n", + "- system_message: These are the instructions that the model will follow" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Athens\n" + ] + } + ], + "source": [ + "r = client.evaluate_prompt(prompt = \"What is the capital of Greece?\", system_message = \"You should answer all questions with a single word.\", llm_name = \"OPENAI_GPT4O\")\n", + "\n", + "# Response:\n", + "print(r.content)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Responding with JSON\n", + "You can also use the `json_response_schema` to specify the output of the model in a pre-defined manner" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'learning_objectives': ['Understand the components and functions of car batteries',\n", + " 'Learn how to maintain and troubleshoot car batteries',\n", + " 'Gain knowledge about the different types of car doors and their mechanisms',\n", + " 'Learn how to repair and replace car doors',\n", + " 'Understand the principles and components of car suspension systems',\n", + " 'Learn how to diagnose and fix common issues in car suspension systems']}" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "import json\n", + "\n", + "r = client.evaluate_prompt(prompt = \"In this course, you will learn about car batteries, car doors, and car suspension system\",\n", + " # system_message = \"OPTIONAL, but good to have\", \n", + " llm_name = 'OPENAI_GPT4O',\n", + " json_response_schema = {\"learning_objectives\": {\"type\": \"list\", \"description\": \"A list of learning objectives\", \"is_required\": True}}\n", + ")\n", + "learning_objectives = json.loads(r.content)\n", + "learning_objectives" + ] + } + ], + "metadata": { + "language_info": { + "name": "python" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/examples/language/1.extracting_text.ipynb b/examples/language/1.extracting_text.ipynb new file mode 100644 index 000000000..0621463a9 --- /dev/null +++ b/examples/language/1.extracting_text.ipynb @@ -0,0 +1,203 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Loading Documents\n", + "When documents are uploaded into the platform, they are uploaded as a special Class type `BlobInput`. \n", + "\n", + "To test Abacus functionality locally in your notebook, transform them to `BlobInput` class" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Here, we upload training file from the current location of the notebook\n", + "# You can add files to Jupyter Notebook by drag and drop\n", + "from abacusai.client import BlobInput\n", + "import abacusai\n", + "client = abacusai.ApiClient('YOUR_API_KEY')\n", + "document = BlobInput.from_local_file(\"YOUR_DOCUMENT.pdf/word/etc\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Extract text from a local document\n", + "You can extract text using two methods:\n", + "1. Embedded Text Extraction --> That means extracting the text that is already in the document. It's fast and works well for modern documents.\n", + "2. OCR ---> Extracts the text as seen from end user. Works very well for scanned documents, when there are tables involved, etc.\n", + "\n", + "First, let's take a look at **Embedded Text Extraction**" + ] + }, + { + "cell_type": "code", + "execution_count": 45, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "UNITED STATES\n", + "SECURITIES AND EXCHANGE COMMISSION\n", + "Washington, D.C. 20549\n", + "____________________________\n", + "\n", + "UNITED STATES\n", + "SECURITIES AND EXCHANGE COMMISSION\n", + "Washington, D.C. 20549\n", + "____________________________\n" + ] + } + ], + "source": [ + "extracted_doc_data = client.extract_document_data(document.contents)\n", + "\n", + "# print first 100 chracters of page 0\n", + "print(extracted_doc_data.pages[0][0:100])\n", + "print()\n", + "# print first 100 chracters of all embedded text\n", + "print(extracted_doc_data.embedded_text[0:100]) " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now, let's extract data using **OCR**. Please note that there are multiple `ocr_mode` values and multiple settings. Refer to the official Python SDK API for all of them." + ] + }, + { + "cell_type": "code", + "execution_count": 55, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "UNITED STATES\n", + "SECURITIES AND EXCHANGE COMMISSION\n", + "Washington, D.C. 20549\n", + "(Mark One)\n", + "ANNUAL REPORT PUR\n" + ] + } + ], + "source": [ + "extracted_doc_data = client.extract_document_data(document.contents, \n", + " document_processing_config={'extract_bounding_boxes': True,'ocr_mode': 'DEFAULT', 'use_full_ocr':True})\n", + "\n", + "# Print first 100 characters of extracted_page_text\n", + "print(extracted_doc_data.extracted_text[0:100])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Extract Text from a document that has already been uploaded into the platform\n", + "\n", + "When you upload documents directly into the platform, depending on the settings you choose, you will already have access to `embedded_text` or `extracted_text`, etc. Here is how you can load the text of a file that has already been uploaded into the file:\n", + "\n", + "1. Find the `doc_id`. You can find that in the feature group where the documents where uploaded under `doc_id` column.\n", + "2. Use `get_docstore_document_data` to get document's data.\n", + "\n", + "If OCR is not used when ingesting the document, then `extracted_text` won't exist" + ] + }, + { + "cell_type": "code", + "execution_count": 53, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "------------------------------\n", + "Embedded Text:\n", + "\n", + "UNITED STATES\n", + "SECURITIES AND EXCHANGE COMMISSION\n", + "Washington, D.C. 20549\n", + "FORM 10-K\n", + "(Mark One)\n", + "☒\n", + "ANNUA\n", + "------------------------------\n", + "Extracted (OCR) Text:\n", + "\n", + "UNITED STATES\n", + "SECURITIES AND EXCHANGE COMMISSION\n", + "Washington, D.C. 20549\n", + "(Mark One)\n", + "ANNUAL REPORT PUR\n" + ] + } + ], + "source": [ + "doc_data = client.get_docstore_document_data('115fd750d0-000000000-bde8f7f6ce6065337e599fcac194739685fb3d3060650f6d7ef95bac914c72bc')\n", + "# print first 100 chracters from embedded text\n", + "print('------------------------------')\n", + "print('Embedded Text:\\n')\n", + "print(doc_data.embedded_text[0:100])\n", + "print('------------------------------')\n", + "# print first 100 chracters from OCR detected text\n", + "print('Extracted (OCR) Text:\\n')\n", + "print(extracted_doc_data.extracted_text[0:100]) " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Load a feature group with documents locally as a pandas dataframe" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# To access docstore later, or when it was created outside of this notebook, we may use the name or id of it by functions describe_feature_group_by_table_name or describe_feature_group, respectively\n", + "\n", + "df = client.describe_feature_group_by_table_name('YOUR_FEATURE_GROUP_NAME').load_as_pandas_documents(doc_id_column = 'doc_id',document_column = 'page_infos')\n", + "df['page_infos'][0].keys()\n", + "# dict_keys(['pages', 'tokens', 'metadata', 'extracted_text'])\n", + "\n", + "#pages: This is the embedded text from the document on a per page level\n", + "#extracted_text: This is the OCR extracted text from the document" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.6" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/examples/language/2.document_retriever.ipynb b/examples/language/2.document_retriever.ipynb new file mode 100644 index 000000000..f6b17bf6f --- /dev/null +++ b/examples/language/2.document_retriever.ipynb @@ -0,0 +1,230 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Document Retriever\n", + "\n", + "A document retriever is Abacus own vector database. It can be used to:\n", + "1. Create embeddings of documents\n", + "2. Retrieve document chunks that are semantically close to a phrase that the user passes." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### How to create RAG on the fly with a local document" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Here, we upload training file from the current location of the notebook\n", + "# You can add files to Jupyter Notebook by drag and drop\n", + "from abacusai.client import BlobInput\n", + "import abacusai\n", + "client = abacusai.ApiClient('YOUR_API_KEY')\n", + "document = BlobInput.from_local_file(\"YOUR_DOCUMENT.pdf/word\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Returns chunks of documents that are relevant to the query and can be used to feed an LLM\n", + "# Example for blob in memory of notebook\n", + "\n", + "relevant_snippets = client.get_relevant_snippets(\n", + " blobs={\"document\": document.contents},\n", + " query=\"What are the key terms\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Returns chunks of documents that are relevant to the query and can be used to feed an LLM\n", + "# Example for document in the docstore\n", + "\n", + "relevant_snippets = client.get_relevant_snippets(\n", + " doc_ids = ['YOUR_DOC_ID_1','YOUR_DOC_ID_2'],\n", + " query=\"What are the key terms\")\n", + "\n", + "relevant_snippets" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Using A document Retriever as a standalone deployment\n", + "You can also use a documen retriever, even if a ChatLLM model is not trained!" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# First we connect our docstore to our project\n", + "\n", + "client.add_feature_group_to_project(\n", + " feature_group_id='YOUR_FEATURE_GROUP_ID_WITH_DOCUMENTS'\n", + " project_id='YOUR_PROJECT_ID',\n", + " feature_group_type='DOCUMENTS' # Optional, defaults to 'CUSTOM_TABLE'. But important to set 'DOCUMENTS' as it will enable Document retriver to work properly with it\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "InferredFeatureMappings(error='',\n", + " feature_mappings=[FeatureMapping(feature_mapping='DOCUMENT_ID',\n", + " feature_name='doc_id'), FeatureMapping(feature_mapping='DOCUMENT',\n", + " feature_name='file_description')])" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "ifm = client.infer_feature_mappings(project_id='PROJECT_ID',feature_group_id='FEATURE_GROUP_ID')\n", + "\n", + "# ifm = client.infer_feature_mappings(project_id='15ed76a6a8',feature_group_id='98a8d9cce')\n", + "ifm" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# This blocs of code might be useful to fix featuregroup for docstore usage by document retrievers\n", + "\n", + "# client.set_feature_group_type(project_id='YOUR_PROJECT_ID', feature_group_id='98a8d9cce', feature_group_type='DOCUMENTS')\n", + "# client.set_feature_mapping(project_id='YOUR_PROJECT_ID',feature_group_id = 'YOUR_FEATURE_GROUP_ID',feature_name='doc_id',feature_mapping='DOCUMENT_ID')\n", + "# client.set_feature_mapping(project_id='YOUR_PROJECT_ID',feature_group_id = 'YOUR_FEATURE_GROUP_ID',feature_name='page_infos',feature_mapping='DOCUMENT')\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Creating a document retriever\n", + "\n", + "document_retriever = client.create_document_retriever(\n", + " project_id=project_id,\n", + " name='NAME_OF_YOUR_DOCUMENT_RETRIEVER',\n", + " feature_group_id='YOUR_FEATURE_GROUP_ID'\n", + ")\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Accessing document retriever that is already crreated\n", + "\n", + "# dr = client.describe_document_retriever_by_name('DOCUMENT_RETRIEVER_NAME')\n", + "# dr" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[]" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "r = client.describe_document_retriever(document_retriever.id)\n", + "# Filters allow you to filter the documents that the doc retriever can use on the fly, using some columns of the training feature group that was used as input to the doc retriever.\n", + "# Filters are also available when using .get_chat_reponse\n", + "\n", + "client.get_matching_documents(document_retriever_id='DOCUMENT_RETRIEVER_ID', \n", + " query='WHATEVER_YOU_NEED_TO_ASK',limit= 10,\n", + " filters={\"state\": [\"MICHIGAN\",\"NATIONAL\"]})" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "10" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Examples of document retriever usage\n", + "\n", + "res = document_retriever.get_matching_documents(\"Agreement of the Parties\")\n", + "len(res)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[]" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Example of getting no results\n", + "\n", + "res2 = document_retriever.get_matching_documents(\"planting potatoes on a mars\", required_phrases=['mars'])\n", + "res2" + ] + } + ], + "metadata": { + "language_info": { + "name": "python" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/examples/language/helpers/metadata_extractor.ipynb b/examples/language/helpers/metadata_extractor.ipynb new file mode 100644 index 000000000..077377098 --- /dev/null +++ b/examples/language/helpers/metadata_extractor.ipynb @@ -0,0 +1,196 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "bfbb9b2d", + "metadata": {}, + "source": [ + "#### Purpose of the script\n", + "\n", + "Metadata can be very useful in ChatLLM projects. One of the most common ways to extract metadata is to look at the `file_path` column on the feature group section and try to find metadata that can be extracted directly from there.\n", + "\n", + "This script utilises LLM's to do just that. The process is as follows:\n", + "1. Loads a FG that is of type \"documentset\"\n", + "2. Finds some discreet examples of file paths\n", + "3. Provides a sample SQL code that can be used to extract useful metadata from the file_path.\n", + "\n", + "Please note that the model might not do a perfect job in extracting useful metadata. Human sensibility can help iterate and make the response better, or you might just choose to directly alter the SQL that is being generated. Regardless, it can provide a good starting point for extracting metadata from file_paths." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8b427da7", + "metadata": {}, + "outputs": [], + "source": [ + "FG_TABLE = 'YOUR_DOCUMENTS_FEATURE_GROUP'" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7c849eef-4ec9-4cd6-b201-7078c8af1816", + "metadata": { + "execution": { + "iopub.execute_input": "2024-11-20T18:08:28.593869Z", + "iopub.status.busy": "2024-11-20T18:08:28.593102Z", + "iopub.status.idle": "2024-11-20T18:10:05.111986Z", + "shell.execute_reply": "2024-11-20T18:10:05.111313Z", + "shell.execute_reply.started": "2024-11-20T18:08:28.593835Z" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "import abacusai\n", + "import pandas as pd\n", + "import numpy as np\n", + "from sklearn.feature_extraction.text import TfidfVectorizer\n", + "from sklearn.cluster import KMeans\n", + "client = abacusai.ApiClient()\n", + "\n", + "df = client.describe_feature_group_by_table_name(FG_TABLE).load_as_pandas()\n", + "\n", + "def get_diverse_samples(df, n_samples=100):\n", + " # Convert file paths to a more normalized form and remove the filename\n", + " paths = df['file_path'].apply(lambda x: x.lower().replace('\\\\', '/'))\n", + " # Get just the directory paths by removing the last component\n", + " dir_paths = paths.apply(lambda x: '/'.join(x.split('/')[:-1]))\n", + " \n", + " # Create TF-IDF vectors from the directory paths\n", + " vectorizer = TfidfVectorizer(\n", + " analyzer='char',\n", + " ngram_range=(3, 3), # Use character trigrams\n", + " max_features=2000\n", + " )\n", + " \n", + " path_vectors = vectorizer.fit_transform(dir_paths)\n", + " \n", + " # Determine number of clusters\n", + " n_clusters = min(n_samples, len(df))\n", + " \n", + " # Perform clustering\n", + " kmeans = KMeans(n_clusters=n_clusters, random_state=42)\n", + " clusters = kmeans.fit_predict(path_vectors)\n", + " \n", + " # Create a dictionary to store samples from each cluster\n", + " samples = []\n", + " \n", + " # For each cluster, select one random sample\n", + " for cluster_id in range(n_clusters):\n", + " cluster_mask = clusters == cluster_id\n", + " cluster_indices = np.where(cluster_mask)[0]\n", + " \n", + " if len(cluster_indices) > 0:\n", + " # Randomly select one sample from this cluster\n", + " selected_idx = np.random.choice(cluster_indices)\n", + " samples.append(df.iloc[selected_idx])\n", + " \n", + " # Convert to DataFrame\n", + " result_df = pd.DataFrame(samples)\n", + " \n", + " # If we need more samples to reach n_samples, add random ones\n", + " if len(result_df) < n_samples:\n", + " remaining = n_samples - len(result_df)\n", + " additional = df.sample(n=remaining)\n", + " \n", + " # Instead of using drop_duplicates, we'll use index-based deduplication\n", + " combined_indices = pd.Index(result_df.index).union(additional.index)\n", + " result_df = df.loc[combined_indices]\n", + " \n", + " # If we still need more samples after deduplication\n", + " if len(result_df) < n_samples:\n", + " more_samples = df.drop(result_df.index).sample(n=n_samples - len(result_df))\n", + " result_df = pd.concat([result_df, more_samples])\n", + " \n", + " return result_df\n", + "\n", + "# Example usage:\n", + "diverse_samples = get_diverse_samples(df, n_samples=100)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4d65316d-95b8-467f-a69f-99ef9c165ae4", + "metadata": { + "execution": { + "iopub.execute_input": "2024-11-20T18:25:58.645002Z", + "iopub.status.busy": "2024-11-20T18:25:58.644413Z", + "iopub.status.idle": "2024-11-20T18:25:58.649060Z", + "shell.execute_reply": "2024-11-20T18:25:58.648350Z", + "shell.execute_reply.started": "2024-11-20T18:25:58.644974Z" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "system_message = \"\"\"\n", + "Here are the revised behavior instructions:\n", + "\n", + "---\n", + "\n", + "**Objective:** \n", + "Your task is to assist in crafting a SQL query to extract metadata from file paths. This metadata will support a RAG-based application by identifying patterns that enhance our approach. We aim to extract metadata that is meaningful but not overly detailed, focusing primarily on the document type.\n", + "\n", + "**Instructions:** \n", + "1. **Analyze File Paths:** \n", + " Review the provided example file paths to determine the most relevant metadata to extract. The goal is to identify patterns that can inform the RAG application.\n", + "\n", + "2. **Create SQL Query:** \n", + " Develop a SQL query that effectively extracts the necessary metadata. Use the example below as a guide, but adapt it based on the file paths you receive. The query should be logical and straightforward, avoiding unnecessary complexity.\n", + "\n", + "3. **Example SQL Query:** \n", + "\n", + "```\n", + " SELECT *, \n", + " CASE WHEN UPPER(split(file_path, '/')[4]) LIKE 'GENERAL SERVICES' THEN UPPER(split(file_path, '/')[5])\n", + " WHEN UPPER(split(file_path, '/')[4]) LIKE 'SPECIFIC SERVICES' THEN UPPER(split(file_path, '/')[5])\n", + " ELSE UPPER(split(file_path, '/')[4])\n", + " END AS services_type,\n", + " UPPER(split(file_path, '/')[3]) AS document_type\n", + " FROM TABLE\n", + "```\n", + "4. **Adaptation:** \n", + " In the example, the query extracts metadata based on specific conditions. You may find that a simpler approach is sufficient. Consider creating 3-4 metadata columns if necessary, ensuring they provide valuable insights without being overly granular.\n", + "\n", + "5. **Be Sensible:**\n", + " Use your judgment to determine the most effective way to extract metadata. The complexity of the query should match the complexity of the file paths and the needs of the application. \n", + " If a metadata column has the same values across the board, then we don't care about it. Furthermore, avoid creating metadata columns that would extract the final documents name as metadata. We don't want that at all.\n", + "\"\"\"\n", + "r = client.evaluate_prompt(prompt = str(diverse_samples['file_path'].unique()), system_message = system_message)\n", + "print(r.content)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1fecc230-c09a-4229-8b8f-90a41961b831", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.12" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/examples/prompting_iteration.ipynb b/examples/language/helpers/prompting_iteration.ipynb similarity index 54% rename from examples/prompting_iteration.ipynb rename to examples/language/helpers/prompting_iteration.ipynb index 20f0c82e6..9cbae42de 100644 --- a/examples/prompting_iteration.ipynb +++ b/examples/language/helpers/prompting_iteration.ipynb @@ -7,14 +7,12 @@ "source": [ "### Data Agent Prompting Template\n", "\n", - "Normally, to overwrite the behavior or data prompt context instructions of a ChatLLM model, you would need to go to the Abacus.AI user interface and change it. That is a cubersome process since you need to wait for re-deployment before doing your testing.\n", + "Normally, you have three ways to iterate on the behavior instructions of the model:\n", + "1. Retrain model (This is a slow process since you need to wait for the model to update)\n", + "2. Use the model's \"Playground\" on the deployment page.\n", + "3. Use the API\n", "\n", - "Using this script, you will be able to:\n", - "1. Overwrite the instructions of the model on the fly - without changing them permanently\n", - "2. Make an API request to the model\n", - "3. Iterate faster\n", - "\n", - "If you comment out the `chat_config`, then the model will just use the instructions that it has been trained on. If you leave it, then we will overwrite instructions." + "The script shows you how you can use the API to actually call a ChatLLM deployment and change the behavior settings as you see fit. Using the API has some advantages over the \"Playground\" depending on the user's expertise level." ] }, { @@ -34,7 +32,7 @@ }, "outputs": [], "source": [ - "from abacusai import PredictionClient\n", + "from abacusai import PredictionClient, ApiClient\n", "client = PredictionClient()\n", "\n", "deployment_token = \"\"\n", @@ -46,12 +44,27 @@ "\n", "data_prompt = \"\"\"\n", "FILL IN YOUR DATA PROMPT INSTRUCTIONS\n", + "\"\"\"\n", + "\n", + "response_instructions = \"\"\"\n", + "FILL IN YOUR RESPONSE INSTRUCTIONS\n", "\"\"\"" ] }, { "cell_type": "code", - "execution_count": 2, + "execution_count": null, + "id": "e1d03576", + "metadata": {}, + "outputs": [], + "source": [ + "chat_config = {'data_prompt_context': data_prompt, 'behavior_instructions':behavior, 'response_instructions':response_instructions}\n", + "#chat_config={\"data_prompt_table_context\": {\"concrete_strength\": \"This tables has information about cement\"} -- How to add table level information" + ] + }, + { + "cell_type": "code", + "execution_count": null, "id": "4e872cb7-6e04-42c5-b657-ff20aa6da716", "metadata": { "editable": true, @@ -67,24 +80,47 @@ "outputs": [], "source": [ "question = 'What is the .....'\n", - "chat_config = {'data_prompt_context': data_prompt, 'behavior_instructions':behavior}\n", "\n", "r = client.get_chat_response(deployment_token=deployment_token,\n", " deployment_id=deployment_id,\n", " messages=[{\"is_user\":True,\"text\":question}],\n", - " chat_config=chat_config\n", + " chat_config=chat_config,\n", + " #llm_name = \"\" # In case you want to change LLM that is used to provid the answer\n", " )\n", "\n", "print(r['messages'][1]['text'])" ] }, + { + "cell_type": "markdown", + "id": "a1e47e7c", + "metadata": {}, + "source": [ + "#### How to get current model config -- if required" + ] + }, { "cell_type": "code", "execution_count": null, "id": "99dc02ea-968e-414f-b41f-f91c9e607218", "metadata": {}, "outputs": [], - "source": [] + "source": [ + "client = ApiClient()\n", + "project_id = \"YOUR_PROJECT_ID\"\n", + "models = client.list_models(project_id)\n", + "\n", + "for model in models:\n", + " print(f\"Model Name: {model.name}\")\n", + " print(f\"Model ID: {model.id}\")\n", + " print(\"Latest Model Version:\")\n", + " latest_version = client.describe_model_version(model.latest_model_version.model_version)\n", + " print(f\" Version: {latest_version.model_version}\")\n", + " print(f\" Status: {latest_version.status}\")\n", + " print(\" Training Config:\")\n", + " print(latest_version.model_config.to_dict())\n", + " print(\"\\n\" + \"=\"*50 + \"\\n\")" + ] } ], "metadata": {