Skip to content
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 16 additions & 2 deletions .github/workflows/ml-llamaindex.yml
Original file line number Diff line number Diff line change
Expand Up @@ -40,10 +40,15 @@ jobs:
'ubuntu-latest',
]
python-version: [
'3.8',
'3.10',
'3.13',
]
cratedb-version: [ 'nightly' ]
cratedb-version: [
'nightly',
]
cratedb-mcp-version: [
'pr-50',
]
Copy link
Member Author

@amotl amotl Jul 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be adjusted after the next release of cratedb-mcp.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved with a43ee4e.


services:
cratedb:
Expand All @@ -53,6 +58,15 @@ jobs:
- 5432:5432
env:
CRATE_HEAP_SIZE: 4g
cratedb-mcp:
image: ghcr.io/crate/cratedb-mcp:${{ matrix.cratedb-mcp-version }}
ports:
- 8000:8000
env:
CRATEDB_MCP_TRANSPORT: streamable-http
CRATEDB_MCP_HOST: 0.0.0.0
CRATEDB_MCP_PORT: 8000
CRATEDB_CLUSTER_URL: http://crate:crate@cratedb:4200/

env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
Expand Down
59 changes: 46 additions & 13 deletions topic/machine-learning/llama-index/README.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,15 @@
# Connecting CrateDB Data to an LLM with LlamaIndex and Azure OpenAI
# NL2SQL with LlamaIndex: Querying CrateDB using natural language

This folder contains the codebase for [this tutorial](https://community.cratedb.com/t/how-to-connect-your-cratedb-data-to-llm-with-llamaindex-and-azure-openai/1612) on the CrateDB community forum. You should read the tutorial for instructions on how to set up the components that you need on Azure, and use this README for setting up CrateDB and the Python code.
Connecting CrateDB to an LLM with LlamaIndex and Azure OpenAI,
optionally using MCP. See also the [LlamaIndex Text-to-SQL Guide].

This has been tested using:
This folder contains the codebase for the tutorial
[How to connect your CrateDB data to LLM with LlamaIndex and Azure OpenAI]
on the CrateDB community forum.

* Python 3.12
* macOS
* CrateDB 5.8 and higher
You should read the tutorial for instructions on how to set up the components
that you need on Azure, and use this README for setting up CrateDB and the
Python code.

## Database Setup

Expand Down Expand Up @@ -57,7 +60,7 @@ VALUES

Create and activate a virtual environment:

```
```shell
python3 -m venv .venv
source .venv/bin/activate
```
Expand All @@ -81,23 +84,25 @@ OPENAI_AZURE_ENDPOINT=https://<Your endpoint from Azure e.g. myendpoint.openai.a
OPENAI_AZURE_API_VERSION=2024-08-01-preview
LLM_INSTANCE=<The name of your Chat GPT 3.5 turbo instance from Azure>
EMBEDDING_MODEL_INSTANCE=<The name of your Text Embedding Ada 2.0 instance from Azure>
CRATEDB_SQLALCHEMY_URL="crate://<Database user name>:<Database password>@<Database host>:4200/?ssl=true"
CRATEDB_SQLALCHEMY_URL=crate://<Database user name>:<Database password>@<Database host>:4200/?ssl=true
CRATEDB_TABLE_NAME=time_series_data
```

Save your changes.

## Run the Code

Run the code like so:
### NLSQL

[LlamaIndex's NLSQLTableQueryEngine] is a natural language SQL table query engine.

Run the code like so:
```bash
python main.py
python demo_nlsql.py
```

Here's the expected output:

```
```text
Creating SQLAlchemy engine...
Connecting to CrateDB...
Creating SQLDatabase instance...
Expand All @@ -124,4 +129,32 @@ Answer was: The average value for sensor 1 is 17.033333333333335.
'avg(value)'
]
}
```
```

### MCP

Spin up the [CrateDB MCP server], connecting it to CrateDB on localhost.
```bash
export CRATEDB_CLUSTER_URL=http://crate:crate@localhost:4200/
export CRATEDB_MCP_TRANSPORT=streamable-http
uvx cratedb-mcp serve
```

Run the code using OpenAI API:
```bash
export OPENAI_API_KEY=<YOUR_OPENAI_API_KEY>
python demo_mcp.py
```
Expected output:
```text
Running query
Inquiring MCP server
Query was: What is the average value for sensor 1?
Answer was: The average value for sensor 1 is approximately 17.03.
```


[CrateDB MCP server]: https://cratedb.com/docs/guide/integrate/mcp/cratedb-mcp.html
[How to connect your CrateDB data to LLM with LlamaIndex and Azure OpenAI]: https://community.cratedb.com/t/how-to-connect-your-cratedb-data-to-llm-with-llamaindex-and-azure-openai/1612
[LlamaIndex's NLSQLTableQueryEngine]: https://docs.llamaindex.ai/en/stable/api_reference/query_engine/NL_SQL_table/
[LlamaIndex Text-to-SQL Guide]: https://docs.llamaindex.ai/en/stable/examples/index_structs/struct_indices/SQLIndexDemo/
57 changes: 57 additions & 0 deletions topic/machine-learning/llama-index/boot.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
import os
from typing import Tuple

import openai
from langchain_openai import AzureOpenAIEmbeddings
from langchain_openai import OpenAIEmbeddings
from llama_index.core.base.embeddings.base import BaseEmbedding
from llama_index.core.llms import LLM
from llama_index.llms.azure_openai import AzureOpenAI
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.langchain import LangchainEmbedding


MODEL_NAME = "gpt-4o"


def configure_llm() -> Tuple[LLM, BaseEmbedding]:
"""
Configure LLM. Use either vanilla Open AI, or Azure Open AI.
"""

openai.api_type = os.getenv("OPENAI_API_TYPE")
openai.azure_endpoint = os.getenv("OPENAI_AZURE_ENDPOINT")
openai.api_version = os.getenv("OPENAI_AZURE_API_VERSION")
openai.api_key = os.getenv("OPENAI_API_KEY")

if openai.api_type == "openai":
llm = OpenAI(
model=MODEL_NAME,
temperature=0.0,
api_key=os.getenv("OPENAI_API_KEY"),
)
elif openai.api_type == "azure":
llm = AzureOpenAI(
model=MODEL_NAME,
temperature=0.0,
engine=os.getenv("LLM_INSTANCE"),
azure_endpoint=os.getenv("OPENAI_AZURE_ENDPOINT"),
api_key = os.getenv("OPENAI_API_KEY"),
api_version = os.getenv("OPENAI_AZURE_API_VERSION"),
)
else:
raise ValueError(f"Open AI API type not defined or invalid: {openai.api_type}")

if openai.api_type == "openai":
embed_model = LangchainEmbedding(OpenAIEmbeddings(model=MODEL_NAME))
elif openai.api_type == "azure":
embed_model = LangchainEmbedding(
AzureOpenAIEmbeddings(
azure_endpoint=os.getenv("OPENAI_AZURE_ENDPOINT"),
model=os.getenv("EMBEDDING_MODEL_INSTANCE")
)
)
else:
embed_model = None

return llm, embed_model
89 changes: 89 additions & 0 deletions topic/machine-learning/llama-index/demo_mcp.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
"""
Use an LLM to query a database in human language via MCP.
Example code using LlamaIndex with vanilla Open AI and Azure Open AI.

https://github.com/run-llama/llama_index/tree/main/llama-index-integrations/tools/llama-index-tools-mcp

## Start CrateDB MCP Server
```
export CRATEDB_CLUSTER_URL="http://localhost:4200/"
cratedb-mcp serve --transport=streamable-http
```

## Usage
```
source env.standalone
export OPENAI_API_KEY=sk-XJZ7pfog5Gp8Kus8D--invalid--0CJ5lyAKSefZLaV1Y9S1
python demo_mcp.py
```
"""
import asyncio
import os

from cratedb_about.instruction import Instructions

from dotenv import load_dotenv
from llama_index.core.agent.workflow import FunctionAgent
from llama_index.core.llms import LLM
from llama_index.tools.mcp import BasicMCPClient, McpToolSpec

from boot import configure_llm


class Agent:

def __init__(self, llm: LLM):
self.llm = llm

async def get_tools(self):
# Connect to the CrateDB MCP server using `streamable-http` transport.
mcp_url = os.getenv("CRATEDB_MCP_URL", "http://127.0.0.1:8000/mcp/")
mcp_client = BasicMCPClient(mcp_url)
mcp_tool_spec = McpToolSpec(
client=mcp_client,
# Optional: Filter the tools by name
# allowed_tools=["tool1", "tool2"],
# Optional: Include resources in the tool list
# include_resources=True,
)
return await mcp_tool_spec.to_tool_list_async()

async def get_agent(self):
return FunctionAgent(
name="Agent",
description="CrateDB text-to-SQL agent",
llm=self.llm,
tools=await self.get_tools(),
system_prompt=Instructions.full(),
)

async def aquery(self, query):
return await (await self.get_agent()).run(query)

def query(self, query):
print("Inquiring MCP server")
return asyncio.run(self.aquery(query))


def main():
"""
Use an LLM to query a database in human language.
"""

# Configure application.
load_dotenv()
llm, embed_model = configure_llm()

# Use an agent that uses the CrateDB MCP server.
agent = Agent(llm)

# Invoke an inquiry.
print("Running query")
QUERY_STR = "What is the average value for sensor 1?"
answer = agent.query(QUERY_STR)
print("Query was:", QUERY_STR)
print("Answer was:", answer)


if __name__ == "__main__":
main()
50 changes: 50 additions & 0 deletions topic/machine-learning/llama-index/demo_nlsql.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
"""
Use an LLM to query a database in human language via NLSQLTableQueryEngine.
Example code using LlamaIndex with vanilla Open AI and Azure Open AI.
"""

import os
import sqlalchemy as sa

from dotenv import load_dotenv
from llama_index.core.utilities.sql_wrapper import SQLDatabase
from llama_index.core.query_engine import NLSQLTableQueryEngine

from boot import configure_llm


def main():
"""
Use an LLM to query a database in human language.
"""

# Configure application.
load_dotenv()
llm, embed_model = configure_llm()

# Configure database connection and query engine.
print("Connecting to CrateDB")
engine_crate = sa.create_engine(os.getenv("CRATEDB_SQLALCHEMY_URL"))
engine_crate.connect()

print("Creating LlamaIndex QueryEngine")
sql_database = SQLDatabase(engine_crate, include_tables=[os.getenv("CRATEDB_TABLE_NAME")])
query_engine = NLSQLTableQueryEngine(
sql_database=sql_database,
tables=[os.getenv("CRATEDB_TABLE_NAME")],
llm=llm,
embed_model=embed_model,
)

# Invoke an inquiry.
print("Running query")
QUERY_STR = "What is the average value for sensor 1?"
answer = query_engine.query(QUERY_STR)
print(answer.get_formatted_sources())
print("Query was:", QUERY_STR)
print("Answer was:", answer)
print(answer.metadata)


if __name__ == "__main__":
main()
16 changes: 8 additions & 8 deletions topic/machine-learning/llama-index/env.azure
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
OPENAI_API_KEY=TODO
OPENAI_API_TYPE=azure
OPENAI_AZURE_ENDPOINT=https://TODO.openai.azure.com
OPENAI_AZURE_API_VERSION=2024-08-01-preview
LLM_INSTANCE=TODO
EMBEDDING_MODEL_INSTANCE=TODO
CRATEDB_SQLALCHEMY_URL="crate://USER:PASSWORD@HOST:4200/?ssl=true"
CRATEDB_TABLE_NAME=time_series_data
export OPENAI_API_KEY=TODO
export OPENAI_API_TYPE=azure
export OPENAI_AZURE_ENDPOINT=https://TODO.openai.azure.com
export OPENAI_AZURE_API_VERSION=2024-08-01-preview
export LLM_INSTANCE=TODO
export EMBEDDING_MODEL_INSTANCE=TODO
export CRATEDB_SQLALCHEMY_URL="crate://USER:PASSWORD@HOST:4200/?ssl=true"
export CRATEDB_TABLE_NAME=time_series_data
6 changes: 3 additions & 3 deletions topic/machine-learning/llama-index/env.standalone
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# OPENAI_API_KEY=sk-XJZ7pfog5Gp8Kus8D--invalid--0CJ5lyAKSefZLaV1Y9S1
OPENAI_API_TYPE=openai
CRATEDB_SQLALCHEMY_URL="crate://crate@localhost:4200/"
CRATEDB_TABLE_NAME=time_series_data
export OPENAI_API_TYPE=openai
export CRATEDB_SQLALCHEMY_URL="crate://crate@localhost:4200/"
export CRATEDB_TABLE_NAME=time_series_data
Loading