| icon | lucide/database |
|---|
A RAG (Retrieval-Augmented Generation) proxy server that lets you chat with your documents.
agent-cli rag-proxy [OPTIONS]Enables "Chat with your Data" by running a local proxy server:
- Start the server, pointing to your documents folder and LLM
- The server watches the folder and indexes documents into a ChromaDB vector store
- Point any OpenAI-compatible client to this server's URL
- When you ask a question, the server retrieves relevant chunks and adds them to the prompt
Requires the rag extra:
pip install "agent-cli[rag]"
# or from repo
uv sync --extra rag# With local LLM (Ollama)
agent-cli rag-proxy \
--docs-folder ~/Documents/Notes \
--openai-base-url http://localhost:11434/v1 \
--port 8000
# With OpenAI
agent-cli rag-proxy \
--docs-folder ~/Documents/Notes \
--openai-api-key sk-... \
--port 8000
# Use with agent-cli chat
agent-cli chat --openai-base-url http://localhost:8000/v1 --llm-provider openai| Option | Default | Description |
|---|---|---|
--docs-folder |
./rag_docs |
Folder to watch for documents |
--chroma-path |
./rag_db |
Path to ChromaDB persistence directory |
--limit |
3 |
Number of document chunks to retrieve per query. |
--rag-tools/--no-rag-tools |
true |
Allow agent to fetch full documents when snippets are insufficient. |
| Option | Default | Description |
|---|---|---|
--openai-base-url |
- | Custom base URL for OpenAI-compatible API (e.g., for llama-server: http://localhost:8080/v1). |
--openai-api-key |
- | Your OpenAI API key. Can also be set with the OPENAI_API_KEY environment variable. |
| Option | Default | Description |
|---|---|---|
--embedding-model |
text-embedding-3-small |
Embedding model to use for vectorization. |
| Option | Default | Description |
|---|---|---|
--host |
0.0.0.0 |
Host/IP to bind API servers to. |
--port |
8000 |
Port to bind to |
| Option | Default | Description |
|---|---|---|
--log-level |
INFO |
Set logging level. |
--config |
- | Path to a TOML configuration file. |
--print-args |
false |
Print the command line arguments, including variables taken from the configuration file. |
Text files (loaded directly):
.txt,.md,.json,.py,.js,.ts,.yaml,.yml,.rs,.go.c,.cpp,.h,.sh,.toml,.rst,.ini,.cfg
Rich documents (converted via MarkItDown):
.pdf,.docx,.pptx,.xlsx,.html,.htm,.csv,.xml
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Your Client │────▶│ RAG Proxy │────▶│ LLM Backend │
│ (chat, curl) │◀────│ :8000 │◀────│ (Ollama/OpenAI) │
└─────────────────┘ └────────┬────────┘ └─────────────────┘
│
┌────────▼────────┐
│ ChromaDB │
│ (Vector Store) │
└────────┬────────┘
│
┌────────▼────────┐
│ docs-folder │
│ (Your Files) │
└─────────────────┘
- RAG System Architecture - Detailed design and data flow
- memory - Long-term memory proxy (different retrieval and storage model)
- Memory System Architecture - How memory storage works
- Configuration - Config file keys and defaults
Any OpenAI-compatible client can use the RAG proxy:
# curl
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "<your-model>", "messages": [{"role": "user", "content": "What do my notes say about X?"}]}'
# Python (openai library)
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
response = client.chat.completions.create(
model="<your-model>",
messages=[{"role": "user", "content": "Summarize my project notes"}]
)- The server automatically re-indexes when files change
- Use
--limitto control how many document chunks are retrieved - Enable
--rag-toolsfor the agent to request full documents when snippets aren't enough