Skip to content

Self-hosted AI chat with RAG, guardrails, and streaming. Drop-in alternative to LiveChat/Intercom.

License

Notifications You must be signed in to change notification settings

teamauresta/agentic-rag-chat

Repository files navigation

πŸ€– Agentic RAG Chat

Self-hosted AI chat platform with RAG, guardrails, and streaming

CI Python FastAPI pgvector Redis MIT License


Deploy your own AI assistant in minutes. Connect any OpenAI-compatible LLM (vLLM, Ollama, OpenAI, Together, etc.), upload documents for RAG, and get a production-ready chat API with guardrails, session management, and a beautiful widget.

✨ Features

  • πŸ”Œ Any LLM Backend β€” Works with vLLM, Ollama, OpenAI, Together, or any OpenAI-compatible API
  • πŸ“„ RAG Pipeline β€” Upload PDFs, DOCX, CSV, TXT, MD β†’ auto-chunked, embedded, and searchable via pgvector
  • πŸ›‘οΈ 3-Layer Guardrails β€” Input filtering, streaming sanitisation, output validation
  • ⚑ SSE Streaming β€” Real-time token streaming to the client
  • πŸ’¬ Session Management β€” Redis-backed conversation history with automatic summarisation
  • πŸ”‘ API Key Auth β€” Simple bearer token authentication
  • 🚦 Rate Limiting β€” Per-IP and per-session rate limits
  • πŸ“Š Token Management β€” Automatic history trimming with LLM-powered summarisation
  • πŸ“ File Upload API β€” Upload and index documents via REST API
  • 🎨 Chat Widget β€” Beautiful, configurable HTML widget (dark mode, markdown, file upload)
  • 🐳 Docker Ready β€” docker compose up and you're running
  • πŸ”’ Self-Hosted β€” Everything runs on your infrastructure. No data leaves your network.

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Chat Widget (HTML)                     β”‚
β”‚              or any HTTP client / frontend                β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                       β”‚ HTTPS / SSE
                       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                   Agentic RAG Chat API                     β”‚
β”‚                     (FastAPI + Python)                    β”‚
β”‚                                                           β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”β”‚
β”‚  β”‚   Auth   β”‚  β”‚ Guardrailsβ”‚  β”‚  Tokens  β”‚  β”‚  Rate   β”‚β”‚
β”‚  β”‚  (API    β”‚  β”‚ (3-layer) β”‚  β”‚ (tiktokenβ”‚  β”‚ Limiter β”‚β”‚
β”‚  β”‚   keys)  β”‚  β”‚           β”‚  β”‚  + trim) β”‚  β”‚         β”‚β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜β”‚
β”‚                                                           β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚   RAG Engine     β”‚  β”‚   Session Manager (Redis)    β”‚  β”‚
β”‚  β”‚ (embed + search) β”‚  β”‚   (history + rate limits)    β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
            β”‚                           β”‚
            β–Ό                           β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  PostgreSQL +     β”‚        β”‚     Redis        β”‚
β”‚  pgvector         β”‚        β”‚                  β”‚
β”‚  (embeddings)     β”‚        β”‚  (sessions)      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
            β”‚
            β”‚ SSE Stream
            β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚    Any OpenAI-Compatible LLM          β”‚
β”‚  (vLLM, Ollama, OpenAI, Together...)  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸš€ Quick Start

1. Clone and configure

git clone https://github.com/sotastack/agent.git
cd agent
cp .env.example .env
# Edit .env with your LLM endpoint and API key

2. Start services

docker compose up -d

3. Ingest sample documents and chat

# Ingest the sample docs
docker compose exec agent python ingest.py --path docs/ --source "Sample Docs"

# Test the API
curl http://localhost:8083/api/v1/health

# Open the widget
open widget/index.html

That's it. You're running a self-hosted AI assistant with RAG.

πŸ“– Configuration

All configuration is via environment variables. See .env.example for the full list.

Variable Default Description
LLM_URL http://localhost:8000/v1 OpenAI-compatible API endpoint
LLM_API_KEY - API key for your LLM backend
LLM_MODEL default Model name to use
CLIENT_API_KEYS - Comma-separated API keys for client auth
REDIS_URL redis://localhost:6379/0 Redis connection URL
RAG_DB_HOST localhost PostgreSQL host
RAG_ENABLED true Enable/disable RAG
RAG_TOP_K 5 Number of RAG results to inject
RAG_MIN_SIMILARITY 0.3 Minimum cosine similarity threshold
MAX_TOKENS_CONTEXT 28000 Max tokens in context window
RATE_LIMIT_PER_MIN 20 Per-IP rate limit

πŸ”Œ LLM Backend Examples

vLLM (local GPU):

LLM_URL=http://localhost:8000/v1
LLM_API_KEY=token-abc123
LLM_MODEL=meta-llama/Llama-3-8B-Instruct

Ollama:

LLM_URL=http://localhost:11434/v1
LLM_API_KEY=ollama
LLM_MODEL=llama3

OpenAI:

LLM_URL=https://api.openai.com/v1
LLM_API_KEY=sk-...
LLM_MODEL=gpt-4o

πŸ“‘ API Reference

Method Endpoint Description
GET /api/v1/health Health check
POST /api/v1/chat Send a message (SSE streaming response)
POST /api/v1/upload Upload a document for RAG indexing
GET /api/v1/files List indexed documents
GET /api/v1/session/{id} Get session info
DELETE /api/v1/session/{id} Delete a session

Chat Request

curl -X POST http://localhost:8083/api/v1/chat \
  -H "Authorization: Bearer your-api-key" \
  -H "Content-Type: application/json" \
  -d '{"message": "What is in the knowledge base?"}' \
  --no-buffer

Upload a Document

curl -X POST http://localhost:8083/api/v1/upload \
  -H "Authorization: Bearer your-api-key" \
  -F "file=@document.pdf" \
  -F "source=Product Manual"

πŸ›‘οΈ Guardrails

Agentic RAG Chat includes three layers of protection:

  1. Input Guardrails β€” Blocks prompt injection, jailbreak attempts, and model probing
  2. Streaming Sanitisation β€” Strips unwanted characters (e.g., CJK from English-only models) in real-time
  3. Output Validation β€” Checks completed responses for system prompt leaks

Customise blocked patterns in guardrails.py.

🎨 Widget

The included chat widget (widget/index.html) is a single HTML file with zero dependencies. Configure it via URL parameters:

widget/index.html?api=http://localhost:8083/api/v1&key=your-key&title=My+Assistant
Param Description
api Agent API base URL
key API key for authentication
title Custom title in the header
subtitle Custom subtitle

πŸ› οΈ Development

# Install dependencies
pip install -r requirements.txt

# Run in dev mode
make dev

# Ingest sample documents
make ingest

# Health check
make test

πŸ“ Customisation

  • System Prompt: Edit prompts/default.txt or add client-specific prompts as prompts/{client}.txt
  • Guardrails: Modify guardrails.py to add/remove blocked patterns
  • RAG Settings: Adjust RAG_TOP_K, RAG_MIN_SIMILARITY in .env
  • Widget: The widget is a single HTML file β€” fork and customise freely

πŸ“¦ Tech Stack

  • FastAPI β€” async Python web framework
  • httpx β€” async HTTP client for LLM streaming
  • Redis β€” session storage and rate limiting
  • PostgreSQL + pgvector β€” vector similarity search for RAG
  • sentence-transformers β€” CPU-based embedding (all-MiniLM-L6-v2)
  • tiktoken β€” token counting for context management

πŸ“„ License

MIT β€” see LICENSE.

πŸ”— Links


Built by SOTAStack Β· Melbourne, Australia πŸ‡¦πŸ‡Ί

About

Self-hosted AI chat with RAG, guardrails, and streaming. Drop-in alternative to LiveChat/Intercom.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

No packages published