Self-hosted AI chat platform with RAG, guardrails, and streaming
Deploy your own AI assistant in minutes. Connect any OpenAI-compatible LLM (vLLM, Ollama, OpenAI, Together, etc.), upload documents for RAG, and get a production-ready chat API with guardrails, session management, and a beautiful widget.
- π Any LLM Backend β Works with vLLM, Ollama, OpenAI, Together, or any OpenAI-compatible API
- π RAG Pipeline β Upload PDFs, DOCX, CSV, TXT, MD β auto-chunked, embedded, and searchable via pgvector
- π‘οΈ 3-Layer Guardrails β Input filtering, streaming sanitisation, output validation
- β‘ SSE Streaming β Real-time token streaming to the client
- π¬ Session Management β Redis-backed conversation history with automatic summarisation
- π API Key Auth β Simple bearer token authentication
- π¦ Rate Limiting β Per-IP and per-session rate limits
- π Token Management β Automatic history trimming with LLM-powered summarisation
- π File Upload API β Upload and index documents via REST API
- π¨ Chat Widget β Beautiful, configurable HTML widget (dark mode, markdown, file upload)
- π³ Docker Ready β
docker compose upand you're running - π Self-Hosted β Everything runs on your infrastructure. No data leaves your network.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Chat Widget (HTML) β
β or any HTTP client / frontend β
ββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββ
β HTTPS / SSE
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Agentic RAG Chat API β
β (FastAPI + Python) β
β β
β ββββββββββββ βββββββββββββ ββββββββββββ ββββββββββββ
β β Auth β β Guardrailsβ β Tokens β β Rate ββ
β β (API β β (3-layer) β β (tiktokenβ β Limiter ββ
β β keys) β β β β + trim) β β ββ
β ββββββββββββ βββββββββββββ ββββββββββββ ββββββββββββ
β β
β ββββββββββββββββββββ βββββββββββββββββββββββββββββββ β
β β RAG Engine β β Session Manager (Redis) β β
β β (embed + search) β β (history + rate limits) β β
β ββββββββββ¬ββββββββββ βββββββββββββββ¬ββββββββββββββββ β
βββββββββββββΌββββββββββββββββββββββββββββΌβββββββββββββββββββ
β β
βΌ βΌ
βββββββββββββββββββββ ββββββββββββββββββββ
β PostgreSQL + β β Redis β
β pgvector β β β
β (embeddings) β β (sessions) β
βββββββββββββββββββββ ββββββββββββββββββββ
β
β SSE Stream
βΌ
βββββββββββββββββββββββββββββββββββββββββ
β Any OpenAI-Compatible LLM β
β (vLLM, Ollama, OpenAI, Together...) β
βββββββββββββββββββββββββββββββββββββββββ
git clone https://github.com/sotastack/agent.git
cd agent
cp .env.example .env
# Edit .env with your LLM endpoint and API keydocker compose up -d# Ingest the sample docs
docker compose exec agent python ingest.py --path docs/ --source "Sample Docs"
# Test the API
curl http://localhost:8083/api/v1/health
# Open the widget
open widget/index.htmlThat's it. You're running a self-hosted AI assistant with RAG.
All configuration is via environment variables. See .env.example for the full list.
| Variable | Default | Description |
|---|---|---|
LLM_URL |
http://localhost:8000/v1 |
OpenAI-compatible API endpoint |
LLM_API_KEY |
- | API key for your LLM backend |
LLM_MODEL |
default |
Model name to use |
CLIENT_API_KEYS |
- | Comma-separated API keys for client auth |
REDIS_URL |
redis://localhost:6379/0 |
Redis connection URL |
RAG_DB_HOST |
localhost |
PostgreSQL host |
RAG_ENABLED |
true |
Enable/disable RAG |
RAG_TOP_K |
5 |
Number of RAG results to inject |
RAG_MIN_SIMILARITY |
0.3 |
Minimum cosine similarity threshold |
MAX_TOKENS_CONTEXT |
28000 |
Max tokens in context window |
RATE_LIMIT_PER_MIN |
20 |
Per-IP rate limit |
vLLM (local GPU):
LLM_URL=http://localhost:8000/v1
LLM_API_KEY=token-abc123
LLM_MODEL=meta-llama/Llama-3-8B-InstructOllama:
LLM_URL=http://localhost:11434/v1
LLM_API_KEY=ollama
LLM_MODEL=llama3OpenAI:
LLM_URL=https://api.openai.com/v1
LLM_API_KEY=sk-...
LLM_MODEL=gpt-4o| Method | Endpoint | Description |
|---|---|---|
GET |
/api/v1/health |
Health check |
POST |
/api/v1/chat |
Send a message (SSE streaming response) |
POST |
/api/v1/upload |
Upload a document for RAG indexing |
GET |
/api/v1/files |
List indexed documents |
GET |
/api/v1/session/{id} |
Get session info |
DELETE |
/api/v1/session/{id} |
Delete a session |
curl -X POST http://localhost:8083/api/v1/chat \
-H "Authorization: Bearer your-api-key" \
-H "Content-Type: application/json" \
-d '{"message": "What is in the knowledge base?"}' \
--no-buffercurl -X POST http://localhost:8083/api/v1/upload \
-H "Authorization: Bearer your-api-key" \
-F "file=@document.pdf" \
-F "source=Product Manual"Agentic RAG Chat includes three layers of protection:
- Input Guardrails β Blocks prompt injection, jailbreak attempts, and model probing
- Streaming Sanitisation β Strips unwanted characters (e.g., CJK from English-only models) in real-time
- Output Validation β Checks completed responses for system prompt leaks
Customise blocked patterns in guardrails.py.
The included chat widget (widget/index.html) is a single HTML file with zero dependencies. Configure it via URL parameters:
widget/index.html?api=http://localhost:8083/api/v1&key=your-key&title=My+Assistant
| Param | Description |
|---|---|
api |
Agent API base URL |
key |
API key for authentication |
title |
Custom title in the header |
subtitle |
Custom subtitle |
# Install dependencies
pip install -r requirements.txt
# Run in dev mode
make dev
# Ingest sample documents
make ingest
# Health check
make test- System Prompt: Edit
prompts/default.txtor add client-specific prompts asprompts/{client}.txt - Guardrails: Modify
guardrails.pyto add/remove blocked patterns - RAG Settings: Adjust
RAG_TOP_K,RAG_MIN_SIMILARITYin.env - Widget: The widget is a single HTML file β fork and customise freely
- FastAPI β async Python web framework
- httpx β async HTTP client for LLM streaming
- Redis β session storage and rate limiting
- PostgreSQL + pgvector β vector similarity search for RAG
- sentence-transformers β CPU-based embedding (all-MiniLM-L6-v2)
- tiktoken β token counting for context management
MIT β see LICENSE.
- Website: sotastack.com.au
- Issues: GitHub Issues
Built by SOTAStack Β· Melbourne, Australia π¦πΊ