Adaptive RAG evaluation with a Council-of-LLMs — up to 70x cost reduction by routing each test case to the optimal strategy based on risk.
Live Demo — zero setup, no API keys needed. Click "Run Demo" and watch adaptive orchestration in real time.
RAG systems fail silently. The retriever fetches wrong documents, the generator hallucinates, stale data gets served as truth — and nothing breaks. No error, no alert, just bad information reaching users.
Evaluating every query with a full multi-judge panel costs ~$0.0035/case. At scale this becomes prohibitive — but a single cheap judge misses subtle failures on high-stakes queries.
Quorum scores each test case for risk and routes it to the optimal evaluation strategy, spending budget only where it matters.
┌──────────────────┐
│ Risk Scorer │
│ (real analysis) │
└────────┬─────────┘
│
┌──────────────┼──────────────┐
│ │ │
risk ≥ 0.8 0.4 ≤ risk < 0.8 risk < 0.4
│ │ │
┌────────┴────────┐ ┌──┴───┐ ┌────┴────┐
│ Council │ │Hybrid│ │ Single │
│ 3 judges + agg. │ │det + │ │ Gemini │
│ ~$0.0035/case │ │1 judge│ │~$0.00005│
└─────────────────┘ └──────┘ └─────────┘
Medical dosages, legal requirements, safety procedures — queries where errors have real consequences get the full treatment: OpenAI (faithfulness) + Anthropic (groundedness) + Gemini (context relevancy), synthesized by Claude Sonnet 4.
Technical explanations, financial advice — zero-cost deterministic checks (Jaccard similarity, entity matching, freshness, completeness) run first, then a single LLM judge validates. Local verdict computation, no aggregator needed.
"What is the capital of Japan?" — one Gemini call, done. No wasted spend on trivial factoid queries.
17 SSE event types stream the entire evaluation lifecycle:
risk_scored → strategy_selected → judge_start → judge_complete → aggregator_start → ...
The frontend renders judges appearing staggered with live score animations — council shows 3 judge cards, hybrid shows deterministic checks + 1 judge, single shows 1 judge. All driven by SSE events, not hardcoded layouts.
┌─────────────────────────────────────────────────────────────────┐
│ Frontend (React + Vite) │
│ Upload → Strategy Select → Live Streaming → History → Costs │
└────────────────────────────────┬────────────────────────────────┘
│ SSE + REST
┌────────────────────────────────┴────────────────────────────────┐
│ Backend (Express + MongoDB) │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Adaptive Router │ │
│ │ Risk Scorer → Strategy Selector → Cost Tracker │ │
│ │ ↓ ↓ ↓ │ │
│ │ ┌─────────┐ ┌──────────────────────────┐ │ │
│ │ │Determin.│ │ Orchestrator │ │ │
│ │ │ Checks │ │ OpenAI │ Anthropic │ Gem │ │ │
│ │ │(0-cost) │ │ ↓ │ │ │
│ │ └─────────┘ │ Aggregator │ │ │
│ │ └──────────────────────────┘ │ │
│ └──────────────────────────────────────────────────────────┘ │
│ ↓ │
│ Webhook Service → Slack / HTTP │
└─────────────────────────────────────────────────────────────────┘
See ARCHITECTURE.md for detailed system diagrams, data flow, and SSE protocol documentation.
git clone https://github.com/AlexLopezGomez/Quorum---Council-LLMs.git && cd Quorum---Council-LLMs
cd frontend && npm ci && npm run build && cd ..
cd backend && npm ci
DEMO_MODE=true node src/index.jsOpen http://localhost:3000 — click "Run Demo" to launch a 10-case adaptive evaluation.
No MongoDB, no API keys, no Docker. The demo runs the real orchestration engine with mocked judge I/O at the boundary.
cd backend && cp .env.example .env # Add OPENAI_API_KEY, ANTHROPIC_API_KEY, GOOGLE_API_KEY
docker run -d -p 27017:27017 mongo:7
npm run dev &
cd ../frontend && npm run dev &Open http://localhost:5173.
docker-compose up --build
# Open http://localhost:8080Zero-dependency ESM module for capturing RAG interactions:
import { Quorum } from '@quorum/sdk';
const quorum = new Quorum({ endpoint: 'http://localhost:3000' });
quorum.capture({
input: 'What is the capital of France?',
actualOutput: 'The capital of France is Paris.',
retrievalContext: ['Paris is the capital and largest city of France.'],
});
await quorum.close();Batched transport with exponential backoff, PII sanitization on by default. See sdk/README.md.
npx quorum test --file test-cases.json # Run evaluation
npx quorum init # Scaffold config
npx quorum validate # Validate test case format| Method | Path | Purpose |
|---|---|---|
| POST | /api/evaluate |
Start evaluation (accepts strategy + riskOverride) |
| GET | /api/stream/:jobId |
SSE stream (replays + live) |
| GET | /api/results/:jobId |
Poll for results |
| GET | /api/history |
Cursor-paginated history |
| GET | /api/history/:jobId/cost |
Cost breakdown with savings estimate |
| GET | /api/stats |
Aggregated statistics |
| GET | /api/docs |
Swagger UI (interactive docs) |
Backend: Node.js 20+, Express, MongoDB/Mongoose, Zod, SSE Frontend: React 18, Vite 6, TailwindCSS 3, Lucide React SDK: Zero-dependency ESM with native fetch LLM Providers: OpenAI (gpt-4o-mini), Anthropic (claude-3-haiku, claude-sonnet-4), Google (gemini-2.5-flash)
- ARCHITECTURE.md — System diagrams, data flow, SSE protocol
- DECISIONS.md — Architectural decision records
- DESIGN_SYSTEM.md — Frontend component patterns
- sdk/README.md — SDK integration guide
- Public benchmark results: /benchmarks
- Research paper: forthcoming
Community contributions are welcome. Start with CONTRIBUTING.md for local setup, architecture notes, and pull request expectations.
Monitor live RAG traffic with the council. After any inference call, submit the sample fire-and-forget — it never blocks your production path:
// After your RAG pipeline response
fetch('https://your-quorum.app/api/sample', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'Authorization': 'Bearer ' + process.env.QUORUM_SERVICE_KEY,
},
body: JSON.stringify({
query, // the user's question
response, // your model's answer
contexts, // retrieved context passages (array of strings)
}),
}).catch(() => {}); // never await, never throwQuorum samples at 5% by default (configurable via SAMPLE_RATE env var). View the Monitoring dashboard for score trends, baseline comparison, and drift alerts.
Rate limit note: The /api/sample endpoint shares the global 30 RPM limit across all /api routes. At 5% sample rate, you need fewer than 600 RPM of production traffic to stay under the limit.