A production-ready Retrieval-Augmented Generation (RAG) application that enables intelligent conversations with PDF documents. Built for sub-200ms retrieval latency, streaming responses, and scalable inference.
π Live Demo | π Backend API
ContextAI solves the problem of interacting with document content at scale:
- Upload any PDF β instantly searchable
- Ask questions β get answers from your documents (not training data)
- Follow-up questions β maintains conversation context
- Streaming responses β real-time typewriter effect (like ChatGPT)
Perfect for: Recruiters, researchers, legal teams, and anyone working with large document datasets.
%%{init: {'theme':'dark'}}%%
graph TB
subgraph Frontend["π¨ Frontend (React + TypeScript)"]
UI["Landing Page & Chat UI"]
Upload["Drag-Drop Upload"]
Chat["Real-Time Chat"]
end
subgraph Backend["βοΈ Backend (FastAPI - Render)"]
API["REST API Endpoints"]
Upload_EP["/upload"]
Chat_EP["/chat & /chat/stream"]
Docs_EP["/documents"]
end
subgraph AI["π§ AI & Vector Search"]
HF["HuggingFace Embeddings"]
FAISS["FAISS Vector DB"]
Groq["Groq LLM<br/>llama-3.1-8b"]
end
UI --> Upload
UI --> Chat
Upload --> Upload_EP
Chat --> Chat_EP
Upload_EP --> HF
HF --> FAISS
Chat_EP --> FAISS
Chat_EP --> Groq
style Frontend fill:#1f1f2e,stroke:#00d4ff,color:#fff
style Backend fill:#1f1f2e,stroke:#00ff88,color:#fff
style AI fill:#1f1f2e,stroke:#ff006e,color:#fff
%%{init: {'theme':'dark'}}%%
graph LR
A["π PDF Upload"] --> B["π PyPDFLoader<br/>Extract Text"]
B --> C["βοΈ Split Text<br/>500 char chunks<br/>50 overlap"]
C --> D["π’ Embeddings<br/>all-MiniLM-L6-v2<br/>384-dim"]
D --> E["π FAISS<br/>Vector Store"]
E --> F["πΎ In-Memory<br/>Storage"]
style A fill:#00ff88,stroke:#00ff88,color:#1a1a1a
style B fill:#00d4ff,stroke:#00d4ff,color:#1a1a1a
style C fill:#ffa700,stroke:#ffa700,color:#1a1a1a
style D fill:#ff006e,stroke:#ff006e,color:#f0f0f0
style E fill:#8338ec,stroke:#8338ec,color:#f0f0f0
style F fill:#ffbe0b,stroke:#ffbe0b,color:#1a1a1a
%%{init: {'theme':'dark'}}%%
graph TD
A["β User Question<br/>+ Chat History"] --> B["π History-Aware<br/>Reformulation"]
B --> C["π FAISS Retrieval<br/>Top 3 Chunks"]
C --> D["π LangChain Prompt<br/>Assembly"]
D --> E["π― System Prompt<br/>+ History<br/>+ Context<br/>+ Question"]
E --> F["β‘ Groq API<br/>llama-3.1-8b-instant"]
F --> G["π‘ Streaming Response<br/>Token by Token"]
G --> H["β¨ Real-Time Display<br/>Typewriter Effect"]
style A fill:#00ff88,stroke:#00ff88,color:#1a1a1a
style B fill:#00d4ff,stroke:#00d4ff,color:#1a1a1a
style C fill:#ffa700,stroke:#ffa700,color:#1a1a1a
style D fill:#ff006e,stroke:#ff006e,color:#f0f0f0
style E fill:#8338ec,stroke:#8338ec,color:#f0f0f0
style F fill:#ffbe0b,stroke:#ffbe0b,color:#1a1a1a
style G fill:#00d4ff,stroke:#00d4ff,color:#1a1a1a
style H fill:#00ff88,stroke:#00ff88,color:#1a1a1a
- Semantic document search using FAISS vector database
- Dual retrieval: context + conversational history
- Sub-200ms retrieval latency (FAISS + CPU optimization)
- Maintains context across 10+ conversation turns
- Automatic question reformulation for pronouns ("What about that?" β "What about the third paragraph?")
- History-aware prompting with LangChain
- Server-Sent Events (SSE) for real-time token streaming
- Dual endpoints:
/chat(instant) vs/chat/stream(progressive) - 50% perceived latency reduction compared to batch responses
- Upload multiple PDFs simultaneously
- Switch between documents instantly
- Per-document vector stores (isolated context)
- Environment-based LLM selection (Groq API or HuggingFace)
- Async background processing for PDFs
- CORS-configured for cross-origin requests
- Error handling with meaningful messages
| Layer | Technology | Purpose |
|---|---|---|
| Frontend | React 19 + TypeScript | Modern UI with real-time updates |
| Vite | Fast HMR development, optimized builds | |
| Tailwind CSS | Responsive styling | |
| Axios + SSE | API communication & streaming | |
| Backend | FastAPI | High-performance async web framework |
| LangChain 1.x | LLM orchestration & RAG pipeline | |
| Groq API | Fast token-per-second inference | |
| FAISS | Approximate nearest neighbor search | |
| ML/Embeddings | HuggingFace | all-MiniLM-L6-v2 (384-dim, fast) |
| LangChain Splitters | Intelligent document chunking | |
| Infrastructure | Render | Backend hosting (Python runtime) |
| Vercel / GitHub Pages | Frontend hosting (auto-deploys) |
| Metric | Value | Notes |
|---|---|---|
| Retrieval Latency | <200ms | FAISS on CPU, top-3 chunks |
| LLM Response Time | ~3-5s | Groq llama-3.1-8b (streaming) |
| First Token Latency | <500ms | SSE streaming begins immediately |
| Upload Processing | ~10-30s | PDF β embeddings β FAISS index |
| Embedding Model Size | ~100MB | all-MiniLM-L6-v2 |
| Typical Token Throughput | 200-300 tok/s | Groq inference speed |
| Cold Start Time | 30-60s | Render free tier wake-up |
| Warm Response Time | 3-5s | Typical end-to-end latency |
POST /upload # Upload PDF, returns document_id
GET /documents # List all uploaded documents with statusPOST /chat # Single-turn Q&A with history
POST /chat/stream # Streaming Q&A with Server-Sent EventsGET /health # Health check (for keep-alive pings)
GET /debug/model # Show active LLM & test inference
GET /debug/cors # CORS configuration info
GET /docs # Interactive API docs (Swagger UI)Upload a document:
curl -X POST "http://localhost:8000/upload" \
-F "file=@document.pdf"Chat (single response):
curl -X POST "http://localhost:8000/chat" \
-H "Content-Type: application/json" \
-d '{
"question": "What is this document about?",
"document_id": "b2b98275-d72a-47e2-b303-4e1ca237f964",
"chat_history": [
{"sender": "user", "text": "Tell me about X"},
{"sender": "ai", "text": "X is..."}
]
}'Chat (streaming):
curl -X POST "http://localhost:8000/chat/stream" \
-H "Content-Type: application/json" \
-d '{"question": "Explain Y", "document_id": "...", "chat_history": []}'
# Responses arrive as Server-Sent Events- Python 3.9+
- Node.js 18+
- Groq API key (free tier available)
cd backend
python -m venv venv
source venv/bin/activate # Windows: .\venv\Scripts\activate
pip install -r requirements.txt
cp .env.example .env
# Add your GROQ_API_KEY to .env
export GROQ_API_KEY="gsk_..."
python main.py
# Server runs on http://localhost:8000cd frontend
npm install
npm run dev
# Frontend runs on http://localhost:5173- Open http://localhost:5173
- Upload a PDF (drag-drop or click)
- Wait for processing (progress bar shows status)
- Ask questions about the document
- Watch responses stream in real-time
- User uploads PDF via frontend
- Backend returns
document_idimmediately (non-blocking) - Background task processes:
- Extract text using PyPDF
- Split into 500-char chunks with 50-char overlap
- Generate embeddings (384-dim vectors)
- Store in FAISS vector index
- Save to in-memory
vector_storesdict
Why async? Users get instant feedback, processing happens silently.
- User asks question about document
- If chat history exists: reformulate question context
- Retrieve top 3 semantically similar chunks from FAISS
- Format LangChain prompt:
[SYSTEM]: You are a Q&A assistant. Use context to answer. [HISTORY]: [Previous messages for context] [CONTEXT]: [Top 3 relevant chunks from document] [QUESTION]: [User's actual question] - Send to Groq API (llama-3.1-8b-instant)
- Stream tokens back via SSE
- LangChain's
.astream()yields tokens as LLM generates them - Each token sent as JSON:
{"chunk": "word"} - Frontend appends to visible message in real-time
- Users see "typewriter effect" like ChatGPT
# Required
GROQ_API_KEY=gsk_your_key_here
# Optional (defaults shown)
PORT=8000
HOST=0.0.0.0
ENVIRONMENT=productionThe app automatically detects environment:
- If
GROQ_API_KEYis set: Uses Groq API (recommended for production) - If missing: Falls back to local HuggingFace
flan-t5-small(slower, no rate limits)
-
Fork this repository on GitHub
-
Backend Deployment:
- Go to render.com
- New Web Service β Connect GitHub repo
- Environment: Python
- Build command:
pip install -r requirements.txt - Start command:
gunicorn -w 4 -k uvicorn.workers.UvicornWorker main:app - Add environment variable:
GROQ_API_KEY=...
-
Frontend Deployment:
- New Static Site β Connect same repo
- Build command:
cd frontend && npm install && npm run build - Publish directory:
frontend/dist
-
Keep Server Warm:
- Use cron-job.org (free)
- Ping
/healthendpoint every 30 minutes - Prevents Render free tier cold starts
| Issue | Cause | Solution |
|---|---|---|
| Cold start (30-60s) | Render free tier spins down | Keep-alive ping every 30 min |
| Large PDFs slow | Processing time scales with size | Limit to <20MB PDFs |
| In-memory storage | No persistence after restart | Consider Redis/PostgreSQL for prod |
| Rate limiting | Groq free tier has limits | Implement request queuing if needed |
# Test PDF upload endpoint
curl -X POST http://localhost:8000/upload -F "file=@test.pdf"
# Test chat endpoint
curl -X POST http://localhost:8000/chat \
-H "Content-Type: application/json" \
-d '{"question":"What is this?","document_id":"xyz"}'
# Check health
curl http://localhost:8000/health- Test document upload: Should show progress bar β success message
- Test streaming chat: Should see tokens appearing in real-time
- Test multi-document: Upload 2+ docs, switch between them
- RAG Concept: What is RAG?
- LangChain Docs: docs.langchain.com
- FAISS Guide: FAISS Fundamentals
- Groq API: console.groq.com
Feel free to submit issues and enhancement requests!
git clone https://github.com/farazmirzax/ContextAI.git
cd ContextAI
# Create your feature branch, make changes, and submit PRMIT License - See LICENSE file for details.
Q: Why does the first upload take 30+ seconds? A: Render free tier cold starts + model loading time. Subsequent requests are instant. Use keep-alive pings to prevent cold starts.
Q: Can I use my own LLM? A: Yes! Swap the Groq integration for any LangChain-compatible LLM (OpenAI, Anthropic, local models).
Q: What's the max document size? A: Tested up to 50MB. Larger files will require chunking or database persistence.
Q: Does it work offline?
A: Yes! Set GROQ_API_KEY="" to use local HuggingFace models (no internet needed for inference).
Q: How much does this cost? A: $0 with free tiers (Render, Groq, Vercel, GitHub).
Built with β€οΈ by Faraz Mirza
Showcasing modern RAG implementation, conversational AI, and production-grade full-stack development.