Skip to content

farazmirzax/ContextAI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

58 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ’¬ ContextAI - Intelligent Document Assistant

FastAPI React LangChain FAISS Groq

A production-ready Retrieval-Augmented Generation (RAG) application that enables intelligent conversations with PDF documents. Built for sub-200ms retrieval latency, streaming responses, and scalable inference.

πŸš€ Live Demo | πŸ“š Backend API


🎯 Quick Overview

ContextAI solves the problem of interacting with document content at scale:

  • Upload any PDF β†’ instantly searchable
  • Ask questions β†’ get answers from your documents (not training data)
  • Follow-up questions β†’ maintains conversation context
  • Streaming responses β†’ real-time typewriter effect (like ChatGPT)

Perfect for: Recruiters, researchers, legal teams, and anyone working with large document datasets.


πŸ—οΈ System Architecture

High-Level Flow

%%{init: {'theme':'dark'}}%%
graph TB
    subgraph Frontend["🎨 Frontend (React + TypeScript)"]
        UI["Landing Page & Chat UI"]
        Upload["Drag-Drop Upload"]
        Chat["Real-Time Chat"]
    end
    
    subgraph Backend["βš™οΈ Backend (FastAPI - Render)"]
        API["REST API Endpoints"]
        Upload_EP["/upload"]
        Chat_EP["/chat & /chat/stream"]
        Docs_EP["/documents"]
    end
    
    subgraph AI["🧠 AI & Vector Search"]
        HF["HuggingFace Embeddings"]
        FAISS["FAISS Vector DB"]
        Groq["Groq LLM<br/>llama-3.1-8b"]
    end
    
    UI --> Upload
    UI --> Chat
    Upload --> Upload_EP
    Chat --> Chat_EP
    Upload_EP --> HF
    HF --> FAISS
    Chat_EP --> FAISS
    Chat_EP --> Groq
    
    style Frontend fill:#1f1f2e,stroke:#00d4ff,color:#fff
    style Backend fill:#1f1f2e,stroke:#00ff88,color:#fff
    style AI fill:#1f1f2e,stroke:#ff006e,color:#fff
Loading

Document Processing Pipeline

%%{init: {'theme':'dark'}}%%
graph LR
    A["πŸ“„ PDF Upload"] --> B["πŸ“– PyPDFLoader<br/>Extract Text"]
    B --> C["βœ‚οΈ Split Text<br/>500 char chunks<br/>50 overlap"]
    C --> D["πŸ”’ Embeddings<br/>all-MiniLM-L6-v2<br/>384-dim"]
    D --> E["πŸ” FAISS<br/>Vector Store"]
    E --> F["πŸ’Ύ In-Memory<br/>Storage"]
    
    style A fill:#00ff88,stroke:#00ff88,color:#1a1a1a
    style B fill:#00d4ff,stroke:#00d4ff,color:#1a1a1a
    style C fill:#ffa700,stroke:#ffa700,color:#1a1a1a
    style D fill:#ff006e,stroke:#ff006e,color:#f0f0f0
    style E fill:#8338ec,stroke:#8338ec,color:#f0f0f0
    style F fill:#ffbe0b,stroke:#ffbe0b,color:#1a1a1a
Loading

Query-to-Answer Flow

%%{init: {'theme':'dark'}}%%
graph TD
    A["❓ User Question<br/>+ Chat History"] --> B["πŸ”„ History-Aware<br/>Reformulation"]
    B --> C["πŸ” FAISS Retrieval<br/>Top 3 Chunks"]
    C --> D["πŸ“ LangChain Prompt<br/>Assembly"]
    D --> E["🎯 System Prompt<br/>+ History<br/>+ Context<br/>+ Question"]
    E --> F["⚑ Groq API<br/>llama-3.1-8b-instant"]
    F --> G["πŸ“‘ Streaming Response<br/>Token by Token"]
    G --> H["✨ Real-Time Display<br/>Typewriter Effect"]
    
    style A fill:#00ff88,stroke:#00ff88,color:#1a1a1a
    style B fill:#00d4ff,stroke:#00d4ff,color:#1a1a1a
    style C fill:#ffa700,stroke:#ffa700,color:#1a1a1a
    style D fill:#ff006e,stroke:#ff006e,color:#f0f0f0
    style E fill:#8338ec,stroke:#8338ec,color:#f0f0f0
    style F fill:#ffbe0b,stroke:#ffbe0b,color:#1a1a1a
    style G fill:#00d4ff,stroke:#00d4ff,color:#1a1a1a
    style H fill:#00ff88,stroke:#00ff88,color:#1a1a1a
Loading

⚑ Key Features

1. RAG Implementation 🎯

  • Semantic document search using FAISS vector database
  • Dual retrieval: context + conversational history
  • Sub-200ms retrieval latency (FAISS + CPU optimization)

2. Conversational Intelligence 🧠

  • Maintains context across 10+ conversation turns
  • Automatic question reformulation for pronouns ("What about that?" β†’ "What about the third paragraph?")
  • History-aware prompting with LangChain

3. Streaming Architecture 🌊

  • Server-Sent Events (SSE) for real-time token streaming
  • Dual endpoints: /chat (instant) vs /chat/stream (progressive)
  • 50% perceived latency reduction compared to batch responses

4. Multi-Document Support πŸ“š

  • Upload multiple PDFs simultaneously
  • Switch between documents instantly
  • Per-document vector stores (isolated context)

5. Production-Ready πŸš€

  • Environment-based LLM selection (Groq API or HuggingFace)
  • Async background processing for PDFs
  • CORS-configured for cross-origin requests
  • Error handling with meaningful messages

πŸ› οΈ Tech Stack

Layer Technology Purpose
Frontend React 19 + TypeScript Modern UI with real-time updates
Vite Fast HMR development, optimized builds
Tailwind CSS Responsive styling
Axios + SSE API communication & streaming
Backend FastAPI High-performance async web framework
LangChain 1.x LLM orchestration & RAG pipeline
Groq API Fast token-per-second inference
FAISS Approximate nearest neighbor search
ML/Embeddings HuggingFace all-MiniLM-L6-v2 (384-dim, fast)
LangChain Splitters Intelligent document chunking
Infrastructure Render Backend hosting (Python runtime)
Vercel / GitHub Pages Frontend hosting (auto-deploys)

πŸ“Š Performance Metrics

Metric Value Notes
Retrieval Latency <200ms FAISS on CPU, top-3 chunks
LLM Response Time ~3-5s Groq llama-3.1-8b (streaming)
First Token Latency <500ms SSE streaming begins immediately
Upload Processing ~10-30s PDF β†’ embeddings β†’ FAISS index
Embedding Model Size ~100MB all-MiniLM-L6-v2
Typical Token Throughput 200-300 tok/s Groq inference speed
Cold Start Time 30-60s Render free tier wake-up
Warm Response Time 3-5s Typical end-to-end latency

πŸ”Œ API Endpoints

Document Management

POST   /upload              # Upload PDF, returns document_id
GET    /documents           # List all uploaded documents with status

Chat Endpoints

POST   /chat                # Single-turn Q&A with history
POST   /chat/stream         # Streaming Q&A with Server-Sent Events

Utilities

GET    /health              # Health check (for keep-alive pings)
GET    /debug/model         # Show active LLM & test inference
GET    /debug/cors          # CORS configuration info
GET    /docs                # Interactive API docs (Swagger UI)

Example Usage

Upload a document:

curl -X POST "http://localhost:8000/upload" \
  -F "file=@document.pdf"

Chat (single response):

curl -X POST "http://localhost:8000/chat" \
  -H "Content-Type: application/json" \
  -d '{
    "question": "What is this document about?",
    "document_id": "b2b98275-d72a-47e2-b303-4e1ca237f964",
    "chat_history": [
      {"sender": "user", "text": "Tell me about X"},
      {"sender": "ai", "text": "X is..."}
    ]
  }'

Chat (streaming):

curl -X POST "http://localhost:8000/chat/stream" \
  -H "Content-Type: application/json" \
  -d '{"question": "Explain Y", "document_id": "...", "chat_history": []}'
  # Responses arrive as Server-Sent Events

πŸš€ Quick Start

Prerequisites

  • Python 3.9+
  • Node.js 18+
  • Groq API key (free tier available)

Backend Setup

cd backend
python -m venv venv
source venv/bin/activate  # Windows: .\venv\Scripts\activate

pip install -r requirements.txt
cp .env.example .env

# Add your GROQ_API_KEY to .env
export GROQ_API_KEY="gsk_..."

python main.py
# Server runs on http://localhost:8000

Frontend Setup

cd frontend
npm install
npm run dev
# Frontend runs on http://localhost:5173

Test the Application

  1. Open http://localhost:5173
  2. Upload a PDF (drag-drop or click)
  3. Wait for processing (progress bar shows status)
  4. Ask questions about the document
  5. Watch responses stream in real-time

🏭 How It Works

1. PDF Upload & Processing

  • User uploads PDF via frontend
  • Backend returns document_id immediately (non-blocking)
  • Background task processes:
    1. Extract text using PyPDF
    2. Split into 500-char chunks with 50-char overlap
    3. Generate embeddings (384-dim vectors)
    4. Store in FAISS vector index
    5. Save to in-memory vector_stores dict

Why async? Users get instant feedback, processing happens silently.

2. Question Answering

  • User asks question about document
  • If chat history exists: reformulate question context
  • Retrieve top 3 semantically similar chunks from FAISS
  • Format LangChain prompt:
    [SYSTEM]: You are a Q&A assistant. Use context to answer.
    [HISTORY]: [Previous messages for context]
    [CONTEXT]: [Top 3 relevant chunks from document]
    [QUESTION]: [User's actual question]
    
  • Send to Groq API (llama-3.1-8b-instant)
  • Stream tokens back via SSE

3. Streaming Response

  • LangChain's .astream() yields tokens as LLM generates them
  • Each token sent as JSON: {"chunk": "word"}
  • Frontend appends to visible message in real-time
  • Users see "typewriter effect" like ChatGPT

πŸ”§ Configuration

Environment Variables (.env)

# Required
GROQ_API_KEY=gsk_your_key_here

# Optional (defaults shown)
PORT=8000
HOST=0.0.0.0
ENVIRONMENT=production

Feature Toggles

The app automatically detects environment:

  • If GROQ_API_KEY is set: Uses Groq API (recommended for production)
  • If missing: Falls back to local HuggingFace flan-t5-small (slower, no rate limits)

πŸ“¦ Deployment

Deploy to Render (Free Tier)

  1. Fork this repository on GitHub

  2. Backend Deployment:

    • Go to render.com
    • New Web Service β†’ Connect GitHub repo
    • Environment: Python
    • Build command: pip install -r requirements.txt
    • Start command: gunicorn -w 4 -k uvicorn.workers.UvicornWorker main:app
    • Add environment variable: GROQ_API_KEY=...
  3. Frontend Deployment:

    • New Static Site β†’ Connect same repo
    • Build command: cd frontend && npm install && npm run build
    • Publish directory: frontend/dist
  4. Keep Server Warm:

    • Use cron-job.org (free)
    • Ping /health endpoint every 30 minutes
    • Prevents Render free tier cold starts

⚠️ Known Limitations & Workarounds

Issue Cause Solution
Cold start (30-60s) Render free tier spins down Keep-alive ping every 30 min
Large PDFs slow Processing time scales with size Limit to <20MB PDFs
In-memory storage No persistence after restart Consider Redis/PostgreSQL for prod
Rate limiting Groq free tier has limits Implement request queuing if needed

πŸ§ͺ Testing

Local Testing

# Test PDF upload endpoint
curl -X POST http://localhost:8000/upload -F "file=@test.pdf"

# Test chat endpoint
curl -X POST http://localhost:8000/chat \
  -H "Content-Type: application/json" \
  -d '{"question":"What is this?","document_id":"xyz"}'

# Check health
curl http://localhost:8000/health

Frontend Testing

  • Test document upload: Should show progress bar β†’ success message
  • Test streaming chat: Should see tokens appearing in real-time
  • Test multi-document: Upload 2+ docs, switch between them

πŸŽ“ Learning Resources


πŸ‘₯ Contributing

Feel free to submit issues and enhancement requests!

git clone https://github.com/farazmirzax/ContextAI.git
cd ContextAI
# Create your feature branch, make changes, and submit PR

πŸ“„ License

MIT License - See LICENSE file for details.


πŸ™‹ FAQ

Q: Why does the first upload take 30+ seconds? A: Render free tier cold starts + model loading time. Subsequent requests are instant. Use keep-alive pings to prevent cold starts.

Q: Can I use my own LLM? A: Yes! Swap the Groq integration for any LangChain-compatible LLM (OpenAI, Anthropic, local models).

Q: What's the max document size? A: Tested up to 50MB. Larger files will require chunking or database persistence.

Q: Does it work offline? A: Yes! Set GROQ_API_KEY="" to use local HuggingFace models (no internet needed for inference).

Q: How much does this cost? A: $0 with free tiers (Render, Groq, Vercel, GitHub).


Built with ❀️ by Faraz Mirza

Showcasing modern RAG implementation, conversational AI, and production-grade full-stack development.

About

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors