Skip to content

Latest commit

 

History

History
339 lines (286 loc) · 13.3 KB

File metadata and controls

339 lines (286 loc) · 13.3 KB

DocuMentor - AI-Powered Study Assistant

🎯 Project Overview

DocuMentor is an offline AI tutor that helps students learn from their study materials by providing intelligent summarization, quiz generation, and interactive Q&A capabilities. Built entirely with open-source models and tools, it runs locally without requiring any API keys or cloud services.

🏗️ Architecture

Tech Stack

  • Frontend: React/Next.js (existing mock UI)
  • Backend: FastAPI (Python)
  • ML Models:
    • microsoft/Phi-3-mini-4k-instruct (3B params) - Text summarization
    • google/flan-t5-xl - MCQ generation
    • bge-small-en or e5-small - Text embeddings
  • Vector Database: FAISS
  • PDF Processing: pdfplumber / PyMuPDF
  • Environment: Conda (environment name: documentor)
  • GPU: RTX 4060 (CUDA support)

System Architecture

┌─────────────────────────────────────────────────────────────┐
│                      Frontend (React)                        │
│  - File Upload UI                                           │
│  - Chat Interface                                           │
│  - Summary Display                                          │
│  - Quiz Generator (future)                                  │
└───────────────────────────┬─────────────────────────────────┘
                            │ HTTP/REST API
┌───────────────────────────▼─────────────────────────────────┐
│                    Backend (FastAPI)                         │
│  ┌─────────────────────────────────────────────────────┐   │
│  │  API Endpoints                                       │   │
│  │  - POST /upload_pdf                                 │   │
│  │  - POST /summarize                                  │   │
│  │  - POST /ask_question                               │   │
│  │  - POST /generate_quiz (future)                     │   │
│  └─────────────────────────────────────────────────────┘   │
│  ┌─────────────────────────────────────────────────────┐   │
│  │  Processing Pipeline                                 │   │
│  │  1. PDF Ingestion → Text Extraction                │   │
│  │  2. Chunking (150-300 words with overlap)          │   │
│  │  3. Embedding Generation                            │   │
│  │  4. FAISS Indexing                                  │   │
│  └─────────────────────────────────────────────────────┘   │
│  ┌─────────────────────────────────────────────────────┐   │
│  │  ML Models (Local GPU)                              │   │
│  │  - Phi-3 (Summarization)                           │   │
│  │  - Flan-T5-XL (MCQ Generation)                     │   │
│  │  - BGE/E5 (Embeddings)                             │   │
│  └─────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────┘

📁 Project Structure

DocuMentor/
├── website/
│   ├── client/              # React frontend
│   │   ├── components/
│   │   │   ├── app/         # Main app components
│   │   │   └── ui/          # UI components
│   │   ├── app/             # Next.js app directory
│   │   └── ...
│   └── server/              # (if needed for Next.js SSR)
├── backend/
│   ├── main.py              # FastAPI app entry point
│   ├── models/
│   │   ├── phi3_summarizer.py      # Phi-3 model wrapper
│   │   ├── t5_quiz_generator.py    # Flan-T5 wrapper
│   │   └── embeddings.py           # Embedding model
│   ├── services/
│   │   ├── pdf_processor.py        # PDF reading & chunking
│   │   ├── vector_store.py         # FAISS operations
│   │   ├── rag_pipeline.py         # RAG for Q&A
│   │   └── summarizer.py           # Summarization logic
│   ├── api/
│   │   ├── routes.py               # API endpoints
│   │   └── schemas.py              # Pydantic models
│   ├── utils/
│   │   ├── chunker.py              # Text chunking utilities
│   │   └── config.py               # Configuration
│   ├── requirements.txt
│   └── environment.yml
├── data/                    # Uploaded PDFs & processed data
│   ├── uploads/
│   ├── vectors/            # FAISS indices
│   └── processed/          # Chunked documents
├── models/                  # Downloaded model weights (cached)
└── claude.md               # This file

🚀 3-Month Development Roadmap

📅 MONTH 1: Foundation – Build the Brain

Goal: Working pipeline from PDF → Q&A using local LLM

Week 1: PDF Ingestion + Chunking ✅

  • Read PDFs using pdfplumber / PyMuPDF
  • Extract titles, headings, bullets (basic structure detection)
  • Split into clean 150–300 word chunks (with slight overlap)
  • Store as JSON / pickled dict
  • Checkpoint: Upload any textbook and see a clean list of all chunks

Week 2: Embeddings + Vector Index (FAISS)

  • Load open-source embedding model: bge-small-en, e5-small, or Instructor
  • Embed all chunks
  • Store embeddings in FAISS
  • Write a retriever: query → embedding → top-k chunks
  • Checkpoint: Enter "What is overfitting?" → get relevant chunks from doc

Week 3–4: Local LLM + RAG Pipeline

  • Load Phi-3 with transformers (4-bit quantization for efficiency)
  • Pass top-k chunks + query into prompt
  • Get generated answer
  • Create FastAPI endpoints for Q&A
  • Checkpoint: Local chatbot answers based only on uploaded study material

📅 MONTH 2: Intelligence – Summary & Quizzes

Goal: Add summarization and question generation

Week 5–6: Summarizer

  • Chunk-level summaries (2–3 sentences per section)
  • Full-doc summary
  • Prompt tuning: "bullet points", "exam-style", "in your own words"
  • Connect to frontend summary display
  • Checkpoint: App shows a quick summary of any chapter or section

Week 7–8: Quiz Generator

  • Integrate Flan-T5-XL for MCQ generation
  • Prompt LLM to generate 3–5 MCQs per section
  • Generate open-ended questions
  • Add answer options + explanations
  • Save to JSON for frontend consumption
  • Checkpoint: Auto-generated quiz for each topic

Week 9: Flashcard / Revision View

  • Turn Q&A pairs into flashcards
  • Optionally use Anki deck format
  • Add toggle: "Mark as Learned / Unseen"
  • Frontend component for flashcard display
  • Checkpoint: Revision tool built from your own notes

📅 MONTH 3: User Interface + Extensions

Goal: Make it usable, beautiful, and extensible

Week 10–11: Polish UI

  • Connect all frontend components to backend APIs
  • File upload with progress indicator
  • Chat interface for Q&A
  • Tabs for: Summary, Quiz, Flashcards
  • Option to download generated content
  • Checkpoint: Working web UI that looks good and runs locally

Week 12: Extensions / Final Touch

  • Multi-document support
  • Search by topic, not just by text
  • Save/load sessions
  • Export quizzes and summaries (PDF/Markdown)
  • Performance optimization
  • Final Checkpoint: Polished, useful AI tutor built from scratch

🔧 Current Phase: Month 1, Week 1 - Backend Setup

Immediate Tasks (Current Session)

  1. ✅ Create project documentation
  2. ⏳ Set up backend directory structure
  3. ⏳ Configure conda environment with dependencies
  4. ⏳ Implement PDF processor module
  5. ⏳ Create FastAPI skeleton with basic endpoints
  6. ⏳ Test PDF upload and chunking workflow

📦 Dependencies

Backend Requirements

# Core Framework
fastapi==0.109.0
uvicorn[standard]==0.27.0
python-multipart==0.0.6

# ML & NLP
torch>=2.1.0
transformers>=4.36.0
sentence-transformers>=2.3.1
accelerate>=0.25.0
bitsandbytes>=0.41.0  # For 4-bit quantization

# Vector Store
faiss-cpu==1.7.4  # or faiss-gpu for GPU support

# PDF Processing
pdfplumber>=0.10.3
PyMuPDF>=1.23.8

# Utilities
numpy>=1.24.0
pandas>=2.1.0
pydantic>=2.5.0
python-dotenv>=1.0.0

Frontend Dependencies (Already set up)

  • React
  • Next.js
  • TailwindCSS
  • TypeScript

🎯 Core ML/NLP Concepts Implemented

  1. Embedding Generation - Convert text chunks to dense vectors
  2. Vector Similarity Search - Retrieve relevant chunks using FAISS
  3. Retrieval-Augmented Generation (RAG) - Combine retrieval with generation
  4. Prompt Engineering - Craft effective prompts for summarization and quizzes
  5. Local LLM Inference - Run models efficiently on consumer GPU
  6. Smart Chunking - Segment documents intelligently with context preservation
  7. Quantization - Use 4-bit models to reduce memory usage

🔐 Security & Privacy

  • 100% Offline: All processing happens locally
  • No Data Leakage: Documents never leave your machine
  • No API Keys: Zero dependency on external services
  • Private Learning: Your study materials remain completely private

🚦 Getting Started

Prerequisites

  • Python 3.10+
  • CUDA-capable GPU (RTX 4060 in this case)
  • Conda installed
  • Node.js 18+ (for frontend)

Quick Start

# 1. Activate conda environment
conda activate documentor

# 2. Install backend dependencies
cd backend
pip install -r requirements.txt

# 3. Start FastAPI server
uvicorn main:app --reload --host 0.0.0.0 --port 8000

# 4. In another terminal, start frontend (already set up)
cd website/client
npm run dev

Environment Setup (To be created)

# Create conda environment
conda create -n documentor python=3.10
conda activate documentor

# Install PyTorch with CUDA support
conda install pytorch pytorch-cuda=12.1 -c pytorch -c nvidia

# Install remaining dependencies
pip install -r requirements.txt

📊 Model Details

Phi-3-mini-4k-instruct

  • Size: 3.8B parameters
  • Context: 4K tokens
  • Use Case: Text summarization, Q&A
  • Quantization: 4-bit (fits in ~3GB VRAM)
  • License: MIT

Flan-T5-XL

  • Size: 3B parameters
  • Use Case: MCQ generation, instruction following
  • Quantization: 4-bit recommended
  • License: Apache 2.0

BGE-Small-EN

  • Size: 33M parameters
  • Use Case: Text embeddings for retrieval
  • Embedding Dim: 384
  • License: MIT

🎨 API Endpoints (To be implemented)

PDF Management

  • POST /api/upload - Upload PDF file
  • GET /api/documents - List uploaded documents
  • DELETE /api/documents/{doc_id} - Delete document

Summarization

  • POST /api/summarize - Summarize document or section
    • Body: { "doc_id": "...", "mode": "full|chunked" }

Q&A

  • POST /api/ask - Ask question about document
    • Body: { "doc_id": "...", "question": "..." }

Quiz Generation (Future)

  • POST /api/generate_quiz - Generate MCQs
    • Body: { "doc_id": "...", "num_questions": 5 }

🧪 Testing Strategy

  1. Unit Tests: Test each module independently
  2. Integration Tests: Test PDF → Chunks → Embeddings → Retrieval
  3. End-to-End Tests: Test full user workflows
  4. Model Quality Tests: Evaluate summarization and MCQ quality

📈 Performance Considerations

  • Model Loading: Cache models in memory to avoid reloading
  • Batch Processing: Process multiple chunks together
  • Quantization: Use 4-bit models to reduce VRAM usage
  • FAISS: Use GPU-accelerated FAISS if available
  • Async Processing: Use FastAPI's async capabilities for I/O operations

🔄 Future Enhancements

  • Multi-document chat
  • Knowledge graph visualization
  • Spaced repetition system
  • Audio lecture transcription and summarization
  • Handwritten notes OCR
  • Collaborative study features
  • Mobile app (React Native)

📝 Notes

  • All models run locally on RTX 4060 GPU
  • Conda environment name: documentor
  • Frontend is already built as a mock - needs backend connection
  • Start with PDF summarization, then add MCQ generation later
  • Chunked summarization approach for large documents

🤝 Contributing

This is a personal learning project. Feel free to use this architecture for your own implementations!

📄 License

Open source - Educational purposes


Last Updated: 2025-11-06 Status: Phase 1 - Backend Development in Progress