Skip to content

sandeepuppalapati/enterprise-doc-qa

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Enterprise Document Q&A System

A lightweight RAG (Retrieval-Augmented Generation) system built from scratch without LangChain. Enables semantic search and intelligent question-answering over document collections using Claude AI and ChromaDB.

Python License

πŸš€ Live Demo

Coming Soon: Deploy link will be added after initial deployment

Screenshots

Document Upload Interface

Upload Interface

File Processing

File Processing

Query & AI-Powered Answers

Query Results

Source Citations with Relevance Scores

Citations

Why I Built This

I built this while exploring production RAG patterns for enterprise applications. Every company is racing to unlock knowledge trapped in documents, and I wanted to understand the full stackβ€”from chunking strategies to deploymentβ€”without relying on heavy frameworks like LangChain.

Key learnings:

  • Chunking strategies significantly impact retrieval quality
  • Source citation is critical for enterprise trust
  • Direct API integration gives better control than abstraction layers
  • Proper error handling matters more than perfect embeddings

What this demonstrates:

  • Production-ready RAG from scratch (no LangChain)
  • Custom chunking and retrieval pipeline
  • Direct Claude API integration
  • Clean, maintainable code patterns
  • End-to-end deployment with Docker

Features

  • πŸ“„ PDF Document Processing: Upload and index PDF documents
  • πŸ” Semantic Search: Find relevant information using natural language
  • πŸ€– AI-Powered Answers: Get accurate responses backed by your documents
  • πŸ“Ž Source Citations: See exactly which documents and pages informed each answer
  • 🎯 Relevance Scoring: Understand confidence levels for retrieved information
  • πŸš€ Fast Retrieval: Optimized vector search with ChromaDB

Architecture

Custom RAG Pipeline (No LangChain)

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   User      β”‚
β”‚   Query     β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
       β”‚
       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚      Streamlit UI Layer              β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
       β”‚
       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Custom RAG Pipeline                β”‚
β”‚                                      β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚ PyPDF    │───▢│ Custom       β”‚  β”‚
β”‚  β”‚ Loader   β”‚    β”‚ Chunker      β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                         β”‚           β”‚
β”‚                         β–Ό           β”‚
β”‚              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚              β”‚  ChromaDB        β”‚  β”‚
β”‚              β”‚  (embeddings +   β”‚  β”‚
β”‚              β”‚   vector store)  β”‚  β”‚
β”‚              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                         β”‚           β”‚
β”‚         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚         β–Ό                      β”‚   β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”             β”‚   β”‚
β”‚  β”‚  Semantic    β”‚             β”‚   β”‚
β”‚  β”‚  Search      β”‚             β”‚   β”‚
β”‚  β”‚  (Top-K)     β”‚             β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜             β”‚   β”‚
β”‚         β”‚                     β”‚   β”‚
β”‚         β–Ό                     β”‚   β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”             β”‚   β”‚
β”‚  β”‚  Direct      β”‚             β”‚   β”‚
β”‚  β”‚  Claude API  β”‚β—€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚
          β–Ό
     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
     β”‚ Answer  β”‚
     β”‚   +     β”‚
     β”‚ Sources β”‚
     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Tech Stack

Component Technology Purpose
LLM Claude 3.5 Sonnet Response generation
Vector Store ChromaDB Semantic search & embeddings
Document Processing PyPDF PDF text extraction
UI Streamlit Web interface
Language Python 3.11+ Core implementation

Note: Built without LangChain - direct API integration for full control and minimal dependencies.

Quick Start

Prerequisites

  • Python 3.11 or higher
  • Anthropic API key (Get one here)
  • (Optional) Voyage AI or OpenAI API key for embeddings

Installation

  1. Clone the repository

    git clone https://github.com/yourusername/enterprise-doc-qa.git
    cd enterprise-doc-qa
  2. Create virtual environment

    python -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
  3. Install dependencies

    pip install -r requirements.txt
  4. Configure environment variables

    cp .env.example .env
    # Edit .env and add your API keys
  5. Run the application

    streamlit run src/ui/app.py
  6. Open your browser Navigate to http://localhost:8501

Docker Deployment

# Build the image
docker build -t doc-qa-system .

# Run the container
docker run -p 8501:8501 --env-file .env doc-qa-system

Usage

Sample Questions to Try

Once you've uploaded documents, try questions like:

  • "What are the key terms of the contract?"
  • "Summarize the main findings from the research report"
  • "What security measures are mentioned?"
  • "Compare the pricing models discussed"
  • "What are the project timelines?"

Best Practices

For best results:

  • Upload well-structured PDFs (avoid scanned images without OCR)
  • Keep documents focused on a specific domain
  • Ask specific questions rather than broad queries
  • Review source citations to verify accuracy

Project Structure

enterprise-doc-qa/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ components/
β”‚   β”‚   β”œβ”€β”€ document_loader.py    # PDF processing
β”‚   β”‚   β”œβ”€β”€ chunking.py            # Text splitting logic
β”‚   β”‚   β”œβ”€β”€ embeddings.py          # Vector generation
β”‚   β”‚   └── retrieval.py           # RAG chain implementation
β”‚   └── ui/
β”‚       └── app.py                 # Streamlit interface
β”œβ”€β”€ tests/
β”‚   β”œβ”€β”€ test_chunking.py
β”‚   └── test_retrieval.py
β”œβ”€β”€ data/                          # Sample documents (gitignored)
β”œβ”€β”€ docs/                          # Additional documentation
β”œβ”€β”€ .env.example                   # Environment template
β”œβ”€β”€ .gitignore
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ Dockerfile
└── README.md

Configuration

Key environment variables in .env:

# Required
ANTHROPIC_API_KEY=your_claude_api_key

# Optional (for embeddings)
VOYAGE_API_KEY=your_voyage_key
OPENAI_API_KEY=your_openai_key

# Tuning parameters
CHUNK_SIZE=1000
CHUNK_OVERLAP=200
TOP_K_RESULTS=4

Limitations & Future Improvements

Current Limitations

  • PDF only (no DOCX, TXT, HTML yet)
  • No multi-document comparison
  • Chat history not persisted across sessions
  • English language only

Roadmap

  • Add support for DOCX, TXT, Markdown files
  • Implement conversation memory
  • Add authentication and multi-user support
  • Hybrid search (keyword + semantic)
  • Export Q&A history
  • Advanced chunking strategies (semantic splitting)
  • Custom embedding fine-tuning

Testing

# Run all tests
pytest

# Run with coverage
pytest --cov=src tests/

# Run specific test file
pytest tests/test_chunking.py

Performance

Benchmarks (on M1 Mac, 100-page PDF):

  • Document processing: ~15 seconds
  • Query response time: ~2-3 seconds
  • Embedding generation: ~5 seconds (cached afterward)

Contributing

Contributions welcome! Please:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

License

MIT License - see LICENSE file for details

Acknowledgments

Contact

Sandeep Uppalapati


Note: This is a demonstration project. For production use, add proper authentication, rate limiting, and security measures.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •