Skip to content

sahilgupta3023/Document-Processor

Repository files navigation

Document Processor

The Streamlit application allows users to upload or select a PDF document, extract text using Google Gemini's OCR capabilities, summarize its content, and ask questions about it using Retrieval Augmented Generation (RAG).

How to run the application?

Step 1: Clone the Repository

Open your terminal or command prompt and run the following command:

``` bash
git clone https://github.com/sahilgupta3023/Document-Processor.git
```

Step 2: Install Dependencies using requirements.txt

``` bash
pip install -r requirements.txt
```

Step 3: Running the Application

Navigate to the project directory in your terminal or command prompt (if you're not already there) and run the Streamlit application:

```bash
streamlit run web_app.py
```

This will start the Streamlit server, and your application will open in your default web browser.

Screen recording: _ https://drive.google.com/file/d/1EGpfBz8NLP8x-cJ0_nmp3BUHfs2D2aUX/view?usp=drive_link _

After entering the Google Gemini API key, you can upload your pdf or use the default 'The Gift of Magi' pdf to generate a summary & ask questions related to the pdf.'

Dependencies

  • Streamlit
  • LangChain
  • Hugging Face Transformers
  • ChromaDB
  • Google Generative AI
  • python-dotenv
  • Werkzeug
  • editdistance
  • pypdf

Libraries/APIs

  • Streamlit: Chosen for its simplicity in creating web applications with Python, allowing for rapid prototyping and deployment.
  • LangChain: Used for its powerful framework for developing applications powered by language models, specifically for text splitting and vector store interactions.
  • Hugging Face Transformers: Employed for embedding generation, leveraging the all-MiniLM-L6-v2 model for efficient and effective text embeddings.
  • ChromaDB: Selected for its ease of use and persistence capabilities in storing and retrieving vector embeddings, enabling efficient similarity searches.
  • Google Generative AI (Gemini): Utilized for its state-of-the-art capabilities in OCR, summarization, and question answering, providing high-quality results.
  • python-dotenv: Used to manage environment variables, keeping sensitive information like API keys secure.
  • Werkzeug: Employed for secure file name handling during PDF uploads.

Design Choices:

  • RAG Implementation: The application implements a Retrieval Augmented Generation (RAG) approach to answer questions, ensuring that the responses are grounded in the provided document's content.
  • Modular Design: The code is structured into separate modules (rag.py, pdf_to_text_extraction.py, summarization.py) for better organization and maintainability.
  • Error Handling: Basic error handling is implemented to provide informative messages to the user in case of issues like missing files or API key errors.

How Text is Extracted from the PDFs?

PDF is first converted into a base64 encoded string as Gemini API accepts base64 encoded data for PDF Processing. Then, Google Gemini API is utilized to perform OCR on the encoded PDF data. To evaluate the text extraction, I created a different .txt file containing the story, and computed the accuracy using Character Error Rate (CER). The CER score was 0.029, that means that the OCR was performed well.

How Summarization is Performed?

Summarization of the extracted text is performed using Google Gemini through the summarization.py module. The prompt instructed the Gemini to generate a summary with the specifics.

Question Answering from PDFs with Gemini, Sentence Transformer, ChromaDB and Langchain

  1. Text Chunking: Processed pdf is segmented into chunks using RecursiveTextSplitter as it is able to handle codes, lists and paragraphs more effectively than simple character splitting, and it is better as it ensures semantic coherent chunks for RAG.
  2. Storing Embeddings in ChromaDB: storing embeddings generated by Hugging Face's all-MiniLM-L6-v2 model in Chroma Data Base. This allows for efficient semantic similarity searches.
  3. Real-Time Query Retrieval: When user ask question, the query is converted into embedding and used to query the vector store.
  4. Extracting relevant chunks based on Query: Chunks are ranked according to cosine similarity
  5. Response Generation: The prompt is sent to the Google Gemini API, which generates a real-time response based on the context and your query.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages