Document Processor

The Streamlit application allows users to upload or select a PDF document, extract text using Google Gemini's OCR capabilities, summarize its content, and ask questions about it using Retrieval Augmented Generation (RAG).

How to run the application?

Step 1: Clone the Repository

Open your terminal or command prompt and run the following command:

``` bash
git clone https://github.com/sahilgupta3023/Document-Processor.git
```

Step 2: Install Dependencies using `requirements.txt`

``` bash
pip install -r requirements.txt
```

Step 3: Running the Application

Navigate to the project directory in your terminal or command prompt (if you're not already there) and run the Streamlit application:

```bash
streamlit run web_app.py
```

This will start the Streamlit server, and your application will open in your default web browser.

Screen recording: _ https://drive.google.com/file/d/1EGpfBz8NLP8x-cJ0_nmp3BUHfs2D2aUX/view?usp=drive_link _

After entering the Google Gemini API key, you can upload your pdf or use the default 'The Gift of Magi' pdf to generate a summary & ask questions related to the pdf.'

Dependencies

Streamlit
LangChain
Hugging Face Transformers
ChromaDB
Google Generative AI
python-dotenv
Werkzeug
editdistance
pypdf

Libraries/APIs

Streamlit: Chosen for its simplicity in creating web applications with Python, allowing for rapid prototyping and deployment.
LangChain: Used for its powerful framework for developing applications powered by language models, specifically for text splitting and vector store interactions.
Hugging Face Transformers: Employed for embedding generation, leveraging the all-MiniLM-L6-v2 model for efficient and effective text embeddings.
ChromaDB: Selected for its ease of use and persistence capabilities in storing and retrieving vector embeddings, enabling efficient similarity searches.
Google Generative AI (Gemini): Utilized for its state-of-the-art capabilities in OCR, summarization, and question answering, providing high-quality results.
python-dotenv: Used to manage environment variables, keeping sensitive information like API keys secure.
Werkzeug: Employed for secure file name handling during PDF uploads.

Design Choices:

RAG Implementation: The application implements a Retrieval Augmented Generation (RAG) approach to answer questions, ensuring that the responses are grounded in the provided document's content.
Modular Design: The code is structured into separate modules (rag.py, pdf_to_text_extraction.py, summarization.py) for better organization and maintainability.
Error Handling: Basic error handling is implemented to provide informative messages to the user in case of issues like missing files or API key errors.

How Text is Extracted from the PDFs?

PDF is first converted into a base64 encoded string as Gemini API accepts base64 encoded data for PDF Processing. Then, Google Gemini API is utilized to perform OCR on the encoded PDF data. To evaluate the text extraction, I created a different .txt file containing the story, and computed the accuracy using Character Error Rate (CER). The CER score was 0.029, that means that the OCR was performed well.

How Summarization is Performed?

Summarization of the extracted text is performed using Google Gemini through the summarization.py module. The prompt instructed the Gemini to generate a summary with the specifics.

Question Answering from PDFs with Gemini, Sentence Transformer, ChromaDB and Langchain

Text Chunking: Processed pdf is segmented into chunks using RecursiveTextSplitter as it is able to handle codes, lists and paragraphs more effectively than simple character splitting, and it is better as it ensures semantic coherent chunks for RAG.
Storing Embeddings in ChromaDB: storing embeddings generated by Hugging Face's all-MiniLM-L6-v2 model in Chroma Data Base. This allows for efficient semantic similarity searches.
Real-Time Query Retrieval: When user ask question, the query is converted into embedding and used to query the vector store.
Extracting relevant chunks based on Query: Chunks are ranked according to cosine similarity
Response Generation: The prompt is sent to the Google Gemini API, which generates a real-time response based on the context and your query.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.env		.env
README.md		README.md
pdf_to_text_extraction.py		pdf_to_text_extraction.py
rag.py		rag.py
requirements.txt		requirements.txt
streamlit.py		streamlit.py
summarization.py		summarization.py
web_app.py		web_app.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Document Processor

How to run the application?

Step 1: Clone the Repository

Step 2: Install Dependencies using `requirements.txt`

Step 3: Running the Application

Dependencies

Libraries/APIs

How Text is Extracted from the PDFs?

How Summarization is Performed?

Question Answering from PDFs with Gemini, Sentence Transformer, ChromaDB and Langchain

About

Uh oh!

Releases

Packages

Languages

sahilgupta3023/Document-Processor

Folders and files

Latest commit

History

Repository files navigation

Document Processor

How to run the application?

Step 1: Clone the Repository

Step 2: Install Dependencies using requirements.txt

Step 3: Running the Application

Dependencies

Libraries/APIs

How Text is Extracted from the PDFs?

How Summarization is Performed?

Question Answering from PDFs with Gemini, Sentence Transformer, ChromaDB and Langchain

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Step 2: Install Dependencies using `requirements.txt`

Packages