NLP RAG System

Build a system that answers questions on NLP topics using the Retrieval-Augmented Generation (RAG) approach, based on the Speech and Language Processing (Third Edition) book.

Abstraction

The NLP Q&A system is an AI-powered application that streamlines searching within the Speech and Language Processing (Third Edition) book. It enables students, teachers, and researchers to quickly and easily retrieve relevant information from the book.

Introduction

The NLP Q&A system is an AI-powered application that simplifies and enhances information retrieval using a Retrieval-Augmented Generation (RAG) approach. The system processes the main chapters of the Speech and Language Processing (Third Edition) book by breaking them into manageable chunks, generating embeddings, and storing them in a vector database to enable efficient similarity-based search. It then uses the RAG pipeline to generate accurate answers to user queries.

Methodology

We follow a systematic process that begins with extracting text from the documents. The extracted text is then cleaned and preprocessed to prepare it for chunking. These chunks are converted into embeddings, which are then stored in a vector database for efficient retrieval during the question-answering process.

Extract

In the extraction stage, we use the built-in PDF extractor from the LangChain library. After extracting the text, we apply a series of cleaning steps, including:

Removing page headers
Removing copyright notices
Removing page numbers and repeated line breaks
Removing figure numbers
Normalizing whitespace

These steps ensure that the text is clean.

Chunking

In the chunking stage, we use the built-in token-based chunking function from the LangChain library. We experiment with different chunk sizes, 100, 200, 300, and 384 tokens, to evaluate which configuration yields the best performance and highest accuracy.

Embeddings

In the embeddings stage, we use the sentence-transformers/all-MiniLM-L6-v2 model from Hugging Face to generate dense vector representations of the text chunks. These embeddings are later stored in a vector database for similarity-based retrieval.

Vectorstore

In the vectorstore stage, we use the FAISS vector database for efficient similarity search and clustering of dense embeddings. This allows the system to quickly retrieve the most relevant chunks based on the user’s query.

Pipeline

The pipeline involves testing the Mistral and Gemini models to identify the best-performing option. We use the built-in RAG pipeline from the LangChain library to integrate the retrieval and generation components effectively.

Results

We evaluated the models by testing them with a set of questions. Our observations indicate that Mistral provides stronger and more accurate answers compared to Gemini. Additionally, when we adjusted the chunk size, we noticed that larger chunks allowed the models to access more context, resulting in better answers. However, even as the chunk size increased, the Mistral model consistently outperformed Gemini.

Tech Stack

Python
LangChain
gradio
HuggingFace embeddings model sentence-transformers/all-MiniLM-L6-v2

How to Run Locally

Clone the repository:

git clone https://github.com/amjadAwad95/nlp-rag-system.git
cd nlp-rag-system

Create a virtual environment:

python -m venv .venv
source .venv/bin/activate  # on Windows: .venv\Scripts\activate

Create a .env file in the root folder:

GOOGLE_API_KEY = <GOOGLE_API_KEY>
MISTRAL_API_KEY = <MISTRAL_API_KEY>
MONGO_URI = <MONGO_URI>

Install the dependencies:

pip install -r requirements.txt

Run the app:

to run gradio app

python app.py

to run FastAPI

uvicorn main:app --reload

or

docker-compose -f docker-compose.yml up --build -d

Conclusion

The NLP RAG system demonstrates how Retrieval-Augmented Generation (RAG) can enhance question-answering by effectively combining a large language model (LLM) with relevant retrieved information.

Name		Name	Last commit message	Last commit date
Latest commit History 74 Commits
config		config
database		database
documents		documents
exprements		exprements
models		models
pipeline		pipeline
rag_system		rag_system
reranker		reranker
routers		routers
schema		schema
utils		utils
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
app.py		app.py
arabic_config.py		arabic_config.py
docker-compose.yml		docker-compose.yml
main.py		main.py
requirements.txt		requirements.txt
vectorstore.py		vectorstore.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

NLP RAG System

Abstraction

Introduction

Methodology

Extract

Chunking

Embeddings

Vectorstore

Pipeline

Results

Tech Stack

How to Run Locally

Conclusion

About

Uh oh!

Releases

Packages

Languages

amjadAwad95/nlp-rag-system

Folders and files

Latest commit

History

Repository files navigation

NLP RAG System

Abstraction

Introduction

Methodology

Extract

Chunking

Embeddings

Vectorstore

Pipeline

Results

Tech Stack

How to Run Locally

Conclusion

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages