This project showcases a straightforward approach to one of mankind's oldest challenges: "Finding the right answer in the right document for a given question." Powered by the Huggingface transformers
library, this repository addresses the challenge in two distinct phases:
- Identifying the appropriate document for a given question.
- Extracting an answer from that document.
The process is currently tailored for documentation in Markdown format but can easily be adapted to other formats based on user requirements. Please feel free to contact me with any questions or suggestions regarding this repository— I especially welcome suggestions!
The problem in question tackled by this repo is divided in two stages:
-
Text Search:This feature enhances document search efficiency by using embeddings to represent text semantically within a high-dimensional space. Coupled with FAISS (Facebook AI Similarity Search) indexing, the system quickly identifies the most relevant documents to any query. Embeddings capture the core meaning of text, while FAISS enables fast, scalable retrieval by comparing these embeddings to find the closest matches. This combination ensures fast and precise search results even in large document databases.
-
Question Answering: This system leverages a pre-trained question answering (QA) model, upon receiving a user's question, the model scans through the documents selected during the semantic search process and hence identify those passages that potentially contain the answer. The process is powered by Huggingface's transformers library. Additionally the project allows to fine tune the model for better performance. This is subjected to the construction of an appropiate training dataset for the task.
The script src/etl/data_processing.py
encompasses all the processes responsible for converting raw files [documentation] into a format that is compatible for analysis by pre-trained models. Specifically, this set of functions performs the following tasks:
- Transformation of raw files from markdown to text files for model processing.
- Construction of dataset with metadata about documents.
The process of fine-tuning a pre-trained model with new information requires that the training data adhere to a specific structure. The functions contained in src/etl/data_training.py
process the input data and output as a Dataset
dataframe and a JSON file formatted to meet the requirements of various QA models. It is important to note that both of these functions accept input structured such that each question is paired with a corresponding context and an answer located within that text. Additionally, the creation of these context-question-answer triplets is typically labor-intensive.
{
"documents": [
{
"title": "sagemaker-projects",
"contexts": [
"SageMaker Projects help organizations set up and standardize developer environments for data scientists and CI/CD systems for MLOps engineers.",
"No. SageMaker pipelines are standalone entities just like training jobs, processing jobs, and other SageMaker jobs. You can create, update, and run pipelines directly within a notebook by using the SageMaker Python SDK without using a SageMaker project.",
],
"questions_answers": [
{
"context_index": 0,
"question": "What is SageMaker",
"answer": "help organizations set up and standardize developer environments for data scientists"
},
{
"context_index": 1,
"question": "Is it neccesary to create a project in order to run a SageMaker pipeline?",
"answer": "No. SageMaker pipelines are standalone entities just like training jobs, processing jobs, and other SageMaker jobs."
},
]
}
]
}
The DocumentAssistant class, defined in the src/features/qa_class.py
module, is designed for document processing and question answering functionalities. This class utilizes the transformers library from Huggingface to encode text and generate embeddings that facilitate deep understanding and interactions with textual data.
Key Features
- Tokenization and Modeling: Utilizes AutoTokenizer for text encoding and AutoModel for generating text embeddings.
- Document Processing and Search: Implements methods to convert texts into embeddings and search these embeddings to find relevant texts based on a given question.
- Question Answering: Capable of answering questions using either the base model or a fine-tuned model, depending on the use case requirements. This is enhanced through a pipeline that integrates text searching with question answering.
The primary functionality of this project is encapsulated in the script src/features/qa_pipe.py
. Executing this script initiates the main function which creates an instance of the DocumentAssistant class. This class is responsible for processing an input question and returning both the answer and the relevant documents associated with it. This setup ensures a seamless integration of text processing and retrieval functionalities to address user queries effectively.
Sample Execution
Upon execution of the question answering pipeline the function track_execution()
part of src/model_tracking/tracking.py
will save the input question as well as the outputs associated to the model such as the answer as well as:
- Path to Relevant Document
- Score
- Start index of the answer
- End index of the answer
💡 This could be beneficial for identifying areas where efforts should be focused when developing future training datasets.
Sample Log
[
{
"question": "What are all AWS regions where SageMaker is available?",
"answer": "East/West",
"document_path": [
"sagemaker-compliance.md"
],
"score": 0.0060913050547242165,
"start": 273,
"end": 282
}
]
The execution of the project requires to declare all the relevant paths and parameters in a yaml file located at conf/local.yml
, as follows:
# config.yaml
data_processing_params:
documents_path: /data/01_raw/md_files # Path to the markdown files
save_csv: False # Save the csv file
csvdf_path: /data/02_intermediate/tables/master_table.csv # Path to save the csv file with metadata of the markdown files
text_files_path: /data/02_intermediate/txt_files/ # Path to save the converted markdown files to text files
arrowdf_path: /data/02_intermediate/tables/text_df # Path to save the arrow file with the text data
training_params:
context_qa: /data/01_raw/training_objects/training_contexts.json # Path to the training contexts
training_json: /data/02_intermediate/training_data/training_data.json # Path to save the training data in specified format
training_dataset: /data/02_intermediate/training_dataset # Path to save the training dataset
model_tracking_params:
tracking_json_path: /data/04_model_output/tracking.json # Path to save the tracking json file
git clone https://github.com/Germanifold91/loka_qa
cd loka_qa
📝 Note: As of 2024/04/25 the installation of FAISS must be through the use of conda for the appropiate functioning of this method.
To set up the project environment to run the project, follow these steps:
make create-env
This command will create a Conda environment with all the necessary dependencies installed.
To run the data processing script, execute the following command:
make data
To run the question answering pipeline with a question as input, you can use:
make process-question QUESTION="Your question here"
In order to generate the appropiate data for fine tunning a model run:
make data-training
Fine Tunning of the model can be performed through:
make tune-model