MIRIAD/quality_control at main · kennethSty/MIRIAD

Name	Name	Last commit message	Last commit date
parent directory ..
relevance	relevance
streamlit_app	streamlit_app
README.md	README.md
keyword_filter.py	keyword_filter.py
selected_passages.csv	selected_passages.csv

Name

Last commit message

Last commit date

selected_passages.csv

MIRIAD Dataset Quality Control

After data generation, we applied multiple rounds of quality control to ensure the high-quality of MIRIAD.

Step 6: Quality Control

Step 6.1: Keyword Filtering

quality_control/keyword_filter.py

Filters generated data at path data_generation/generated_data by removing QA with answers with the prefix "the passage" or the study to make the QA pairs self contained and not referencing studies

Step 6.2: Relevance Filtering

Step 6.2A: Generating labels using GPT-4

python quality_control/relevance/generate_labels.py

Files (including the train/test split) are stored in:

quality_control/relevance/labels.json
quality_control/relevance/train_qa_ids.json.json
quality_control/relevance/test_qa_ids.json.json

Step 6.2B: Finetuning `mistralai/Mistral-7B-Instruct-v0.2` as a relevance classifier

python quality_control/relevance/finetune.py

Step 6.2C: Filtering using the relevance classifier

python quality_control/relevance/relevance_inference.py
python quality_control/relevance/filter_relevance.py

Applies the relevance classifier to retain medically relevant QA pairs.

Step 6.3: Human Annotation App

Launch the Streamlit app for manual QA inspection and labeling:

streamlit run quality_control/streamlit_app/app.py

This allows human annotators to review generated QA pairs alongside their source passages and assign labels for Relevance, Factuality, and Groundedness in Passage.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

MIRIAD Dataset Quality Control

Step 6: Quality Control

Step 6.1: Keyword Filtering

Filters generated data at path data_generation/generated_data by removing QA with answers with the prefix "the passage" or the study to make the QA pairs self contained and not referencing studies

Step 6.2: Relevance Filtering

Step 6.2A: Generating labels using GPT-4

Files (including the train/test split) are stored in:

Step 6.2B: Finetuning `mistralai/Mistral-7B-Instruct-v0.2` as a relevance classifier

Step 6.2C: Filtering using the relevance classifier

Applies the relevance classifier to retain medically relevant QA pairs.

Step 6.3: Human Annotation App

This allows human annotators to review generated QA pairs alongside their source passages and assign labels for Relevance, Factuality, and Groundedness in Passage.

FilesExpand file tree

quality_control

Directory actions

More options

Directory actions

More options

Latest commit

History

quality_control

Folders and files

parent directory

README.md

MIRIAD Dataset Quality Control

Step 6: Quality Control

Step 6.1: Keyword Filtering

Filters generated data at path data_generation/generated_data by removing QA with answers with the prefix "the passage" or the study to make the QA pairs self contained and not referencing studies

Step 6.2: Relevance Filtering

Step 6.2A: Generating labels using GPT-4

Files (including the train/test split) are stored in:

Step 6.2B: Finetuning mistralai/Mistral-7B-Instruct-v0.2 as a relevance classifier

Step 6.2C: Filtering using the relevance classifier

Applies the relevance classifier to retain medically relevant QA pairs.

Step 6.3: Human Annotation App

This allows human annotators to review generated QA pairs alongside their source passages and assign labels for Relevance, Factuality, and Groundedness in Passage.

Step 6.2B: Finetuning `mistralai/Mistral-7B-Instruct-v0.2` as a relevance classifier