After data generation, we applied multiple rounds of quality control to ensure the high-quality of MIRIAD.
quality_control/keyword_filter.pyFilters generated data at path data_generation/generated_data by removing QA with answers with the prefix "the passage" or the study to make the QA pairs self contained and not referencing studies
python quality_control/relevance/generate_labels.pyquality_control/relevance/labels.json
quality_control/relevance/train_qa_ids.json.json
quality_control/relevance/test_qa_ids.json.jsonpython quality_control/relevance/finetune.pypython quality_control/relevance/relevance_inference.py
python quality_control/relevance/filter_relevance.pyLaunch the Streamlit app for manual QA inspection and labeling:
streamlit run quality_control/streamlit_app/app.py