Here we provide the steps we've used for MIRIAD dataset generation. We've noticed that since Semantic Scholar has modified their dataset download settings, downloading raw S2ORC dataset might take slightly different steps from the steps provided below. We encourage you to check out Semantic Scholar for the latest API access method.
source preprocessing/download_full.shpreprocessing/20200705v1/full/metadata/
preprocessing/20200705v1/full/pdf_parses/source preprocessing/broad_filter.shThis script filters the raw data from step 1 to retain papers in medical, biological, or adjacent disciplines that overlap with or contribute to biomedical research. Files are stored in:
preprocessing/20200705v1/selected/metadata/
preprocessing/20200705v1/selected/pdf_parses/source preprocessing/filter_medicine.shpreprocessing/paperssource preprocessing/create_passages.shChunks full-text medical papers into smaller (1K token) passages that can be used for question generation. Files are stored in:
preprocessing/passagesexport OPENAI_API_KEY="sk-..."
source data_generation/launch_generate.shLaunches parallel screen sessions to generate QA pairs using the model gpt-3.5-turbo-0125. Files are stored in:
data_generation/generated_data