MIRIAD/data_generation at main · kennethSty/MIRIAD

Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md
generate_dataset.py	generate_dataset.py
launch_generate.sh	launch_generate.sh

Name

Last commit message

Last commit date

MIRIAD Dataset Generation

Here we provide the steps we've used for MIRIAD dataset generation. We've noticed that since Semantic Scholar has modified their dataset download settings, downloading raw S2ORC dataset might take slightly different steps from the steps provided below. We encourage you to check out Semantic Scholar for the latest API access method.

Step 1: Download Raw S2ORC Data

source preprocessing/download_full.sh

Downloads metadata_.jsonl.gz and pdf_parses_.jsonl.gz from the S2ORC corpus. Files are stored in:

preprocessing/20200705v1/full/metadata/
preprocessing/20200705v1/full/pdf_parses/

Step 2: Broad Filtering by Field of Study

source preprocessing/broad_filter.sh

This script filters the raw data from step 1 to retain papers in medical, biological, or adjacent disciplines that overlap with or contribute to biomedical research. Files are stored in:

preprocessing/20200705v1/selected/metadata/
preprocessing/20200705v1/selected/pdf_parses/

Step 3: Filter Only Medical Documents

source preprocessing/filter_medicine.sh

Further narrows to papers strictly in "Medicine" category. Files are stored in:

preprocessing/papers

Step 4: Create Passages

source preprocessing/create_passages.sh

Chunks full-text medical papers into smaller (1K token) passages that can be used for question generation. Files are stored in:

preprocessing/passages

Step 5: Generate QA Pairs

export OPENAI_API_KEY="sk-..."
source data_generation/launch_generate.sh

Launches parallel screen sessions to generate QA pairs using the model `gpt-3.5-turbo-0125`. Files are stored in:

data_generation/generated_data

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

MIRIAD Dataset Generation

Step 1: Download Raw S2ORC Data

Downloads metadata_.jsonl.gz and pdf_parses_.jsonl.gz from the S2ORC corpus. Files are stored in:

Step 2: Broad Filtering by Field of Study

This script filters the raw data from step 1 to retain papers in medical, biological, or adjacent disciplines that overlap with or contribute to biomedical research. Files are stored in:

Step 3: Filter Only Medical Documents

Further narrows to papers strictly in "Medicine" category. Files are stored in:

Step 4: Create Passages

Chunks full-text medical papers into smaller (1K token) passages that can be used for question generation. Files are stored in:

Step 5: Generate QA Pairs

Launches parallel screen sessions to generate QA pairs using the model `gpt-3.5-turbo-0125`. Files are stored in:

FilesExpand file tree

data_generation

Directory actions

More options

Directory actions

More options

Latest commit

History

data_generation

Folders and files

parent directory

README.md

MIRIAD Dataset Generation

Step 1: Download Raw S2ORC Data

Downloads metadata_.jsonl.gz and pdf_parses_.jsonl.gz from the S2ORC corpus. Files are stored in:

Step 2: Broad Filtering by Field of Study

This script filters the raw data from step 1 to retain papers in medical, biological, or adjacent disciplines that overlap with or contribute to biomedical research. Files are stored in:

Step 3: Filter Only Medical Documents

Further narrows to papers strictly in "Medicine" category. Files are stored in:

Step 4: Create Passages

Chunks full-text medical papers into smaller (1K token) passages that can be used for question generation. Files are stored in:

Step 5: Generate QA Pairs

Launches parallel screen sessions to generate QA pairs using the model gpt-3.5-turbo-0125. Files are stored in:

Launches parallel screen sessions to generate QA pairs using the model `gpt-3.5-turbo-0125`. Files are stored in: