This repository contains code for our paper: Comparison of pipeline, sequence-to-sequence, and generative language models for end-to-end relation extraction: experiments with the rare disease use-case.
The full modified dataset is available at this link.
All the Experiments are done in Google Colab Pro+ using A100 GPU.
Please follow the original seq2rel repo for installation and environment preparation guidelines here.
Alternatively, run:
pip install git+https://github.com/JohnGiorgi/seq2rel.git
We follow the same linearization schema as provided by the authors.
Datasets are tab-separated files where each example is contained on its own line. The first column contains the text, and the second column contains the relations. Relations themselves must be serialized to strings.
SCAN1 has been identified in a single Saudi Arabian family. It has not been identified in other ataxic individuals. The diagnosis of SCAN1 is made on history and clinical signs as listed above. DNA testing for mutations in TDP1 is only available on a research basis. SCAN1 @RAREDISEASE@ It @ANAPHOR@ @Anaphora@
Seq2rel/data_prep_REL.py will generate files in the desired format for seq2rel. The pre processed input files are present in Seq2rel/preprocees_data folder.
We trained our model on Google Colab Pro+ using A100 GPU.
Git clone the John Giorgi's seq2rel github repo in the desired location in your drive Seq2rel repo
To train the model, use the allennlp train
command with one of our configs (or write your own!)
For example, to train a model on the Raredis, first, preprocess the data mentioned in the previous step or directly use the already pre-processed data from Seq2rel/preprocees_data folder.
Then, call allennlp train
with the Raredis config we have provided
train_data_path="path/to/preprocessed/raredis/train.txt" \
valid_data_path="path/to/preprocessed/raredis/valid.txt" \
dataset_size=600 \
allennlp train "training_config/raredis.jsonnet" \
--serialization-dir "output" \
--include-package "seq2rel"
The best model checkpoint (measured by micro-F1 score on the validation set), vocabulary, configuration, and log files will be saved to --serialization-dir
. This can be changed to any directory you like. You can also follow
Our model train Google collab file here link
For overall and per relation score, run Seq2rel/eval_rel_type.py. Make sure you change the path to the trained model and gold test file.
All the Experiments are done in Google Colab Pro+ using A100 GPU.
- First run the BioGPT/scripts/data_preparation/rawToJSON.py to convert the original files in the JSON format. This script adds/removes the instruction to the input sequence and adds/removes entity type for the target sequence.
- Run BioGPT/scripts/data_preparation/rel_is_preprocess.py to pre-process the JSON data in rel-is input format. This will output .pmid, .x, and .y files for each split.
split.pmid: It contains the document name
split.x: It contains the input string
split.y: it contains the target string
For example. Original text
the incidence and prevalence of tarsal tunnel syndrome is unknown. the disorder is believed to affect males and females in equal numbers.
Using rel_is_preprocess.py with enabling copy instruct and enabling ent type for the entities will generate
split.pmid that contains
Tarsal-Tunnel-Syndrome
split.x that contains
consider the abstract: $ the incidence and prevalence of tarsal tunnel syndrome is unknown. the disorder is believed to affect males and females in equal numbers. $ from the given abstract, find all the entities and relations among them. do not generate any token outside the abstract.
split.y that contains
the relationship between raredisease tarsal tunnel syndrome and anaphor "the disorder" is antecedent.
A sample pre-processed data can be found here
Git clone the BioGPT repo
!git clone https://github.com/microsoft/BioGPT.git
and then follow the original GitHub repo to install the necessary libraries to work with BioGPT here or run the following cells.
!git clone https://github.com/pytorch/fairseq
import os
os.chdir("/content/fairseq")
!git checkout v0.12.0
!pip install .
!python setup.py build_ext --inplace
Moses
os.chdir("/content/BioGPT")
!git clone https://github.com/moses-smt/mosesdecoder.git
!export MOSES=${PWD}/mosesdecoder
FastBPE
!git clone https://github.com/glample/fastBPE.git
!export FASTBPE=${PWD}/fastBPE
os.chdir("fastBPE")
!g++ -std=c++11 -pthread -O3 fastBPE/main.cc -IfastBPE -o fast
Sacremoses
!pip install sacremoses
!pip install tensorboardX
You can also follow our Google Colab working directory to follow the code for installation steps here.
- The link to the pre-trained BioGPT and BioGPT large is provided on the original GitHub repo here. We observed that sometimes the URL doesn't work so alternatively you can use this link to download BioGPT medium(4GB) or this link to download BioGPT large(18GB) from our google drive and save in your local/gdrive.
os.chdir("/content/BioGPT/")
os.mkdir("checkpoints")
os.chdir("checkpoints")
!wget https://msramllasc.blob.core.windows.net/modelrelease/BioGPT/checkpoints/Pre-trained-BioGPT.tgz
!tar -zxvf Pre-trained-BioGPT.tgz
if the above URL doesn't work, (sometimes public access error), try running the below code to copy the BioGPT model from your Google Drive to Google Collab
os.chdir("/content/BioGPT/")
os.mkdir("checkpoints")
os.chdir("checkpoints")
os.mkdir("Pre-trained-BioGPT")
# copy the model checkpoint from google drive
%cp -av "/content/drive/MyDrive/BioGPT/pre_trained_model_med/checkpoint.pt" "/content/BioGPT/checkpoints/Pre-trained-BioGPT"
The model path hierarchy should look like this:
- Create a folder named "Raredis" under the data subfolder in the BioGPT path and paste the raw folder BioGPT/data/raw folder inside it or alternatively you can choose different raw files from pre processed directory.
os.chdir("/content/BioGPT/data")
os.mkdir("Raredis")
# command to copy files created from rel_is_preprocess.py (.pmid, .x and.y files)
%cp -av "content/drive/Mydrive/raw" "/content/BioGPT/data/Raredis/"
The file tree should look like this:

- Copy the Re-Raredis under the subfolder "examples" in the BioGPT path. This folder contains the bash files to pre-process, train, and infer.
%cp -av "content/drive/mydrive/RE-Raredis" "/content/BioGPT/examples/"
The file structure should look like this:

- Run preprocess.sh
os.chdir("/content/BioGPT/examples/RE-Raredis")
!bash preprocess.sh
The above command will create 1 more folder named "relis-bin" under the same folder as the raw path as shown below:

- Run train.sh to begin training the model. this will create a folder "RE-Raredis-BioGPT" under checkpoint folder. you can change configs in train.sh bash file.
!bash train.sh
- After training run infer.sh. This script runs inference on the test.txt and generates a .detok file
!bash infer.sh

-
Post-processing
After inference, run the BioGPT/scripts/postprocess to fetch the inference in the desired JSON format. -
Evaluation Run BioGPT/scripts/eval/eval_per_rel_type.py to get the overall and individual relation type scores.
We use Lambda Labs to train BioMedLM by Stanford on 1 H100 80GB GPU.
We follow the same guidelines to prepare data and model training provided at BioMedLM's author's github for NLG (Seq2seq) task.
We use the same JSON files we created earlier using BioGPT/scripts/data_preparation/rawToJSON.py to build the data required for BioMedLM input.
Run BioMedLM/scripts/databuilder to build the files required to train BioMedLM. Notice, this python file is similar to BioGPT/scripts/data_preparation and generates same files but with different extensions that were created in BioGPT files. This script will generate split.pmid, split.source, and split.target for train, dev and test repectively as mentioned in the original github repo.
Git clone the repo
!git clone https://github.com/stanford-crfm/BioMedLM.git
After cloning the BioMedLM repo, copy the train_contol file and put it under gpt2 folder.
Make sure the task dataset is in ./textgen/data. The dataset folder should have .source and .target files. The .source file should contain the original text in a one example per line format and the .target file should contain the desired output in a one example per line format. See example here.
Go to ./textgen/gpt2. To finetune, run:
python finetune_for_summarization.py --output_dir /home/ubuntu/BioMedLM/output_dir \
--model_name_or_path stanford-crfm/BioMedLM \
--tokenizer_name stanford-crfm/pubmed_gpt_tokenizer \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--save_strategy steps \
--do_eval \
--train_data_file /home/ubuntu/BioMedLM/finetune/textgen/data/train.source \
--eval_data_file /home/ubuntu/BioMedLM/finetune/textgen/data//valid.source \
--max_source_length 510 \
--train_max_target_length 500 \
--save_total_limit 25 \
--overwrite_output_dir \
--gradient_accumulation_steps 16 \
--learning_rate 1e-5 \
--warmup_ratio 0.1 \
--weight_decay 0.01 \
--seed 7 \
--evaluation_strategy steps \
--eval_steps 50 \
--num_train_epochs 30 \
--logging_steps 50 \
--save_steps 50 \
--logging_first_step \
--load_best_model_at_end True \
--metric_for_best_model eval_loss \
--greater_is_better True \
--adam_beta2 0.98
Make sure you add the correct validation.source path in the run_generation_batch.py file.
After finetuning, run generation on validation set for each checkpoint saved by running:
python -u run_generation_batch.py --max_source_length -1 --length 510 --model_name_or_path=finetune_checkpoint_path --num_return_sequences 1 --stop_token [SEP] --tokenizer_name=finetune_checkpoint_path --task_mode=raredis --control_mode=no --tuning_mode finetune --gen_dir user/output_dir --batch_size 1 --temperature 1.0
We save the each training checkpoint and selected the best checkpoint based on validation dataset F1 score using the eval script here. We select the best checkpoint when validation F1 score did not improve after 5 checkpoints (Patience = 5)
Once you select the best checkpoint based on validation F1 score, change the validation.source path in the run_generation_batch.py file to your test.source path and again run below command with best checkpoint saved:
python -u run_generation_batch.py --max_source_length -1 --length 510 --model_name_or_path=best_validation_ checkpoint_path --num_return_sequences 1 --stop_token [SEP] --tokenizer_name=best_validation_ checkpoint_path --task_mode=raredis --control_mode=no --tuning_mode finetune --gen_dir user/output_dir --batch_size 1 --temperature 1.0
Run BioMedLM/scripts/eval/test_eval/ to run evaluation on predicted sequence.
For pipeline, we use the truncated documents (upto 512 tokens, BERT input limit) from link
Data for the copy/No copy instruction for natural language/rel-is template can be found here: link
The training script for both formats can be found here: link
Command:
python PATH_TO_TRAINING_PYTHON_FILE \
--output_dir PATH_TO_OUTPUT_DIR \
--model_name_or_path t5-3b \
--tokenizer_name t5-3b \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 16 \
--save_strategy steps \
--do_eval \
--overwrite_output_dir \
--learning_rate 3e-4 \
--warmup_ratio 0.1 \
--weight_decay 1e-5 \
--seed 7 \
--evaluation_strategy steps \
--eval_steps 200 \
--num_train_epochs 100 \
--logging_steps 200 \
--save_steps 200 \
--logging_first_step \
--load_best_model_at_end True \
--metric_for_best_model eval_f1 \
--save_total_limit=1 \
--greater_is_better True \
--adam_beta2 0.98 \
--predict_with_generate True \
--generation_num_beams 4 \
--prediction_loss_only False \
--generation_max_length 1024
After training, you can run inference scripts from link depending on the template you chose.
After inference, you can run evaluation scripts from link depending on the template you chose.
Please click here for more information of the paper including predicate-specific analysis, error analysis and model configurations.