Skip to content

Commit

Permalink
added biobert
Browse files Browse the repository at this point in the history
  • Loading branch information
plkmo committed Jul 2, 2020
1 parent 47517ca commit 7e291ae
Show file tree
Hide file tree
Showing 10 changed files with 159 additions and 26 deletions.
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -7,5 +7,7 @@
*.pth
*.pyc
*.pkl
*.bin
/data/*
/additional_models/biobert_v1.1_pubmed/*
/__pycache__/*
27 changes: 19 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,16 +4,23 @@
A PyTorch implementation of the models for the paper ["Matching the Blanks: Distributional Similarity for Relation Learning"](https://arxiv.org/pdf/1906.03158.pdf) published in ACL 2019.
Note: This is not an official repo for the paper.
Additional models for relation extraction, implemented here based on the paper's methodology:
- ALBERT (https://arxiv.org/abs/1909.11942)
- ALBERT (https://arxiv.org/abs/1909.11942)
- BioBERT (https://arxiv.org/abs/1901.08746)

## Requirements
Requirements: Python (3.6+), PyTorch (1.2.0), Spacy (2.1.8)
Pre-trained BERT(S) model courtesy of HuggingFace.co (https://huggingface.co)
Requirements: Python (3.6+), PyTorch (1.2.0+), Spacy (2.1.8+)

Pre-trained BERT models (ALBERT, BERT) courtesy of HuggingFace.co (https://huggingface.co)
Pre-trained BioBERT model courtesy of https://github.com/dmis-lab/biobert

To use BioBERT(biobert_v1.1_pubmed), download & unzip the [contents](https://drive.google.com/file/d/1zKTBqqrCGlclb3zgBGGpq_70Fx-qFpiU/view?usp=sharing) to ./additional_models folder.

## Training by matching the blanks (BERT<sub>EM</sub> + MTB)
Run main_pretraining.py with arguments below. Pre-training data can be any .txt continuous text file.
We use Spacy NLP to grab pairwise entities (within a window size of 40 tokens length) from the text to form relation statements for pre-training. Entities recognition are based on NER and dependency tree parsing of objects/subjects.
The pre-training data (cnn.txt) that I've used can be downloaded [here.](https://drive.google.com/file/d/1aMiIZXLpO7JF-z_Zte3uH7OCo4Uk_0do/view?usp=sharing)

The pre-training data taken from CNN dataset (cnn.txt) that I've used can be downloaded [here.](https://drive.google.com/file/d/1aMiIZXLpO7JF-z_Zte3uH7OCo4Uk_0do/view?usp=sharing)
However, do note that the paper uses wiki dumps data for MTB pre-training which is much larger than the CNN dataset.

Note: Pre-training can take a long time, depending on available GPU. It is possible to directly fine-tune on the relation-extraction task and still get reasonable results, following the section below.

Expand All @@ -27,8 +34,10 @@ main_pretraining.py [-h]
[--fp16 FP_16]
[--num_epochs NUM_EPOCHS]
[--lr LR]
[--model_no MODEL_NO (0: BERT ; 1: ALBERT)]
[--model_size MODEL_SIZE (BERT: 'bert-base-uncased', 'bert-large-uncased'; ALBERT: 'albert-base-v2', 'albert-large-v2')]
[--model_no MODEL_NO (0: BERT ; 1: ALBERT ; 2: BioBERT)]
[--model_size MODEL_SIZE (BERT: 'bert-base-uncased', 'bert-large-uncased';
ALBERT: 'albert-base-v2', 'albert-large-v2';
BioBERT: 'bert-base-uncased' (biobert_v1.1_pubmed))]
```

## Fine-tuning on SemEval2010 Task 8 (BERT<sub>EM</sub>/BERT<sub>EM</sub> + MTB)
Expand All @@ -46,8 +55,10 @@ main_task.py [-h]
[--fp16 FP_16]
[--num_epochs NUM_EPOCHS]
[--lr LR]
[--model_no MODEL_NO (0: BERT ; 1: ALBERT)]
[--model_size MODEL_SIZE (BERT: 'bert-base-uncased', 'bert-large-uncased'; ALBERT: 'albert-base-v2', 'albert-large-v2')]
[--model_no MODEL_NO (0: BERT ; 1: ALBERT ; 2: BioBERT)]
[--model_size MODEL_SIZE (BERT: 'bert-base-uncased', 'bert-large-uncased';
ALBERT: 'albert-base-v2', 'albert-large-v2';
BioBERT: 'bert-base-uncased' (biobert_v1.1_pubmed))]
[--train TRAIN]
[--infer INFER]
```
Expand Down
5 changes: 5 additions & 0 deletions additional_models/convert_to_pytorch_biobert_v1.1_pubmed.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
export BERT_BASE_DIR=biobert_v1.1_pubmed
transformers-cli convert --model_type bert \
--tf_checkpoint $BERT_BASE_DIR/model.ckpt-1000000 \
--config $BERT_BASE_DIR/bert_config.json \
--pytorch_dump_output $BERT_BASE_DIR/biobert_v1.1_pubmed.bin
23 changes: 20 additions & 3 deletions main_pretraining.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,12 +32,29 @@
parser.add_argument("--num_epochs", type=int, default=18, help="No of epochs")
parser.add_argument("--lr", type=float, default=0.0001, help="learning rate")
parser.add_argument("--model_no", type=int, default=0, help='''Model ID: 0 - BERT\n
1 - ALBERT''')
1 - ALBERT\n
2 - BioBERT''')
parser.add_argument("--model_size", type=str, default='bert-base-uncased', help="For BERT: 'bert-base-uncased', \
'bert-large-uncased',\
For ALBERT: 'albert-base-v2',\
'albert-large-v2'")
'albert-large-v2',\
For BioBERT: 'bert-base-uncased' (biobert_v1.1_pubmed)")

args = parser.parse_args()

output = train_and_fit(args)
output = train_and_fit(args)

'''
# For testing additional models
from src.model.BERT.modeling_bert import BertModel, BertConfig
from src.model.BERT.tokenization_bert import BertTokenizer as Tokenizer
config = BertConfig.from_pretrained('./additional_models/biobert_v1.1_pubmed/bert_config.json')
model = BertModel.from_pretrained(pretrained_model_name_or_path='./additional_models/biobert_v1.1_pubmed.bin',
config=config,
force_download=False, \
model_size='bert-base-uncased',
task='classification',\
n_classes_=12)
tokenizer = Tokenizer(vocab_file='./additional_models/biobert_v1.1_pubmed/vocab.txt',
do_lower_case=False)
'''
8 changes: 5 additions & 3 deletions main_task.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
from argparse import ArgumentParser

'''
This fine-tunes the BERT model on SemEval task
This fine-tunes the BERT model on SemEval, FewRel tasks
'''

logging.basicConfig(format='%(asctime)s [%(levelname)s]: %(message)s', \
Expand All @@ -34,11 +34,13 @@
parser.add_argument("--num_epochs", type=int, default=11, help="No of epochs")
parser.add_argument("--lr", type=float, default=0.00007, help="learning rate")
parser.add_argument("--model_no", type=int, default=0, help='''Model ID: 0 - BERT\n
1 - ALBERT''')
1 - ALBERT\n
2 - BioBERT''')
parser.add_argument("--model_size", type=str, default='bert-base-uncased', help="For BERT: 'bert-base-uncased', \
'bert-large-uncased',\
For ALBERT: 'albert-base-v2',\
'albert-large-v2'")
'albert-large-v2'\
For BioBERT: 'bert-base-uncased' (biobert_v1.1_pubmed)")
parser.add_argument("--train", type=int, default=1, help="0: Don't train, 1: train")
parser.add_argument("--infer", type=int, default=1, help="0: Don't infer, 1: Infer")

Expand Down
12 changes: 11 additions & 1 deletion src/preprocessing_funcs.py
Original file line number Diff line number Diff line change
Expand Up @@ -186,13 +186,22 @@ def __init__(self, args, D, batch_size=None):
model = args.model_size #'albert-base-v2'
lower_case = False
model_name = 'ALBERT'
elif args.model_no == 2:
from .model.BERT.tokenization_bert import BertTokenizer as Tokenizer
model = 'bert-base-uncased'
lower_case = False
model_name = 'BioBERT'

tokenizer_path = './data/%s_tokenizer.pkl' % (model_name)
if os.path.isfile(tokenizer_path):
self.tokenizer = load_pickle('%s_tokenizer.pkl' % (model_name))
logger.info("Loaded tokenizer from saved path.")
else:
self.tokenizer = Tokenizer.from_pretrained(model, do_lower_case=False)
if args.model_no == 2:
self.tokenizer = Tokenizer(vocab_file='./additional_models/biobert_v1.1_pubmed/vocab.txt',
do_lower_case=False)
else:
self.tokenizer = Tokenizer.from_pretrained(model, do_lower_case=False)
self.tokenizer.add_tokens(['[E1]', '[/E1]', '[E2]', '[/E2]', '[BLANK]'])
save_as_pickle("%s_tokenizer.pkl" % (model_name), self.tokenizer)
logger.info("Saved %s tokenizer at ./data/%s_tokenizer.pkl" % (model_name, model_name))
Expand Down Expand Up @@ -289,6 +298,7 @@ def __getitem__(self, idx):
### get positive samples
r, e1, e2 = self.df.iloc[idx] # positive sample
pool = self.df[((self.df['e1'] == e1) & (self.df['e2'] == e2))].index
pool = pool.append(self.df[((self.df['e1'] == e2) & (self.df['e2'] == e1))].index)
pos_idxs = np.random.choice(pool, \
size=min(int(self.batch_size//2), len(pool)), replace=False)
### get negative samples
Expand Down
59 changes: 53 additions & 6 deletions src/tasks/infer.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
from itertools import permutations
from tqdm import tqdm
from .preprocessing_funcs import load_dataloaders
from ..misc import save_as_pickle

import logging

Expand Down Expand Up @@ -54,15 +55,30 @@ def __init__(self, args=None, detect_entities=False):
model = args.model_size #'bert-base-uncased'
lower_case = True
model_name = 'BERT'
self.net = Model.from_pretrained(model, force_download=False, \
model_size=args.model_size,\
task='classification', n_classes_=self.args.num_classes)
elif self.args.model_no == 1:
from ..model.ALBERT.modeling_albert import AlbertModel as Model
model = args.model_size #'albert-base-v2'
lower_case = False
model_name = 'ALBERT'

self.net = Model.from_pretrained(model, force_download=False, \
self.net = Model.from_pretrained(model, force_download=False, \
model_size=args.model_size,\
task='classification', n_classes_=self.args.num_classes)
elif args.model_no == 2: # BioBert
from ..model.BERT.modeling_bert import BertModel, BertConfig
model = 'bert-base-uncased'
lower_case = False
model_name = 'BioBERT'
config = BertConfig.from_pretrained('./additional_models/biobert_v1.1_pubmed/bert_config.json')
self.net = BertModel.from_pretrained(pretrained_model_name_or_path='./additional_models/biobert_v1.1_pubmed/biobert_v1.1_pubmed.bin',
config=config,
force_download=False, \
model_size='bert-base-uncased',
task='classification',\
n_classes_=self.args.num_classes)

self.tokenizer = load_pickle("%s_tokenizer.pkl" % model_name)
self.net.resize_token_embeddings(len(self.tokenizer))
if self.cuda:
Expand Down Expand Up @@ -215,19 +231,50 @@ def __init__(self, args=None):

if self.args.model_no == 0:
from ..model.BERT.modeling_bert import BertModel as Model
from ..model.BERT.tokenization_bert import BertTokenizer as Tokenizer
model = args.model_size #'bert-large-uncased' 'bert-base-uncased'
lower_case = True
model_name = 'BERT'
self.net = Model.from_pretrained(model, force_download=False, \
model_size=args.model_size,\
task='fewrel')
elif self.args.model_no == 1:
from ..model.ALBERT.modeling_albert import AlbertModel as Model
from ..model.ALBERT.tokenization_albert import AlbertTokenizer as Tokenizer
model = args.model_size #'albert-base-v2'
lower_case = False
model_name = 'ALBERT'
self.net = Model.from_pretrained(model, force_download=False, \
model_size=args.model_size,\
task='fewrel')
elif args.model_no == 2: # BioBert
from ..model.BERT.modeling_bert import BertModel, BertConfig
from ..model.BERT.tokenization_bert import BertTokenizer as Tokenizer
model = 'bert-base-uncased'
lower_case = False
model_name = 'BioBERT'
config = BertConfig.from_pretrained('./additional_models/biobert_v1.1_pubmed/bert_config.json')
self.net = BertModel.from_pretrained(pretrained_model_name_or_path='./additional_models/biobert_v1.1_pubmed/biobert_v1.1_pubmed.bin',
config=config,
force_download=False, \
model_size='bert-base-uncased',
task='fewrel')

if os.path.isfile('./data/%s_tokenizer.pkl' % model_name):
self.tokenizer = load_pickle("%s_tokenizer.pkl" % model_name)
logger.info("Loaded tokenizer from saved file.")
else:
logger.info("Saved tokenizer not found, initializing new tokenizer...")
if args.model_no == 2:
self.tokenizer = Tokenizer(vocab_file='./additional_models/biobert_v1.1_pubmed/vocab.txt',
do_lower_case=False)
else:
self.tokenizer = Tokenizer.from_pretrained(model, do_lower_case=False)
self.tokenizer.add_tokens(['[E1]', '[/E1]', '[E2]', '[/E2]', '[BLANK]'])
save_as_pickle("%s_tokenizer.pkl" % model_name, self.tokenizer)
logger.info("Saved %s tokenizer at ./data/%s_tokenizer.pkl" %(model_name, model_name))


self.net = Model.from_pretrained(model, force_download=False, \
model_size=args.model_size,\
task='fewrel')
self.tokenizer = load_pickle("%s_tokenizer.pkl" % model_name)
self.net.resize_token_embeddings(len(self.tokenizer))
self.pad_id = self.tokenizer.pad_token_id

Expand Down
11 changes: 10 additions & 1 deletion src/tasks/preprocessing_funcs.py
Original file line number Diff line number Diff line change
Expand Up @@ -298,13 +298,22 @@ def load_dataloaders(args):
model = args.model_size #'albert-base-v2'
lower_case = True
model_name = 'ALBERT'
elif args.model_no == 2:
from ..model.BERT.tokenization_bert import BertTokenizer as Tokenizer
model = 'bert-base-uncased'
lower_case = False
model_name = 'BioBERT'

if os.path.isfile("./data/%s_tokenizer.pkl" % model_name):
tokenizer = load_pickle("%s_tokenizer.pkl" % model_name)
logger.info("Loaded tokenizer from pre-trained blanks model")
else:
logger.info("Pre-trained blanks tokenizer not found, initializing new tokenizer...")
tokenizer = Tokenizer.from_pretrained(model, do_lower_case=False)
if args.model_no == 2:
tokenizer = Tokenizer(vocab_file='./additional_models/biobert_v1.1_pubmed/vocab.txt',
do_lower_case=False)
else:
tokenizer = Tokenizer.from_pretrained(model, do_lower_case=False)
tokenizer.add_tokens(['[E1]', '[/E1]', '[E2]', '[/E2]', '[BLANK]'])

save_as_pickle("%s_tokenizer.pkl" % model_name, tokenizer)
Expand Down
22 changes: 20 additions & 2 deletions src/tasks/trainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,16 +38,31 @@ def train_and_fit(args):
model = args.model_size #'bert-base-uncased'
lower_case = True
model_name = 'BERT'
net = Model.from_pretrained(model, force_download=False, \
model_size=args.model_size,
task='classification' if args.task != 'fewrel' else 'fewrel',\
n_classes_=args.num_classes)
elif args.model_no == 1:
from ..model.ALBERT.modeling_albert import AlbertModel as Model
model = args.model_size #'albert-base-v2'
lower_case = True
model_name = 'ALBERT'

net = Model.from_pretrained(model, force_download=False, \
net = Model.from_pretrained(model, force_download=False, \
model_size=args.model_size,
task='classification' if args.task != 'fewrel' else 'fewrel',\
n_classes_=args.num_classes)
elif args.model_no == 2: # BioBert
from ..model.BERT.modeling_bert import BertModel, BertConfig
model = 'bert-base-uncased'
lower_case = False
model_name = 'BioBERT'
config = BertConfig.from_pretrained('./additional_models/biobert_v1.1_pubmed/bert_config.json')
net = BertModel.from_pretrained(pretrained_model_name_or_path='./additional_models/biobert_v1.1_pubmed/biobert_v1.1_pubmed.bin',
config=config,
force_download=False, \
model_size='bert-base-uncased',
task='classification' if args.task != 'fewrel' else 'fewrel',\
n_classes_=args.num_classes)

tokenizer = load_pickle("%s_tokenizer.pkl" % model_name)
net.resize_token_embeddings(len(tokenizer))
Expand All @@ -66,6 +81,9 @@ def train_and_fit(args):
unfrozen_layers = ["classifier", "pooler", "classification_layer",\
"blanks_linear", "lm_linear", "cls",\
"albert_layer_groups.0.albert_layers.0.ffn"]
elif args.model_no == 2:
unfrozen_layers = ["classifier", "pooler", "encoder.layer.11", \
"classification_layer", "blanks_linear", "lm_linear", "cls"]

for name, param in net.named_parameters():
if not any([layer in name for layer in unfrozen_layers]):
Expand Down
16 changes: 14 additions & 2 deletions src/trainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,14 +39,26 @@ def train_and_fit(args):
model = args.model_size #'bert-base-uncased'
lower_case = True
model_name = 'BERT'
net = Model.from_pretrained(model, force_download=False, \
model_size=args.model_size)
elif args.model_no == 1:
from .model.ALBERT.modeling_albert import AlbertModel as Model
model = args.model_size #'albert-base-v2'
lower_case = False
model_name = 'ALBERT'

net = Model.from_pretrained(model, force_download=False, \
net = Model.from_pretrained(model, force_download=False, \
model_size=args.model_size)
elif args.model_no == 2: # BioBert
from .model.BERT.modeling_bert import BertModel, BertConfig
model = 'bert-base-uncased'
lower_case = False
model_name = 'BioBERT'
config = BertConfig.from_pretrained('./additional_models/biobert_v1.1_pubmed/bert_config.json')
net = BertModel.from_pretrained(pretrained_model_name_or_path='./additional_models/biobert_v1.1_pubmed/biobert_v1.1_pubmed.bin',
config=config,
force_download=False, \
model_size='bert-base-uncased')

tokenizer = load_pickle("%s_tokenizer.pkl" % model_name)
net.resize_token_embeddings(len(tokenizer))
e1_id = tokenizer.convert_tokens_to_ids('[E1]')
Expand Down

0 comments on commit 7e291ae

Please sign in to comment.