Skip to content

Commit

Permalink
updated biobert links, python3.8+ req
Browse files Browse the repository at this point in the history
  • Loading branch information
plkmo committed Sep 24, 2023
1 parent 79f44fe commit 65b7944
Show file tree
Hide file tree
Showing 3 changed files with 53 additions and 2 deletions.
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -11,3 +11,5 @@
/data/*
/additional_models/biobert_v1.1_pubmed/*
/__pycache__/*
venv
.DS_Store
11 changes: 9 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,19 +9,26 @@ Additional models for relation extraction, implemented here based on the paper's

For more conceptual details on the implementation, please see https://towardsdatascience.com/bert-s-for-relation-extraction-in-nlp-2c7c3ab487c4

If you like my work, please consider sponsoring by clicking the sponsor button at the top.

## Requirements
Requirements: Python (3.6+), PyTorch (1.2.0+), Spacy (2.1.8+)
Requirements: Python (3.8+)
```bash
python3 -m pip install -r requirements.txt
python3 -m spacy download en_core_web_lg
```

Pre-trained BERT models (ALBERT, BERT) courtesy of HuggingFace.co (https://huggingface.co)
Pre-trained BioBERT model courtesy of https://github.com/dmis-lab/biobert

To use BioBERT(biobert_v1.1_pubmed), download & unzip the [contents](https://drive.google.com/file/d/1zKTBqqrCGlclb3zgBGGpq_70Fx-qFpiU/view?usp=sharing) to ./additional_models folder.
To use BioBERT(biobert_v1.1_pubmed), download & unzip the model from [here](https://github.com/dmis-lab/biobert) to ./additional_models folder.

## Training by matching the blanks (BERT<sub>EM</sub> + MTB)
Run main_pretraining.py with arguments below. Pre-training data can be any .txt continuous text file.
We use Spacy NLP to grab pairwise entities (within a window size of 40 tokens length) from the text to form relation statements for pre-training. Entities recognition are based on NER and dependency tree parsing of objects/subjects.

The pre-training data taken from CNN dataset (cnn.txt) that I've used can be downloaded [here.](https://drive.google.com/file/d/1aMiIZXLpO7JF-z_Zte3uH7OCo4Uk_0do/view?usp=sharing)
Download and save as ./data/cnn.txt
However, do note that the paper uses wiki dumps data for MTB pre-training which is much larger than the CNN dataset.

Note: Pre-training can take a long time, depending on available GPU. It is possible to directly fine-tune on the relation-extraction task and still get reasonable results, following the section below.
Expand Down
42 changes: 42 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
blis==0.4.1
boto3==1.28.53
botocore==1.31.53
certifi==2023.7.22
charset-normalizer==3.2.0
contourpy==1.1.1
cycler==0.11.0
cymem==2.0.8
en-core-web-lg==2.2.5
fonttools==4.42.1
idna==3.4
importlib-resources==6.1.0
jmespath==1.0.1
joblib==1.3.2
kiwisolver==1.4.5
matplotlib==3.7.3
murmurhash==1.0.10
numpy==1.24.4
packaging==23.1
pandas==2.0.3
Pillow==10.0.1
plac==1.1.3
preshed==3.0.9
pyparsing==3.1.1
python-dateutil==2.8.2
pytz==2023.3.post1
requests==2.31.0
s3transfer==0.6.2
scikit-learn==1.3.1
scipy==1.10.1
seqeval==1.2.2
six==1.16.0
spacy==2.2.2
srsly==1.0.7
thinc==7.3.1
threadpoolctl==3.2.0
torch==1.4.0
tqdm==4.66.1
tzdata==2023.3
urllib3==1.26.16
wasabi==0.10.1
zipp==3.17.0

0 comments on commit 65b7944

Please sign in to comment.