updated biobert links, python3.8+ req

plkmo · Sep 24, 2023 · 65b7944 · 65b7944
1 parent 79f44fe
commit 65b7944
Show file tree

Hide file tree

Showing 3 changed files with 53 additions and 2 deletions.
diff --git a/.gitignore b/.gitignore
@@ -11,3 +11,5 @@
 /data/*
 /additional_models/biobert_v1.1_pubmed/*
 /__pycache__/*
+venv
+.DS_Store
diff --git a/README.md b/README.md
@@ -9,19 +9,26 @@ Additional models for relation extraction, implemented here based on the paper's
 
 For more conceptual details on the implementation, please see https://towardsdatascience.com/bert-s-for-relation-extraction-in-nlp-2c7c3ab487c4
 
+If you like my work, please consider sponsoring by clicking the sponsor button at the top.
+
 ## Requirements
-Requirements: Python (3.6+), PyTorch (1.2.0+), Spacy (2.1.8+)  
+Requirements: Python (3.8+)
+```bash
+python3 -m pip install -r requirements.txt
+python3 -m spacy download en_core_web_lg
+```
 
 Pre-trained BERT models (ALBERT, BERT) courtesy of HuggingFace.co (https://huggingface.co)   
 Pre-trained BioBERT model courtesy of https://github.com/dmis-lab/biobert   
 
-To use BioBERT(biobert_v1.1_pubmed), download & unzip the [contents](https://drive.google.com/file/d/1zKTBqqrCGlclb3zgBGGpq_70Fx-qFpiU/view?usp=sharing) to ./additional_models folder.   
+To use BioBERT(biobert_v1.1_pubmed), download & unzip the model from [here](https://github.com/dmis-lab/biobert) to ./additional_models folder.   
 
 ## Training by matching the blanks (BERT<sub>EM</sub> + MTB)
 Run main_pretraining.py with arguments below. Pre-training data can be any .txt continuous text file.  
 We use Spacy NLP to grab pairwise entities (within a window size of 40 tokens length) from the text to form relation statements for pre-training. Entities recognition are based on NER and dependency tree parsing of objects/subjects.  
 
 The pre-training data taken from CNN dataset (cnn.txt) that I've used can be downloaded [here.](https://drive.google.com/file/d/1aMiIZXLpO7JF-z_Zte3uH7OCo4Uk_0do/view?usp=sharing)   
+Download and save as ./data/cnn.txt   
 However, do note that the paper uses wiki dumps data for MTB pre-training which is much larger than the CNN dataset.   
 
 Note: Pre-training can take a long time, depending on available GPU. It is possible to directly fine-tune on the relation-extraction task and still get reasonable results, following the section below.  

diff --git a/requirements.txt b/requirements.txt
@@ -0,0 +1,42 @@
+blis==0.4.1
+boto3==1.28.53
+botocore==1.31.53
+certifi==2023.7.22
+charset-normalizer==3.2.0
+contourpy==1.1.1
+cycler==0.11.0
+cymem==2.0.8
+en-core-web-lg==2.2.5
+fonttools==4.42.1
+idna==3.4
+importlib-resources==6.1.0
+jmespath==1.0.1
+joblib==1.3.2
+kiwisolver==1.4.5
+matplotlib==3.7.3
+murmurhash==1.0.10
+numpy==1.24.4
+packaging==23.1
+pandas==2.0.3
+Pillow==10.0.1
+plac==1.1.3
+preshed==3.0.9
+pyparsing==3.1.1
+python-dateutil==2.8.2
+pytz==2023.3.post1
+requests==2.31.0
+s3transfer==0.6.2
+scikit-learn==1.3.1
+scipy==1.10.1
+seqeval==1.2.2
+six==1.16.0
+spacy==2.2.2
+srsly==1.0.7
+thinc==7.3.1
+threadpoolctl==3.2.0
+torch==1.4.0
+tqdm==4.66.1
+tzdata==2023.3
+urllib3==1.26.16
+wasabi==0.10.1
+zipp==3.17.0