diff --git a/.gitignore b/.gitignore index ca3792c..6d39ef0 100644 --- a/.gitignore +++ b/.gitignore @@ -11,3 +11,5 @@ /data/* /additional_models/biobert_v1.1_pubmed/* /__pycache__/* +venv +.DS_Store \ No newline at end of file diff --git a/README.md b/README.md index 73b3d2b..cac6825 100644 --- a/README.md +++ b/README.md @@ -9,19 +9,26 @@ Additional models for relation extraction, implemented here based on the paper's For more conceptual details on the implementation, please see https://towardsdatascience.com/bert-s-for-relation-extraction-in-nlp-2c7c3ab487c4 +If you like my work, please consider sponsoring by clicking the sponsor button at the top. + ## Requirements -Requirements: Python (3.6+), PyTorch (1.2.0+), Spacy (2.1.8+) +Requirements: Python (3.8+) +```bash +python3 -m pip install -r requirements.txt +python3 -m spacy download en_core_web_lg +``` Pre-trained BERT models (ALBERT, BERT) courtesy of HuggingFace.co (https://huggingface.co) Pre-trained BioBERT model courtesy of https://github.com/dmis-lab/biobert -To use BioBERT(biobert_v1.1_pubmed), download & unzip the [contents](https://drive.google.com/file/d/1zKTBqqrCGlclb3zgBGGpq_70Fx-qFpiU/view?usp=sharing) to ./additional_models folder. +To use BioBERT(biobert_v1.1_pubmed), download & unzip the model from [here](https://github.com/dmis-lab/biobert) to ./additional_models folder. ## Training by matching the blanks (BERTEM + MTB) Run main_pretraining.py with arguments below. Pre-training data can be any .txt continuous text file. We use Spacy NLP to grab pairwise entities (within a window size of 40 tokens length) from the text to form relation statements for pre-training. Entities recognition are based on NER and dependency tree parsing of objects/subjects. The pre-training data taken from CNN dataset (cnn.txt) that I've used can be downloaded [here.](https://drive.google.com/file/d/1aMiIZXLpO7JF-z_Zte3uH7OCo4Uk_0do/view?usp=sharing) +Download and save as ./data/cnn.txt However, do note that the paper uses wiki dumps data for MTB pre-training which is much larger than the CNN dataset. Note: Pre-training can take a long time, depending on available GPU. It is possible to directly fine-tune on the relation-extraction task and still get reasonable results, following the section below. diff --git a/requirements.txt b/requirements.txt new file mode 100644 index 0000000..8037e5f --- /dev/null +++ b/requirements.txt @@ -0,0 +1,42 @@ +blis==0.4.1 +boto3==1.28.53 +botocore==1.31.53 +certifi==2023.7.22 +charset-normalizer==3.2.0 +contourpy==1.1.1 +cycler==0.11.0 +cymem==2.0.8 +en-core-web-lg==2.2.5 +fonttools==4.42.1 +idna==3.4 +importlib-resources==6.1.0 +jmespath==1.0.1 +joblib==1.3.2 +kiwisolver==1.4.5 +matplotlib==3.7.3 +murmurhash==1.0.10 +numpy==1.24.4 +packaging==23.1 +pandas==2.0.3 +Pillow==10.0.1 +plac==1.1.3 +preshed==3.0.9 +pyparsing==3.1.1 +python-dateutil==2.8.2 +pytz==2023.3.post1 +requests==2.31.0 +s3transfer==0.6.2 +scikit-learn==1.3.1 +scipy==1.10.1 +seqeval==1.2.2 +six==1.16.0 +spacy==2.2.2 +srsly==1.0.7 +thinc==7.3.1 +threadpoolctl==3.2.0 +torch==1.4.0 +tqdm==4.66.1 +tzdata==2023.3 +urllib3==1.26.16 +wasabi==0.10.1 +zipp==3.17.0