AstraZeneca · paluchasz · Dec 17, 2024 · Dec 16, 2024 · Dec 16, 2024 · Dec 16, 2024
diff --git a/docs/_changelog.d/multilabel_bert.feature.rst b/docs/_changelog.d/multilabel_bert.feature.rst
@@ -0,0 +1,2 @@
+Release new multilabel biomedBERT model trained on LLM (Gemini) synthetically generated NER data. The model was trained on over 7000 LLM annoted documents with a total of 295822 samples.
+The model was trained for 21 epochs and achieved an F1 score of 95.6% on a held out test set.
diff --git a/docs/_changelog.d/pytorch_memory_issue.bugfix.rst b/docs/_changelog.d/pytorch_memory_issue.bugfix.rst
@@ -0,0 +1 @@
+Fix issue with TransformersModelForTokenClassificationNerStep when processing large amounts of documents. The fix offloads tensors onto cpu before performin the torch.cat operation which lead to a zero tensor before.
diff --git a/docs/training_multilabel_ner.rst b/docs/training_multilabel_ner.rst
@@ -1,4 +1,4 @@
-Build an amasing NER model from LLM annotated data!
+Build an amazing NER model from LLM annotated data!
 ====================================================
 
 Intro
@@ -41,3 +41,95 @@ then run the script with
       multilabel_ner_training.label_studio_manager.headers.Authorisation="Token <your ls token>"
 
 More options are available via :class:`kazu.training.config.TrainingConfig`\.
+
+Note, if running the script on MPS you will need to add the following to the top of the file:
+
+.. code-block:: python
+
+    import os
+
+    os.environ["PYTORCH_ENABLE_MPS_FALLBACK"] = "1"
+
+as one of the functions used in the Transformer NER step are not supported on MPS.
+
+Our results with this approach
+-------------------------------
+
+We trained the `microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext` 416MB model on this task using over 7000 full text KAZU documents which
+consisted of 295822 total training samples. The model was trained for 21 epochs using early stopping on a held out validation set.
+
+The model was evaluated on a held out test set of 365 KAZU documents and achieved a mean F1 score of 95.6% across all the classes. All the documents
+in the train/validation/test sets were annotated with an LLM (gemini-1.5-flash-002) and as such should be taken with a pinch of salt.
+
+The detailed metrics per class are shown below:
+
+.. code-block:: json
+
+ {
+  "gene or gene product_precision": 0.9563058589870904,
+  "gene or gene product_recall": 0.9534653465346534,
+  "gene or gene product_support": 9090,
+  "method_precision": 0.9581589958158996,
+  "method_recall": 0.9631112237142133,
+  "method_support": 39470,
+  "protein domain or region_precision": 0.9587628865979382,
+  "protein domain or region_recall": 0.9765287214329833,
+  "protein domain or region_support": 1619,
+  "biological process_precision": 0.9621024062489657,
+  "biological process_recall": 0.9553043249638491,
+  "biological process_support": 60856,
+  "measurement_precision": 0.9576365663322185,
+  "measurement_recall": 0.954338406843684,
+  "measurement_support": 36004,
+  "cell type_precision": 0.9479236812570145,
+  "cell type_recall": 0.9873743277998597,
+  "cell type_support": 4277,
+  "chemical_precision": 0.9438978994948152,
+  "chemical_recall": 0.9814763616256567,
+  "chemical_support": 3617,
+  "species_precision": 0.9475158012641012,
+  "species_recall": 0.9615166030689292,
+  "species_support": 12317,
+  "cellular component_precision": 0.9379452999310504,
+  "cellular component_recall": 0.9702805515929624,
+  "cellular component_support": 4206,
+  "diagnostic_precision": 0.8901098901098901,
+  "diagnostic_recall": 0.9585798816568047,
+  "diagnostic_support": 338,
+  "disease, disorder, phenotype or trait_precision": 0.9441439004598323,
+  "disease, disorder, phenotype or trait_recall": 0.9495375408052231,
+  "disease, disorder, phenotype or trait_support": 7352,
+  "drug_precision": 0.9435426958362738,
+  "drug_recall": 0.9681390296886314,
+  "drug_support": 1381,
+  "treatment_precision": 0.9329966983880366,
+  "treatment_recall": 0.9550695825049702,
+  "treatment_support": 5030,
+  "instrument_precision": 0.9301778242677824,
+  "instrument_recall": 0.9766611751784734,
+  "instrument_support": 3642,
+  "organization_precision": 0.9359301055697125,
+  "organization_recall": 0.9694570135746606,
+  "organization_support": 2652,
+  "mutation_precision": 0.9478108581436077,
+  "mutation_recall": 0.9815016322089227,
+  "mutation_support": 2757,
+  "anatomical part or tissue_precision": 0.9636795933426252,
+  "anatomical part or tissue_recall": 0.9730132450331126,
+  "anatomical part or tissue_support": 12080,
+  "place_precision": 0.952116935483871,
+  "place_recall": 0.9799412069859934,
+  "place_support": 5783,
+  "mean_f1": 0.9560492521601017
+  }
+
+Experiments were also performed with DistilBERT (268MB) and tinyBERT (60MB) models for comparison which achieved a mean F1 score of 93.1% and 77.9%
+respectively.
+
+Future work
+--------------
+
+For future work we need to investigate further the quality of the LLM annotated data, perhaps getting human corrections at least on the test set
+to ensure that we have a good understanding of it's performance. The trained model is quite large in comparison to the previous TinyBern model (56MB)
+so we should also investigate the possibility of knowledge distillation or other techniques to reduce the model size whilst keeping most of the
+performance.
diff --git a/kazu/conf/TransformersModelForTokenClassificationNerStep/multilabel_biomedBERT.yaml b/kazu/conf/TransformersModelForTokenClassificationNerStep/multilabel_biomedBERT.yaml
@@ -0,0 +1,44 @@
+_target_: kazu.steps.ner.hf_token_classification.TransformersModelForTokenClassificationNerStep
+path: ${oc.env:KAZU_MODEL_PACK}/multilabel_biomedBERT
+batch_size: 2
+stride: 64
+max_sequence_length: 512
+keys_to_use: #distilbert for token classification doesn't use token_type_ids
+  - input_ids
+  - attention_mask
+  - token_type_ids
+entity_splitter:
+  _target_: kazu.steps.ner.entity_post_processing.NonContiguousEntitySplitter
+  entity_conditions:
+    gene:
+      - _target_: kazu.steps.ner.entity_post_processing.SplitOnNumericalListPatternWithPrefix
+      - _target_: kazu.steps.ner.entity_post_processing.SplitOnConjunctionPattern
+        path: ${SciSpacyPipeline.path}
+    disease:
+      - _target_: kazu.steps.ner.entity_post_processing.SplitOnConjunctionPattern
+        path: ${SciSpacyPipeline.path}
+tokenized_word_processor:
+  _target_: kazu.steps.ner.tokenized_word_processor.TokenizedWordProcessor
+  labels:
+    - "O"
+    - "anatomical part or tissue"
+    - "biological process"
+    - "cell type"
+    - "cellular component"
+    - "chemical"
+    - "diagnostic"
+    - "disease, disorder, phenotype or trait"
+    - "drug"
+    - "gene or gene product"
+    - "instrument"
+    - "measurement"
+    - "method"
+    - "mutation"
+    - "organization"
+    - "place"
+    - "protein domain or region"
+    - "species"
+    - "treatment"
+  strip_re:
+    gene: "( (gene|protein)s?)+$"
+  use_multilabel: true
diff --git a/resources/kazu_model_pack_public/multilabel_biomedBERT/config.json b/resources/kazu_model_pack_public/multilabel_biomedBERT/config.json
diff --git a/resources/kazu_model_pack_public/multilabel_biomedBERT/pytorch_model.bin b/resources/kazu_model_pack_public/multilabel_biomedBERT/pytorch_model.bin
diff --git a/resources/kazu_model_pack_public/multilabel_biomedBERT/special_tokens_map.json b/resources/kazu_model_pack_public/multilabel_biomedBERT/special_tokens_map.json
diff --git a/resources/kazu_model_pack_public/multilabel_biomedBERT/tokenizer.json b/resources/kazu_model_pack_public/multilabel_biomedBERT/tokenizer.json
diff --git a/resources/kazu_model_pack_public/multilabel_biomedBERT/tokenizer_config.json b/resources/kazu_model_pack_public/multilabel_biomedBERT/tokenizer_config.json
diff --git a/resources/kazu_model_pack_public/multilabel_biomedBERT/vocab.txt b/resources/kazu_model_pack_public/multilabel_biomedBERT/vocab.txt
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		Release new multilabel biomedBERT model trained on LLM (Gemini) synthetically generated NER data. The model was trained on over 7000 LLM annoted documents with a total of 295822 samples.
		The model was trained for 21 epochs and achieved an F1 score of 95.6% on a held out test set.
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		Fix issue with TransformersModelForTokenClassificationNerStep when processing large amounts of documents. The fix offloads tensors onto cpu before performin the torch.cat operation which lead to a zero tensor before.