Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New multi-label synthetic train data biomedBERT #81

Merged
merged 7 commits into from
Dec 17, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/_changelog.d/multilabel_bert.feature.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
Release new multilabel biomedBERT model trained on LLM (Gemini) synthetically generated NER data. The model was trained on over 7000 LLM annoted documents with a total of 295822 samples.
The model was trained for 21 epochs and achieved an F1 score of 95.6% on a held out test set.
1 change: 1 addition & 0 deletions docs/_changelog.d/pytorch_memory_issue.bugfix.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Fix issue with TransformersModelForTokenClassificationNerStep when processing large amounts of documents. The fix offloads tensors onto cpu before performin the torch.cat operation which lead to a zero tensor before.
94 changes: 93 additions & 1 deletion docs/training_multilabel_ner.rst
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
Build an amasing NER model from LLM annotated data!
Build an amazing NER model from LLM annotated data!
====================================================
paluchasz marked this conversation as resolved.
Show resolved Hide resolved

Intro
Expand Down Expand Up @@ -41,3 +41,95 @@ then run the script with
multilabel_ner_training.label_studio_manager.headers.Authorisation="Token <your ls token>"

More options are available via :class:`kazu.training.config.TrainingConfig`\.

Note, if running the script on MPS you will need to add the following to the top of the file:

.. code-block:: python

import os

os.environ["PYTORCH_ENABLE_MPS_FALLBACK"] = "1"

as one of the functions used in the Transformer NER step are not supported on MPS.

Our results with this approach
-------------------------------

We trained the `microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext` 416MB model on this task using over 7000 full text KAZU documents which
consisted of 295822 total training samples. The model was trained for 21 epochs using early stopping on a held out validation set.

The model was evaluated on a held out test set of 365 KAZU documents and achieved a mean F1 score of 95.6% across all the classes. All the documents
in the train/validation/test sets were annotated with an LLM (gemini-1.5-flash-002) and as such should be taken with a pinch of salt.

The detailed metrics per class are shown below:

.. code-block:: json

{
"gene or gene product_precision": 0.9563058589870904,
"gene or gene product_recall": 0.9534653465346534,
"gene or gene product_support": 9090,
"method_precision": 0.9581589958158996,
"method_recall": 0.9631112237142133,
"method_support": 39470,
"protein domain or region_precision": 0.9587628865979382,
"protein domain or region_recall": 0.9765287214329833,
"protein domain or region_support": 1619,
"biological process_precision": 0.9621024062489657,
"biological process_recall": 0.9553043249638491,
"biological process_support": 60856,
"measurement_precision": 0.9576365663322185,
"measurement_recall": 0.954338406843684,
"measurement_support": 36004,
"cell type_precision": 0.9479236812570145,
"cell type_recall": 0.9873743277998597,
"cell type_support": 4277,
"chemical_precision": 0.9438978994948152,
"chemical_recall": 0.9814763616256567,
"chemical_support": 3617,
"species_precision": 0.9475158012641012,
"species_recall": 0.9615166030689292,
"species_support": 12317,
"cellular component_precision": 0.9379452999310504,
"cellular component_recall": 0.9702805515929624,
"cellular component_support": 4206,
"diagnostic_precision": 0.8901098901098901,
"diagnostic_recall": 0.9585798816568047,
"diagnostic_support": 338,
"disease, disorder, phenotype or trait_precision": 0.9441439004598323,
"disease, disorder, phenotype or trait_recall": 0.9495375408052231,
"disease, disorder, phenotype or trait_support": 7352,
"drug_precision": 0.9435426958362738,
"drug_recall": 0.9681390296886314,
"drug_support": 1381,
"treatment_precision": 0.9329966983880366,
"treatment_recall": 0.9550695825049702,
"treatment_support": 5030,
"instrument_precision": 0.9301778242677824,
"instrument_recall": 0.9766611751784734,
"instrument_support": 3642,
"organization_precision": 0.9359301055697125,
"organization_recall": 0.9694570135746606,
"organization_support": 2652,
"mutation_precision": 0.9478108581436077,
"mutation_recall": 0.9815016322089227,
"mutation_support": 2757,
"anatomical part or tissue_precision": 0.9636795933426252,
"anatomical part or tissue_recall": 0.9730132450331126,
"anatomical part or tissue_support": 12080,
"place_precision": 0.952116935483871,
"place_recall": 0.9799412069859934,
"place_support": 5783,
"mean_f1": 0.9560492521601017
}

Experiments were also performed with DistilBERT (268MB) and tinyBERT (60MB) models for comparison which achieved a mean F1 score of 93.1% and 77.9%
respectively.

Future work
--------------

For future work we need to investigate further the quality of the LLM annotated data, perhaps getting human corrections at least on the test set
to ensure that we have a good understanding of it's performance. The trained model is quite large in comparison to the previous TinyBern model (56MB)
so we should also investigate the possibility of knowledge distillation or other techniques to reduce the model size whilst keeping most of the
performance.
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
_target_: kazu.steps.ner.hf_token_classification.TransformersModelForTokenClassificationNerStep
path: ${oc.env:KAZU_MODEL_PACK}/multilabel_biomedBERT
batch_size: 2
stride: 64
max_sequence_length: 512
keys_to_use: #distilbert for token classification doesn't use token_type_ids
- input_ids
- attention_mask
- token_type_ids
entity_splitter:
_target_: kazu.steps.ner.entity_post_processing.NonContiguousEntitySplitter
entity_conditions:
gene:
- _target_: kazu.steps.ner.entity_post_processing.SplitOnNumericalListPatternWithPrefix
- _target_: kazu.steps.ner.entity_post_processing.SplitOnConjunctionPattern
path: ${SciSpacyPipeline.path}
disease:
- _target_: kazu.steps.ner.entity_post_processing.SplitOnConjunctionPattern
path: ${SciSpacyPipeline.path}
tokenized_word_processor:
_target_: kazu.steps.ner.tokenized_word_processor.TokenizedWordProcessor
labels:
- "O"
- "anatomical part or tissue"
- "biological process"
- "cell type"
- "cellular component"
- "chemical"
- "diagnostic"
- "disease, disorder, phenotype or trait"
- "drug"
- "gene or gene product"
- "instrument"
- "measurement"
- "method"
- "mutation"
- "organization"
- "place"
- "protein domain or region"
- "species"
- "treatment"
strip_re:
gene: "( (gene|protein)s?)+$"
use_multilabel: true
Git LFS file not shown
Git LFS file not shown
Git LFS file not shown
Git LFS file not shown
Git LFS file not shown
Git LFS file not shown
Loading