For more information about training deep learning models on Gaudi, visit developer.habana.ai.
Please visit this page for performance information.
- Model Overview
- Setup
- ALBERT Fine-Tuning
- Downloading the datasets
- Training the Model
- Examples
- Advanced
- Changelog
ALBERT is "A Lite" version of BERT, a popular unsupervised language representation learning algorithm by Google. ALBERT uses parameter-reduction techniques that allow for large-scale configurations, overcome previous memory limitations, and achieve better behavior with respect to model degradation.
This release supports Albert Finetuning on 1 and 8 cards.
Our implementation is a fork of Google Research ALBERT. Please visit this page for performance information.
Please follow the instructions given in the following link for setting up the
environment including the $PYTHON environment variable: Gaudi Installation
Guide.
This guide will walk you through the process of setting up your system to run
the model on Gaudi.
- Suited for tasks:
mrpc: Microsoft Research Paraphrase Corpus (MRPC) is a paraphrase identification dataset, where systems aim to identify if two sentences are paraphrases of each other.squad: Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.
- Default hyperparameters:
- dataset: squad
- predict_batch_size: 8
- max_seq_length: 384
- doc_stride: 128
- max_query_length: 64
- learning_rate: 5e-5
- num_train_epochs: 2.0
- warmup_proportion: 0.1
- save_checkpoints_steps: 5000
- do_lower_case: true
- do_train: true
- do_predict: true
- use_einsum: false
- n_best_size: 20
- max_answer_length: 30
- The output will be saved in $HOME/tmp by default.
For finetuning task, since it is using the same datasets as in bert, please follow the steps in Model-References/TensorFlow/nlp/bert/README.md section "Download and preprocess the datasets for Pretraining and Finetuning"
In the docker container, clone this repository and switch to the branch that
matches your SynapseAI version. (Run the
hl-smi
utility to determine the SynapseAI version.)
git clone -b [SynapseAI version] https://github.com/HabanaAI/Model-ReferencesGo to the ALBERT directory:
cd Model-References/TensorFlow/nlp/albert
pip install -r requirements.txtIf Model-References repository path is not in the PYTHONPATH, make sure you update it:
export PYTHONPATH=$PYTHONPATH:/path/to/Model-ReferencesDownload the pretrained model to the current folder:
For ALBERT Base:
$PYTHON download/download_pretrained_model.py https://storage.googleapis.com/albert_models/ albert_base_v1For ALBERT Large:
$PYTHON download/download_pretrained_model.py https://storage.googleapis.com/albert_models/ albert_large_v1The training can be run with run_pretraining.py for pretraining, run_classifier for finetuning
with MRPC dataset, and run_squad_v1.py with SQuAD dataset:
The following examples assume that the datasets are in a directory /data/tensorflow/ :
-
Single Gaudi card finetuning of albert Large, using SQuAD dataset on bfloat16 precision:
TF_BF16_CONVERSION=/path/to/Model-References/TensorFlow/nlp/albert/bf16_config/albert.json \ $PYTHON run_squad_v1.py \ --train_feature_file=/root/tmp/albert_large/train_feature_file.tf_record \ --predict_feature_file=/root/tmp/albert_large/predict_feature_file.tf_record \ --predict_feature_left_file=/root/tmp/albert_large/predict_feature_left_file.tf_record \ --spm_model_file=albert_large_v1/30k-clean.model \ --vocab_file=albert_large_v1/30k-clean.vocab \ --albert_config_file=albert_large_v1/albert_config.json \ --init_checkpoint=albert_large_v1/model.ckpt-best \ --do_train=True \ --train_file=/data/tensorflow/squad/train-v1.1.json \ --do_predict=True \ --predict_file=/data/tensorflow/squad/dev-v1.1.json \ --train_batch_size=32 \ --learning_rate=3e-05 \ --num_train_epochs=2 \ --max_seq_length=384 \ --doc_stride=128 \ --output_dir=/root/tmp/albert_large \ --use_horovod=false \ --enable_scoped_allocator=False \ --save_checkpoints_steps=5000 -
NOTE: mpirun map-by PE attribute value may vary on your setup. Please refer to the instructions on mpirun Configuration for calculation.
-
8 Gaudi cards finetuning of ALBERT Large in bfloat16 precision using SQuAD dataset on a single box (8 cards):
cd /path/to/Model-References/TensorFlow/nlp/albert/ mpirun --allow-run-as-root \ -x TF_BF16_CONVERSION=/path/to/Model-References/TensorFlow/nlp/albert/bf16_config/albert.json \ --tag-output \ --merge-stderr-to-stdout \ --output-filename /root/tmp/albert_log \ --bind-to core \ --map-by socket:PE=7 \ -np 8 \ $PYTHON run_squad_v1.py \ --train_feature_file=/root/tmp/albert_large/train_feature_file.tf_record \ --predict_feature_file=/root/tmp/albert_large/predict_feature_file.tf_record \ --predict_feature_left_file=/root/tmp/albert_large/predict_feature_left_file.tf_record \ --spm_model_file=albert_large_v1/30k-clean.model \ --vocab_file=albert_large_v1/30k-clean.vocab \ --albert_config_file=albert_large_v1/albert_config.json \ --init_checkpoint=albert_large_v1/model.ckpt-best \ --do_train=True \ --train_file=/data/tensorflow/squad/train-v1.1.json \ --do_predict=True \ --predict_file=/data/tensorflow/squad/dev-v1.1.json \ --train_batch_size=32 \ --learning_rate=3e-05 \ --num_train_epochs=2 \ --max_seq_length=384 \ --doc_stride=128 \ --output_dir=/root/tmp/albert_large \ --use_horovod=true \ --enable_scoped_allocator=False \ --save_checkpoints_steps=5000 \ 2>&1 | tee ~/hlogs/albert_large_finetuning_bf16_squad_8_cards.txt
-
8 Gaudi cards finetuning of ALBERT Large in bfloat16 precision using SQuAD dataset on a K8s single box (8 cards):
Make sure to download the pretrained model for ALBERT Large or Base to a location that can be accessed by the workers, such as the folder for dataset.
Make sure the Python packages in
requirements.txtare installed in all the workers.mpirun --allow-run-as-root \ --bind-to core \ --map-by socket:PE=6 \ -np 8 \ --tag-output \ --merge-stderr-to-stdout \ bash -c "cd /root/Model-References/TensorFlow/nlp/albert;\ $PYTHON run_squad_v1.py \ --train_feature_file=/root/tmp/albert_large/train_feature_file.tf_record \ --predict_feature_file=/root/tmp/albert_large/predict_feature_file.tf_record \ --predict_feature_left_file=/root/tmp/albert_large/predict_feature_left_file.tf_record \ --spm_model_file=/data/tensorflow/albert_large_v1/30k-clean.model \ --vocab_file=/data/tensorflow/albert_large_v1/30k-clean.vocab \ --albert_config_file=/data/tensorflow/albert_large_v1/albert_config.json \ --init_checkpoint=/data/tensorflow/albert_large_v1/model.ckpt-best \ --do_train=True \ --train_file=/data/tensorflow/squad/train-v1.1.json \ --do_predict=True \ --predict_file=/data/tensorflow/squad/dev-v1.1.json \ --train_batch_size=32 \ --learning_rate=3e-05 \ --num_train_epochs=2 \ --max_seq_length=384 \ --doc_stride=128 \ --output_dir=/root/tmp/albert_large \ --use_horovod=true \ --enable_scoped_allocator=False \ --save_checkpoints_steps=5000" \ 2>&1 | tee ~/hlogs/albert_large_ft_squad_8cards.txt
run_classifier.py: Script implementing MRPC task.run_squad_v1.py: Script implementing SQUAD task.
| Model | xHPU | Device | SynapseAI Version | TensorFlow Version(s) |
|---|---|---|---|---|
| Albert-Large FT (SQUAD) | 1 HPU | Gaudi | 1.3.0 | 2.8.0 , 2.7.1 |
| Albert-Large FT (SQUAD) | 8 HPU | Gaudi | 1.3.0 | 2.8.0 , 2.7.1 |
- Import horovod-fork package directly instead of using Model-References' TensorFlow.common.horovod_helpers; wrapped horovod import with a try-catch block so that the user is not required to install this library when the model is being run on a single card
- References to custom demo script were replaced by community entry points in README.
- Switched from depracated flag TF_ENABLE_BF16_CONVERSION to TF_BF16_CONVERSION
- adding handling of save_checkpoints_steps parameter and change default to 5000
- removal obsolete demo_albert (bash script)
- move BF16 config json file from TensorFlow/common/ to model's dir
- update requirements.txt
- remove redundant imports
- Change
pythonorpython3to$PYTHONto execute correct version based on environment setup.
- cleanup script from deprecated habana_model_runner