This repository provides a script and recipe to train the ResNet v1.5 models to achieve state-of-the-art accuracy, and is tested and maintained by Habana. Please visit this page for performance information.
For more information about training deep learning models on Gaudi, visit developer.habana.ai.
- Model-References
- Model Overview
- Setup
- Media Loading Acceleration
- Training
- Profile
- Changelog
- Supported Configuration
- Known Issues
ResNeXt is a modified version of the original ResNet v1 model. This implementation defines ResNeXt101, which features 101 layers.
Originally, scripts were taken from Tensorflow Github, tag v1.13.0
Files used:
- imagenet_main.py
- imagenet_preprocessing.py
- resnet_model.py
- resnet_run_loop.py
All of above files were converted to TF2 by using tf_upgrade_v2 tool. Additionally, some other changes were committed for specific files.
The following are the changes specific to Gaudi that were made to the original scripts:
- Added Habana Gaudi support
- Added Horovod support for multinode
- Added mini_imagenet support
- Changed the signature of input_fn with new parameters added, and num_parallel_batches removed
- Changed parallel dataset deserialization to use dataset.interleave instead of dataset.apply
- Added Resnext support
- Added parameter to ImagenetModel::init for selecting between resnet and resnext
- Redefined learning_rate_fn to use warmup_epochs and use_cosine_lr from params
- Added flags to specify the weight_decay and momentum
- Added flag to enable horovod support
- Added calls to tf.compat.v1.disable_eager_execution() and tf.compat.v1.enable_resource_variables() in main code section
- Added flag to specify maximum number of cpus to be used
- Changed the image decode, crop and flip function to take a seed to propagate to tf.image.sample_distorted_bounding_box
- Changed the use of tf.expand_dims to tf.broadcast_to for better performance
- Added tf.bfloat16 to CASTABLE_TYPES
- Changed calls to tf. to tf.compat.v1. for backward compatibility when running training scripts on TensorFlow v2
- Deleted the call to pad the input along the spatial dimensions independently of input size for stride greater than 1
- Added functionality for strided 2-D convolution with groups and explicit padding
- Added functionality for a single block for ResNext, with a bottleneck
- Added Resnext support
- Changes for enabling Horovod, e.g. data sharding in multi-node, usage of Horovod's TF DistributedOptimizer, etc.
- Added options to enable the use of tf.data's 'experimental_slack' and 'experimental.prefetch_to_device' options during the input processing
- Added support for specific size thread pool for tf.data operations
- TensorFlow v2 support: Changed dataset.apply(tf.contrib.data.map_and_batch(..)) to dataset.map(..., num_parallel_calls, ...) followed by dataset.batch()
- Changed calls to tf. to tf.compat.v1. for backward compatibility when running training scripts on TensorFlow v2
- Other TF v2 replacements for tf.contrib usages
- Redefined learning_rate_with_decay to use warmup_epochs and use_cosine_lr
- Added functionality for label smoothing
- Commented out writing images to summary, for performance reasons
- Added check for non-tf.bfloat16 input images having the same data type as the dtype that training is run with
- Added functionality to define the cross-entropy function depending on whether label smoothing is enabled
- Added support for loss scaling of gradients
- Added flag for experimental_preloading that invokes the HabanaEstimator, besides other optimizations such as tf.data.experimental.prefetch_to_device
- Added 'TF_DISABLE_SCOPED_ALLOCATOR' environment variable flag to disable Scoped Allocator Optimization (enabled by default) for Horovod runs
- Added a flag to configure the save_checkpoint_steps
- If the flag "use_train_and_evaluate" is set, or in multi-worker training scenarios, there is a one-shot call to tf.estimator.train_and_evaluate
- resnet_main() returns a dictionary with keys 'eval_results' and 'train_hooks'
- Added flags in 'define_resnet_flags' for flags_core.define_base, flags_core.define_performance, flags_core.define_distribution, flags_core.define_experimental, and many others (please refer to this function for all the flags that are available)
- Changed order of ops creating summaries to log them in TensorBoard with proper name. Added saving HParams to TensorBoard and exposed a flag for specifying frequency of summary updates.
- Changed a name of directory, in which workers are saving logs and checkpoints, from "rank_N" to "worker_N".
- Added support for TF profiling
ResNeXt:
- Momentum (0.875).
- Learning rate (LR) = 0.256 for 256 batch size, for other batch sizes we linearly scale the learning rate.
- Cosine learning rate schedule.
- Linear warmup of the learning rate during the first 8 epochs.
- Weight decay: 6.103515625e-05
- Label Smoothing: 0.1.
We do not apply Weight decay on batch norm trainable parameters (gamma/bias). We train for:
- 90 Epochs -> 90 epochs is a standard for ResNet family networks.
- 250 Epochs -> best possible accuracy. For 250 epoch training we also use MixUp regularization.
This model uses the following data augmentation:
- For training:
- Normalization.
- Random resized crop to 224x224.
- Scale from 8% to 100%.
- Aspect ratio from 3/4 to 4/3.
- Random horizontal flip.
Please follow the instructions given in the following link for setting up the
environment including the $PYTHON and $MPI_ROOT environment variables: Gaudi Installation
Guide.
This guide will walk you through the process of setting up your system to run
the model on Gaudi.
The script operates on ImageNet 1k, a widely popular image classification dataset from the ILSVRC challenge. In order to obtain the dataset, follow these steps:
- Sign up with http://image-net.org/download-images and acquire the rights to download original images
- Follow the link to the 2012 ILSVRC and download
ILSVRC2012_img_val.tarandILSVRC2012_img_train.tar. - Use the commands below - they will prepare the dataset under
/data/tensorflow/imagenet/tf_records. This is the default data_dir for the training script. In/data/tensorflow/imagenet/trainand/data/tensorflow/imagenet/valdirectories original JPEG files will stay and can be used for Media Loading Acceleration on Gaudi2. See examples with --data_dir and --jpeg_data_dir parameters for details.
export IMAGENET_HOME=/data/tensorflow/imagenet
mkdir -p $IMAGENET_HOME/validation
mkdir -p $IMAGENET_HOME/train
tar xf ILSVRC2012_img_val.tar -C $IMAGENET_HOME/validation
tar xf ILSVRC2012_img_train.tar -C $IMAGENET_HOME/train
cd $IMAGENET_HOME/train
for f in *.tar; do
d=`basename $f .tar`
mkdir $d
tar xf $f -C $d
done
cd $IMAGENET_HOME
rm $IMAGENET_HOME/train/*.tar # optional
wget -O synset_labels.txt https://raw.githubusercontent.com/tensorflow/models/master/research/slim/datasets/imagenet_2012_validation_synset_labels.txt
cd Model-References/TensorFlow/computer_vision/Resnets
$PYTHON preprocess_imagenet.py \
--raw_data_dir=$IMAGENET_HOME \
--local_scratch_dir=$IMAGENET_HOME/tf_records
mv $IMAGENET_HOME/validation $IMAGENET_HOME/val
cd $IMAGENET_HOME/val
wget -qO- https://raw.githubusercontent.com/soumith/imagenetloader.torch/master/valprep.sh | bash
In the docker container, clone this repository and switch to the branch that
matches your SynapseAI version. (Run the
hl-smi
utility to determine the SynapseAI version.)
git clone -b [SynapseAI version] https://github.com/HabanaAI/Model-ReferencesNote: If the repository is not in the PYTHONPATH, make sure you update it.
export PYTHONPATH=/path/to/Model-References:$PYTHONPATHIn the docker container, go to the ResNeXt directory
cd /root/Model-References/TensorFlow/computer_vision/Resnets/ResNeXtInstall required packages using pip
$PYTHON -m pip install -r requirements.txtThere are instances where the performace improvement is noticed if jemalloc is setup before running the below examples. Generally it is recommended to setup jemalloc before training ResNeXt model. To setup jemalloc, export the LD_PRELOAD with the path to libjemalloc.so Only one of the libjemalloc.so.1 or libjemalloc.so.2 will be present. To locate the files, search inside directories
- /usr/lib/x86_64-linux-gnu/
- /usr/lib64/
Once any of the above version is detected, export is using command export LD_PRELOAD=/path/to/libjemalloc.so
Example:
export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2Gaudi2 offers dedicated hardware engine for Media Loading operations, such as JPEG decoding and data augmentation. This can be leveraged in ResNet model to decrease CPU usage and exceed performance limitation when data processing on CPU is a bottleneck. Currently the only supported file format is JPEG (TFRecord support is in plan).
ResNext automatically uses hardware Media Loading Acceleration unless:
- Training is done on First-gen Gaudi Processors (they don't have dedicated hardware)
- User doesn't have hpu_media_loader Python package installed
- User has set FORCE_HABANA_IMAGENET_LOADER_FALLBACK environment variable
- User hasn't provided location of ImageNet dataset containing JPEGs (--jpeg_data_dir parameter)
In the above cases media processing will be done on CPU.
There is known performance issue where disabling bf16 data loading for ResNeXt is a workaround that improves 1-card performance. It turns out we get the best performance when
dtype == bf16anddata_loader_image_type == fp32. Note that, examples below do not take it into account and use-dlit bf16.
Run training on 1 Gaudi
- ResNeXt101, bf16, batch size 128, 90 epochs
$PYTHON imagenet_main.py -dt bf16 -dlit bf16 -bs 128 -te 90 -ebe 90 --data_dir /data/tensorflow/imagenet/tf_records/
Run training on 1 Gaudi2
- ResNeXt101, bf16, batch size 256, 90 epochs, Gaudi2 with media acceleration
$PYTHON imagenet_main.py -dt bf16 -dlit bf16 -bs 256 -te 90 -ebe 90 --jpeg_data_dir /data/tensorflow/imagenet
Run training on 8 Gaudi - Horovod
NOTE: mpirun map-by PE attribute value may vary on your setup. Please refer to the instructions on mpirun Configuration for calculation.
-
ResNeXt101, bf16, batch size 128, 90 epochs,
mpirun --allow-run-as-root --tag-output --merge-stderr-to-stdout --output-filename /root/tmp/resnext_log --bind-to core --map-by socket:PE=7 -np 8 \ $PYTHON imagenet_main.py --use_horovod -dt bf16 -dlit bf16 -bs 128 -te 90 -ebe 90 --data_dir /data/tensorflow/imagenet/tf_records/
Run training on 8 Gaudi2 - Horovod
NOTE: mpirun map-by PE attribute value may vary on your setup. Please refer to the instructions on mpirun Configuration for calculation.
-
ResNeXt101, bf16, batch size 256, 90 epochs, Gaudi2 with media acceleration
mpirun --allow-run-as-root --tag-output --merge-stderr-to-stdout --output-filename /root/tmp/resnext_log --bind-to core --map-by socket:PE=7 -np 8 \ $PYTHON imagenet_main.py --use_horovod -dt bf16 -dlit bf16 -bs 256 -te 90 -ebe 90 --jpeg_data_dir /data/tensorflow/imagenet
Run training on 64 Gaudi, multiple boxes - Horovod
Multi-server training works by setting these environment variables:
- NOTE: MODIFY IP ADDRESS BELOW TO MATCH YOUR SYSTEM.
-H: set this to a comma-separated list of host IP addresses--mca btl_tcp_if_include: Provide network interface associated with IP address. More details: Open MPI documentation. If you get mpirunbtl_tcp_if_includeerrors, try un-setting this environment variable and let the training script automatically detect the network interface associated with the host IP address.HCCL_SOCKET_IFNAME: HCCL_SOCKET_IFNAME defines the prefix of the network interface name that is used for HCCL sideband TCP communication. If not set, the first network interface with a name that does not start with lo or docker will be used.- NOTE: mpirun map-by PE attribute value may vary on your setup. Please refer to the instructions on mpirun Configuration for calculation.
mpirun \
--allow-run-as-root --mca plm_rsh_args -p3022 \
--bind-to core \
--map-by socket:PE=7 -np 64 \
--mca btl_tcp_if_include 192.10.100.174/24 \
--tag-output --merge-stderr-to-stdout \
--output-filename /root/tmp/resnext_log --prefix $MPI_ROOT \
-H 192.10.100.174:8,10.10.100.101:8,10.10.100.102:8,10.10.100.203:8,10.10.100.104:8,10.10.100.205:8,10.10.100.106:8,10.10.100.207:8 \
-x GC_KERNEL_PATH -x HABANA_LOGS \
-x PYTHONPATH -x HCCL_SOCKET_IFNAME=<interface_name> \
$PYTHON imagenet_main.py \
--use_horovod -dt bf16 \
-dlit bf16 \
-bs 128 \
-te 90 \
-ebe 90 \
--weight_decay 7.93E-05 \
--data_dir /data/tensorflow/imagenet/tf_records/Note: On a large scale, saving checkpoints has a significant impact on scalability. To disable checkpoints and improve preformance during training, use --disable_checkpoints
In order to see all possible arguments to imagenet_main.py, run
$PYTHON imagenet_main.py --helpfullRun training on 1 Gaudi with profiler
$PYTHON imagenet_main.py -dt bf16 -dlit bf16 -bs 128 -te 90 -ebe 90 --data_dir /data/tensorflow/imagenet/tf_records/ \
--hooks ProfilerHook --profile_steps 5,8 --max_train_steps 10The above example will produce profile trace for 4 steps (5,6,7,8)
| Device | SynapseAI Version | TensorFlow Version(s) |
|---|---|---|
| Gaudi | 1.5.0 | 2.9.1 |
| Gaudi2 | 1.5.0 | 2.9.1 |
- Add examples to run training with batch size 256 on Gaudi2
- Add support for image processing acceleration on Gaudi2 (JPEG format only)
- Import horovod-fork package directly instead of using Model-References' TensorFlow.common.horovod_helpers; wrapped horovod import with a try-catch block so that the user is not required to install this library when the model is being run on a single card
- Replace references to custom demo script by community entry points in README
- Add profile steps range support to ResNeXt model
- Remove
setup_jemalloc()from demo_resnext - Reduce frequency of logging
- Switch from depracated flag TF_ENABLE_BF16_CONVERSION to TF_BF16_CONVERSION
- Move files from TensorFlow/common/ and TensorFlow/utils/ to model dir, align imports
- Setup BF16 conversion pass using a config from habana-tensorflow package instead of recipe json file
- Remove usage of HBN_TF_REGISTER_DATASETOPS as prerequisite for experimantal preloading
- Update imagenet and align to new naming convention (img_train->train, img_val->validation)
- Add flag to specify maximum number of cpus to be used
- Postpone image transfer to device in order to leave DMA free for more critical data
- Add support for TF profiling
- Update requirements.txt
- Remove support for LARS optimizer
- Move distribution_utils from main dir
/TensorFlow/utils/to ResNeXt script dir
Final training accuracy is significantly lower than validation accuracy when the training is being run for 90 epochs with just one evaluation at the end.
This is the case for the examplary commands given out in this README, when -te 90 (or --train_epochs 90) and -ebe 90 (or --epochs_between_evals 90) are being passed to the training script.
Note, that the validation accuracy is still state of the art.
Due to difference in cropping mechanism used in HW image processing, accuracy is lower 0.5-1% than comparing to SW data loader. This will be fixed in upcoming releases.
HW image processing on Gaudi2 can be used only with epochs between evals equal to number of total epochs, so it is done only once after the training, i.e. --epochs_between_evals 90.
Any other value may result with crash after evaluation. This will be fixed in upcoming releases.