Skip to content
This repository was archived by the owner on Oct 28, 2022. It is now read-only.

Commit ff7c843

Browse files
committed
Updated documentation, fixed typos etc.v
1 parent 2b59b78 commit ff7c843

9 files changed

+59
-43
lines changed

Diff for: README.md

+6-4
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,8 @@ __*Please star/fork my repository if you find this tutorial helpful!*__
2525
To run this project please install
2626
[NVIDIA-Docker](https://github.com/NVIDIA/nvidia-docker) first.
2727

28+
Unfortunately for Windows users, NVIDIA-Docker is only available for Linux as of the time of writing.
29+
2830
NVIDIA-Docker has many dependencies, such as the NVIDIA driver and Docker.
2931

3032
These are all necessary for this project.
@@ -33,7 +35,7 @@ I am using Docker because I have found that local installation often fails.
3335

3436
This is likely due to complicated dependency issues.
3537

36-
Also, catastrophic errors are easier to handle in a Docker container than on a local computer.
38+
Also, catastrophic errors are easier to handle in a Docker container than on a local machine.
3739

3840
Please view basic [Docker](https://docs.docker.com) concepts for this project.
3941

@@ -42,14 +44,14 @@ Please view basic [Docker](https://docs.docker.com) concepts for this project.
4244
### Environment
4345

4446
The Docker container generated by the Dockerfile will create a
45-
Ubuntu 18.04 LTS Container with CUDA10.0, CUDNN version 7.6.0.64-1,
46-
NCCL version2.4.7-1, OPENMPI version 4.0.2.
47+
Ubuntu 18.04 LTS image with CUDA 10.0, CuDNN 7.6.0.64-1,
48+
NCCL 2.4.7-1, and OpenMPI 4.0.2.
4749

4850
Python version is 3.6.7, Pytorch is 1.4.0, and Torchvision is 0.5.0.
4951

5052
The settings were modified from the currently available official horovod image.
5153

52-
The current official image has an issue with pillow 7 incompatibility with Torchvision 0.4.2.
54+
The current official horovod Docker image has an issue with pillow 7 incompatibility with Torchvision 0.4.2.
5355

5456
### Task
5557

Diff for: docker_files/Dockerfile

+12-8
Original file line numberDiff line numberDiff line change
@@ -4,13 +4,16 @@ FROM nvidia/cuda:10.0-devel-ubuntu18.04
44
ENV CUDNN_VERSION=7.6.0.64-1+cuda10.0
55
ENV NCCL_VERSION=2.4.7-1+cuda10.0
66
ENV OPENMPI_VERSION=4.0.2
7+
ENV TORCH_WHEEL=https://download.pytorch.org/whl/cu100/torch-1.4.0%2Bcu100-cp36-cp36m-linux_x86_64.whl
8+
ENV TORCHVISION_WHEEL=https://download.pytorch.org/whl/cu100/torchvision-0.5.0%2Bcu100-cp36-cp36m-linux_x86_64.whl
79

8-
# Python 3.6 is supported by Ubuntu Bionic out of the box
10+
# Python 3.6 is supported by Ubuntu Bionic out of the box.
911
ENV PYTHON_VERSION=3.6
1012

11-
# Set default shell to /bin/bash
13+
# Set default shell to /bin/bash.
1214
SHELL ["/bin/bash", "-cu"]
1315

16+
# Get dependencies on Ubuntu.
1417
RUN apt-get update && apt-get install -y --allow-downgrades --allow-change-held-packages --no-install-recommends \
1518
build-essential \
1619
cmake \
@@ -32,17 +35,18 @@ RUN apt-get update && apt-get install -y --allow-downgrades --allow-change-held-
3235
libibverbs1 \
3336
ibverbs-providers
3437

38+
# Create symbolic link.
3539
RUN ln -s /usr/bin/python${PYTHON_VERSION} /usr/bin/python
3640

41+
# Get pip.
3742
RUN curl -O https://bootstrap.pypa.io/get-pip.py && \
3843
python get-pip.py && \
3944
rm get-pip.py
4045

4146
# Install Pytorch, torchvivison and other required libraries. Tensorboard depends on the future library.
42-
RUN pip install typing numpy future tensorboard==2.0.0 \
43-
https://download.pytorch.org/whl/cu100/torch-1.4.0%2Bcu100-cp36-cp36m-linux_x86_64.whl \
44-
https://download.pytorch.org/whl/cu100/torchvision-0.5.0%2Bcu100-cp36-cp36m-linux_x86_64.whl
47+
RUN pip install typing numpy future tensorboard==2.0.0 ${TORCH_WHEEL} ${TORCHVISION_WHEEL}
4548

49+
# Get OpenMPI.
4650
RUN mkdir /tmp/openmpi && \
4751
cd /tmp/openmpi && \
4852
wget https://www.open-mpi.org/software/ompi/v4.0/downloads/openmpi-${OPENMPI_VERSION}.tar.gz && \
@@ -54,17 +58,17 @@ RUN mkdir /tmp/openmpi && \
5458
ldconfig && \
5559
rm -rf /tmp/openmpi
5660

57-
# Install Horovod, temporarily using CUDA stubs
61+
# Install Horovod, temporarily using CUDA stubs.
5862
RUN ldconfig /usr/local/cuda/targets/x86_64-linux/lib/stubs && \
5963
HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_GPU_BROADCAST=NCCL HOROVOD_WITH_PYTORCH=1 \
6064
pip install --no-cache-dir horovod && \
6165
ldconfig
6266

63-
# Install OpenSSH for MPI to communicate between containers
67+
# Install OpenSSH for MPI to communicate between containers.
6468
RUN apt-get install -y --no-install-recommends openssh-client openssh-server && \
6569
mkdir -p /var/run/sshd
6670

67-
# Allow OpenSSH to talk to containers without asking for confirmation
71+
# Allow OpenSSH to talk to containers without asking for confirmation.
6872
RUN cat /etc/ssh/ssh_config | grep -v StrictHostKeyChecking > /etc/ssh/ssh_config.new && \
6973
echo " StrictHostKeyChecking no" >> /etc/ssh/ssh_config.new && \
7074
mv /etc/ssh/ssh_config.new /etc/ssh/ssh_config

Diff for: docker_files/README.md

+5-5
Original file line numberDiff line numberDiff line change
@@ -2,11 +2,11 @@
22

33
The Dockerfile here is used to create an environment where Horovod can be used with Pytorch.
44

5-
Software versions have been fixed for my convenience.
5+
Software versions have been fixed for Python, Pytorch, etc. for my convenience.
66

77
However, they can be changed manually.
88

9-
The current installation uses pip instead of Anaconda.
9+
Also, the current installation uses pip instead of Anaconda.
1010

1111
This is keeping with the original Dockerfile in Horovod.
1212

@@ -15,7 +15,7 @@ for the original.
1515

1616
### Dependencies and installation
1717

18-
The section before `pip install ...` is boilerplate for dependencies on Ubuntu.
18+
The section before `pip install [...]` is boilerplate for dependencies on Ubuntu.
1919

2020
The current project only uses Pytorch, Torchvision, Tensorboard, Numpy, and typing.
2121

@@ -25,7 +25,7 @@ Pytorch and Torchvision are installed with their wheel directories in PyPI for f
2525

2626
See [here](https://download.pytorch.org/whl/cu100/torch_stable.html) for Pytorch wheels for CUDA10.0.
2727

28-
Other project requirements should be installed here.
28+
Other project requirements should be installed by including them in the `pip install [...]` line.
2929

3030
### Horovod installation
3131

@@ -36,7 +36,7 @@ HOROVOD_GPU_ALLREDUCE=NCCL,
3636
HOROVOD_GPU_BROADCAST=NCCL, and
3737
HOROVOD_WITH_PYTORCH=1.
3838

39-
The first two indicate that GPU operations should use NCCL (pronounced "nickel").
39+
The first two indicate that GPU operations should use [NCCL](https://github.com/NVIDIA/nccl) (pronounced "nickel").
4040

4141
This setting is crucial for performance.
4242

Diff for: scripts/README.md

+7-5
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,8 @@ Each script and its contents are explained here.
1212

1313
Build the Docker image with the docker_build_script.
1414

15+
Please read the README file attached with the Dockerfile for an explanation of its contents.
16+
1517
The -t indicates the tag/name. The format is (repository):(tag).
1618

1719
The full path to the Dockerfile must be specified.
@@ -98,15 +100,15 @@ This is specified by the `num_workers` variable.
98100

99101
However, each Horovod process will launch these workers independently.
100102

101-
This may cause an excessive number of workers to be launched.
103+
The true number of pre-processing workers will therefore be `np x num_workers`.
102104

103105
**2. Host**
104106

105107
The `-H` flag specifies the host type.
106108

107109
The number of GPUs to be used is specified on the right.
108110

109-
N must be the same or lesser than the number of GPUs.
111+
N must be the same or lesser than the number of available GPUs.
110112

111113
For a local machine where N GPUs are to be used, use "localhost:N".
112114

@@ -116,13 +118,13 @@ For servers, a different scheme is used.
116118

117119
For server with index I with N GPUs, use "serverI:N".
118120

119-
For large servers, use a hostfile.
121+
For large servers, use a hostfile with the server information.
120122

121123
**3. Autotune**
122124

123-
Use the `--autotune` flag to autotune parameters for best performance.
125+
Use the `--autotune` flag to autotune parameters for best speed performance.
124126

125-
Autotuning uses Bayesian optimization for finding the best parameters.
127+
Autotuning uses Bayesian optimization for finding the best parameters for speed.
126128

127129
This will cause early runs to be slower, but later runs will be faster.
128130

Diff for: scripts/docker_build_script.sh

+2-1
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,3 @@
11
set -ex
2-
docker build -t horovod:py-3.6-torch-1.4.0 $HOME/PycharmProjects/Horovod-Pytorch-Tutorial/docker_files/
2+
docker build -t horovod:py-3.6-torch-1.4.0 \
3+
$HOME/PycharmProjects/Horovod-Pytorch-Tutorial/docker_files

Diff for: scripts/docker_run_script.sh

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
11
set -ex
22
docker run -v $HOME/PycharmProjects/Horovod-Pytorch-Tutorial:/opt/project \
3-
-it -w /opt/project --runtime nvidia --gpus all --name horovod_torch --rm \
3+
-it -w /opt/project --runtime=nvidia --gpus=all --name=horovod_torch --rm \
44
horovod:py-3.6-torch-1.4.0

Diff for: scripts/horovod_run_script.sh

+2-1
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,3 @@
11
set -ex
2-
horovodrun -np 2 -H localhost:2 python train/train_model.py
2+
N=2
3+
horovodrun -np $N -H localhost:$N python train/train_model.py

Diff for: train/README.md

+16-7
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,8 @@
11
# Training with multiple processes.
22

3-
---
43
Concepts used in Horovod and MPI are outlined [here](https://horovod.readthedocs.io/en/latest/concepts_include.html).
54

6-
---
5+
### How does it work?
76

87
When `horovodrun` is used, multiple Python processes are launched simultaneously.
98

@@ -13,24 +12,34 @@ Each process is given an identifier, the *"local rank"*.
1312

1413
Each process can access its identifier via the hvd.local_rank() function.
1514

16-
This is similar to how CUDA threads operate.
15+
This is similar to how CUDA threads operate within a CUDA kernel.
1716

1817
Imagine launching the `python ...` command simultaneously, in parallel,
1918
but each process knowing its identifying number.
2019

21-
Note that the DataLoaders launch new workers in each process.
20+
Each process does its own thing but synchronizes with the others via the Ring-AllReduce.
21+
22+
### How many processes?
23+
24+
As mentioned previously, note that the DataLoaders launch new workers in each process.
2225

2326
This means that the number of pre-processing processes
2427
is multiplied by the number of Horovod processes.
2528

2629
This may cause memory issues or performance drops.
2730

28-
The mini-batch size is not affected by the number of Horovod processes because of the DistributedSampler.
31+
However, **mini-batch size is not affected by the number of Horovod processes.**
32+
33+
DistributedSampler handles this very well.
34+
35+
### How to write logs and save checkpoints.
2936

3037
The Horovod documentation recommends that only model checkpoints and logs from
31-
"hvd.local_rank() == 0" should be saved.
38+
`"hvd.local_rank() == 0"` should be saved.
39+
40+
The Ring-AllReduce ensures that the different versions will not diverge very much.
3241

33-
The Ring-AllReduce will ensure that the values will converge eventually.
42+
### How to manage devices
3443

3544
Within each `horovodrun` process, the device assigned to that process is set as the default device.
3645

Diff for: train/train_model.py

+8-11
Original file line numberDiff line numberDiff line change
@@ -15,9 +15,8 @@
1515
def main():
1616
model: nn.Module = resnet34(num_classes=10).cuda()
1717

18-
# print('Number of threads: ', torch.get_num_threads(), torch.get_num_interop_threads())
19-
20-
batch_size = 1024
18+
# Set variables here. These are just for demonstration so no need for argparse.
19+
batch_size = 1024 # This is the true min-batch size, thanks to DistributedSampler.
2120
num_workers_per_process = 2 # Workers launched by each process started by horovodrun command.
2221
lr = 0.1
2322
momentum = 0.9
@@ -42,9 +41,7 @@ def main():
4241

4342
loss_func = nn.CrossEntropyLoss()
4443

45-
# print('Thread number: ', torch.get_num_threads(), torch.get_num_interop_threads())
46-
47-
# Writing separate log files for each process. Verified that models are different.
44+
# Writing separate log files for each process for comparison. Verified that models are different.
4845
writer = SummaryWriter(log_dir=f'./logs/{hvd.local_rank()}', comment='Summary writer for run.')
4946

5047
# Optimizer must be distributed for the Ring-AllReduce.
@@ -68,7 +65,7 @@ def warm_up(epoch: int): # Learning rate warm-up.
6865

6966
for epoch in range(num_epochs):
7067
print(epoch)
71-
torch.autograd.set_grad_enabled = True
68+
torch.autograd.set_grad_enabled = True # Training mode.
7269
train_sampler.set_epoch(epoch) # Set epoch to sampler for proper shuffling of training set.
7370
for inputs, targets in train_loader:
7471
inputs: Tensor = inputs.cuda(non_blocking=True)
@@ -81,15 +78,15 @@ def warm_up(epoch: int): # Learning rate warm-up.
8178
loss.backward()
8279
optimizer.step()
8380

84-
torch.autograd.set_grad_enabled = False
81+
torch.autograd.set_grad_enabled = False # Evaluation mode.
8582
for step, (inputs, targets) in enumerate(test_loader):
8683
inputs: Tensor = inputs.cuda(non_blocking=True)
8784
targets: Tensor = targets.cuda(non_blocking=True)
8885
outputs = model(inputs)
8986
loss = loss_func(outputs, targets)
9087
writer.add_scalar(tag='val_loss', scalar_value=loss.item(), global_step=step)
9188

92-
scheduler.step()
89+
scheduler.step() # Scheduler works fine on DistributedOptimizer.
9390

9491

9592
if __name__ == '__main__':
@@ -105,5 +102,5 @@ def warm_up(epoch: int): # Learning rate warm-up.
105102
print(f'Local Rank: {hvd.local_rank()}')
106103

107104
# Set default to each GPU device. Local rank is different for each process launched by horovodrun.
108-
with torch.cuda.device(f'cuda:{hvd.local_rank()}'):
109-
main()
105+
with torch.cuda.device(f'cuda:{hvd.local_rank()}'): # Not sure if this is absolutely necessary.
106+
main() # Run main function for each process on different devices, specified by "local_rank".

0 commit comments

Comments
 (0)