Skip to content

Commit

Permalink
add training script for multi-machine
Browse files Browse the repository at this point in the history
  • Loading branch information
whwu committed May 18, 2021
1 parent ceb25ba commit ed33622
Show file tree
Hide file tree
Showing 4 changed files with 41 additions and 0 deletions.
10 changes: 10 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -78,6 +78,16 @@ This implementation supports multi-gpu, `DistributedDataParallel` training, whic
bash scripts/dist_train_recognizer.sh configs/MVFNet/K400/mvf_kinetics400_2d_rgb_r50_dense.py 8
```

- We also provide the script to train MVFNet on Kinetics400 with multiple machines (e.g., 2 machines and 16 GPUs).
```sh
# For first machine, --master_addr is the ip of your first machine
bash scripts/dist_train_multinode_1.sh configs/MVFNet/K400/mvf_kinetics400_2d_rgb_r50_dense.py 8
```
```sh
# For second machine, --master_addr is still the ip of your first machine
bash scripts/dist_train_multinode_2.sh configs/MVFNet/K400/mvf_kinetics400_2d_rgb_r50_dense.py 8
```

## Acknowledgements
We especially thank the contributors of the [mmaction](https://github.com/open-mmlab/mmaction) codebase for providing helpful code.

Expand Down
2 changes: 2 additions & 0 deletions codes/core/dist_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -85,9 +85,11 @@ def _init_dist_pytorch(backend, **kwargs):
"""init dist pytorch"""
# TODO: use local_rank instead of rank % num_gpus
rank = int(os.environ['RANK'])
local_rank = int(os.environ["LOCAL_RANK"])
num_gpus = torch.cuda.device_count()
torch.cuda.set_device(rank % num_gpus)
dist.init_process_group(backend=backend, **kwargs)
print(f"[init] == local rank: {local_rank}, global rank: {rank} ==")


def _init_dist_mpi(backend, **kwargs):
Expand Down
14 changes: 14 additions & 0 deletions scripts/dist_train_multinode_1.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
#!/usr/bin/env bash

PYTHON=${PYTHON:-"python"}
PORT=${PORT:-28890}

# run in node 1
$PYTHON -m torch.distributed.launch \
--nproc_per_node=$2 \
--nnodes=2 \
--node_rank=0 \
--master_addr="10.127.20.17" \
--master_port=$PORT \
train_recognizer.py $1 \
--launcher pytorch --validate ${@:3}
15 changes: 15 additions & 0 deletions scripts/dist_train_multinode_2.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
#!/usr/bin/env bash

PYTHON=${PYTHON:-"python"}
PORT=${PORT:-28890}

# run in node 1
$PYTHON -m torch.distributed.launch \
--nproc_per_node=$2 \
--nnodes=2 \
--node_rank=1 \
--master_addr="10.127.20.17" \
--master_port=$PORT \
train_recognizer.py $1 \
--launcher pytorch --validate ${@:3}

0 comments on commit ed33622

Please sign in to comment.