Fundamentals of Accelerated Neural Network Training with Multi-GPUs on HiPerGator-AI

Yunchao Yang
AI support team
UF Research Computing

Code is adapted for the DDP tutorial series at https://pytorch.org/tutorials/beginner/ddp_series_intro.html

Each code file extends upon the previous one. The series starts with a non-distributed script that runs on a single GPU and incrementally updates to end with multinode training on a Slurm cluster.

How to use the Files

Step 0: navigate to ood.rc.ufl.edu and request a open ondemand console terminal with 1 node and 2 GPUs:

Cluster partition = "hpg-ai"
Generic Resource Request = "gpu:2"

Step 1: single GPU code

single_gpu.py: Non-distributed training script on a single GPU
[How to run]: python single_gpu.py

Step 2.1: single node parallel with mp.spawn utility

multigpu.py: DDP on a single node, with mp.spawn
run_multigpu.sh: runner

Step 2.2: single node paralel with torchrun utility

multigpu_torchrun.py: DDP on a single node using torchrun
run_multigpu_torchrun.sh: runner

Step 3 multiNode parallel with torchrun utility

multinode.py: DDP on multiple nodes using Torchrun (and optionally Slurm)

Step 4. Run SLURM jobs on HPG * slurm/multigpu_torchrun.py: training script for multiGPU * slurm/datautils.py: Dataset helper function * slurm/launch_ddp_2N4G.sh: Sample slurm script to launch a trining script using torchrun on 2Nodes with 2GPUs no each node. * slurm/launch_ddp_4N4G.sh: Sample slurm script to launch a trining script using torchrun on 4Nodes with 1GPUs no each node.

learn more about the code walkthrough

please follow the Distributed Data Parallel in PyTorch Tutorial Series by Pytorch

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fundamentals of Accelerated Neural Network Training with Multi-GPUs on HiPerGator-AI

How to use the Files

learn more about the code walkthrough

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
slurm		slurm
.gitignore		.gitignore
FundamentalOfMultiGPUTraining.pdf		FundamentalOfMultiGPUTraining.pdf
README.md		README.md
datautils.py		datautils.py
multigpu.py		multigpu.py
multigpu_torchrun.py		multigpu_torchrun.py
multinode.py		multinode.py
run_multigpu.sh		run_multigpu.sh
run_multigpu_torchrun.sh		run_multigpu_torchrun.sh
single_gpu.py		single_gpu.py

YunchaoYang/BoF-MultiGPUTutorial

Folders and files

Latest commit

History

Repository files navigation

Fundamentals of Accelerated Neural Network Training with Multi-GPUs on HiPerGator-AI

How to use the Files

learn more about the code walkthrough

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages