Skip to content

YunchaoYang/BoF-MultiGPUTutorial

Repository files navigation

drawing

Fundamentals of Accelerated Neural Network Training with Multi-GPUs on HiPerGator-AI

  • Yunchao Yang
  • AI support team
  • UF Research Computing

slides

GitHub repo

Code is adapted for the DDP tutorial series at https://pytorch.org/tutorials/beginner/ddp_series_intro.html

Each code file extends upon the previous one. The series starts with a non-distributed script that runs on a single GPU and incrementally updates to end with multinode training on a Slurm cluster.

How to use the Files

Step 0: navigate to ood.rc.ufl.edu and request a open ondemand console terminal with 1 node and 2 GPUs:

  • Cluster partition = "hpg-ai"
  • Generic Resource Request = "gpu:2"

Step 1: single GPU code

  • single_gpu.py: Non-distributed training script on a single GPU
  • [How to run]: python single_gpu.py

Step 2.1: single node parallel with mp.spawn utility

Step 2.2: single node paralel with torchrun utility

Step 3 multiNode parallel with torchrun utility

  • multinode.py: DDP on multiple nodes using Torchrun (and optionally Slurm)

Step 4. Run SLURM jobs on HPG * slurm/multigpu_torchrun.py: training script for multiGPU * slurm/datautils.py: Dataset helper function * slurm/launch_ddp_2N4G.sh: Sample slurm script to launch a trining script using torchrun on 2Nodes with 2GPUs no each node. * slurm/launch_ddp_4N4G.sh: Sample slurm script to launch a trining script using torchrun on 4Nodes with 1GPUs no each node.

learn more about the code walkthrough

please follow the Distributed Data Parallel in PyTorch Tutorial Series by Pytorch

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published