Detectron2 multinode run with slurm #4376

unrue · 2022-07-04T06:59:35Z

unrue
Jul 4, 2022

Dear detectron users,

I'm using detectron2 in order to train a net using a custom dataset, according to coco format. Until the run is intranode it works properly. But I don't understand how to set and launch on multinode hpc cluster and Slurm scheduler, in particular the dist_url flag in launch command:

 launch(main,
           num_machines = 2,
           num_gpus_per_machine = 4,
           dist_url = ???)

The "auto" options is not accepted when num_machines is more than 1 , and if dist_url is missing, I get the following error:

if num_machines > 1 and dist_url.startswith("file://"):
AttributeError: 'NoneType' object has no attribute 'startswith'

I tried also passing the ip of one of two nodes allocated and a random port number, having the following error:

RuntimeError: Address already in use

The following is a part of my code, in order to get the node ip:


if __name__ == "__main__":

  import os
  import hostlist
  import dns
  import dns.resolver

  # get SLURM variables
  rank = int(os.environ['SLURM_PROCID'])
  local_rank = int(os.environ['SLURM_LOCALID'])
  size = int(os.environ['SLURM_NTASKS'])
  cpus_per_task = int(os.environ['SLURM_CPUS_PER_TASK'])

  # get node list from slurm
  hostnames = hostlist.expand_hostlist(os.environ['SLURM_JOB_NODELIST'])

  # get IDs of reserved GPU
  gpu_ids = os.environ['SLURM_STEP_GPUS'].split(",")

  # define MASTER_ADD & MASTER_PORT
  os.environ['MASTER_ADDR'] = hostnames[0]
  os.environ['MASTER_PORT'] = str(61000 + int(min(gpu_ids))) # to avoid port conflict on the same node

  print("MASTER ADDRESS: ", os.environ['MASTER_ADDR'])
  print("MASTER PORT: ", os.environ['MASTER_PORT'])

  suffix_name = ".m100.cineca.it"
  result = dns.resolver.resolve(hostnames[0]+suffix_name, 'A')

  master_ip = ''
  for ipval in result:
     master_ip = ipval.to_text()
     print('IP', master_ip)

  master_port = os.environ['MASTER_PORT']

  print('IP', master_ip)

  launch(main,
         num_machines = 2,
         num_gpus_per_machine = 4,
         machine_rank = 0,
         dist_url = 'tcp://' + master_ip + ':' + master_port)

And the following is my Slurm script`:


!/bin/bash

#SBATCH --time=00:10:00
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=8
#SBATCH --exclusive
#SBATCH -N 2
#SBATCH --gres=gpu:4

srun python3.8 ./train_multinode.py

So, which is the right way to set and launch detectron2 in multinode way and Slurm scheduler? I think Slurm itself should pass the url of the nodes to detectron in some manner, like an MPI application.

I'm not sure also about ntasks-per-node and cpus-per-task. Are there properly set? ( the machine has 32 Power9 CPU, 128 with SMT enabled) Is it not clear how many cores Detectron uses during a training session. Thanks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detectron2 multinode run with slurm #4376

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Detectron2 multinode run with slurm #4376

unrue Jul 4, 2022

Replies: 0 comments

unrue
Jul 4, 2022