You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm using detectron2 in order to train a net using a custom dataset, according to coco format. Until the run is intranode it works properly. But I don't understand how to set and launch on multinode hpc cluster and Slurm scheduler, in particular the dist_url flag in launch command:
So, which is the right way to set and launch detectron2 in multinode way and Slurm scheduler? I think Slurm itself should pass the url of the nodes to detectron in some manner, like an MPI application.
I'm not sure also about ntasks-per-node and cpus-per-task. Are there properly set? ( the machine has 32 Power9 CPU, 128 with SMT enabled) Is it not clear how many cores Detectron uses during a training session. Thanks.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Dear detectron users,
I'm using detectron2 in order to train a net using a custom dataset, according to coco format. Until the run is intranode it works properly. But I don't understand how to set and launch on multinode hpc cluster and Slurm scheduler, in particular the dist_url flag in launch command:
The "auto" options is not accepted when num_machines is more than 1 , and if dist_url is missing, I get the following error:
I tried also passing the ip of one of two nodes allocated and a random port number, having the following error:
RuntimeError: Address already in use
The following is a part of my code, in order to get the node ip:
And the following is my Slurm script`:
So, which is the right way to set and launch detectron2 in multinode way and Slurm scheduler? I think Slurm itself should pass the url of the nodes to detectron in some manner, like an MPI application.
I'm not sure also about ntasks-per-node and cpus-per-task. Are there properly set? ( the machine has 32 Power9 CPU, 128 with SMT enabled) Is it not clear how many cores Detectron uses during a training session. Thanks.
Beta Was this translation helpful? Give feedback.
All reactions