Skip to content

torch.distributed.elastic.multiprocessing.errors.ChildFailedError: #26

@mohammad21saif

Description

@mohammad21saif
$ sh ./single_run.sh >> "error.txt"

./single_run.sh: 4: source: not found
/home/dse316/miniconda3/envs/grp_007/lib/python3.10/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  warnings.warn(
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
Traceback (most recent call last):
  File "/DATA2/dse316/grp_007/augseg/./train_semi.py", line 591, in <module>
    main(args)
  File "/DATA2/dse316/grp_007/augseg/./train_semi.py", line 71, in main
    dist.barrier()
  File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2784, in barrier
Traceback (most recent call last):
  File "/DATA2/dse316/grp_007/augseg/./train_semi.py", line 591, in <module>
        main(args)
  File "/DATA2/dse316/grp_007/augseg/./train_semi.py", line 71, in main
    dist.barrier()
  File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2784, in barrier
work = default_pg.barrier(opts=opts)
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, invalid usage, NCCL version 2.10.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
    work = default_pg.barrier(opts=opts)
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, invalid usage, NCCL version 2.10.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
Traceback (most recent call last):
  File "/DATA2/dse316/grp_007/augseg/./train_semi.py", line 591, in <module>
    main(args)
  File "/DATA2/dse316/grp_007/augseg/./train_semi.py", line 71, in main
    dist.barrier()
  File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2784, in barrier
    work = default_pg.barrier(opts=opts)
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, unhandled system error, NCCL version 2.10.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. It can be also caused by unexpected exit of a remote peer, you can check NCCL warnings for failure reason and see if there is connection closure by a peer.
Traceback (most recent call last):
  File "/DATA2/dse316/grp_007/augseg/./train_semi.py", line 591, in <module>
    main(args)
  File "/DATA2/dse316/grp_007/augseg/./train_semi.py", line 71, in main
    dist.barrier()
  File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2784, in barrier
    work = default_pg.barrier(opts=opts)
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, unhandled system error, NCCL version 2.10.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. It can be also caused by unexpected exit of a remote peer, you can check NCCL warnings for failure reason and see if there is connection closure by a peer.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3317586 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3317587 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 3317584) of binary: /home/dse316/miniconda3/envs/grp_007/bin/python
Traceback (most recent call last):
  File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/site-packages/torch/distributed/run.py", line 752, in run
    elastic_launch(
  File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
./train_semi.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-03-19_17:20:25
  host      : pragyan
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 3317585)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-03-19_17:20:25
  host      : pragyan
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 3317584)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
  • Ran steps as mentioned in issue Can't Run At All #23
  • Downloaded gtFine dataset into ./data/cityscapes
  • Set the paths in augseg/exps/zrun_citys/citys_semi744/config_semi.yaml
  • Downloaded resnet50.pth in ./pretrained
  • Ran ./single_run.sh
    Got this error. Please help 😢

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions