$ sh ./single_run.sh >> "error.txt"
./single_run.sh: 4: source: not found
/home/dse316/miniconda3/envs/grp_007/lib/python3.10/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
warnings.warn(
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
Traceback (most recent call last):
File "/DATA2/dse316/grp_007/augseg/./train_semi.py", line 591, in <module>
main(args)
File "/DATA2/dse316/grp_007/augseg/./train_semi.py", line 71, in main
dist.barrier()
File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2784, in barrier
Traceback (most recent call last):
File "/DATA2/dse316/grp_007/augseg/./train_semi.py", line 591, in <module>
main(args)
File "/DATA2/dse316/grp_007/augseg/./train_semi.py", line 71, in main
dist.barrier()
File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2784, in barrier
work = default_pg.barrier(opts=opts)
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, invalid usage, NCCL version 2.10.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
work = default_pg.barrier(opts=opts)
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, invalid usage, NCCL version 2.10.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
Traceback (most recent call last):
File "/DATA2/dse316/grp_007/augseg/./train_semi.py", line 591, in <module>
main(args)
File "/DATA2/dse316/grp_007/augseg/./train_semi.py", line 71, in main
dist.barrier()
File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2784, in barrier
work = default_pg.barrier(opts=opts)
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, unhandled system error, NCCL version 2.10.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. It can be also caused by unexpected exit of a remote peer, you can check NCCL warnings for failure reason and see if there is connection closure by a peer.
Traceback (most recent call last):
File "/DATA2/dse316/grp_007/augseg/./train_semi.py", line 591, in <module>
main(args)
File "/DATA2/dse316/grp_007/augseg/./train_semi.py", line 71, in main
dist.barrier()
File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2784, in barrier
work = default_pg.barrier(opts=opts)
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, unhandled system error, NCCL version 2.10.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. It can be also caused by unexpected exit of a remote peer, you can check NCCL warnings for failure reason and see if there is connection closure by a peer.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3317586 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3317587 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 3317584) of binary: /home/dse316/miniconda3/envs/grp_007/bin/python
Traceback (most recent call last):
File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/site-packages/torch/distributed/launch.py", line 193, in <module>
main()
File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/site-packages/torch/distributed/run.py", line 752, in run
elastic_launch(
File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/dse316/miniconda3/envs/grp_007/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
./train_semi.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2024-03-19_17:20:25
host : pragyan
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 3317585)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-03-19_17:20:25
host : pragyan
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 3317584)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
augseg/exps/zrun_citys/citys_semi744/config_semi.yamlGot this error. Please help 😢