You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
First of all, appologies, I'm a systems admin/programmer, not an AI expert, but I've been trying to help some of our researchers run nnUNet on a test system we have from AMD and when run on a single GPU it reliably crashes with errors like the following almost immediately.
2025-02-20 14:22:53.111255: unpacking dataset...
Exception in thread Thread-2 (results_loop):
Traceback (most recent call last):
File "/usr/lib64/python3.12/threading.py", line 1075, in _bootstrap_inner
self.run()
File "/usr/lib64/python3.12/threading.py", line 1012, in run
self._target(*self._args, **self._kwargs)
File "/data1/uccaoke/Virtualenvs/nnUNet/lib64/python3.12/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 125, in resul
ts_loop
raise e
File "/data1/uccaoke/Virtualenvs/nnUNet/lib64/python3.12/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 108, in resul
ts_loop
item = in_queue.get()
^^^^^^^^^^^^^^
File "/usr/lib64/python3.12/multiprocessing/queues.py", line 122, in get
return _ForkingPickler.loads(res)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data1/uccaoke/Virtualenvs/nnUNet/lib64/python3.12/site-packages/torch/multiprocessing/reductions.py", line 541, in rebuild_storage_fd
fd = df.detach()
^^^^^^^^^^^
File "/usr/lib64/python3.12/multiprocessing/resource_sharer.py", line 57, in detach
with _resource_sharer.get_connection(self._id) as conn:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib64/python3.12/multiprocessing/resource_sharer.py", line 86, in get_connection
c = Client(address, authkey=process.current_process().authkey)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib64/python3.12/multiprocessing/connection.py", line 525, in Client
answer_challenge(c, authkey)
File "/usr/lib64/python3.12/multiprocessing/connection.py", line 953, in answer_challenge
message = connection.recv_bytes(256) # reject large message
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib64/python3.12/multiprocessing/connection.py", line 216, in recv_bytes
buf = self._recv_bytes(maxlength)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib64/python3.12/multiprocessing/connection.py", line 430, in _recv_bytes
buf = self._recv(4)
^^^^^^^^^^^^^
File "/usr/lib64/python3.12/multiprocessing/connection.py", line 395, in _recv
chunk = read(handle, remaining)
^^^^^^^^^^^^^^^^^^^^^^^
ConnectionResetError: [Errno 104] Connection reset by peer
Exception in thread Thread-1 (results_loop):
Traceback (most recent call last):
File "/usr/lib64/python3.12/threading.py", line 1075, in _bootstrap_inner
self.run()
File "/usr/lib64/python3.12/threading.py", line 1012, in run
self._target(*self._args, **self._kwargs)
File "/data1/uccaoke/Virtualenvs/nnUNet/lib64/python3.12/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 125, in results_loop
raise e
File "/data1/uccaoke/Virtualenvs/nnUNet/lib64/python3.12/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 103, in results_loop
raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the "
RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message
2025-02-20 14:22:59.812495: unpacking done...
2025-02-20 14:22:59.821651: Unable to plot network architecture: nnUNet_compile is enabled!
2025-02-20 14:22:59.832494:
2025-02-20 14:22:59.832671: Epoch 0
2025-02-20 14:22:59.832855: Current learning rate: 0.01
Traceback (most recent call last):
File "/data1/uccaoke/Virtualenvs/nnUNet/bin/nnUNetv2_train", line 8, in <module>
sys.exit(run_training_entry())
^^^^^^^^^^^^^^^^^^^^
File "/data1/uccaoke/Virtualenvs/nnUNet/lib64/python3.12/site-packages/nnunetv2/run/run_training.py", line 275, in run_training_entry
run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights,
File "/data1/uccaoke/Virtualenvs/nnUNet/lib64/python3.12/site-packages/nnunetv2/run/run_training.py", line 211, in run_training
nnunet_trainer.run_training()
File "/data1/uccaoke/Virtualenvs/nnUNet/lib64/python3.12/site-packages/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 1370, in run_training
train_outputs.append(self.train_step(next(self.dataloader_train)))
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data1/uccaoke/Virtualenvs/nnUNet/lib64/python3.12/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 196, in __next__
item = self.__get_next_item()
^^^^^^^^^^^^^^^^^^^^^^
File "/data1/uccaoke/Virtualenvs/nnUNet/lib64/python3.12/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 181, in __get_next_item
raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the "
RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message
If nnUNet is run with 2 GPUs, (-num_gpus 2) it runs fine.
On a hunch, I downgraded batchgenerators to 0.25 (since the researchers reported this was not a problem on other systems they have access to and I guessed that they might have installed before the 0.25.1 release), and that also resolved the crash, implying that something changed in the 0.25.1 release is causing this issue.
The text was updated successfully, but these errors were encountered:
First of all, appologies, I'm a systems admin/programmer, not an AI expert, but I've been trying to help some of our researchers run nnUNet on a test system we have from AMD and when run on a single GPU it reliably crashes with errors like the following almost immediately.
If nnUNet is run with 2 GPUs, (
-num_gpus 2
) it runs fine.On a hunch, I downgraded batchgenerators to 0.25 (since the researchers reported this was not a problem on other systems they have access to and I guessed that they might have installed before the 0.25.1 release), and that also resolved the crash, implying that something changed in the 0.25.1 release is causing this issue.
The text was updated successfully, but these errors were encountered: