Workers die when using only one GPU since version 0.25.1 #133

owainkenwayucl · 2025-02-20T14:30:04Z

First of all, appologies, I'm a systems admin/programmer, not an AI expert, but I've been trying to help some of our researchers run nnUNet on a test system we have from AMD and when run on a single GPU it reliably crashes with errors like the following almost immediately.

2025-02-20 14:22:53.111255: unpacking dataset...
Exception in thread Thread-2 (results_loop):
Traceback (most recent call last):   
  File "/usr/lib64/python3.12/threading.py", line 1075, in _bootstrap_inner
    self.run()
  File "/usr/lib64/python3.12/threading.py", line 1012, in run
    self._target(*self._args, **self._kwargs)
  File "/data1/uccaoke/Virtualenvs/nnUNet/lib64/python3.12/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 125, in resul
ts_loop
    raise e
  File "/data1/uccaoke/Virtualenvs/nnUNet/lib64/python3.12/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 108, in resul
ts_loop
    item = in_queue.get()
           ^^^^^^^^^^^^^^
  File "/usr/lib64/python3.12/multiprocessing/queues.py", line 122, in get
    return _ForkingPickler.loads(res)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data1/uccaoke/Virtualenvs/nnUNet/lib64/python3.12/site-packages/torch/multiprocessing/reductions.py", line 541, in rebuild_storage_fd
    fd = df.detach()
         ^^^^^^^^^^^
  File "/usr/lib64/python3.12/multiprocessing/resource_sharer.py", line 57, in detach
    with _resource_sharer.get_connection(self._id) as conn:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.12/multiprocessing/resource_sharer.py", line 86, in get_connection
    c = Client(address, authkey=process.current_process().authkey)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.12/multiprocessing/connection.py", line 525, in Client
    answer_challenge(c, authkey)
  File "/usr/lib64/python3.12/multiprocessing/connection.py", line 953, in answer_challenge
    message = connection.recv_bytes(256)         # reject large message
              ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.12/multiprocessing/connection.py", line 216, in recv_bytes
    buf = self._recv_bytes(maxlength)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.12/multiprocessing/connection.py", line 430, in _recv_bytes
    buf = self._recv(4)
          ^^^^^^^^^^^^^
  File "/usr/lib64/python3.12/multiprocessing/connection.py", line 395, in _recv
    chunk = read(handle, remaining)  
            ^^^^^^^^^^^^^^^^^^^^^^^  
ConnectionResetError: [Errno 104] Connection reset by peer
Exception in thread Thread-1 (results_loop):
Traceback (most recent call last):   
  File "/usr/lib64/python3.12/threading.py", line 1075, in _bootstrap_inner
    self.run()
  File "/usr/lib64/python3.12/threading.py", line 1012, in run
    self._target(*self._args, **self._kwargs)
  File "/data1/uccaoke/Virtualenvs/nnUNet/lib64/python3.12/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 125, in results_loop
    raise e
  File "/data1/uccaoke/Virtualenvs/nnUNet/lib64/python3.12/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 103, in results_loop
    raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the "
RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message
2025-02-20 14:22:59.812495: unpacking done...
2025-02-20 14:22:59.821651: Unable to plot network architecture: nnUNet_compile is enabled!
2025-02-20 14:22:59.832494:
2025-02-20 14:22:59.832671: Epoch 0  
2025-02-20 14:22:59.832855: Current learning rate: 0.01
Traceback (most recent call last):   
  File "/data1/uccaoke/Virtualenvs/nnUNet/bin/nnUNetv2_train", line 8, in <module>
    sys.exit(run_training_entry())   
             ^^^^^^^^^^^^^^^^^^^^
  File "/data1/uccaoke/Virtualenvs/nnUNet/lib64/python3.12/site-packages/nnunetv2/run/run_training.py", line 275, in run_training_entry
    run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights,
  File "/data1/uccaoke/Virtualenvs/nnUNet/lib64/python3.12/site-packages/nnunetv2/run/run_training.py", line 211, in run_training
    nnunet_trainer.run_training()
  File "/data1/uccaoke/Virtualenvs/nnUNet/lib64/python3.12/site-packages/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 1370, in run_training
    train_outputs.append(self.train_step(next(self.dataloader_train)))
                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data1/uccaoke/Virtualenvs/nnUNet/lib64/python3.12/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 196, in __next__
    item = self.__get_next_item()
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/data1/uccaoke/Virtualenvs/nnUNet/lib64/python3.12/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 181, in __get_next_item
    raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the "
RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message

If nnUNet is run with 2 GPUs, (-num_gpus 2) it runs fine.

On a hunch, I downgraded batchgenerators to 0.25 (since the researchers reported this was not a problem on other systems they have access to and I guessed that they might have installed before the 0.25.1 release), and that also resolved the crash, implying that something changed in the 0.25.1 release is causing this issue.

The text was updated successfully, but these errors were encountered:

owainkenwayucl · 2025-02-20T14:46:49Z

System detail in case this matters:

Operating System: AlmaLinux 9.5 (effectively RedHat Enterprise Linux 9.5)
Python: 3.12 (from system packages)
nnUNetv2: 2.5.2 from pipy

I don't think this is relevant but in case:
ROCm: 6.3.2
Pytorch: 2.6 for ROCm from pytorch.org

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workers die when using only one GPU since version 0.25.1 #133

Workers die when using only one GPU since version 0.25.1 #133

owainkenwayucl commented Feb 20, 2025 •

edited

Loading

owainkenwayucl commented Feb 20, 2025 •

edited

Loading

Workers die when using only one GPU since version 0.25.1 #133

Workers die when using only one GPU since version 0.25.1 #133

Comments

owainkenwayucl commented Feb 20, 2025 • edited Loading

owainkenwayucl commented Feb 20, 2025 • edited Loading

owainkenwayucl commented Feb 20, 2025 •

edited

Loading

owainkenwayucl commented Feb 20, 2025 •

edited

Loading