Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workers die when using only one GPU since version 0.25.1 #133

Open
owainkenwayucl opened this issue Feb 20, 2025 · 1 comment
Open

Workers die when using only one GPU since version 0.25.1 #133

owainkenwayucl opened this issue Feb 20, 2025 · 1 comment

Comments

@owainkenwayucl
Copy link

owainkenwayucl commented Feb 20, 2025

First of all, appologies, I'm a systems admin/programmer, not an AI expert, but I've been trying to help some of our researchers run nnUNet on a test system we have from AMD and when run on a single GPU it reliably crashes with errors like the following almost immediately.

2025-02-20 14:22:53.111255: unpacking dataset...
Exception in thread Thread-2 (results_loop):
Traceback (most recent call last):   
  File "/usr/lib64/python3.12/threading.py", line 1075, in _bootstrap_inner
    self.run()
  File "/usr/lib64/python3.12/threading.py", line 1012, in run
    self._target(*self._args, **self._kwargs)
  File "/data1/uccaoke/Virtualenvs/nnUNet/lib64/python3.12/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 125, in resul
ts_loop
    raise e
  File "/data1/uccaoke/Virtualenvs/nnUNet/lib64/python3.12/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 108, in resul
ts_loop
    item = in_queue.get()
           ^^^^^^^^^^^^^^
  File "/usr/lib64/python3.12/multiprocessing/queues.py", line 122, in get
    return _ForkingPickler.loads(res)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data1/uccaoke/Virtualenvs/nnUNet/lib64/python3.12/site-packages/torch/multiprocessing/reductions.py", line 541, in rebuild_storage_fd
    fd = df.detach()
         ^^^^^^^^^^^
  File "/usr/lib64/python3.12/multiprocessing/resource_sharer.py", line 57, in detach
    with _resource_sharer.get_connection(self._id) as conn:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.12/multiprocessing/resource_sharer.py", line 86, in get_connection
    c = Client(address, authkey=process.current_process().authkey)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.12/multiprocessing/connection.py", line 525, in Client
    answer_challenge(c, authkey)
  File "/usr/lib64/python3.12/multiprocessing/connection.py", line 953, in answer_challenge
    message = connection.recv_bytes(256)         # reject large message
              ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.12/multiprocessing/connection.py", line 216, in recv_bytes
    buf = self._recv_bytes(maxlength)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.12/multiprocessing/connection.py", line 430, in _recv_bytes
    buf = self._recv(4)
          ^^^^^^^^^^^^^
  File "/usr/lib64/python3.12/multiprocessing/connection.py", line 395, in _recv
    chunk = read(handle, remaining)  
            ^^^^^^^^^^^^^^^^^^^^^^^  
ConnectionResetError: [Errno 104] Connection reset by peer
Exception in thread Thread-1 (results_loop):
Traceback (most recent call last):   
  File "/usr/lib64/python3.12/threading.py", line 1075, in _bootstrap_inner
    self.run()
  File "/usr/lib64/python3.12/threading.py", line 1012, in run
    self._target(*self._args, **self._kwargs)
  File "/data1/uccaoke/Virtualenvs/nnUNet/lib64/python3.12/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 125, in results_loop
    raise e
  File "/data1/uccaoke/Virtualenvs/nnUNet/lib64/python3.12/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 103, in results_loop
    raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the "
RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message
2025-02-20 14:22:59.812495: unpacking done...
2025-02-20 14:22:59.821651: Unable to plot network architecture: nnUNet_compile is enabled!
2025-02-20 14:22:59.832494:
2025-02-20 14:22:59.832671: Epoch 0  
2025-02-20 14:22:59.832855: Current learning rate: 0.01
Traceback (most recent call last):   
  File "/data1/uccaoke/Virtualenvs/nnUNet/bin/nnUNetv2_train", line 8, in <module>
    sys.exit(run_training_entry())   
             ^^^^^^^^^^^^^^^^^^^^
  File "/data1/uccaoke/Virtualenvs/nnUNet/lib64/python3.12/site-packages/nnunetv2/run/run_training.py", line 275, in run_training_entry
    run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights,
  File "/data1/uccaoke/Virtualenvs/nnUNet/lib64/python3.12/site-packages/nnunetv2/run/run_training.py", line 211, in run_training
    nnunet_trainer.run_training()
  File "/data1/uccaoke/Virtualenvs/nnUNet/lib64/python3.12/site-packages/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 1370, in run_training
    train_outputs.append(self.train_step(next(self.dataloader_train)))
                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data1/uccaoke/Virtualenvs/nnUNet/lib64/python3.12/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 196, in __next__
    item = self.__get_next_item()
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/data1/uccaoke/Virtualenvs/nnUNet/lib64/python3.12/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 181, in __get_next_item
    raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the "
RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message

If nnUNet is run with 2 GPUs, (-num_gpus 2) it runs fine.

On a hunch, I downgraded batchgenerators to 0.25 (since the researchers reported this was not a problem on other systems they have access to and I guessed that they might have installed before the 0.25.1 release), and that also resolved the crash, implying that something changed in the 0.25.1 release is causing this issue.

@owainkenwayucl
Copy link
Author

owainkenwayucl commented Feb 20, 2025

System detail in case this matters:

Operating System: AlmaLinux 9.5 (effectively RedHat Enterprise Linux 9.5)
Python: 3.12 (from system packages)
nnUNetv2: 2.5.2 from pipy

I don't think this is relevant but in case:
ROCm: 6.3.2
Pytorch: 2.6 for ROCm from pytorch.org

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant