-
Notifications
You must be signed in to change notification settings - Fork 92
[BUG] DataLoader worker exited unexpectedly after fresh installation and default example #149
Copy link
Copy link
Open
Labels
InstallationIssue with installationIssue with installationbugSomething isn't workingSomething isn't working
Description
Describe the bug
I downloaded OpenFold, installed it using UV Sync, and downloaded the models. When I run the example, I get an error saying that the data loader failed unexpectedly. I want to use it on an HPC cluster; I'm currently using an NVIDIA A30.
Apart from the cluster, the same happend for me using just a local installation on my machine
To Reproduce
Run command on cluster
srun --partition=paula --time=01:00:00 --gres=gpu:1 --mem=20G \
run_openfold predict \
--query-json examples/example_inference_inputs/query_ubiquitin.json \
--inference-ckpt-path weights/of3-p2-155k.pt
Run command locally
run_openfold predict \
--query_json=examples/example_inference_inputs/query_ubiquitin.json \
--inference-ckpt-path weights/of3-p2-155k.pt
Expected behavior
The default output structure for the example.
Stack trace
Stack trace cluster
(openfold3) (base) ↑1 (login02.sc.uni-leipzig.de) mb97hape-software/openfold-3 git:(main) ▶ srun --partition=paula --time=01:00:00 --gres=gpu:1 --mem=20G run_openfold predict --query-json examples/example_inference_inputs/query_ubiquitin.json --inference-ckpt-path cache/weights/of3-p2-155k.pt
srun: job 21044665 queued and waiting for resources
srun: job 21044665 has been allocated resources
WARNING:openfold3.entry_points.experiment_runner:No version_tensor is found for this checkpoint.Assuming the user knows checkpoints are parameters are compatible, continuing...
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
💡 Tip: For seamless cloud logging and experiment tracking, try installing [litlogger](https://pypi.org/project/litlogger/) to enable LitLogger, which logs metrics and artifacts automatically to the Lightning Experiments platform.
💡 Tip: For seamless cloud uploads and versioning, try installing [litmodels](https://pypi.org/project/litmodels/) to enable LitModelCheckpoint, which syncs automatically with the Lightning model registry.
WARNING:openfold3.core.data.tools.colabfold_msa_server:Using output directory: /tmp/of3_colabfold_msas for ColabFold MSAs.
Submitting 1 sequences to the Colabfold MSA server for main MSAs...
COMPLETE: 100%|██████████| 150/150 [elapsed: 00:10 remaining: 00:00]
/work2/mb97hape-software/openfold-3/openfold3/core/data/tools/colabfold_msa_server.py:335: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior.
tar_gz.extractall(path)
No complexes found for paired MSA generation. Skipping...
/home/sc.uni-leipzig.de/mb97hape/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/multiprocessing/popen_fork.py:66: DeprecationWarning: This process (pid=48523) is multi-threaded, use of fork() may lead to deadlocks in the child.
self.pid = os.fork()
Preprocessing templates: 100%|██████████| 1/1 [00:18<00:00, 18.13s/it]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
SLURM auto-requeueing enabled. Setting signal handlers.
/work2/mb97hape-software/openfold-3/.venv/lib/python3.12/site-packages/torch/utils/data/dataloader.py:424: UserWarning: This DataLoader will create 10 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
self.check_worker_number_rationality()
/work2/mb97hape-software/openfold-3/.venv/lib/python3.12/site-packages/pytorch_lightning/utilities/_pytree.py:21: `isinstance(treespec, LeafSpec)` is deprecated, use `isinstance(treespec, TreeSpec) and treespec.is_leaf()` instead.
ERROR: Unexpected segmentation fault encountered in worker.
Traceback (most recent call last):
File "/work2/mb97hape-software/openfold-3/.venv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 1310, in _try_get_data
data = self._data_queue.get(timeout=timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/sc.uni-leipzig.de/mb97hape/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/multiprocessing/queues.py", line 113, in get
if not self._poll(timeout):
^^^^^^^^^^^^^^^^^^^
File "/home/sc.uni-leipzig.de/mb97hape/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/multiprocessing/connection.py", line 257, in poll
return self._poll(timeout)
^^^^^^^^^^^^^^^^^^^
File "/home/sc.uni-leipzig.de/mb97hape/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/multiprocessing/connection.py", line 440, in _poll
r = wait([self], timeout)
^^^^^^^^^^^^^^^^^^^^^
File "/home/sc.uni-leipzig.de/mb97hape/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/multiprocessing/connection.py", line 1136, in wait
ready = selector.select(timeout)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/sc.uni-leipzig.de/mb97hape/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/selectors.py", line 415, in select
fd_event_list = self._selector.poll(timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/work2/mb97hape-software/openfold-3/.venv/lib/python3.12/site-packages/torch/utils/data/_utils/signal_handling.py", line 73, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 49238) is killed by signal: Segmentation fault.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/work2/mb97hape-software/openfold-3/.venv/bin/run_openfold", line 10, in <module>
sys.exit(cli())
^^^^^
File "/work2/mb97hape-software/openfold-3/.venv/lib/python3.12/site-packages/click/core.py", line 1485, in __call__
return self.main(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/work2/mb97hape-software/openfold-3/.venv/lib/python3.12/site-packages/click/core.py", line 1406, in main
rv = self.invoke(ctx)
^^^^^^^^^^^^^^^^
File "/work2/mb97hape-software/openfold-3/.venv/lib/python3.12/site-packages/click/core.py", line 1873, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/work2/mb97hape-software/openfold-3/.venv/lib/python3.12/site-packages/click/core.py", line 1269, in invoke
return ctx.invoke(self.callback, **ctx.params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/work2/mb97hape-software/openfold-3/.venv/lib/python3.12/site-packages/click/core.py", line 824, in invoke
return callback(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/work2/mb97hape-software/openfold-3/openfold3/run_openfold.py", line 195, in predict
expt_runner.run(query_set)
File "/work2/mb97hape-software/openfold-3/openfold3/entry_points/experiment_runner.py", line 696, in run
self.trainer.predict(
File "/work2/mb97hape-software/openfold-3/.venv/lib/python3.12/site-packages/pytorch_lightning/trainer/trainer.py", line 941, in predict
return call._call_and_handle_interrupt(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/work2/mb97hape-software/openfold-3/.venv/lib/python3.12/site-packages/pytorch_lightning/trainer/call.py", line 49, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/work2/mb97hape-software/openfold-3/.venv/lib/python3.12/site-packages/pytorch_lightning/trainer/trainer.py", line 990, in _predict_impl
results = self._run(model, ckpt_path=ckpt_path, weights_only=weights_only)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/work2/mb97hape-software/openfold-3/.venv/lib/python3.12/site-packages/pytorch_lightning/trainer/trainer.py", line 1079, in _run
results = self._run_stage()
^^^^^^^^^^^^^^^^^
File "/work2/mb97hape-software/openfold-3/.venv/lib/python3.12/site-packages/pytorch_lightning/trainer/trainer.py", line 1118, in _run_stage
return self.predict_loop.run()
^^^^^^^^^^^^^^^^^^^^^^^
File "/work2/mb97hape-software/openfold-3/.venv/lib/python3.12/site-packages/pytorch_lightning/loops/utilities.py", line 179, in _decorator
return loop_run(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/work2/mb97hape-software/openfold-3/.venv/lib/python3.12/site-packages/pytorch_lightning/loops/prediction_loop.py", line 122, in run
batch, batch_idx, dataloader_idx = next(data_fetcher)
^^^^^^^^^^^^^^^^^^
File "/work2/mb97hape-software/openfold-3/.venv/lib/python3.12/site-packages/pytorch_lightning/loops/fetchers.py", line 134, in __next__
batch = super().__next__()
^^^^^^^^^^^^^^^^^^
File "/work2/mb97hape-software/openfold-3/.venv/lib/python3.12/site-packages/pytorch_lightning/loops/fetchers.py", line 61, in __next__
batch = next(self.iterator)
^^^^^^^^^^^^^^^^^^^
File "/work2/mb97hape-software/openfold-3/.venv/lib/python3.12/site-packages/pytorch_lightning/utilities/combined_loader.py", line 341, in __next__
out = next(self._iterator)
^^^^^^^^^^^^^^^^^^^^
File "/work2/mb97hape-software/openfold-3/.venv/lib/python3.12/site-packages/pytorch_lightning/utilities/combined_loader.py", line 142, in __next__
out = next(self.iterators[0])
^^^^^^^^^^^^^^^^^^^^^^^
File "/work2/mb97hape-software/openfold-3/.venv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 741, in __next__
data = self._next_data()
^^^^^^^^^^^^^^^^^
File "/work2/mb97hape-software/openfold-3/.venv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 1524, in _next_data
idx, data = self._get_data()
^^^^^^^^^^^^^^^^
File "/work2/mb97hape-software/openfold-3/.venv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 1483, in _get_data
success, data = self._try_get_data()
^^^^^^^^^^^^^^^^^^^^
File "/work2/mb97hape-software/openfold-3/.venv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 1323, in _try_get_data
raise RuntimeError(
RuntimeError: DataLoader worker (pid(s) 49238) exited unexpectedly
Predicting: | | 0/? [00:07<?, ?it/s]srun: error: paula08: task 0: Exited with exit code 1
Stack trace local
(openfold3) (base) Ξ software/openfold-3 git:(main) ▶ run_openfold predict --query_json=examples/example_inference_inputs/query_ubiquitin.json --inference-ckpt-path weights/of3-p2-155k.pt
WARNING:openfold3.entry_points.experiment_runner:No version_tensor is found for this checkpoint.Assuming the user knows checkpoints are parameters are compatible, continuing...
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
💡 Tip: For seamless cloud logging and experiment tracking, try installing [litlogger](https://pypi.org/project/litlogger/) to enable LitLogger, which logs metrics and artifacts automatically to the Lightning Experiments platform.
💡 Tip: For seamless cloud uploads and versioning, try installing [litmodels](https://pypi.org/project/litmodels/) to enable LitModelCheckpoint, which syncs automatically with the Lightning model registry.
WARNING:openfold3.core.data.tools.colabfold_msa_server:Using output directory: /tmp/of3_colabfold_msas for ColabFold MSAs.
WARNING:openfold3.core.data.tools.colabfold_msa_server:Mapping file /tmp/of3_colabfold_msas/mappings/seq_to_rep_id.json already exists. Appending new sequences.
WARNING:openfold3.core.data.tools.colabfold_msa_server:Mapping file /tmp/of3_colabfold_msas/mappings/rep_id_to_seq.json already exists. Appending new sequences.
Submitting 1 sequences to the Colabfold MSA server for main MSAs...
No complexes found for paired MSA generation. Skipping...
/home/iwe34/.local/share/uv/python/cpython-3.13.6-linux-x86_64-gnu/lib/python3.13/multiprocessing/popen_fork.py:67: DeprecationWarning: This process (pid=2591200) is multi-threaded, use of fork() may lead to deadlocks in the child.
self.pid = os.fork()
Preprocessing templates: 100%|███████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 181.53it/s]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Predicting: | | 0/? [00:00<?, ?it/s]ERROR: Unexpected segmentation fault encountered in worker.
Traceback (most recent call last):
File "/media/data/software/openfold-3/.venv/lib/python3.13/site-packages/torch/utils/data/dataloader.py", line 1275, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/home/iwe34/.local/share/uv/python/cpython-3.13.6-linux-x86_64-gnu/lib/python3.13/multiprocessing/queues.py", line 111, in get
if not self._poll(timeout):
~~~~~~~~~~^^^^^^^^^
File "/home/iwe34/.local/share/uv/python/cpython-3.13.6-linux-x86_64-gnu/lib/python3.13/multiprocessing/connection.py", line 257, in poll
return self._poll(timeout)
~~~~~~~~~~^^^^^^^^^
File "/home/iwe34/.local/share/uv/python/cpython-3.13.6-linux-x86_64-gnu/lib/python3.13/multiprocessing/connection.py", line 440, in _poll
r = wait([self], timeout)
File "/home/iwe34/.local/share/uv/python/cpython-3.13.6-linux-x86_64-gnu/lib/python3.13/multiprocessing/connection.py", line 1148, in wait
ready = selector.select(timeout)
File "/home/iwe34/.local/share/uv/python/cpython-3.13.6-linux-x86_64-gnu/lib/python3.13/selectors.py", line 398, in select
fd_event_list = self._selector.poll(timeout)
File "/media/data/software/openfold-3/.venv/lib/python3.13/site-packages/torch/utils/data/_utils/signal_handling.py", line 73, in handler
_error_if_any_worker_fails()
~~~~~~~~~~~~~~~~~~~~~~~~~~^^
RuntimeError: DataLoader worker (pid 2591499) is killed by signal: Segmentation fault.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/media/data/software/openfold-3/.venv/bin/run_openfold", line 10, in <module>
sys.exit(cli())
~~~^^
File "/media/data/software/openfold-3/.venv/lib/python3.13/site-packages/click/core.py", line 1485, in __call__
return self.main(*args, **kwargs)
~~~~~~~~~^^^^^^^^^^^^^^^^^
File "/media/data/software/openfold-3/.venv/lib/python3.13/site-packages/click/core.py", line 1406, in main
rv = self.invoke(ctx)
File "/media/data/software/openfold-3/.venv/lib/python3.13/site-packages/click/core.py", line 1873, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^
File "/media/data/software/openfold-3/.venv/lib/python3.13/site-packages/click/core.py", line 1269, in invoke
return ctx.invoke(self.callback, **ctx.params)
~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/media/data/software/openfold-3/.venv/lib/python3.13/site-packages/click/core.py", line 824, in invoke
return callback(*args, **kwargs)
File "/media/data/software/openfold-3/openfold3/run_openfold.py", line 195, in predict
expt_runner.run(query_set)
~~~~~~~~~~~~~~~^^^^^^^^^^^
File "/media/data/software/openfold-3/openfold3/entry_points/experiment_runner.py", line 696, in run
self.trainer.predict(
~~~~~~~~~~~~~~~~~~~~^
model=self.lightning_module,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
datamodule=self.lightning_data_module,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
return_predictions=False,
^^^^^^^^^^^^^^^^^^^^^^^^^
)
^
File "/media/data/software/openfold-3/.venv/lib/python3.13/site-packages/pytorch_lightning/trainer/trainer.py", line 941, in predict
return call._call_and_handle_interrupt(
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
self,
^^^^^
...<6 lines>...
weights_only,
^^^^^^^^^^^^^
)
^
File "/media/data/software/openfold-3/.venv/lib/python3.13/site-packages/pytorch_lightning/trainer/call.py", line 49, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/media/data/software/openfold-3/.venv/lib/python3.13/site-packages/pytorch_lightning/trainer/trainer.py", line 990, in _predict_impl
results = self._run(model, ckpt_path=ckpt_path, weights_only=weights_only)
File "/media/data/software/openfold-3/.venv/lib/python3.13/site-packages/pytorch_lightning/trainer/trainer.py", line 1079, in _run
results = self._run_stage()
File "/media/data/software/openfold-3/.venv/lib/python3.13/site-packages/pytorch_lightning/trainer/trainer.py", line 1118, in _run_stage
return self.predict_loop.run()
~~~~~~~~~~~~~~~~~~~~~^^
File "/media/data/software/openfold-3/.venv/lib/python3.13/site-packages/pytorch_lightning/loops/utilities.py", line 179, in _decorator
return loop_run(self, *args, **kwargs)
File "/media/data/software/openfold-3/.venv/lib/python3.13/site-packages/pytorch_lightning/loops/prediction_loop.py", line 122, in run
batch, batch_idx, dataloader_idx = next(data_fetcher)
~~~~^^^^^^^^^^^^^^
File "/media/data/software/openfold-3/.venv/lib/python3.13/site-packages/pytorch_lightning/loops/fetchers.py", line 134, in __next__
batch = super().__next__()
File "/media/data/software/openfold-3/.venv/lib/python3.13/site-packages/pytorch_lightning/loops/fetchers.py", line 61, in __next__
batch = next(self.iterator)
File "/media/data/software/openfold-3/.venv/lib/python3.13/site-packages/pytorch_lightning/utilities/combined_loader.py", line 341, in __next__
out = next(self._iterator)
File "/media/data/software/openfold-3/.venv/lib/python3.13/site-packages/pytorch_lightning/utilities/combined_loader.py", line 142, in __next__
out = next(self.iterators[0])
File "/media/data/software/openfold-3/.venv/lib/python3.13/site-packages/torch/utils/data/dataloader.py", line 732, in __next__
data = self._next_data()
File "/media/data/software/openfold-3/.venv/lib/python3.13/site-packages/torch/utils/data/dataloader.py", line 1482, in _next_data
idx, data = self._get_data()
~~~~~~~~~~~~~~^^
File "/media/data/software/openfold-3/.venv/lib/python3.13/site-packages/torch/utils/data/dataloader.py", line 1444, in _get_data
success, data = self._try_get_data()
~~~~~~~~~~~~~~~~~~^^
File "/media/data/software/openfold-3/.venv/lib/python3.13/site-packages/torch/utils/data/dataloader.py", line 1288, in _try_get_data
raise RuntimeError(
f"DataLoader worker (pid(s) {pids_str}) exited unexpectedly"
) from e
RuntimeError: DataLoader worker (pid(s) 25914
Configuration (please complete the following information):
Installation
uv sync
Set up env variables and load modules
module load CUDA/12.8.0
export TRITON_CACHE_DIR=/work2/mb97hape-software/openfold-3/cache/triton
export OPENFOLD_CACHE=/work2/mb97hape-software/openfold-3/cache
OS
(openfold3) (base) Ξ (login02.sc.uni-leipzig.de) mb97hape-software/openfold-3 git:(main) ▶ head /etc/os-release
NAME="Rocky Linux"
VERSION="9.7 (Blue Onyx)"
ID="rocky"
ID_LIKE="rhel centos fedora"
VERSION_ID="9.7"
PLATFORM_ID="platform:el9"
PRETTY_NAME="Rocky Linux 9.7 (Blue Onyx)"
ANSI_COLOR="0;32"
LOGO="fedora-logo-icon"
CPE_NAME="cpe:/o:rocky:rocky:9::baseos"
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
InstallationIssue with installationIssue with installationbugSomething isn't workingSomething isn't working