Skip to content

[BUG] DataLoader worker exited unexpectedly after fresh installation and default example #149

@Bloeci

Description

@Bloeci

Describe the bug
I downloaded OpenFold, installed it using UV Sync, and downloaded the models. When I run the example, I get an error saying that the data loader failed unexpectedly. I want to use it on an HPC cluster; I'm currently using an NVIDIA A30.

Apart from the cluster, the same happend for me using just a local installation on my machine

To Reproduce
Run command on cluster

srun --partition=paula --time=01:00:00 --gres=gpu:1 --mem=20G \
    run_openfold predict \
        --query-json examples/example_inference_inputs/query_ubiquitin.json \
        --inference-ckpt-path weights/of3-p2-155k.pt

Run command locally

run_openfold predict \
    --query_json=examples/example_inference_inputs/query_ubiquitin.json \
    --inference-ckpt-path weights/of3-p2-155k.pt

Expected behavior
The default output structure for the example.

Stack trace
Stack trace cluster

(openfold3) (base) ↑1 (login02.sc.uni-leipzig.de) mb97hape-software/openfold-3 git:(main) ▶ srun --partition=paula --time=01:00:00 --gres=gpu:1 --mem=20G run_openfold predict --query-json examples/example_inference_inputs/query_ubiquitin.json --inference-ckpt-path cache/weights/of3-p2-155k.pt
srun: job 21044665 queued and waiting for resources
srun: job 21044665 has been allocated resources
WARNING:openfold3.entry_points.experiment_runner:No version_tensor is found for this checkpoint.Assuming the user knows checkpoints are parameters are compatible, continuing...
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
💡 Tip: For seamless cloud logging and experiment tracking, try installing [litlogger](https://pypi.org/project/litlogger/) to enable LitLogger, which logs metrics and artifacts automatically to the Lightning Experiments platform.
💡 Tip: For seamless cloud uploads and versioning, try installing [litmodels](https://pypi.org/project/litmodels/) to enable LitModelCheckpoint, which syncs automatically with the Lightning model registry.
WARNING:openfold3.core.data.tools.colabfold_msa_server:Using output directory: /tmp/of3_colabfold_msas for ColabFold MSAs.
Submitting 1 sequences to the Colabfold MSA server for main MSAs...
COMPLETE: 100%|██████████| 150/150 [elapsed: 00:10 remaining: 00:00]
/work2/mb97hape-software/openfold-3/openfold3/core/data/tools/colabfold_msa_server.py:335: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior.
  tar_gz.extractall(path)
No complexes found for paired MSA generation. Skipping...
/home/sc.uni-leipzig.de/mb97hape/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/multiprocessing/popen_fork.py:66: DeprecationWarning: This process (pid=48523) is multi-threaded, use of fork() may lead to deadlocks in the child.
  self.pid = os.fork()
Preprocessing templates: 100%|██████████| 1/1 [00:18<00:00, 18.13s/it]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
SLURM auto-requeueing enabled. Setting signal handlers.
/work2/mb97hape-software/openfold-3/.venv/lib/python3.12/site-packages/torch/utils/data/dataloader.py:424: UserWarning: This DataLoader will create 10 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
  self.check_worker_number_rationality()
/work2/mb97hape-software/openfold-3/.venv/lib/python3.12/site-packages/pytorch_lightning/utilities/_pytree.py:21: `isinstance(treespec, LeafSpec)` is deprecated, use `isinstance(treespec, TreeSpec) and treespec.is_leaf()` instead.
ERROR: Unexpected segmentation fault encountered in worker.
Traceback (most recent call last):
  File "/work2/mb97hape-software/openfold-3/.venv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 1310, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/sc.uni-leipzig.de/mb97hape/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/multiprocessing/queues.py", line 113, in get
    if not self._poll(timeout):
           ^^^^^^^^^^^^^^^^^^^
  File "/home/sc.uni-leipzig.de/mb97hape/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/multiprocessing/connection.py", line 257, in poll
    return self._poll(timeout)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/sc.uni-leipzig.de/mb97hape/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/multiprocessing/connection.py", line 440, in _poll
    r = wait([self], timeout)
        ^^^^^^^^^^^^^^^^^^^^^
  File "/home/sc.uni-leipzig.de/mb97hape/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/multiprocessing/connection.py", line 1136, in wait
    ready = selector.select(timeout)
            ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/sc.uni-leipzig.de/mb97hape/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/selectors.py", line 415, in select
    fd_event_list = self._selector.poll(timeout)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/work2/mb97hape-software/openfold-3/.venv/lib/python3.12/site-packages/torch/utils/data/_utils/signal_handling.py", line 73, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 49238) is killed by signal: Segmentation fault. 

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/work2/mb97hape-software/openfold-3/.venv/bin/run_openfold", line 10, in <module>
    sys.exit(cli())
             ^^^^^
  File "/work2/mb97hape-software/openfold-3/.venv/lib/python3.12/site-packages/click/core.py", line 1485, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/work2/mb97hape-software/openfold-3/.venv/lib/python3.12/site-packages/click/core.py", line 1406, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/work2/mb97hape-software/openfold-3/.venv/lib/python3.12/site-packages/click/core.py", line 1873, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/work2/mb97hape-software/openfold-3/.venv/lib/python3.12/site-packages/click/core.py", line 1269, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/work2/mb97hape-software/openfold-3/.venv/lib/python3.12/site-packages/click/core.py", line 824, in invoke
    return callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/work2/mb97hape-software/openfold-3/openfold3/run_openfold.py", line 195, in predict
    expt_runner.run(query_set)
  File "/work2/mb97hape-software/openfold-3/openfold3/entry_points/experiment_runner.py", line 696, in run
    self.trainer.predict(
  File "/work2/mb97hape-software/openfold-3/.venv/lib/python3.12/site-packages/pytorch_lightning/trainer/trainer.py", line 941, in predict
    return call._call_and_handle_interrupt(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/work2/mb97hape-software/openfold-3/.venv/lib/python3.12/site-packages/pytorch_lightning/trainer/call.py", line 49, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/work2/mb97hape-software/openfold-3/.venv/lib/python3.12/site-packages/pytorch_lightning/trainer/trainer.py", line 990, in _predict_impl
    results = self._run(model, ckpt_path=ckpt_path, weights_only=weights_only)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/work2/mb97hape-software/openfold-3/.venv/lib/python3.12/site-packages/pytorch_lightning/trainer/trainer.py", line 1079, in _run
    results = self._run_stage()
              ^^^^^^^^^^^^^^^^^
  File "/work2/mb97hape-software/openfold-3/.venv/lib/python3.12/site-packages/pytorch_lightning/trainer/trainer.py", line 1118, in _run_stage
    return self.predict_loop.run()
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/work2/mb97hape-software/openfold-3/.venv/lib/python3.12/site-packages/pytorch_lightning/loops/utilities.py", line 179, in _decorator
    return loop_run(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/work2/mb97hape-software/openfold-3/.venv/lib/python3.12/site-packages/pytorch_lightning/loops/prediction_loop.py", line 122, in run
    batch, batch_idx, dataloader_idx = next(data_fetcher)
                                       ^^^^^^^^^^^^^^^^^^
  File "/work2/mb97hape-software/openfold-3/.venv/lib/python3.12/site-packages/pytorch_lightning/loops/fetchers.py", line 134, in __next__
    batch = super().__next__()
            ^^^^^^^^^^^^^^^^^^
  File "/work2/mb97hape-software/openfold-3/.venv/lib/python3.12/site-packages/pytorch_lightning/loops/fetchers.py", line 61, in __next__
    batch = next(self.iterator)
            ^^^^^^^^^^^^^^^^^^^
  File "/work2/mb97hape-software/openfold-3/.venv/lib/python3.12/site-packages/pytorch_lightning/utilities/combined_loader.py", line 341, in __next__
    out = next(self._iterator)
          ^^^^^^^^^^^^^^^^^^^^
  File "/work2/mb97hape-software/openfold-3/.venv/lib/python3.12/site-packages/pytorch_lightning/utilities/combined_loader.py", line 142, in __next__
    out = next(self.iterators[0])
          ^^^^^^^^^^^^^^^^^^^^^^^
  File "/work2/mb97hape-software/openfold-3/.venv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 741, in __next__
    data = self._next_data()
           ^^^^^^^^^^^^^^^^^
  File "/work2/mb97hape-software/openfold-3/.venv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 1524, in _next_data
    idx, data = self._get_data()
                ^^^^^^^^^^^^^^^^
  File "/work2/mb97hape-software/openfold-3/.venv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 1483, in _get_data
    success, data = self._try_get_data()
                    ^^^^^^^^^^^^^^^^^^^^
  File "/work2/mb97hape-software/openfold-3/.venv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 1323, in _try_get_data
    raise RuntimeError(
RuntimeError: DataLoader worker (pid(s) 49238) exited unexpectedly
Predicting: |          | 0/? [00:07<?, ?it/s]srun: error: paula08: task 0: Exited with exit code 1

Stack trace local

(openfold3) (base) Ξ software/openfold-3 git:(main) ▶ run_openfold predict --query_json=examples/example_inference_inputs/query_ubiquitin.json --inference-ckpt-path weights/of3-p2-155k.pt 
WARNING:openfold3.entry_points.experiment_runner:No version_tensor is found for this checkpoint.Assuming the user knows checkpoints are parameters are compatible, continuing...
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
💡 Tip: For seamless cloud logging and experiment tracking, try installing [litlogger](https://pypi.org/project/litlogger/) to enable LitLogger, which logs metrics and artifacts automatically to the Lightning Experiments platform.
💡 Tip: For seamless cloud uploads and versioning, try installing [litmodels](https://pypi.org/project/litmodels/) to enable LitModelCheckpoint, which syncs automatically with the Lightning model registry.
WARNING:openfold3.core.data.tools.colabfold_msa_server:Using output directory: /tmp/of3_colabfold_msas for ColabFold MSAs.
WARNING:openfold3.core.data.tools.colabfold_msa_server:Mapping file /tmp/of3_colabfold_msas/mappings/seq_to_rep_id.json already exists. Appending new sequences.
WARNING:openfold3.core.data.tools.colabfold_msa_server:Mapping file /tmp/of3_colabfold_msas/mappings/rep_id_to_seq.json already exists. Appending new sequences.
Submitting 1 sequences to the Colabfold MSA server for main MSAs...
No complexes found for paired MSA generation. Skipping...
/home/iwe34/.local/share/uv/python/cpython-3.13.6-linux-x86_64-gnu/lib/python3.13/multiprocessing/popen_fork.py:67: DeprecationWarning: This process (pid=2591200) is multi-threaded, use of fork() may lead to deadlocks in the child.
  self.pid = os.fork()
Preprocessing templates: 100%|███████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 181.53it/s]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Predicting: |                                                                                                             | 0/? [00:00<?, ?it/s]ERROR: Unexpected segmentation fault encountered in worker.
Traceback (most recent call last):
  File "/media/data/software/openfold-3/.venv/lib/python3.13/site-packages/torch/utils/data/dataloader.py", line 1275, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/home/iwe34/.local/share/uv/python/cpython-3.13.6-linux-x86_64-gnu/lib/python3.13/multiprocessing/queues.py", line 111, in get
    if not self._poll(timeout):
           ~~~~~~~~~~^^^^^^^^^
  File "/home/iwe34/.local/share/uv/python/cpython-3.13.6-linux-x86_64-gnu/lib/python3.13/multiprocessing/connection.py", line 257, in poll
    return self._poll(timeout)
           ~~~~~~~~~~^^^^^^^^^
  File "/home/iwe34/.local/share/uv/python/cpython-3.13.6-linux-x86_64-gnu/lib/python3.13/multiprocessing/connection.py", line 440, in _poll
    r = wait([self], timeout)
  File "/home/iwe34/.local/share/uv/python/cpython-3.13.6-linux-x86_64-gnu/lib/python3.13/multiprocessing/connection.py", line 1148, in wait
    ready = selector.select(timeout)
  File "/home/iwe34/.local/share/uv/python/cpython-3.13.6-linux-x86_64-gnu/lib/python3.13/selectors.py", line 398, in select
    fd_event_list = self._selector.poll(timeout)
  File "/media/data/software/openfold-3/.venv/lib/python3.13/site-packages/torch/utils/data/_utils/signal_handling.py", line 73, in handler
    _error_if_any_worker_fails()
    ~~~~~~~~~~~~~~~~~~~~~~~~~~^^
RuntimeError: DataLoader worker (pid 2591499) is killed by signal: Segmentation fault. 

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/media/data/software/openfold-3/.venv/bin/run_openfold", line 10, in <module>
    sys.exit(cli())
             ~~~^^
  File "/media/data/software/openfold-3/.venv/lib/python3.13/site-packages/click/core.py", line 1485, in __call__
    return self.main(*args, **kwargs)
           ~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "/media/data/software/openfold-3/.venv/lib/python3.13/site-packages/click/core.py", line 1406, in main
    rv = self.invoke(ctx)
  File "/media/data/software/openfold-3/.venv/lib/python3.13/site-packages/click/core.py", line 1873, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^
  File "/media/data/software/openfold-3/.venv/lib/python3.13/site-packages/click/core.py", line 1269, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/media/data/software/openfold-3/.venv/lib/python3.13/site-packages/click/core.py", line 824, in invoke
    return callback(*args, **kwargs)
  File "/media/data/software/openfold-3/openfold3/run_openfold.py", line 195, in predict
    expt_runner.run(query_set)
    ~~~~~~~~~~~~~~~^^^^^^^^^^^
  File "/media/data/software/openfold-3/openfold3/entry_points/experiment_runner.py", line 696, in run
    self.trainer.predict(
    ~~~~~~~~~~~~~~~~~~~~^
        model=self.lightning_module,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        datamodule=self.lightning_data_module,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        return_predictions=False,
        ^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/media/data/software/openfold-3/.venv/lib/python3.13/site-packages/pytorch_lightning/trainer/trainer.py", line 941, in predict
    return call._call_and_handle_interrupt(
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
        self,
        ^^^^^
    ...<6 lines>...
        weights_only,
        ^^^^^^^^^^^^^
    )
    ^
  File "/media/data/software/openfold-3/.venv/lib/python3.13/site-packages/pytorch_lightning/trainer/call.py", line 49, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/media/data/software/openfold-3/.venv/lib/python3.13/site-packages/pytorch_lightning/trainer/trainer.py", line 990, in _predict_impl
    results = self._run(model, ckpt_path=ckpt_path, weights_only=weights_only)
  File "/media/data/software/openfold-3/.venv/lib/python3.13/site-packages/pytorch_lightning/trainer/trainer.py", line 1079, in _run
    results = self._run_stage()
  File "/media/data/software/openfold-3/.venv/lib/python3.13/site-packages/pytorch_lightning/trainer/trainer.py", line 1118, in _run_stage
    return self.predict_loop.run()
           ~~~~~~~~~~~~~~~~~~~~~^^
  File "/media/data/software/openfold-3/.venv/lib/python3.13/site-packages/pytorch_lightning/loops/utilities.py", line 179, in _decorator
    return loop_run(self, *args, **kwargs)
  File "/media/data/software/openfold-3/.venv/lib/python3.13/site-packages/pytorch_lightning/loops/prediction_loop.py", line 122, in run
    batch, batch_idx, dataloader_idx = next(data_fetcher)
                                       ~~~~^^^^^^^^^^^^^^
  File "/media/data/software/openfold-3/.venv/lib/python3.13/site-packages/pytorch_lightning/loops/fetchers.py", line 134, in __next__
    batch = super().__next__()
  File "/media/data/software/openfold-3/.venv/lib/python3.13/site-packages/pytorch_lightning/loops/fetchers.py", line 61, in __next__
    batch = next(self.iterator)
  File "/media/data/software/openfold-3/.venv/lib/python3.13/site-packages/pytorch_lightning/utilities/combined_loader.py", line 341, in __next__
    out = next(self._iterator)
  File "/media/data/software/openfold-3/.venv/lib/python3.13/site-packages/pytorch_lightning/utilities/combined_loader.py", line 142, in __next__
    out = next(self.iterators[0])
  File "/media/data/software/openfold-3/.venv/lib/python3.13/site-packages/torch/utils/data/dataloader.py", line 732, in __next__
    data = self._next_data()
  File "/media/data/software/openfold-3/.venv/lib/python3.13/site-packages/torch/utils/data/dataloader.py", line 1482, in _next_data
    idx, data = self._get_data()
                ~~~~~~~~~~~~~~^^
  File "/media/data/software/openfold-3/.venv/lib/python3.13/site-packages/torch/utils/data/dataloader.py", line 1444, in _get_data
    success, data = self._try_get_data()
                    ~~~~~~~~~~~~~~~~~~^^
  File "/media/data/software/openfold-3/.venv/lib/python3.13/site-packages/torch/utils/data/dataloader.py", line 1288, in _try_get_data
    raise RuntimeError(
        f"DataLoader worker (pid(s) {pids_str}) exited unexpectedly"
    ) from e
RuntimeError: DataLoader worker (pid(s) 25914

Configuration (please complete the following information):
Installation

uv sync

Set up env variables and load modules

module load CUDA/12.8.0
export TRITON_CACHE_DIR=/work2/mb97hape-software/openfold-3/cache/triton
export OPENFOLD_CACHE=/work2/mb97hape-software/openfold-3/cache

OS

(openfold3) (base) Ξ (login02.sc.uni-leipzig.de) mb97hape-software/openfold-3 git:(main) ▶ head /etc/os-release 
NAME="Rocky Linux"
VERSION="9.7 (Blue Onyx)"
ID="rocky"
ID_LIKE="rhel centos fedora"
VERSION_ID="9.7"
PLATFORM_ID="platform:el9"
PRETTY_NAME="Rocky Linux 9.7 (Blue Onyx)"
ANSI_COLOR="0;32"
LOGO="fedora-logo-icon"
CPE_NAME="cpe:/o:rocky:rocky:9::baseos"

Metadata

Metadata

Assignees

No one assigned

    Labels

    InstallationIssue with installationbugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions