-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support to saving.py
for loading GPU-trained models on CPU-only machines
#19024
base: master
Are you sure you want to change the base?
Conversation
for more information, see https://pre-commit.ci
@amorehead Have you tried reporting this on PyTorch? You would expect that |
@carmocca, great point. I'll open up an issue for PyTorch as well, linked to this one for Lightning. However, for the time being (since it may take a while for PyTorch to fix the issue on their end), I think this PR for Lightning should still be useful for the time being, in case other users run into the same issue I am facing. |
Yes, we can merge this, but I would like to hear from their team first before moving forward. Then we could have this: if not _TORCH_GREATER_EQUAL_2_2:
# your patch |
@amorehead Great find. If you still have it, could you provide the full stack trace of the error? |
@awaelchli, yes, the stack trace is as follows. Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/home/acmwhb/mambaforge/envs/GCPNet/lib/python3.10/site-packages/lightning/pytorch/core/module.py", line 1552, in load_from_checkpoint
loaded = _load_from_checkpoint(
File "/home/acmwhb/mambaforge/envs/GCPNet/lib/python3.10/site-packages/lightning/pytorch/core/saving.py", line 97, in _load_from_checkpoint
return model.to(device)
File "/home/acmwhb/mambaforge/envs/GCPNet/lib/python3.10/site-packages/lightning/fabric/utilities/device_dtype_mixin.py", line 54, in to
return super().to(*args, **kwargs)
File "/home/acmwhb/mambaforge/envs/GCPNet/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1160, in to
return self._apply(convert)
File "/home/acmwhb/mambaforge/envs/GCPNet/lib/python3.10/site-packages/torch/nn/modules/module.py", line 810, in _apply
module._apply(fn)
File "/home/acmwhb/mambaforge/envs/GCPNet/lib/python3.10/site-packages/torch/nn/modules/module.py", line 810, in _apply
module._apply(fn)
File "/home/acmwhb/mambaforge/envs/GCPNet/lib/python3.10/site-packages/torchmetrics/metric.py", line 808, in _apply
self._device = fn(torch.zeros(1, device=self.device)).device
File "/home/acmwhb/mambaforge/envs/GCPNet/lib/python3.10/site-packages/torch/cuda/__init__.py", line 289, in _lazy_init
raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled This is triggered by calling: my_lightning_model.__class__.load_from_checkpoint(
checkpoint_path=ckpt_path,
map_location="cpu",
strict=True,
) The issue happens with both Lightning 2.1.0 and 2.1.2 (note the |
Given the stack trace, we see that it goes through torchmetrics and fails at this line:
maybe |
This seems to be a |
@amorehead Did you actually end up with the entire torchmetric object pickled in the checkpoint like described by this user Lightning-AI/torchmetrics#2223 or was it a proper checkpoint with the state dict of the metric? Because the former would indeed explain your issue, but then the fix should be not to pickle the metric in the first place. |
You have described it perfectly. The checkpoints I am trying to load on a CPU-only machine contain full TorchMetrics objects in them unintentionally. Seems this is not best practice by any means. Are you aware of any workarounds for this issue in light of the metrics being fully saved in my checkpoint files, or is the only solution to only save the state_dicts in the first place? |
Hi, I was having the same issue, and this commit fixed it for me! I would be very happy if this gets merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seem reasonable, could you pls add test that would train a model on GPU and then mock CUDA_VISIBLE_EVICES to load it with CPU only...
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #19024 +/- ##
==========================================
- Coverage 83% 47% -36%
==========================================
Files 445 437 -8
Lines 37289 37140 -149
==========================================
- Hits 31119 17586 -13533
- Misses 6170 19554 +13384 |
️✅ There are no secrets present in this pull request anymore.If these secrets were true positive and are still valid, we highly recommend you to revoke them. 🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request. |
What does this PR do?
saving.py
for loading GPU-trained models on CPU-only machines..to()
call in the context of CPU-only inference may lead toAssertionError: Torch not compiled with CUDA enabled
.Before submitting
PR review
Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:
Reviewer checklist
📚 Documentation preview 📚: https://pytorch-lightning--19024.org.readthedocs.build/en/19024/