You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I just noticed that resuming from a checkpoint with the current code will assume tht it is running on the same GPU (number) as before. E.g. if the process started on GPU #0, after resume it will assume it is still running on GPU #0. However if you work on a cluster and your resume command goes through eh queuing system, then the process may well run on GPU#1,2,3 etc. after resume.
Fix: Provide the GPU number as an additional argument to the resume function.
The text was updated successfully, but these errors were encountered:
I just noticed that resuming from a checkpoint with the current code will assume tht it is running on the same GPU (number) as before. E.g. if the process started on GPU #0, after resume it will assume it is still running on GPU #0. However if you work on a cluster and your resume command goes through eh queuing system, then the process may well run on GPU#1,2,3 etc. after resume.
Fix: Provide the GPU number as an additional argument to the resume function.
The text was updated successfully, but these errors were encountered: