checkpointing: make code aware of changed GPU device on resume #71

mwibral · 2021-07-16T13:27:05Z

I just noticed that resuming from a checkpoint with the current code will assume tht it is running on the same GPU (number) as before. E.g. if the process started on GPU #0, after resume it will assume it is still running on GPU #0. However if you work on a cluster and your resume command goes through eh queuing system, then the process may well run on GPU#1,2,3 etc. after resume.

Fix: Provide the GPU number as an additional argument to the resume function.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

checkpointing: make code aware of changed GPU device on resume #71

checkpointing: make code aware of changed GPU device on resume #71

mwibral commented Jul 16, 2021 •

edited

Loading

checkpointing: make code aware of changed GPU device on resume #71

checkpointing: make code aware of changed GPU device on resume #71

Comments

mwibral commented Jul 16, 2021 • edited Loading

mwibral commented Jul 16, 2021 •

edited

Loading