Incompatibility in model checkpoints between multi-GPU/single-GPU #1087

jeswan · 2020-09-17T18:36:19Z

Issue by zphang
Friday May 15, 2020 at 06:16 GMT
Originally opened as nyu-mll/jiant#1087

Summary: If you train a model with multi-GPU, then try to load it with single-GPU, the weights will not be successfully loaded. The run will print warnings but will not fail.

Background:

When a PyTorch module is wrapped in DataParallel, the original module will be stored as multi_gpu_model.module.
This means that calling state_dict on multi_gpu_model will add a "module." prefix to every key in the state_dict.
In other words, state_dicts are not compatible between single-gpu and multi-gpu models.

What this affects:

Two-phase training runs that are run from two separate commands (via load_target_train_checkpoint) will incorrectly load the model weights, reverting to the pretrained weights if one phase uses multi-GPU and the other uses single-GPU
E.g. If intermediate task training is on multi-GPU and target task training is on single-GPU, the target task run will fail to load the intermediate-trained weights, and revert to pretrained weights. The logs will be filled with "parameter missed" warnings, but will not fail

Proposed solution:

Don't call state_dict on a DataParallel model. Write a function that determines whether a model is DaraParallel, and call model.state_dict or model.module.state_dict correspondingly

def get_state_dict_for_saving(model: nn.Module) -> nn.Module:
    if isinstance(model, nn.DataParallel):
        return model.module.state_dict()
    else:
        return model.state_dict()

The text was updated successfully, but these errors were encountered:

jeswan · 2020-09-17T18:36:20Z

Comment by sleepinyourhat
Friday May 15, 2020 at 15:50 GMT

Eek—does anyone have bandwidth to put together a fix?

jeswan · 2020-09-17T18:36:21Z

Comment by pruksmhc
Friday May 15, 2020 at 19:24 GMT

I can take a look at this this weekend

jeswan · 2020-09-17T18:36:23Z

Comment by pruksmhc
Sunday May 17, 2020 at 16:32 GMT

Update: It's actually not quite as simple as just putting the above function to modify saving, since modifying saving will not work if we want to restart a job from multi-GPU (because the saved model will now always have model.state_dict() while the model in multi-GPU expects module.model.state_dict()

jeswan · 2020-09-17T18:36:24Z

Comment by zphang
Monday May 18, 2020 at 04:45 GMT

Could you instead call model.module.load_state_dict?

jeswan added bug Something isn't working high-priority Fix this before addressing any other major issue. labels Sep 17, 2020

jeswan mentioned this issue Sep 17, 2020

Fixing model checkpoints to be robust to mutli -> single GPU usage #1091

Open

zphang mentioned this issue Oct 16, 2020

Incompatibility in model checkpoints between multi-GPU/single-GPU nyu-mll/jiant#1087

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incompatibility in model checkpoints between multi-GPU/single-GPU #1087

Incompatibility in model checkpoints between multi-GPU/single-GPU #1087

jeswan commented Sep 17, 2020

jeswan commented Sep 17, 2020

jeswan commented Sep 17, 2020

jeswan commented Sep 17, 2020

jeswan commented Sep 17, 2020

Incompatibility in model checkpoints between multi-GPU/single-GPU #1087

Incompatibility in model checkpoints between multi-GPU/single-GPU #1087

Comments

jeswan commented Sep 17, 2020

jeswan commented Sep 17, 2020

jeswan commented Sep 17, 2020

jeswan commented Sep 17, 2020

jeswan commented Sep 17, 2020