Incompatibility in model checkpoints between multi-GPU/single-GPU #1087

zphang · 2020-05-15T06:16:37Z

Summary: If you train a model with multi-GPU, then try to load it with single-GPU, the weights will not be successfully loaded. The run will print warnings but will not fail.

Background:

When a PyTorch module is wrapped in DataParallel, the original module will be stored as multi_gpu_model.module.
This means that calling state_dict on multi_gpu_model will add a "module." prefix to every key in the state_dict.
In other words, state_dicts are not compatible between single-gpu and multi-gpu models.

What this affects:

Two-phase training runs that are run from two separate commands (via load_target_train_checkpoint) will incorrectly load the model weights, reverting to the pretrained weights if one phase uses multi-GPU and the other uses single-GPU
E.g. If intermediate task training is on multi-GPU and target task training is on single-GPU, the target task run will fail to load the intermediate-trained weights, and revert to pretrained weights. The logs will be filled with "parameter missed" warnings, but will not fail

Proposed solution:

Don't call state_dict on a DataParallel model. Write a function that determines whether a model is DaraParallel, and call model.state_dict or model.module.state_dict correspondingly

def get_state_dict_for_saving(model: nn.Module) -> nn.Module:
    if isinstance(model, nn.DataParallel):
        return model.module.state_dict()
    else:
        return model.state_dict()

The text was updated successfully, but these errors were encountered:

sleepinyourhat · 2020-05-15T15:50:07Z

Eek—does anyone have bandwidth to put together a fix?

pruksmhc · 2020-05-15T19:24:06Z

I can take a look at this this weekend

pruksmhc · 2020-05-17T16:32:00Z

Update: It's actually not quite as simple as just putting the above function to modify saving, since modifying saving will not work if we want to restart a job from multi-GPU (because the saved model will now always have model.state_dict() while the model in multi-GPU expects module.model.state_dict()

zphang · 2020-05-18T04:45:26Z

Could you instead call model.module.load_state_dict?

zphang · 2020-10-16T04:29:49Z

This is an automatically generated comment.

As we update jiant to v2.x, jiant v1.x has been migrated to https://github.com/nyu-mll/jiant-v1-legacy. As such, we are closing all issues relating to jiant v1.x in this repository.

If this issue is still affecting you in jiant v1.x, please follow up at nyu-mll/jiant-v1-legacy#1087.

If this issue is still affecting you in jiant v2.x, reopen this issue or create a new one.

zphang added high-priority Fix this before addressing any other major issue. bug Something isn't working labels May 15, 2020

zphang changed the title ~~Inconsistency in state_dict from saving model checkpoints with multi-GPU/single-GPU~~ Incompatibility in model checkpoints between multi-GPU/single-GPU May 15, 2020

pruksmhc self-assigned this May 16, 2020

pruksmhc mentioned this issue May 17, 2020

Fixing model checkpoints to be robust to mutli -> single GPU usage #1091

Closed

This was referenced Sep 17, 2020

Incompatibility in model checkpoints between multi-GPU/single-GPU nyu-mll/jiant-v1-legacy#1087

Open

Fixing model checkpoints to be robust to mutli -> single GPU usage nyu-mll/jiant-v1-legacy#1091

Open

jeswan added the jiant-v1-legacy Relevant to versions <= v1.3.2 label Sep 17, 2020

zphang closed this as completed Oct 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incompatibility in model checkpoints between multi-GPU/single-GPU #1087

Incompatibility in model checkpoints between multi-GPU/single-GPU #1087

zphang commented May 15, 2020 •

edited

Loading

sleepinyourhat commented May 15, 2020

pruksmhc commented May 15, 2020

pruksmhc commented May 17, 2020

zphang commented May 18, 2020

zphang commented Oct 16, 2020

Incompatibility in model checkpoints between multi-GPU/single-GPU #1087

Incompatibility in model checkpoints between multi-GPU/single-GPU #1087

Comments

zphang commented May 15, 2020 • edited Loading

sleepinyourhat commented May 15, 2020

pruksmhc commented May 15, 2020

pruksmhc commented May 17, 2020

zphang commented May 18, 2020

zphang commented Oct 16, 2020

zphang commented May 15, 2020 •

edited

Loading