Reshape ZeroStage=0 FP16 Checkpoint

What is the best way for reshaping a checkpoint trained with zero stage = 0 & fp16?

I see two options:
a) Continue training with zero stage 1 for 1 step & adapt [this PR](https://github.com/microsoft/DeepSpeed/pull/1953) to work with fp16
b) Adapt [the script here](https://github.com/bigscience-workshop/Megatron-DeepSpeed/pull/239/files#diff-ecfdbe3a107133adc51e3f7c5d73ef006d6b06bd18bfc16ecb4708834e6fdc97) to work without the need of zero ckpts; The difficult part will just be reshaping the optimizer states in the `mp_rank` files


Maybe @tjruwase could give me a quick hint if a) or b) makes more sense before I waste my time? Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reshape ZeroStage=0 FP16 Checkpoint #2031

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Reshape ZeroStage=0 FP16 Checkpoint #2031

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions