Skip to content

Commit 5ff6e69

Browse files
committed
add instruction document for huggingface UCP
Signed-off-by: Schwidola0607 <[email protected]>
1 parent 3e1da1f commit 5ff6e69

File tree

1 file changed

+39
-0
lines changed

1 file changed

+39
-0
lines changed
Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
---
2+
title: "Converting a Hugging Face checkpoint to Universal Checkpointing format"
3+
tags: checkpointing, training, deepspeed, huggingface
4+
---
5+
6+
## Introduction to Universal Checkpointing
7+
8+
Universal Checkpointing in DeepSpeed abstracts away the complexities of saving and loading model states, optimizer states, and training scheduler states. This feature is designed to work out of the box with minimal configuration, supporting a wide range of model sizes and types, from small-scale models to large, distributed models with different parallelism topologies trained across multiple GPUs and other accelerators.
9+
10+
See more: https://www.deepspeed.ai/tutorials/universal-checkpointing/
11+
12+
## Converting a pretrained Hugging Face checkpoint to Universal Checkpointing format
13+
14+
### Step 1: Download a pretrained Hugging Face checkpoint
15+
Download a pretrained Hugging Face checkpoint from the Hugging Face Hub using [snapshot_download](https://huggingface.co/docs/huggingface_hub/en/guides/download)
16+
17+
Hugging Face checkpoints are one or many files in the `pytorch_model.bin` or `safetensors format`.
18+
19+
### Step 2: Convert Hugging Face checkpoint to Universal Checkpointing format
20+
21+
To convert a Hugging Face checkpoint to Universal Checkpointing format, you can use the `hf_to_universal.py` script provided in the DeepSpeed repository. This script will take a Hugging Face checkpoint of any model and convert it to a Universal Checkpointing format.
22+
23+
```bash
24+
python deepspeed/checkpoint/hf_to_universal.py --hf_checkpoint_dir /path/to/huggingface/checkpoint --save_dir /path/to/universal/checkpoint
25+
```
26+
27+
This script will process the Hugging Face checkpoint and generate a new checkpoint in the Universal Checkpointing format. Note that `hf_to_universal.py` script supports both `.safetensors` and `pytorch.bin` checkpoint format. Use `--safe_serialization` flag to convert from `.safetensors` format.
28+
29+
See `hf_to_universal.py` for more flags and options.
30+
31+
### Step 3: Resume Training with Universal Checkpoint
32+
With the Universal checkpoint ready, you can now resume training on potentially with different parallelism topologies or training configurations. To do this add `--universal-checkpoint` to your DeepSpeed config JSON file
33+
34+
See [Megatron-DeepSpeed examples](https://github.com/deepspeedai/Megatron-DeepSpeed/tree/main/examples_deepspeed/universal_checkpointing) for more details on how to use Universal Checkpointing.
35+
36+
## Conclusion
37+
DeepSpeed Universal Checkpointing simplifies the management of model states, making it easier to save, load, and transfer model states across different training sessions and parallelism techniques. By converting a Hugging Face checkpoint to Universal Checkpointing format, you can load pretrained weights of any model in the Hugging Face Hub and resume training with DeepSpeed under any parallelism topologies.
38+
39+
For more detailed examples and advanced configurations, please refer to the [Megatron-DeepSpeed examples](https://github.com/deepspeedai/Megatron-DeepSpeed/tree/main/examples_deepspeed/universal_checkpointing).

0 commit comments

Comments
 (0)