Skip to content

Commit 8ed846a

Browse files
committed
update MindSpore Transformers LLM Course
1 parent ec0637c commit 8ed846a

38 files changed

Lines changed: 913383 additions & 57 deletions

File tree

Binary file not shown.
Lines changed: 349 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,349 @@
1+
# MindSpore Transformers预训练与微调实践
2+
3+
## 环境与目录
4+
5+
* 建议目录:`/home/mindspore/work/demo`
6+
* 建议 Python ≥ 3.9,已安装 MindSpore + MindFormers
7+
* 可选:配置国内镜像
8+
9+
```bash
10+
mkdir -p /home/mindspore/work/demo && cd /home/mindspore/work/demo
11+
pip install -U huggingface_hub
12+
export HF_ENDPOINT=https://hf-mirror.com # 可按需要注释/取消
13+
```
14+
15+
## 准备权重、数据集与MindSpore Transformers
16+
17+
Qwen3-0.6B: [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B)
18+
Qwen3-1.7B: [Qwen/Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B)
19+
20+
预训练数据集:[wikitext-2-v1](https://modelscope-open.oss-cn-hangzhou.aliyuncs.com/wikitext/wikitext-2-v1.zip)
21+
22+
> 预训练数据集参考[社区issue](https://gitee.com/mindspore/mindformers/issues/IBV35D)获取
23+
24+
微调数据集: [llm-wizard/alpaca-gpt4-data-zh](https://huggingface.co/datasets/llm-wizard/alpaca-gpt4-data-zh)
25+
26+
```bash
27+
# 下载权重
28+
mkdir -p /home/mindspore/work/demo
29+
cd /home/mindspore/work/demo
30+
pip install -U huggingface_hub
31+
export HF_ENDPOINT=https://hf-mirror.com
32+
huggingface-cli download --resume-download Qwen/Qwen3-0.6B --local-dir Qwen3-0.6B
33+
huggingface-cli download --resume-download Qwen/Qwen3-1.7B --local-dir Qwen3-1.7B
34+
35+
# 下载预训练数据集
36+
wget https://modelscope-open.oss-cn-hangzhou.aliyuncs.com/wikitext/wikitext-2-v1.zip
37+
unzip wikitext-2-v1.zip
38+
39+
# 下载微调数据集
40+
huggingface-cli download --repo-type dataset --resume-download llm-wizard/alpaca-gpt4-data-zh --local-dir alpaca-gpt4-data-zh
41+
42+
# 下载MindSpore Transformers
43+
git clone https://gitee.com/mindspore/mindformers.git
44+
cd mindformers
45+
git checkout 5a12973fb38bfd5b504240334492f4fb7ff7f7a6
46+
pip install -r requirements.txt
47+
48+
# 升级MindSpore
49+
pip install https://repo.mindspore.cn/mindspore/mindspore/version/202509/20250917/master_20250917220006_52c46b3bfd9e9d50b2334d764afc80a6d7b56e90_newest/unified/aarch64/mindspore-2.7.1-cp39-cp39-linux_aarch64.whl
50+
51+
```
52+
53+
> 所有需要的权重和数据集都应当挂载到/home/mindspore/work/demo
54+
55+
最后的文件结构应为:
56+
57+
```plaintext
58+
/home/mindspore/work/demo
59+
├── Qwen3-0.6B/
60+
├── Qwen3-1.7B/
61+
├── wikitext-2-v1/
62+
├── alpaca-gpt4-data-zh/
63+
└── mindformers/
64+
```
65+
66+
## 数据预处理
67+
68+
### wiki => json
69+
70+
脚本来源:[社区issue #ICOKGY](https://gitee.com/mindspore/mindformers/issues/ICOKGY)
71+
72+
将脚本复制到`/home/mindspore/work/demo/mindformers/gen_wiki_json.py`
73+
74+
```bash
75+
cd /home/mindspore/work/demo/mindformers
76+
python gen_wiki_json.py --input /home/mindspore/work/demo/wikitext-2/wiki.train.tokens --output /home/mindspore/work/demo/wikitext-2/wiki.jsonl
77+
```
78+
79+
### json => megatron
80+
81+
先取前1000条数据
82+
83+
```bash
84+
head -n 1000 /home/mindspore/work/demo/wikitext-2/wiki.jsonl > /home/mindspore/work/demo/wikitext-2/wiki.less.jsonl
85+
```
86+
87+
```shell
88+
# cd /home/mindspore/work/demo/mindformers
89+
mkdir -p /home/mindspore/work/demo/megatron_data
90+
python toolkit/data_preprocess/megatron/preprocess_indexed_dataset.py \
91+
--input /home/mindspore/work/demo/wikitext-2/wiki.less.jsonl \
92+
--output-prefix /home/mindspore/work/demo/megatron_data/wikitext-2-v1-qwen3_text_document \
93+
--tokenizer-type HuggingFaceTokenizer \
94+
--tokenizer-dir /home/mindspore/work/demo/Qwen3-0.6B/ \
95+
--workers 64
96+
```
97+
98+
## 准备Qwen3-0.6B模型的yaml
99+
100+
MindSpore Transformers源码仓库中提供了Qwen3-32B的预训练模型[yaml配置文件](https://gitee.com/mindspore/mindformers/blob/master/configs/qwen3/pretrain_qwen3_32b_4k.yaml),我们可以参考该文件来创建Qwen3-0.6B的yaml文件。
101+
102+
首先复制32B配置并重命名为0.6b的配置
103+
104+
```bash
105+
cp configs/qwen3/pretrain_qwen3_32b_4k.yaml configs/qwen3/pretrain_qwen3_0_6b_4k.yaml
106+
```
107+
108+
接下来需要对`configs/qwen3/pretrain_qwen3_0_6b_4k.yaml`做出以下修改:
109+
110+
```yaml
111+
# Model configuration
112+
model:
113+
model_config:
114+
# Configurations from Hugging Face
115+
...
116+
hidden_size: 1024
117+
intermediate_size: 3072
118+
num_hidden_layers: 28
119+
num_attention_heads: 16
120+
num_key_value_heads: 8
121+
head_dim: 128
122+
...
123+
# Configurations from MindFormers
124+
offset: 0
125+
```
126+
127+
## 预训练
128+
129+
### 修改配置以启动预训练任务
130+
131+
修改训练epoch数
132+
133+
```yaml
134+
# runner config
135+
runner_config:
136+
epochs: 1
137+
```
138+
139+
修改数据集配置
140+
141+
```yaml
142+
# dataset
143+
train_dataset: &train_dataset
144+
data_loader:
145+
...
146+
sizes:
147+
- 400 # 训练集数据样本数
148+
- 0 # 测试集数据样本数,当前不支持配置
149+
- 0 # 评测集数据样本数,当前不支持配置
150+
config: # GPTDataset配置项
151+
...
152+
data_path: # Megatron数据集采样比例以及路径
153+
- '1'
154+
- "/home/mindspore/work/demo/megatron_data/wikitext-2-v1-qwen3_text_document"
155+
```
156+
157+
添加以下参数以开启TensorBoard监控
158+
159+
```yaml
160+
monitor_config:
161+
monitor_on: True
162+
dump_path: './dump'
163+
target: ['layers.0', 'layers.1'] # 只监控第一、二层的参数
164+
invert: False
165+
step_interval: 1
166+
local_loss_format: ['tensorboard']
167+
device_local_loss_format: ['tensorboard']
168+
local_norm_format: ['tensorboard']
169+
device_local_norm_format: ['tensorboard']
170+
optimizer_state_format: null
171+
weight_state_format: null
172+
throughput_baseline: null
173+
print_struct: False
174+
check_for_global_norm: False
175+
global_norm_spike_threshold: 1.0
176+
global_norm_spike_count_threshold: 10
177+
178+
tensorboard:
179+
tensorboard_dir: 'worker/tensorboard'
180+
tensorboard_queue_size: 10
181+
log_loss_scale_to_tensorboard: True
182+
log_timers_to_tensorboard: True
183+
```
184+
185+
186+
187+
添加以下配置以开启权重去冗余
188+
189+
```yaml
190+
callbacks:
191+
- type: MFLossMonitor # Prints training progress information
192+
- type: CheckpointMonitor # Saves model weights during training
193+
...
194+
save_checkpoint_steps: 50 # Interval steps for saving model weights
195+
keep_checkpoint_max: 2 # Maximum number of saved model weight files
196+
...
197+
remove_redundancy: True
198+
```
199+
200+
修改以下配置以切换并行配置
201+
202+
```yaml
203+
parallel_config:
204+
data_parallel: &dp 2
205+
model_parallel: 2
206+
pipeline_stage: 2
207+
micro_batch_num: &micro_batch_num 2
208+
```
209+
210+
### 启动预训练任务
211+
212+
```bash
213+
rm -rf ouput/checkpoint
214+
bash scripts/msrun_launcher.sh "run_mindformer.py \
215+
--config configs/qwen3/pretrain_qwen3_0_6b_4k.yaml \
216+
--auto_trans_ckpt False \
217+
--use_parallel True \
218+
--run_mode train \
219+
--recompute_config.recompute False"
220+
```
221+
222+
### 启动断点续训的预训练任务
223+
224+
修改以下配置以开启断点续训功能
225+
226+
```yaml
227+
resume_training: True
228+
```
229+
230+
备份上一次任务日志,用于查看续训是否成功
231+
232+
```bash
233+
mv ouput/msrun_log output/msrun_log_bak
234+
```
235+
236+
```bash
237+
bash scripts/msrun_launcher.sh "run_mindformer.py \
238+
--config configs/qwen3/pretrain_qwen3_0_6b_4k.yaml \
239+
--auto_trans_ckpt False \
240+
--use_parallel True \
241+
--run_mode train \
242+
--load_checkpoint "output/checkpoint_50_step" \
243+
--recompute_config.recompute False"
244+
```
245+
246+
## 全参微调
247+
248+
### 修改配置以启动全参微调任务
249+
250+
MindSpore Transformers源码仓库中提供了Qwen3类不同规模的微调模型[yaml配置文件](https://gitee.com/mindspore/mindformers/blob/master/configs/qwen3/finetune_qwen3.yaml),不同规模的模型可以通过`--pretrained_model_dir /path/to/Qwen3-0.6B/`来指定。我们可以参考该文件来创建Qwen3-0.6B的全参微调的yaml文件。
251+
252+
253+
需要对`configs/qwen3/finetune_qwen3.yaml`做出以下修改:
254+
255+
修改以下配置项以设置数据集大小
256+
257+
```yaml
258+
# Dataset configuration
259+
train_dataset: &train_dataset
260+
...
261+
data_loader:
262+
type: HFDataLoader
263+
path: "llm-wizard/alpaca-gpt4-data-zh"
264+
...
265+
# dataset process arguments
266+
handler:
267+
- type: take
268+
n: 1000
269+
```
270+
271+
修改以下配置以切换并行配置
272+
273+
```yaml
274+
parallel_config:
275+
data_parallel: &dp 8
276+
model_parallel: 1
277+
pipeline_stage: 1
278+
micro_batch_num: &micro_batch_num 1
279+
...
280+
use_seq_parallel: False
281+
```
282+
283+
### 启动全参微调任务
284+
285+
启动Qwen3-0.6B的全参微调任务
286+
287+
```bash
288+
bash scripts/msrun_launcher.sh "run_mindformer.py \
289+
--config configs/qwen3/finetune_qwen3.yaml \
290+
--auto_trans_ckpt True \
291+
--use_parallel True \
292+
--run_mode train \
293+
--pretrained_model_dir /home/mindspore/work/demo/Qwen3-0.6B/ \
294+
--recompute_config.recompute False"
295+
```
296+
297+
## LoRA微调
298+
299+
MindSpore Transformers源码仓库中提供了Qwen3类不同规模的微调模型[yaml配置文件](https://gitee.com/mindspore/mindformers/blob/master/configs/qwen3/finetune_qwen3.yaml),不同规模的模型可以通过`--pretrained_model_dir /path/to/Qwen3-1.7B/`来指定。我们可以参考该文件来创建Qwen3-1.7B的LoRA微调的yaml文件。
300+
301+
首先复制微调配置并重命名为lora的配置
302+
303+
```bash
304+
cp configs/qwen3/finetune_qwen3.yaml configs/qwen3/finetune_qwen3_lora.yaml
305+
```
306+
307+
### 修改配置以启动LoRA微调任务
308+
309+
添加以下参数以启动LoRA微调
310+
311+
```yaml
312+
# Model configuration
313+
model:
314+
model_config:
315+
# Configurations from Hugging Face
316+
...
317+
pet_config:
318+
pet_type: lora
319+
lora_rank: 8
320+
lora_alpha: 16
321+
lora_dropout: 0.1
322+
lora_a_init: 'normal'
323+
lora_b_init: 'zeros'
324+
target_modules: '.*word_embeddings|.*linear_qkv|.*linear_proj|.*linear_fc1|.*linear_fc2'
325+
freeze_include: ['*']
326+
freeze_exclude: ['*lora*']
327+
```
328+
329+
修改以下配置以切换并行配置
330+
331+
```yaml
332+
parallel_config:
333+
data_parallel: &dp 8
334+
model_parallel: 1
335+
pipeline_stage: 1
336+
micro_batch_num: &micro_batch_num 1
337+
```
338+
339+
### 启动LoRA微调任务
340+
341+
```bash
342+
bash scripts/msrun_launcher.sh "run_mindformer.py \
343+
--config configs/qwen3/finetune_qwen3_lora.yaml \
344+
--auto_trans_ckpt True \
345+
--use_parallel True \
346+
--run_mode train \
347+
--pretrained_model_dir /home/mindspore/work/demo/Qwen3-1.7B/ \
348+
--recompute_config.recompute False"
349+
```
Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
# 使用指令汇总
2+
3+
## 获取 Hugging Face 权重
4+
```bash
5+
git clone https://hf-mirror.com/Qwen/Qwen3-0.6B
6+
```
7+
8+
## 获取数据集
9+
```bash
10+
mkdir qwen-datasets
11+
cd qwen-datasets
12+
13+
wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/qwen-datasets/mmap_qwen3_datasets_text_document.bin
14+
wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/qwen-datasets/mmap_qwen3_datasets_text_document.idx
15+
```
16+
17+
## 运行任务
18+
```bash
19+
bash scripts/msrun_launcher.sh "run_mindformer.py --config xx.yaml"
20+
```
21+
22+
## 合并权重
23+
```bash
24+
python toolkit/safetensors/unified_safetensors.py \
25+
--src_strategy_dirs ./output/strategy \
26+
--mindspore_ckpt_dir ./output/checkpoint \
27+
--output_dir ./unified_trained_qwen3_60_sf \
28+
--file_suffix "60_1" \
29+
--has_redundancy True \
30+
--filter_out_param_prefix "adam_" \
31+
--max_process_num 16
32+
```
33+
34+
## qwen3-0.6B 反转权重
35+
```bash
36+
python toolkit/weight_convert/qwen3/reverse_mcore_qwen3_weight_to_hf.py \
37+
--mindspore_ckpt_path ./unified_trained_qwen3_60_sf/60_1_ckpt_convert/unified_safe \
38+
--huggingface_ckpt_path ./hf_sf \
39+
--num_layers 28 \
40+
--num_attention_heads 16 \
41+
--num_query_groups 8 \
42+
--kv_channels 128 \
43+
--ffn_hidden_size 3072 \
44+
--dtype 'bf16'
45+
```

0 commit comments

Comments
 (0)