|
| 1 | +# MindSpore Transformers预训练与微调实践 |
| 2 | + |
| 3 | +## 环境与目录 |
| 4 | + |
| 5 | +* 建议目录:`/home/mindspore/work/demo` |
| 6 | +* 建议 Python ≥ 3.9,已安装 MindSpore + MindFormers |
| 7 | +* 可选:配置国内镜像 |
| 8 | + |
| 9 | +```bash |
| 10 | +mkdir -p /home/mindspore/work/demo && cd /home/mindspore/work/demo |
| 11 | +pip install -U huggingface_hub |
| 12 | +export HF_ENDPOINT=https://hf-mirror.com # 可按需要注释/取消 |
| 13 | +``` |
| 14 | + |
| 15 | +## 准备权重、数据集与MindSpore Transformers |
| 16 | + |
| 17 | +Qwen3-0.6B: [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) |
| 18 | +Qwen3-1.7B: [Qwen/Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) |
| 19 | + |
| 20 | +预训练数据集:[wikitext-2-v1](https://modelscope-open.oss-cn-hangzhou.aliyuncs.com/wikitext/wikitext-2-v1.zip) |
| 21 | + |
| 22 | +> 预训练数据集参考[社区issue](https://gitee.com/mindspore/mindformers/issues/IBV35D)获取 |
| 23 | +
|
| 24 | +微调数据集: [llm-wizard/alpaca-gpt4-data-zh](https://huggingface.co/datasets/llm-wizard/alpaca-gpt4-data-zh) |
| 25 | + |
| 26 | +```bash |
| 27 | +# 下载权重 |
| 28 | +mkdir -p /home/mindspore/work/demo |
| 29 | +cd /home/mindspore/work/demo |
| 30 | +pip install -U huggingface_hub |
| 31 | +export HF_ENDPOINT=https://hf-mirror.com |
| 32 | +huggingface-cli download --resume-download Qwen/Qwen3-0.6B --local-dir Qwen3-0.6B |
| 33 | +huggingface-cli download --resume-download Qwen/Qwen3-1.7B --local-dir Qwen3-1.7B |
| 34 | + |
| 35 | +# 下载预训练数据集 |
| 36 | +wget https://modelscope-open.oss-cn-hangzhou.aliyuncs.com/wikitext/wikitext-2-v1.zip |
| 37 | +unzip wikitext-2-v1.zip |
| 38 | + |
| 39 | +# 下载微调数据集 |
| 40 | +huggingface-cli download --repo-type dataset --resume-download llm-wizard/alpaca-gpt4-data-zh --local-dir alpaca-gpt4-data-zh |
| 41 | + |
| 42 | +# 下载MindSpore Transformers |
| 43 | +git clone https://gitee.com/mindspore/mindformers.git |
| 44 | +cd mindformers |
| 45 | +git checkout 5a12973fb38bfd5b504240334492f4fb7ff7f7a6 |
| 46 | +pip install -r requirements.txt |
| 47 | + |
| 48 | +# 升级MindSpore |
| 49 | +pip install https://repo.mindspore.cn/mindspore/mindspore/version/202509/20250917/master_20250917220006_52c46b3bfd9e9d50b2334d764afc80a6d7b56e90_newest/unified/aarch64/mindspore-2.7.1-cp39-cp39-linux_aarch64.whl |
| 50 | + |
| 51 | +``` |
| 52 | + |
| 53 | +> 所有需要的权重和数据集都应当挂载到/home/mindspore/work/demo |
| 54 | +
|
| 55 | +最后的文件结构应为: |
| 56 | + |
| 57 | +```plaintext |
| 58 | +/home/mindspore/work/demo |
| 59 | +├── Qwen3-0.6B/ |
| 60 | +├── Qwen3-1.7B/ |
| 61 | +├── wikitext-2-v1/ |
| 62 | +├── alpaca-gpt4-data-zh/ |
| 63 | +└── mindformers/ |
| 64 | +``` |
| 65 | + |
| 66 | +## 数据预处理 |
| 67 | + |
| 68 | +### wiki => json |
| 69 | + |
| 70 | +脚本来源:[社区issue #ICOKGY](https://gitee.com/mindspore/mindformers/issues/ICOKGY) |
| 71 | + |
| 72 | +将脚本复制到`/home/mindspore/work/demo/mindformers/gen_wiki_json.py` |
| 73 | + |
| 74 | +```bash |
| 75 | +cd /home/mindspore/work/demo/mindformers |
| 76 | +python gen_wiki_json.py --input /home/mindspore/work/demo/wikitext-2/wiki.train.tokens --output /home/mindspore/work/demo/wikitext-2/wiki.jsonl |
| 77 | +``` |
| 78 | + |
| 79 | +### json => megatron |
| 80 | + |
| 81 | +先取前1000条数据 |
| 82 | + |
| 83 | +```bash |
| 84 | +head -n 1000 /home/mindspore/work/demo/wikitext-2/wiki.jsonl > /home/mindspore/work/demo/wikitext-2/wiki.less.jsonl |
| 85 | +``` |
| 86 | + |
| 87 | +```shell |
| 88 | +# cd /home/mindspore/work/demo/mindformers |
| 89 | +mkdir -p /home/mindspore/work/demo/megatron_data |
| 90 | +python toolkit/data_preprocess/megatron/preprocess_indexed_dataset.py \ |
| 91 | + --input /home/mindspore/work/demo/wikitext-2/wiki.less.jsonl \ |
| 92 | + --output-prefix /home/mindspore/work/demo/megatron_data/wikitext-2-v1-qwen3_text_document \ |
| 93 | + --tokenizer-type HuggingFaceTokenizer \ |
| 94 | + --tokenizer-dir /home/mindspore/work/demo/Qwen3-0.6B/ \ |
| 95 | + --workers 64 |
| 96 | +``` |
| 97 | + |
| 98 | +## 准备Qwen3-0.6B模型的yaml |
| 99 | + |
| 100 | +MindSpore Transformers源码仓库中提供了Qwen3-32B的预训练模型[yaml配置文件](https://gitee.com/mindspore/mindformers/blob/master/configs/qwen3/pretrain_qwen3_32b_4k.yaml),我们可以参考该文件来创建Qwen3-0.6B的yaml文件。 |
| 101 | + |
| 102 | +首先复制32B配置并重命名为0.6b的配置 |
| 103 | + |
| 104 | +```bash |
| 105 | +cp configs/qwen3/pretrain_qwen3_32b_4k.yaml configs/qwen3/pretrain_qwen3_0_6b_4k.yaml |
| 106 | +``` |
| 107 | + |
| 108 | +接下来需要对`configs/qwen3/pretrain_qwen3_0_6b_4k.yaml`做出以下修改: |
| 109 | + |
| 110 | +```yaml |
| 111 | +# Model configuration |
| 112 | +model: |
| 113 | + model_config: |
| 114 | + # Configurations from Hugging Face |
| 115 | + ... |
| 116 | + hidden_size: 1024 |
| 117 | + intermediate_size: 3072 |
| 118 | + num_hidden_layers: 28 |
| 119 | + num_attention_heads: 16 |
| 120 | + num_key_value_heads: 8 |
| 121 | + head_dim: 128 |
| 122 | + ... |
| 123 | + # Configurations from MindFormers |
| 124 | + offset: 0 |
| 125 | +``` |
| 126 | +
|
| 127 | +## 预训练 |
| 128 | +
|
| 129 | +### 修改配置以启动预训练任务 |
| 130 | +
|
| 131 | +修改训练epoch数 |
| 132 | +
|
| 133 | +```yaml |
| 134 | +# runner config |
| 135 | +runner_config: |
| 136 | + epochs: 1 |
| 137 | +``` |
| 138 | +
|
| 139 | +修改数据集配置 |
| 140 | +
|
| 141 | +```yaml |
| 142 | +# dataset |
| 143 | +train_dataset: &train_dataset |
| 144 | + data_loader: |
| 145 | + ... |
| 146 | + sizes: |
| 147 | + - 400 # 训练集数据样本数 |
| 148 | + - 0 # 测试集数据样本数,当前不支持配置 |
| 149 | + - 0 # 评测集数据样本数,当前不支持配置 |
| 150 | + config: # GPTDataset配置项 |
| 151 | + ... |
| 152 | + data_path: # Megatron数据集采样比例以及路径 |
| 153 | + - '1' |
| 154 | + - "/home/mindspore/work/demo/megatron_data/wikitext-2-v1-qwen3_text_document" |
| 155 | +``` |
| 156 | +
|
| 157 | +添加以下参数以开启TensorBoard监控 |
| 158 | +
|
| 159 | +```yaml |
| 160 | +monitor_config: |
| 161 | + monitor_on: True |
| 162 | + dump_path: './dump' |
| 163 | + target: ['layers.0', 'layers.1'] # 只监控第一、二层的参数 |
| 164 | + invert: False |
| 165 | + step_interval: 1 |
| 166 | + local_loss_format: ['tensorboard'] |
| 167 | + device_local_loss_format: ['tensorboard'] |
| 168 | + local_norm_format: ['tensorboard'] |
| 169 | + device_local_norm_format: ['tensorboard'] |
| 170 | + optimizer_state_format: null |
| 171 | + weight_state_format: null |
| 172 | + throughput_baseline: null |
| 173 | + print_struct: False |
| 174 | + check_for_global_norm: False |
| 175 | + global_norm_spike_threshold: 1.0 |
| 176 | + global_norm_spike_count_threshold: 10 |
| 177 | + |
| 178 | +tensorboard: |
| 179 | + tensorboard_dir: 'worker/tensorboard' |
| 180 | + tensorboard_queue_size: 10 |
| 181 | + log_loss_scale_to_tensorboard: True |
| 182 | + log_timers_to_tensorboard: True |
| 183 | +``` |
| 184 | +
|
| 185 | +
|
| 186 | +
|
| 187 | +添加以下配置以开启权重去冗余 |
| 188 | +
|
| 189 | +```yaml |
| 190 | +callbacks: |
| 191 | + - type: MFLossMonitor # Prints training progress information |
| 192 | + - type: CheckpointMonitor # Saves model weights during training |
| 193 | + ... |
| 194 | + save_checkpoint_steps: 50 # Interval steps for saving model weights |
| 195 | + keep_checkpoint_max: 2 # Maximum number of saved model weight files |
| 196 | + ... |
| 197 | + remove_redundancy: True |
| 198 | +``` |
| 199 | +
|
| 200 | +修改以下配置以切换并行配置 |
| 201 | +
|
| 202 | +```yaml |
| 203 | +parallel_config: |
| 204 | + data_parallel: &dp 2 |
| 205 | + model_parallel: 2 |
| 206 | + pipeline_stage: 2 |
| 207 | + micro_batch_num: µ_batch_num 2 |
| 208 | +``` |
| 209 | +
|
| 210 | +### 启动预训练任务 |
| 211 | +
|
| 212 | +```bash |
| 213 | +rm -rf ouput/checkpoint |
| 214 | +bash scripts/msrun_launcher.sh "run_mindformer.py \ |
| 215 | +--config configs/qwen3/pretrain_qwen3_0_6b_4k.yaml \ |
| 216 | +--auto_trans_ckpt False \ |
| 217 | +--use_parallel True \ |
| 218 | +--run_mode train \ |
| 219 | +--recompute_config.recompute False" |
| 220 | +``` |
| 221 | + |
| 222 | +### 启动断点续训的预训练任务 |
| 223 | + |
| 224 | +修改以下配置以开启断点续训功能 |
| 225 | + |
| 226 | +```yaml |
| 227 | +resume_training: True |
| 228 | +``` |
| 229 | +
|
| 230 | +备份上一次任务日志,用于查看续训是否成功 |
| 231 | +
|
| 232 | +```bash |
| 233 | +mv ouput/msrun_log output/msrun_log_bak |
| 234 | +``` |
| 235 | + |
| 236 | +```bash |
| 237 | +bash scripts/msrun_launcher.sh "run_mindformer.py \ |
| 238 | +--config configs/qwen3/pretrain_qwen3_0_6b_4k.yaml \ |
| 239 | +--auto_trans_ckpt False \ |
| 240 | +--use_parallel True \ |
| 241 | +--run_mode train \ |
| 242 | +--load_checkpoint "output/checkpoint_50_step" \ |
| 243 | +--recompute_config.recompute False" |
| 244 | +``` |
| 245 | + |
| 246 | +## 全参微调 |
| 247 | + |
| 248 | +### 修改配置以启动全参微调任务 |
| 249 | + |
| 250 | +MindSpore Transformers源码仓库中提供了Qwen3类不同规模的微调模型[yaml配置文件](https://gitee.com/mindspore/mindformers/blob/master/configs/qwen3/finetune_qwen3.yaml),不同规模的模型可以通过`--pretrained_model_dir /path/to/Qwen3-0.6B/`来指定。我们可以参考该文件来创建Qwen3-0.6B的全参微调的yaml文件。 |
| 251 | + |
| 252 | + |
| 253 | +需要对`configs/qwen3/finetune_qwen3.yaml`做出以下修改: |
| 254 | + |
| 255 | +修改以下配置项以设置数据集大小 |
| 256 | + |
| 257 | +```yaml |
| 258 | +# Dataset configuration |
| 259 | +train_dataset: &train_dataset |
| 260 | + ... |
| 261 | + data_loader: |
| 262 | + type: HFDataLoader |
| 263 | + path: "llm-wizard/alpaca-gpt4-data-zh" |
| 264 | + ... |
| 265 | + # dataset process arguments |
| 266 | + handler: |
| 267 | + - type: take |
| 268 | + n: 1000 |
| 269 | +``` |
| 270 | +
|
| 271 | +修改以下配置以切换并行配置 |
| 272 | +
|
| 273 | +```yaml |
| 274 | +parallel_config: |
| 275 | + data_parallel: &dp 8 |
| 276 | + model_parallel: 1 |
| 277 | + pipeline_stage: 1 |
| 278 | + micro_batch_num: µ_batch_num 1 |
| 279 | + ... |
| 280 | + use_seq_parallel: False |
| 281 | +``` |
| 282 | +
|
| 283 | +### 启动全参微调任务 |
| 284 | +
|
| 285 | +启动Qwen3-0.6B的全参微调任务 |
| 286 | +
|
| 287 | +```bash |
| 288 | +bash scripts/msrun_launcher.sh "run_mindformer.py \ |
| 289 | +--config configs/qwen3/finetune_qwen3.yaml \ |
| 290 | +--auto_trans_ckpt True \ |
| 291 | +--use_parallel True \ |
| 292 | +--run_mode train \ |
| 293 | +--pretrained_model_dir /home/mindspore/work/demo/Qwen3-0.6B/ \ |
| 294 | +--recompute_config.recompute False" |
| 295 | +``` |
| 296 | + |
| 297 | +## LoRA微调 |
| 298 | + |
| 299 | +MindSpore Transformers源码仓库中提供了Qwen3类不同规模的微调模型[yaml配置文件](https://gitee.com/mindspore/mindformers/blob/master/configs/qwen3/finetune_qwen3.yaml),不同规模的模型可以通过`--pretrained_model_dir /path/to/Qwen3-1.7B/`来指定。我们可以参考该文件来创建Qwen3-1.7B的LoRA微调的yaml文件。 |
| 300 | + |
| 301 | +首先复制微调配置并重命名为lora的配置 |
| 302 | + |
| 303 | +```bash |
| 304 | +cp configs/qwen3/finetune_qwen3.yaml configs/qwen3/finetune_qwen3_lora.yaml |
| 305 | +``` |
| 306 | + |
| 307 | +### 修改配置以启动LoRA微调任务 |
| 308 | + |
| 309 | +添加以下参数以启动LoRA微调 |
| 310 | + |
| 311 | +```yaml |
| 312 | +# Model configuration |
| 313 | +model: |
| 314 | + model_config: |
| 315 | + # Configurations from Hugging Face |
| 316 | + ... |
| 317 | + pet_config: |
| 318 | + pet_type: lora |
| 319 | + lora_rank: 8 |
| 320 | + lora_alpha: 16 |
| 321 | + lora_dropout: 0.1 |
| 322 | + lora_a_init: 'normal' |
| 323 | + lora_b_init: 'zeros' |
| 324 | + target_modules: '.*word_embeddings|.*linear_qkv|.*linear_proj|.*linear_fc1|.*linear_fc2' |
| 325 | + freeze_include: ['*'] |
| 326 | + freeze_exclude: ['*lora*'] |
| 327 | +``` |
| 328 | +
|
| 329 | +修改以下配置以切换并行配置 |
| 330 | +
|
| 331 | +```yaml |
| 332 | +parallel_config: |
| 333 | + data_parallel: &dp 8 |
| 334 | + model_parallel: 1 |
| 335 | + pipeline_stage: 1 |
| 336 | + micro_batch_num: µ_batch_num 1 |
| 337 | +``` |
| 338 | +
|
| 339 | +### 启动LoRA微调任务 |
| 340 | +
|
| 341 | +```bash |
| 342 | +bash scripts/msrun_launcher.sh "run_mindformer.py \ |
| 343 | +--config configs/qwen3/finetune_qwen3_lora.yaml \ |
| 344 | +--auto_trans_ckpt True \ |
| 345 | +--use_parallel True \ |
| 346 | +--run_mode train \ |
| 347 | +--pretrained_model_dir /home/mindspore/work/demo/Qwen3-1.7B/ \ |
| 348 | +--recompute_config.recompute False" |
| 349 | +``` |
0 commit comments