diff --git a/llm/README.md b/llm/README.md index d13b624..3fe2bb7 100644 --- a/llm/README.md +++ b/llm/README.md @@ -11,9 +11,10 @@ The following notebooks are actively maintained in sync with MindSpore and MindS | No. | Model | Description | | :-- | :---- | :----------------------- | | 1 | [t5](./t5/) | Includes notebooks for T5 finetuning and inference on tasks such as email summarization | -| 2 | [distilgpt2](./distilgpt2/) | Includes notebooks for DistilGPT-2 finetuning and inference on causal language modeling (text generation) tasks. | -| 3 | [bert](./bert/) | Includes notebooks for finetuning BERT on SWAG dataset for Multiple Choice tasks using MindSpore NLP | -| 4 | [esm](./esmforproteinfolding/) | Includes notebooks for EsmForProteinFolding finetuning and inference tasks | +| 2 | [helsinki-nlp/t5](./t5/finetune_seq2seq_translation.ipynb) | Includes notebooks for Helsinki-NLP/T5 finetuning and inference on tasks such as translation | +| 3 | [distilgpt2](./distilgpt2/) | Includes notebooks for DistilGPT-2 finetuning and inference on causal language modeling (text generation) tasks. | +| 4 | [bert](./bert/) | Includes notebooks for finetuning BERT on SWAG dataset for Multiple Choice tasks using MindSpore NLP | +| 5 | [esm](./esmforproteinfolding/) | Includes notebooks for EsmForProteinFolding finetuning and inference tasks | ### Community-Driven / Legacy Applications diff --git a/llm/t5/finetune_seq2seq_translation.ipynb b/llm/t5/finetune_seq2seq_translation.ipynb new file mode 100644 index 0000000..c6e2334 --- /dev/null +++ b/llm/t5/finetune_seq2seq_translation.ipynb @@ -0,0 +1,398 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "0", + "metadata": {}, + "source": [ + "### MindSpore NLP 机器翻译 (Helsinki-NLP / T5)\n", + "\n", + "本案例将演示如何使用 MindSpore NLP 在昇腾 (Ascend) 环境下微调一个机器翻译模型。我们将使用 WMT16 数据集将英语翻译为罗马尼亚语。\n", + "\n", + "本案例的运行环境为:\n", + "\n", + "| Python | MindSpore | MindSpore NLP |\n", + "| :----- | :-------- | :------------ |\n", + "| 3.10 | 2.7.0 | 0.5.1 |" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1", + "metadata": {}, + "outputs": [], + "source": [ + "#若在https://internstudio-ascend.intern-ai.org.cn/console/instance进行开发时,使用notebook会出现无法正常使用NPU,可进行以下步骤:\n", + "# 进入开发机的命令窗口\n", + "# 1. 激活你的环境\n", + "# conda activate mind_py310\n", + "# pip install ipykernel\n", + "# python -m ipykernel install --user --name=mind_py310 --display-name=\"Python (mind_py310)\"\n", + "# 2. 加载系统基础驱动配置\n", + "# source /usr/local/Ascend/ascend-toolkit/set_env.sh\n", + "# 3.【核心步骤】手动补全深层驱动路径 (修复 libascend_hal.so 报错)\n", + "# export LD_LIBRARY_PATH=/usr/local/Ascend/driver/lib64/driver:/usr/local/Ascend/driver/lib64/common:/usr/local/Ascend/driver/lib64:$LD_LIBRARY_PATH\n", + "# 4. 启动 Jupyter Lab\n", + "#jupyter lab --allow-root" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2", + "metadata": {}, + "outputs": [], + "source": [ + "# 安装依赖\n", + "# !pip install mindnlp==0.5.1\n", + "# !pip install sacrebleu" + ] + }, + { + "cell_type": "markdown", + "id": "3", + "metadata": {}, + "source": [ + "#### Step 1: 兼容性修复与环境配置\n", + "首先,我们需要应用一个补丁来修复 mindtorch 的版本兼容性问题,并配置 MindSpore 的运行环境。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4", + "metadata": {}, + "outputs": [], + "source": [ + "# --------兼容性补丁---------\n", + "import mindtorch.autograd.function\n", + "if not hasattr(mindtorch.autograd.function, 'FunctionCtx'):\n", + " class FunctionCtx:\n", + " def __init__(self):\n", + " self.saved_tensors = ()\n", + " def save_for_backward(self, *tensors):\n", + " self.saved_tensors = tensors\n", + " mindtorch.autograd.function.FunctionCtx = FunctionCtx\n", + " print(\"已应用 mindtorch 兼容性修复\")\n", + "\n", + "# ----基础配置与环境 ----\n", + "import mindnlp\n", + "import mindspore\n", + "from mindspore import context\n", + "from datasets import load_dataset\n", + "import evaluate\n", + "import numpy as np\n", + "import os\n", + "\n", + "# 1. 设置 Token (请确保这是有效的 HuggingFace Token)\n", + "os.environ[\"HF_TOKEN\"] = \"hf_***\" \n", + "\n", + "# 2. 清理离线环境变量,确保联网 (如果需要在线下载)\n", + "if 'HF_DATASETS_OFFLINE' in os.environ: del os.environ['HF_DATASETS_OFFLINE']\n", + "if 'TRANSFORMERS_OFFLINE' in os.environ: del os.environ['TRANSFORMERS_OFFLINE']\n", + "mindspore.set_seed(42)\n", + "print(\"环境配置完成\")" + ] + }, + { + "cell_type": "markdown", + "id": "5", + "metadata": {}, + "source": [ + "#### Step 2: 模型选择\n", + "支持 T5 系列模型并添加相应的前缀,也支持 Helsinki-NLP 专用翻译模型" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6", + "metadata": {}, + "outputs": [], + "source": [ + "model_checkpoint = \"Helsinki-NLP/opus-mt-en-ro\"\n", + "# model_checkpoint = \"t5-small\"\n", + "print(f\"当前选用的模型: {model_checkpoint}\")" + ] + }, + { + "cell_type": "markdown", + "id": "7", + "metadata": {}, + "source": [ + "#### Step 3: 数据加载与预处理\n", + "本案例使用 WMT 数据集,这是一个汇集了包括新闻评论和议会记录在内的多种来源的机器翻译数据集。\n", + "我们将展示如何加载用于此任务的数据集,以及如何使用 MindSpore NLP 提供的 Trainer 接口(类似 Hugging Face 的体验)在 Ascend NPU 环境下对模型进行高效微调。\n", + "接着,我们初始化 Tokenizer 并进行针对性的模型设置:\n", + "1. **T5 模型特殊处理**:由于 T5 是一个多任务模型(Text-to-Text),我们需要给输入文本添加特定的前缀(如 `\"translate English to Romanian: \"`)来告诉模型执行翻译任务。\n", + "2. **mBART 模型特殊处理**:如果是 mBART 模型,需要显式指定源语言和目标语言的代码。\n", + "\n", + "最后,我们定义了 `preprocess_function` 对数据进行批量处理:\n", + "- 将输入文本(Input)和目标文本(Label)分别进行 Tokenization。\n", + "- 使用 `map` 函数将预处理应用到整个数据集。\n", + "- **注意**:为了快速演示流程,这里只采样了少量数据(2000条训练数据)。正式训练时请注释掉 `.select()` 部分以使用全量数据。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8", + "metadata": {}, + "outputs": [], + "source": [ + "raw_datasets = load_dataset(\"wmt16\", \"ro-en\")\n", + "metric = evaluate.load(\"sacrebleu\")\n", + "print(\"数据集加载完成\")\n", + "\n", + "from mindnlp.transformers import AutoTokenizer\n", + "\n", + "tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)\n", + "\n", + "# --- 针对 mBART 的特殊处理 ---\n", + "if \"mbart\" in model_checkpoint:\n", + " print(\"检测到 mBART 模型,设置源语言和目标语言代码...\")\n", + " tokenizer.src_lang = \"en-XX\"\n", + " tokenizer.tgt_lang = \"ro-RO\"\n", + "\n", + "# --- 针对 T5 的前缀判断 ---\n", + "if model_checkpoint in [\"t5-small\", \"t5-base\", \"t5-large\", \"t5-3b\", \"t5-11b\"]:\n", + " prefix = \"translate English to Romanian: \"\n", + " print(f\"检测到 T5 模型,已添加前缀: '{prefix}'\")\n", + "else:\n", + " prefix = \"\"\n", + " print(\"非 T5 模型,不添加前缀。\")\n", + "\n", + "\n", + "# --- 数据预处理函数 ---\n", + "max_input_length = 128\n", + "max_target_length = 128\n", + "source_lang = \"en\"\n", + "target_lang = \"ro\"\n", + "\n", + "def preprocess_function(examples):\n", + " # inputs: 前缀 + 原文\n", + " inputs = [prefix + ex[source_lang] for ex in examples[\"translation\"]]\n", + " targets = [ex[target_lang] for ex in examples[\"translation\"]]\n", + " \n", + " model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)\n", + "\n", + " # targets 需要用 tokenizer 处理为 labels\n", + " with tokenizer.as_target_tokenizer():\n", + " labels = tokenizer(targets, max_length=max_target_length, truncation=True)\n", + "\n", + " model_inputs[\"labels\"] = labels[\"input_ids\"]\n", + " return model_inputs\n", + "\n", + "# 采样少量数据用于快速演示 (正式训练注释掉select)\n", + "train_dataset = raw_datasets[\"train\"].shuffle(seed=42).select(range(2000))\n", + "val_dataset = raw_datasets[\"validation\"].shuffle(seed=42).select(range(500))\n", + "test_dataset = raw_datasets[\"test\"].shuffle(seed=42).select(range(200))\n", + "\n", + "# 批量映射处理\n", + "print(\"正在处理数据 (Tokenization)...\")\n", + "tokenized_datasets = {\n", + " \"train\": train_dataset.map(preprocess_function, batched=True, remove_columns=raw_datasets[\"train\"].column_names),\n", + " \"validation\": val_dataset.map(preprocess_function, batched=True, remove_columns=raw_datasets[\"validation\"].column_names),\n", + " \"test\": test_dataset.map(preprocess_function, batched=True, remove_columns=raw_datasets[\"test\"].column_names)\n", + "}\n", + "print(\"数据预处理完成\")" + ] + }, + { + "cell_type": "markdown", + "id": "9", + "metadata": {}, + "source": [ + "#### Step 4: 加载模型" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "10", + "metadata": {}, + "outputs": [], + "source": [ + "from mindnlp.transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq\n", + "\n", + "model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)\n", + "data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)\n", + "print(\"模型加载成功\")" + ] + }, + { + "cell_type": "markdown", + "id": "11", + "metadata": {}, + "source": [ + "#### Step 5: 定义评估指标 (BLEU)\n", + "\n", + "需要定义 `compute_metrics` 函数来评估翻译质量。该函数负责将模型输出的 ID 序列解码为文本,处理标签中的特殊掩码(-100),并最终计算 **BLEU** 分数和生成序列的平均长度。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "12", + "metadata": {}, + "outputs": [], + "source": [ + "# ---- 定义评估函数 ----\n", + "def postprocess_text(preds, labels):\n", + " preds = [pred.strip() for pred in preds]\n", + " labels = [[label.strip()] for label in labels]\n", + " return preds, labels\n", + "\n", + "def compute_metrics(eval_preds):\n", + " preds, labels = eval_preds\n", + " if isinstance(preds, tuple):\n", + " preds = preds[0]\n", + " \n", + " decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)\n", + " # 将 -100 替换为 pad token 才能解码\n", + " labels = np.where(labels != -100, labels, tokenizer.pad_token_id)\n", + " decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)\n", + "\n", + " decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)\n", + "\n", + " result = metric.compute(predictions=decoded_preds, references=decoded_labels)\n", + " result = {\"bleu\": result[\"score\"]}\n", + " \n", + " prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]\n", + " result[\"gen_len\"] = np.mean(prediction_lens)\n", + " return {k: round(v, 4) for k, v in result.items()}" + ] + }, + { + "cell_type": "markdown", + "id": "13", + "metadata": {}, + "source": [ + "#### Step 6: 配置 Trainer 并开始训练\n", + "\n", + "为了微调模型,我们需要使用 MindSpore NLP 提供的 `Seq2SeqTrainer` 接口。首先,我们通过 `Seq2SeqTrainingArguments` 定义训练的具体配置:\n", + "\n", + "- **output_dir**: 模型检查点和日志的保存路径。\n", + "- **learning_rate**: 设置为 2e-5,微调通常使用较小的学习率。\n", + "- **per_device_train_batch_size**: 根据显存大小设置为 16。\n", + "- **predict_with_generate**: 设置为 `True`,以便在评估时生成翻译结果并计算指标。\n", + "- **metric_for_best_model**: 指定 \"bleu\" 作为评估模型好坏的指标,并保存最优模型。\n", + "\n", + "定义好参数后,我们构建 `Seq2SeqTrainer` 对象并运行 `train()` 方法开始训练。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "14", + "metadata": {}, + "outputs": [], + "source": [ + "from mindnlp.transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments\n", + "\n", + "batch_size = 16\n", + "# 自动生成输出目录名\n", + "model_name = model_checkpoint.split(\"/\")[-1]\n", + "output_dir = f\"{model_name}-finetuned-{source_lang}-to-{target_lang}\"\n", + "\n", + "args = Seq2SeqTrainingArguments(\n", + " output_dir=output_dir,\n", + " eval_strategy=\"epoch\",\n", + " learning_rate=2e-5,\n", + " per_device_train_batch_size=batch_size,\n", + " per_device_eval_batch_size=batch_size,\n", + " weight_decay=0.01,\n", + " save_total_limit=3,\n", + " num_train_epochs=1,\n", + " predict_with_generate=True,\n", + " logging_steps=50,\n", + " save_strategy=\"epoch\",\n", + " load_best_model_at_end=True,\n", + " metric_for_best_model=\"bleu\",\n", + ")\n", + "\n", + "trainer = Seq2SeqTrainer(\n", + " model=model,\n", + " args=args,\n", + " train_dataset=tokenized_datasets[\"train\"],\n", + " eval_dataset=tokenized_datasets[\"validation\"],\n", + " data_collator=data_collator,\n", + " tokenizer=tokenizer,\n", + " compute_metrics=compute_metrics\n", + ")\n", + "\n", + "print(\"开始训练...\")\n", + "try:\n", + " mindspore.hal.empty_cache()\n", + "except:\n", + " pass\n", + "trainer.train()\n", + "print(\"训练结束\")" + ] + }, + { + "cell_type": "markdown", + "id": "15", + "metadata": {}, + "source": [ + "#### Step 7: 推理与全量评估" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "16", + "metadata": {}, + "outputs": [], + "source": [ + "# --- 推理演示 (Inference) ----\n", + "\n", + "print(\"\\n=== 单句推理测试 ===\")\n", + "src_text = \"Machine learning is fascinating.\"\n", + "# 再次应用前缀逻辑,确保推理时格式和训练时一致\n", + "input_text = prefix + src_text \n", + "\n", + "inputs = tokenizer(input_text, return_tensors=\"ms\", max_length=128, truncation=True)\n", + "inputs = {k: v.to(model.device) for k, v in inputs.items()}\n", + "\n", + "outputs = model.generate(\n", + " input_ids=inputs[\"input_ids\"], \n", + " attention_mask=inputs[\"attention_mask\"], \n", + " max_new_tokens=40, \n", + " num_beams=4\n", + ")\n", + "print(f\"输入: {input_text}\")\n", + "print(f\"翻译: {tokenizer.decode(outputs[0], skip_special_tokens=True)}\")\n", + "\n", + "\n", + "# --- 全量测试集评估 (计算 BLEU) ---\n", + "print(\"\\n=== 正在计算测试集 BLEU 分数 ===\")\n", + "# trainer.predict 会自动处理批量推理、设备分配和指标计算\n", + "test_results = trainer.predict(tokenized_datasets[\"test\"])\n", + "print(\"测试集最终指标:\", test_results.metrics)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "MindSpore (mind_py310)", + "language": "python", + "name": "mind_py310" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.19" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +}