From d20798185e02c2abb486d97637c39c0021a99621 Mon Sep 17 00:00:00 2001 From: Fiona-Waters Date: Fri, 15 May 2026 16:39:34 +0100 Subject: [PATCH 1/2] Adding GRPO/ART example Signed-off-by: Fiona-Waters --- examples/fine-tuning/README.md | 2 + examples/fine-tuning/grpo/README.md | 149 +++++ .../grpo/grpo_lora-kubeflow-trainjob.ipynb | 516 ++++++++++++++++++ 3 files changed, 667 insertions(+) create mode 100644 examples/fine-tuning/grpo/README.md create mode 100644 examples/fine-tuning/grpo/grpo_lora-kubeflow-trainjob.ipynb diff --git a/examples/fine-tuning/README.md b/examples/fine-tuning/README.md index 5caa5ddf..24f142e0 100644 --- a/examples/fine-tuning/README.md +++ b/examples/fine-tuning/README.md @@ -7,6 +7,7 @@ All examples are built primarily on top of **Training Hub** algorithms running o - **SFT (Supervised Fine-Tuning)** - **OSFT (Orthogonal Subspace Fine-Tuning)** - **LoRA + SFT (Low-Rank Adaptation)** +- **GRPO (Group Relative Policy Optimization)** For detailed algorithm documentation and configuration options, see the upstream [Training Hub documentation](https://github.com/Red-Hat-AI-Innovation-Team/training_hub/tree/main). @@ -93,6 +94,7 @@ Training is offloaded to **dedicated training pods** managed by **Kubeflow Train - [SFT fine-tuning example](sft/README.md) - [OSFT fine-tuning example](osft/README.md) - [LoRA fine-tuning example](lora/README.md) +- [GRPO fine-tuning example](grpo/README.md) (single-GPU TrainJob only) --- diff --git a/examples/fine-tuning/grpo/README.md b/examples/fine-tuning/grpo/README.md new file mode 100644 index 00000000..bd2dcc54 --- /dev/null +++ b/examples/fine-tuning/grpo/README.md @@ -0,0 +1,149 @@ +# GRPO Fine-Tuning with Training Hub + +This example provides an overview of Training Hub's [GRPO (Group Relative Policy Optimization)](https://github.com/Red-Hat-AI-Innovation-Team/training_hub?tab=readme-ov-file#grpo) capabilities and demonstrates how to use them with Red Hat OpenShift AI. + +## What is GRPO? + +GRPO is a reinforcement learning from verifiable rewards (RLVR) algorithm that improves a model's outputs by comparing groups of responses and reinforcing the better ones: + +- Generates multiple candidate responses per prompt +- Scores them with a reward function (e.g. tool-call correctness) +- Uses the group's relative ranking to compute advantage signals +- Updates LoRA adapter weights via policy gradient with group normalization + +Each training iteration has two phases: + +1. **Rollout phase** — vLLM generates candidate responses and a reward function scores them +2. **Train phase** — Unsloth updates the LoRA adapter weights using the advantage signals + +The ART backend time-shares a single GPU between vLLM (inference) and Unsloth (training) via `gpu_memory_utilization`. + +### Training Task: Tool-Call Verification + +The example uses the [Agent-Ark/Toucan-1.5M](https://huggingface.co/datasets/Agent-Ark/Toucan-1.5M) dataset, which contains tool-calling conversations. The reward function verifies that the model produces syntactically correct tool calls with the expected function name and arguments. + +## Execution mode + +GRPO runs as a **single-GPU TrainJob** submitted via the Kubeflow SDK. ART is single-GPU by design and manages its own vLLM subprocess internally. + +The notebook submits a `TrainJob` from a lightweight workbench, and the training runs on a dedicated GPU pod managed by Kubeflow Trainer. + +To learn more about execution modes for other algorithms, see the [fine-tuning execution modes overview](../README.md#execution-modes). + +## RHOAI compatibility + +This example is compatible with RHOAI version 3.5. + +## Requirements + +- An OpenShift cluster with OpenShift AI (RHOAI 3.5) installed: + - The `dashboard` and `workbenches` components enabled + - The `trainer` component enabled +- A worker node with an NVIDIA GPU (Ampere-based or newer, 40GB+ VRAM). +- A dynamic storage provisioner supporting RWX PVC provisioning. Talk to your cluster administrator about RWX storage options. + +## Hardware requirements + +For the workbench image, the example was run on `Training | Jupyter | PyTorch | CUDA | Python` and `Training | Jupyter | PyTorch | CPU Python`. +This is a single image serving both as training runtime and jupyter notebook and comes with pre-installed dependencies required +to seamlessly run fine-tuning jobs. + +### Workbench Requirements + +| Image Type | Use Case | GPU | CPU | Memory | +|------------|----------|-----|-----|--------| +| Training \| Jupyter \| PyTorch \| CPU Python | Job submission and monitoring | None | 2 cores | 8Gi | +| Training \| Jupyter \| PyTorch \| CUDA \| Python | Job submission + model evaluation | 1× GPU | 2 cores | 8Gi | + +> [!NOTE] +> +> - The workbench does not run the training itself — it submits a TrainJob and monitors progress. +> - A GPU on the workbench is only needed if you want to load and test the fine-tuned LoRA adapter after training completes. + +### Training Pod Requirements + +| Component | GPU | GPU Type | CPU | Memory | +|-----------|-----|----------|-----|--------| +| Training Pod | 1× GPU | NVIDIA A100, H100, or L40S (40GB+ VRAM) | 8 cores | 64Gi | + +> [!NOTE] +> +> - GRPO requires a single GPU with at least 40GB VRAM. The `gpu_memory_utilization` parameter (default `0.45`) controls how much GPU memory is reserved for vLLM inference, with the remainder available for Unsloth training. +> - CPU and memory requirements scale with model size and group size. The above values suit the example configuration (Qwen3-4B, group_size=4). +> - The training pod is configured from the `client.train()` call within the notebook. + +### Storage Requirements + +| Purpose | Size | Access Mode | Storage Class | Notes | +|---------|------|-------------|---------------|-------| +| Shared Storage (PVC) total | 50Gi (Example Default) | RWX | Dynamic provisioner required | Shared between workbench and training pod | + +> [!NOTE] +> +> - Storage can be created in `Create Workbench` view on RHOAI Platform, however, dynamic RWX provisioner is required to be configured prior to creating shared file storage in RHOAI. +> - Shared storage is required — the training pod writes checkpoints and metrics to the PVC, and the workbench reads them for inspection and plotting. + +## GRPO-specific considerations + +- **`/dev/shm` volume**: vLLM requires a memory-backed `/dev/shm` for inter-process communication. The notebook configures this automatically via a `PodSpecOverride` that mounts an `emptyDir` with `medium: Memory`. +- **`gpu_memory_utilization`**: Controls the vLLM/Unsloth memory split on the single GPU. The default `0.45` reserves 45% for vLLM inference and leaves the rest for Unsloth training. Adjust based on your model size and available VRAM. +- **HuggingFace token**: Not strictly required for public models (e.g. Qwen3-4B) but recommended to avoid rate limits. Set `HF_TOKEN` in the environment variables if needed. + +## Setup + +### Setup Workbench + +**Step 1.** Access the OpenShift AI dashboard, for example from the top navigation bar menu: + +![](../images/01.png) + +**Step 2.** Log in, then go to **_Data Science Projects_** and create a project: + +![](../images/02.png) + +**Step 3.** Once the project is created, click on **_Create a workbench_**: + +![](../images/03.png) + +**Step 4.** Select the appropriate Workbench image. See options above: + +![](../images/04a.png) + +**Step 5.** You may want to create a **Hardware Profile** with GPU support, similar to the one below: + +![](../images/04b.png) + +**Step 6.** Select the Hardware profile you want to use: + +![](../images/04c.png) + +> [!NOTE] +> A GPU on the workbench is only needed if you want to test the fine-tuned model after training. The workbench itself only submits and monitors the TrainJob. + +**Step 7.** Create **shared storage** that will be shared between the workbench and the training pod. Make sure it uses a storage class with RWX capability: + +![](../images/04d.png) + +> [!NOTE] +> You can attach an existing shared storage if you already have one instead. + +**Step 8.** Review the storage configuration and click "Create workbench": + +![](../images/04e.png) + +**Step 9.** From "Workbenches" page, click on **_Open_** when the workbench you've just created becomes ready: + +![](../images/05.png) + +### Running the example notebook + +- From the workbench, clone this repository: `https://github.com/red-hat-data-services/red-hat-ai-examples.git` +- Navigate to the `examples/fine-tuning/grpo` directory and open the [`grpo_lora-kubeflow-trainjob.ipynb`](./grpo_lora-kubeflow-trainjob.ipynb) notebook. + +> [!NOTE] +> +> - You will need a Hugging Face token if using gated models (e.g., Llama models). +> Set the `HF_TOKEN` environment variable in your job configuration. +> You can skip the token if switching to non-gated models like Qwen3-4B. + +You can now proceed with the instructions from the notebook. Enjoy! diff --git a/examples/fine-tuning/grpo/grpo_lora-kubeflow-trainjob.ipynb b/examples/fine-tuning/grpo/grpo_lora-kubeflow-trainjob.ipynb new file mode 100644 index 00000000..16d2d973 --- /dev/null +++ b/examples/fine-tuning/grpo/grpo_lora-kubeflow-trainjob.ipynb @@ -0,0 +1,516 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# LoRA GRPO Fine-Tuning with Kubeflow Trainer and Training Hub on OpenShift AI\n", + "\n", + "This notebook demonstrates how to run **GRPO (Group Relative Policy Optimization)** training on OpenShift AI using a Kubeflow `TrainJob`. GRPO is a reinforcement learning technique that teaches a model to improve its outputs by comparing groups of responses and reinforcing the better ones.\n", + "\n", + "## What is GRPO?\n", + "\n", + "GRPO is a reinforcement learning from verifiable rewards (RLVR) algorithm that:\n", + "- Generates multiple candidate responses per prompt\n", + "- Scores them with a reward function (e.g. tool-call correctness)\n", + "- Uses the group's relative ranking to compute advantage signals\n", + "- Updates LoRA adapter weights via policy gradient with group normalization\n", + "\n", + "Each training iteration has two phases:\n", + "\n", + "1. **Rollout phase** — vLLM generates candidate responses, a reward function scores them\n", + "2. **Train phase** — Unsloth updates the LoRA adapter weights using the advantage signals\n", + "\n", + "ART time-shares a single GPU between vLLM (inference) and Unsloth (training).\n", + "\n", + "## Training Task: Tool-Call Verification\n", + "\n", + "We use the [Agent-Ark/Toucan-1.5M](https://huggingface.co/datasets/Agent-Ark/Toucan-1.5M) dataset, which contains tool-calling conversations. The reward function verifies that the model produces syntactically correct tool calls with the expected function name and arguments.\n", + "\n", + "## Hardware Requirements\n", + "\n", + "- **1x GPU with 40GB+ VRAM** (A100, H100, or L40S recommended)\n", + "- ART manages GPU memory sharing between vLLM and Unsloth via `gpu_memory_utilization`\n", + "- The default `gpu_memory_utilization=0.45` reserves 45% for vLLM, leaving the rest for Unsloth" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Setup\n", + "\n", + "First, import the required dependencies." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# LORA_GRPO support is not yet released in the workbench image.\n", + "# Install from midstream until the SDK is pre-installed in the image.\n", + "# TODO: replace with midstream branch once the PR is merged, then remove\n", + "# this line entirely when the image includes the updated SDK.\n", + "%pip install \"kubeflow @ git+https://github.com/Fiona-Waters/kubeflow-sdk.git@add-rl-algo\" --force-reinstall --quiet" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from kubeflow.common.types import KubernetesBackendConfig\n", + "from kubeflow.trainer import TrainerClient\n", + "from kubeflow.trainer.rhai import TrainingHubAlgorithms, TrainingHubTrainer\n", + "from kubernetes import client as k8s" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Authenticate to your OpenShift Cluster" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "api_server = \"\"\n", + "token = \"\"\n", + "PVC_NAME = \"shared\" # Replace if the shared RWX storage name is different than in the example provided\n", + "PVC_PATH = \"shared\" # Replace if the shared RWX storage path is different than in the example provided\n", + "configuration = k8s.Configuration()\n", + "configuration.host = api_server\n", + "# Un-comment if your cluster API server uses a self-signed certificate or an un-trusted CA\n", + "# configuration.verify_ssl = False\n", + "configuration.api_key = {\"authorization\": f\"Bearer {token}\"}\n", + "api_client = k8s.ApiClient(configuration)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 1. Configure Training Parameters\n", + "\n", + "Key parameters:\n", + "\n", + "### GRPO Parameters\n", + "- **num_iterations**: Number of GRPO iterations (each = rollout + train phase)\n", + "- **group_size**: Responses generated per prompt for comparison\n", + "- **prompt_batch_size**: Unique prompts sampled per iteration\n", + "\n", + "### LoRA Parameters\n", + "- **lora_r**: Rank of the low-rank matrices (higher = more capacity, more memory)\n", + "- **lora_alpha**: Scaling factor\n", + "\n", + "### vLLM Parameters\n", + "- **gpu_memory_utilization**: Fraction of GPU memory reserved for vLLM inference (rest for Unsloth training)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Model and dataset\n", + "MODEL_PATH = \"Qwen/Qwen3-4B\"\n", + "DATA_PATH = \"Agent-Ark/Toucan-1.5M\"\n", + "DATA_CONFIG = \"Qwen3\"\n", + "\n", + "# GRPO hyperparameters\n", + "NUM_ITERATIONS = 5 # Number of GRPO iterations (each = rollout + train)\n", + "GROUP_SIZE = 4 # Responses generated per prompt for comparison\n", + "PROMPT_BATCH_SIZE = 50 # Unique prompts per iteration\n", + "N_TRAIN = 200 # Total training examples from dataset\n", + "LEARNING_RATE = 1e-5\n", + "\n", + "# LoRA configuration\n", + "LORA_R = 16\n", + "LORA_ALPHA = 8\n", + "\n", + "# vLLM configuration\n", + "GPU_MEMORY_UTILIZATION = 0.45 # Fraction of GPU memory for vLLM (rest for Unsloth)\n", + "\n", + "params = {\n", + " \"model_path\": MODEL_PATH,\n", + " \"data_path\": DATA_PATH,\n", + " \"data_config\": DATA_CONFIG,\n", + " \"ckpt_output_dir\": f\"/mnt/{PVC_PATH}/grpo_output\",\n", + " \"backend\": \"art\",\n", + " # GRPO hyperparameters\n", + " \"num_iterations\": NUM_ITERATIONS,\n", + " \"group_size\": GROUP_SIZE,\n", + " \"prompt_batch_size\": PROMPT_BATCH_SIZE,\n", + " \"n_train\": N_TRAIN,\n", + " \"learning_rate\": LEARNING_RATE,\n", + " # LoRA\n", + " \"lora_r\": LORA_R,\n", + " \"lora_alpha\": LORA_ALPHA,\n", + " # vLLM\n", + " \"gpu_memory_utilization\": GPU_MEMORY_UTILIZATION,\n", + "}\n", + "\n", + "print(\"Training Configuration:\")\n", + "print(f\" Model: {MODEL_PATH}\")\n", + "print(f\" Dataset: {DATA_PATH} ({DATA_CONFIG})\")\n", + "print(f\" Iterations: {NUM_ITERATIONS}, Group size: {GROUP_SIZE}\")\n", + "print(\n", + " f\" Prompts/iter: {PROMPT_BATCH_SIZE}, Rollouts/iter: {PROMPT_BATCH_SIZE * GROUP_SIZE}\"\n", + ")\n", + "print(f\" Training samples: {N_TRAIN}\")\n", + "print(f\" LoRA Rank: {LORA_R}, Alpha: {LORA_ALPHA}\")\n", + "print(f\" GPU Memory Utilization: {GPU_MEMORY_UTILIZATION}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 2. Training with LORA GRPO and Kubeflow Trainer\n", + "\n", + "Launch a training job via Kubeflow Trainer with configured hyperparameters.\n", + "\n", + "`LORA_GRPO` uses `python` (not `torchrun`) as the entrypoint because ART manages\n", + "its own subprocess via `multiprocessing.spawn`. The SDK also wraps the training call\n", + "with a `__main__` guard to prevent re-execution in the spawned process." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "backend_cfg = KubernetesBackendConfig(client_configuration=api_client.configuration)\n", + "client = TrainerClient(backend_cfg)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Find the Training Hub ClusterTrainingRuntime" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "for runtime in client.list_runtimes():\n", + " if runtime.name == \"training-hub\":\n", + " th_runtime = runtime\n", + " print(\"Found runtime: \" + str(th_runtime))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Submit the TrainJob" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from kubeflow.trainer.options.kubernetes import (\n", + " ContainerOverride,\n", + " PodSpecOverride,\n", + " PodTemplateOverride,\n", + " PodTemplateOverrides,\n", + ")\n", + "\n", + "cache_root = f\"/mnt/{PVC_PATH}/.cache/huggingface\"\n", + "\n", + "job_name = client.train(\n", + " trainer=TrainingHubTrainer(\n", + " algorithm=TrainingHubAlgorithms.LORA_GRPO,\n", + " func_args=params,\n", + " env={\n", + " \"HF_HOME\": cache_root,\n", + " \"TRANSFORMERS_ATTN_BACKEND\": \"sdpa\",\n", + " },\n", + " resources_per_node={\n", + " \"cpu\": 8,\n", + " \"memory\": \"64Gi\",\n", + " \"nvidia.com/gpu\": 1,\n", + " },\n", + " ),\n", + " options=[\n", + " PodTemplateOverrides(\n", + " PodTemplateOverride(\n", + " target_jobs=[\"node\"],\n", + " spec=PodSpecOverride(\n", + " volumes=[\n", + " {\n", + " \"name\": \"work\",\n", + " \"persistentVolumeClaim\": {\"claimName\": PVC_NAME},\n", + " },\n", + " {\n", + " \"name\": \"dshm\",\n", + " \"emptyDir\": {\"medium\": \"Memory\"},\n", + " },\n", + " ],\n", + " containers=[\n", + " ContainerOverride(\n", + " name=\"node\",\n", + " volume_mounts=[\n", + " {\n", + " \"name\": \"work\",\n", + " \"mountPath\": f\"/mnt/{PVC_PATH}\",\n", + " \"readOnly\": False,\n", + " },\n", + " {\n", + " \"name\": \"dshm\",\n", + " \"mountPath\": \"/dev/shm\",\n", + " },\n", + " ],\n", + " ),\n", + " ],\n", + " ),\n", + " )\n", + " )\n", + " ],\n", + " runtime=th_runtime,\n", + ")\n", + "\n", + "print(job_name)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Follow job logs" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Follow job logs\n", + "logs = client.get_job_logs(job_name, follow=True)\n", + "for line in logs:\n", + " print(line)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 3. Inspect Results\n", + "\n", + "After training completes, check the metrics file and training results on the PVC.\n", + "\n", + "GRPO writes two lines per iteration to `training_metrics.jsonl`:\n", + "- **Rollout phase**: `mean_reward`, `full_match_rate`\n", + "- **Train phase**: `loss`, `grad_norm`, `entropy`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import json\n", + "import os\n", + "\n", + "OUTPUT_DIR = f\"/opt/app-root/src/{PVC_PATH}/grpo_output\"\n", + "\n", + "# Read training metrics (two lines per iteration: rollout + train)\n", + "metrics_file = os.path.join(OUTPUT_DIR, \"training_metrics.jsonl\")\n", + "if os.path.exists(metrics_file):\n", + " print(\"Training metrics:\")\n", + " print(\"=\" * 80)\n", + " with open(metrics_file) as f:\n", + " for line in f:\n", + " entry = json.loads(line)\n", + " phase = entry.get(\"phase\", \"unknown\")\n", + " step = entry.get(\"step\", \"?\")\n", + " if phase == \"rollout\":\n", + " print(\n", + " f\"Step {step} [rollout] mean_reward={entry.get('mean_reward', 'N/A'):.4f} \"\n", + " f\"full_match_rate={entry.get('full_match_rate', 'N/A'):.4f}\"\n", + " )\n", + " elif phase == \"train\":\n", + " print(\n", + " f\"Step {step} [train] loss={entry.get('loss', 'N/A')} \"\n", + " f\"grad_norm={entry.get('grad_norm', 'N/A')} \"\n", + " f\"entropy={entry.get('entropy', 'N/A')}\"\n", + " )\n", + "else:\n", + " print(\"No metrics file yet -- training may not have started.\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Read training results summary\n", + "results_file = os.path.join(OUTPUT_DIR, \"training_results.json\")\n", + "if os.path.exists(results_file):\n", + " with open(results_file) as f:\n", + " results = json.load(f)\n", + " print(\"Training results:\")\n", + " print(json.dumps(results, indent=2, default=str))\n", + "else:\n", + " print(\"No results file yet -- training may not have completed.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 4. Plot Reward Curve" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import json\n", + "import os\n", + "\n", + "metrics_file = os.path.join(OUTPUT_DIR, \"training_metrics.jsonl\")\n", + "if not os.path.exists(metrics_file):\n", + " print(\"No metrics file found.\")\n", + "else:\n", + " steps, rewards, match_rates = [], [], []\n", + " with open(metrics_file) as f:\n", + " for line in f:\n", + " entry = json.loads(line)\n", + " if entry.get(\"phase\") == \"rollout\":\n", + " steps.append(entry[\"step\"])\n", + " rewards.append(entry[\"mean_reward\"])\n", + " match_rates.append(entry[\"full_match_rate\"])\n", + "\n", + " try:\n", + " import matplotlib.pyplot as plt\n", + "\n", + " fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))\n", + "\n", + " ax1.plot(steps, rewards, marker=\"o\")\n", + " ax1.set_xlabel(\"Iteration\")\n", + " ax1.set_ylabel(\"Mean Reward\")\n", + " ax1.set_title(\"GRPO Training: Mean Reward\")\n", + " ax1.grid(True, alpha=0.3)\n", + "\n", + " ax2.plot(steps, match_rates, marker=\"o\", color=\"tab:orange\")\n", + " ax2.set_xlabel(\"Iteration\")\n", + " ax2.set_ylabel(\"Full Match Rate\")\n", + " ax2.set_title(\"GRPO Training: Full Match Rate\")\n", + " ax2.grid(True, alpha=0.3)\n", + "\n", + " plt.tight_layout()\n", + " plt.savefig(os.path.join(OUTPUT_DIR, \"reward_curve.png\"), dpi=150)\n", + " plt.show()\n", + " print(f\"Plot saved to {OUTPUT_DIR}/reward_curve.png\")\n", + " except ImportError:\n", + " print(\"matplotlib not installed -- printing raw values instead.\")\n", + " for s, r, m in zip(steps, rewards, match_rates, strict=True):\n", + " print(f\" Step {s}: reward={r:.4f}, match_rate={m:.4f}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 5. Cleanup" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "client.delete_job(job_name)\n", + "print(f\"TrainJob '{job_name}' deleted.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Appendix: Dataset Format\n", + "\n", + "### Built-in tool-call mode (default)\n", + "\n", + "When using `data_path` with a HuggingFace dataset like `Agent-Ark/Toucan-1.5M`, ART's built-in reward function handles tool-call verification automatically. The dataset should contain multi-turn conversations with tool calls.\n", + "\n", + "### Custom reward function\n", + "\n", + "For custom tasks, pass `rollout_fn`, `reward_fn`, and `tasks` instead of `data_path`:\n", + "\n", + "```python\n", + "params = {\n", + " \"model_path\": \"Qwen/Qwen3-4B\",\n", + " \"ckpt_output_dir\": \"/mnt/shared/grpo_custom\",\n", + " \"backend\": \"art\",\n", + " # Custom rollout and reward\n", + " \"rollout_fn\": my_rollout_function,\n", + " \"reward_fn\": my_reward_function,\n", + " \"tasks\": my_task_list,\n", + " # Hyperparameters\n", + " \"num_iterations\": 10,\n", + " \"group_size\": 8,\n", + "}\n", + "```\n", + "\n", + "Note: Custom functions passed through `func_args` must be picklable (module-level, not closures) because ART uses `multiprocessing.spawn`.\n", + "\n", + "## Appendix: Key Parameters\n", + "\n", + "| Parameter | Description | Default |\n", + "|-----------|-------------|----------|\n", + "| `model_path` | HuggingFace model ID or local path | (required) |\n", + "| `ckpt_output_dir` | Where to save checkpoints and metrics | (required) |\n", + "| `backend` | `\"art\"` (single-GPU) or `\"verl\"` (multi-GPU) | `\"verl\"` |\n", + "| `data_path` | HuggingFace dataset for built-in tool-call mode | `None` |\n", + "| `data_config` | HuggingFace dataset config (conversation format) | `\"Qwen3\"` |\n", + "| `num_iterations` | Number of GRPO iterations | `15` |\n", + "| `group_size` | Responses per prompt for comparison | `8` |\n", + "| `prompt_batch_size` | Unique prompts per iteration | `100` |\n", + "| `n_train` | Training examples from dataset | `5000` |\n", + "| `learning_rate` | LoRA learning rate | `1e-5` |\n", + "| `lora_r` | LoRA rank | `16` |\n", + "| `lora_alpha` | LoRA alpha | `8` |\n", + "| `gpu_memory_utilization` | vLLM GPU memory fraction | `0.45` |\n", + "| `temperature` | Sampling temperature for rollouts | `0.7` |\n", + "| `max_tokens` | Max tokens per generated response | `512` |" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "name": "python", + "version": "3.12.0" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} \ No newline at end of file From b24c8aedf61e5047354d4d8468772299cbc26e99 Mon Sep 17 00:00:00 2001 From: Fiona-Waters Date: Thu, 21 May 2026 14:38:06 +0100 Subject: [PATCH 2/2] Add interactive GRPO notebook and update README - Add grpo_lora-interactive-notebook.ipynb for single-GPU GRPO training directly in the workbench - Include "Test the Trained Model" section with dynamic checkpoint loading - Update README to document both interactive and distributed execution modes - Update workbench requirements for interactive mode (8 CPU, 64Gi memory) - Remove custom reward function appendix from both notebooks (out of scope) Signed-off-by: Fiona-Waters Co-authored-by: Cursor --- examples/fine-tuning/grpo/README.md | 41 +- .../grpo/grpo_lora-interactive-notebook.ipynb | 530 ++++++++++++++++++ .../grpo/grpo_lora-kubeflow-trainjob.ipynb | 257 +++++++-- 3 files changed, 766 insertions(+), 62 deletions(-) create mode 100644 examples/fine-tuning/grpo/grpo_lora-interactive-notebook.ipynb diff --git a/examples/fine-tuning/grpo/README.md b/examples/fine-tuning/grpo/README.md index bd2dcc54..1a7023d2 100644 --- a/examples/fine-tuning/grpo/README.md +++ b/examples/fine-tuning/grpo/README.md @@ -22,11 +22,16 @@ The ART backend time-shares a single GPU between vLLM (inference) and Unsloth (t The example uses the [Agent-Ark/Toucan-1.5M](https://huggingface.co/datasets/Agent-Ark/Toucan-1.5M) dataset, which contains tool-calling conversations. The reward function verifies that the model produces syntactically correct tool calls with the expected function name and arguments. -## Execution mode +## Execution Modes -GRPO runs as a **single-GPU TrainJob** submitted via the Kubeflow SDK. ART is single-GPU by design and manages its own vLLM subprocess internally. +This example provides two notebooks: -The notebook submits a `TrainJob` from a lightweight workbench, and the training runs on a dedicated GPU pod managed by Kubeflow Trainer. +| Mode | Notebook | Description | +| --------------- | ------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------ | +| **Interactive** | [`grpo_lora-interactive-notebook.ipynb`](./grpo_lora-interactive-notebook.ipynb) | Runs GRPO training directly on the workbench GPU. Best for exploration, prototyping, and quick iteration. | +| **Distributed** | [`grpo_lora-kubeflow-trainjob.ipynb`](./grpo_lora-kubeflow-trainjob.ipynb) | Submits a Kubeflow TrainJob from a lightweight workbench. Training runs on a dedicated GPU pod. Best for production workloads. | + +ART is single-GPU by design and manages its own vLLM subprocess internally. To learn more about execution modes for other algorithms, see the [fine-tuning execution modes overview](../README.md#execution-modes). @@ -50,21 +55,21 @@ to seamlessly run fine-tuning jobs. ### Workbench Requirements -| Image Type | Use Case | GPU | CPU | Memory | -|------------|----------|-----|-----|--------| -| Training \| Jupyter \| PyTorch \| CPU Python | Job submission and monitoring | None | 2 cores | 8Gi | -| Training \| Jupyter \| PyTorch \| CUDA \| Python | Job submission + model evaluation | 1× GPU | 2 cores | 8Gi | +| Image Type | Use Case | GPU | CPU | Memory | +| ------------------------------------------------ | --------------------------------------------------- | -------------------- | ------- | ------ | +| Training \| Jupyter \| PyTorch \| CPU Python | Distributed mode: job submission and monitoring | None | 2 cores | 8Gi | +| Training \| Jupyter \| PyTorch \| CUDA \| Python | Interactive mode, or distributed + model evaluation | 1× GPU (40GB+ VRAM) | 8 cores | 64Gi | > [!NOTE] > -> - The workbench does not run the training itself — it submits a TrainJob and monitors progress. -> - A GPU on the workbench is only needed if you want to load and test the fine-tuned LoRA adapter after training completes. +> - **Distributed mode**: The workbench submits a TrainJob and monitors progress. A GPU on the workbench is only needed to test the fine-tuned LoRA adapter after training completes. +> - **Interactive mode**: Training runs directly on the workbench GPU. The workbench needs an A100, H100, or L40S (40GB+ VRAM) with sufficient CPU and memory. ### Training Pod Requirements -| Component | GPU | GPU Type | CPU | Memory | -|-----------|-----|----------|-----|--------| -| Training Pod | 1× GPU | NVIDIA A100, H100, or L40S (40GB+ VRAM) | 8 cores | 64Gi | +| Component | GPU | GPU Type | CPU | Memory | +| ------------ | ------ | ---------------------------------------- | ------- | ------ | +| Training Pod | 1× GPU | NVIDIA A100, H100, or L40S (40GB+ VRAM) | 8 cores | 64Gi | > [!NOTE] > @@ -74,9 +79,9 @@ to seamlessly run fine-tuning jobs. ### Storage Requirements -| Purpose | Size | Access Mode | Storage Class | Notes | -|---------|------|-------------|---------------|-------| -| Shared Storage (PVC) total | 50Gi (Example Default) | RWX | Dynamic provisioner required | Shared between workbench and training pod | +| Purpose | Size | Access Mode | Storage Class | Notes | +| -------------------------- | ---------------------- | ----------- | ---------------------------- | ----------------------------------------- | +| Shared Storage (PVC) total | 50Gi (Example Default) | RWX | Dynamic provisioner required | Shared between workbench and training pod | > [!NOTE] > @@ -135,10 +140,12 @@ to seamlessly run fine-tuning jobs. ![](../images/05.png) -### Running the example notebook +### Running the example notebooks - From the workbench, clone this repository: `https://github.com/red-hat-data-services/red-hat-ai-examples.git` -- Navigate to the `examples/fine-tuning/grpo` directory and open the [`grpo_lora-kubeflow-trainjob.ipynb`](./grpo_lora-kubeflow-trainjob.ipynb) notebook. +- Navigate to the `examples/fine-tuning/grpo` directory and open the notebook for your preferred execution mode: + - **Interactive**: [`grpo_lora-interactive-notebook.ipynb`](./grpo_lora-interactive-notebook.ipynb) — runs training directly on the workbench GPU + - **Distributed**: [`grpo_lora-kubeflow-trainjob.ipynb`](./grpo_lora-kubeflow-trainjob.ipynb) — submits a TrainJob via Kubeflow Trainer > [!NOTE] > diff --git a/examples/fine-tuning/grpo/grpo_lora-interactive-notebook.ipynb b/examples/fine-tuning/grpo/grpo_lora-interactive-notebook.ipynb new file mode 100644 index 00000000..4fc0479f --- /dev/null +++ b/examples/fine-tuning/grpo/grpo_lora-interactive-notebook.ipynb @@ -0,0 +1,530 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# LoRA GRPO Fine-Tuning with Training Hub (Interactive)\n", + "\n", + "This notebook demonstrates how to run **GRPO (Group Relative Policy Optimization)** training directly inside a GPU-enabled workbench using Training Hub's ART backend. GRPO is a reinforcement learning technique that teaches a model to improve its outputs by comparing groups of responses and reinforcing the better ones.\n", + "\n", + "## What is GRPO?\n", + "\n", + "GRPO is a reinforcement learning from verifiable rewards (RLVR) algorithm that:\n", + "- Generates multiple candidate responses per prompt\n", + "- Scores them with a reward function (e.g. tool-call correctness)\n", + "- Uses the group's relative ranking to compute advantage signals\n", + "- Updates LoRA adapter weights via policy gradient with group normalization\n", + "\n", + "Each training iteration has two phases:\n", + "\n", + "1. **Rollout phase** — vLLM generates candidate responses and a reward function scores them\n", + "2. **Train phase** — Unsloth updates the LoRA adapter weights using the advantage signals\n", + "\n", + "ART time-shares a single GPU between vLLM (inference) and Unsloth (training) via `gpu_memory_utilization`.\n", + "\n", + "## Training Task: Tool-Call Verification\n", + "\n", + "We use the [Agent-Ark/Toucan-1.5M](https://huggingface.co/datasets/Agent-Ark/Toucan-1.5M) dataset, which contains tool-calling conversations. The reward function verifies that the model produces syntactically correct tool calls with the expected function name and arguments.\n", + "\n", + "## Hardware Requirements\n", + "\n", + "| Resource | Minimum |\n", + "|----------|---------|\n", + "| **GPU** | 1× NVIDIA A100, H100, or L40S (40GB+ VRAM) |\n", + "| **CPU** | 8 cores |\n", + "| **Memory** | 64Gi |\n", + "\n", + "- ART manages GPU memory sharing between vLLM and Unsloth via `gpu_memory_utilization`\n", + "- The default `gpu_memory_utilization=0.45` reserves 45% for vLLM, leaving the rest for Unsloth\n", + "- CPU and memory requirements are high because ART spawns multiple processes (vLLM engine + Unsloth training) alongside the Jupyter server\n", + "\n", + "> **Note:** This notebook runs training directly inside the workbench. For job submission via Kubeflow Trainer, see [`grpo_lora-kubeflow-trainjob.ipynb`](./grpo_lora-kubeflow-trainjob.ipynb)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Setup\n", + "\n", + "First, import the required dependencies." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import json\n", + "from pathlib import Path\n", + "\n", + "import torch" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "if torch.cuda.is_available():\n", + " gpu_name = torch.cuda.get_device_name(0)\n", + " gpu_memory = torch.cuda.get_device_properties(0).total_memory / (1024**3)\n", + " print(f\"GPU: {gpu_name}\")\n", + " print(f\"Memory: {gpu_memory:.1f} GB\")\n", + "else:\n", + " print(\"WARNING: No GPU detected. GRPO requires a GPU with 40GB+ VRAM.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 1. Explore the Dataset\n", + "\n", + "We use the [Agent-Ark/Toucan-1.5M](https://huggingface.co/datasets/Agent-Ark/Toucan-1.5M) dataset with the `Qwen3` configuration. This dataset contains multi-turn tool-calling conversations where the model must produce syntactically correct function calls.\n", + "\n", + "ART's built-in reward function handles scoring automatically — it checks that the model output is a valid tool call with the correct function name and arguments." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from datasets import load_dataset\n", + "\n", + "dataset = load_dataset(\"Agent-Ark/Toucan-1.5M\", \"Qwen3\", split=\"train\", streaming=True)\n", + "\n", + "print(\"Dataset: Agent-Ark/Toucan-1.5M (Qwen3 config)\")\n", + "print(\"=\" * 60)\n", + "\n", + "sample = next(iter(dataset))\n", + "print(f\"\\nColumns: {list(sample.keys())}\")\n", + "\n", + "messages = sample[\"messages\"]\n", + "if isinstance(messages, str):\n", + " messages = json.loads(messages)\n", + "\n", + "print(f\"\\nSample conversation ({len(messages)} turns):\")\n", + "print(\"-\" * 60)\n", + "for msg in messages[:4]:\n", + " role = msg[\"role\"]\n", + " content = (\n", + " msg[\"content\"][:200] + \"...\" if len(msg[\"content\"]) > 200 else msg[\"content\"]\n", + " )\n", + " print(f\"\\n[{role}]\\n{content}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 2. Configure Training Parameters\n", + "\n", + "Key parameters:\n", + "\n", + "### GRPO Parameters\n", + "- **num_iterations**: Number of GRPO iterations (each = rollout + train phase)\n", + "- **group_size**: Responses generated per prompt for comparison\n", + "- **prompt_batch_size**: Unique prompts sampled per iteration\n", + "\n", + "### LoRA Parameters\n", + "- **lora_r**: Rank of the low-rank matrices (higher = more capacity, more memory)\n", + "- **lora_alpha**: Scaling factor\n", + "\n", + "### vLLM Parameters\n", + "- **gpu_memory_utilization**: Fraction of GPU memory reserved for vLLM inference (rest for Unsloth training)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Model and dataset\n", + "MODEL_PATH = \"Qwen/Qwen3-4B\"\n", + "DATA_PATH = \"Agent-Ark/Toucan-1.5M\"\n", + "DATA_CONFIG = \"Qwen3\"\n", + "\n", + "# Output directory (local to workbench)\n", + "OUTPUT_DIR = Path(\"./grpo_output\")\n", + "OUTPUT_DIR.mkdir(parents=True, exist_ok=True)\n", + "\n", + "# GRPO hyperparameters\n", + "NUM_ITERATIONS = 5 # Number of GRPO iterations (each = rollout + train)\n", + "GROUP_SIZE = 4 # Responses generated per prompt for comparison\n", + "PROMPT_BATCH_SIZE = 50 # Unique prompts per iteration\n", + "N_TRAIN = 200 # Total training examples from dataset\n", + "LEARNING_RATE = 1e-5\n", + "\n", + "# LoRA configuration\n", + "LORA_R = 16\n", + "LORA_ALPHA = 8\n", + "\n", + "# vLLM configuration — controls GPU memory split between vLLM and Unsloth\n", + "GPU_MEMORY_UTILIZATION = 0.45\n", + "\n", + "print(\"Training Configuration:\")\n", + "print(f\" Model: {MODEL_PATH}\")\n", + "print(f\" Dataset: {DATA_PATH} ({DATA_CONFIG})\")\n", + "print(f\" Output: {OUTPUT_DIR.resolve()}\")\n", + "print(f\" Iterations: {NUM_ITERATIONS}\")\n", + "print(f\" Group size: {GROUP_SIZE}\")\n", + "print(f\" Prompts/iteration: {PROMPT_BATCH_SIZE}\")\n", + "print(f\" Training examples: {N_TRAIN}\")\n", + "print(f\" LoRA rank: {LORA_R}, alpha: {LORA_ALPHA}\")\n", + "print(f\" GPU memory for vLLM: {GPU_MEMORY_UTILIZATION * 100:.0f}%\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 3. Run GRPO Training\n", + "\n", + "Call `training_hub.lora_grpo()` directly. ART manages the single-GPU workflow internally:\n", + "1. Loads the base model with LoRA adapters via Unsloth\n", + "2. Starts a vLLM inference process for rollout generation\n", + "3. Alternates between rollout (generate + score) and train (update weights) phases\n", + "\n", + "> **Note:** Training may take 10-30 minutes depending on configuration." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from training_hub import lora_grpo\n", + "\n", + "result = lora_grpo(\n", + " model_path=MODEL_PATH,\n", + " data_path=DATA_PATH,\n", + " data_config=DATA_CONFIG,\n", + " ckpt_output_dir=str(OUTPUT_DIR),\n", + " backend=\"art\",\n", + " # GRPO hyperparameters\n", + " num_iterations=NUM_ITERATIONS,\n", + " group_size=GROUP_SIZE,\n", + " prompt_batch_size=PROMPT_BATCH_SIZE,\n", + " n_train=N_TRAIN,\n", + " learning_rate=LEARNING_RATE,\n", + " # LoRA\n", + " lora_r=LORA_R,\n", + " lora_alpha=LORA_ALPHA,\n", + " # vLLM\n", + " gpu_memory_utilization=GPU_MEMORY_UTILIZATION,\n", + ")\n", + "\n", + "print(f\"\\nTraining completed — status: {result.get('status')}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 4. Inspect Training Metrics\n", + "\n", + "GRPO writes two lines per iteration to `training_metrics.jsonl`:\n", + "- **Rollout phase**: `mean_reward`, `full_match_rate`\n", + "- **Train phase**: `loss`, `grad_norm`, `entropy`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "metrics_file = OUTPUT_DIR / \"training_metrics.jsonl\"\n", + "\n", + "if metrics_file.exists():\n", + " print(\"Training metrics:\")\n", + " print(\"=\" * 80)\n", + " with open(metrics_file) as f:\n", + " for line in f:\n", + " entry = json.loads(line)\n", + " phase = entry.get(\"phase\", \"unknown\")\n", + " step = entry.get(\"step\", \"?\")\n", + " if phase == \"rollout\":\n", + " print(\n", + " f\"Step {step} [rollout] mean_reward={entry.get('mean_reward', 'N/A'):.4f} \"\n", + " f\"full_match_rate={entry.get('full_match_rate', 'N/A'):.4f}\"\n", + " )\n", + " elif phase == \"train\":\n", + " print(\n", + " f\"Step {step} [train] loss={entry.get('loss', 'N/A')} \"\n", + " f\"grad_norm={entry.get('grad_norm', 'N/A')} \"\n", + " f\"entropy={entry.get('entropy', 'N/A')}\"\n", + " )\n", + "else:\n", + " print(\"No metrics file found — training may not have completed.\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "results_file = OUTPUT_DIR / \"training_results.json\"\n", + "\n", + "if results_file.exists():\n", + " with open(results_file) as f:\n", + " results = json.load(f)\n", + " print(\"Training results:\")\n", + " print(json.dumps(results, indent=2, default=str))\n", + "else:\n", + " print(\"No results file yet — training may not have completed.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 5. Plot Reward Curve\n", + "\n", + "Visualize how the model's reward improves over GRPO iterations." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "metrics_file = OUTPUT_DIR / \"training_metrics.jsonl\"\n", + "\n", + "if not metrics_file.exists():\n", + " print(\"No metrics file found.\")\n", + "else:\n", + " steps, rewards, match_rates = [], [], []\n", + " with open(metrics_file) as f:\n", + " for line in f:\n", + " entry = json.loads(line)\n", + " if entry.get(\"phase\") == \"rollout\":\n", + " steps.append(entry[\"step\"])\n", + " rewards.append(entry[\"mean_reward\"])\n", + " match_rates.append(entry[\"full_match_rate\"])\n", + "\n", + " import matplotlib.pyplot as plt\n", + "\n", + " fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))\n", + "\n", + " ax1.plot(steps, rewards, marker=\"o\")\n", + " ax1.set_xlabel(\"Iteration\")\n", + " ax1.set_ylabel(\"Mean Reward\")\n", + " ax1.set_title(\"GRPO Training: Mean Reward\")\n", + " ax1.grid(True, alpha=0.3)\n", + "\n", + " ax2.plot(steps, match_rates, marker=\"o\", color=\"tab:orange\")\n", + " ax2.set_xlabel(\"Iteration\")\n", + " ax2.set_ylabel(\"Full Match Rate\")\n", + " ax2.set_title(\"GRPO Training: Full Match Rate\")\n", + " ax2.grid(True, alpha=0.3)\n", + "\n", + " plt.tight_layout()\n", + " plt.savefig(OUTPUT_DIR / \"reward_curve.png\", dpi=150)\n", + " plt.show()\n", + " print(f\"Plot saved to {OUTPUT_DIR}/reward_curve.png\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 6. Test the Trained Model\n", + "\n", + "Load the fine-tuned LoRA adapter and test whether the model generates better tool calls after GRPO training." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from unsloth import FastLanguageModel\n", + "\n", + "# ART saves checkpoints under .art//models//checkpoints//\n", + "# Find the latest checkpoint dynamically.\n", + "art_dir = OUTPUT_DIR / \".art\"\n", + "checkpoints_dirs = sorted(art_dir.rglob(\"checkpoints\"))\n", + "if not checkpoints_dirs:\n", + " raise FileNotFoundError(f\"No checkpoints found under {art_dir}\")\n", + "latest_ckpt = sorted(checkpoints_dirs[0].iterdir())[-1]\n", + "print(f\"Loading checkpoint: {latest_ckpt}\")\n", + "\n", + "model, tokenizer = FastLanguageModel.from_pretrained(\n", + " model_name=str(latest_ckpt),\n", + " max_seq_length=2048,\n", + " load_in_4bit=False,\n", + ")\n", + "FastLanguageModel.for_inference(model)\n", + "\n", + "print(\"Fine-tuned model loaded and ready for inference.\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def generate_response(messages, max_tokens=512):\n", + " \"\"\"\n", + " Generate a response from the fine-tuned model.\n", + " \"\"\"\n", + " prompt = tokenizer.apply_chat_template(\n", + " messages, tokenize=False, add_generation_prompt=True\n", + " )\n", + " inputs = tokenizer(prompt, return_tensors=\"pt\").to(model.device)\n", + "\n", + " with torch.no_grad():\n", + " outputs = model.generate(\n", + " **inputs,\n", + " max_new_tokens=max_tokens,\n", + " temperature=0.7,\n", + " do_sample=True,\n", + " )\n", + "\n", + " response = tokenizer.decode(\n", + " outputs[0][inputs[\"input_ids\"].shape[1] :], skip_special_tokens=True\n", + " )\n", + " return response" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "test_prompts = [\n", + " {\n", + " \"description\": \"Weather lookup\",\n", + " \"messages\": [\n", + " {\n", + " \"role\": \"system\",\n", + " \"content\": (\n", + " \"You are a helpful assistant with access to tools. \"\n", + " \"Available tools: get_weather(location: str, unit: str = 'celsius') -> dict\"\n", + " ),\n", + " },\n", + " {\"role\": \"user\", \"content\": \"What's the weather like in Dublin?\"},\n", + " ],\n", + " },\n", + " {\n", + " \"description\": \"Calculator\",\n", + " \"messages\": [\n", + " {\n", + " \"role\": \"system\",\n", + " \"content\": (\n", + " \"You are a helpful assistant with access to tools. \"\n", + " \"Available tools: calculate(expression: str) -> float\"\n", + " ),\n", + " },\n", + " {\"role\": \"user\", \"content\": \"What is 1547 * 23 + 89?\"},\n", + " ],\n", + " },\n", + " {\n", + " \"description\": \"File search\",\n", + " \"messages\": [\n", + " {\n", + " \"role\": \"system\",\n", + " \"content\": (\n", + " \"You are a helpful assistant with access to tools. \"\n", + " \"Available tools: search_files(query: str, directory: str = '.', file_type: str = None) -> list\"\n", + " ),\n", + " },\n", + " {\n", + " \"role\": \"user\",\n", + " \"content\": \"Find all Python files that contain 'import torch'\",\n", + " },\n", + " ],\n", + " },\n", + "]\n", + "\n", + "print(\"Testing fine-tuned model on tool-call generation:\")\n", + "print(\"=\" * 60)\n", + "\n", + "for test in test_prompts:\n", + " print(f\"\\nTest: {test['description']}\")\n", + " print(f\"User: {test['messages'][-1]['content']}\")\n", + " response = generate_response(test[\"messages\"])\n", + " print(f\"Model: {response}\")\n", + " print(\"-\" * 60)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 7. Save the Model (Optional)\n", + "\n", + "The LoRA checkpoint is already saved in the output directory. You can inspect the saved files or reload the model later." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(\"Training artifacts:\")\n", + "for file in sorted(OUTPUT_DIR.rglob(\"*\")):\n", + " if file.is_file() and \".ipynb_checkpoints\" not in str(file):\n", + " size_mb = file.stat().st_size / (1024 * 1024)\n", + " rel = file.relative_to(OUTPUT_DIR)\n", + " print(f\" {rel}: {size_mb:.2f} MB\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# To reload the model later, find the latest checkpoint:\n", + "#\n", + "# from pathlib import Path\n", + "# from unsloth import FastLanguageModel\n", + "#\n", + "# art_dir = Path(\"./grpo_output/.art\")\n", + "# latest_ckpt = sorted(sorted(art_dir.rglob(\"checkpoints\"))[0].iterdir())[-1]\n", + "# model, tokenizer = FastLanguageModel.from_pretrained(\n", + "# model_name=str(latest_ckpt),\n", + "# max_seq_length=2048,\n", + "# load_in_4bit=False,\n", + "# )\n", + "# FastLanguageModel.for_inference(model)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Summary\n\nIn this notebook, we:\n\n1. **Explored** the Agent-Ark/Toucan-1.5M tool-calling dataset\n2. **Configured** GRPO hyperparameters and LoRA settings\n3. **Trained** a LoRA adapter using GRPO with Training Hub's ART backend\n4. **Inspected** training metrics (reward curve, match rates)\n5. **Tested** the fine-tuned model on tool-call generation examples\n\n### Key Takeaways\n\n- **GRPO** improves model outputs through reinforcement learning from verifiable rewards\n- **ART** efficiently time-shares a single GPU between vLLM inference and Unsloth training\n- The `gpu_memory_utilization` parameter controls the memory split between inference and training\n- Training Hub handles the full GRPO loop: rollout generation, reward scoring, and weight updates\n\n### Next Steps\n\n- Increase `num_iterations` and `n_train` for better results (more compute time)\n- Experiment with `group_size` (larger groups give better advantage estimates but use more memory)\n- For distributed or long-running training, use the [TrainJob notebook](./grpo_lora-kubeflow-trainjob.ipynb)\n- Add W&B logging for experiment tracking" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "name": "python", + "version": "3.11.0" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} \ No newline at end of file diff --git a/examples/fine-tuning/grpo/grpo_lora-kubeflow-trainjob.ipynb b/examples/fine-tuning/grpo/grpo_lora-kubeflow-trainjob.ipynb index 16d2d973..e5b1b932 100644 --- a/examples/fine-tuning/grpo/grpo_lora-kubeflow-trainjob.ipynb +++ b/examples/fine-tuning/grpo/grpo_lora-kubeflow-trainjob.ipynb @@ -50,9 +50,13 @@ "outputs": [], "source": [ "# LORA_GRPO support is not yet released in the workbench image.\n", + "\n", "# Install from midstream until the SDK is pre-installed in the image.\n", + "\n", "# TODO: replace with midstream branch once the PR is merged, then remove\n", + "\n", "# this line entirely when the image includes the updated SDK.\n", + "\n", "%pip install \"kubeflow @ git+https://github.com/Fiona-Waters/kubeflow-sdk.git@add-rl-algo\" --force-reinstall --quiet" ] }, @@ -82,14 +86,23 @@ "outputs": [], "source": [ "api_server = \"\"\n", + "\n", "token = \"\"\n", + "\n", "PVC_NAME = \"shared\" # Replace if the shared RWX storage name is different than in the example provided\n", + "\n", "PVC_PATH = \"shared\" # Replace if the shared RWX storage path is different than in the example provided\n", + "\n", "configuration = k8s.Configuration()\n", + "\n", "configuration.host = api_server\n", + "\n", "# Un-comment if your cluster API server uses a self-signed certificate or an un-trusted CA\n", + "\n", "# configuration.verify_ssl = False\n", + "\n", "configuration.api_key = {\"authorization\": f\"Bearer {token}\"}\n", + "\n", "api_client = k8s.ApiClient(configuration)" ] }, @@ -121,24 +134,39 @@ "outputs": [], "source": [ "# Model and dataset\n", + "\n", "MODEL_PATH = \"Qwen/Qwen3-4B\"\n", + "\n", "DATA_PATH = \"Agent-Ark/Toucan-1.5M\"\n", + "\n", "DATA_CONFIG = \"Qwen3\"\n", "\n", + "\n", "# GRPO hyperparameters\n", + "\n", "NUM_ITERATIONS = 5 # Number of GRPO iterations (each = rollout + train)\n", + "\n", "GROUP_SIZE = 4 # Responses generated per prompt for comparison\n", + "\n", "PROMPT_BATCH_SIZE = 50 # Unique prompts per iteration\n", + "\n", "N_TRAIN = 200 # Total training examples from dataset\n", + "\n", "LEARNING_RATE = 1e-5\n", "\n", + "\n", "# LoRA configuration\n", + "\n", "LORA_R = 16\n", + "\n", "LORA_ALPHA = 8\n", "\n", + "\n", "# vLLM configuration\n", + "\n", "GPU_MEMORY_UTILIZATION = 0.45 # Fraction of GPU memory for vLLM (rest for Unsloth)\n", "\n", + "\n", "params = {\n", " \"model_path\": MODEL_PATH,\n", " \"data_path\": DATA_PATH,\n", @@ -158,15 +186,23 @@ " \"gpu_memory_utilization\": GPU_MEMORY_UTILIZATION,\n", "}\n", "\n", + "\n", "print(\"Training Configuration:\")\n", + "\n", "print(f\" Model: {MODEL_PATH}\")\n", + "\n", "print(f\" Dataset: {DATA_PATH} ({DATA_CONFIG})\")\n", + "\n", "print(f\" Iterations: {NUM_ITERATIONS}, Group size: {GROUP_SIZE}\")\n", + "\n", "print(\n", " f\" Prompts/iter: {PROMPT_BATCH_SIZE}, Rollouts/iter: {PROMPT_BATCH_SIZE * GROUP_SIZE}\"\n", ")\n", + "\n", "print(f\" Training samples: {N_TRAIN}\")\n", + "\n", "print(f\" LoRA Rank: {LORA_R}, Alpha: {LORA_ALPHA}\")\n", + "\n", "print(f\" GPU Memory Utilization: {GPU_MEMORY_UTILIZATION}\")" ] }, @@ -190,6 +226,7 @@ "outputs": [], "source": [ "backend_cfg = KubernetesBackendConfig(client_configuration=api_client.configuration)\n", + "\n", "client = TrainerClient(backend_cfg)" ] }, @@ -209,6 +246,7 @@ "for runtime in client.list_runtimes():\n", " if runtime.name == \"training-hub\":\n", " th_runtime = runtime\n", + "\n", " print(\"Found runtime: \" + str(th_runtime))" ] }, @@ -234,6 +272,7 @@ "\n", "cache_root = f\"/mnt/{PVC_PATH}/.cache/huggingface\"\n", "\n", + "\n", "job_name = client.train(\n", " trainer=TrainingHubTrainer(\n", " algorithm=TrainingHubAlgorithms.LORA_GRPO,\n", @@ -286,6 +325,7 @@ " runtime=th_runtime,\n", ")\n", "\n", + "\n", "print(job_name)" ] }, @@ -303,7 +343,9 @@ "outputs": [], "source": [ "# Follow job logs\n", + "\n", "logs = client.get_job_logs(job_name, follow=True)\n", + "\n", "for line in logs:\n", " print(line)" ] @@ -332,27 +374,37 @@ "\n", "OUTPUT_DIR = f\"/opt/app-root/src/{PVC_PATH}/grpo_output\"\n", "\n", + "\n", "# Read training metrics (two lines per iteration: rollout + train)\n", + "\n", "metrics_file = os.path.join(OUTPUT_DIR, \"training_metrics.jsonl\")\n", + "\n", "if os.path.exists(metrics_file):\n", " print(\"Training metrics:\")\n", + "\n", " print(\"=\" * 80)\n", + "\n", " with open(metrics_file) as f:\n", " for line in f:\n", " entry = json.loads(line)\n", + "\n", " phase = entry.get(\"phase\", \"unknown\")\n", + "\n", " step = entry.get(\"step\", \"?\")\n", + "\n", " if phase == \"rollout\":\n", " print(\n", " f\"Step {step} [rollout] mean_reward={entry.get('mean_reward', 'N/A'):.4f} \"\n", " f\"full_match_rate={entry.get('full_match_rate', 'N/A'):.4f}\"\n", " )\n", + "\n", " elif phase == \"train\":\n", " print(\n", " f\"Step {step} [train] loss={entry.get('loss', 'N/A')} \"\n", " f\"grad_norm={entry.get('grad_norm', 'N/A')} \"\n", " f\"entropy={entry.get('entropy', 'N/A')}\"\n", " )\n", + "\n", "else:\n", " print(\"No metrics file yet -- training may not have started.\")" ] @@ -364,12 +416,17 @@ "outputs": [], "source": [ "# Read training results summary\n", + "\n", "results_file = os.path.join(OUTPUT_DIR, \"training_results.json\")\n", + "\n", "if os.path.exists(results_file):\n", " with open(results_file) as f:\n", " results = json.load(f)\n", + "\n", " print(\"Training results:\")\n", + "\n", " print(json.dumps(results, indent=2, default=str))\n", + "\n", "else:\n", " print(\"No results file yet -- training may not have completed.\")" ] @@ -391,16 +448,22 @@ "import os\n", "\n", "metrics_file = os.path.join(OUTPUT_DIR, \"training_metrics.jsonl\")\n", + "\n", "if not os.path.exists(metrics_file):\n", " print(\"No metrics file found.\")\n", + "\n", "else:\n", " steps, rewards, match_rates = [], [], []\n", + "\n", " with open(metrics_file) as f:\n", " for line in f:\n", " entry = json.loads(line)\n", + "\n", " if entry.get(\"phase\") == \"rollout\":\n", " steps.append(entry[\"step\"])\n", + "\n", " rewards.append(entry[\"mean_reward\"])\n", + "\n", " match_rates.append(entry[\"full_match_rate\"])\n", "\n", " try:\n", @@ -409,23 +472,36 @@ " fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))\n", "\n", " ax1.plot(steps, rewards, marker=\"o\")\n", + "\n", " ax1.set_xlabel(\"Iteration\")\n", + "\n", " ax1.set_ylabel(\"Mean Reward\")\n", + "\n", " ax1.set_title(\"GRPO Training: Mean Reward\")\n", + "\n", " ax1.grid(True, alpha=0.3)\n", "\n", " ax2.plot(steps, match_rates, marker=\"o\", color=\"tab:orange\")\n", + "\n", " ax2.set_xlabel(\"Iteration\")\n", + "\n", " ax2.set_ylabel(\"Full Match Rate\")\n", + "\n", " ax2.set_title(\"GRPO Training: Full Match Rate\")\n", + "\n", " ax2.grid(True, alpha=0.3)\n", "\n", " plt.tight_layout()\n", + "\n", " plt.savefig(os.path.join(OUTPUT_DIR, \"reward_curve.png\"), dpi=150)\n", + "\n", " plt.show()\n", + "\n", " print(f\"Plot saved to {OUTPUT_DIR}/reward_curve.png\")\n", + "\n", " except ImportError:\n", " print(\"matplotlib not installed -- printing raw values instead.\")\n", + "\n", " for s, r, m in zip(steps, rewards, match_rates, strict=True):\n", " print(f\" Step {s}: reward={r:.4f}, match_rate={m:.4f}\")" ] @@ -434,7 +510,11 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## 5. Cleanup" + "## 5. Test the Trained Model\n", + "\n", + "Load the fine-tuned LoRA adapter from the shared PVC and test whether the model generates better tool calls after GRPO training.\n", + "\n", + "> **Note:** This section requires a GPU on the workbench to run inference." ] }, { @@ -443,60 +523,147 @@ "metadata": {}, "outputs": [], "source": [ - "client.delete_job(job_name)\n", - "print(f\"TrainJob '{job_name}' deleted.\")" + "from pathlib import Path\n", + "\n", + "from unsloth import FastLanguageModel\n", + "\n", + "# Find the latest checkpoint on the shared PVC\n", + "\n", + "art_dir = Path(OUTPUT_DIR) / \".art\"\n", + "\n", + "checkpoints_dirs = sorted(art_dir.rglob(\"checkpoints\"))\n", + "\n", + "if not checkpoints_dirs:\n", + " raise FileNotFoundError(f\"No checkpoints found under {art_dir}\")\n", + "\n", + "latest_ckpt = sorted(checkpoints_dirs[0].iterdir())[-1]\n", + "\n", + "print(f\"Loading checkpoint: {latest_ckpt}\")\n", + "\n", + "\n", + "model, tokenizer = FastLanguageModel.from_pretrained(\n", + " model_name=str(latest_ckpt),\n", + " max_seq_length=2048,\n", + " load_in_4bit=False,\n", + ")\n", + "\n", + "FastLanguageModel.for_inference(model)\n", + "\n", + "\n", + "print(\"Fine-tuned model loaded and ready for inference.\")" ] }, { - "cell_type": "markdown", + "cell_type": "code", + "execution_count": null, "metadata": {}, + "outputs": [], "source": [ - "## Appendix: Dataset Format\n", + "import torch\n", "\n", - "### Built-in tool-call mode (default)\n", "\n", - "When using `data_path` with a HuggingFace dataset like `Agent-Ark/Toucan-1.5M`, ART's built-in reward function handles tool-call verification automatically. The dataset should contain multi-turn conversations with tool calls.\n", + "def generate_response(messages, max_tokens=512):\n", + " \"\"\"Generate a response from the fine-tuned model.\"\"\"\n", "\n", - "### Custom reward function\n", + " prompt = tokenizer.apply_chat_template(\n", + " messages, tokenize=False, add_generation_prompt=True\n", + " )\n", "\n", - "For custom tasks, pass `rollout_fn`, `reward_fn`, and `tasks` instead of `data_path`:\n", + " inputs = tokenizer(prompt, return_tensors=\"pt\").to(model.device)\n", "\n", - "```python\n", - "params = {\n", - " \"model_path\": \"Qwen/Qwen3-4B\",\n", - " \"ckpt_output_dir\": \"/mnt/shared/grpo_custom\",\n", - " \"backend\": \"art\",\n", - " # Custom rollout and reward\n", - " \"rollout_fn\": my_rollout_function,\n", - " \"reward_fn\": my_reward_function,\n", - " \"tasks\": my_task_list,\n", - " # Hyperparameters\n", - " \"num_iterations\": 10,\n", - " \"group_size\": 8,\n", - "}\n", - "```\n", - "\n", - "Note: Custom functions passed through `func_args` must be picklable (module-level, not closures) because ART uses `multiprocessing.spawn`.\n", - "\n", - "## Appendix: Key Parameters\n", - "\n", - "| Parameter | Description | Default |\n", - "|-----------|-------------|----------|\n", - "| `model_path` | HuggingFace model ID or local path | (required) |\n", - "| `ckpt_output_dir` | Where to save checkpoints and metrics | (required) |\n", - "| `backend` | `\"art\"` (single-GPU) or `\"verl\"` (multi-GPU) | `\"verl\"` |\n", - "| `data_path` | HuggingFace dataset for built-in tool-call mode | `None` |\n", - "| `data_config` | HuggingFace dataset config (conversation format) | `\"Qwen3\"` |\n", - "| `num_iterations` | Number of GRPO iterations | `15` |\n", - "| `group_size` | Responses per prompt for comparison | `8` |\n", - "| `prompt_batch_size` | Unique prompts per iteration | `100` |\n", - "| `n_train` | Training examples from dataset | `5000` |\n", - "| `learning_rate` | LoRA learning rate | `1e-5` |\n", - "| `lora_r` | LoRA rank | `16` |\n", - "| `lora_alpha` | LoRA alpha | `8` |\n", - "| `gpu_memory_utilization` | vLLM GPU memory fraction | `0.45` |\n", - "| `temperature` | Sampling temperature for rollouts | `0.7` |\n", - "| `max_tokens` | Max tokens per generated response | `512` |" + " with torch.no_grad():\n", + " outputs = model.generate(\n", + " **inputs,\n", + " max_new_tokens=max_tokens,\n", + " temperature=0.7,\n", + " do_sample=True,\n", + " )\n", + "\n", + " response = tokenizer.decode(\n", + " outputs[0][inputs[\"input_ids\"].shape[1] :], skip_special_tokens=True\n", + " )\n", + "\n", + " return response" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "test_prompts = [\n", + " {\n", + " \"description\": \"Weather lookup\",\n", + " \"messages\": [\n", + " {\n", + " \"role\": \"system\",\n", + " \"content\": (\n", + " \"You are a helpful assistant with access to tools. \"\n", + " \"Available tools: get_weather(location: str, unit: str = 'celsius') -> dict\"\n", + " ),\n", + " },\n", + " {\"role\": \"user\", \"content\": \"What's the weather like in Dublin?\"},\n", + " ],\n", + " },\n", + " {\n", + " \"description\": \"Calculator\",\n", + " \"messages\": [\n", + " {\n", + " \"role\": \"system\",\n", + " \"content\": (\n", + " \"You are a helpful assistant with access to tools. \"\n", + " \"Available tools: calculate(expression: str) -> float\"\n", + " ),\n", + " },\n", + " {\"role\": \"user\", \"content\": \"What is 1547 * 23 + 89?\"},\n", + " ],\n", + " },\n", + " {\n", + " \"description\": \"File search\",\n", + " \"messages\": [\n", + " {\n", + " \"role\": \"system\",\n", + " \"content\": (\n", + " \"You are a helpful assistant with access to tools. \"\n", + " \"Available tools: search_files(query: str, directory: str = '.', file_type: str = None) -> list\"\n", + " ),\n", + " },\n", + " {\n", + " \"role\": \"user\",\n", + " \"content\": \"Find all Python files that contain 'import torch'\",\n", + " },\n", + " ],\n", + " },\n", + "]\n", + "\n", + "print(\"Testing fine-tuned model on tool-call generation:\")\n", + "print(\"=\" * 60)\n", + "\n", + "for test in test_prompts:\n", + " print(f\"\\nTest: {test['description']}\")\n", + " print(f\"User: {test['messages'][-1]['content']}\")\n", + " response = generate_response(test[\"messages\"])\n", + " print(f\"Model: {response}\")\n", + " print(\"-\" * 60)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 6. Cleanup" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "client.delete_job(job_name)\n", + "\n", + "print(f\"TrainJob '{job_name}' deleted.\")" ] } ],