From d20798185e02c2abb486d97637c39c0021a99621 Mon Sep 17 00:00:00 2001
From: Fiona-Waters <fiwaters6@gmail.com>
Date: Fri, 15 May 2026 16:39:34 +0100
Subject: [PATCH 1/2] Adding GRPO/ART example

Signed-off-by: Fiona-Waters <fiwaters6@gmail.com>
---
 examples/fine-tuning/README.md                |   2 +
 examples/fine-tuning/grpo/README.md           | 149 +++++
 .../grpo/grpo_lora-kubeflow-trainjob.ipynb    | 516 ++++++++++++++++++
 3 files changed, 667 insertions(+)
 create mode 100644 examples/fine-tuning/grpo/README.md
 create mode 100644 examples/fine-tuning/grpo/grpo_lora-kubeflow-trainjob.ipynb

diff --git a/examples/fine-tuning/README.md b/examples/fine-tuning/README.md
index 5caa5ddf..24f142e0 100644
--- a/examples/fine-tuning/README.md
+++ b/examples/fine-tuning/README.md
@@ -7,6 +7,7 @@ All examples are built primarily on top of **Training Hub** algorithms running o
 - **SFT (Supervised Fine-Tuning)**
 - **OSFT (Orthogonal Subspace Fine-Tuning)**
 - **LoRA + SFT (Low-Rank Adaptation)**
+- **GRPO (Group Relative Policy Optimization)**
 
 For detailed algorithm documentation and configuration options, see the upstream [Training Hub documentation](https://github.com/Red-Hat-AI-Innovation-Team/training_hub/tree/main).
 
@@ -93,6 +94,7 @@ Training is offloaded to **dedicated training pods** managed by **Kubeflow Train
 - [SFT fine-tuning example](sft/README.md)
 - [OSFT fine-tuning example](osft/README.md)
 - [LoRA fine-tuning example](lora/README.md)
+- [GRPO fine-tuning example](grpo/README.md) (single-GPU TrainJob only)
 
 ---
 
diff --git a/examples/fine-tuning/grpo/README.md b/examples/fine-tuning/grpo/README.md
new file mode 100644
index 00000000..bd2dcc54
--- /dev/null
+++ b/examples/fine-tuning/grpo/README.md
@@ -0,0 +1,149 @@
+# GRPO Fine-Tuning with Training Hub
+
+This example provides an overview of Training Hub's [GRPO (Group Relative Policy Optimization)](https://github.com/Red-Hat-AI-Innovation-Team/training_hub?tab=readme-ov-file#grpo) capabilities and demonstrates how to use them with Red Hat OpenShift AI.
+
+## What is GRPO?
+
+GRPO is a reinforcement learning from verifiable rewards (RLVR) algorithm that improves a model's outputs by comparing groups of responses and reinforcing the better ones:
+
+- Generates multiple candidate responses per prompt
+- Scores them with a reward function (e.g. tool-call correctness)
+- Uses the group's relative ranking to compute advantage signals
+- Updates LoRA adapter weights via policy gradient with group normalization
+
+Each training iteration has two phases:
+
+1. **Rollout phase** — vLLM generates candidate responses and a reward function scores them
+2. **Train phase** — Unsloth updates the LoRA adapter weights using the advantage signals
+
+The ART backend time-shares a single GPU between vLLM (inference) and Unsloth (training) via `gpu_memory_utilization`.
+
+### Training Task: Tool-Call Verification
+
+The example uses the [Agent-Ark/Toucan-1.5M](https://huggingface.co/datasets/Agent-Ark/Toucan-1.5M) dataset, which contains tool-calling conversations. The reward function verifies that the model produces syntactically correct tool calls with the expected function name and arguments.
+
+## Execution mode
+
+GRPO runs as a **single-GPU TrainJob** submitted via the Kubeflow SDK. ART is single-GPU by design and manages its own vLLM subprocess internally.
+
+The notebook submits a `TrainJob` from a lightweight workbench, and the training runs on a dedicated GPU pod managed by Kubeflow Trainer.
+
+To learn more about execution modes for other algorithms, see the [fine-tuning execution modes overview](../README.md#execution-modes).
+
+## RHOAI compatibility
+
+This example is compatible with RHOAI version 3.5.
+
+## Requirements
+
+- An OpenShift cluster with OpenShift AI (RHOAI 3.5) installed:
+  - The `dashboard` and `workbenches` components enabled
+  - The `trainer` component enabled
+- A worker node with an NVIDIA GPU (Ampere-based or newer, 40GB+ VRAM).
+- A dynamic storage provisioner supporting RWX PVC provisioning. Talk to your cluster administrator about RWX storage options.
+
+## Hardware requirements
+
+For the workbench image, the example was run on `Training | Jupyter | PyTorch | CUDA | Python` and `Training | Jupyter | PyTorch | CPU Python`.
+This is a single image serving both as training runtime and jupyter notebook and comes with pre-installed dependencies required
+to seamlessly run fine-tuning jobs.
+
+### Workbench Requirements
+
+| Image Type | Use Case | GPU | CPU | Memory |
+|------------|----------|-----|-----|--------|
+| Training \| Jupyter \| PyTorch \| CPU Python | Job submission and monitoring | None | 2 cores | 8Gi |
+| Training \| Jupyter \| PyTorch \| CUDA \| Python | Job submission + model evaluation | 1× GPU | 2 cores | 8Gi |
+
+> [!NOTE]
+>
+> - The workbench does not run the training itself — it submits a TrainJob and monitors progress.
+> - A GPU on the workbench is only needed if you want to load and test the fine-tuned LoRA adapter after training completes.
+
+### Training Pod Requirements
+
+| Component | GPU | GPU Type | CPU | Memory |
+|-----------|-----|----------|-----|--------|
+| Training Pod | 1× GPU | NVIDIA A100, H100, or L40S (40GB+ VRAM) | 8 cores | 64Gi |
+
+> [!NOTE]
+>
+> - GRPO requires a single GPU with at least 40GB VRAM. The `gpu_memory_utilization` parameter (default `0.45`) controls how much GPU memory is reserved for vLLM inference, with the remainder available for Unsloth training.
+> - CPU and memory requirements scale with model size and group size. The above values suit the example configuration (Qwen3-4B, group_size=4).
+> - The training pod is configured from the `client.train()` call within the notebook.
+
+### Storage Requirements
+
+| Purpose | Size | Access Mode | Storage Class | Notes |
+|---------|------|-------------|---------------|-------|
+| Shared Storage (PVC) total | 50Gi (Example Default) | RWX | Dynamic provisioner required | Shared between workbench and training pod |
+
+> [!NOTE]
+>
+> - Storage can be created in `Create Workbench` view on RHOAI Platform, however, dynamic RWX provisioner is required to be configured prior to creating shared file storage in RHOAI.
+> - Shared storage is required — the training pod writes checkpoints and metrics to the PVC, and the workbench reads them for inspection and plotting.
+
+## GRPO-specific considerations
+
+- **`/dev/shm` volume**: vLLM requires a memory-backed `/dev/shm` for inter-process communication. The notebook configures this automatically via a `PodSpecOverride` that mounts an `emptyDir` with `medium: Memory`.
+- **`gpu_memory_utilization`**: Controls the vLLM/Unsloth memory split on the single GPU. The default `0.45` reserves 45% for vLLM inference and leaves the rest for Unsloth training. Adjust based on your model size and available VRAM.
+- **HuggingFace token**: Not strictly required for public models (e.g. Qwen3-4B) but recommended to avoid rate limits. Set `HF_TOKEN` in the environment variables if needed.
+
+## Setup
+
+### Setup Workbench
+
+**Step 1.** Access the OpenShift AI dashboard, for example from the top navigation bar menu:
+
+![](../images/01.png)
+
+**Step 2.** Log in, then go to **_Data Science Projects_** and create a project:
+
+![](../images/02.png)
+
+**Step 3.** Once the project is created, click on **_Create a workbench_**:
+
+![](../images/03.png)
+
+**Step 4.** Select the appropriate Workbench image. See options above:
+
+![](../images/04a.png)
+
+**Step 5.** You may want to create a **Hardware Profile** with GPU support, similar to the one below:
+
+![](../images/04b.png)
+
+**Step 6.** Select the Hardware profile you want to use:
+
+![](../images/04c.png)
+
+> [!NOTE]
+> A GPU on the workbench is only needed if you want to test the fine-tuned model after training. The workbench itself only submits and monitors the TrainJob.
+
+**Step 7.** Create **shared storage** that will be shared between the workbench and the training pod. Make sure it uses a storage class with RWX capability:
+
+![](../images/04d.png)
+
+> [!NOTE]
+> You can attach an existing shared storage if you already have one instead.
+
+**Step 8.** Review the storage configuration and click "Create workbench":
+
+![](../images/04e.png)
+
+**Step 9.** From "Workbenches" page, click on **_Open_** when the workbench you've just created becomes ready:
+
+![](../images/05.png)
+
+### Running the example notebook
+
+- From the workbench, clone this repository: `https://github.com/red-hat-data-services/red-hat-ai-examples.git`
+- Navigate to the `examples/fine-tuning/grpo` directory and open the [`grpo_lora-kubeflow-trainjob.ipynb`](./grpo_lora-kubeflow-trainjob.ipynb) notebook.
+
+> [!NOTE]
+>
+> - You will need a Hugging Face token if using gated models (e.g., Llama models).
+>   Set the `HF_TOKEN` environment variable in your job configuration.
+>   You can skip the token if switching to non-gated models like Qwen3-4B.
+
+You can now proceed with the instructions from the notebook. Enjoy!
diff --git a/examples/fine-tuning/grpo/grpo_lora-kubeflow-trainjob.ipynb b/examples/fine-tuning/grpo/grpo_lora-kubeflow-trainjob.ipynb
new file mode 100644
index 00000000..16d2d973
--- /dev/null
+++ b/examples/fine-tuning/grpo/grpo_lora-kubeflow-trainjob.ipynb
@@ -0,0 +1,516 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# LoRA GRPO Fine-Tuning with Kubeflow Trainer and Training Hub on OpenShift AI\n",
+    "\n",
+    "This notebook demonstrates how to run **GRPO (Group Relative Policy Optimization)** training on OpenShift AI using a Kubeflow `TrainJob`. GRPO is a reinforcement learning technique that teaches a model to improve its outputs by comparing groups of responses and reinforcing the better ones.\n",
+    "\n",
+    "## What is GRPO?\n",
+    "\n",
+    "GRPO is a reinforcement learning from verifiable rewards (RLVR) algorithm that:\n",
+    "- Generates multiple candidate responses per prompt\n",
+    "- Scores them with a reward function (e.g. tool-call correctness)\n",
+    "- Uses the group's relative ranking to compute advantage signals\n",
+    "- Updates LoRA adapter weights via policy gradient with group normalization\n",
+    "\n",
+    "Each training iteration has two phases:\n",
+    "\n",
+    "1. **Rollout phase** — vLLM generates candidate responses, a reward function scores them\n",
+    "2. **Train phase** — Unsloth updates the LoRA adapter weights using the advantage signals\n",
+    "\n",
+    "ART time-shares a single GPU between vLLM (inference) and Unsloth (training).\n",
+    "\n",
+    "## Training Task: Tool-Call Verification\n",
+    "\n",
+    "We use the [Agent-Ark/Toucan-1.5M](https://huggingface.co/datasets/Agent-Ark/Toucan-1.5M) dataset, which contains tool-calling conversations. The reward function verifies that the model produces syntactically correct tool calls with the expected function name and arguments.\n",
+    "\n",
+    "## Hardware Requirements\n",
+    "\n",
+    "- **1x GPU with 40GB+ VRAM** (A100, H100, or L40S recommended)\n",
+    "- ART manages GPU memory sharing between vLLM and Unsloth via `gpu_memory_utilization`\n",
+    "- The default `gpu_memory_utilization=0.45` reserves 45% for vLLM, leaving the rest for Unsloth"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Setup\n",
+    "\n",
+    "First, import the required dependencies."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# LORA_GRPO support is not yet released in the workbench image.\n",
+    "# Install from midstream until the SDK is pre-installed in the image.\n",
+    "# TODO: replace with midstream branch once the PR is merged, then remove\n",
+    "# this line entirely when the image includes the updated SDK.\n",
+    "%pip install \"kubeflow @ git+https://github.com/Fiona-Waters/kubeflow-sdk.git@add-rl-algo\" --force-reinstall --quiet"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from kubeflow.common.types import KubernetesBackendConfig\n",
+    "from kubeflow.trainer import TrainerClient\n",
+    "from kubeflow.trainer.rhai import TrainingHubAlgorithms, TrainingHubTrainer\n",
+    "from kubernetes import client as k8s"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Authenticate to your OpenShift Cluster"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "api_server = \"<REPLACE WITH OPENSHIFT SERVER>\"\n",
+    "token = \"<REPLACE WITH OPENSHIFT TOKEN>\"\n",
+    "PVC_NAME = \"shared\"  # Replace if the shared RWX storage name is different than in the example provided\n",
+    "PVC_PATH = \"shared\"  # Replace if the shared RWX storage path is different than in the example provided\n",
+    "configuration = k8s.Configuration()\n",
+    "configuration.host = api_server\n",
+    "# Un-comment if your cluster API server uses a self-signed certificate or an un-trusted CA\n",
+    "# configuration.verify_ssl = False\n",
+    "configuration.api_key = {\"authorization\": f\"Bearer {token}\"}\n",
+    "api_client = k8s.ApiClient(configuration)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 1. Configure Training Parameters\n",
+    "\n",
+    "Key parameters:\n",
+    "\n",
+    "### GRPO Parameters\n",
+    "- **num_iterations**: Number of GRPO iterations (each = rollout + train phase)\n",
+    "- **group_size**: Responses generated per prompt for comparison\n",
+    "- **prompt_batch_size**: Unique prompts sampled per iteration\n",
+    "\n",
+    "### LoRA Parameters\n",
+    "- **lora_r**: Rank of the low-rank matrices (higher = more capacity, more memory)\n",
+    "- **lora_alpha**: Scaling factor\n",
+    "\n",
+    "### vLLM Parameters\n",
+    "- **gpu_memory_utilization**: Fraction of GPU memory reserved for vLLM inference (rest for Unsloth training)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Model and dataset\n",
+    "MODEL_PATH = \"Qwen/Qwen3-4B\"\n",
+    "DATA_PATH = \"Agent-Ark/Toucan-1.5M\"\n",
+    "DATA_CONFIG = \"Qwen3\"\n",
+    "\n",
+    "# GRPO hyperparameters\n",
+    "NUM_ITERATIONS = 5  # Number of GRPO iterations (each = rollout + train)\n",
+    "GROUP_SIZE = 4  # Responses generated per prompt for comparison\n",
+    "PROMPT_BATCH_SIZE = 50  # Unique prompts per iteration\n",
+    "N_TRAIN = 200  # Total training examples from dataset\n",
+    "LEARNING_RATE = 1e-5\n",
+    "\n",
+    "# LoRA configuration\n",
+    "LORA_R = 16\n",
+    "LORA_ALPHA = 8\n",
+    "\n",
+    "# vLLM configuration\n",
+    "GPU_MEMORY_UTILIZATION = 0.45  # Fraction of GPU memory for vLLM (rest for Unsloth)\n",
+    "\n",
+    "params = {\n",
+    "    \"model_path\": MODEL_PATH,\n",
+    "    \"data_path\": DATA_PATH,\n",
+    "    \"data_config\": DATA_CONFIG,\n",
+    "    \"ckpt_output_dir\": f\"/mnt/{PVC_PATH}/grpo_output\",\n",
+    "    \"backend\": \"art\",\n",
+    "    # GRPO hyperparameters\n",
+    "    \"num_iterations\": NUM_ITERATIONS,\n",
+    "    \"group_size\": GROUP_SIZE,\n",
+    "    \"prompt_batch_size\": PROMPT_BATCH_SIZE,\n",
+    "    \"n_train\": N_TRAIN,\n",
+    "    \"learning_rate\": LEARNING_RATE,\n",
+    "    # LoRA\n",
+    "    \"lora_r\": LORA_R,\n",
+    "    \"lora_alpha\": LORA_ALPHA,\n",
+    "    # vLLM\n",
+    "    \"gpu_memory_utilization\": GPU_MEMORY_UTILIZATION,\n",
+    "}\n",
+    "\n",
+    "print(\"Training Configuration:\")\n",
+    "print(f\"  Model: {MODEL_PATH}\")\n",
+    "print(f\"  Dataset: {DATA_PATH} ({DATA_CONFIG})\")\n",
+    "print(f\"  Iterations: {NUM_ITERATIONS}, Group size: {GROUP_SIZE}\")\n",
+    "print(\n",
+    "    f\"  Prompts/iter: {PROMPT_BATCH_SIZE}, Rollouts/iter: {PROMPT_BATCH_SIZE * GROUP_SIZE}\"\n",
+    ")\n",
+    "print(f\"  Training samples: {N_TRAIN}\")\n",
+    "print(f\"  LoRA Rank: {LORA_R}, Alpha: {LORA_ALPHA}\")\n",
+    "print(f\"  GPU Memory Utilization: {GPU_MEMORY_UTILIZATION}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2. Training with LORA GRPO and Kubeflow Trainer\n",
+    "\n",
+    "Launch a training job via Kubeflow Trainer with configured hyperparameters.\n",
+    "\n",
+    "`LORA_GRPO` uses `python` (not `torchrun`) as the entrypoint because ART manages\n",
+    "its own subprocess via `multiprocessing.spawn`. The SDK also wraps the training call\n",
+    "with a `__main__` guard to prevent re-execution in the spawned process."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "backend_cfg = KubernetesBackendConfig(client_configuration=api_client.configuration)\n",
+    "client = TrainerClient(backend_cfg)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Find the Training Hub ClusterTrainingRuntime"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "for runtime in client.list_runtimes():\n",
+    "    if runtime.name == \"training-hub\":\n",
+    "        th_runtime = runtime\n",
+    "        print(\"Found runtime: \" + str(th_runtime))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Submit the TrainJob"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from kubeflow.trainer.options.kubernetes import (\n",
+    "    ContainerOverride,\n",
+    "    PodSpecOverride,\n",
+    "    PodTemplateOverride,\n",
+    "    PodTemplateOverrides,\n",
+    ")\n",
+    "\n",
+    "cache_root = f\"/mnt/{PVC_PATH}/.cache/huggingface\"\n",
+    "\n",
+    "job_name = client.train(\n",
+    "    trainer=TrainingHubTrainer(\n",
+    "        algorithm=TrainingHubAlgorithms.LORA_GRPO,\n",
+    "        func_args=params,\n",
+    "        env={\n",
+    "            \"HF_HOME\": cache_root,\n",
+    "            \"TRANSFORMERS_ATTN_BACKEND\": \"sdpa\",\n",
+    "        },\n",
+    "        resources_per_node={\n",
+    "            \"cpu\": 8,\n",
+    "            \"memory\": \"64Gi\",\n",
+    "            \"nvidia.com/gpu\": 1,\n",
+    "        },\n",
+    "    ),\n",
+    "    options=[\n",
+    "        PodTemplateOverrides(\n",
+    "            PodTemplateOverride(\n",
+    "                target_jobs=[\"node\"],\n",
+    "                spec=PodSpecOverride(\n",
+    "                    volumes=[\n",
+    "                        {\n",
+    "                            \"name\": \"work\",\n",
+    "                            \"persistentVolumeClaim\": {\"claimName\": PVC_NAME},\n",
+    "                        },\n",
+    "                        {\n",
+    "                            \"name\": \"dshm\",\n",
+    "                            \"emptyDir\": {\"medium\": \"Memory\"},\n",
+    "                        },\n",
+    "                    ],\n",
+    "                    containers=[\n",
+    "                        ContainerOverride(\n",
+    "                            name=\"node\",\n",
+    "                            volume_mounts=[\n",
+    "                                {\n",
+    "                                    \"name\": \"work\",\n",
+    "                                    \"mountPath\": f\"/mnt/{PVC_PATH}\",\n",
+    "                                    \"readOnly\": False,\n",
+    "                                },\n",
+    "                                {\n",
+    "                                    \"name\": \"dshm\",\n",
+    "                                    \"mountPath\": \"/dev/shm\",\n",
+    "                                },\n",
+    "                            ],\n",
+    "                        ),\n",
+    "                    ],\n",
+    "                ),\n",
+    "            )\n",
+    "        )\n",
+    "    ],\n",
+    "    runtime=th_runtime,\n",
+    ")\n",
+    "\n",
+    "print(job_name)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Follow job logs"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Follow job logs\n",
+    "logs = client.get_job_logs(job_name, follow=True)\n",
+    "for line in logs:\n",
+    "    print(line)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 3. Inspect Results\n",
+    "\n",
+    "After training completes, check the metrics file and training results on the PVC.\n",
+    "\n",
+    "GRPO writes two lines per iteration to `training_metrics.jsonl`:\n",
+    "- **Rollout phase**: `mean_reward`, `full_match_rate`\n",
+    "- **Train phase**: `loss`, `grad_norm`, `entropy`"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import json\n",
+    "import os\n",
+    "\n",
+    "OUTPUT_DIR = f\"/opt/app-root/src/{PVC_PATH}/grpo_output\"\n",
+    "\n",
+    "# Read training metrics (two lines per iteration: rollout + train)\n",
+    "metrics_file = os.path.join(OUTPUT_DIR, \"training_metrics.jsonl\")\n",
+    "if os.path.exists(metrics_file):\n",
+    "    print(\"Training metrics:\")\n",
+    "    print(\"=\" * 80)\n",
+    "    with open(metrics_file) as f:\n",
+    "        for line in f:\n",
+    "            entry = json.loads(line)\n",
+    "            phase = entry.get(\"phase\", \"unknown\")\n",
+    "            step = entry.get(\"step\", \"?\")\n",
+    "            if phase == \"rollout\":\n",
+    "                print(\n",
+    "                    f\"Step {step} [rollout]  mean_reward={entry.get('mean_reward', 'N/A'):.4f}  \"\n",
+    "                    f\"full_match_rate={entry.get('full_match_rate', 'N/A'):.4f}\"\n",
+    "                )\n",
+    "            elif phase == \"train\":\n",
+    "                print(\n",
+    "                    f\"Step {step} [train]    loss={entry.get('loss', 'N/A')}  \"\n",
+    "                    f\"grad_norm={entry.get('grad_norm', 'N/A')}  \"\n",
+    "                    f\"entropy={entry.get('entropy', 'N/A')}\"\n",
+    "                )\n",
+    "else:\n",
+    "    print(\"No metrics file yet -- training may not have started.\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Read training results summary\n",
+    "results_file = os.path.join(OUTPUT_DIR, \"training_results.json\")\n",
+    "if os.path.exists(results_file):\n",
+    "    with open(results_file) as f:\n",
+    "        results = json.load(f)\n",
+    "    print(\"Training results:\")\n",
+    "    print(json.dumps(results, indent=2, default=str))\n",
+    "else:\n",
+    "    print(\"No results file yet -- training may not have completed.\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 4. Plot Reward Curve"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import json\n",
+    "import os\n",
+    "\n",
+    "metrics_file = os.path.join(OUTPUT_DIR, \"training_metrics.jsonl\")\n",
+    "if not os.path.exists(metrics_file):\n",
+    "    print(\"No metrics file found.\")\n",
+    "else:\n",
+    "    steps, rewards, match_rates = [], [], []\n",
+    "    with open(metrics_file) as f:\n",
+    "        for line in f:\n",
+    "            entry = json.loads(line)\n",
+    "            if entry.get(\"phase\") == \"rollout\":\n",
+    "                steps.append(entry[\"step\"])\n",
+    "                rewards.append(entry[\"mean_reward\"])\n",
+    "                match_rates.append(entry[\"full_match_rate\"])\n",
+    "\n",
+    "    try:\n",
+    "        import matplotlib.pyplot as plt\n",
+    "\n",
+    "        fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))\n",
+    "\n",
+    "        ax1.plot(steps, rewards, marker=\"o\")\n",
+    "        ax1.set_xlabel(\"Iteration\")\n",
+    "        ax1.set_ylabel(\"Mean Reward\")\n",
+    "        ax1.set_title(\"GRPO Training: Mean Reward\")\n",
+    "        ax1.grid(True, alpha=0.3)\n",
+    "\n",
+    "        ax2.plot(steps, match_rates, marker=\"o\", color=\"tab:orange\")\n",
+    "        ax2.set_xlabel(\"Iteration\")\n",
+    "        ax2.set_ylabel(\"Full Match Rate\")\n",
+    "        ax2.set_title(\"GRPO Training: Full Match Rate\")\n",
+    "        ax2.grid(True, alpha=0.3)\n",
+    "\n",
+    "        plt.tight_layout()\n",
+    "        plt.savefig(os.path.join(OUTPUT_DIR, \"reward_curve.png\"), dpi=150)\n",
+    "        plt.show()\n",
+    "        print(f\"Plot saved to {OUTPUT_DIR}/reward_curve.png\")\n",
+    "    except ImportError:\n",
+    "        print(\"matplotlib not installed -- printing raw values instead.\")\n",
+    "        for s, r, m in zip(steps, rewards, match_rates, strict=True):\n",
+    "            print(f\"  Step {s}: reward={r:.4f}, match_rate={m:.4f}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 5. Cleanup"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "client.delete_job(job_name)\n",
+    "print(f\"TrainJob '{job_name}' deleted.\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Appendix: Dataset Format\n",
+    "\n",
+    "### Built-in tool-call mode (default)\n",
+    "\n",
+    "When using `data_path` with a HuggingFace dataset like `Agent-Ark/Toucan-1.5M`, ART's built-in reward function handles tool-call verification automatically. The dataset should contain multi-turn conversations with tool calls.\n",
+    "\n",
+    "### Custom reward function\n",
+    "\n",
+    "For custom tasks, pass `rollout_fn`, `reward_fn`, and `tasks` instead of `data_path`:\n",
+    "\n",
+    "```python\n",
+    "params = {\n",
+    "    \"model_path\": \"Qwen/Qwen3-4B\",\n",
+    "    \"ckpt_output_dir\": \"/mnt/shared/grpo_custom\",\n",
+    "    \"backend\": \"art\",\n",
+    "    # Custom rollout and reward\n",
+    "    \"rollout_fn\": my_rollout_function,\n",
+    "    \"reward_fn\": my_reward_function,\n",
+    "    \"tasks\": my_task_list,\n",
+    "    # Hyperparameters\n",
+    "    \"num_iterations\": 10,\n",
+    "    \"group_size\": 8,\n",
+    "}\n",
+    "```\n",
+    "\n",
+    "Note: Custom functions passed through `func_args` must be picklable (module-level, not closures) because ART uses `multiprocessing.spawn`.\n",
+    "\n",
+    "## Appendix: Key Parameters\n",
+    "\n",
+    "| Parameter | Description | Default |\n",
+    "|-----------|-------------|----------|\n",
+    "| `model_path` | HuggingFace model ID or local path | (required) |\n",
+    "| `ckpt_output_dir` | Where to save checkpoints and metrics | (required) |\n",
+    "| `backend` | `\"art\"` (single-GPU) or `\"verl\"` (multi-GPU) | `\"verl\"` |\n",
+    "| `data_path` | HuggingFace dataset for built-in tool-call mode | `None` |\n",
+    "| `data_config` | HuggingFace dataset config (conversation format) | `\"Qwen3\"` |\n",
+    "| `num_iterations` | Number of GRPO iterations | `15` |\n",
+    "| `group_size` | Responses per prompt for comparison | `8` |\n",
+    "| `prompt_batch_size` | Unique prompts per iteration | `100` |\n",
+    "| `n_train` | Training examples from dataset | `5000` |\n",
+    "| `learning_rate` | LoRA learning rate | `1e-5` |\n",
+    "| `lora_r` | LoRA rank | `16` |\n",
+    "| `lora_alpha` | LoRA alpha | `8` |\n",
+    "| `gpu_memory_utilization` | vLLM GPU memory fraction | `0.45` |\n",
+    "| `temperature` | Sampling temperature for rollouts | `0.7` |\n",
+    "| `max_tokens` | Max tokens per generated response | `512` |"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python",
+   "version": "3.12.0"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
\ No newline at end of file

From b24c8aedf61e5047354d4d8468772299cbc26e99 Mon Sep 17 00:00:00 2001
From: Fiona-Waters <fiwaters6@gmail.com>
Date: Thu, 21 May 2026 14:38:06 +0100
Subject: [PATCH 2/2] Add interactive GRPO notebook and update README

- Add grpo_lora-interactive-notebook.ipynb for single-GPU GRPO training
  directly in the workbench
- Include "Test the Trained Model" section with dynamic checkpoint loading
- Update README to document both interactive and distributed execution modes
- Update workbench requirements for interactive mode (8 CPU, 64Gi memory)
- Remove custom reward function appendix from both notebooks (out of scope)

Signed-off-by: Fiona-Waters <fiwaters6@gmail.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
---
 examples/fine-tuning/grpo/README.md           |  41 +-
 .../grpo/grpo_lora-interactive-notebook.ipynb | 530 ++++++++++++++++++
 .../grpo/grpo_lora-kubeflow-trainjob.ipynb    | 257 +++++++--
 3 files changed, 766 insertions(+), 62 deletions(-)
 create mode 100644 examples/fine-tuning/grpo/grpo_lora-interactive-notebook.ipynb

diff --git a/examples/fine-tuning/grpo/README.md b/examples/fine-tuning/grpo/README.md
index bd2dcc54..1a7023d2 100644
--- a/examples/fine-tuning/grpo/README.md
+++ b/examples/fine-tuning/grpo/README.md
@@ -22,11 +22,16 @@ The ART backend time-shares a single GPU between vLLM (inference) and Unsloth (t
 
 The example uses the [Agent-Ark/Toucan-1.5M](https://huggingface.co/datasets/Agent-Ark/Toucan-1.5M) dataset, which contains tool-calling conversations. The reward function verifies that the model produces syntactically correct tool calls with the expected function name and arguments.
 
-## Execution mode
+## Execution Modes
 
-GRPO runs as a **single-GPU TrainJob** submitted via the Kubeflow SDK. ART is single-GPU by design and manages its own vLLM subprocess internally.
+This example provides two notebooks:
 
-The notebook submits a `TrainJob` from a lightweight workbench, and the training runs on a dedicated GPU pod managed by Kubeflow Trainer.
+| Mode            | Notebook                                                                        | Description                                                                                                                    |
+| --------------- | ------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------ |
+| **Interactive** | [`grpo_lora-interactive-notebook.ipynb`](./grpo_lora-interactive-notebook.ipynb) | Runs GRPO training directly on the workbench GPU. Best for exploration, prototyping, and quick iteration.                      |
+| **Distributed** | [`grpo_lora-kubeflow-trainjob.ipynb`](./grpo_lora-kubeflow-trainjob.ipynb)      | Submits a Kubeflow TrainJob from a lightweight workbench. Training runs on a dedicated GPU pod. Best for production workloads. |
+
+ART is single-GPU by design and manages its own vLLM subprocess internally.
 
 To learn more about execution modes for other algorithms, see the [fine-tuning execution modes overview](../README.md#execution-modes).
 
@@ -50,21 +55,21 @@ to seamlessly run fine-tuning jobs.
 
 ### Workbench Requirements
 
-| Image Type | Use Case | GPU | CPU | Memory |
-|------------|----------|-----|-----|--------|
-| Training \| Jupyter \| PyTorch \| CPU Python | Job submission and monitoring | None | 2 cores | 8Gi |
-| Training \| Jupyter \| PyTorch \| CUDA \| Python | Job submission + model evaluation | 1× GPU | 2 cores | 8Gi |
+| Image Type                                       | Use Case                                            | GPU                  | CPU     | Memory |
+| ------------------------------------------------ | --------------------------------------------------- | -------------------- | ------- | ------ |
+| Training \| Jupyter \| PyTorch \| CPU Python     | Distributed mode: job submission and monitoring     | None                 | 2 cores | 8Gi    |
+| Training \| Jupyter \| PyTorch \| CUDA \| Python | Interactive mode, or distributed + model evaluation | 1× GPU (40GB+ VRAM) | 8 cores | 64Gi   |
 
 > [!NOTE]
 >
-> - The workbench does not run the training itself — it submits a TrainJob and monitors progress.
-> - A GPU on the workbench is only needed if you want to load and test the fine-tuned LoRA adapter after training completes.
+> - **Distributed mode**: The workbench submits a TrainJob and monitors progress. A GPU on the workbench is only needed to test the fine-tuned LoRA adapter after training completes.
+> - **Interactive mode**: Training runs directly on the workbench GPU. The workbench needs an A100, H100, or L40S (40GB+ VRAM) with sufficient CPU and memory.
 
 ### Training Pod Requirements
 
-| Component | GPU | GPU Type | CPU | Memory |
-|-----------|-----|----------|-----|--------|
-| Training Pod | 1× GPU | NVIDIA A100, H100, or L40S (40GB+ VRAM) | 8 cores | 64Gi |
+| Component    | GPU    | GPU Type                                 | CPU     | Memory |
+| ------------ | ------ | ---------------------------------------- | ------- | ------ |
+| Training Pod | 1× GPU | NVIDIA A100, H100, or L40S (40GB+ VRAM) | 8 cores | 64Gi   |
 
 > [!NOTE]
 >
@@ -74,9 +79,9 @@ to seamlessly run fine-tuning jobs.
 
 ### Storage Requirements
 
-| Purpose | Size | Access Mode | Storage Class | Notes |
-|---------|------|-------------|---------------|-------|
-| Shared Storage (PVC) total | 50Gi (Example Default) | RWX | Dynamic provisioner required | Shared between workbench and training pod |
+| Purpose                    | Size                   | Access Mode | Storage Class                | Notes                                     |
+| -------------------------- | ---------------------- | ----------- | ---------------------------- | ----------------------------------------- |
+| Shared Storage (PVC) total | 50Gi (Example Default) | RWX         | Dynamic provisioner required | Shared between workbench and training pod |
 
 > [!NOTE]
 >
@@ -135,10 +140,12 @@ to seamlessly run fine-tuning jobs.
 
 ![](../images/05.png)
 
-### Running the example notebook
+### Running the example notebooks
 
 - From the workbench, clone this repository: `https://github.com/red-hat-data-services/red-hat-ai-examples.git`
-- Navigate to the `examples/fine-tuning/grpo` directory and open the [`grpo_lora-kubeflow-trainjob.ipynb`](./grpo_lora-kubeflow-trainjob.ipynb) notebook.
+- Navigate to the `examples/fine-tuning/grpo` directory and open the notebook for your preferred execution mode:
+  - **Interactive**: [`grpo_lora-interactive-notebook.ipynb`](./grpo_lora-interactive-notebook.ipynb) — runs training directly on the workbench GPU
+  - **Distributed**: [`grpo_lora-kubeflow-trainjob.ipynb`](./grpo_lora-kubeflow-trainjob.ipynb) — submits a TrainJob via Kubeflow Trainer
 
 > [!NOTE]
 >
diff --git a/examples/fine-tuning/grpo/grpo_lora-interactive-notebook.ipynb b/examples/fine-tuning/grpo/grpo_lora-interactive-notebook.ipynb
new file mode 100644
index 00000000..4fc0479f
--- /dev/null
+++ b/examples/fine-tuning/grpo/grpo_lora-interactive-notebook.ipynb
@@ -0,0 +1,530 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# LoRA GRPO Fine-Tuning with Training Hub (Interactive)\n",
+    "\n",
+    "This notebook demonstrates how to run **GRPO (Group Relative Policy Optimization)** training directly inside a GPU-enabled workbench using Training Hub's ART backend. GRPO is a reinforcement learning technique that teaches a model to improve its outputs by comparing groups of responses and reinforcing the better ones.\n",
+    "\n",
+    "## What is GRPO?\n",
+    "\n",
+    "GRPO is a reinforcement learning from verifiable rewards (RLVR) algorithm that:\n",
+    "- Generates multiple candidate responses per prompt\n",
+    "- Scores them with a reward function (e.g. tool-call correctness)\n",
+    "- Uses the group's relative ranking to compute advantage signals\n",
+    "- Updates LoRA adapter weights via policy gradient with group normalization\n",
+    "\n",
+    "Each training iteration has two phases:\n",
+    "\n",
+    "1. **Rollout phase** — vLLM generates candidate responses and a reward function scores them\n",
+    "2. **Train phase** — Unsloth updates the LoRA adapter weights using the advantage signals\n",
+    "\n",
+    "ART time-shares a single GPU between vLLM (inference) and Unsloth (training) via `gpu_memory_utilization`.\n",
+    "\n",
+    "## Training Task: Tool-Call Verification\n",
+    "\n",
+    "We use the [Agent-Ark/Toucan-1.5M](https://huggingface.co/datasets/Agent-Ark/Toucan-1.5M) dataset, which contains tool-calling conversations. The reward function verifies that the model produces syntactically correct tool calls with the expected function name and arguments.\n",
+    "\n",
+    "## Hardware Requirements\n",
+    "\n",
+    "| Resource | Minimum |\n",
+    "|----------|---------|\n",
+    "| **GPU** | 1× NVIDIA A100, H100, or L40S (40GB+ VRAM) |\n",
+    "| **CPU** | 8 cores |\n",
+    "| **Memory** | 64Gi |\n",
+    "\n",
+    "- ART manages GPU memory sharing between vLLM and Unsloth via `gpu_memory_utilization`\n",
+    "- The default `gpu_memory_utilization=0.45` reserves 45% for vLLM, leaving the rest for Unsloth\n",
+    "- CPU and memory requirements are high because ART spawns multiple processes (vLLM engine + Unsloth training) alongside the Jupyter server\n",
+    "\n",
+    "> **Note:** This notebook runs training directly inside the workbench. For job submission via Kubeflow Trainer, see [`grpo_lora-kubeflow-trainjob.ipynb`](./grpo_lora-kubeflow-trainjob.ipynb)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Setup\n",
+    "\n",
+    "First, import the required dependencies."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import json\n",
+    "from pathlib import Path\n",
+    "\n",
+    "import torch"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "if torch.cuda.is_available():\n",
+    "    gpu_name = torch.cuda.get_device_name(0)\n",
+    "    gpu_memory = torch.cuda.get_device_properties(0).total_memory / (1024**3)\n",
+    "    print(f\"GPU: {gpu_name}\")\n",
+    "    print(f\"Memory: {gpu_memory:.1f} GB\")\n",
+    "else:\n",
+    "    print(\"WARNING: No GPU detected. GRPO requires a GPU with 40GB+ VRAM.\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 1. Explore the Dataset\n",
+    "\n",
+    "We use the [Agent-Ark/Toucan-1.5M](https://huggingface.co/datasets/Agent-Ark/Toucan-1.5M) dataset with the `Qwen3` configuration. This dataset contains multi-turn tool-calling conversations where the model must produce syntactically correct function calls.\n",
+    "\n",
+    "ART's built-in reward function handles scoring automatically — it checks that the model output is a valid tool call with the correct function name and arguments."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from datasets import load_dataset\n",
+    "\n",
+    "dataset = load_dataset(\"Agent-Ark/Toucan-1.5M\", \"Qwen3\", split=\"train\", streaming=True)\n",
+    "\n",
+    "print(\"Dataset: Agent-Ark/Toucan-1.5M (Qwen3 config)\")\n",
+    "print(\"=\" * 60)\n",
+    "\n",
+    "sample = next(iter(dataset))\n",
+    "print(f\"\\nColumns: {list(sample.keys())}\")\n",
+    "\n",
+    "messages = sample[\"messages\"]\n",
+    "if isinstance(messages, str):\n",
+    "    messages = json.loads(messages)\n",
+    "\n",
+    "print(f\"\\nSample conversation ({len(messages)} turns):\")\n",
+    "print(\"-\" * 60)\n",
+    "for msg in messages[:4]:\n",
+    "    role = msg[\"role\"]\n",
+    "    content = (\n",
+    "        msg[\"content\"][:200] + \"...\" if len(msg[\"content\"]) > 200 else msg[\"content\"]\n",
+    "    )\n",
+    "    print(f\"\\n[{role}]\\n{content}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2. Configure Training Parameters\n",
+    "\n",
+    "Key parameters:\n",
+    "\n",
+    "### GRPO Parameters\n",
+    "- **num_iterations**: Number of GRPO iterations (each = rollout + train phase)\n",
+    "- **group_size**: Responses generated per prompt for comparison\n",
+    "- **prompt_batch_size**: Unique prompts sampled per iteration\n",
+    "\n",
+    "### LoRA Parameters\n",
+    "- **lora_r**: Rank of the low-rank matrices (higher = more capacity, more memory)\n",
+    "- **lora_alpha**: Scaling factor\n",
+    "\n",
+    "### vLLM Parameters\n",
+    "- **gpu_memory_utilization**: Fraction of GPU memory reserved for vLLM inference (rest for Unsloth training)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Model and dataset\n",
+    "MODEL_PATH = \"Qwen/Qwen3-4B\"\n",
+    "DATA_PATH = \"Agent-Ark/Toucan-1.5M\"\n",
+    "DATA_CONFIG = \"Qwen3\"\n",
+    "\n",
+    "# Output directory (local to workbench)\n",
+    "OUTPUT_DIR = Path(\"./grpo_output\")\n",
+    "OUTPUT_DIR.mkdir(parents=True, exist_ok=True)\n",
+    "\n",
+    "# GRPO hyperparameters\n",
+    "NUM_ITERATIONS = 5  # Number of GRPO iterations (each = rollout + train)\n",
+    "GROUP_SIZE = 4  # Responses generated per prompt for comparison\n",
+    "PROMPT_BATCH_SIZE = 50  # Unique prompts per iteration\n",
+    "N_TRAIN = 200  # Total training examples from dataset\n",
+    "LEARNING_RATE = 1e-5\n",
+    "\n",
+    "# LoRA configuration\n",
+    "LORA_R = 16\n",
+    "LORA_ALPHA = 8\n",
+    "\n",
+    "# vLLM configuration — controls GPU memory split between vLLM and Unsloth\n",
+    "GPU_MEMORY_UTILIZATION = 0.45\n",
+    "\n",
+    "print(\"Training Configuration:\")\n",
+    "print(f\"  Model: {MODEL_PATH}\")\n",
+    "print(f\"  Dataset: {DATA_PATH} ({DATA_CONFIG})\")\n",
+    "print(f\"  Output: {OUTPUT_DIR.resolve()}\")\n",
+    "print(f\"  Iterations: {NUM_ITERATIONS}\")\n",
+    "print(f\"  Group size: {GROUP_SIZE}\")\n",
+    "print(f\"  Prompts/iteration: {PROMPT_BATCH_SIZE}\")\n",
+    "print(f\"  Training examples: {N_TRAIN}\")\n",
+    "print(f\"  LoRA rank: {LORA_R}, alpha: {LORA_ALPHA}\")\n",
+    "print(f\"  GPU memory for vLLM: {GPU_MEMORY_UTILIZATION * 100:.0f}%\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 3. Run GRPO Training\n",
+    "\n",
+    "Call `training_hub.lora_grpo()` directly. ART manages the single-GPU workflow internally:\n",
+    "1. Loads the base model with LoRA adapters via Unsloth\n",
+    "2. Starts a vLLM inference process for rollout generation\n",
+    "3. Alternates between rollout (generate + score) and train (update weights) phases\n",
+    "\n",
+    "> **Note:** Training may take 10-30 minutes depending on configuration."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from training_hub import lora_grpo\n",
+    "\n",
+    "result = lora_grpo(\n",
+    "    model_path=MODEL_PATH,\n",
+    "    data_path=DATA_PATH,\n",
+    "    data_config=DATA_CONFIG,\n",
+    "    ckpt_output_dir=str(OUTPUT_DIR),\n",
+    "    backend=\"art\",\n",
+    "    # GRPO hyperparameters\n",
+    "    num_iterations=NUM_ITERATIONS,\n",
+    "    group_size=GROUP_SIZE,\n",
+    "    prompt_batch_size=PROMPT_BATCH_SIZE,\n",
+    "    n_train=N_TRAIN,\n",
+    "    learning_rate=LEARNING_RATE,\n",
+    "    # LoRA\n",
+    "    lora_r=LORA_R,\n",
+    "    lora_alpha=LORA_ALPHA,\n",
+    "    # vLLM\n",
+    "    gpu_memory_utilization=GPU_MEMORY_UTILIZATION,\n",
+    ")\n",
+    "\n",
+    "print(f\"\\nTraining completed — status: {result.get('status')}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 4. Inspect Training Metrics\n",
+    "\n",
+    "GRPO writes two lines per iteration to `training_metrics.jsonl`:\n",
+    "- **Rollout phase**: `mean_reward`, `full_match_rate`\n",
+    "- **Train phase**: `loss`, `grad_norm`, `entropy`"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "metrics_file = OUTPUT_DIR / \"training_metrics.jsonl\"\n",
+    "\n",
+    "if metrics_file.exists():\n",
+    "    print(\"Training metrics:\")\n",
+    "    print(\"=\" * 80)\n",
+    "    with open(metrics_file) as f:\n",
+    "        for line in f:\n",
+    "            entry = json.loads(line)\n",
+    "            phase = entry.get(\"phase\", \"unknown\")\n",
+    "            step = entry.get(\"step\", \"?\")\n",
+    "            if phase == \"rollout\":\n",
+    "                print(\n",
+    "                    f\"Step {step} [rollout]  mean_reward={entry.get('mean_reward', 'N/A'):.4f}  \"\n",
+    "                    f\"full_match_rate={entry.get('full_match_rate', 'N/A'):.4f}\"\n",
+    "                )\n",
+    "            elif phase == \"train\":\n",
+    "                print(\n",
+    "                    f\"Step {step} [train]    loss={entry.get('loss', 'N/A')}  \"\n",
+    "                    f\"grad_norm={entry.get('grad_norm', 'N/A')}  \"\n",
+    "                    f\"entropy={entry.get('entropy', 'N/A')}\"\n",
+    "                )\n",
+    "else:\n",
+    "    print(\"No metrics file found — training may not have completed.\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "results_file = OUTPUT_DIR / \"training_results.json\"\n",
+    "\n",
+    "if results_file.exists():\n",
+    "    with open(results_file) as f:\n",
+    "        results = json.load(f)\n",
+    "    print(\"Training results:\")\n",
+    "    print(json.dumps(results, indent=2, default=str))\n",
+    "else:\n",
+    "    print(\"No results file yet — training may not have completed.\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 5. Plot Reward Curve\n",
+    "\n",
+    "Visualize how the model's reward improves over GRPO iterations."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "metrics_file = OUTPUT_DIR / \"training_metrics.jsonl\"\n",
+    "\n",
+    "if not metrics_file.exists():\n",
+    "    print(\"No metrics file found.\")\n",
+    "else:\n",
+    "    steps, rewards, match_rates = [], [], []\n",
+    "    with open(metrics_file) as f:\n",
+    "        for line in f:\n",
+    "            entry = json.loads(line)\n",
+    "            if entry.get(\"phase\") == \"rollout\":\n",
+    "                steps.append(entry[\"step\"])\n",
+    "                rewards.append(entry[\"mean_reward\"])\n",
+    "                match_rates.append(entry[\"full_match_rate\"])\n",
+    "\n",
+    "    import matplotlib.pyplot as plt\n",
+    "\n",
+    "    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))\n",
+    "\n",
+    "    ax1.plot(steps, rewards, marker=\"o\")\n",
+    "    ax1.set_xlabel(\"Iteration\")\n",
+    "    ax1.set_ylabel(\"Mean Reward\")\n",
+    "    ax1.set_title(\"GRPO Training: Mean Reward\")\n",
+    "    ax1.grid(True, alpha=0.3)\n",
+    "\n",
+    "    ax2.plot(steps, match_rates, marker=\"o\", color=\"tab:orange\")\n",
+    "    ax2.set_xlabel(\"Iteration\")\n",
+    "    ax2.set_ylabel(\"Full Match Rate\")\n",
+    "    ax2.set_title(\"GRPO Training: Full Match Rate\")\n",
+    "    ax2.grid(True, alpha=0.3)\n",
+    "\n",
+    "    plt.tight_layout()\n",
+    "    plt.savefig(OUTPUT_DIR / \"reward_curve.png\", dpi=150)\n",
+    "    plt.show()\n",
+    "    print(f\"Plot saved to {OUTPUT_DIR}/reward_curve.png\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 6. Test the Trained Model\n",
+    "\n",
+    "Load the fine-tuned LoRA adapter and test whether the model generates better tool calls after GRPO training."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from unsloth import FastLanguageModel\n",
+    "\n",
+    "# ART saves checkpoints under .art/<project>/models/<name>/checkpoints/<step>/\n",
+    "# Find the latest checkpoint dynamically.\n",
+    "art_dir = OUTPUT_DIR / \".art\"\n",
+    "checkpoints_dirs = sorted(art_dir.rglob(\"checkpoints\"))\n",
+    "if not checkpoints_dirs:\n",
+    "    raise FileNotFoundError(f\"No checkpoints found under {art_dir}\")\n",
+    "latest_ckpt = sorted(checkpoints_dirs[0].iterdir())[-1]\n",
+    "print(f\"Loading checkpoint: {latest_ckpt}\")\n",
+    "\n",
+    "model, tokenizer = FastLanguageModel.from_pretrained(\n",
+    "    model_name=str(latest_ckpt),\n",
+    "    max_seq_length=2048,\n",
+    "    load_in_4bit=False,\n",
+    ")\n",
+    "FastLanguageModel.for_inference(model)\n",
+    "\n",
+    "print(\"Fine-tuned model loaded and ready for inference.\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def generate_response(messages, max_tokens=512):\n",
+    "    \"\"\"\n",
+    "    Generate a response from the fine-tuned model.\n",
+    "    \"\"\"\n",
+    "    prompt = tokenizer.apply_chat_template(\n",
+    "        messages, tokenize=False, add_generation_prompt=True\n",
+    "    )\n",
+    "    inputs = tokenizer(prompt, return_tensors=\"pt\").to(model.device)\n",
+    "\n",
+    "    with torch.no_grad():\n",
+    "        outputs = model.generate(\n",
+    "            **inputs,\n",
+    "            max_new_tokens=max_tokens,\n",
+    "            temperature=0.7,\n",
+    "            do_sample=True,\n",
+    "        )\n",
+    "\n",
+    "    response = tokenizer.decode(\n",
+    "        outputs[0][inputs[\"input_ids\"].shape[1] :], skip_special_tokens=True\n",
+    "    )\n",
+    "    return response"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "test_prompts = [\n",
+    "    {\n",
+    "        \"description\": \"Weather lookup\",\n",
+    "        \"messages\": [\n",
+    "            {\n",
+    "                \"role\": \"system\",\n",
+    "                \"content\": (\n",
+    "                    \"You are a helpful assistant with access to tools. \"\n",
+    "                    \"Available tools: get_weather(location: str, unit: str = 'celsius') -> dict\"\n",
+    "                ),\n",
+    "            },\n",
+    "            {\"role\": \"user\", \"content\": \"What's the weather like in Dublin?\"},\n",
+    "        ],\n",
+    "    },\n",
+    "    {\n",
+    "        \"description\": \"Calculator\",\n",
+    "        \"messages\": [\n",
+    "            {\n",
+    "                \"role\": \"system\",\n",
+    "                \"content\": (\n",
+    "                    \"You are a helpful assistant with access to tools. \"\n",
+    "                    \"Available tools: calculate(expression: str) -> float\"\n",
+    "                ),\n",
+    "            },\n",
+    "            {\"role\": \"user\", \"content\": \"What is 1547 * 23 + 89?\"},\n",
+    "        ],\n",
+    "    },\n",
+    "    {\n",
+    "        \"description\": \"File search\",\n",
+    "        \"messages\": [\n",
+    "            {\n",
+    "                \"role\": \"system\",\n",
+    "                \"content\": (\n",
+    "                    \"You are a helpful assistant with access to tools. \"\n",
+    "                    \"Available tools: search_files(query: str, directory: str = '.', file_type: str = None) -> list\"\n",
+    "                ),\n",
+    "            },\n",
+    "            {\n",
+    "                \"role\": \"user\",\n",
+    "                \"content\": \"Find all Python files that contain 'import torch'\",\n",
+    "            },\n",
+    "        ],\n",
+    "    },\n",
+    "]\n",
+    "\n",
+    "print(\"Testing fine-tuned model on tool-call generation:\")\n",
+    "print(\"=\" * 60)\n",
+    "\n",
+    "for test in test_prompts:\n",
+    "    print(f\"\\nTest: {test['description']}\")\n",
+    "    print(f\"User: {test['messages'][-1]['content']}\")\n",
+    "    response = generate_response(test[\"messages\"])\n",
+    "    print(f\"Model: {response}\")\n",
+    "    print(\"-\" * 60)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 7. Save the Model (Optional)\n",
+    "\n",
+    "The LoRA checkpoint is already saved in the output directory. You can inspect the saved files or reload the model later."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(\"Training artifacts:\")\n",
+    "for file in sorted(OUTPUT_DIR.rglob(\"*\")):\n",
+    "    if file.is_file() and \".ipynb_checkpoints\" not in str(file):\n",
+    "        size_mb = file.stat().st_size / (1024 * 1024)\n",
+    "        rel = file.relative_to(OUTPUT_DIR)\n",
+    "        print(f\"  {rel}: {size_mb:.2f} MB\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# To reload the model later, find the latest checkpoint:\n",
+    "#\n",
+    "# from pathlib import Path\n",
+    "# from unsloth import FastLanguageModel\n",
+    "#\n",
+    "# art_dir = Path(\"./grpo_output/.art\")\n",
+    "# latest_ckpt = sorted(sorted(art_dir.rglob(\"checkpoints\"))[0].iterdir())[-1]\n",
+    "# model, tokenizer = FastLanguageModel.from_pretrained(\n",
+    "#     model_name=str(latest_ckpt),\n",
+    "#     max_seq_length=2048,\n",
+    "#     load_in_4bit=False,\n",
+    "# )\n",
+    "# FastLanguageModel.for_inference(model)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Summary\n\nIn this notebook, we:\n\n1. **Explored** the Agent-Ark/Toucan-1.5M tool-calling dataset\n2. **Configured** GRPO hyperparameters and LoRA settings\n3. **Trained** a LoRA adapter using GRPO with Training Hub's ART backend\n4. **Inspected** training metrics (reward curve, match rates)\n5. **Tested** the fine-tuned model on tool-call generation examples\n\n### Key Takeaways\n\n- **GRPO** improves model outputs through reinforcement learning from verifiable rewards\n- **ART** efficiently time-shares a single GPU between vLLM inference and Unsloth training\n- The `gpu_memory_utilization` parameter controls the memory split between inference and training\n- Training Hub handles the full GRPO loop: rollout generation, reward scoring, and weight updates\n\n### Next Steps\n\n- Increase `num_iterations` and `n_train` for better results (more compute time)\n- Experiment with `group_size` (larger groups give better advantage estimates but use more memory)\n- For distributed or long-running training, use the [TrainJob notebook](./grpo_lora-kubeflow-trainjob.ipynb)\n- Add W&B logging for experiment tracking"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python",
+   "version": "3.11.0"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
\ No newline at end of file
diff --git a/examples/fine-tuning/grpo/grpo_lora-kubeflow-trainjob.ipynb b/examples/fine-tuning/grpo/grpo_lora-kubeflow-trainjob.ipynb
index 16d2d973..e5b1b932 100644
--- a/examples/fine-tuning/grpo/grpo_lora-kubeflow-trainjob.ipynb
+++ b/examples/fine-tuning/grpo/grpo_lora-kubeflow-trainjob.ipynb
@@ -50,9 +50,13 @@
    "outputs": [],
    "source": [
     "# LORA_GRPO support is not yet released in the workbench image.\n",
+    "\n",
     "# Install from midstream until the SDK is pre-installed in the image.\n",
+    "\n",
     "# TODO: replace with midstream branch once the PR is merged, then remove\n",
+    "\n",
     "# this line entirely when the image includes the updated SDK.\n",
+    "\n",
     "%pip install \"kubeflow @ git+https://github.com/Fiona-Waters/kubeflow-sdk.git@add-rl-algo\" --force-reinstall --quiet"
    ]
   },
@@ -82,14 +86,23 @@
    "outputs": [],
    "source": [
     "api_server = \"<REPLACE WITH OPENSHIFT SERVER>\"\n",
+    "\n",
     "token = \"<REPLACE WITH OPENSHIFT TOKEN>\"\n",
+    "\n",
     "PVC_NAME = \"shared\"  # Replace if the shared RWX storage name is different than in the example provided\n",
+    "\n",
     "PVC_PATH = \"shared\"  # Replace if the shared RWX storage path is different than in the example provided\n",
+    "\n",
     "configuration = k8s.Configuration()\n",
+    "\n",
     "configuration.host = api_server\n",
+    "\n",
     "# Un-comment if your cluster API server uses a self-signed certificate or an un-trusted CA\n",
+    "\n",
     "# configuration.verify_ssl = False\n",
+    "\n",
     "configuration.api_key = {\"authorization\": f\"Bearer {token}\"}\n",
+    "\n",
     "api_client = k8s.ApiClient(configuration)"
    ]
   },
@@ -121,24 +134,39 @@
    "outputs": [],
    "source": [
     "# Model and dataset\n",
+    "\n",
     "MODEL_PATH = \"Qwen/Qwen3-4B\"\n",
+    "\n",
     "DATA_PATH = \"Agent-Ark/Toucan-1.5M\"\n",
+    "\n",
     "DATA_CONFIG = \"Qwen3\"\n",
     "\n",
+    "\n",
     "# GRPO hyperparameters\n",
+    "\n",
     "NUM_ITERATIONS = 5  # Number of GRPO iterations (each = rollout + train)\n",
+    "\n",
     "GROUP_SIZE = 4  # Responses generated per prompt for comparison\n",
+    "\n",
     "PROMPT_BATCH_SIZE = 50  # Unique prompts per iteration\n",
+    "\n",
     "N_TRAIN = 200  # Total training examples from dataset\n",
+    "\n",
     "LEARNING_RATE = 1e-5\n",
     "\n",
+    "\n",
     "# LoRA configuration\n",
+    "\n",
     "LORA_R = 16\n",
+    "\n",
     "LORA_ALPHA = 8\n",
     "\n",
+    "\n",
     "# vLLM configuration\n",
+    "\n",
     "GPU_MEMORY_UTILIZATION = 0.45  # Fraction of GPU memory for vLLM (rest for Unsloth)\n",
     "\n",
+    "\n",
     "params = {\n",
     "    \"model_path\": MODEL_PATH,\n",
     "    \"data_path\": DATA_PATH,\n",
@@ -158,15 +186,23 @@
     "    \"gpu_memory_utilization\": GPU_MEMORY_UTILIZATION,\n",
     "}\n",
     "\n",
+    "\n",
     "print(\"Training Configuration:\")\n",
+    "\n",
     "print(f\"  Model: {MODEL_PATH}\")\n",
+    "\n",
     "print(f\"  Dataset: {DATA_PATH} ({DATA_CONFIG})\")\n",
+    "\n",
     "print(f\"  Iterations: {NUM_ITERATIONS}, Group size: {GROUP_SIZE}\")\n",
+    "\n",
     "print(\n",
     "    f\"  Prompts/iter: {PROMPT_BATCH_SIZE}, Rollouts/iter: {PROMPT_BATCH_SIZE * GROUP_SIZE}\"\n",
     ")\n",
+    "\n",
     "print(f\"  Training samples: {N_TRAIN}\")\n",
+    "\n",
     "print(f\"  LoRA Rank: {LORA_R}, Alpha: {LORA_ALPHA}\")\n",
+    "\n",
     "print(f\"  GPU Memory Utilization: {GPU_MEMORY_UTILIZATION}\")"
    ]
   },
@@ -190,6 +226,7 @@
    "outputs": [],
    "source": [
     "backend_cfg = KubernetesBackendConfig(client_configuration=api_client.configuration)\n",
+    "\n",
     "client = TrainerClient(backend_cfg)"
    ]
   },
@@ -209,6 +246,7 @@
     "for runtime in client.list_runtimes():\n",
     "    if runtime.name == \"training-hub\":\n",
     "        th_runtime = runtime\n",
+    "\n",
     "        print(\"Found runtime: \" + str(th_runtime))"
    ]
   },
@@ -234,6 +272,7 @@
     "\n",
     "cache_root = f\"/mnt/{PVC_PATH}/.cache/huggingface\"\n",
     "\n",
+    "\n",
     "job_name = client.train(\n",
     "    trainer=TrainingHubTrainer(\n",
     "        algorithm=TrainingHubAlgorithms.LORA_GRPO,\n",
@@ -286,6 +325,7 @@
     "    runtime=th_runtime,\n",
     ")\n",
     "\n",
+    "\n",
     "print(job_name)"
    ]
   },
@@ -303,7 +343,9 @@
    "outputs": [],
    "source": [
     "# Follow job logs\n",
+    "\n",
     "logs = client.get_job_logs(job_name, follow=True)\n",
+    "\n",
     "for line in logs:\n",
     "    print(line)"
    ]
@@ -332,27 +374,37 @@
     "\n",
     "OUTPUT_DIR = f\"/opt/app-root/src/{PVC_PATH}/grpo_output\"\n",
     "\n",
+    "\n",
     "# Read training metrics (two lines per iteration: rollout + train)\n",
+    "\n",
     "metrics_file = os.path.join(OUTPUT_DIR, \"training_metrics.jsonl\")\n",
+    "\n",
     "if os.path.exists(metrics_file):\n",
     "    print(\"Training metrics:\")\n",
+    "\n",
     "    print(\"=\" * 80)\n",
+    "\n",
     "    with open(metrics_file) as f:\n",
     "        for line in f:\n",
     "            entry = json.loads(line)\n",
+    "\n",
     "            phase = entry.get(\"phase\", \"unknown\")\n",
+    "\n",
     "            step = entry.get(\"step\", \"?\")\n",
+    "\n",
     "            if phase == \"rollout\":\n",
     "                print(\n",
     "                    f\"Step {step} [rollout]  mean_reward={entry.get('mean_reward', 'N/A'):.4f}  \"\n",
     "                    f\"full_match_rate={entry.get('full_match_rate', 'N/A'):.4f}\"\n",
     "                )\n",
+    "\n",
     "            elif phase == \"train\":\n",
     "                print(\n",
     "                    f\"Step {step} [train]    loss={entry.get('loss', 'N/A')}  \"\n",
     "                    f\"grad_norm={entry.get('grad_norm', 'N/A')}  \"\n",
     "                    f\"entropy={entry.get('entropy', 'N/A')}\"\n",
     "                )\n",
+    "\n",
     "else:\n",
     "    print(\"No metrics file yet -- training may not have started.\")"
    ]
@@ -364,12 +416,17 @@
    "outputs": [],
    "source": [
     "# Read training results summary\n",
+    "\n",
     "results_file = os.path.join(OUTPUT_DIR, \"training_results.json\")\n",
+    "\n",
     "if os.path.exists(results_file):\n",
     "    with open(results_file) as f:\n",
     "        results = json.load(f)\n",
+    "\n",
     "    print(\"Training results:\")\n",
+    "\n",
     "    print(json.dumps(results, indent=2, default=str))\n",
+    "\n",
     "else:\n",
     "    print(\"No results file yet -- training may not have completed.\")"
    ]
@@ -391,16 +448,22 @@
     "import os\n",
     "\n",
     "metrics_file = os.path.join(OUTPUT_DIR, \"training_metrics.jsonl\")\n",
+    "\n",
     "if not os.path.exists(metrics_file):\n",
     "    print(\"No metrics file found.\")\n",
+    "\n",
     "else:\n",
     "    steps, rewards, match_rates = [], [], []\n",
+    "\n",
     "    with open(metrics_file) as f:\n",
     "        for line in f:\n",
     "            entry = json.loads(line)\n",
+    "\n",
     "            if entry.get(\"phase\") == \"rollout\":\n",
     "                steps.append(entry[\"step\"])\n",
+    "\n",
     "                rewards.append(entry[\"mean_reward\"])\n",
+    "\n",
     "                match_rates.append(entry[\"full_match_rate\"])\n",
     "\n",
     "    try:\n",
@@ -409,23 +472,36 @@
     "        fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))\n",
     "\n",
     "        ax1.plot(steps, rewards, marker=\"o\")\n",
+    "\n",
     "        ax1.set_xlabel(\"Iteration\")\n",
+    "\n",
     "        ax1.set_ylabel(\"Mean Reward\")\n",
+    "\n",
     "        ax1.set_title(\"GRPO Training: Mean Reward\")\n",
+    "\n",
     "        ax1.grid(True, alpha=0.3)\n",
     "\n",
     "        ax2.plot(steps, match_rates, marker=\"o\", color=\"tab:orange\")\n",
+    "\n",
     "        ax2.set_xlabel(\"Iteration\")\n",
+    "\n",
     "        ax2.set_ylabel(\"Full Match Rate\")\n",
+    "\n",
     "        ax2.set_title(\"GRPO Training: Full Match Rate\")\n",
+    "\n",
     "        ax2.grid(True, alpha=0.3)\n",
     "\n",
     "        plt.tight_layout()\n",
+    "\n",
     "        plt.savefig(os.path.join(OUTPUT_DIR, \"reward_curve.png\"), dpi=150)\n",
+    "\n",
     "        plt.show()\n",
+    "\n",
     "        print(f\"Plot saved to {OUTPUT_DIR}/reward_curve.png\")\n",
+    "\n",
     "    except ImportError:\n",
     "        print(\"matplotlib not installed -- printing raw values instead.\")\n",
+    "\n",
     "        for s, r, m in zip(steps, rewards, match_rates, strict=True):\n",
     "            print(f\"  Step {s}: reward={r:.4f}, match_rate={m:.4f}\")"
    ]
@@ -434,7 +510,11 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## 5. Cleanup"
+    "## 5. Test the Trained Model\n",
+    "\n",
+    "Load the fine-tuned LoRA adapter from the shared PVC and test whether the model generates better tool calls after GRPO training.\n",
+    "\n",
+    "> **Note:** This section requires a GPU on the workbench to run inference."
    ]
   },
   {
@@ -443,60 +523,147 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "client.delete_job(job_name)\n",
-    "print(f\"TrainJob '{job_name}' deleted.\")"
+    "from pathlib import Path\n",
+    "\n",
+    "from unsloth import FastLanguageModel\n",
+    "\n",
+    "# Find the latest checkpoint on the shared PVC\n",
+    "\n",
+    "art_dir = Path(OUTPUT_DIR) / \".art\"\n",
+    "\n",
+    "checkpoints_dirs = sorted(art_dir.rglob(\"checkpoints\"))\n",
+    "\n",
+    "if not checkpoints_dirs:\n",
+    "    raise FileNotFoundError(f\"No checkpoints found under {art_dir}\")\n",
+    "\n",
+    "latest_ckpt = sorted(checkpoints_dirs[0].iterdir())[-1]\n",
+    "\n",
+    "print(f\"Loading checkpoint: {latest_ckpt}\")\n",
+    "\n",
+    "\n",
+    "model, tokenizer = FastLanguageModel.from_pretrained(\n",
+    "    model_name=str(latest_ckpt),\n",
+    "    max_seq_length=2048,\n",
+    "    load_in_4bit=False,\n",
+    ")\n",
+    "\n",
+    "FastLanguageModel.for_inference(model)\n",
+    "\n",
+    "\n",
+    "print(\"Fine-tuned model loaded and ready for inference.\")"
    ]
   },
   {
-   "cell_type": "markdown",
+   "cell_type": "code",
+   "execution_count": null,
    "metadata": {},
+   "outputs": [],
    "source": [
-    "## Appendix: Dataset Format\n",
+    "import torch\n",
     "\n",
-    "### Built-in tool-call mode (default)\n",
     "\n",
-    "When using `data_path` with a HuggingFace dataset like `Agent-Ark/Toucan-1.5M`, ART's built-in reward function handles tool-call verification automatically. The dataset should contain multi-turn conversations with tool calls.\n",
+    "def generate_response(messages, max_tokens=512):\n",
+    "    \"\"\"Generate a response from the fine-tuned model.\"\"\"\n",
     "\n",
-    "### Custom reward function\n",
+    "    prompt = tokenizer.apply_chat_template(\n",
+    "        messages, tokenize=False, add_generation_prompt=True\n",
+    "    )\n",
     "\n",
-    "For custom tasks, pass `rollout_fn`, `reward_fn`, and `tasks` instead of `data_path`:\n",
+    "    inputs = tokenizer(prompt, return_tensors=\"pt\").to(model.device)\n",
     "\n",
-    "```python\n",
-    "params = {\n",
-    "    \"model_path\": \"Qwen/Qwen3-4B\",\n",
-    "    \"ckpt_output_dir\": \"/mnt/shared/grpo_custom\",\n",
-    "    \"backend\": \"art\",\n",
-    "    # Custom rollout and reward\n",
-    "    \"rollout_fn\": my_rollout_function,\n",
-    "    \"reward_fn\": my_reward_function,\n",
-    "    \"tasks\": my_task_list,\n",
-    "    # Hyperparameters\n",
-    "    \"num_iterations\": 10,\n",
-    "    \"group_size\": 8,\n",
-    "}\n",
-    "```\n",
-    "\n",
-    "Note: Custom functions passed through `func_args` must be picklable (module-level, not closures) because ART uses `multiprocessing.spawn`.\n",
-    "\n",
-    "## Appendix: Key Parameters\n",
-    "\n",
-    "| Parameter | Description | Default |\n",
-    "|-----------|-------------|----------|\n",
-    "| `model_path` | HuggingFace model ID or local path | (required) |\n",
-    "| `ckpt_output_dir` | Where to save checkpoints and metrics | (required) |\n",
-    "| `backend` | `\"art\"` (single-GPU) or `\"verl\"` (multi-GPU) | `\"verl\"` |\n",
-    "| `data_path` | HuggingFace dataset for built-in tool-call mode | `None` |\n",
-    "| `data_config` | HuggingFace dataset config (conversation format) | `\"Qwen3\"` |\n",
-    "| `num_iterations` | Number of GRPO iterations | `15` |\n",
-    "| `group_size` | Responses per prompt for comparison | `8` |\n",
-    "| `prompt_batch_size` | Unique prompts per iteration | `100` |\n",
-    "| `n_train` | Training examples from dataset | `5000` |\n",
-    "| `learning_rate` | LoRA learning rate | `1e-5` |\n",
-    "| `lora_r` | LoRA rank | `16` |\n",
-    "| `lora_alpha` | LoRA alpha | `8` |\n",
-    "| `gpu_memory_utilization` | vLLM GPU memory fraction | `0.45` |\n",
-    "| `temperature` | Sampling temperature for rollouts | `0.7` |\n",
-    "| `max_tokens` | Max tokens per generated response | `512` |"
+    "    with torch.no_grad():\n",
+    "        outputs = model.generate(\n",
+    "            **inputs,\n",
+    "            max_new_tokens=max_tokens,\n",
+    "            temperature=0.7,\n",
+    "            do_sample=True,\n",
+    "        )\n",
+    "\n",
+    "    response = tokenizer.decode(\n",
+    "        outputs[0][inputs[\"input_ids\"].shape[1] :], skip_special_tokens=True\n",
+    "    )\n",
+    "\n",
+    "    return response"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "test_prompts = [\n",
+    "    {\n",
+    "        \"description\": \"Weather lookup\",\n",
+    "        \"messages\": [\n",
+    "            {\n",
+    "                \"role\": \"system\",\n",
+    "                \"content\": (\n",
+    "                    \"You are a helpful assistant with access to tools. \"\n",
+    "                    \"Available tools: get_weather(location: str, unit: str = 'celsius') -> dict\"\n",
+    "                ),\n",
+    "            },\n",
+    "            {\"role\": \"user\", \"content\": \"What's the weather like in Dublin?\"},\n",
+    "        ],\n",
+    "    },\n",
+    "    {\n",
+    "        \"description\": \"Calculator\",\n",
+    "        \"messages\": [\n",
+    "            {\n",
+    "                \"role\": \"system\",\n",
+    "                \"content\": (\n",
+    "                    \"You are a helpful assistant with access to tools. \"\n",
+    "                    \"Available tools: calculate(expression: str) -> float\"\n",
+    "                ),\n",
+    "            },\n",
+    "            {\"role\": \"user\", \"content\": \"What is 1547 * 23 + 89?\"},\n",
+    "        ],\n",
+    "    },\n",
+    "    {\n",
+    "        \"description\": \"File search\",\n",
+    "        \"messages\": [\n",
+    "            {\n",
+    "                \"role\": \"system\",\n",
+    "                \"content\": (\n",
+    "                    \"You are a helpful assistant with access to tools. \"\n",
+    "                    \"Available tools: search_files(query: str, directory: str = '.', file_type: str = None) -> list\"\n",
+    "                ),\n",
+    "            },\n",
+    "            {\n",
+    "                \"role\": \"user\",\n",
+    "                \"content\": \"Find all Python files that contain 'import torch'\",\n",
+    "            },\n",
+    "        ],\n",
+    "    },\n",
+    "]\n",
+    "\n",
+    "print(\"Testing fine-tuned model on tool-call generation:\")\n",
+    "print(\"=\" * 60)\n",
+    "\n",
+    "for test in test_prompts:\n",
+    "    print(f\"\\nTest: {test['description']}\")\n",
+    "    print(f\"User: {test['messages'][-1]['content']}\")\n",
+    "    response = generate_response(test[\"messages\"])\n",
+    "    print(f\"Model: {response}\")\n",
+    "    print(\"-\" * 60)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 6. Cleanup"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "client.delete_job(job_name)\n",
+    "\n",
+    "print(f\"TrainJob '{job_name}' deleted.\")"
    ]
   }
  ],