From af375949bea083bfbad7c8d4e86ebf370b9d0618 Mon Sep 17 00:00:00 2001
From: Mustafa Eyceoz <meyceoz@redhat.com>
Date: Wed, 27 Aug 2025 16:04:55 -0400
Subject: [PATCH 1/3] Adding baseline hardware values for lab multiphase
 configs

Signed-off-by: Mustafa Eyceoz <meyceoz@redhat.com>
---
 .../lab_multiphase_configs.ipynb              | 324 ++++++++++++++++++
 1 file changed, 324 insertions(+)
 create mode 100644 examples/instructlab-multiphase-configs/lab_multiphase_configs.ipynb

diff --git a/examples/instructlab-multiphase-configs/lab_multiphase_configs.ipynb b/examples/instructlab-multiphase-configs/lab_multiphase_configs.ipynb
new file mode 100644
index 00000000..ee8714a0
--- /dev/null
+++ b/examples/instructlab-multiphase-configs/lab_multiphase_configs.ipynb
@@ -0,0 +1,324 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": "# InstructLab Multi-Phase Training Hardware Configurations\n\nThis notebook contains hardware-specific parameter configurations for use with the [LAB Multi-Phase Training Tutorial](https://github.com/Red-Hat-AI-Innovation-Team/training_hub/blob/main/examples/notebooks/lab_multiphase_training_tutorial.ipynb).\n\n**Model**: These configurations are optimized for `granite-3.1-8b-starter-v2.1`\n\nEach configuration below specifies the optimal parameters for different GPU setups, including:\n- `max_tokens_per_gpu`: Memory limit per GPU to prevent OOM errors\n- `nproc_per_node`: Number of GPUs per node for distributed training\n- `cpu_offload_params`: FSDP CPU offloading configuration for memory optimization\n\n**Usage**: Copy the appropriate configuration parameters from the sections below into your training script based on your available hardware."
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## FSDP Configuration Reference\n",
+    "\n",
+    "The `cpu_offload_params` parameter controls whether FSDP parameters are offloaded to CPU memory to reduce GPU memory usage:\n",
+    "\n",
+    "```python\n",
+    "from instructlab.training import FSDPOptions\n",
+    "\n",
+    "fsdp_options = FSDPOptions(\n",
+    "    cpu_offload_params=True  # Boolean: True to enable CPU offloading, False to disable\n",
+    ")\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "source": "## Usage Example\n\nHere's how to use these configurations in the [LAB Multi-Phase Training Tutorial](https://github.com/Red-Hat-AI-Innovation-Team/training_hub/blob/main/examples/notebooks/lab_multiphase_training_tutorial.ipynb):\n\n1. **Choose your hardware configuration** from the sections below based on your available GPUs\n2. **Copy the configuration values** into your training script\n3. **Apply them to your training configuration**\n\n```python\n# Example: Using H100 4x GPU configuration\n\n# Step 1: Set hardware-specific parameters from this notebook\nmax_tokens_per_gpu = 30000\nmax_seq_len = 27000\nnproc_per_node = 4\ncpu_offload_params = False\n\n# Step 2: Use with training_hub's sft function\nfrom training_hub import sft\nfrom instructlab.training import FSDPOptions\n\n# Configure FSDP options\nfsdp_options = FSDPOptions(\n    cpu_offload_params=cpu_offload_params\n)\n\n# Step 3: Apply to your LAB training phases\n# Phase 1: Knowledge Tuning (Phase07)\nphase07_result = sft(\n    model_path=\"/path/to/base/model\",\n    data_path=\"/path/to/knowledge_data.jsonl\",\n    ckpt_output_dir=\"/path/to/phase07_checkpoints\",\n    max_tokens_per_gpu=max_tokens_per_gpu,\n    max_seq_len=max_seq_len,\n    fsdp_options=fsdp_options,\n    num_epochs=7,\n    # Use torchrun with nproc_per_node for distributed training\n)\n\n# Phase 2: Skills + Replay Training (Phase10)\nphase10_result = sft(\n    model_path=phase07_result.checkpoint_path,  # Use Phase07 output\n    data_path=\"/path/to/skills_plus_replay_data.jsonl\",\n    ckpt_output_dir=\"/path/to/phase10_checkpoints\", \n    max_tokens_per_gpu=max_tokens_per_gpu,\n    max_seq_len=max_seq_len,\n    fsdp_options=fsdp_options,\n    num_epochs=7,\n    # Use torchrun with nproc_per_node for distributed training\n)\n```\n\n**Note**: These values have been tested and optimized specifically for Granite starter models and LAB multiphase datasets. You may need to adjust these parameters for different student models and datasets.",
+   "metadata": {}
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## H200 Configurations"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### H200 8x GPU Configuration"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": "# H200 8x GPU Configuration\nmax_tokens_per_gpu = 85000\nmax_seq_len = 80000  # adjust if needed based on actual data lengths, but keep below max_tokens_per_gpu\nnproc_per_node = 8\n\n# FSDP CPU offloading configuration\ncpu_offload_params = False"
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### H200 4x GPU Configuration"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": "# H200 4x GPU Configuration\nmax_tokens_per_gpu = 75000\nmax_seq_len = 70000  # adjust if needed based on actual data lengths, but keep below max_tokens_per_gpu\nnproc_per_node = 4\n\n# FSDP CPU offloading configuration\ncpu_offload_params = False"
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### H200 2x GPU Configuration"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": "# H200 2x GPU Configuration\nmax_tokens_per_gpu = 45000\nmax_seq_len = 40000  # adjust if needed based on actual data lengths, but keep below max_tokens_per_gpu\nnproc_per_node = 2\n\n# FSDP CPU offloading configuration\ncpu_offload_params = False"
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### H200 1x GPU Configuration"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": "# H200 1x GPU Configuration\nmax_tokens_per_gpu = 45000\nmax_seq_len = 40000  # adjust if needed based on actual data lengths, but keep below max_tokens_per_gpu\nnproc_per_node = 1\n\n# FSDP CPU offloading configuration\ncpu_offload_params = True"
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## H100 Configurations"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### H100 8x GPU Configuration"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": "# H100 8x GPU Configuration\nmax_tokens_per_gpu = 45000\nmax_seq_len = 40000  # adjust if needed based on actual data lengths, but keep below max_tokens_per_gpu\nnproc_per_node = 8\n\n# FSDP CPU offloading configuration\ncpu_offload_params = False"
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### H100 4x GPU Configuration"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": "# H100 4x GPU Configuration\nmax_tokens_per_gpu = 30000\nmax_seq_len = 27000  # adjust if needed based on actual data lengths, but keep below max_tokens_per_gpu\nnproc_per_node = 4\n\n# FSDP CPU offloading configuration\ncpu_offload_params = False"
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### H100 2x GPU Configuration"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": "# H100 2x GPU Configuration\nmax_tokens_per_gpu = 25000\nmax_seq_len = 22000  # adjust if needed based on actual data lengths, but keep below max_tokens_per_gpu\nnproc_per_node = 2\n\n# FSDP CPU offloading configuration\ncpu_offload_params = True"
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# H100 1x GPU Configuration\n",
+    "max_tokens_per_gpu = 0  # TO BE FILLED\n",
+    "nproc_per_node = 1\n",
+    "\n",
+    "# FSDP CPU offloading configuration\n",
+    "cpu_offload_params = False  # TO BE FILLED - True to enable CPU offloading, False to disable"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### A100 80GB 8x GPU Configuration"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": "# A100 80GB 8x GPU Configuration\nmax_tokens_per_gpu = 45000\nmax_seq_len = 40000  # adjust if needed based on actual data lengths, but keep below max_tokens_per_gpu\nnproc_per_node = 8\n\n# FSDP CPU offloading configuration\ncpu_offload_params = False"
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### A100 80GB 4x GPU Configuration"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": "# A100 80GB 4x GPU Configuration\nmax_tokens_per_gpu = 30000\nmax_seq_len = 27000  # adjust if needed based on actual data lengths, but keep below max_tokens_per_gpu\nnproc_per_node = 4\n\n# FSDP CPU offloading configuration\ncpu_offload_params = False"
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### A100 80GB 2x GPU Configuration"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": "# A100 80GB 2x GPU Configuration\nmax_tokens_per_gpu = 25000\nmax_seq_len = 22000  # adjust if needed based on actual data lengths, but keep below max_tokens_per_gpu\nnproc_per_node = 2\n\n# FSDP CPU offloading configuration\ncpu_offload_params = True"
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## A100 40GB Configurations"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### A100 40GB 8x GPU Configuration"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": "# A100 40GB 8x GPU Configuration\nmax_tokens_per_gpu = 15000\nmax_seq_len = 13000  # adjust if needed based on actual data lengths, but keep below max_tokens_per_gpu\nnproc_per_node = 8\n\n# FSDP CPU offloading configuration\ncpu_offload_params = False"
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### A100 40GB 4x GPU Configuration"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": "# A100 40GB 4x GPU Configuration\nmax_tokens_per_gpu = 30000\nmax_seq_len = 27000  # adjust if needed based on actual data lengths, but keep below max_tokens_per_gpu\nnproc_per_node = 4\n\n# FSDP CPU offloading configuration\ncpu_offload_params = True"
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### A100 40GB 2x GPU Configuration"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": "# A100 40GB 2x GPU Configuration\nmax_tokens_per_gpu = 25000\nmax_seq_len = 22000  # adjust if needed based on actual data lengths, but keep below max_tokens_per_gpu\nnproc_per_node = 2\n\n# FSDP CPU offloading configuration\ncpu_offload_params = True"
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## L40S Configurations"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### L40S 8x GPU Configuration"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": "# L40S 8x GPU Configuration\nmax_tokens_per_gpu = 10000\nmax_seq_len = 8000  # adjust if needed based on actual data lengths, but keep below max_tokens_per_gpu\nnproc_per_node = 8\n\n# FSDP CPU offloading configuration\ncpu_offload_params = False"
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### L40S 4x GPU Configuration"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": "# L40S 4x GPU Configuration\nmax_tokens_per_gpu = 8000\nmax_seq_len = 6000  # adjust if needed based on actual data lengths, but keep below max_tokens_per_gpu\nnproc_per_node = 4\n\n# FSDP CPU offloading configuration\ncpu_offload_params = False"
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## L4 Configurations"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### L4 8x GPU Configuration"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": "# L4 8x GPU Configuration\nmax_tokens_per_gpu = 8000\nmax_seq_len = 6000  # adjust if needed based on actual data lengths, but keep below max_tokens_per_gpu\nnproc_per_node = 8\n\n# FSDP CPU offloading configuration\ncpu_offload_params = True"
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.5"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
\ No newline at end of file

From 30b3c2bb4f5cc6bb30e6f8aec5ea83e4632a1c01 Mon Sep 17 00:00:00 2001
From: Mustafa Eyceoz <meyceoz@redhat.com>
Date: Thu, 28 Aug 2025 14:26:27 -0400
Subject: [PATCH 2/3] Add README for lab configs

Signed-off-by: Mustafa Eyceoz <meyceoz@redhat.com>
---
 .../instructlab-multiphase-configs/README.md     | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)
 create mode 100644 examples/instructlab-multiphase-configs/README.md

diff --git a/examples/instructlab-multiphase-configs/README.md b/examples/instructlab-multiphase-configs/README.md
new file mode 100644
index 00000000..02ade9e7
--- /dev/null
+++ b/examples/instructlab-multiphase-configs/README.md
@@ -0,0 +1,16 @@
+# InstructLab Multi-Phase Training Configurations
+
+This directory contains hardware-specific configurations for running the LAB multi-phase training pipeline with different GPU setups.
+
+## Quick Start
+
+See [`lab_multiphase_configs.ipynb`](./lab_multiphase_configs.ipynb) for optimized training parameters for various hardware configurations including:
+
+- **H200**: 1x, 2x, 4x, 8x GPU configurations
+- **H100**: 2x, 4x, 8x GPU configurations  
+- **A100 80GB**: 2x, 4x, 8x GPU configurations
+- **A100 40GB**: 2x, 4x, 8x GPU configurations
+- **L40S**: 4x, 8x GPU configurations
+- **L4**: 8x GPU configurations
+
+Each configuration includes memory-optimized settings for `max_tokens_per_gpu`, `max_seq_len`, `nproc_per_node`, and FSDP CPU offloading parameters.
\ No newline at end of file

From 52debb45e6a788920c58764a10ef5889d4f82a3a Mon Sep 17 00:00:00 2001
From: Mustafa Eyceoz <meyceoz@redhat.com>
Date: Tue, 16 Sep 2025 16:42:15 -0400
Subject: [PATCH 3/3] Add disclaimer for single-node, multi-node configs

Signed-off-by: Mustafa Eyceoz <meyceoz@redhat.com>
---
 examples/instructlab-multiphase-configs/README.md | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/examples/instructlab-multiphase-configs/README.md b/examples/instructlab-multiphase-configs/README.md
index 02ade9e7..86755639 100644
--- a/examples/instructlab-multiphase-configs/README.md
+++ b/examples/instructlab-multiphase-configs/README.md
@@ -13,4 +13,6 @@ See [`lab_multiphase_configs.ipynb`](./lab_multiphase_configs.ipynb) for optimiz
 - **L40S**: 4x, 8x GPU configurations
 - **L4**: 8x GPU configurations
 
-Each configuration includes memory-optimized settings for `max_tokens_per_gpu`, `max_seq_len`, `nproc_per_node`, and FSDP CPU offloading parameters.
\ No newline at end of file
+Each configuration includes memory-optimized settings for `max_tokens_per_gpu`, `max_seq_len`, `nproc_per_node`, and FSDP CPU offloading parameters.
+
+Note: The values are all set assuming a single node with the above GPU resources. For multi-node, note that the default sharding strategy is FSDP [HYBRID_SHARD](https://docs.pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.ShardingStrategy). Begin with the cooresponding settings that align with one of your given nodes, or switch to `FULL_SHARD` if required due to memory constraints.