From af375949bea083bfbad7c8d4e86ebf370b9d0618 Mon Sep 17 00:00:00 2001 From: Mustafa Eyceoz Date: Wed, 27 Aug 2025 16:04:55 -0400 Subject: [PATCH 1/3] Adding baseline hardware values for lab multiphase configs Signed-off-by: Mustafa Eyceoz --- .../lab_multiphase_configs.ipynb | 324 ++++++++++++++++++ 1 file changed, 324 insertions(+) create mode 100644 examples/instructlab-multiphase-configs/lab_multiphase_configs.ipynb diff --git a/examples/instructlab-multiphase-configs/lab_multiphase_configs.ipynb b/examples/instructlab-multiphase-configs/lab_multiphase_configs.ipynb new file mode 100644 index 00000000..ee8714a0 --- /dev/null +++ b/examples/instructlab-multiphase-configs/lab_multiphase_configs.ipynb @@ -0,0 +1,324 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": "# InstructLab Multi-Phase Training Hardware Configurations\n\nThis notebook contains hardware-specific parameter configurations for use with the [LAB Multi-Phase Training Tutorial](https://github.com/Red-Hat-AI-Innovation-Team/training_hub/blob/main/examples/notebooks/lab_multiphase_training_tutorial.ipynb).\n\n**Model**: These configurations are optimized for `granite-3.1-8b-starter-v2.1`\n\nEach configuration below specifies the optimal parameters for different GPU setups, including:\n- `max_tokens_per_gpu`: Memory limit per GPU to prevent OOM errors\n- `nproc_per_node`: Number of GPUs per node for distributed training\n- `cpu_offload_params`: FSDP CPU offloading configuration for memory optimization\n\n**Usage**: Copy the appropriate configuration parameters from the sections below into your training script based on your available hardware." + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## FSDP Configuration Reference\n", + "\n", + "The `cpu_offload_params` parameter controls whether FSDP parameters are offloaded to CPU memory to reduce GPU memory usage:\n", + "\n", + "```python\n", + "from instructlab.training import FSDPOptions\n", + "\n", + "fsdp_options = FSDPOptions(\n", + " cpu_offload_params=True # Boolean: True to enable CPU offloading, False to disable\n", + ")\n", + "```" + ] + }, + { + "cell_type": "markdown", + "source": "## Usage Example\n\nHere's how to use these configurations in the [LAB Multi-Phase Training Tutorial](https://github.com/Red-Hat-AI-Innovation-Team/training_hub/blob/main/examples/notebooks/lab_multiphase_training_tutorial.ipynb):\n\n1. **Choose your hardware configuration** from the sections below based on your available GPUs\n2. **Copy the configuration values** into your training script\n3. **Apply them to your training configuration**\n\n```python\n# Example: Using H100 4x GPU configuration\n\n# Step 1: Set hardware-specific parameters from this notebook\nmax_tokens_per_gpu = 30000\nmax_seq_len = 27000\nnproc_per_node = 4\ncpu_offload_params = False\n\n# Step 2: Use with training_hub's sft function\nfrom training_hub import sft\nfrom instructlab.training import FSDPOptions\n\n# Configure FSDP options\nfsdp_options = FSDPOptions(\n cpu_offload_params=cpu_offload_params\n)\n\n# Step 3: Apply to your LAB training phases\n# Phase 1: Knowledge Tuning (Phase07)\nphase07_result = sft(\n model_path=\"/path/to/base/model\",\n data_path=\"/path/to/knowledge_data.jsonl\",\n ckpt_output_dir=\"/path/to/phase07_checkpoints\",\n max_tokens_per_gpu=max_tokens_per_gpu,\n max_seq_len=max_seq_len,\n fsdp_options=fsdp_options,\n num_epochs=7,\n # Use torchrun with nproc_per_node for distributed training\n)\n\n# Phase 2: Skills + Replay Training (Phase10)\nphase10_result = sft(\n model_path=phase07_result.checkpoint_path, # Use Phase07 output\n data_path=\"/path/to/skills_plus_replay_data.jsonl\",\n ckpt_output_dir=\"/path/to/phase10_checkpoints\", \n max_tokens_per_gpu=max_tokens_per_gpu,\n max_seq_len=max_seq_len,\n fsdp_options=fsdp_options,\n num_epochs=7,\n # Use torchrun with nproc_per_node for distributed training\n)\n```\n\n**Note**: These values have been tested and optimized specifically for Granite starter models and LAB multiphase datasets. You may need to adjust these parameters for different student models and datasets.", + "metadata": {} + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## H200 Configurations" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### H200 8x GPU Configuration" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": "# H200 8x GPU Configuration\nmax_tokens_per_gpu = 85000\nmax_seq_len = 80000 # adjust if needed based on actual data lengths, but keep below max_tokens_per_gpu\nnproc_per_node = 8\n\n# FSDP CPU offloading configuration\ncpu_offload_params = False" + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### H200 4x GPU Configuration" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": "# H200 4x GPU Configuration\nmax_tokens_per_gpu = 75000\nmax_seq_len = 70000 # adjust if needed based on actual data lengths, but keep below max_tokens_per_gpu\nnproc_per_node = 4\n\n# FSDP CPU offloading configuration\ncpu_offload_params = False" + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### H200 2x GPU Configuration" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": "# H200 2x GPU Configuration\nmax_tokens_per_gpu = 45000\nmax_seq_len = 40000 # adjust if needed based on actual data lengths, but keep below max_tokens_per_gpu\nnproc_per_node = 2\n\n# FSDP CPU offloading configuration\ncpu_offload_params = False" + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### H200 1x GPU Configuration" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": "# H200 1x GPU Configuration\nmax_tokens_per_gpu = 45000\nmax_seq_len = 40000 # adjust if needed based on actual data lengths, but keep below max_tokens_per_gpu\nnproc_per_node = 1\n\n# FSDP CPU offloading configuration\ncpu_offload_params = True" + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## H100 Configurations" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### H100 8x GPU Configuration" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": "# H100 8x GPU Configuration\nmax_tokens_per_gpu = 45000\nmax_seq_len = 40000 # adjust if needed based on actual data lengths, but keep below max_tokens_per_gpu\nnproc_per_node = 8\n\n# FSDP CPU offloading configuration\ncpu_offload_params = False" + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### H100 4x GPU Configuration" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": "# H100 4x GPU Configuration\nmax_tokens_per_gpu = 30000\nmax_seq_len = 27000 # adjust if needed based on actual data lengths, but keep below max_tokens_per_gpu\nnproc_per_node = 4\n\n# FSDP CPU offloading configuration\ncpu_offload_params = False" + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### H100 2x GPU Configuration" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": "# H100 2x GPU Configuration\nmax_tokens_per_gpu = 25000\nmax_seq_len = 22000 # adjust if needed based on actual data lengths, but keep below max_tokens_per_gpu\nnproc_per_node = 2\n\n# FSDP CPU offloading configuration\ncpu_offload_params = True" + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# H100 1x GPU Configuration\n", + "max_tokens_per_gpu = 0 # TO BE FILLED\n", + "nproc_per_node = 1\n", + "\n", + "# FSDP CPU offloading configuration\n", + "cpu_offload_params = False # TO BE FILLED - True to enable CPU offloading, False to disable" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### A100 80GB 8x GPU Configuration" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": "# A100 80GB 8x GPU Configuration\nmax_tokens_per_gpu = 45000\nmax_seq_len = 40000 # adjust if needed based on actual data lengths, but keep below max_tokens_per_gpu\nnproc_per_node = 8\n\n# FSDP CPU offloading configuration\ncpu_offload_params = False" + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### A100 80GB 4x GPU Configuration" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": "# A100 80GB 4x GPU Configuration\nmax_tokens_per_gpu = 30000\nmax_seq_len = 27000 # adjust if needed based on actual data lengths, but keep below max_tokens_per_gpu\nnproc_per_node = 4\n\n# FSDP CPU offloading configuration\ncpu_offload_params = False" + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### A100 80GB 2x GPU Configuration" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": "# A100 80GB 2x GPU Configuration\nmax_tokens_per_gpu = 25000\nmax_seq_len = 22000 # adjust if needed based on actual data lengths, but keep below max_tokens_per_gpu\nnproc_per_node = 2\n\n# FSDP CPU offloading configuration\ncpu_offload_params = True" + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## A100 40GB Configurations" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### A100 40GB 8x GPU Configuration" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": "# A100 40GB 8x GPU Configuration\nmax_tokens_per_gpu = 15000\nmax_seq_len = 13000 # adjust if needed based on actual data lengths, but keep below max_tokens_per_gpu\nnproc_per_node = 8\n\n# FSDP CPU offloading configuration\ncpu_offload_params = False" + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### A100 40GB 4x GPU Configuration" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": "# A100 40GB 4x GPU Configuration\nmax_tokens_per_gpu = 30000\nmax_seq_len = 27000 # adjust if needed based on actual data lengths, but keep below max_tokens_per_gpu\nnproc_per_node = 4\n\n# FSDP CPU offloading configuration\ncpu_offload_params = True" + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### A100 40GB 2x GPU Configuration" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": "# A100 40GB 2x GPU Configuration\nmax_tokens_per_gpu = 25000\nmax_seq_len = 22000 # adjust if needed based on actual data lengths, but keep below max_tokens_per_gpu\nnproc_per_node = 2\n\n# FSDP CPU offloading configuration\ncpu_offload_params = True" + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## L40S Configurations" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### L40S 8x GPU Configuration" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": "# L40S 8x GPU Configuration\nmax_tokens_per_gpu = 10000\nmax_seq_len = 8000 # adjust if needed based on actual data lengths, but keep below max_tokens_per_gpu\nnproc_per_node = 8\n\n# FSDP CPU offloading configuration\ncpu_offload_params = False" + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### L40S 4x GPU Configuration" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": "# L40S 4x GPU Configuration\nmax_tokens_per_gpu = 8000\nmax_seq_len = 6000 # adjust if needed based on actual data lengths, but keep below max_tokens_per_gpu\nnproc_per_node = 4\n\n# FSDP CPU offloading configuration\ncpu_offload_params = False" + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## L4 Configurations" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### L4 8x GPU Configuration" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": "# L4 8x GPU Configuration\nmax_tokens_per_gpu = 8000\nmax_seq_len = 6000 # adjust if needed based on actual data lengths, but keep below max_tokens_per_gpu\nnproc_per_node = 8\n\n# FSDP CPU offloading configuration\ncpu_offload_params = True" + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.5" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} \ No newline at end of file From 30b3c2bb4f5cc6bb30e6f8aec5ea83e4632a1c01 Mon Sep 17 00:00:00 2001 From: Mustafa Eyceoz Date: Thu, 28 Aug 2025 14:26:27 -0400 Subject: [PATCH 2/3] Add README for lab configs Signed-off-by: Mustafa Eyceoz --- .../instructlab-multiphase-configs/README.md | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) create mode 100644 examples/instructlab-multiphase-configs/README.md diff --git a/examples/instructlab-multiphase-configs/README.md b/examples/instructlab-multiphase-configs/README.md new file mode 100644 index 00000000..02ade9e7 --- /dev/null +++ b/examples/instructlab-multiphase-configs/README.md @@ -0,0 +1,16 @@ +# InstructLab Multi-Phase Training Configurations + +This directory contains hardware-specific configurations for running the LAB multi-phase training pipeline with different GPU setups. + +## Quick Start + +See [`lab_multiphase_configs.ipynb`](./lab_multiphase_configs.ipynb) for optimized training parameters for various hardware configurations including: + +- **H200**: 1x, 2x, 4x, 8x GPU configurations +- **H100**: 2x, 4x, 8x GPU configurations +- **A100 80GB**: 2x, 4x, 8x GPU configurations +- **A100 40GB**: 2x, 4x, 8x GPU configurations +- **L40S**: 4x, 8x GPU configurations +- **L4**: 8x GPU configurations + +Each configuration includes memory-optimized settings for `max_tokens_per_gpu`, `max_seq_len`, `nproc_per_node`, and FSDP CPU offloading parameters. \ No newline at end of file From 52debb45e6a788920c58764a10ef5889d4f82a3a Mon Sep 17 00:00:00 2001 From: Mustafa Eyceoz Date: Tue, 16 Sep 2025 16:42:15 -0400 Subject: [PATCH 3/3] Add disclaimer for single-node, multi-node configs Signed-off-by: Mustafa Eyceoz --- examples/instructlab-multiphase-configs/README.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/examples/instructlab-multiphase-configs/README.md b/examples/instructlab-multiphase-configs/README.md index 02ade9e7..86755639 100644 --- a/examples/instructlab-multiphase-configs/README.md +++ b/examples/instructlab-multiphase-configs/README.md @@ -13,4 +13,6 @@ See [`lab_multiphase_configs.ipynb`](./lab_multiphase_configs.ipynb) for optimiz - **L40S**: 4x, 8x GPU configurations - **L4**: 8x GPU configurations -Each configuration includes memory-optimized settings for `max_tokens_per_gpu`, `max_seq_len`, `nproc_per_node`, and FSDP CPU offloading parameters. \ No newline at end of file +Each configuration includes memory-optimized settings for `max_tokens_per_gpu`, `max_seq_len`, `nproc_per_node`, and FSDP CPU offloading parameters. + +Note: The values are all set assuming a single node with the above GPU resources. For multi-node, note that the default sharding strategy is FSDP [HYBRID_SHARD](https://docs.pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.ShardingStrategy). Begin with the cooresponding settings that align with one of your given nodes, or switch to `FULL_SHARD` if required due to memory constraints.