-
Notifications
You must be signed in to change notification settings - Fork 28
Adding Classic InstructLab Multi-Phase Training Hardware Configs to Examples #3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,18 @@ | ||
| # InstructLab Multi-Phase Training Configurations | ||
|
|
||
| This directory contains hardware-specific configurations for running the LAB multi-phase training pipeline with different GPU setups. | ||
|
|
||
| ## Quick Start | ||
|
|
||
| See [`lab_multiphase_configs.ipynb`](./lab_multiphase_configs.ipynb) for optimized training parameters for various hardware configurations including: | ||
|
|
||
| - **H200**: 1x, 2x, 4x, 8x GPU configurations | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this always single node or multi-node is supported? That would be useful to mention it explicitly.
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good point, added a disclaimer to the README |
||
| - **H100**: 2x, 4x, 8x GPU configurations | ||
| - **A100 80GB**: 2x, 4x, 8x GPU configurations | ||
| - **A100 40GB**: 2x, 4x, 8x GPU configurations | ||
| - **L40S**: 4x, 8x GPU configurations | ||
| - **L4**: 8x GPU configurations | ||
|
|
||
| Each configuration includes memory-optimized settings for `max_tokens_per_gpu`, `max_seq_len`, `nproc_per_node`, and FSDP CPU offloading parameters. | ||
|
|
||
| Note: The values are all set assuming a single node with the above GPU resources. For multi-node, note that the default sharding strategy is FSDP [HYBRID_SHARD](https://docs.pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.ShardingStrategy). Begin with the cooresponding settings that align with one of your given nodes, or switch to `FULL_SHARD` if required due to memory constraints. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: "cooresponding" -> "corresponding" |
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,324 @@ | ||
| { | ||
| "cells": [ | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": "# InstructLab Multi-Phase Training Hardware Configurations\n\nThis notebook contains hardware-specific parameter configurations for use with the [LAB Multi-Phase Training Tutorial](https://github.com/Red-Hat-AI-Innovation-Team/training_hub/blob/main/examples/notebooks/lab_multiphase_training_tutorial.ipynb).\n\n**Model**: These configurations are optimized for `granite-3.1-8b-starter-v2.1`\n\nEach configuration below specifies the optimal parameters for different GPU setups, including:\n- `max_tokens_per_gpu`: Memory limit per GPU to prevent OOM errors\n- `nproc_per_node`: Number of GPUs per node for distributed training\n- `cpu_offload_params`: FSDP CPU offloading configuration for memory optimization\n\n**Usage**: Copy the appropriate configuration parameters from the sections below into your training script based on your available hardware." | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It would be better to have There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It might make sense to have the end to end instruct lab pipeline example here instead of just training example.
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. also related to above comment: #3 (comment) |
||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "## FSDP Configuration Reference\n", | ||
| "\n", | ||
| "The `cpu_offload_params` parameter controls whether FSDP parameters are offloaded to CPU memory to reduce GPU memory usage:\n", | ||
| "\n", | ||
| "```python\n", | ||
| "from instructlab.training import FSDPOptions\n", | ||
| "\n", | ||
| "fsdp_options = FSDPOptions(\n", | ||
| " cpu_offload_params=True # Boolean: True to enable CPU offloading, False to disable\n", | ||
| ")\n", | ||
| "```" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "source": "## Usage Example\n\nHere's how to use these configurations in the [LAB Multi-Phase Training Tutorial](https://github.com/Red-Hat-AI-Innovation-Team/training_hub/blob/main/examples/notebooks/lab_multiphase_training_tutorial.ipynb):\n\n1. **Choose your hardware configuration** from the sections below based on your available GPUs\n2. **Copy the configuration values** into your training script\n3. **Apply them to your training configuration**\n\n```python\n# Example: Using H100 4x GPU configuration\n\n# Step 1: Set hardware-specific parameters from this notebook\nmax_tokens_per_gpu = 30000\nmax_seq_len = 27000\nnproc_per_node = 4\ncpu_offload_params = False\n\n# Step 2: Use with training_hub's sft function\nfrom training_hub import sft\nfrom instructlab.training import FSDPOptions\n\n# Configure FSDP options\nfsdp_options = FSDPOptions(\n cpu_offload_params=cpu_offload_params\n)\n\n# Step 3: Apply to your LAB training phases\n# Phase 1: Knowledge Tuning (Phase07)\nphase07_result = sft(\n model_path=\"/path/to/base/model\",\n data_path=\"/path/to/knowledge_data.jsonl\",\n ckpt_output_dir=\"/path/to/phase07_checkpoints\",\n max_tokens_per_gpu=max_tokens_per_gpu,\n max_seq_len=max_seq_len,\n fsdp_options=fsdp_options,\n num_epochs=7,\n # Use torchrun with nproc_per_node for distributed training\n)\n\n# Phase 2: Skills + Replay Training (Phase10)\nphase10_result = sft(\n model_path=phase07_result.checkpoint_path, # Use Phase07 output\n data_path=\"/path/to/skills_plus_replay_data.jsonl\",\n ckpt_output_dir=\"/path/to/phase10_checkpoints\", \n max_tokens_per_gpu=max_tokens_per_gpu,\n max_seq_len=max_seq_len,\n fsdp_options=fsdp_options,\n num_epochs=7,\n # Use torchrun with nproc_per_node for distributed training\n)\n```\n\n**Note**: These values have been tested and optimized specifically for Granite starter models and LAB multiphase datasets. You may need to adjust these parameters for different student models and datasets.", | ||
| "metadata": {} | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "## H200 Configurations" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "### H200 8x GPU Configuration" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": "# H200 8x GPU Configuration\nmax_tokens_per_gpu = 85000\nmax_seq_len = 80000 # adjust if needed based on actual data lengths, but keep below max_tokens_per_gpu\nnproc_per_node = 8\n\n# FSDP CPU offloading configuration\ncpu_offload_params = False" | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "### H200 4x GPU Configuration" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": "# H200 4x GPU Configuration\nmax_tokens_per_gpu = 75000\nmax_seq_len = 70000 # adjust if needed based on actual data lengths, but keep below max_tokens_per_gpu\nnproc_per_node = 4\n\n# FSDP CPU offloading configuration\ncpu_offload_params = False" | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "### H200 2x GPU Configuration" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": "# H200 2x GPU Configuration\nmax_tokens_per_gpu = 45000\nmax_seq_len = 40000 # adjust if needed based on actual data lengths, but keep below max_tokens_per_gpu\nnproc_per_node = 2\n\n# FSDP CPU offloading configuration\ncpu_offload_params = False" | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "### H200 1x GPU Configuration" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": "# H200 1x GPU Configuration\nmax_tokens_per_gpu = 45000\nmax_seq_len = 40000 # adjust if needed based on actual data lengths, but keep below max_tokens_per_gpu\nnproc_per_node = 1\n\n# FSDP CPU offloading configuration\ncpu_offload_params = True" | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "## H100 Configurations" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "### H100 8x GPU Configuration" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": "# H100 8x GPU Configuration\nmax_tokens_per_gpu = 45000\nmax_seq_len = 40000 # adjust if needed based on actual data lengths, but keep below max_tokens_per_gpu\nnproc_per_node = 8\n\n# FSDP CPU offloading configuration\ncpu_offload_params = False" | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "### H100 4x GPU Configuration" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": "# H100 4x GPU Configuration\nmax_tokens_per_gpu = 30000\nmax_seq_len = 27000 # adjust if needed based on actual data lengths, but keep below max_tokens_per_gpu\nnproc_per_node = 4\n\n# FSDP CPU offloading configuration\ncpu_offload_params = False" | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "### H100 2x GPU Configuration" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": "# H100 2x GPU Configuration\nmax_tokens_per_gpu = 25000\nmax_seq_len = 22000 # adjust if needed based on actual data lengths, but keep below max_tokens_per_gpu\nnproc_per_node = 2\n\n# FSDP CPU offloading configuration\ncpu_offload_params = True" | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "# H100 1x GPU Configuration\n", | ||
| "max_tokens_per_gpu = 0 # TO BE FILLED\n", | ||
| "nproc_per_node = 1\n", | ||
| "\n", | ||
| "# FSDP CPU offloading configuration\n", | ||
| "cpu_offload_params = False # TO BE FILLED - True to enable CPU offloading, False to disable" | ||
|
Comment on lines
+147
to
+153
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Remove or complete the unfinished H100 1x config before merging. Line 149 publishes 🤖 Prompt for AI Agents |
||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "### A100 80GB 8x GPU Configuration" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": "# A100 80GB 8x GPU Configuration\nmax_tokens_per_gpu = 45000\nmax_seq_len = 40000 # adjust if needed based on actual data lengths, but keep below max_tokens_per_gpu\nnproc_per_node = 8\n\n# FSDP CPU offloading configuration\ncpu_offload_params = False" | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "### A100 80GB 4x GPU Configuration" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": "# A100 80GB 4x GPU Configuration\nmax_tokens_per_gpu = 30000\nmax_seq_len = 27000 # adjust if needed based on actual data lengths, but keep below max_tokens_per_gpu\nnproc_per_node = 4\n\n# FSDP CPU offloading configuration\ncpu_offload_params = False" | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "### A100 80GB 2x GPU Configuration" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": "# A100 80GB 2x GPU Configuration\nmax_tokens_per_gpu = 25000\nmax_seq_len = 22000 # adjust if needed based on actual data lengths, but keep below max_tokens_per_gpu\nnproc_per_node = 2\n\n# FSDP CPU offloading configuration\ncpu_offload_params = True" | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "## A100 40GB Configurations" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "### A100 40GB 8x GPU Configuration" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": "# A100 40GB 8x GPU Configuration\nmax_tokens_per_gpu = 15000\nmax_seq_len = 13000 # adjust if needed based on actual data lengths, but keep below max_tokens_per_gpu\nnproc_per_node = 8\n\n# FSDP CPU offloading configuration\ncpu_offload_params = False" | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "### A100 40GB 4x GPU Configuration" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": "# A100 40GB 4x GPU Configuration\nmax_tokens_per_gpu = 30000\nmax_seq_len = 27000 # adjust if needed based on actual data lengths, but keep below max_tokens_per_gpu\nnproc_per_node = 4\n\n# FSDP CPU offloading configuration\ncpu_offload_params = True" | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "### A100 40GB 2x GPU Configuration" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": "# A100 40GB 2x GPU Configuration\nmax_tokens_per_gpu = 25000\nmax_seq_len = 22000 # adjust if needed based on actual data lengths, but keep below max_tokens_per_gpu\nnproc_per_node = 2\n\n# FSDP CPU offloading configuration\ncpu_offload_params = True" | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "## L40S Configurations" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "### L40S 8x GPU Configuration" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": "# L40S 8x GPU Configuration\nmax_tokens_per_gpu = 10000\nmax_seq_len = 8000 # adjust if needed based on actual data lengths, but keep below max_tokens_per_gpu\nnproc_per_node = 8\n\n# FSDP CPU offloading configuration\ncpu_offload_params = False" | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "### L40S 4x GPU Configuration" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": "# L40S 4x GPU Configuration\nmax_tokens_per_gpu = 8000\nmax_seq_len = 6000 # adjust if needed based on actual data lengths, but keep below max_tokens_per_gpu\nnproc_per_node = 4\n\n# FSDP CPU offloading configuration\ncpu_offload_params = False" | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "## L4 Configurations" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "### L4 8x GPU Configuration" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": "# L4 8x GPU Configuration\nmax_tokens_per_gpu = 8000\nmax_seq_len = 6000 # adjust if needed based on actual data lengths, but keep below max_tokens_per_gpu\nnproc_per_node = 8\n\n# FSDP CPU offloading configuration\ncpu_offload_params = True" | ||
| } | ||
| ], | ||
| "metadata": { | ||
| "kernelspec": { | ||
| "display_name": "Python 3", | ||
| "language": "python", | ||
| "name": "python3" | ||
| }, | ||
| "language_info": { | ||
| "codemirror_mode": { | ||
| "name": "ipython", | ||
| "version": 3 | ||
| }, | ||
| "file_extension": ".py", | ||
| "name": "python", | ||
| "nbconvert_exporter": "python", | ||
| "pygments_lexer": "ipython3", | ||
| "version": "3.8.5" | ||
| } | ||
| }, | ||
| "nbformat": 4, | ||
| "nbformat_minor": 4 | ||
| } | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest we start grouping the examples under
examples/training_hubfor example to better organize the examples from different part of the product.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I think that makes sense. One question is whether we will have other components of the instructlab pipeline / notebooks in this repo. @aditisaluja5 was a decision made as to whether the full end-to-end pipeline example will live here? And if so, will that be as a series of notebooks (including the training notebook), or a single all-encompassing notebook?
If it is just the training configs living here, then I think @astefanutti's suggestions are best, where we can have a training_hub sub-directory and include the training notebook itself in there as well. If we are including the full pipeline in here, however, then we should probably create an
instructlabsub-directory, add all of the step notebooks in there (data prep, sdg, training, eval, etc.) and have this config notebook live alongside them. LMK what the plan is and I can proceed accordinglyThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aditisaluja5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, for the late response. The end to end examples will also live in this repo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like we are leaning on the latter option. Would
examples/ai_hubbe more appropriate as sub-directory instead ofinstructlab?