Merge pull request #2278 from AI-Hypercomputer:collabs-examples-sft

Google-ML-Automation · Google-ML-Automation · commit a55e18af31a7 · 2025-09-15T20:18:41.000-07:00
PiperOrigin-RevId: 807495969
diff --git a/src/MaxText/examples/sft_llama3_demo.ipynb b/src/MaxText/examples/sft_llama3_demo.ipynb
@@ -0,0 +1,288 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "# Run SFT on Llama3.1-8B-Instruct model\n",
+        "\n",
+        "This notebook demonstrates how to perform Supervised Fine-Tuning (SFT) on Llama3.1-8B-Instruct using the Hugging Face ultrachat_200k dataset with Tunix integration for efficient training.\n",
+        "\n",
+        "## Dataset Overview\n",
+        "https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k\n",
+        "\n",
+        "**Dataset Information:**\n",
+        "- **Name**: HuggingFaceH4/ultrachat_200k\n",
+        "- **Type**: Supervised Fine-Tuning dataset\n",
+        "- **Size**: ~200k conversations\n",
+        "- **Format**: Chat conversations with human-AI pairs\n",
+        "- **Splits**: train_sft, test_sft\n",
+        "- **Data columns**: ['messages']\n",
+        "\n",
+        "**Dataset Structure:**\n",
+        "Each example contains a 'messages' field with:\n",
+        "- role: 'user' or 'assistant'\n",
+        "- content: The actual message text\n",
+        "\n",
+        "**Example data format:**\n",
+        "```json\n",
+        "{\n",
+        "  \"messages\": [\n",
+        "    {\"role\": \"user\", \"content\": \"What is the capital of France?\"},\n",
+        "    {\"role\": \"assistant\", \"content\": \"The capital of France is Paris.\"}\n",
+        "  ]\n",
+        "}\n",
+        "```\n",
+        "\n",
+        "## Key Features\n",
+        "- **MaxText Llama3.1-8B-Instruct model** \n",
+        "- **Tunix integration** for optimized training\n",
+        "- **UltraChat-200k dataset** from HuggingFace\n",
+        "- Tokenizes with meta-llama/Llama-3.1-8B-Instruct\n",
+        "\n",
+        "\n",
+        "## Prerequisites\n",
+        "- MaxText environment with all dependencies\n",
+        "- Tunix installation\n",
+        "- HuggingFace access token for dataset download\n",
+        "- Sufficient compute resources (TPU/GPU)\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "### (Optional) Run this if you just have this file and nothing else\n",
+        "\n",
+        "# 1. Clone the MaxText repository (from AI‑Hypercomputer)\n",
+        "!git clone https://github.com/AI-Hypercomputer/maxtext.git\n",
+        "\n",
+        "# 2. Navigate into the cloned directory\n",
+        "%cd maxtext"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "### (Optional) Do not run this if you already installed the dependencies\n",
+        "\n",
+        "# 3. Ensure setup.sh is executable\n",
+        "!chmod +x setup.sh\n",
+        "\n",
+        "# 4. Execute the setup script\n",
+        "!./setup.sh\n",
+        "\n",
+        "# force numpy version\n",
+        "!pip install --force-reinstall numpy==2.1.2\n",
+        "#install nest_asyncio\n",
+        "!pip install nest_asyncio\n",
+        "\n",
+        "import nest_asyncio\n",
+        "nest_asyncio.apply()\n",
+        "# To fix \"This event loop is already running\" error in Colab\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "\n",
+        "import os\n",
+        "import sys\n",
+        "#  Set  home directory. Change this to your home directory where maxtext is cloned\n",
+        "MAXTEXT_HOME = os.path.expanduser(\"~\") + \"/maxtext\"\n",
+        "print(f\"Home directory (from Python): {MAXTEXT_HOME}\")\n",
+        "#set the path to the Llama3.1-8B-Instruct checkpoint you want to load, gs://<bucket> supported \n",
+        "MODEL_CHECKPOINT_PATH = \"path/to/scanned/checkpoint\""
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "from pathlib import Path\n",
+        "from typing import Optional, Dict, Any\n",
+        "\n",
+        "# Find MaxText directory and change working directory to it\n",
+        "current_dir = Path.cwd()\n",
+        "if current_dir.name == 'examples':\n",
+        "    # We're in the examples folder, go up one level\n",
+        "    maxtext_path = current_dir.parent.parent\n",
+        "else:\n",
+        "    # We're in the root, MaxText is a subfolder\n",
+        "    maxtext_path = Path(f'{MAXTEXT_HOME}') / 'src' / 'MaxText'\n",
+        "\n",
+        "# Change working directory to MaxText project root\n",
+        "os.chdir(maxtext_path)\n",
+        "sys.path.insert(0, str(maxtext_path))\n",
+        "\n",
+        "print(f\"✓ Changed working directory to: {os.getcwd()}\")\n",
+        "print(f\"✓ MaxText project root: {maxtext_path}\")\n",
+        "print(f\"✓ Added to Python path: {maxtext_path}\")\n",
+        "import jax\n",
+        "if not jax.distributed.is_initialized():\n",
+        "    jax.distributed.initialize()    \n",
+        "print(f\"JAX version: {jax.__version__}\")\n",
+        "print(f\"JAX devices: {jax.devices()}\")\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Hugging Face Authentication Setup\n",
+        "\n",
+        "If you encounter 401 unauthorized errors when loading datasets, you need to authenticate with Hugging Face. Set your token below:\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# Hugging Face Authentication Setup\n",
+        "from huggingface_hub import login\n",
+        "\n",
+        "# Set your Hugging Face token here\n",
+        "HF_TOKEN = \"hf_your_token_here\"  # Replace with your actual token\n",
+        "login(token=HF_TOKEN)\n",
+        "  "
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# MaxText imports \n",
+        "try:\n",
+        "    from MaxText import pyconfig\n",
+        "    from MaxText.sft.sft_trainer import train as sft_train\n",
+        "\n",
+        "    MAXTEXT_AVAILABLE = True\n",
+        "    print(\"✓ MaxText imports successful\")\n",
+        "except ImportError as e:\n",
+        "    print(f\"⚠️ MaxText not available: {e}\")\n",
+        "    MAXTEXT_AVAILABLE = False\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "##  Configuration Setup\n",
+        "\n",
+        "## Notes\n",
+        "- Trains on completion only (sft_train_on_completion_only=True)\n",
+        "- Please set sft_train_on_completion_only=False to train both on prompts and completions. By default SFT will train only on completions."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# Fixed configuration setup\n",
+        "if MAXTEXT_AVAILABLE:\n",
+        "    # Proper config setup using MaxText's config system\n",
+        "    config_argv = [\n",
+        "        \"\",  \n",
+        "        f\"{MAXTEXT_HOME}/src/MaxText/configs/sft.yml\",   # SFT config\n",
+        "        f\"load_parameters_path={MODEL_CHECKPOINT_PATH}\",\n",
+        "        \"model_name=llama3.1-8b\",\n",
+        "        \"steps=100\",\n",
+        "        \"per_device_batch_size=1\",\n",
+        "        \"max_target_length=1024\",\n",
+        "        \"learning_rate=2.0e-5\",\n",
+        "        \"eval_steps=5\",\n",
+        "        \"weight_dtype=bfloat16\",\n",
+        "        \"dtype=bfloat16\",\n",
+        "        \"hf_path=HuggingFaceH4/ultrachat_200k\",\n",
+        "        f\"hf_access_token={HF_TOKEN}\",\n",
+        "        \"base_output_directory=/tmp/maxtext_output\",\n",
+        "        \"run_name=sft_llama3_demo\",\n",
+        "        \"tokenizer_path=meta-llama/Llama-3.1-8B-Instruct\",\n",
+        "        \"eval_interval=10\",\n",
+        "        \"profiler=xplane\",\n",
+        "    ]\n",
+        "    \n",
+        "    # Initialize configuration using MaxText's pyconfig\n",
+        "    config = pyconfig.initialize(config_argv)\n",
+        "    \n",
+        "    print(\"✓ Fixed configuration loaded:\")\n",
+        "    print(f\"  - Model: {config.model_name}\")\n",
+        "    print(f\"  - Dataset: {config.hf_path}\")\n",
+        "    print(f\"  - Steps: {config.steps}\")\n",
+        "    print(f\"  - Use SFT: {config.use_sft}\")\n",
+        "    print(f\"  - Learning Rate: {config.learning_rate}\")\n",
+        "else:\n",
+        "    print(\"MaxText not available - cannot load configuration\")\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "##  Execute Actual Training\n",
+        "\n",
+        "Let's actually run the training using the MaxText SFT trainer's `train()` function.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "#  Execute the training using MaxText SFT trainer's train() function\n",
+        "if MAXTEXT_AVAILABLE:\n",
+        "    print(\"=\"*60)\n",
+        "    print(\"EXECUTING ACTUAL TRAINING\")\n",
+        "    print(\"=\"*60)\n",
+        "    \n",
+        "    sft_train(config)           \n",
+        "    \n",
+        "    print(\"\\n✅ Training completed successfully!\")\n",
+        "               \n",
+        "else:\n",
+        "    print(\"MaxText not available - cannot execute training\")\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "##  Summary\n",
+        "\n",
+        "This notebook demonstrated the complete MaxText & Tunix integration for SFT training.\n",
+        "\n",
+        "\n",
+        "The integration provides the best of both worlds: MaxText's high-performance LLM training and Tunix's optimized training infrastructure, making it ideal for production SFT training on large datasets like UltraChat-200k.\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": []
+    }
+  ],
+  "metadata": {
+    "language_info": {
+      "name": "python"
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 2
+}