From 67f79860cda5c4d73839d0164ffbb76b49dad4ca Mon Sep 17 00:00:00 2001
From: Connor Treacy <connortreacy@users.noreply.github.com>
Date: Fri, 17 Oct 2025 16:14:35 +0100
Subject: [PATCH 1/5] Generating codebase docs tutorial

---
 .../generating-codebase-docs.ipynb            | 1348 +++++++++++++++++
 1 file changed, 1348 insertions(+)
 create mode 100644 end-to-end-use-cases/generating-codebase-docs/generating-codebase-docs.ipynb
diff --git a/end-to-end-use-cases/generating-codebase-docs/generating-codebase-docs.ipynb b/end-to-end-use-cases/generating-codebase-docs/generating-codebase-docs.ipynb
new file mode 100644
index 000000000..906511256
--- /dev/null
+++ b/end-to-end-use-cases/generating-codebase-docs/generating-codebase-docs.ipynb
@@ -0,0 +1,1348 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "42a6fd1b",
+   "metadata": {},
+   "source": [
+    "# Generating documentation for an entire codebase"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "72c37e61",
+   "metadata": {},
+   "source": [
+    "*Copyright (c) Meta Platforms, Inc. and affiliates.\n",
+    "This software may be used and distributed according to the terms of the Llama Community License Agreement.*"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "352a1d17",
+   "metadata": {},
+   "source": [
+    "<a href=\"https://colab.research.google.com/github/meta-llama/llama-cookbook/blob/main/end-to-end-use-cases/generating-codebase-docs/generating-codebase-docs.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "36f56eac-5824-4b4d-8231-ca1d9a792cfc",
+   "metadata": {
+    "vscode": {
+     "languageId": "raw"
+    }
+   },
+   "source": [
+    "This tutorial shows you how to build an automated documentation generator for source code repositories. Using Llama 4 Scout, you'll create a \"Repo2Docs\" system that analyzes an entire codebase and produces a comprehensive README with architectural diagrams and component summaries.\n",
+    "\n",
+    "While traditional documentation tools require manual annotation or simple extraction, this approach uses Llama 4's large context window and code understanding capabilities to generate meaningful, contextual documentation that explains not just what the code does, but how components work together.\n",
+    "\n",
+    "## What you will learn\n",
+    "\n",
+    "- **Build a multi-stage AI pipeline** that performs progressive analysis, from individual files to the complete architecture.\n",
+    "- **Leverage Llama 4 Scout's large context window** to analyze entire source files and repositories without complex chunking strategies.\n",
+    "- **Use the Meta Llama API** to access Llama 4 models.\n",
+    "- **Generate production-ready documentation**, including Mermaid diagrams that visualize your repository's architecture.\n",
+    "\n",
+    "| Component | Choice | Why |\n",
+    "|:----------|:-------|:----|\n",
+    "| **Model** | Llama 4 Scout | Large context window (up to 10M tokens) and Mixture-of-Experts (MoE) architecture for efficient, high-quality analysis. |\n",
+    "| **Infrastructure** | Meta Llama API | Provides serverless, production-ready access to Llama 4 models using the `llama_api_client` SDK. |\n",
+    "| **Architecture** | Progressive Pipeline | Deconstructs the complex task of repository analysis into manageable, sequential stages for scalability and efficiency. |\n",
+    "---\n",
+    "\n",
+    "**Note on Inference Providers:** This tutorial uses the Llama API for demonstration purposes. However, you can run Llama 4 models with any preferred inference provider. Common examples include [Amazon Bedrock](https://docs.aws.amazon.com/bedrock/latest/userguide/model-parameters-meta.html) and [Together AI](https://together.ai/llama). The core logic of this tutorial can be adapted to any of these providers."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ebe11503-7483-4bd4-a0fa-ed6d75e70c59",
+   "metadata": {},
+   "source": [
+    "## Problem: Documentation debt\n",
+    "\n",
+    "Documentation debt is a persistent challenge in software development. As codebases evolve, manual documentation efforts often fall behind, leading to outdated, inconsistent, or missing information. This slows down developer onboarding and makes maintenance more difficult.\n",
+    "\n",
+    "## Solution: An automated documentation pipeline\n",
+    "\n",
+    "This tutorial's solution is a multi-stage pipeline that systematically analyzes a repository to produce a comprehensive `README.md` file. The system works by progressively analyzing your repository in multiple stages:\n",
+    "\n",
+    "```mermaid\n",
+    "flowchart LR\n",
+    "    A[GitHub Repo] --> B[Step 1: File Analysis]\n",
+    "    B --> C[Step 2: <br> Repository Overview]\n",
+    "    C --> D[Step 3: <br> Architecture Analysis]\n",
+    "    D --> E[Step 4: Final README]\n",
+    "```\n",
+    "\n",
+    "By breaking down the complex task of repository analysis into manageable stages, you can process repositories of any size efficiently. The large context window of Llama 4 Scout is sufficient to analyze entire source files without complex chunking strategies, resulting in high-quality documentation that captures both fine-grained details and architectural patterns."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "36a4c1eb-5b32-4328-9f23-566e07c5abc7",
+   "metadata": {
+    "vscode": {
+     "languageId": "raw"
+    }
+   },
+   "source": [
+    "## Prerequisites\n",
+    "\n",
+    "Before you begin, ensure you have a Llama API key. If you do not have a Llama API key, please get one from [Meta Llama API](https://llama.developer.meta.com/).\n",
+    "\n",
+    "Remember, we use the Llama API for this tutorial, but you can adapt this section to use your preferred inference provider.\n",
+    "\n",
+    "## Install dependencies\n",
+    "\n",
+    "You will need a few libraries for this project: `tiktoken` for accurate token counting, `tqdm` for progress bars, and the official `llama-api-client`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "76934968-7e45-4604-bc05-8f6cda19f20f",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Install dependencies\n",
+    "!pip install --quiet tiktoken llama-api-client tqdm"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ab850db6-9f0c-4633-b874-edd7b86fe5d1",
+   "metadata": {},
+   "source": [
+    "## Imports & Llama API client setup\n",
+    "\n",
+    "Import the necessary modules and initialize the `LlamaAPIClient`. This requires a Llama API key to be available as an environment variable."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "93135cfe-ead0-4390-8631-259834c9b988",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os, sys, re\n",
+    "import tempfile\n",
+    "import textwrap\n",
+    "import urllib.request\n",
+    "import zipfile\n",
+    "from pathlib import Path\n",
+    "from typing import Dict, List, Tuple\n",
+    "from urllib.parse import urlparse\n",
+    "import json\n",
+    "import pprint\n",
+    "from tqdm import tqdm\n",
+    "import boto3\n",
+    "import tiktoken\n",
+    "from llama_api_client import LlamaAPIClient\n",
+    "\n",
+    "# --- Llama client ---\n",
+    "API_KEY = os.getenv(\"LLAMA_API_KEY\")\n",
+    "if not API_KEY:\n",
+    "    sys.exit(\"❌  Please set the LLAMA_API_KEY environment variable.\")\n",
+    "\n",
+    "client = LlamaAPIClient(api_key=API_KEY)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "fae9842e-10c8-4fc0-a118-429c786fb63a",
+   "metadata": {
+    "vscode": {
+     "languageId": "raw"
+    }
+   },
+   "source": [
+    "### Model Selection\n",
+    "\n",
+    "For this tutorial, you'll use **Llama 4 Scout**. Its large context window is well-suited for ingesting and analyzing entire source code files, which is a key requirement for this use case. While Llama 4 Scout supports up to 10M tokens, the Llama API currently supports 128k tokens."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "5d363e30-94c8-4420-a438-9c98f45585d0",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# --- Constants & Configuration ---\n",
+    "LLM_MODEL = \"Llama-4-Scout-17B-16E-Instruct-FP8\"\n",
+    "CTX_WINDOW = 128000  # Context window for Llama API"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "73eb5cfb-585c-4acb-b8b5-f6a5bb8eb271",
+   "metadata": {
+    "vscode": {
+     "languageId": "raw"
+    }
+   },
+   "source": [
+    "## Step 1: Download the repository\n",
+    "\n",
+    "First, you'll download the target repository. This tutorial analyzes the official [Meta Llama repository](https://github.com/facebookresearch/llama), but you can adapt it to any public GitHub repository.\n",
+    "\n",
+    "The code downloads the repository as ZIP archive (faster than git clone, avoids .git metadata) and extracts to a temporary directory for isolated processing."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f99c109e-d084-4efd-bdd0-a12ee02a3d9d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "REPO_URL = \"https://github.com/facebookresearch/llama\"\n",
+    "BRANCH_NAME = \"main\" # The default branch to download"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2e99d1b2-251b-47c3-81d5-3b02a4863684",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "📥 Downloading repository from https://github.com/facebookresearch/llama/archive/refs/heads/main.zip...\n",
+      "📦 Extracting files...\n",
+      "✅ Extracted to: /var/folders/sz/kf8w7j1x1v790jxs8k2gl72c0000gn/T/tmptwo_kdt5/llama-main\n"
+     ]
+    }
+   ],
+   "source": [
+    "base_url = REPO_URL.rstrip(\"/\").removesuffix(\".git\")\n",
+    "repo_zip_url = f\"{base_url}/archive/refs/heads/{BRANCH_NAME}.zip\"\n",
+    "\n",
+    "# Create a temporary directory to work in\n",
+    "tmpdir_obj = tempfile.TemporaryDirectory()\n",
+    "tmpdir = Path(tmpdir_obj.name)\n",
+    "\n",
+    "# Download the repository ZIP file\n",
+    "zip_path = tmpdir / \"repo.zip\"\n",
+    "print(f\"📥 Downloading repository from {repo_zip_url}...\")\n",
+    "urllib.request.urlretrieve(repo_zip_url, zip_path)\n",
+    "\n",
+    "# Extract the archive\n",
+    "print(\"📦 Extracting files...\")\n",
+    "with zipfile.ZipFile(zip_path, 'r') as zf:\n",
+    "    zf.extractall(tmpdir)\n",
+    "extracted_root = next(p for p in tmpdir.iterdir() if p.is_dir())\n",
+    "print(f\"✅ Extracted to: {extracted_root}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "01474e16-f15c-4deb-bee9-1506b56153fe",
+   "metadata": {
+    "vscode": {
+     "languageId": "raw"
+    }
+   },
+   "source": [
+    "## Step 2: Analyze individual files\n",
+    "\n",
+    "In this step, you'll generate a concise summary for each relevant file in the repository. This is the first step in the progressive analysis pipeline.\n",
+    "\n",
+    "**File selection strategy**: To ensure the analysis is both comprehensive and efficient, you'll selectively process files based on their extension and name (`should_include_file`). This avoids summarizing binary files, build artifacts, or other content that is not relevant to documentation.\n",
+    "\n",
+    "The list below provides a general-purpose starting point, but you should customize it for your target repository. For a large project, consider what file types contain the most meaningful source code and configuration, and start with those."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "304bb4f3-aea8-4b19-83b9-9f8329db74e4",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Allowlist of file extensions to summarize\n",
+    "INCLUDE_EXTENSIONS = {\n",
+    "    \".py\", # Python\n",
+    "    \".js\", \".jsx\", \".ts\", \".tsx\", # JS/Typescript\n",
+    "    \".md\", \".txt\", # Text\n",
+    "    \".json\", \".yaml\", \".yml\", \".toml\", # Config\n",
+    "    \".sh\", \".css\", \".html\",\n",
+    "}\n",
+    "INCLUDE_FILENAMES = {\"Dockerfile\", \"Makefile\"} # Common files without extension\n",
+    "\n",
+    "def should_include_file(file_path: Path, extracted_root: Path) -> bool:\n",
+    "    \"\"\"Checks if a file should be included for documentation based on its path and type.\"\"\"\n",
+    "    \n",
+    "    if not file_path.is_file(): # Must be a file.\n",
+    "        return False\n",
+    "\n",
+    "    rel_path = file_path.relative_to(extracted_root)\n",
+    "    if any(part.startswith('.') for part in rel_path.parts): # Exclude hidden files/folders.\n",
+    "        return False\n",
+    "\n",
+    "    if ( # Must be in our allow-list of extensions or filenames.\n",
+    "        file_path.suffix.lower() in INCLUDE_EXTENSIONS\n",
+    "        or file_path.name in INCLUDE_FILENAMES\n",
+    "    ):\n",
+    "        return True\n",
+    "\n",
+    "    return False"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2b1fbfc8-0347-41bc-af26-af80e22a3de7",
+   "metadata": {
+    "vscode": {
+     "languageId": "raw"
+    }
+   },
+   "source": [
+    "**Prompt strategy for file summaries**: The prompt for this phase instructs Llama 4 to elicit summaries that focus on a file's purpose and its role within the project, rather than a line-by-line description of its implementation. This is a critical step for generating a high-level, conceptual understanding of the codebase."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "fe341e12-6e4d-4e7f-9a6e-a0fb57d6c795",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "MAX_COMPLETION_TOKENS_FILE = 400 # Max tokens for file summary\n",
+    "# To keep this tutorial straightforward, we'll skip files larger than 1MB.\n",
+    "# For a production system, you might implement a chunking strategy for large files.\n",
+    "MAX_FILE_SIZE = 1_000_000\n",
+    "\n",
+    "def summarize_file_content(file_path: str, file_content: str) -> str:\n",
+    "    \"\"\"Summarizes the content of a single file.\"\"\"\n",
+    "    sys_prompt = (\n",
+    "        \"You are a senior software engineer creating a concise summary of a \"\n",
+    "        \"source file for a project's README.md.\"\n",
+    "    )\n",
+    "    user_prompt = textwrap.dedent(\n",
+    "        f\"\"\"\\\n",
+    "        Please summarize the following file: `{file_path}`.\n",
+    "\n",
+    "        The summary should be a **concise paragraph** (around 40-60 words) that \n",
+    "        explains the file's primary purpose, its main functions or classes, and how \n",
+    "        it fits into the broader project. Focus on the *what* and *why*, not a \n",
+    "        line-by-line explanation of the *how*.\n",
+    "\n",
+    "        ```\n",
+    "        {file_content}\n",
+    "        ```\n",
+    "        \"\"\"\n",
+    "    )\n",
+    "    try:\n",
+    "        resp = client.chat.completions.create(\n",
+    "            model=LLM_MODEL,\n",
+    "            messages=[\n",
+    "                {\"role\": \"system\", \"content\": sys_prompt},\n",
+    "                {\"role\": \"user\", \"content\": user_prompt},\n",
+    "            ],\n",
+    "            temperature=0.1,  # Low temperature for deterministic summaries\n",
+    "            max_tokens=MAX_COMPLETION_TOKENS_FILE,\n",
+    "        )\n",
+    "        return resp.completion_message.content.text\n",
+    "    except Exception as e:\n",
+    "        print(f\"    Error summarizing file: {e}\")\n",
+    "        return \"\"  # Return empty string on failure"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f420b225-5abf-46ed-ad3a-a333ae59e3ad",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "--- Summarizing individual files ---\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "🔍 Summarising files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 22/22 [00:28<00:00,  1.29s/file]"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "✅ Summarized 15 files.\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "# --- Summarize relevant files ---\n",
+    "print(\"\\n--- Summarizing individual files ---\")\n",
+    "file_summaries: Dict[str, str] = {}\n",
+    "files_to_process = list(extracted_root.rglob(\"*\"))\n",
+    "\n",
+    "for file_path in tqdm(files_to_process, desc=\"🔍 Summarizing files\", unit=\"file\"):\n",
+    "    # First, check if the file type is one we want to process.\n",
+    "    if (\n",
+    "        not should_include_file(file_path, extracted_root) # valid file for summarization\n",
+    "        or file_path.stat().st_size > MAX_FILE_SIZE\n",
+    "        or file_path.stat().st_size == 0\n",
+    "    ):\n",
+    "        continue\n",
+    "\n",
+    "    rel_name = str(file_path.relative_to(extracted_root))\n",
+    "    try:\n",
+    "        text = file_path.read_text(encoding=\"utf-8\")\n",
+    "    except UnicodeDecodeError:\n",
+    "        continue\n",
+    "    \n",
+    "    if not text.strip():\n",
+    "        continue\n",
+    "    \n",
+    "    # With a large context window, we can summarize the whole file at once.\n",
+    "    summary = summarize_file_content(rel_name, text)\n",
+    "    if summary:\n",
+    "        file_summaries[rel_name] = summary\n",
+    "\n",
+    "print(f\"✅ Summarized {len(file_summaries)} files.\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "id": "5e777b23-7271-4768-bf31-5807528b7151",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "{'CODE_OF_CONDUCT.md': 'The `CODE_OF_CONDUCT.md` file outlines the expected '\n",
+      "                       'behavior and standards for contributors and '\n",
+      "                       'maintainers of the project, aiming to create a '\n",
+      "                       'harassment-free and welcoming environment. It defines '\n",
+      "                       'acceptable and unacceptable behavior, roles and '\n",
+      "                       'responsibilities, and procedures for reporting and '\n",
+      "                       'addressing incidents, promoting a positive and '\n",
+      "                       'inclusive community.',\n",
+      " 'CONTRIBUTING.md': 'Here is a concise summary of the `CONTRIBUTING.md` file:\\n'\n",
+      "                    '\\n'\n",
+      "                    'The `CONTRIBUTING.md` file outlines the guidelines and '\n",
+      "                    'processes for contributing to the Llama project. It '\n",
+      "                    'provides instructions for submitting pull requests, '\n",
+      "                    'including bug fixes, improvements, and new features, as '\n",
+      "                    'well as information on the Contributor License Agreement, '\n",
+      "                    'issue tracking, and licensing terms, to ensure a smooth '\n",
+      "                    'and transparent contribution experience.',\n",
+      " 'MODEL_CARD.md': 'The `MODEL_CARD.md` file provides detailed information '\n",
+      "                  'about the Llama 2 family of large language models (LLMs), '\n",
+      "                  'including model architecture, training data, performance '\n",
+      "                  'evaluations, and intended use cases. It serves as a '\n",
+      "                  \"comprehensive model card, outlining the model's \"\n",
+      "                  'capabilities, limitations, and responsible use guidelines '\n",
+      "                  'for developers and researchers.',\n",
+      " 'README.md': 'This `README.md` file serves as a deprecated repository for '\n",
+      "              'Llama 2, a large language model, providing minimal examples for '\n",
+      "              'loading models and running inference. It directs users to new, '\n",
+      "              'consolidated repositories for Llama 3.1 and offers guidance on '\n",
+      "              'downloading models, quick start instructions, and responsible '\n",
+      "              'use guidelines.',\n",
+      " 'UPDATES.md': 'Here is a concise summary of the `UPDATES.md` file:\\n'\n",
+      "               '\\n'\n",
+      "               'The `UPDATES.md` file documents recent updates to the project, '\n",
+      "               'specifically addressing issues with system prompts and token '\n",
+      "               'sanitization. Updates aim to reduce false refusal rates and '\n",
+      "               'prevent prompt injection attacks, enhancing model safety and '\n",
+      "               'security. Changes include removing default system prompts and '\n",
+      "               'sanitizing user-provided prompts to mitigate abuse.',\n",
+      " 'USE_POLICY.md': 'Here is a concise summary of the `USE_POLICY.md` file:\\n'\n",
+      "                  '\\n'\n",
+      "                  'The Llama 2 Acceptable Use Policy outlines the guidelines '\n",
+      "                  'for safe and responsible use of the Llama 2 tool. It '\n",
+      "                  'prohibits uses that violate laws, harm individuals or '\n",
+      "                  'groups, or facilitate malicious activities, and requires '\n",
+      "                  'users to report any policy violations, bugs, or concerns to '\n",
+      "                  'designated channels.',\n",
+      " 'download.sh': 'The `download.sh` script downloads Llama 2 models and '\n",
+      "                'associated files from a provided presigned URL. It prompts '\n",
+      "                'for a URL and optional model sizes, then downloads the '\n",
+      "                'models, tokenizer, LICENSE, and usage policy to a target '\n",
+      "                'folder, verifying checksums for integrity.',\n",
+      " 'example_chat_completion.py': 'This file, `example_chat_completion.py`, '\n",
+      "                               'demonstrates how to use a pretrained Llama '\n",
+      "                               'model for generating text in a conversational '\n",
+      "                               'setting. It defines a `main` function that '\n",
+      "                               'takes in model checkpoints, tokenizer paths, '\n",
+      "                               'and generation parameters, and uses them to '\n",
+      "                               'generate responses to a set of predefined '\n",
+      "                               'dialogs. The file serves as an example for '\n",
+      "                               'chat completion tasks in the broader project.',\n",
+      " 'example_text_completion.py': 'This file, `example_text_completion.py`, '\n",
+      "                               'demonstrates text generation using a '\n",
+      "                               'pretrained Llama model. The `main` function '\n",
+      "                               'initializes the model, generates text '\n",
+      "                               'completions for a set of prompts, and prints '\n",
+      "                               \"the results. It showcases the model's \"\n",
+      "                               'capabilities in natural language continuation '\n",
+      "                               'and translation tasks, serving as an example '\n",
+      "                               'for integrating Llama into broader projects.',\n",
+      " 'llama/__init__.py': 'The `llama/__init__.py` file serves as the entry point '\n",
+      "                      'for the Llama project, exposing key classes and '\n",
+      "                      'modules. It imports and makes available the main '\n",
+      "                      '`Llama` and `Dialog` generation classes, `ModelArgs` '\n",
+      "                      'and `Transformer` model components, and the `Tokenizer` '\n",
+      "                      \"class, providing a foundation for the project's \"\n",
+      "                      'functionality.',\n",
+      " 'llama/generation.py': 'The `llama/generation.py` file contains the core '\n",
+      "                        'logic for text generation using the Llama model. It '\n",
+      "                        'defines the `Llama` class, which provides methods for '\n",
+      "                        'building a model instance, generating text '\n",
+      "                        'completions, and handling conversational dialogs. The '\n",
+      "                        'class supports features like nucleus sampling, log '\n",
+      "                        'probability computation, and special token handling.',\n",
+      " 'llama/model.py': 'The `llama/model.py` file defines a Transformer-based '\n",
+      "                   'model architecture, specifically the Llama model. It '\n",
+      "                   'includes key components such as RMSNorm, attention '\n",
+      "                   'mechanisms, feedforward layers, and a Transformer block, '\n",
+      "                   'which are combined to form the overall model. The model is '\n",
+      "                   'designed for efficient and scalable training and '\n",
+      "                   'inference.',\n",
+      " 'llama/tokenizer.py': 'The `llama/tokenizer.py` file implements a tokenizer '\n",
+      "                       'class using SentencePiece, enabling text tokenization '\n",
+      "                       'and encoding/decoding. The `Tokenizer` class loads a '\n",
+      "                       'SentencePiece model, providing `encode` and `decode` '\n",
+      "                       'methods for converting text to token IDs and vice '\n",
+      "                       'versa, with optional BOS and EOS tokens.',\n",
+      " 'requirements.txt': 'Here is a concise summary of the `requirements.txt` '\n",
+      "                     'file:\\n'\n",
+      "                     '\\n'\n",
+      "                     'The `requirements.txt` file specifies the dependencies '\n",
+      "                     'required to run the project. It lists essential '\n",
+      "                     'libraries, including PyTorch, Fairscale, Fire, and '\n",
+      "                     'SentencePiece, which provide core functionality for the '\n",
+      "                     'project. This file ensures that all necessary packages '\n",
+      "                     \"are installed, enabling the project's features and \"\n",
+      "                     'functionality to work as intended.',\n",
+      " 'setup.py': 'The `setup.py` file is a build script that packages and '\n",
+      "             'distributes the project. Its primary purpose is to define '\n",
+      "             'project metadata and dependencies. It uses `setuptools` to find '\n",
+      "             'and include packages, and loads required libraries from '\n",
+      "             '`requirements.txt`, enabling easy installation and setup of the '\n",
+      "             'project.'}\n"
+     ]
+    }
+   ],
+   "source": [
+    "pprint.pprint(file_summaries)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0796a5c9-0f08-465e-9ec3-cb12f3426248",
+   "metadata": {},
+   "source": [
+    "## Step 3: Create repository overview\n",
+    "\n",
+    "After summarizing each file, the next step is to synthesize this information into a high-level repository overview. This overview provides a starting point for a user to understand the project's purpose and structure.\n",
+    "\n",
+    "You'll prompt Llama 4 to generate three key sections based on the file summaries from the previous step:\n",
+    "1.  **Project Overview**: A short, descriptive paragraph that explains the repository's main purpose.\n",
+    "2.  **Key Components**: A bulleted list of the most important files, providing a quick look at the core logic.\n",
+    "3.  **Getting Started**: A brief instruction on how to install dependencies and run the project.\n",
+    "\n",
+    "This prompt leverages the previously generated file summaries as context, enabling the model to create an accurate and cohesive overview without re-analyzing the raw source code."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "6b57e502-0bb4-41f3-bc90-1986a2118c4a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "MAX_COMPLETION_TOKENS_REPO = 600 # Max tokens for repo overview\n",
+    "\n",
+    "def build_repo_overview(file_summaries: Dict[str, str]) -> str:\n",
+    "    \"\"\"Creates the high-level Overview and Key Components sections.\"\"\"\n",
+    "    bullets = \"\\n\".join(f\"- **{n}**: {s}\" for n, s in file_summaries.items())\n",
+    "    sys_prompt = (\n",
+    "        \"You are an expert technical writer. Draft a high-level overview \"\n",
+    "        \"for the root of a README.md.\"\n",
+    "    )\n",
+    "    user_prompt = textwrap.dedent(\n",
+    "        f\"\"\"\\\n",
+    "        Below is a list of source files with their summaries.\n",
+    "\n",
+    "        1. Write an **'Overview'** section (≈3-4 sentences) explaining the purpose of the repository.\n",
+    "        2. Follow it with a **'Key Components'** bullet list (max 6 bullets) referencing the files.\n",
+    "        3. Close with a short 'Getting Started' hint: `pip install -r requirements.txt` etc.\n",
+    "\n",
+    "        ---\n",
+    "        FILE SUMMARIES\n",
+    "        {bullets}\n",
+    "        \"\"\"\n",
+    "    )\n",
+    "    try:\n",
+    "        resp = client.chat.completions.create(\n",
+    "            model=LLM_MODEL,\n",
+    "            messages=[\n",
+    "                {\"role\": \"system\", \"content\": sys_prompt},\n",
+    "                {\"role\": \"user\", \"content\": user_prompt},\n",
+    "            ],\n",
+    "            temperature=0.1,\n",
+    "            max_tokens=MAX_COMPLETION_TOKENS_REPO,\n",
+    "        )\n",
+    "        return resp.completion_message.content.text\n",
+    "    except Exception as e:\n",
+    "        print(f\"    Error creating repo overview: {e}\")\n",
+    "        return \"\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "id": "c92ac3d4-dbc7-4577-bf75-2b56072daba3",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "--- Building high-level repository overview ---\n",
+      "✅ Overview created.\n"
+     ]
+    }
+   ],
+   "source": [
+    "# --- Create High-Level Repo Overview ---\n",
+    "print(\"\\n--- Building high-level repository overview ---\")\n",
+    "repo_overview = build_repo_overview(file_summaries)\n",
+    "print(\"✅ Overview created.\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "id": "a15aaa25-440c-4901-a677-ac5452d8775d",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Here is a high-level overview for the root of a README.md:\n",
+      "\n",
+      "## Overview\n",
+      "\n",
+      "This repository provides a comprehensive framework for utilizing the Llama large language model, including model architecture, training data, and example usage. The project aims to facilitate the development of natural language processing applications, while promoting responsible use and community engagement. By providing a range of tools and resources, this repository enables developers and researchers to explore the capabilities and limitations of the Llama model. The repository is structured to support easy integration, modification, and extension of the model.\n",
+      "\n",
+      "## Key Components\n",
+      "\n",
+      "* **llama/generation.py**: Core logic for text generation using the Llama model\n",
+      "* **llama/model.py**: Transformer-based model architecture definition\n",
+      "* **llama/tokenizer.py**: Tokenizer class using SentencePiece for text encoding and decoding\n",
+      "* **example_text_completion.py**: Example usage of the Llama model for text completion tasks\n",
+      "* **example_chat_completion.py**: Example usage of the Llama model for conversational tasks\n",
+      "* **requirements.txt**: Dependency specifications for project setup and installation\n",
+      "\n",
+      "## Getting Started\n",
+      "\n",
+      "To get started with this project, run `pip install -r requirements.txt` to install the required dependencies. You can then explore the example usage files, such as `example_text_completion.py` and `example_chat_completion.py`, to learn more about integrating the Llama model into your projects.\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(repo_overview)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "824f4439-b1d5-4c44-833a-573cfbc03ee7",
+   "metadata": {},
+   "source": [
+    "## Step 4: Analyze repository architecture\n",
+    "\n",
+    "A high-level overview is useful, but a deep architectural understanding requires analyzing how components interact. This phase generates that deeper analysis.\n",
+    "\n",
+    "### Two-step approach to architecture analysis\n",
+    "\n",
+    "Analyzing an entire codebase for architectural patterns is complex. Instead of passing all the code to the model at once, you'll use a more strategic, two-step approach that mirrors how a human architect would work:\n",
+    "\n",
+    "1.  **AI-driven file selection**: First, you use Llama 4 to identify the most architecturally significant files. The model is prompted to select files that represent the core logic, primary entry points, or key data structures, based on the summaries generated earlier. This step efficiently filters the codebase down to its most critical components.\n",
+    "2.  **Deep-dive analysis**: With the key files identified, you perform a much deeper analysis. While only the full source code of these selected files is provided, the model also receives the summaries of *all* files generated in the first step. This ensures it has broad, high-level context on the entire repository when it performs its deep analysis.\n",
+    "\n",
+    "This two-step process is highly effective because it focuses the model's analytical power on the most important parts of the code, enabling it to generate high-quality architectural insights that are difficult to achieve with a less focused approach."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d107a46a-8cc3-4cfd-bfc3-3847076bf523",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def select_important_files(file_summaries: Dict[str, str]) -> List[str]:\n",
+    "    \"\"\"Uses an LLM to select the most architecturally significant files.\"\"\"\n",
+    "    bullets = \"\\n\".join(f\"- **{n}**: {s}\" for n, s in file_summaries.items())\n",
+    "    sys_prompt = (\n",
+    "        \"You are a senior software architect. Your task is to identify the \"\n",
+    "        \"most critical files for understanding a repository's architecture.\"\n",
+    "    )\n",
+    "    user_prompt = textwrap.dedent(\n",
+    "        f\"\"\"\\\n",
+    "        Based on the following file summaries, identify the most architecturally\n",
+    "        significant files. These files should represent the core logic,\n",
+    "        primary entry points, or key data structures of the project.\n",
+    "\n",
+    "        Your response MUST be a comma-separated list of file paths, ordered from\n",
+    "        most to least architecturally significant. Do not add any other text.\n",
+    "        Please ensure that the file paths exactly match the file summaries \n",
+    "        below.\n",
+    "        Example: `README.md`,`src/main.py,src/utils.py,src/models.py`\n",
+    "\n",
+    "        ---\n",
+    "        FILE SUMMARIES\n",
+    "        {bullets}\n",
+    "        \"\"\"\n",
+    "    )\n",
+    "    \n",
+    "    try:\n",
+    "        resp = client.chat.completions.create(\n",
+    "            model=LLM_MODEL,\n",
+    "            messages=[\n",
+    "                {\"role\": \"system\", \"content\": sys_prompt},\n",
+    "                {\"role\": \"user\", \"content\": user_prompt},\n",
+    "            ],\n",
+    "            temperature=0.1,\n",
+    "        )\n",
+    "        response = resp.completion_message.content.text\n",
+    "        \n",
+    "        # Parse the comma-separated list.\n",
+    "        if response:\n",
+    "            # Clean up the response to handle potential markdown code blocks\n",
+    "            cleaned_response = (response.strip()\n",
+    "                              .removeprefix(\"```\")\n",
+    "                              .removesuffix(\"```\")\n",
+    "                              .strip())\n",
+    "            return [f.strip() for f in cleaned_response.split(',') if f.strip()]\n",
+    "    except Exception as e:\n",
+    "        print(f\"    Error selecting important files: {e}\")\n",
+    "    return []"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "id": "22ad831e-0b3d-405c-9749-b9df65c2f4c3",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "--- Selecting important files for deep analysis ---\n",
+      "✅ LLM selected 6 files for analysis: ['llama/generation.py', 'llama/model.py', 'llama/__init__.py', 'llama/tokenizer.py', 'example_text_completion.py', 'example_chat_completion.py']\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(\"\\n--- Selecting important files for deep analysis ---\")\n",
+    "important_files = select_important_files(file_summaries)\n",
+    "if important_files:\n",
+    "    print(f\"✅ LLM selected {len(important_files)} files for analysis: \"\n",
+    "          f\"{important_files}\")\n",
+    "else:\n",
+    "    print(\"ℹ️ No files were selected for architectural analysis.\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "id": "efa951e9-8584-4ac5-9d41-639faef9c5a4",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def token_estimate(text: str) -> int:\n",
+    "    \"\"\"Estimates the token count of a text string using tiktoken.\"\"\"\n",
+    "    enc = tiktoken.get_encoding(\"o200k_base\")\n",
+    "    return len(enc.encode(text))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "68e91354",
+   "metadata": {},
+   "source": [
+    "**Managing context for large repositories**\n",
+    "\n",
+    "In large repositories, the combined size of important files can still exceed the model's context window. The code below uses a simple budgeting strategy: it collects file contents until a token limit is reached, ensuring the request doesn't fail.\n",
+    "\n",
+    "For a production-grade system, a more sophisticated approach is recommended. For example, you could include the full content of the most critical files that fit, and supplement this with summaries of other important files to stay within the context limit."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "324a3a67-7f06-437e-a917-38eba6fba738",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "--- Step 5: Retrieving code for 6 selected files ---\n",
+      "✅ Retrieved content of 6 files for deep analysis.\n"
+     ]
+    }
+   ],
+   "source": [
+    "# --- Get code for selected files ---\n",
+    "# The files are processed in order of importance as determined by the LLM, so\n",
+    "# that the most critical files are most likely to be included if we hit the\n",
+    "# context window budget.\n",
+    "snippets: List[Tuple[str, str]] = []\n",
+    "if important_files:\n",
+    "    print(f\"\\n--- Step 5: Retrieving code for {len(important_files)} \"\n",
+    "          f\"selected files ---\")\n",
+    "    tokens_used = 0\n",
+    "    for file_name in important_files:\n",
+    "        # It's possible the model returns paths with leading/trailing whitespace\n",
+    "        file_name = file_name.strip()\n",
+    "\n",
+    "        fp = extracted_root / file_name\n",
+    "        if not fp.is_file():\n",
+    "            print(f\"⚠️ Selected path '{file_name}' is not a file, skipping.\")\n",
+    "            continue\n",
+    "\n",
+    "        try:\n",
+    "            # Limit file size to avoid huge token counts for single files\n",
+    "            code = fp.read_text(encoding=\"utf-8\")[:20_000]\n",
+    "        except UnicodeDecodeError:\n",
+    "            continue\n",
+    "\n",
+    "        token_count = token_estimate(code)\n",
+    "\n",
+    "        # Reserve half of the context window for summaries and other prompt text\n",
+    "        if tokens_used + token_count > (CTX_WINDOW // 2):\n",
+    "            print(f\"⚠️  Context window budget reached. Stopping at \"\n",
+    "                  f\"{len(snippets)} files.\")\n",
+    "            break\n",
+    "\n",
+    "        snippets.append((file_name, code))\n",
+    "        tokens_used += token_count\n",
+    "\n",
+    "    print(f\"✅ Retrieved content of {len(snippets)} files for deep analysis.\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "84837f2e-46e8-4735-8ab3-fcf1ba3ee49e",
+   "metadata": {},
+   "source": [
+    "**Deep Analysis Process**: Include full source code of selected files in context to generate:\n",
+    "- Mermaid class diagrams\n",
+    "- Component relationships  \n",
+    "- Architectural patterns\n",
+    "- README-ready documentation"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "69f5b5d1-e95c-4c9d-bf87-d68f4c3cf63a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# --- Cross-File Architectural Reasoning Function ---\n",
+    "MAX_COMPLETION_TOKENS_ARCH = 900 # Max tokens for architecture overview\n",
+    "\n",
+    "def build_architecture(\n",
+    "    file_summaries: Dict[str, str], \n",
+    "    code_snippets: List[Tuple[str, str]], \n",
+    "    ctx_budget: int\n",
+    ") -> str:\n",
+    "    \"\"\"Produces an Architecture & Key Concepts section using the large model.\"\"\"\n",
+    "    summary_lines = \"\\n\".join(f\"- **{n}**: {s}\" for n, s in file_summaries.items())\n",
+    "    prompt_sections = [\n",
+    "        \"[[FILE_SUMMARIES]]\",\n",
+    "        summary_lines,\n",
+    "        \"[[/FILE_SUMMARIES]]\",\n",
+    "    ]\n",
+    "    tokens_used = token_estimate(\"\\n\".join(prompt_sections))\n",
+    "\n",
+    "    if code_snippets:\n",
+    "        code_block_lines = []\n",
+    "        for fname, code in code_snippets:\n",
+    "            added = \"\\n### \" + fname + \"\\n```code\\n\" + code + \"\\n```\\n\"\n",
+    "            t = token_estimate(added)\n",
+    "            if tokens_used + t > (ctx_budget // 2):\n",
+    "                break\n",
+    "            code_block_lines.append(added)\n",
+    "            tokens_used += t\n",
+    "        if code_block_lines:\n",
+    "            prompt_sections.extend(\n",
+    "                [\"[[RAW_CODE_SNIPPETS]]\"] + code_block_lines + \n",
+    "                [\"[[/RAW_CODE_SNIPPETS]]\"]\n",
+    "            )\n",
+    "\n",
+    "    user_prompt = textwrap.dedent(\"\\n\".join(prompt_sections) + \"\"\"\n",
+    "        ---\n",
+    "        **Your tasks**\n",
+    "        1. Identify the major abstractions (classes, services, data models) \n",
+    "           across the entire codebase.\n",
+    "        2. Explain how they interact – include dependencies, data flow, and any \n",
+    "           cross-cutting concerns.\n",
+    "        3. Output a concise *Architecture & Key Concepts* section suitable for a \n",
+    "           README, consisting of:\n",
+    "           • short Overview (≤ 3 sentences)\n",
+    "           • Mermaid diagram (`classDiagram` or `flowchart`) of components\n",
+    "           • bullet list of abstractions with brief descriptions.\n",
+    "        \"\"\")\n",
+    "\n",
+    "    sys_prompt = (\n",
+    "        \"You are a principal software architect. Use the provided file \"\n",
+    "        \"summaries (and raw code if present) to infer high-level design. \"\n",
+    "        \"Be precise and avoid guesswork.\"\n",
+    "    )\n",
+    "    \n",
+    "    try:\n",
+    "        resp = client.chat.completions.create(\n",
+    "            model=LLM_MODEL,\n",
+    "            messages=[\n",
+    "                {\"role\": \"system\", \"content\": sys_prompt},\n",
+    "                {\"role\": \"user\", \"content\": user_prompt},\n",
+    "            ],\n",
+    "            temperature=0.2,\n",
+    "            max_tokens=MAX_COMPLETION_TOKENS_ARCH,\n",
+    "        )\n",
+    "        return resp.completion_message.content.text\n",
+    "    except Exception as e:\n",
+    "        print(f\"    Error creating architecture analysis: {e}\")\n",
+    "        return \"\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 18,
+   "id": "09d61720-62b8-4dc7-a2c0-18c927382432",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "--- Performing cross-file architectural reasoning ---\n",
+      "✅ Architectural analysis complete.\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(\"\\n--- Performing cross-file architectural reasoning ---\")\n",
+    "architecture_section = build_architecture(\n",
+    "    file_summaries, snippets, CTX_WINDOW\n",
+    ")\n",
+    "print(\"✅ Architectural analysis complete.\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 19,
+   "id": "5d9036ca-9aef-47f9-8c02-fe86f48ba427",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "## Architecture & Key Concepts\n",
+      "\n",
+      "### Overview\n",
+      "\n",
+      "The Llama project is a large language model implementation that provides a simple and efficient way to generate text based on given prompts. The project consists of several key components, including a Transformer-based model, a tokenizer, and a generation module. These components work together to enable text completion and chat completion tasks.\n",
+      "\n",
+      "### Mermaid Diagram\n",
+      "\n",
+      "```mermaid\n",
+      "classDiagram\n",
+      "    class Llama {\n",
+      "        +build(ckpt_dir, tokenizer_path, max_seq_len, max_batch_size)\n",
+      "        +text_completion(prompts, temperature, top_p, max_gen_len, logprobs, echo)\n",
+      "        +chat_completion(dialogs, temperature, top_p, max_gen_len, logprobs)\n",
+      "    }\n",
+      "    class Transformer {\n",
+      "        +forward(tokens, start_pos)\n",
+      "    }\n",
+      "    class Tokenizer {\n",
+      "        +encode(s, bos, eos)\n",
+      "        +decode(t)\n",
+      "    }\n",
+      "    class ModelArgs {\n",
+      "        +dim\n",
+      "        +n_layers\n",
+      "        +n_heads\n",
+      "        +n_kv_heads\n",
+      "        +vocab_size\n",
+      "        +multiple_of\n",
+      "        +ffn_dim_multiplier\n",
+      "        +norm_eps\n",
+      "        +max_batch_size\n",
+      "        +max_seq_len\n",
+      "    }\n",
+      "    Llama --> Transformer\n",
+      "    Llama --> Tokenizer\n",
+      "    Transformer --> ModelArgs\n",
+      "```\n",
+      "\n",
+      "### Abstractions and Descriptions\n",
+      "\n",
+      "*   **Llama**: The main class that provides a simple interface for text completion and chat completion tasks. It uses a Transformer-based model and a tokenizer to generate text.\n",
+      "*   **Transformer**: A Transformer-based model that takes in token IDs and outputs logits. It consists of multiple layers, each with an attention mechanism and a feedforward network.\n",
+      "*   **Tokenizer**: A class that tokenizes and encodes/decodes text using SentencePiece.\n",
+      "*   **ModelArgs**: A dataclass that stores the model configuration parameters, such as the dimension, number of layers, and vocabulary size.\n",
+      "*   **Dialog**: A list of messages, where each message is a dictionary with a role and content.\n",
+      "*   **Message**: A dictionary with a role and content.\n",
+      "\n",
+      "## Interaction and Dependencies\n",
+      "\n",
+      "The Llama class depends on the Transformer and Tokenizer classes. The Transformer class depends on the ModelArgs dataclass. The Llama class uses the Transformer and Tokenizer classes to generate text.\n",
+      "\n",
+      "The data flow is as follows:\n",
+      "\n",
+      "1.  The Llama class takes in a prompt or a dialog and tokenizes it using the Tokenizer class.\n",
+      "2.  The tokenized prompt or dialog is then passed to the Transformer class, which outputs logits.\n",
+      "3.  The logits are then used to generate text, which is returned by the Llama class.\n",
+      "\n",
+      "Cross-cutting concerns include:\n",
+      "\n",
+      "*   **Model parallelism**: The Transformer class uses model parallelism to speed up computation.\n",
+      "*   **Caching**: The Transformer class caches the keys and values for attention to reduce computation.\n",
+      "*   **Error handling**: The Llama class and Transformer class handle errors, such as invalid input or out-of-range values.\n",
+      "\n",
+      "## Key Components and Their Responsibilities\n",
+      "\n",
+      "*   **Llama**: Provides a simple interface for text completion and chat completion tasks.\n",
+      "*   **Transformer**: Implements the Transformer-based model for generating text.\n",
+      "*   **Tokenizer**: Tokenizes and encodes/decodes text using SentencePiece.\n",
+      "*   **ModelArgs**: Stores the model configuration parameters.\n",
+      "\n",
+      "## Generation Module\n",
+      "\n",
+      "The generation module is responsible for generating text based on given prompts. It uses the Transformer class and the Tokenizer class to generate text.\n",
+      "\n",
+      "The generation module provides two main functions:\n",
+      "\n",
+      "*   **text_completion**: Generates text completions for a list of prompts.\n",
+      "*   **chat_completion**: Generates assistant responses for a list of conversational dialogs.\n",
+      "\n",
+      "These functions take in parameters such as temperature, top-p, and maximum generation length to control the generation process.\n",
+      "\n",
+      "## Conclusion\n",
+      "\n",
+      "The Llama project provides a simple and efficient way to generate text based on given prompts. The project consists of several key components, including a Transformer-based model, a tokenizer, and a generation module. These components work together to enable text completion and chat completion tasks.\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(architecture_section)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "68646af1-e362-4d1d-8a8d-ed311e54145b",
+   "metadata": {},
+   "source": [
+    "## Step 5: Assemble final documentation\n",
+    "\n",
+    "The final phase assembles all the AI-generated content into a single, comprehensive `README.md` file. The goal is to create a document that is not only informative but also easy for developers to navigate and use.\n",
+    "\n",
+    "### Documentation structure\n",
+    "\n",
+    "The generated README follows a layered approach that enables readers to consume information at their preferred level of detail.\n",
+    "\n",
+    "1.  **Repository Summary**: A high-level overview gives developers an immediate understanding of the project's purpose.\n",
+    "2.  **Architecture and Key Concepts**: A deeper technical analysis, including a Mermaid diagram, helps developers understand how the system is designed.\n",
+    "3.  **File Summaries**: A detailed breakdown of each component provides granular information for those who need it.\n",
+    "4.  **Attribution**: A concluding note clarifies that the document was generated by AI, which provides transparency about its origin.\n",
+    "\n",
+    "> **🎯** The combination of Llama 4's code intelligence and large context window enables the automated generation of thorough, high-quality documentation that rivals manually-created content, requiring minimal human intervention."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 20,
+   "id": "a0594f6f-4dcb-4510-9f4b-aea1b892f6be",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "✍️ Writing final README to /Users/saip/Documents/GitHub/meta-documentation-shared/notebooks/Generated_README_llama-main.md...\n",
+      "\n",
+      "\n",
+      "🎉 Success! Documentation generated at: /Users/saip/Documents/GitHub/meta-documentation-shared/notebooks/Generated_README_llama-main.md\n"
+     ]
+    }
+   ],
+   "source": [
+    "OUTPUT_DIR = Path.cwd()\n",
+    "readme_path = OUTPUT_DIR / f\"Generated_README_{extracted_root.name}.md\"\n",
+    "print(f\"\\n✍️ Writing final README to {readme_path.resolve()}...\")\n",
+    "with readme_path.open(\"w\", encoding=\"utf-8\") as fh:\n",
+    "    fh.write(f\"# Repository Summary for `{extracted_root.name}`\\n\\n\"\n",
+    "             f\"{repo_overview}\\n\\n\")\n",
+    "    fh.write(\"## Architecture & Key Concepts\\n\\n\")\n",
+    "    fh.write(architecture_section.strip() + \"\\n\\n\")\n",
+    "    fh.write(\"## File Summaries\\n\\n\")\n",
+    "    for n, s in sorted(file_summaries.items()):\n",
+    "        fh.write(f\"- **{n}** – {s}\\n\")\n",
+    "    fh.write(\n",
+    "        \"\\n---\\n*This README was generated automatically using \"\n",
+    "        \"Meta's **Llama 4** models.*\"\n",
+    "    )\n",
+    "\n",
+    "print(f\"\\n\\n🎉 Success! Documentation generated at: \"\n",
+    "      f\"{readme_path.resolve()}\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 21,
+   "id": "82caa4b0-3f9f-4749-802a-348c941ed386",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "# Repository Summary for `llama-main`\n",
+      "\n",
+      "Here is a high-level overview for the root of a README.md:\n",
+      "\n",
+      "## Overview\n",
+      "\n",
+      "This repository provides a comprehensive framework for utilizing the Llama large language model, including model architecture, training data, and example usage. The project aims to facilitate the development of natural language processing applications, while promoting responsible use and community engagement. By providing a range of tools and resources, this repository enables developers and researchers to explore the capabilities and limitations of the Llama model. The repository is structured to support easy integration, modification, and extension of the model.\n",
+      "\n",
+      "## Key Components\n",
+      "\n",
+      "* **llama/generation.py**: Core logic for text generation using the Llama model\n",
+      "* **llama/model.py**: Transformer-based model architecture definition\n",
+      "* **llama/tokenizer.py**: Tokenizer class using SentencePiece for text encoding and decoding\n",
+      "* **example_text_completion.py**: Example usage of the Llama model for text completion tasks\n",
+      "* **example_chat_completion.py**: Example usage of the Llama model for conversational tasks\n",
+      "* **requirements.txt**: Dependency specifications for project setup and installation\n",
+      "\n",
+      "## Getting Started\n",
+      "\n",
+      "To get started with this project, run `pip install -r requirements.txt` to install the required dependencies. You can then explore the example usage files, such as `example_text_completion.py` and `example_chat_completion.py`, to learn more about integrating the Llama model into your projects.\n",
+      "\n",
+      "## Architecture & Key Concepts\n",
+      "\n",
+      "## Architecture & Key Concepts\n",
+      "\n",
+      "### Overview\n",
+      "\n",
+      "The Llama project is a large language model implementation that provides a simple and efficient way to generate text based on given prompts. The project consists of several key components, including a Transformer-based model, a tokenizer, and a generation module. These components work together to enable text completion and chat completion tasks.\n",
+      "\n",
+      "### Mermaid Diagram\n",
+      "\n",
+      "```mermaid\n",
+      "classDiagram\n",
+      "    class Llama {\n",
+      "        +build(ckpt_dir, tokenizer_path, max_seq_len, max_batch_size)\n",
+      "        +text_completion(prompts, temperature, top_p, max_gen_len, logprobs, echo)\n",
+      "        +chat_completion(dialogs, temperature, top_p, max_gen_len, logprobs)\n",
+      "    }\n",
+      "    class Transformer {\n",
+      "        +forward(tokens, start_pos)\n",
+      "    }\n",
+      "    class Tokenizer {\n",
+      "        +encode(s, bos, eos)\n",
+      "        +decode(t)\n",
+      "    }\n",
+      "    class ModelArgs {\n",
+      "        +dim\n",
+      "        +n_layers\n",
+      "        +n_heads\n",
+      "        +n_kv_heads\n",
+      "        +vocab_size\n",
+      "        +multiple_of\n",
+      "        +ffn_dim_multiplier\n",
+      "        +norm_eps\n",
+      "        +max_batch_size\n",
+      "        +max_seq_len\n",
+      "    }\n",
+      "    Llama --> Transformer\n",
+      "    Llama --> Tokenizer\n",
+      "    Transformer --> ModelArgs\n",
+      "```\n",
+      "\n",
+      "### Abstractions and Descriptions\n",
+      "\n",
+      "*   **Llama**: The main class that provides a simple interface for text completion and chat completion tasks. It uses a Transformer-based model and a tokenizer to generate text.\n",
+      "*   **Transformer**: A Transformer-based model that takes in token IDs and outputs logits. It consists of multiple layers, each with an attention mechanism and a feedforward network.\n",
+      "*   **Tokenizer**: A class that tokenizes and encodes/decodes text using SentencePiece.\n",
+      "*   **ModelArgs**: A dataclass that stores the model configuration parameters, such as the dimension, number of layers, and vocabulary size.\n",
+      "*   **Dialog**: A list of messages, where each message is a dictionary with a role and content.\n",
+      "*   **Message**: A dictionary with a role and content.\n",
+      "\n",
+      "## Interaction and Dependencies\n",
+      "\n",
+      "The Llama class depends on the Transformer and Tokenizer classes. The Transformer class depends on the ModelArgs dataclass. The Llama class uses the Transformer and Tokenizer classes to generate text.\n",
+      "\n",
+      "The data flow is as follows:\n",
+      "\n",
+      "1.  The Llama class takes in a prompt or a dialog and tokenizes it using the Tokenizer class.\n",
+      "2.  The tokenized prompt or dialog is then passed to the Transformer class, which outputs logits.\n",
+      "3.  The logits are then used to generate text, which is returned by the Llama class.\n",
+      "\n",
+      "Cross-cutting concerns include:\n",
+      "\n",
+      "*   **Model parallelism**: The Transformer class uses model parallelism to speed up computation.\n",
+      "*   **Caching**: The Transformer class caches the keys and values for attention to reduce computation.\n",
+      "*   **Error handling**: The Llama class and Transformer class handle errors, such as invalid input or out-of-range values.\n",
+      "\n",
+      "## Key Components and Their Responsibilities\n",
+      "\n",
+      "*   **Llama**: Provides a simple interface for text completion and chat completion tasks.\n",
+      "*   **Transformer**: Implements the Transformer-based model for generating text.\n",
+      "*   **Tokenizer**: Tokenizes and encodes/decodes text using SentencePiece.\n",
+      "*   **ModelArgs**: Stores the model configuration parameters.\n",
+      "\n",
+      "## Generation Module\n",
+      "\n",
+      "The generation module is responsible for generating text based on given prompts. It uses the Transformer class and the Tokenizer class to generate text.\n",
+      "\n",
+      "The generation module provides two main functions:\n",
+      "\n",
+      "*   **text_completion**: Generates text completions for a list of prompts.\n",
+      "*   **chat_completion**: Generates assistant responses for a list of conversational dialogs.\n",
+      "\n",
+      "These functions take in parameters such as temperature, top-p, and maximum generation length to control the generation process.\n",
+      "\n",
+      "## Conclusion\n",
+      "\n",
+      "The Llama project provides a simple and efficient way to generate text based on given prompts. The project consists of several key components, including a Transformer-based model, a tokenizer, and a generation module. These components work together to enable text completion and chat completion tasks.\n",
+      "\n",
+      "## File Summaries\n",
+      "\n",
+      "- **CODE_OF_CONDUCT.md** – The `CODE_OF_CONDUCT.md` file outlines the expected behavior and standards for contributors and maintainers of the project, aiming to create a harassment-free and welcoming environment. It defines acceptable and unacceptable behavior, roles and responsibilities, and procedures for reporting and addressing incidents, promoting a positive and inclusive community.\n",
+      "- **CONTRIBUTING.md** – Here is a concise summary of the `CONTRIBUTING.md` file:\n",
+      "\n",
+      "The `CONTRIBUTING.md` file outlines the guidelines and processes for contributing to the Llama project. It provides instructions for submitting pull requests, including bug fixes, improvements, and new features, as well as information on the Contributor License Agreement, issue tracking, and licensing terms, to ensure a smooth and transparent contribution experience.\n",
+      "- **MODEL_CARD.md** – The `MODEL_CARD.md` file provides detailed information about the Llama 2 family of large language models (LLMs), including model architecture, training data, performance evaluations, and intended use cases. It serves as a comprehensive model card, outlining the model's capabilities, limitations, and responsible use guidelines for developers and researchers.\n",
+      "- **README.md** – This `README.md` file serves as a deprecated repository for Llama 2, a large language model, providing minimal examples for loading models and running inference. It directs users to new, consolidated repositories for Llama 3.1 and offers guidance on downloading models, quick start instructions, and responsible use guidelines.\n",
+      "- **UPDATES.md** – Here is a concise summary of the `UPDATES.md` file:\n",
+      "\n",
+      "The `UPDATES.md` file documents recent updates to the project, specifically addressing issues with system prompts and token sanitization. Updates aim to reduce false refusal rates and prevent prompt injection attacks, enhancing model safety and security. Changes include removing default system prompts and sanitizing user-provided prompts to mitigate abuse.\n",
+      "- **USE_POLICY.md** – Here is a concise summary of the `USE_POLICY.md` file:\n",
+      "\n",
+      "The Llama 2 Acceptable Use Policy outlines the guidelines for safe and responsible use of the Llama 2 tool. It prohibits uses that violate laws, harm individuals or groups, or facilitate malicious activities, and requires users to report any policy violations, bugs, or concerns to designated channels.\n",
+      "- **download.sh** – The `download.sh` script downloads Llama 2 models and associated files from a provided presigned URL. It prompts for a URL and optional model sizes, then downloads the models, tokenizer, LICENSE, and usage policy to a target folder, verifying checksums for integrity.\n",
+      "- **example_chat_completion.py** – This file, `example_chat_completion.py`, demonstrates how to use a pretrained Llama model for generating text in a conversational setting. It defines a `main` function that takes in model checkpoints, tokenizer paths, and generation parameters, and uses them to generate responses to a set of predefined dialogs. The file serves as an example for chat completion tasks in the broader project.\n",
+      "- **example_text_completion.py** – This file, `example_text_completion.py`, demonstrates text generation using a pretrained Llama model. The `main` function initializes the model, generates text completions for a set of prompts, and prints the results. It showcases the model's capabilities in natural language continuation and translation tasks, serving as an example for integrating Llama into broader projects.\n",
+      "- **llama/__init__.py** – The `llama/__init__.py` file serves as the entry point for the Llama project, exposing key classes and modules. It imports and makes available the main `Llama` and `Dialog` generation classes, `ModelArgs` and `Transformer` model components, and the `Tokenizer` class, providing a foundation for the project's functionality.\n",
+      "- **llama/generation.py** – The `llama/generation.py` file contains the core logic for text generation using the Llama model. It defines the `Llama` class, which provides methods for building a model instance, generating text completions, and handling conversational dialogs. The class supports features like nucleus sampling, log probability computation, and special token handling.\n",
+      "- **llama/model.py** – The `llama/model.py` file defines a Transformer-based model architecture, specifically the Llama model. It includes key components such as RMSNorm, attention mechanisms, feedforward layers, and a Transformer block, which are combined to form the overall model. The model is designed for efficient and scalable training and inference.\n",
+      "- **llama/tokenizer.py** – The `llama/tokenizer.py` file implements a tokenizer class using SentencePiece, enabling text tokenization and encoding/decoding. The `Tokenizer` class loads a SentencePiece model, providing `encode` and `decode` methods for converting text to token IDs and vice versa, with optional BOS and EOS tokens.\n",
+      "- **requirements.txt** – Here is a concise summary of the `requirements.txt` file:\n",
+      "\n",
+      "The `requirements.txt` file specifies the dependencies required to run the project. It lists essential libraries, including PyTorch, Fairscale, Fire, and SentencePiece, which provide core functionality for the project. This file ensures that all necessary packages are installed, enabling the project's features and functionality to work as intended.\n",
+      "- **setup.py** – The `setup.py` file is a build script that packages and distributes the project. Its primary purpose is to define project metadata and dependencies. It uses `setuptools` to find and include packages, and loads required libraries from `requirements.txt`, enabling easy installation and setup of the project.\n",
+      "\n",
+      "---\n",
+      "*This README was generated automatically using Meta's **Llama 4** models.*"
+     ]
+    }
+   ],
+   "source": [
+    "!cat $readme_path"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 22,
+   "id": "6a39ab8e-b3c3-455f-8dfd-9df80d8acf3f",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "--- Cleaning up temporary directory /var/folders/sz/kf8w7j1x1v790jxs8k2gl72c0000gn/T/tmptwo_kdt5 ---\n",
+      "✅ Cleanup complete.\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(f\"\\n--- Cleaning up temporary directory {tmpdir} ---\")\n",
+    "try:\n",
+    "    tmpdir_obj.cleanup()\n",
+    "    print(\"✅ Cleanup complete.\")\n",
+    "except Exception as e:\n",
+    "    print(f\"⚠️  Error during cleanup: {e}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "bce59302-6b03-403c-8399-22d17b824d97",
+   "metadata": {},
+   "source": [
+    "## Next steps and upgrade paths\n",
+    "\n",
+    "This tutorial provides a solid foundation for automated documentation generation. You can extend it in several ways for a production-grade application.\n",
+    "\n",
+    "| Need                           | Recommended approach                                                                                                                                                                                                                            |\n",
+    "| :----------------------------- | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |\n",
+    "| **Private repositories**       | For private GitHub repos, use authenticated requests with a personal access token. For GitLab or Bitbucket, adapt the download logic to their respective APIs. |\n",
+    "| **Multiple languages**         | Extend the `INCLUDE_EXTENSIONS` list and adjust prompts to handle language-specific documentation patterns. Consider using language-specific parsers for better code understanding. |\n",
+    "| **Incremental updates**        | Implement caching of file summaries with timestamps. Only reprocess files that have changed since the last run, significantly reducing API costs for large repositories. |\n",
+    "| **Custom documentation formats** | Adapt the final assembly phase to generate different formats such as API documentation, developer guides, or architecture decision records (ADRs). |\n",
+    "| **CI/CD integration**          | Run the documentation generator as part of your continuous integration pipeline to keep documentation automatically synchronized with code changes. |\n",
+    "| **Multi-repository analysis**  | Extend the pipeline to analyze dependencies and generate documentation for entire microservice architectures or monorepos. |\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "My Project (uv)",
+   "language": "python",
+   "name": "my-uv-project"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.7"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

From 89bd54979c1d24579ec2340e2da242f701e2b9b9 Mon Sep 17 00:00:00 2001
From: Connor Treacy <connortreacy@users.noreply.github.com>
Date: Fri, 17 Oct 2025 16:17:29 +0100
Subject: [PATCH 2/5] Summarization tutorial

---
 .../summarization/summarization.ipynb         | 749 ++++++++++++++++++
 1 file changed, 749 insertions(+)
 create mode 100644 end-to-end-use-cases/summarization/summarization.ipynb

diff --git a/end-to-end-use-cases/summarization/summarization.ipynb b/end-to-end-use-cases/summarization/summarization.ipynb
new file mode 100644
index 000000000..803e43b0b
--- /dev/null
+++ b/end-to-end-use-cases/summarization/summarization.ipynb
@@ -0,0 +1,749 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "d0ef8fea",
+   "metadata": {},
+   "source": [
+    "# Summarization pipeline with chunking"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a2240877",
+   "metadata": {},
+   "source": [
+    "*Copyright (c) Meta Platforms, Inc. and affiliates.\n",
+    "This software may be used and distributed according to the terms of the Llama Community License Agreement.*"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8869a933",
+   "metadata": {},
+   "source": [
+    "<a href=\"https://colab.research.google.com/github/meta-llama/llama-cookbook/blob/main/end-to-end-use-cases/summarization/summarization.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4eb01d0c",
+   "metadata": {},
+   "source": [
+    "This tutorial shows you how to build a robust summarization pipeline for long documents. We will create an \"Intelligent Summarization Assistant\" that uses Llama 4 to summarize a document that is too long to be processed in a single pass.\n",
+    "\n",
+    "While models like Llama 4 have massive context windows, summarizing extremely long texts can sometimes cause details to be \"lost in the middle.\" To solve this, we will implement the **Map-Reduce** pattern: first, we'll \"map\" a summarization task over smaller, coherent chunks of the text, and then \"reduce\" those individual summaries into a final, high-fidelity overview.\n",
+    "\n",
+    "| Component          | Choice                                     | Why                                                                                                                              |\n",
+    "| :----------------- | :----------------------------------------- | :------------------------------------------------------------------------------------------------------------------------------- |\n",
+    "| **Model**          | `Llama-4-Maverick-17B-128E-Instruct-FP8`     | A powerful model ideal for high-quality summarization at both the chunk and final summary stages. |\n",
+    "| **Pattern**        | Map-Reduce Summarization               | A fundamental pattern for processing long documents. We \"map\" a summarization function over each chunk, then \"reduce\" the resulting summaries into a final one. |\n",
+    "| **Infrastructure** | Llama API                  | Provides access to Llama 4 models using the `llama_api_client` SDK.                                     |\n",
+    "---\n",
+    "\n",
+    "**Note on Inference Providers:** This tutorial uses the Llama API for demonstration purposes. However, you can run Llama 4 models with any preferred inference provider. Common examples include [Amazon Bedrock](https://docs.aws.amazon.com/bedrock/latest/userguide/model-parameters-meta.html) and [Together AI](https://together.ai/llama). The core logic of this tutorial can be adapted to any of these providers.\n",
+    "\n",
+    "## What you will learn\n",
+    "\n",
+    "- **How to implement a robust pipeline** for summarizing documents of any length.\n",
+    "- **The foundational \"Map-Reduce\" pattern** for document processing.\n",
+    "- **Techniques for \"semantic chunking\"** to split a document logically while preserving context.\n",
+    "- **How to craft effective, stage-specific prompts** for a multi-step LLM pipeline.\n",
+    "- **How to chain LLM calls** to perform complex, multi-stage tasks.\n",
+    "\n",
+    "## Install dependencies\n",
+    "\n",
+    "You will need two libraries for this project: `tiktoken` for accurate token counting, and the official `llama-api-client`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "175c12af-fa25-4035-afdd-bcfe482b2c5a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!pip install --quiet tiktoken llama-api-client"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a0376ad7-8391-4bc3-b207-c0dd356b0410",
+   "metadata": {},
+   "source": [
+    "## Imports & Llama API client setup\n",
+    "\n",
+    "Import the necessary modules and initialize the `LlamaAPIClient`. This requires a Llama API key to be available as an environment variable. If you do not have a Llama API key, please get one from [Meta Llama API](https://llama.developer.meta.com/). \n",
+    "\n",
+    "Remember, we use the Llama API for this tutorial, but you can adapt this section to use your preferred inference provider."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "id": "a5ac6a27-a662-4445-8e15-0a89aa30587d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os, sys, re\n",
+    "from typing import List\n",
+    "import tiktoken\n",
+    "from llama_api_client import LlamaAPIClient\n",
+    "\n",
+    "# --- Llama client ---\n",
+    "API_KEY = os.getenv(\"LLAMA_API_KEY\")\n",
+    "if not API_KEY:\n",
+    "    sys.exit(\"❌  Please set the LLAMA_API_KEY environment variable.\")\n",
+    "\n",
+    "client = LlamaAPIClient(api_key=API_KEY)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "537de7f1-869f-4868-b107-87b4e1fc8c8b",
+   "metadata": {},
+   "source": [
+    "## Step 1: Get the data\n",
+    "\n",
+    "This tutorial uses a markdown version of the Meta research paper, [ASTRO: Teaching Language Models to Reason by Reflecting and Backtracking In-Context](https://ai.meta.com/research/publications/astro-teaching-language-models-to-reason-by-reflecting-and-backtracking-in-context/). The file, `ASTRO-Teaching_Language_Models_to_Reason.md`, is included in the `data` sub-directory of the repository, making it easy to follow along.\n",
+    "\n",
+    "> We are using a markdown file for this tutorial because it preserves the document's structure with headers, which is useful for semantic chunking. If you are working with other formats like PDFs, you can use parsing services like [LlamaParse](https://www.llamaindex.ai/llamaparse) to convert them to markdown."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "id": "ce0721a0-5624-415f-bab2-b0ba1bedab97",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "✅  Successfully loaded document: 142,921 characters.\n"
+     ]
+    }
+   ],
+   "source": [
+    "file_path = \"data/ASTRO-Teaching_Language_Models_to_Reason.md\"\n",
+    "\n",
+    "try:\n",
+    "    with open(file_path, 'r', encoding='utf-8') as f:\n",
+    "        document_text = f.read()\n",
+    "except FileNotFoundError:\n",
+    "    raise FileNotFoundError(\n",
+    "        f\"Error: The file was not found at {file_path}\"\n",
+    "    )\n",
+    "\n",
+    "if document_text:\n",
+    "    print(f\"✅  Successfully loaded document: {len(document_text):,} characters.\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4a0224bf-b718-4915-ad98-7bf0e737bd27",
+   "metadata": {},
+   "source": [
+    "## Step 2: The logic of chunking\n",
+    "\n",
+    "### Why Chunk?\n",
+    "\n",
+    "For long documents, even with a large context window, summarizing in a single pass can lead to context degradation, where the model may under-weigh details from the middle of the text.\n",
+    "\n",
+    "To ensure all parts of the document are processed with equal focus, we use a **map-reduce** approach. Breaking the document into smaller, coherent chunks for individual summarization guarantees a more detailed and high-quality final result.\n",
+    "\n",
+    "### How to chunk?\n",
+    "\n",
+    "An effective chunking strategy is critical. Simply splitting the text by a fixed token count can break sentences or separate related ideas. A better approach is **semantic chunking**. Our strategy has two levels:\n",
+    "\n",
+    "1.  **Header-based splitting:** First, the document is split into large sections based on its markdown headers (`#`, `##`, `###`). This preserves the document's logical structure.\n",
+    "2.  **Paragraph-based Chunking:** Each large section is then divided into the final, smaller chunks. This process respects paragraph boundaries and a specified token limit, ensuring the chunks are both semantically coherent and sized appropriately for the LLM.\n",
+    "\n",
+    "> **Note on Generalization:** This tutorial's header-based splitting is optimized for markdown documents. For other formats (like plain text or PDFs), you can generalize this header-based splitting approach by identifying similar structural elements. For instance, you could split by chapter titles, numbered sections, or use regular expressions to find custom patterns that define logical breaks in your document. The principle of multi-level semantic chunking remains the same.\n",
+    "\n",
+    "### Choosing the Right Chunk Size\n",
+    "\n",
+    "While our chunking strategy prioritizes semantic boundaries (headers and paragraphs) over fixed token counts, we still need to set a maximum size for our chunks. This ensures that even the largest semantic chunk fits comfortably within the model's context window.\n",
+    "\n",
+    "The `CHUNK_SIZE_TOKENS` constant serves as this upper limit. Finding the right value is a trade-off:\n",
+    "\n",
+    "*   **Set Too High:** The limit might still be larger than the model's context window (once the prompt is included), causing API calls to fail.\n",
+    "*   **Set Too Low:** This could force the chunking logic to split paragraphs or other logical units too aggressively, reducing the quality of the summaries. It also increases the number of API calls, leading to higher cost and latency.\n",
+    "\n",
+    "The `16000` token limit in this tutorial is a conservative size for models with large context windows (usually 128k for models available on the Llama API). It leaves ample room for the prompt while ensuring each chunk is large enough to provide meaningful context for summarization.\n",
+    "\n",
+    "> **Note on Local Processing:** All processing up to this point, including loading the data and chunking the text, happens locally. We have not yet made any calls to the Llama API. The token counting is done with a local library to ensure our chunks are the right size for the API calls in the next steps."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "id": "f825f541-c63b-4747-8ba9-7e675da427ab",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Total chunks created: 54\n",
+      "Average token count per chunk: 661.94\n",
+      "Max token count in a chunk: 6357\n",
+      "Min token count in a chunk: 3\n",
+      "--------------------------------------------------\n",
+      "Top 5 Chunks:\n",
+      "Chunk 0:\n",
+      "# ASTRO: Teaching Language Models to Reason by Reflecting and Backtracking In-Context\n",
+      "\n",
+      "Joongwon Kim<sup>1,2</sup>, Anirudh Goyal<sup>1</sup>, Liang Tan<sup>1</sup>, Hannaneh Hajishirzi<sup>2</sup>, Srini Iyer<sup>1</sup>, Tianlu Wang<sup>1</sup>\n",
+      "\n",
+      "<sup>1</sup>AI at Meta, <sup>2</sup>University of Washington\n",
+      "\n",
+      "We introduce Astro, the \"Autoregressive Search-Taught Reasoner\", a framework for training language models to reason like search algorithms, explicitly leveraging self-reflection, backtracking, and exploration in their outputs. Recently, training large language models (LLMs) via reinforcement learning (RL) has led to the advent of reasoning models with greatly enhanced reasoning capabilities. Open-source replications of reasoning models, while successful, build upon models that already exhibit strong reasoning capabilities along with search behavior observed even before RL. As a result, it is yet unclear how to boost the reasoning capabilities of other non-reasoner models including Llama 3. Astro teaches such models to internalize structured search behavior through a synthetic dataset derived from Monte Carlo Tree Search (MCTS) over mathematical problem-solving trajectories. By converting search traces into natural language chain-of-thoughts that capture both successes and recoveries from failure, Astro bootstraps models with a rich prior for exploration during RL. We finetune our models on these search-derived traces and further improve performance via RL with verifiable rewards. We apply Astro to the Llama 3 family of models and achieve absolute performance gains of 16.0% on MATH-500, 26.9% on AMC 2023, and 20.0% on AIME 2024, especially improving upon challenging problems that require iterative correction. Our results demonstrate that search-inspired training offers a principled way to instill robust reasoning capabilities into open LLMs.\n",
+      "\n",
+      "Date: June 23, 2025\n",
+      "Correspondence: Joongwon Kim at jwonkim@meta.com\n",
+      "\n",
+      "| **stepwise solutions** ## Step 1: Define the problem and identify what we need to find. We need to find the time it takes for Aya to complete her walk and stop at the coffee shop when walking at a speed of $s + \\frac{1}{2}$ kilometers per hour, including the time $t$ spent in the coffee shop. ## Step 2: Set up the equations based on the information given. Let's denote the total time for the walk and coffee shop at speed $s$ as 4 hours or 240 minutes, and at speed $s+2$ as 2 hours and 24 minutes, or 144 minutes ... The final answer is \\boxed{398}. Llama-3.1-70B-Instruct | **long CoT solutions** Procedure Cloning SFT RL ASTRO Let's begin by finding the time that it takes for Aya to complete her walk and stop at the coffee shop ... But wait, are we solving the problem correctly so far? Hmm... Our solution may not be correct so far. Let's go back to where we set up the equations ... Therefore Aya spent a total of 204 minutes. But wait, are we solving the problem correctly so far? Hmm... Our solution seems to be correct so far. The final answer is \\boxed{204}. Llama-3.1-70B-ASTRO-RL | \tMATH-500\tAMC 2023\tAIME 2024Llama-3.1-70B-Instruct&#xA;Llama-3.1-70B-ASTRO-SFT\t+16.0%\t\t&#xA;Llama-3.1-70B-ASTRO-RL\t\t+26.9%\t+20.0% |\n",
+      "| ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | --------------------------------------------------------------------------------------------------------------------------------- |\n",
+      "\n",
+      "\n",
+      "</td>\n",
+      "</tr>\n",
+      "</table>\n",
+      "\n",
+      "Figure 1 Astro teaches Llama-3.1-70B-Instruct to perform self-reflection and backtracking in-context and improves its mathematical reasoning, achieving 81.8% on MATH-500, 64.4% on AMC 2023 and 30.0% on AIME 2024 (pass@1).\n",
+      "\n",
+      "\n",
+      "--------------------------------------------------\n",
+      "Chunk 1:\n",
+      "## 1 Introduction\n",
+      "\n",
+      "Training large language models (LLMs) via reinforcement learning (RL) has greatly improved their reasoning capabilities, leading to the advent of reasoning models such as OpenAI o1 (OpenAI, 2024), DeepSeek-R1 (DeepSeek-AI, 2025) or Gemini 2.5 (Google, 2025). A prominent feature of reasoning models is their ability to iteratively refine their outputs with a behavior similar to search – a process which involves reflecting on their own outputs and backtracking to a previous state (Xiang et al., 2025). While open-source replications of reasoning models achieve notable performance improvements, they rely on distillation from existing reasoning\n",
+      "\n",
+      "---\n",
+      "\n",
+      "\n",
+      "\n",
+      "Diagram showing three stages: Monte Carlo Tree Search, Procedure Cloning, and Reinforcement Learning\n",
+      "\n",
+      "Figure 2 An overview of Astro. Given a math reasoning problem, we first perform Monte Carlo Tree Search (MCTS) in a stepwise manner with verifiable rewards and obtain a search tree where each node contains a discrete reasoning step with its associated Q-value. We then linearize the visited sequence of nodes, including intermediate nodes with incorrect answers, into a solution that integrates backtracking and self-reflection in natural language. Then, we perform supervised fine-tuning (SFT) on the search-integrated solutions and bootstrap our policy to perform autoregressive search. Finally, we further improve the policy's search and reasoning capabilities with reinforcement learning (RL).\n",
+      "\n",
+      "models (Li et al., 2025; Muennighoff et al., 2025) or direct RL (Hu et al., 2025; Yu et al., 2025) from LLMs that (1) already contain reflective behavior and strong reasoning capabilities (Chang et al., 2025; Liu et al., 2025), and (2) exhibit spurious performance gains from incorrect or noisy reward signals during RL (Lv et al., 2025; Shao et al., 2025). Hence it is unclear from a scientific perspective how reasoning models can be built from other LLMs that do not exhibit the aforementioned behavior, such as Llama 3 (AI at Meta, 2024).\n",
+      "\n",
+      "We introduce ASTRO, the \"Autoregressive Search-Taught Reasoner\", a framework that systematically infuses search-like behavior into language models ab initio to improve their reasoning capabilities. The fundamental principle guiding Astro is search, where our policy explores the solution space by selecting actions, reflecting on its own solution, and backtracking to a previous step if needed. Astro trains language models to perform autoregressive search – instead of using external search scaffolds such as beam search to solve reasoning problems, Astro internalizes the search procedure and generates entire search trajectories, including reflections and backtracks, in a single inference pass. Models trained using Astro exhibit improved reasoning abilities by frequently re-evaluating their solutions and backtracking until they reach a final answer of high confidence. Moreover, such models generate structured reasoning traces that can be mapped to a directed graph with each vertex representing a discrete reasoning step, allowing for a richer understanding of their reasoning processes.\n",
+      "\n",
+      "Astro operates in three stages: (1) search trajectory generation, (2) supervised fine-tuning and (3) reinforcement learning. We initially bootstrap our models with search behavior by generating search trajectories to be used for training data via procedure cloning (Yang et al., 2022; Laskin et al., 2022) – we perform search with custom scaffolding over our language model policy to explore over different solution trajectories for each math problem, and we train our policy without using scaffolds at test time to predict the entire sequence of actions, including intermediate actions that lead to incorrect answers, that ultimately end with a successful terminal state. Then, we further optimize our policy via RL to improve their reasoning and search capabilities. Astro provides beneficial priors for RL during its data generation stage by systematically injecting self-reflection and backtracking patterns to the search traces via procedure cloning.\n",
+      "\n",
+      "First, we generate synthetic data, also called the cold-start data (DeepSeek-AI, 2025; Qwen, 2025), to instill autoregressive search priors to our models. To this end, we use Monte Carlo Tree Search (MCTS) to explore the solution space of challenging math problems and build search trees with diverse reasoning traces. We linearize each search tree into a sequence of nodes visited during the search process, which we then convert into a natural language chain-of-thought (CoT, Wei et al. (2022)) that integrates self-reflection and backtracking. Then, we sample about 36K high-quality CoT solutions across three open-source math datasets.\n",
+      "\n",
+      "2\n",
+      "\n",
+      "---\n",
+      "\n",
+      "\n",
+      "\n",
+      "| Monte Carlo Tree Search (MCTS) ⇒ Linearization ⇒ Procedure Cloning in Language                                                                                                                                                                                                                                                                                                                                                                                                                                                              |         |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |\n",
+      "| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |\n",
+      "| Problem: A box $2$ centimeters high, $3$ centimeters wide, and $5$ centimeters long can hold $40$ grams of clay. A second box with twice the height, three times the width, and the same length as the first box can hold $n$ grams of clay. What is $n$?                                                                                                                                                                                                                                                                                   | Node 3  | Let's determine the dimensions of the second box by calculating its height, which is twice the height of the first box, so its height is $2 \\times 2 = 4$ centimeters, and its width, which is three times the width of the first box, so its width is $3 \\times 3 = 9$ centimeters, while noting that the length of the second box is the same as the first box, which ... (solution omitted) ...                                                                                                                                                                              |\n",
+      "| ## Step 1: Determine the dimensions of the second box (Node 3, Q=0.962) The second box has twice the height of the first box, so its height is $2 \\times 2 = 4$ centimeters. It has three times the width of the first box, so its width is $3 \\times 3 = 9$ centimeters. The length of the second box is the same as the first box, which is $5$ centimeters.                                                                                                                                                                              | Node 16 | Let's use this information to calculate n, the amount of clay that the second box can hold, using the formula n = 40 \\* (volume of second box / volume of first box) = 40 \\* (4 \\* 9 \\* 5) / (2 \\* 3 \\* 5) = 40 \\* (12) = 40 \\* 6 = 480 grams. But wait, are we correctly solving the problem so far? Hmm... I'm not sure if our solution is correct so far. Let's go back to where we calculated the volume of the second box to be 180 cubic centimeters. Given the established dimensions of the boxes, let's calculate the volume of the first box, where its volume is ... |\n",
+      "| ## Step 2: Calculate the volume of the second box (Node 16, Q=0.934) The second box has twice the height (2 \\* 2 = 4 cm), three times the width (3 \\* 3 = 9 cm), and the same length (5 cm) as the first box. So, the volume of the second box = height \\* width \\* length = 4 \\* 9 \\* 5 = 180 cubic centimeters.                                                                                                                                                                                                                           | Node 43 | (solution omitted) ... the second box is 6 times larger than the first box, it follows that the second box can hold 6 times as much clay as the first box. Therefore, to find n, the amount of clay the second box can hold, we calculate n = 6 \\* 40 = 240 grams. But wait, are we correctly solving the problem so far? Hmm... Our solution seems to be correct so far.                                                                                                                                                                                                       |\n",
+      "| ## Step 3: Calculate the volume of the first box (Node 43, Q=0.768) The volume of the first box = height \\* width \\* length = 2 \\* 3 \\* 5 = 30 cubic centimeters. ## Step 4: Calculate n, the amount of clay that the second box can hold (Node 68, Q=0.0) The ratio of the volumes of the two boxes will equal the ratio of the amounts of clay that they can hold. ... (solution omitted) ... Therefore, n = 40 \\* (volume of second box / volume of first box) = 40 \\* (4 \\* 9 \\* 5) / (2 \\* 3 \\* 5) = 40 \\* (12) = 40 \\* 6 = 480 grams. | Node 68 |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |\n",
+      "| ## Step 3: Calculate the volume of the first box (Node 44, Q=0.874) The volume of the first box = height \\* width \\* length = 2 \\* 3 \\* 5 = 30 cubic centimeters. ## Step 4: Calculate the ratio of the volume of the second box to the first box (Node 85, Q=1.0) ... (solution omitted) ... The ratio = volume of the second box / volume of the first box = 180 / 30 = 6.                                                                                                                                                                | Node 44 |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |\n",
+      "|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             | Node 85 |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |\n",
+      "| ## Step 5: Calculate the amount of clay the second box holds (Node 96, Q=1.0) Since the second box is 6 times larger than the first box, it can hold 6 times as much clay as the first box. Therefore, n = 6 \\* 40 = 240 grams.                                                                                                                                                                                                                                                                                                             | Node 96 | The final answer is: $\\boxed{240}$                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |\n",
+      "\n",
+      "\n",
+      "Figure 3 Example of search trajectory generation via procedure cloning. We use the policy to search through the solution space via MCTS and keep track of terminal nodes with incorrect answers, as well as terminal nodes with correct answers. Then, we linearize the search tree such that it incorporates backtracking from the incorrect terminal node (Node 68) to its greatest common ancestor (Node 16) with the correct terminal node (Node 96). Finally, we rewrite the node sequence into a long chain-of-thought, injecting self-reflection and backtracking phrases into the CoTs.\n",
+      "\n",
+      "We then perform supervised fine-tuning (SFT) to infuse autoregressive search behavior into the Llama 3 family of models (AI at Meta, 2024). After fine-tuning for just one epoch, our SFT checkpoint based on llama-3.1-70b-instruct achieves 69.6% on MATH-500, 55.0% on AMC 2023 and 13.3% on AIME 2024, and outperforms its counterpart trained on the same set of problems but without search priors. Our qualitative analyses show that even simply performing SFT with high-quality search traces can infuse search capabilities, including backtracking and self-reflection behavior, into a language model.\n",
+      "\n",
+      "Finally, we perform reinforcement learning (RL) on our models to further improve their reasoning capabilities. Our training prompts are derived from open-source math problems of moderate to high difficulties for our policies. We use a modified form of Group Relative Policy Optimization (GRPO, Shao et al. (2024)) that is very similar to that of Dr. GRPO (Liu et al., 2025) to update our policies. After RL, our policy based on llama-3.1-70b-instruct achieves 81.8% in MATH-500, 64.4% in AMC 2023 and 30.0% in AIME 2024 (pass@1). We show that our model trained end-to-end using Astro outperforms its counterpart similarly optimized with RL but initialized from a SFT checkpoint trained without search priors – this demonstrates the importance of leveraging self-reflection and backtracking as priors for improving reasoning via RL. Our work provides a clear recipe for improving the reasoning capabilities of language models by instilling autoregressive search priors with SFT and leveraging such priors to further improve the models via RL.\n",
+      "\n",
+      "\n",
+      "--------------------------------------------------\n",
+      "Chunk 2:\n",
+      "## 2 Search Trajectory Generation\n",
+      "\n",
+      "Astro begins by generating a dataset of search traces, expressed as long chain-of-thoughts (Wei et al., 2022) that encode self-reflection and backtracking in natural language, via procedure cloning. To this end, we first obtain search trees that explore a wide solution space for each math problem using Monte Carlo Tree Search (MCTS) in a stepwise manner, strategically balancing exploration and exploitation with verifier-based rewards to obtain diverse and high-quality solutions exploring different reasoning traces (Section 2.2).\n",
+      "\n",
+      "We then linearize the search trees into sequences of nodes that explore various states, including intermediate nodes with incorrect answers, until arriving at a high-quality solution leading to the correct answer (Section 2.3). Finally, we translate each node sequence into a chain-of-thought that integrates self-reflection and backtracking in natural language, and we add each long chain-of-thought to our final dataset (Section 2.4). The resulting dataset encodes beneficial self-reflection and backtracking priors for training language models to perform autoregressive search for solving challenging math problems via supervised fine-tuning and reinforcement learning (Section 3). Refer to Figure 3 for a visual example of our search trajectory generation pipeline.\n",
+      "\n",
+      "---\n",
+      "\n",
+      "\n",
+      "\n",
+      "\n",
+      "--------------------------------------------------\n",
+      "Chunk 3:\n",
+      "## 2.1 Problem Formulation and Overview\n",
+      "\n",
+      "**Problem formulation.** Our data generation setup is a Markov Decision Process (MDP) (Puterman, 1994), where the language model functions as the policy ΠLM and explores the solution space to the input x, while obtaining rewards in terminal states from a verifier V based on the correct answer. Here we assume that ΠLM solves math problems in a stepwise manner, where each step st represents a sequence of tokens y1 · · · y|st| encapsulating a minimal unit of reasoning required to solve x. Then, each state St represents a combination of the input prompt and the sequence of steps generated by the policy, i.e. St = (x, s0, · · · , st). Meanwhile, the action at+1 represents the next step st+1 taken by ΠLM to address x. Refer to Figure 3 for examples of the steps defined in our setup.\n",
+      "\n",
+      "Given this setup, we teach a language model to predict a sequence of states (S0 · · · Send) in response to x such that the states explore reasoning steps leading to correct and incorrect answers, until the LM arrives at Send and terminates its search by accepting the correct answer as its final answer.\n",
+      "\n",
+      "**Overview.** We generate training data for Astro in three main stages outlined below:\n",
+      "\n",
+      "1. For each x we generate a search tree T, where each node ni represents the state Si and each edge (ni, nj) represents the action aj, i.e. the next step sj taken from Si to Sj, using Monte Carlo Tree Search (MCTS) to explore the solution space based on verifier-based rewards from rollouts (Section 2.2).\n",
+      "\n",
+      "2. We linearize T into a sequence of nodes L = (n0, · · · , nend), a subsequence of the entire history of nodes visited by ΠLM until arriving at nend, the terminal node with the correct answer. Some adjacent pairs of nodes (nt, nt+1) in L are such that nt+1 is an ancestor of nt in T, which corresponds to self-reflection and backtracking during the search procedure (Section 2.3).\n",
+      "\n",
+      "3. We translate L into a chain-of-thought solution y = (y0, · · · , yend) that integrates self-reflection and backtracking in natural language, and we add (x, y) to our final dataset (Section 2.4).\n",
+      "\n",
+      "\n",
+      "--------------------------------------------------\n",
+      "Chunk 4:\n",
+      "## 2.2 Monte Carlo Tree Search\n",
+      "\n",
+      "We use our language model policy ΠLM to obtain a search tree with diverse solution traces to each input x by running Monte Carlo Tree Search (MCTS). By using MCTS, we explore a diverse solution space while balancing exploration and exploitation with reliable guidance from reward signals obtained from full rollouts. Here, we prompt x to elicit stepwise solutions from ΠLM, and assign reward scores with our verifier V to compare the predicted answer with the correct answer.\n",
+      "\n",
+      "Monte Carlo Tree Search employs three main stages – selection, expansion and backpropagation – to select promising next steps, expand the search tree, and update the quality metric of each reasoning step.\n",
+      "\n",
+      "**Selection.** At state St with k actions generated by ΠLM from St, we balance exploration and exploitation to select the most promising node from which to further perform tree search. We use the Predictor+Upper Confidence bounds applied to Trees (PUCT, Silver et al. (2016)) for selection to balance exploration and exploitation during tree search. From any state St, given the action index i ∈ [1...k], the quality score of taking action ai from state St – Q(St, ai), the total visit count of St – N(St), and the visit count of taking action ai from St – N(St, ai), we perform selection as:\n",
+      "\n",
+      "$$S^*_{t+1} = \\underset{(S_{t+1}=S_t \\rightarrow a_i)}{\\text{argmax}} \\left[Q(S_t, a_i) + c_{\\text{puct}} \\cdot \\Pi_{\\text{LM}}(a_i|S_t)\\sqrt{\\frac{N(S_t)}{1 + N(S_t, a_i)}}\\right]$$\n",
+      "\n",
+      "**Expansion.** From state St, ΠLM takes x and the sequence of steps (s0, · · · , st) as the input, and first samples k actions which each correspond to the next step for solving x. For each action, we sample M rollouts and score the full solution using V to match the predicted answer with the reference answer. Then, we average the scores across the rollouts for each new action ai (i ∈ [1...k]) to compute the reward scores for the new states. We add a new node nt+1, associated with each new state St+1, to T.\n",
+      "\n",
+      "$$R(S_{t+1}) = \\frac{1}{M} \\sum_{j\\in[1...M]} V(\\Pi_{\\text{LM},j}(S_{t+1}))$$\n",
+      "\n",
+      "---\n",
+      "\n",
+      "\n",
+      "\n",
+      "Backpropagation. We backpropagate the reward scores obtained during expansion from the leaf node to the root node to recursively update their Q-values. The updates consist of (1) incrementing the visit count of each state (Eq. 3), and (2) updating the Q-values of each (state, action) pair using the Q-values and visit counts of the children nodes of S<sub>t+1</sub> = (S<sub>t</sub>, a), along with the rollout-based reward score R(S<sub>t+1</sub>) (Eq. 4).\n",
+      "\n",
+      "$$N(s_t) = N(s_t) + 1$$\n",
+      "\n",
+      "$$Q(S_t, a) = \\frac{\\sum_{i=1}^K Q(S_{t+1}, a_i) \\cdot N(S_{t+1}, a_i) + R(S_{t+1})}{\\sum_{i=1}^K N(S_{t+1}, a_i) + 1}$$\n",
+      "\n",
+      "We repeat the procedure above for multiple iterations to explore the solution space for each math problem and build the search trees. We use llama-3.3-70b-instruct as our policy Π<sub>LM</sub> and generate k = 8 actions during each expansion step with M = 16 rollouts, c<sub>puct</sub> = 1.0, 32 iterations and maximum tree depth of 50.\n",
+      "\n",
+      "\n",
+      "--------------------------------------------------\n"
+     ]
+    }
+   ],
+   "source": [
+    "# --- Constants & Configuration ---\n",
+    "ENCODING_MODEL = \"o200k_base\"\n",
+    "CHUNK_SIZE_TOKENS = 16000 # A practical chunk size\n",
+    "\n",
+    "def count_tokens(text: str, encoding: tiktoken.Encoding) -> int:\n",
+    "    \"\"\"Helper function to count tokens in a string.\"\"\"\n",
+    "    return len(encoding.encode(text))\n",
+    "\n",
+    "def chunk_document(\n",
+    "    markdown_text: str,\n",
+    "    chunk_size: int = CHUNK_SIZE_TOKENS,\n",
+    "    headers_to_split_on: List[str] = [\"#\", \"##\", \"###\"]\n",
+    ") -> List[str]:\n",
+    "    \"\"\"\n",
+    "    Chunks a markdown document, preserving header context for each chunk.\n",
+    "    \"\"\"\n",
+    "    # 1. Split the document by headers to get sections\n",
+    "    header_pattern = \"|\".join(f\"^{h}\\\\s\" for h in headers_to_split_on)\n",
+    "    sections = re.split(f\"({header_pattern})\", markdown_text, flags=re.MULTILINE)\n",
+    "    if sections and not sections[0].strip():\n",
+    "        sections.pop(0)\n",
+    "\n",
+    "    if len(sections) > 1:\n",
+    "        sections = list(zip(sections[0::2], sections[1::2]))\n",
+    "    else:\n",
+    "        sections = []\n",
+    "\n",
+    "    encoding = tiktoken.get_encoding(ENCODING_MODEL)\n",
+    "    final_chunks = []\n",
+    "\n",
+    "    # 2. Process each section\n",
+    "    for header, content in sections:\n",
+    "        header_token_count = count_tokens(header, encoding)\n",
+    "        \n",
+    "        if header_token_count + count_tokens(content, encoding) <= chunk_size:\n",
+    "            final_chunks.append(header + content)\n",
+    "            continue\n",
+    "\n",
+    "        # Split the content by paragraphs\n",
+    "        paragraphs = content.split('\\n\\n')\n",
+    "        current_chunk_paragraphs = []\n",
+    "        current_chunk_tokens = header_token_count\n",
+    "\n",
+    "        for para in paragraphs:\n",
+    "            para_tokens = count_tokens(para, encoding)\n",
+    "\n",
+    "            # If a paragraph is too large to fit with the header, it must be truncated.\n",
+    "            if header_token_count + para_tokens > chunk_size:\n",
+    "                available_tokens = chunk_size - header_token_count\n",
+    "                para_token_ids = encoding.encode(para)\n",
+    "                truncated_ids = para_token_ids[:available_tokens]\n",
+    "                para = encoding.decode(truncated_ids, errors='ignore')\n",
+    "                para_tokens = len(truncated_ids)\n",
+    "                print(f\"Warning: Truncating a paragraph to {para_tokens} \"\n",
+    "                      f\"tokens to fit the chunk size.\")\n",
+    "\n",
+    "            # If the current chunk is not empty and the new paragraph doesn't fit,\n",
+    "            # finalize the current chunk before starting a new one.\n",
+    "            if (current_chunk_paragraphs and \n",
+    "                (current_chunk_tokens + para_tokens > chunk_size)):\n",
+    "                final_chunks.append(header + \"\\n\\n\".join(current_chunk_paragraphs))\n",
+    "                current_chunk_paragraphs = []\n",
+    "                current_chunk_tokens = header_token_count\n",
+    "\n",
+    "            current_chunk_paragraphs.append(para)\n",
+    "            current_chunk_tokens += para_tokens\n",
+    "\n",
+    "        # Add the last remaining chunk\n",
+    "        if current_chunk_paragraphs:\n",
+    "            final_chunks.append(header + \"\\n\\n\".join(current_chunk_paragraphs))\n",
+    "            \n",
+    "    return final_chunks\n",
+    "\n",
+    "# Now, let's chunk our document\n",
+    "chunks = chunk_document(document_text)\n",
+    "\n",
+    "# --- Print Statistics and a Sample Chunk ---\n",
+    "if chunks:\n",
+    "    print(f\"Total chunks created: {len(chunks)}\")\n",
+    "    encoding = tiktoken.get_encoding(ENCODING_MODEL)\n",
+    "    token_counts = [count_tokens(chunk, encoding) for chunk in chunks]\n",
+    "    avg_tokens = sum(token_counts) / len(token_counts)\n",
+    "    print(f\"Average token count per chunk: {avg_tokens:.2f}\")\n",
+    "    print(f\"Max token count in a chunk: {max(token_counts)}\")\n",
+    "    print(f\"Min token count in a chunk: {min(token_counts)}\")\n",
+    "    print(\"-\" * 50)\n",
+    "    print(\"Top 5 Chunks:\")\n",
+    "    for i, chunk in enumerate(chunks[:5]):\n",
+    "        print(f\"Chunk {i}:\")\n",
+    "        print(chunk)\n",
+    "        print(\"-\" * 50)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "843c4267-b959-4e37-9d6f-bc30ae628574",
+   "metadata": {},
+   "source": [
+    "## Step 3: The \"map\" stage - summarizing each chunk\n",
+    "\n",
+    "With the document split into manageable, semantically coherent chunks, we can begin the \"Map\" stage. This means we apply the same operation—in this case, summarization—to each chunk independently.\n",
+    "\n",
+    "### Prompt engineering\n",
+    "\n",
+    "The quality of the summaries depends heavily on the quality of the prompts. For this stage, the prompt must instruct the model to create a summary of a small piece of a larger document. It is crucial to tell the model to focus *only* on the provided text and not to add outside information."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "id": "678a1afc-3a17-419d-a424-549a0788b8a8",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Summary of chunk 0:\n",
+      "- ASTRO is a framework for training language models to reason like search algorithms.\n",
+      "- ASTRO leverages self-reflection, backtracking, and exploration in language model outputs.\n",
+      "- ASTRO uses a synthetic dataset derived from Monte Carlo Tree Search (MCTS) over mathematical problem-solving trajectories.\n",
+      "- The framework finetunes models on search-derived traces and improves performance via reinforcement learning (RL) with verifiable rewards.\n",
+      "- ASTRO is applied to the Llama 3 family of models.\n",
+      "- Absolute performance gains achieved: 16.0% on MATH-500, 26.9% on AMC 2023, and 20.0% on AIME 2024.\n",
+      "- Llama-3.1-70B-ASTRO-RL achieves 81.8% on MATH-500, 64.4% on AMC 2023, and 30.0% on AIME 2024 (pass@1).\n",
+      "--------------------------------------------------\n",
+      "Summary of chunk 1:\n",
+      "- ASTRO is a framework that infuses search-like behavior into language models to improve their reasoning capabilities.\n",
+      "- ASTRO operates in three stages: search trajectory generation, supervised fine-tuning, and reinforcement learning.\n",
+      "- Search trajectory generation uses Monte Carlo Tree Search (MCTS) to explore the solution space of math problems and builds search trees with diverse reasoning traces.\n",
+      "- About 36K high-quality chain-of-thought (CoT) solutions are sampled across three open-source math datasets.\n",
+      "- Supervised fine-tuning (SFT) is performed on the search-integrated solutions to infuse autoregressive search behavior into the models.\n",
+      "- The SFT checkpoint based on llama-3.1-70b-instruct achieves 69.6% on MATH-500, 55.0% on AMC 2023, and 13.3% on AIME 2024 after fine-tuning for one epoch.\n",
+      "- Reinforcement learning (RL) is performed using a modified form of Group Relative Policy Optimization (GRPO) to further improve the models' reasoning capabilities.\n",
+      "- After RL, the policy based on llama-3.1-70b-instruct achieves 81.8% in MATH-500, 64.4% in AMC 2023, and 30.0% in AIME 2024 (pass@1).\n",
+      "--------------------------------------------------\n",
+      "Summary of chunk 2:\n",
+      "* Astro generates a dataset of search traces via procedure cloning.\n",
+      "* Search trees are obtained using Monte Carlo Tree Search (MCTS) with verifier-based rewards.\n",
+      "* Search trees are linearized into sequences of nodes exploring various states.\n",
+      "* Node sequences are translated into chains-of-thought integrating self-reflection and backtracking in natural language.\n",
+      "* The resulting dataset encodes self-reflection and backtracking priors for training language models.\n",
+      "* The dataset is used for supervised fine-tuning and reinforcement learning to solve math problems.\n",
+      "--------------------------------------------------\n",
+      "Summary of chunk 3:\n",
+      "* The data generation setup is a Markov Decision Process (MDP).\n",
+      "* The language model functions as the policy ΠLM and explores the solution space to the input x.\n",
+      "* Each state St represents a combination of the input prompt and the sequence of steps generated by the policy.\n",
+      "* The goal is to teach a language model to predict a sequence of states (S0 · · · Send) in response to x.\n",
+      "* Training data for Astro is generated in three main stages: \n",
+      "  1. Generating a search tree T using Monte Carlo Tree Search (MCTS).\n",
+      "  2. Linearizing T into a sequence of nodes L.\n",
+      "  3. Translating L into a chain-of-thought solution y that integrates self-reflection and backtracking in natural language.\n",
+      "--------------------------------------------------\n",
+      "Summary of chunk 4:\n",
+      "- Monte Carlo Tree Search (MCTS) is used with language model policy ΠLM to obtain a search tree with diverse solution traces.\n",
+      "- MCTS involves three stages: selection, expansion, and backpropagation.\n",
+      "- Selection uses Predictor+Upper Confidence bounds applied to Trees (PUCT) to balance exploration and exploitation.\n",
+      "- The selection formula is: $$S^*_{t+1} = \\underset{(S_{t+1}=S_t \\rightarrow a_i)}{\\text{argmax}} \\left[Q(S_t, a_i) + c_{\\text{puct}} \\cdot \\Pi_{\\text{LM}}(a_i|S_t)\\sqrt{\\frac{N(S_t)}{1 + N(S_t, a_i)}}\\right]$$\n",
+      "- Expansion involves sampling k actions, scoring full solutions using verifier V, and averaging scores across M rollouts.\n",
+      "- The reward score formula is: $$R(S_{t+1}) = \\frac{1}{M} \\sum_{j\\in[1...M]} V(\\Pi_{\\text{LM},j}(S_{t+1}))$$\n",
+      "- Backpropagation updates Q-values and visit counts using equations: \n",
+      "  $$N(s_t) = N(s_t) + 1$$\n",
+      "  $$Q(S_t, a) = \\frac{\\sum_{i=1}^K Q(S_{t+1}, a_i) \\cdot N(S_{t+1}, a_i) + R(S_{t+1})}{\\sum_{i=1}^K N(S_{t+1}, a_i) + 1}$$\n",
+      "- The policy ΠLM used is llama-3.3-70b-instruct.\n",
+      "- Parameters used are: k = 8, M = 16, cpuct = 1.0, 32 iterations, and maximum tree depth of 50.\n",
+      "--------------------------------------------------\n"
+     ]
+    }
+   ],
+   "source": [
+    "LLM_MODEL = \"Llama-4-Maverick-17B-128E-Instruct-FP8\"\n",
+    "DOC_TITLE = (\"ASTRO: Teaching Language Models to Reason by Reflecting and \"\n",
+    "             \"Backtracking In-Context\")\n",
+    "\n",
+    "MAP_PROMPT = \"\"\"\n",
+    "Your role is to create a concise, factual summary of a text chunk from the \n",
+    "research paper titled \"{document_title}\".\n",
+    "- Extract only key facts, figures, and statements from the chunk text itself.\n",
+    "- Omit any conversational introductions or conclusions. Do not explain what you \n",
+    "  are doing.\n",
+    "- If a chunk contains no substantive information (e.g., only headers, formatting, \n",
+    "  or boilerplate), output the exact phrase: \"No substantive information.\"\n",
+    "\n",
+    "**Text Chunk:**\n",
+    "{chunk_text}\n",
+    "\"\"\"\n",
+    "\n",
+    "def map_summarize_chunk(chunk_text: str, document_title: str) -> str:\n",
+    "    \"\"\"\n",
+    "    Summarizes a single chunk of text using the 'map' prompt.\n",
+    "    \"\"\"\n",
+    "    try:\n",
+    "        resp = client.chat.completions.create(\n",
+    "            model=LLM_MODEL,\n",
+    "            messages=[\n",
+    "                {\"role\": \"user\", \"content\": MAP_PROMPT.format(\n",
+    "                    document_title=document_title, chunk_text=chunk_text)},\n",
+    "            ],\n",
+    "            temperature=0.1, # Low temperature for deterministic summaries\n",
+    "        )\n",
+    "        return resp.completion_message.content.text\n",
+    "    except Exception as e:\n",
+    "        print(f\"    Error summarizing chunk: {e}\")\n",
+    "        return \"\" # Return empty string on failure\n",
+    "\n",
+    "# Let's test the map function on the first few chunks\n",
+    "if chunks:\n",
+    "    for i, chunk in enumerate(chunks[:5]):\n",
+    "        summary = map_summarize_chunk(chunk, DOC_TITLE)\n",
+    "        print(f\"Summary of chunk {i}:\")\n",
+    "        print(summary)\n",
+    "        print(\"-\" * 50)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f8c221de-0c8f-435f-a872-d65acb622b2f",
+   "metadata": {},
+   "source": [
+    "## Step 4: The \"reduce\" stage: creating the final summary\n",
+    "\n",
+    "With the \"map\" stage complete, we now have a list of individual summaries for each chunk. The \"reduce\" stage combines these into a single, coherent executive summary.\n",
+    "\n",
+    "### Prompt engineering for synthesis\n",
+    "\n",
+    "The prompt for this stage is different. We are no longer just summarizing; we are *synthesizing*. The prompt instructs the model to weave the individual points from the chunk summaries into a flowing, well-written narrative."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 16,
+   "id": "65ef4d59-6ec6-4de4-8214-0a5632bdacf7",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "REDUCE_PROMPT = \"\"\"\n",
+    "You are a research assistant tasked with creating an executive summary.\n",
+    "You have been given a series of concise summaries from different sections of a \n",
+    "research paper.\n",
+    "Your goal is to synthesize these individual summaries into a single, well-written, \n",
+    "and coherent executive summary.\n",
+    "The final summary should read like a standalone document, flowing logically from \n",
+    "one topic to the next.\n",
+    "\n",
+    "**Summaries of Report Sections:**\n",
+    "{chunk_summaries}\n",
+    "\"\"\"\n",
+    "\n",
+    "MAX_CONTEXT_WINDOW = 100000\n",
+    "\n",
+    "def reduce_create_final_summary(chunk_summaries: List[str]) -> str:\n",
+    "    \"\"\"\n",
+    "    Combines chunk summaries into a final executive summary using the 'reduce' prompt.\n",
+    "    \"\"\"\n",
+    "    summaries_text = \"\\\\n\\\\n---\\\\n\\\\n\".join(chunk_summaries)\n",
+    "    \n",
+    "    encoding = tiktoken.get_encoding(ENCODING_MODEL)\n",
+    "    if count_tokens(summaries_text, encoding) > MAX_CONTEXT_WINDOW:\n",
+    "        # For this tutorial, we'll truncate to fit. A more advanced implementation\n",
+    "        # might run another map-reduce pass (recursive reduction).\n",
+    "        print(\"Warning: Combined summaries are too large; will be truncated for \"\n",
+    "              \"final summary.\")\n",
+    "        tokens = encoding.encode(summaries_text)\n",
+    "        truncated_tokens = tokens[:MAX_CONTEXT_WINDOW]\n",
+    "        summaries_text = encoding.decode(truncated_tokens, errors='ignore')\n",
+    "\n",
+    "    try:\n",
+    "        resp = client.chat.completions.create(\n",
+    "            model=LLM_MODEL,\n",
+    "            messages=[\n",
+    "                {\"role\": \"user\", \"content\": REDUCE_PROMPT.format(\n",
+    "                    chunk_summaries=summaries_text)},\n",
+    "            ],\n",
+    "            temperature=0.3, # Slightly higher for more fluid, natural writing\n",
+    "        )\n",
+    "        return resp.completion_message.content.text\n",
+    "    except Exception as e:\n",
+    "        print(f\"    Error creating final summary: {e}\")\n",
+    "        return \"\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9654471a-d764-4b25-9d03-23a2242cbd4f",
+   "metadata": {},
+   "source": [
+    "## Step 5: Bringing it all together\n",
+    "\n",
+    "The following code runs the full pipeline:\n",
+    "1.  **Map:** Iterate through a subset of our chunks and generate a summary for each one.\n",
+    "2.  **Reduce:** Take all the generated chunk summaries and synthesize them into our final executive summary.\n",
+    "\n",
+    "To keep this tutorial fast and interactive, we'll only process the first 25 chunks. In a production scenario, you would process all chunks."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 17,
+   "id": "dc7c344a-fa6d-4422-a6c8-72fb7c58dd61",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "--- MAP: Summarizing 25 individual chunks ---\n",
+      "\\nSuccessfully summarized 25 chunks.\n",
+      "\\nOriginal token count: 19,127\n",
+      "Summarized token count: 4,163\n",
+      "Compression rate: 78.23%\n",
+      "\\n--- REDUCE: Creating final summary ---\n",
+      "\\n==================================================\n",
+      "           FINAL EXECUTIVE SUMMARY\n",
+      "==================================================\n",
+      "Here is a synthesized executive summary based on the provided summaries:\n",
+      "\n",
+      "**Executive Summary**\n",
+      "\n",
+      "This report introduces ASTRO, a novel framework designed to enhance the reasoning capabilities of language models by infusing search-like behavior into their outputs. ASTRO operates in three stages: data generation using Monte Carlo Tree Search (MCTS), supervised fine-tuning (SFT), and reinforcement learning (RL). The framework leverages self-reflection, backtracking, and exploration in language model outputs to improve their performance on mathematical problem-solving tasks.\n",
+      "\n",
+      "The data generation stage utilizes MCTS to build search trees, which are then linearized into node sequences and translated into long Chain-of-Thoughts (CoTs) that integrate self-reflection and backtracking in natural language. The resulting dataset is used for SFT and RL to fine-tune the Llama 3 family of models.\n",
+      "\n",
+      "The ASTRO-trained models demonstrate significant performance gains on various mathematical benchmarks, including MATH-500, AMC 2023, and AIME 2024. Specifically, Llama-3.1-70B-ASTRO-RL achieves 81.8% on MATH-500, 64.4% on AMC 2023, and 30.0% on AIME 2024 (pass@1). The models also exhibit improved self-reflection and backtracking capabilities, generating longer CoTs and achieving better training efficacy and upper bound during RL.\n",
+      "\n",
+      "The report highlights the importance of search priors in improving the model's reasoning capabilities and demonstrates that ASTRO-trained models outperform those trained without explicit search priors across all benchmarks. The results suggest that ASTRO is a promising framework for enhancing the mathematical reasoning abilities of language models.\n",
+      "\n",
+      "Overall, this research contributes to the development of more advanced language models that can reason like search algorithms, with potential applications in various domains that require complex problem-solving capabilities.\n"
+     ]
+    }
+   ],
+   "source": [
+    "# For this demonstration, we'll process a subset of chunks.\n",
+    "# In a real application, you would process all of them.\n",
+    "CHUNKS_TO_PROCESS = 25\n",
+    "chunks_to_summarize = chunks[:CHUNKS_TO_PROCESS]\n",
+    "\n",
+    "print(f\"--- MAP: Summarizing {len(chunks_to_summarize)} individual chunks ---\")\n",
+    "chunk_summaries = [map_summarize_chunk(chunk, DOC_TITLE) \n",
+    "                   for chunk in chunks_to_summarize]\n",
+    "chunk_summaries = [summary for summary in chunk_summaries \n",
+    "                   if summary.strip()]  # Filter out errors\n",
+    "print(f\"\\\\nSuccessfully summarized {len(chunk_summaries)} chunks.\")\n",
+    "\n",
+    "# --- Calculate compression rate ---\n",
+    "encoding = tiktoken.get_encoding(ENCODING_MODEL)\n",
+    "original_tokens = sum(count_tokens(chunk, encoding) \n",
+    "                      for chunk in chunks_to_summarize)\n",
+    "summarized_tokens = sum(count_tokens(summary, encoding) \n",
+    "                        for summary in chunk_summaries)\n",
+    "if original_tokens > 0:\n",
+    "    compression_rate = (1 - (summarized_tokens / original_tokens)) * 100\n",
+    "    print(f\"\\\\nOriginal token count: {original_tokens:,}\")\n",
+    "    print(f\"Summarized token count: {summarized_tokens:,}\")\n",
+    "    print(f\"Compression rate: {compression_rate:.2f}%\")\n",
+    "\n",
+    "print(\"\\\\n--- REDUCE: Creating final summary ---\")\n",
+    "final_summary = reduce_create_final_summary(chunk_summaries)\n",
+    "\n",
+    "# --- Display Final Result ---\n",
+    "print(\"\\\\n\" + \"=\" * 50)\n",
+    "print(\"           FINAL EXECUTIVE SUMMARY\")\n",
+    "print(\"=\" * 50)\n",
+    "print(final_summary)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ed7d0941-d51a-409c-922e-95f7c6f1bca8",
+   "metadata": {},
+   "source": [
+    "## Future enhancement: Handling extremely long documents with recursive reduction\n",
+    "\n",
+    "If you are summarizing an entire book, the combined text of your *chunk summaries* might still be too long for the model's context window. The solution is **recursive reduction**.\n",
+    "\n",
+    "You run the same map-reduce process again on the chunk summaries themselves:\n",
+    "1.  Generate 500 chunk summaries from the original document.\n",
+    "2.  Group these 500 summaries into batches of 50.\n",
+    "3.  Run your `reduce_create_final_summary` function on each batch, producing 10 \"super summaries\".\n",
+    "4.  Finally, run the reduce function one last time on the 10 \"super summaries\" to get your final executive summary.\n",
+    "\n",
+    "This approach enables you to scale this summarization technique to documents of virtually any length."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "00f0e52c-2432-49df-9263-790e7add7ae4",
+   "metadata": {},
+   "source": [
+    "## Next steps and upgrade paths\n",
+    "\n",
+    "This tutorial provides a solid foundation for a powerful summarization pipeline. You can extend it in several ways for a production-grade application.\n",
+    "\n",
+    "| Need                           | Where to look                                                                                                                                                                                                                            |\n",
+    "| :----------------------------- | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |\n",
+    "| **More advanced chunking**     | For more robust document splitting, explore libraries such as LangChain or LlamaIndex, which offer \"Recursive Character Text Splitters\" that can handle complex documents and code. These can split based on code syntax, markdown structure, and more. |\n",
+    "| **Alternative patterns**       | The \"Map-Reduce\" pattern is not the only option. Learn about the **\"Refine\" pattern**, where the model iteratively builds upon and refines a summary by processing one chunk at a time. This can be better for creating a single, highly coherent narrative. |\n",
+    "| **Question & Answering**       | If your goal is to ask questions of a long document instead of summarizing it, the best approach is **Retrieval-Augmented Generation (RAG)**. This involves storing chunks in a vector database and retrieving only the most relevant ones to answer a user's question. See our [Contextual chunking RAG recipe](https://github.com/meta-llama/llama-cookbook/tree/main/end-to-end-use-cases/Contextual-Chunking-RAG). |\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "My Project (uv)",
+   "language": "python",
+   "name": "my-uv-project"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.7"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

From 8a5c9b638ecc4f85a2da3c6cb93c2a1d84da3737 Mon Sep 17 00:00:00 2001
From: Connor Treacy <connortreacy@users.noreply.github.com>
Date: Thu, 23 Oct 2025 17:28:15 +0100
Subject: [PATCH 3/5] Update test result file name in GitHub Actions

---
 .github/workflows/pytest_cpu_gha_runner.yaml | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/.github/workflows/pytest_cpu_gha_runner.yaml b/.github/workflows/pytest_cpu_gha_runner.yaml
index a3fa2004b..211f0a881 100644
--- a/.github/workflows/pytest_cpu_gha_runner.yaml
+++ b/.github/workflows/pytest_cpu_gha_runner.yaml
@@ -64,12 +64,12 @@ jobs:
         id: pytest
         run: |
           echo "Running PyTest tests at 'GITHUB_WORKSPACE' path: ${GITHUB_WORKSPACE}"
-          cd $GITHUB_WORKSPACE && python3 -m pytest --junitxml="$GITHUB_WORKSPACE/result.xml"
+          cd $GITHUB_WORKSPACE && python3 -m pytest --junitxml="$GITHUB_WORKSPACE/pytest_result.xml"
 
       - name: Publish Test Summary
         id: test_summary
         uses: test-summary/action@v2
         with:
           paths: |
-            **/*.xml
+             **/pytest_result.xml
         if: always()

From 290a2686b9e0d8ba91ca68a265badfbfb6328897 Mon Sep 17 00:00:00 2001
From: Riandy <riandy@windowslive.com>
Date: Wed, 22 Oct 2025 12:57:39 +0800
Subject: [PATCH 4/5] Add building with coding assistant readme

---
 .../coding_assistant/README.md                | 367 ++++++++++++++++++
 1 file changed, 367 insertions(+)
 create mode 100644 end-to-end-use-cases/coding_assistant/README.md

diff --git a/end-to-end-use-cases/coding_assistant/README.md b/end-to-end-use-cases/coding_assistant/README.md
new file mode 100644
index 000000000..59700865b
--- /dev/null
+++ b/end-to-end-use-cases/coding_assistant/README.md
@@ -0,0 +1,367 @@
+# Building with Coding Assistant
+
+Coding assistants have transformed application development by streamlining workflows and increasing productivity. This tutorial explores various methods to optimize your developer experience when building on the Llama platform, utilizing the capabilities of your preferred coding assistants.
+
+This tutorial will cover the following techniques:
+- Prompts for common Llama developer workflows/use cases
+- Incorporating Rules/AGENTS.md into your coding assistant
+
+## Prompts for common Llama developer workflows/use cases
+
+Example prompts for the following tasks are provided, which you can copy and paste into your coding assistant:
+- Task#1: Migrate OpenAI API to Llama API python in the application
+- Task#2: Finetuning Llama models on one single GPU
+- Task#3: Building a RAG chatbot with Llama
+- Task#4: Llama model upgrade from Llama 3 to use Llama 4
+
+### Task#1: Migrate OpenAI API to Llama API python in the application
+```
+You are a coding Assistant specialized in Llama - a series of LLM developed and opensourced by Meta. Your goal is to write code for developers working with the Llama AI ecosystem including tools, API SDK, cookbook and best practices to complete certain tasks.
+
+Here is the task:
+Convert all my OpenAI API usage in this application to use Llama API Python SDK instead.
+Make sure you follow the correct syntax from resources provided below.
+Provide instructions on how to acquire Llama API Key.
+Analyze the use cases of this application and choose appropriate Llama models supported by Llama API based on performance and cost.
+Add clear readme on what you have changed and how to properly test them.
+Convert the files in place. Do not create unnecessary files,scripts and readme.
+
+Mainly use this reference: Llama API Python SDK (https://github.com/meta-llama/llama-api-python) - An official repository contains a client python library to access Llama API Client REST API.
+
+Here are the other resources you might need to work with. Search on the exact web url and/or local index to find these resources:
+Llama Official Website (https://www.llama.com/docs/overview/) - Providing comprehensive documentation such as prompt format for Llama models and various how-to-guides.
+Llama Cookbooks (https://github.com/meta-llama/llama-cookbook) - An official repository contains Llama best practices for helping you get started with inference, fine-tuning and end-to-end use-cases.
+Llama Stack (https://github.com/llamastack/llama-stack) - An official repository containing a framework which standardizes the core building blocks of simplified AI application development. Codifies best practices across the Llama ecosystem.
+```
+
+### Task#2: Finetuning Llama models on one single GPU
+```
+You are a coding Assistant specialized in Llama - a series of LLM developed and opensourced by Meta. Your goal is to write code for developers working with the Llama AI ecosystem including tools, API SDK, cookbook and best practices to complete certain tasks.
+
+Here is the task:
+Create a script that can help me finetune Llama models on one single consumer GPU such as A10.
+Use PEFT for finetuning.
+Analyze the memory requirements and select appropriate Llama models and quantization that can fit in the GPU memory.
+Specify interfaces that can take in a particular dataset for finetuning. This should be defined by the user later based on the use cases. Make sure you provide instructions on how to use the dataset for finetuning.
+Write a separate script for evaluating the finetuning result.
+
+Mainly use this reference: https://github.com/meta-llama/llama-cookbook/blob/main/src/docs/single_gpu.md
+
+Here are the other resources you might need to work with. Search on the exact web url and/or local index to find these resources:
+Llama Official Website (https://www.llama.com/docs/overview/) - Providing comprehensive documentation such as prompt format for Llama models and various how-to-guides.
+Llama Cookbooks (https://github.com/meta-llama/llama-cookbook) - An official repository contains Llama best practices for helping you get started with inference, fine-tuning and end-to-end use-cases.
+Llama Stack (https://github.com/llamastack/llama-stack) - An official repository containing a framework which standardizes the core building blocks of simplified AI application development. Codifies best practices across the Llama ecosystem.
+Llama API Python SDK (https://github.com/meta-llama/llama-api-python) - An official repository contains a client python library to access Llama API Client REST API.
+
+
+To accomplish this, follow these steps:
+1. Analysis on the task and break it down into corresponding subtasks.
+2. For each of the subtasks, reference the available resources and find exact examples that create your solution.
+3. Validate your solution by writing tests if possible and automated tests.
+4. Iterate on step#2 until you are satisfied.
+
+Your output must contain these artifacts:
+- Exact code files to accomplish the task
+- A comprehensive readme with step by step guide
+- Scripts for easy deployment
+- Dependencies that can be easily installed
+```
+
+### Task#3: Building a RAG chatbot with Llama
+```
+You are a coding Assistant specialized in Llama - a series of LLM developed and opensourced by Meta. Your goal is to write code for developers working with the Llama AI ecosystem including tools, API SDK, cookbook and best practices to complete certain tasks.
+
+Here is the task:
+Build a RAG chatbot using Llama models.
+Specify interfaces that can take in user defined files such as PDFs. Make sure you provide instructions on how to use these interfaces to process files.
+Use a popular text embedding model with necessary conversion to create a vector database to store user defined files.
+Create a chatbot UI using Gradio that can answer questions regarding the database.
+
+
+Mainly use this reference:https://github.com/meta-llama/llama-cookbook/blob/main/end-to-end-use-cases/customerservice_chatbots/RAG_chatbot/RAG_Chatbot_Example.ipynb
+
+Here are the resources you’ll work with. Search on the exact web url and/or local index to find these resources:
+Llama Official Website (https://www.llama.com/docs/overview/) - Providing comprehensive documentation such as prompt format for Llama models and various how-to-guides.
+Llama Cookbooks (https://github.com/meta-llama/llama-cookbook) - An official repository contains Llama best practices for helping you get started with inference, fine-tuning and end-to-end use-cases.
+Llama Stack (https://github.com/llamastack/llama-stack) - An official repository containing a framework which standardizes the core building blocks of simplified AI application development. Codifies best practices across the Llama ecosystem.
+Llama API Python SDK (https://github.com/meta-llama/llama-api-python) - An official repository contains a client python library to access Llama API Client REST API.
+
+
+To accomplish this, follow these steps:
+1. Analysis on the task and break it down into corresponding subtasks.
+2. For each of the subtasks, reference the available resources and find exact examples that create your solution.
+3. Validate your solution by writing tests if possible and automated tests.
+4. Iterate on step#2 until you are satisfied.
+
+Your output must contain these artifacts:
+- Exact code files to accomplish the task
+- A comprehensive readme with step by step guide
+- Scripts for easy deployment
+- Dependencies that can be easily installed
+```
+
+### Task#4: Llama model upgrade from Llama 3 to use Llama 4
+```
+You are a coding Assistant specialized in Llama - a series of LLM developed and opensourced by Meta. Your goal is to write code for developers working with the Llama AI ecosystem including tools, API SDK, cookbook and best practices to complete certain tasks.
+
+Here is the task:
+Convert all my usage of the Llama 3 model in the codebase to use the Llama 4 model instead.
+Do not change the original interface method. Use the same API provided if applicable.
+Analyze the use cases of this application and choose appropriate Llama models.
+Add clear readme on what you have changed and how to properly test them.
+Convert the files in place. Do not create unnecessary files, scripts and readme.
+
+Here are the resources you’ll work with. Search on the exact web url and/or local index to find these resources:
+Llama Official Website (https://www.llama.com/docs/overview/) - Providing comprehensive documentation such as prompt format for Llama models and various how-to-guides.
+Llama Cookbooks (https://github.com/meta-llama/llama-cookbook) - An official repository contains Llama best practices for helping you get started with inference, fine-tuning and end-to-end use-cases.
+Llama Stack (https://github.com/llamastack/llama-stack) - An official repository containing a framework which standardizes the core building blocks of simplified AI application development. Codifies best practices across the Llama ecosystem.
+Llama API Python SDK (https://github.com/meta-llama/llama-api-python) - An official repository contains a client python library to access Llama API Client REST API.
+```
+
+## Incorporating Rules/AGENTS.md into Your Coding Assistant
+
+An effective method to enhance your coding assistant involves providing rules and instructions for the coding agent. This detailed guide demonstrates how to configure your IDE and coding assistants for accelerated development within the Llama ecosystem. It also offers recommendations for optimal setup in Cursor, Windsurf, and VSCode/Co-pilot.
+
+### Cursor
+
+#### Index documentation
+Cursor can pre-index documentations to provide additional Llama context during development. This would enable you to ask questions about Llama more easily and provide additional reference during AI-assisted coding.
+- Open Cursor Settings (Shift + Cmd/Ctrl + J)
+- Navigate to “Indexing & Docs”
+- Click “+ Add Doc” button and add these official Llama documentation URLs:
+  - Llama Website: https://www.llama.com/docs/overview/
+  - Llama Models Github: https://github.com/meta-llama/llama-models
+  - Llama Cookbook Github: https://github.com/meta-llama/llama-cookbook
+  - Llama Stack Github: https://github.com/llamastack/llama-stack
+
+#### Optimize with Custom Rules
+
+To further improve the experience building with Llama, you can add custom instructions to Cursor’s AI system such as best practices and preferred coding patterns. This helps the AI agent understand the Llama ecosystem better.
+
+- Open Cursor Settings (Shift + Cmd/Ctrl + J)
+- Go to “Rules & Memories”
+- Click “+ Add Rule” for Project rules and append the prompt below
+- After pasting, adjust the button on the top to select “Apply Intelligently”
+
+#### Project Rule for Llama (`.cursor/rules/llama.mdc`)
+```
+---
+description: Full-featured developer rulebook for leveraging Meta Llama in this codebase.
+globs:
+  - "**/*.py"
+  - "**/*.js"
+  - "**/*.md"
+alwaysApply: true
+---
+
+# Meta Llama Developer Rulebook
+
+## Model Usage & Integration
+- Use Meta Llama models (Llama 3, Llama 4 or later) as the default LLM for all tasks requiring a language model.
+- Reference the latest official developer resources for installation, basic usage, and authentication:
+  - Meta Llama Dev Docs: https://www.llama.com/docs/overview/
+  - Official GitHub: https://github.com/meta-llama/llama
+  - Llama models on Hugging Face: https://huggingface.co/meta-llama
+
+## Model Deployment
+- Provide clear suggestions on different types of deployment (API Cloud based, On-prem, On-device) based on the task and model requirements.
+  - Deploying on-prem: https://github.com/meta-llama/llama-cookbook/blob/main/3p-integrations/llama_on_prem.md
+  - Deploying with Llama Stack: https://llama-stack.readthedocs.io/en/latest/getting_started/index.html
+  - Deploying on-device with ExecuTorch: https://docs.pytorch.org/executorch/stable/llm/llama-demo-android.html
+
+## API Interaction & Migration
+- If migrating from other APIs, automatically convert parameter names and prompt formats to Llama's interface. See:
+  - Llama Migration Guide: https://www.llama.com/docs/llama-everywhere/migration/
+  - Llama API Python SDK: https://github.com/meta-llama/llama-api-python
+- Point out differences in authentication, token limits, and endpoint structure.
+
+## Fine-Tuning & Customization
+- Always provide sample code for fine-tuning Llama models using the latest Meta and Hugging Face recipes:
+  - Llama fine-tuning (Meta How-to): https://www.llama.com/docs/how-to-guides/fine-tuning/
+  - Recipe of Fine-tuning and get-started guide: https://github.com/meta-llama/llama-cookbook/tree/main/getting-started/finetuning
+  - PEFT: https://github.com/meta-llama/llama-cookbook/blob/main/getting-started/finetuning/quickstart_peft_finetuning.ipynb
+  - MultiGPU finetuning: https://github.com/meta-llama/llama-cookbook/blob/main/getting-started/finetuning/multigpu_finetuning.md
+  - Finetuning Llama 4: https://github.com/meta-llama/llama-cookbook/blob/main/getting-started/finetuning/finetune_llama4.md
+  - Finetuning vision Llama: https://github.com/meta-llama/llama-cookbook/blob/main/getting-started/finetuning/finetune_vision_model.md
+
+## Prompt Engineering Practices
+- Use and exemplify the Meta/official Llama chat prompt format (system/user/assistant roles).
+- Reference:
+  - Llama 4 Prompt Template and Guide: https://www.llama.com/docs/model-cards-and-prompt-formats/llama4/
+  - Llama 3.3 Prompt Template and Guide: https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_3/
+  - Llama 3.2 Prompt Template and Guide: https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_2/
+  - Llama 3.1 Prompt Template and Guide: https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_1/
+  - Hugging Face transformer prompting Llama 3: https://huggingface.co/docs/transformers/en/model_doc/llama3
+
+## Quantization
+- Based on task requirements, suggest appropriate Quantization schema and necessary frameworks to minimize memory footprint without sacrificing model quality.
+- Official Llama Quantization Guide: https://www.llama.com/docs/how-to-guides/quantization/
+
+
+## Developer Best Practices
+- Ensure all generated code is idiomatic, covered by comments, follows error-handling and input-validation best practices.
+- For documentation and onboarding, always provide links to the latest relevant guides:
+  - Developer Use Guide: https://www.llama.com/developer-use-guide/
+- Recommend including model info and links (e.g., license, documentation) in project-level README files.
+
+
+## Other Common Use-cases
+Text2SQL: https://www.llama.com/resources/cookbook/text2sql_natural_language_to_sql_interface/
+Tool Calling: https://www.llama.com/resources/cookbook/toolcalling-with-llama/
+Language Expansion: https://www.llama.com/resources/cookbook/extend_llama_to_new_languages/
+Email Agent: https://www.llama.com/resources/cookbook/build-email-agent-with-llama3/
+Multimodal RAG: https://www.llama.com/resources/cookbook/build-multimodal-rag/
+Browser Agent: https://www.llama.com/resources/cookbook/build-an-intelligent-browser-agent/
+Customer Service Chatbot: https://github.com/meta-llama/llama-cookbook/tree/main/end-to-end-use-cases/customerservice_chatbots
+
+---
+```
+
+### Windsurf
+
+Similar to Cursor, you can index documents and add custom rules in Windsurf with a slightly different approach.
+
+#### Index Documentation
+Windsurf doesn’t local index a resource by just taking your url. It automatically indexes your codebase when opening in the workspace. That means you will need to clone the github repo mentioned above to your local machine.
+Remote indexing is only available to Enterprise plan
+
+#### Optimize with Custom Rules
+- Create a .windsurfrules file and add it to your project's root directory to set project-specific rules for the AI, such as enforcing coding standards or focusing suggestions on particular frameworks.
+- For workspace-wide or global rules, use a global_rules.md file. These can be edited via the Windsurf “Customizations” or “Cascade Memories” settings panel, available in the app.
+- Each rule should be concise (under 6,000 characters) and in Markdown format. Rules can be activated manually, always-on, by model decision, or automatically by file path/glob pattern
+
+```
+# Windsurf Llama Model Global Rules
+
+## Model Usage & Integration
+- Use Meta Llama models (Llama 3, Llama 4 or later) as the default LLM for all tasks requiring a language model.
+- Reference the latest official developer resources for installation, basic usage, and authentication:
+  - Meta Llama Dev Docs: https://www.llama.com/docs/overview/
+  - Official GitHub: https://github.com/meta-llama/llama
+  - Llama models on Hugging Face: https://huggingface.co/meta-llama
+
+## Model Deployment
+- Provide clear suggestions on different types of deployment (API Cloud based, On-prem, On-device) based on the task and model requirements.
+  - Deploying on-prem: https://github.com/meta-llama/llama-cookbook/blob/main/3p-integrations/llama_on_prem.md
+  - Deploying with Llama Stack: https://llama-stack.readthedocs.io/en/latest/getting_started/index.html
+  - Deploying on-device with ExecuTorch: https://docs.pytorch.org/executorch/stable/llm/llama-demo-android.html
+
+## API Interaction & Migration
+- If migrating from other APIs, automatically convert parameter names and prompt formats to Llama's interface. See:
+  - Llama Migration Guide: https://www.llama.com/docs/llama-everywhere/migration/ni
+  - Llama API Python SDK: https://github.com/meta-llama/llama-api-python
+- Point out differences in authentication, token limits, and endpoint structure.
+
+## Fine-Tuning & Customization
+- Always provide sample code for fine-tuning Llama models using the latest Meta and Hugging Face recipes:
+  - Llama fine-tuning (Meta How-to): https://www.llama.com/docs/how-to-guides/fine-tuning/
+  - Recipe of Fine-tuning and get-started guide: https://github.com/meta-llama/llama-cookbook/tree/main/getting-started/finetuning
+  - PEFT: https://github.com/meta-llama/llama-cookbook/blob/main/getting-started/finetuning/quickstart_peft_finetuning.ipynb
+  - MultiGPU finetuning: https://github.com/meta-llama/llama-cookbook/blob/main/getting-started/finetuning/multigpu_finetuning.md
+  - Finetuning Llama 4: https://github.com/meta-llama/llama-cookbook/blob/main/getting-started/finetuning/finetune_llama4.md
+  - Finetuning vision Llama: https://github.com/meta-llama/llama-cookbook/blob/main/getting-started/finetuning/finetune_vision_model.md
+
+## Prompt Engineering Practices
+- Use and exemplify the Meta/official Llama chat prompt format (system/user/assistant roles).
+- Reference:
+  - Llama 4 Prompt Template and Guide: https://www.llama.com/docs/model-cards-and-prompt-formats/llama4/
+  - Llama 3.3 Prompt Template and Guide: https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_3/
+  - Llama 3.2 Prompt Template and Guide: https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_2/
+  - Llama 3.1 Prompt Template and Guide: https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_1/
+  - Hugging Face transformer prompting Llama 3: https://huggingface.co/docs/transformers/en/model_doc/llama3
+
+## Quantization
+- Based on task requirements, suggest appropriate Quantization schema and necessary frameworks to minimize memory footprint without sacrificing model quality.
+- Official Llama Quantization Guide: https://www.llama.com/docs/how-to-guides/quantization/
+
+
+## Developer Best Practices
+- Ensure all generated code is idiomatic, covered by comments, follows error-handling and input-validation best practices.
+- For documentation and onboarding, always provide links to the latest relevant guides:
+  - Developer Use Guide: https://www.llama.com/developer-use-guide/
+- Recommend including model info and links (e.g., license, documentation) in project-level README files.
+
+
+## Other Common Use-cases
+Text2SQL: https://www.llama.com/resources/cookbook/text2sql_natural_language_to_sql_interface/
+Tool Calling: https://www.llama.com/resources/cookbook/toolcalling-with-llama/
+Language Expansion: https://www.llama.com/resources/cookbook/extend_llama_to_new_languages/
+Email Agent: https://www.llama.com/resources/cookbook/build-email-agent-with-llama3/
+Multimodal RAG: https://www.llama.com/resources/cookbook/build-multimodal-rag/
+Browser Agent: https://www.llama.com/resources/cookbook/build-an-intelligent-browser-agent/
+Customer Service Chatbot: https://github.com/meta-llama/llama-cookbook/tree/main/end-to-end-use-cases/customerservice_chatbots
+```
+
+### VSCode/Copilot
+
+#### Index Documentation
+GitHub Copilot leverages context from open files, the workspace, or specific instructions to provide relevant code suggestions
+
+#### Optimize with Custom Rules
+- Create a .github/copilot-instructions.md file in your workspace to define general coding standards and guidelines for all Copilot chat requests.
+- For file- or task-specific rules, use .instructions.md files with the applyTo frontmatter to specify file targeting.
+- You can also add settings in settings.json for review, commit message generation, and pull request description instructions, either directly or by referencing instruction files
+
+For more information, refer to https://docs.github.com/en/copilot/how-tos/configure-custom-instructions/add-repository-instructions
+
+```
+Llama Model Global Rules
+
+## Model Usage & Integration
+- Use Meta Llama models (Llama 3, Llama 4 or later) as the default LLM for all tasks requiring a language model.
+- Reference the latest official developer resources for installation, basic usage, and authentication:
+  - Meta Llama Dev Docs: https://www.llama.com/docs/overview/
+  - Official GitHub: https://github.com/meta-llama/llama
+  - Llama models on Hugging Face: https://huggingface.co/meta-llama
+
+## Model Deployment
+- Provide clear suggestions on different types of deployment (API Cloud based, On-prem, On-device) based on the task and model requirements.
+  - Deploying on-prem: https://github.com/meta-llama/llama-cookbook/blob/main/3p-integrations/llama_on_prem.md
+  - Deploying with Llama Stack: https://llama-stack.readthedocs.io/en/latest/getting_started/index.html
+  - Deploying on-device with ExecuTorch: https://docs.pytorch.org/executorch/stable/llm/llama-demo-android.html
+
+## API Interaction & Migration
+- If migrating from other APIs, automatically convert parameter names and prompt formats to Llama's interface. See:
+  - Llama Migration Guide: https://www.llama.com/docs/llama-everywhere/migration/ni
+  - Llama API Python SDK: https://github.com/meta-llama/llama-api-python
+- Point out differences in authentication, token limits, and endpoint structure.
+
+## Fine-Tuning & Customization
+- Always provide sample code for fine-tuning Llama models using the latest Meta and Hugging Face recipes:
+  - Llama fine-tuning (Meta How-to): https://www.llama.com/docs/how-to-guides/fine-tuning/
+  - Recipe of Fine-tuning and get-started guide: https://github.com/meta-llama/llama-cookbook/tree/main/getting-started/finetuning
+  - PEFT: https://github.com/meta-llama/llama-cookbook/blob/main/getting-started/finetuning/quickstart_peft_finetuning.ipynb
+  - MultiGPU finetuning: https://github.com/meta-llama/llama-cookbook/blob/main/getting-started/finetuning/multigpu_finetuning.md
+  - Finetuning Llama 4: https://github.com/meta-llama/llama-cookbook/blob/main/getting-started/finetuning/finetune_llama4.md
+  - Finetuning vision Llama: https://github.com/meta-llama/llama-cookbook/blob/main/getting-started/finetuning/finetune_vision_model.md
+
+## Prompt Engineering Practices
+- Use and exemplify the Meta/official Llama chat prompt format (system/user/assistant roles).
+- Reference:
+  - Llama 4 Prompt Template and Guide: https://www.llama.com/docs/model-cards-and-prompt-formats/llama4/
+  - Llama 3.3 Prompt Template and Guide: https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_3/
+  - Llama 3.2 Prompt Template and Guide: https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_2/
+  - Llama 3.1 Prompt Template and Guide: https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_1/
+  - Hugging Face transformer prompting Llama 3: https://huggingface.co/docs/transformers/en/model_doc/llama3
+
+## Quantization
+- Based on task requirements, suggest appropriate Quantization schema and necessary frameworks to minimize memory footprint without sacrificing model quality.
+- Official Llama Quantization Guide: https://www.llama.com/docs/how-to-guides/quantization/
+
+
+## Developer Best Practices
+- Ensure all generated code is idiomatic, covered by comments, follows error-handling and input-validation best practices.
+- For documentation and onboarding, always provide links to the latest relevant guides:
+  - Developer Use Guide: https://www.llama.com/developer-use-guide/
+- Recommend including model info and links (e.g., license, documentation) in project-level README files.
+
+
+## Other Common Use-cases
+Text2SQL: https://www.llama.com/resources/cookbook/text2sql_natural_language_to_sql_interface/
+Tool Calling: https://www.llama.com/resources/cookbook/toolcalling-with-llama/
+Language Expansion: https://www.llama.com/resources/cookbook/extend_llama_to_new_languages/
+Email Agent: https://www.llama.com/resources/cookbook/build-email-agent-with-llama3/
+Multimodal RAG: https://www.llama.com/resources/cookbook/build-multimodal-rag/
+Browser Agent: https://www.llama.com/resources/cookbook/build-an-intelligent-browser-agent/
+Customer Service Chatbot: https://github.com/meta-llama/llama-cookbook/tree/main/end-to-end-use-cases/customerservice_chatbots
+```

From 2fd32c97ef18679cb260db6ef3d36f725cdb5ad2 Mon Sep 17 00:00:00 2001
From: Riandy <riandy@windowslive.com>
Date: Wed, 22 Oct 2025 13:07:15 +0800
Subject: [PATCH 5/5] Fix typo and spell check

---
 .github/scripts/spellcheck_conf/wordlist.txt    | 5 +++++
 end-to-end-use-cases/coding_assistant/README.md | 2 +-
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/.github/scripts/spellcheck_conf/wordlist.txt b/.github/scripts/spellcheck_conf/wordlist.txt
index 436db4de3..da9cb9052 100644
--- a/.github/scripts/spellcheck_conf/wordlist.txt
+++ b/.github/scripts/spellcheck_conf/wordlist.txt
@@ -1556,3 +1556,8 @@ RequestBuilder
 VectorIndexManager
 csvs
 programmatically
+Customizations
+VSCode
+applyTo
+mdc
+windsurfrules
diff --git a/end-to-end-use-cases/coding_assistant/README.md b/end-to-end-use-cases/coding_assistant/README.md
index 59700865b..d7ad6b5c7 100644
--- a/end-to-end-use-cases/coding_assistant/README.md
+++ b/end-to-end-use-cases/coding_assistant/README.md
@@ -299,7 +299,7 @@ GitHub Copilot leverages context from open files, the workspace, or specific ins
 
 #### Optimize with Custom Rules
 - Create a .github/copilot-instructions.md file in your workspace to define general coding standards and guidelines for all Copilot chat requests.
-- For file- or task-specific rules, use .instructions.md files with the applyTo frontmatter to specify file targeting.
+- For file- or task-specific rules, use .instructions.md files with the applyTo formatter to specify file targeting.
 - You can also add settings in settings.json for review, commit message generation, and pull request description instructions, either directly or by referencing instruction files
 
 For more information, refer to https://docs.github.com/en/copilot/how-tos/configure-custom-instructions/add-repository-instructions