AMD-AGI · mattwill-amd · Jun 23, 2026 · Jun 26, 2026
@@ -72,7 +72,7 @@ python -m Magpie.mcp
 | **Compare** | Multi-kernel comparison and ranking | ✅ |
 | **Benchmark** | Framework-level benchmarking (vLLM/SGLang/Atom) with trace analysis | ✅ |
 
-> 📖 See [Benchmark mode](docs/how-to/benchmark.md) for vLLM/SGLang/Atom usage.  
+> 📖 See [Benchmark mode](docs/how-to/benchmarking/benchmark.md) for vLLM/SGLang/Atom usage.  
 > 📖 See [Analyze vs Compare](docs/how-to/analyze-compare.md) for kernel evaluation modes.
 
 ## Configuration

@@ -32,7 +32,7 @@ python -m sphinx -T -b html docs docs/_build/html
 | `reference/compatibility-matrix.md` | Compatibility Matrix | Verified hardware/software versions. Contains `TODO (verify)` markers. |
 | `reference/api-reference.md` | API Reference | CLI commands and options, configuration schema, and MCP tools. |
 | `how-to/analyze-compare.md` | How-to | Analyze vs compare kernel modes. |
-| `how-to/benchmark.md` | How-to | vLLM/SGLang/Atom benchmarking, TraceLens, gap analysis. |
+| `how-to/benchmarking/benchmark.md` | How-to | vLLM/SGLang/Atom benchmarking, TraceLens, gap analysis. |
 | `how-to/ray.md` | How-to | Remote execution on a Ray cluster. |
 | `how-to/mcp-and-skills.md` | How-to | MCP server and agent skill installation. |
 | `how-to/kernel-source-finder.md` | How-to | Locating kernel sources from traces. |

@@ -1,10 +1,12 @@
-# License
+---
+myst:
+    html_meta:
+        "description": "The full MIT License text for Magpie, an open-source GPU kernel evaluation framework developed by AMD-AGI."
+        "keywords": "Magpie, MIT license, open source, AMD-AGI, license text"
+---
 
-Magpie is released under the MIT License. The full license text below matches
-the [`LICENSE`](https://github.com/AMD-AGI/Magpie/blob/main/LICENSE) file in the
-Magpie GitHub repository.
+# License
 
-```text
 MIT License
 
 Copyright (c) 2026 AMD-AGI
@@ -26,4 +28,3 @@ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
 OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
 SOFTWARE.
-```
@@ -0,0 +1,84 @@
+---
+myst:
+    html_meta:
+        "description": "Learn how Magpie's benchmark mode pipeline is structured, including components, execution flow, and integration with TraceLens and gap analysis."
+        "keywords": "Magpie, benchmark architecture, BenchmarkMode, TraceLens, gap analysis, vLLM, SGLang, ROCm, GPU, LLM inference"
+---
+
+# Magpie benchmarking mode architecture
+
+Magpie's benchmark mode drives end-to-end performance evaluation of LLM inference frameworks—vLLM, SGLang, and Atom—by launching a server, running a client workload, and collecting throughput and latency metrics into a structured JSON report. Benchmarks can run inside a Docker container (the default), directly on the host, or on a remote Ray cluster, and they optionally capture torch profiler traces for downstream analysis with TraceLens and gap analysis. This page describes the components that make up the benchmark pipeline, the execution flow from configuration to report generation, and how the pieces connect.
+
+## Architecture
+
+Magpie benchmark mode is composed of the following key components that work together to run, profile, and analyze inference framework benchmarks.
+
+### Components
+
+Benchmark mode consists of the following Python modules.
+
+| Component | File | Description |
+|-----------|------|-------------|
+| `BenchmarkMode` | `benchmarker.py` | Main orchestrator |
+| `BenchmarkConfig` | `config.py` | Configuration dataclasses |
+| `TraceLensAnalyzer` | `tracelens.py` | TraceLens CLI integration |
+| `GapAnalyzer` | `gap_analysis.py` | Kernel bottleneck analysis |
+| `BenchmarkResult` | `result.py` | Result data structures |
+
+### Execution flow
+
+Each benchmark run proceeds through the following stages.
+
+1. **Configuration Loading**: Parse YAML config into `BenchmarkConfig`
+2. **Runtime Setup**: For `run_mode: docker`, prepare a container with InferenceX; for `local`, use the host environment
+3. **Server Launch**: Start vLLM/SGLang server (in container or on host per `run_mode`)
+4. **Client Execution**: Run benchmark client with profiling enabled
+5. **Trace Collection**: Torch profiler traces saved to workspace
+6. **TraceLens Analysis**: Run TraceLens CLI commands inside the runtime image
+   for Docker inference mode, or on host for local/classic mode (if enabled)
+7. **Gap Analysis**: Analyze kernel bottlenecks within time window (if enabled)
+8. **Result Generation**: Aggregate metrics and generate reports
+
+### Architecture diagram
+
+The following diagram shows how Magpie orchestrates the benchmark pipeline.
+
+```
+┌─────────────────────────────────────────────────────────────────────┐
+│                        Benchmark Mode                               │
+├─────────────────────────────────────────────────────────────────────┤
+│  ┌───────────────┐    ┌───────────────┐    ┌────────────────────┐   │
+│  │BenchmarkConfig│  → │ BenchmarkMode │ →  │  BenchmarkResult   │   │
+│  │  (YAML)       │    │               │    │  (JSON + CSV)      │   │
+│  └───────────────┘    └───────────────┘    └────────────────────┘   │
+│                               │                                     │
+│                               ▼                                     │
+│  ┌──────────────────────────────────────────────────────────────┐   │
+│  │  Runtime: docker │ local │ ray                               │   │
+│  │  ┌─────────────┐        ┌─────────────────────────────────┐  │   │
+│  │  │ InferenceX  │  →     │ vLLM / SGLang Server + Client   │  │   │
+│  │  │ scripts     │        │ + Torch Profiler                │  │   │
+│  │  └─────────────┘        └─────────────────────────────────┘  │   │
+│  │  Ray: Magpie driver → RayJobExecutor → GPU worker runs the   │   │
+│  │        same stack (local/docker on worker; NFS for cache/    │   │
+│  │        results). See ray.md                                  │   │
+│  └──────────────────────────────────────────────────────────────┘   │
+│                               │                                     │
+│                      ┌────────┴────────┐                            │
+│                      ▼                 ▼                            │
+│  ┌────────────────────────┐  ┌─────────────────────────────────┐    │
+│  │  Gap Analysis          │  │  TraceLens Analysis             │    │
+│  │  • Time window filter  │  │  • Perf report (per-rank)       │    │
+│  │  • Category filter     │  │  • Multi-rank collective report │    │
+│  │  • Kernel stats CSV    │  │                                 │    │
+│  └────────────────────────┘  └─────────────────────────────────┘    │
+└─────────────────────────────────────────────────────────────────────┘
+```
+
+## More info
+
+- [Benchmark frameworks with Magpie](../how-to/benchmarking/benchmark.md) — how-to guide covering configuration, run modes, TraceLens, gap analysis, and examples
+- [Magpie benchmark mode configuration](../reference/benchmark-config.md) — full YAML schema with all available options and defaults
+- [Run Magpie on a Ray cluster](../how-to/ray.md) — running benchmarks on remote GPU nodes using Ray
+- [Find kernel sources with Magpie](../how-to/kernel-source-finder.md) — mapping kernel names from gap analysis output to source files
+- [Magpie troubleshooting](../reference/troubleshooting.md) — solutions for common benchmark errors
@@ -0,0 +1,60 @@
+---
+myst:
+    html_meta:
+        "description": "Understand Magpie's Ray integration driver-worker model, executor selection, and end-to-end task flow for running GPU workloads on remote Ray clusters."
+        "keywords": "Magpie, Ray architecture, RayJobExecutor, driver worker, remote GPU, distributed benchmark, ROCm, CUDA"
+---
+
+# Magpie on Ray architecture
+
+Magpie's Ray integration offloads analyze, compare, and benchmark workloads from the machine running the CLI or MCP server onto GPU-capable worker nodes in a Ray cluster, without changing the evaluation logic itself. The integration is built around a driver-worker split: the driver process submits a remote function via `RayJobExecutor`, and the worker node executes the same `AnalyzeMode`, `CompareMode`, or `BenchmarkMode` code it would run locally. This page describes how executor selection works, how the task flows end-to-end, and where to find the relevant source files.
+
+Magpie's Ray integration follows a driver-worker model where the driver submits tasks and workers execute them on GPU-capable nodes.
+
+## Driver vs worker
+
+Magpie's Ray integration uses two roles: the driver process that submits work, and the worker nodes that execute it.
+
+- **Driver**: process running `python -m Magpie …`, MCP, or your script. It calls `Scheduler` or `BenchmarkMode`, connects with `ray.init(address=…)`, and submits a remote function.
+- **Worker**: Ray executes `Magpie.remote.tasks.run_task` on a GPU-capable node. That function dispatches to `_run_analyze`, `_run_compare`, or `_run_benchmark`.
+
+## Executor selection
+
+The executor is chosen based on `SchedulerConfig.environment_type`.
+
+| `SchedulerConfig.environment_type` | Executor | Execution |
+|-----------------------------------|----------|-----------|
+| `local` | `LocalExecutor` | Subprocesses on the driver machine (`Magpie/core/executor.py`). |
+| `container` | Container executor | Isolated environment on the driver (kernel flows). |
+| `ray` | `RayJobExecutor` | `ray.remote(run_task)` on a cluster node (`Magpie/core/ray_executor.py`). |
+
+Benchmark mode additionally uses `BenchmarkConfig.run_mode`: `docker`, `local`, or `ray`. When `run_mode` is `ray`, `BenchmarkMode` builds a `Task` and uses `RayJobExecutor` internally (`Magpie/modes/benchmark/benchmarker.py`).
+
+## End-to-end flow
+
+```mermaid
+flowchart LR
+  subgraph Driver
+    CLI[MCP / CLI]
+    SCH[Scheduler or BenchmarkMode]
+    RJE[RayJobExecutor]
+    CLI --> SCH --> RJE
+  end
+  subgraph Cluster
+    RT[run_task]
+    A[AnalyzeMode]
+    C[CompareMode]
+    B[BenchmarkMode]
+    RJE -->|ray.remote| RT
+    RT --> A
+    RT --> C
+    RT --> B
+  end
+```
+
+## More info
+
+- [Magpie on Ray](../how-to/ray.md) — how-to guide covering cluster setup, configuration, shared storage, and troubleshooting
+- [Benchmark frameworks with Magpie](../how-to/benchmarking/benchmark.md) — benchmark run modes including `run_mode: ray`
+- [Magpie benchmarking mode architecture](benchmarking-architecture.md) — how the benchmark pipeline is designed and how components interact
+- [Ray documentation](https://docs.ray.io/) — cluster setup, job submission, and runtime environments
@@ -1,37 +1,63 @@
-# Configuration file for the Sphinx documentation builder.
-#
-# Magpie documentation is built with rocm-docs-core, which configures the
-# theme, navigation, MyST Markdown support, and shared ROCm options. Both
-# Markdown (.md, via MyST) and reStructuredText (.rst) source files build out
-# of the box.
-#
-# https://www.sphinx-doc.org/en/master/usage/configuration.html
-# https://rocm.docs.amd.com/projects/rocm-docs-core/en/latest/
-
-# -- Project information ------------------------------------------------------
+"""
+html_theme is usually unchanged (rocm_docs_theme).
+flavor defines the site header display, select the flavor for the corresponding portals
+flavor options: rocm, rocm-docs-home, rocm-blogs, rocm-ds, instinct, ai-developer-hub, local, generic
+"""
 
+version_number = "0.1.0"
+
+html_theme = "rocm_docs_theme"
+html_theme_options = {
+    "flavor": "generic",
+    "header_title": f"Magpie {version_number}",
+    "header_link": False,
+    "version_list_link": False,
+    "nav_secondary_items": {
+        "GitHub": False,
+        "Community": False,
+        "Blogs": "https://rocm.blogs.amd.com/",
+        "ROCm Developer Hub": "https://www.amd.com/en/developer/resources/rocm-hub.html",
+        "Instinct™ Docs": "https://instinct.docs.amd.com/",
+        "Infinity Hub": "https://www.amd.com/en/developer/resources/infinity-hub.html",
+        "Support": False,
+    },
+    "link_main_doc": False,
+}
+
+# This section turns on/off article info
+setting_all_article_info = True
+all_article_info_os = ["linux"]
+all_article_info_author = ""
+
+# for PDF output on Read the Docs
 project = "Magpie"
 author = "Advanced Micro Devices, Inc."
-copyright = "2026, Advanced Micro Devices, Inc."
-
-# Single-sourced version. Update alongside pyproject.toml / package version.
-version = "0.1.0"
-release = version
-
-# -- General configuration ----------------------------------------------------
+copyright = "Copyright (c) 2026 Advanced Micro Devices, Inc. All rights reserved."
+version = version_number
+release = version_number
+
+external_toc_path = "./sphinx/_toc.yml"  # Defines Table of Content structure definition path
+
+"""
+Doxygen Settings
+Ensure Doxyfile is located at docs/doxygen.
+If the component does not need doxygen, delete this section for optimal build time
+"""
+# doxygen_root = "doxygen"
+# doxysphinx_enabled = True
+# doxygen_project = {
+#    "name": "doxygen",
+#    "path": "doxygen/xml",
+# }
+
+# Add more addtional package accordingly
+extensions = [
+    "rocm_docs",
+    "sphinxcontrib.mermaid"
+]
 
-extensions = ["rocm_docs", "sphinxcontrib.mermaid"]
-
-# Render fenced ```mermaid code blocks in Markdown as diagrams.
 myst_fence_as_directive = ["mermaid"]
 
-external_toc_path = "./sphinx/_toc.yml"
-
-# docs/README.md documents the build process for contributors and is not a
-# published page; keep it out of the source build so it is not treated as an
-# orphan document.
-exclude_patterns = ["README.md"]
+html_title = f"{project} {version_number} documentation"
 
-# rocm-docs-core options.
-html_theme = "rocm_docs_theme"
-html_theme_options = {"flavor": "rocm-docs-home"}
+external_projects_current_project = "Magpie"
@@ -1,14 +1,21 @@
-# Examples
+---
+myst:
+    html_meta:
+        "description": "Step-by-step Magpie examples for analyzing HIP kernels, comparing implementations, benchmarking vLLM with TraceLens, and running standalone gap analysis on GPU traces."
+        "keywords": "Magpie, examples, HIP kernel, compare kernels, vLLM benchmark, TraceLens, gap analysis, ROCm, CUDA, GPU"
+---
 
-This page provides end-to-end, step-by-step examples for common Magpie use
+# Magpie examples
+
+This topic provides end-to-end, step-by-step examples for common Magpie use
 cases. Each example lists the prerequisites, the exact commands to run, and the
 expected output. All example configuration files referenced here live in the
 [`examples/`](https://github.com/AMD-AGI/Magpie/tree/main/examples) directory of
 the Magpie repository.
 
 Run every command from the Magpie repository root unless noted otherwise.
 
-## Example 1: Analyze a simple HIP kernel
+## Analyze a simple HIP kernel
 
 This example analyzes a minimal HIP `vector_add` kernel for correctness using a
 testcase command.
@@ -55,7 +62,7 @@ Magpie reports a passing correctness state and writes a JSON report to
 with an overall `score` of `1.0` when correctness succeeds and profiling is
 skipped.
 
-## Example 2: Compare two kernel implementations
+## Compare two kernel implementations
 
 This example compares BF16 and FP16 grouped GEMM kernels from Composable Kernel
 and ranks them by performance.
@@ -108,7 +115,7 @@ Magpie evaluates both kernels, prints a ranked comparison against the baseline
 implementation. See [Analyze and compare kernels](../how-to/analyze-compare.md)
 for how scores and rankings are computed.
 
-## Example 3: Benchmark vLLM with TraceLens analysis
+## Benchmark vLLM with TraceLens analysis
 
 This example runs a framework-level benchmark of vLLM and analyzes the resulting
 traces.
@@ -139,10 +146,10 @@ traces.
 
 Magpie launches the benchmark, collects throughput and latency metrics, and (for
 the TraceLens config) produces a trace analysis report under the benchmark
-workspace in `./results`. See [Benchmark frameworks](../how-to/benchmark.md) for
+workspace in `./results`. See [Benchmark frameworks with Magpie](../how-to/benchmarking/benchmark.md) for
 the full result layout and metric descriptions.
 
-## Example 4: Standalone gap analysis on existing traces
+## Standalone gap analysis on existing traces
 
 If you already have torch profiler traces, you can run gap analysis without
 launching a benchmark to find the kernels that dominate runtime.
@@ -161,7 +168,7 @@ Magpie writes a `gap_analysis/gap_analysis.csv` file (plus optional per-rank
 CSVs) under the trace directory, listing the top bottleneck kernels by
 aggregated duration. Add `--find-kernel-sources` to also locate kernel source
 files and test commands for AMD kernels; see
-[Find kernel sources](../how-to/kernel-source-finder.md).
+[Find kernel sources with Magpie](../how-to/kernel-source-finder.md).
 
 ## More examples