diff --git a/README.md b/README.md
index f389079..b4b6d54 100644
--- a/README.md
+++ b/README.md
@@ -72,7 +72,7 @@ python -m Magpie.mcp
 | **Compare** | Multi-kernel comparison and ranking | ✅ |
 | **Benchmark** | Framework-level benchmarking (vLLM/SGLang/Atom) with trace analysis | ✅ |
 
-> 📖 See [Benchmark mode](docs/how-to/benchmark.md) for vLLM/SGLang/Atom usage.  
+> 📖 See [Benchmark mode](docs/how-to/benchmarking/benchmark.md) for vLLM/SGLang/Atom usage.  
 > 📖 See [Analyze vs Compare](docs/how-to/analyze-compare.md) for kernel evaluation modes.
 
 ## Configuration
diff --git a/docs/README.md b/docs/README.md
index 62ba4ed..19dd1d0 100644
--- a/docs/README.md
+++ b/docs/README.md
@@ -32,7 +32,7 @@ python -m sphinx -T -b html docs docs/_build/html
 | `reference/compatibility-matrix.md` | Compatibility Matrix | Verified hardware/software versions. Contains `TODO (verify)` markers. |
 | `reference/api-reference.md` | API Reference | CLI commands and options, configuration schema, and MCP tools. |
 | `how-to/analyze-compare.md` | How-to | Analyze vs compare kernel modes. |
-| `how-to/benchmark.md` | How-to | vLLM/SGLang/Atom benchmarking, TraceLens, gap analysis. |
+| `how-to/benchmarking/benchmark.md` | How-to | vLLM/SGLang/Atom benchmarking, TraceLens, gap analysis. |
 | `how-to/ray.md` | How-to | Remote execution on a Ray cluster. |
 | `how-to/mcp-and-skills.md` | How-to | MCP server and agent skill installation. |
 | `how-to/kernel-source-finder.md` | How-to | Locating kernel sources from traces. |
diff --git a/docs/about/license.md b/docs/about/license.md
index bba2a91..f716abe 100644
--- a/docs/about/license.md
+++ b/docs/about/license.md
@@ -1,10 +1,12 @@
-# License
+---
+myst:
+    html_meta:
+        "description": "The full MIT License text for Magpie, an open-source GPU kernel evaluation framework developed by AMD-AGI."
+        "keywords": "Magpie, MIT license, open source, AMD-AGI, license text"
+---
 
-Magpie is released under the MIT License. The full license text below matches
-the [`LICENSE`](https://github.com/AMD-AGI/Magpie/blob/main/LICENSE) file in the
-Magpie GitHub repository.
+# License
 
-```text
 MIT License
 
 Copyright (c) 2026 AMD-AGI
@@ -26,4 +28,3 @@ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
 OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
 SOFTWARE.
-```
diff --git a/docs/conceptual/benchmarking-architecture.md b/docs/conceptual/benchmarking-architecture.md
new file mode 100644
index 0000000..4ff5167
--- /dev/null
+++ b/docs/conceptual/benchmarking-architecture.md
@@ -0,0 +1,84 @@
+---
+myst:
+    html_meta:
+        "description": "Learn how Magpie's benchmark mode pipeline is structured, including components, execution flow, and integration with TraceLens and gap analysis."
+        "keywords": "Magpie, benchmark architecture, BenchmarkMode, TraceLens, gap analysis, vLLM, SGLang, ROCm, GPU, LLM inference"
+---
+
+# Magpie benchmarking mode architecture
+
+Magpie's benchmark mode drives end-to-end performance evaluation of LLM inference frameworks—vLLM, SGLang, and Atom—by launching a server, running a client workload, and collecting throughput and latency metrics into a structured JSON report. Benchmarks can run inside a Docker container (the default), directly on the host, or on a remote Ray cluster, and they optionally capture torch profiler traces for downstream analysis with TraceLens and gap analysis. This page describes the components that make up the benchmark pipeline, the execution flow from configuration to report generation, and how the pieces connect.
+
+## Architecture
+
+Magpie benchmark mode is composed of the following key components that work together to run, profile, and analyze inference framework benchmarks.
+
+### Components
+
+Benchmark mode consists of the following Python modules.
+
+| Component | File | Description |
+|-----------|------|-------------|
+| `BenchmarkMode` | `benchmarker.py` | Main orchestrator |
+| `BenchmarkConfig` | `config.py` | Configuration dataclasses |
+| `TraceLensAnalyzer` | `tracelens.py` | TraceLens CLI integration |
+| `GapAnalyzer` | `gap_analysis.py` | Kernel bottleneck analysis |
+| `BenchmarkResult` | `result.py` | Result data structures |
+
+### Execution flow
+
+Each benchmark run proceeds through the following stages.
+
+1. **Configuration Loading**: Parse YAML config into `BenchmarkConfig`
+2. **Runtime Setup**: For `run_mode: docker`, prepare a container with InferenceX; for `local`, use the host environment
+3. **Server Launch**: Start vLLM/SGLang server (in container or on host per `run_mode`)
+4. **Client Execution**: Run benchmark client with profiling enabled
+5. **Trace Collection**: Torch profiler traces saved to workspace
+6. **TraceLens Analysis**: Run TraceLens CLI commands inside the runtime image
+   for Docker inference mode, or on host for local/classic mode (if enabled)
+7. **Gap Analysis**: Analyze kernel bottlenecks within time window (if enabled)
+8. **Result Generation**: Aggregate metrics and generate reports
+
+### Architecture diagram
+
+The following diagram shows how Magpie orchestrates the benchmark pipeline.
+
+```
+┌─────────────────────────────────────────────────────────────────────┐
+│                        Benchmark Mode                               │
+├─────────────────────────────────────────────────────────────────────┤
+│  ┌───────────────┐    ┌───────────────┐    ┌────────────────────┐   │
+│  │BenchmarkConfig│  → │ BenchmarkMode │ →  │  BenchmarkResult   │   │
+│  │  (YAML)       │    │               │    │  (JSON + CSV)      │   │
+│  └───────────────┘    └───────────────┘    └────────────────────┘   │
+│                               │                                     │
+│                               ▼                                     │
+│  ┌──────────────────────────────────────────────────────────────┐   │
+│  │  Runtime: docker │ local │ ray                               │   │
+│  │  ┌─────────────┐        ┌─────────────────────────────────┐  │   │
+│  │  │ InferenceX  │  →     │ vLLM / SGLang Server + Client   │  │   │
+│  │  │ scripts     │        │ + Torch Profiler                │  │   │
+│  │  └─────────────┘        └─────────────────────────────────┘  │   │
+│  │  Ray: Magpie driver → RayJobExecutor → GPU worker runs the   │   │
+│  │        same stack (local/docker on worker; NFS for cache/    │   │
+│  │        results). See ray.md                                  │   │
+│  └──────────────────────────────────────────────────────────────┘   │
+│                               │                                     │
+│                      ┌────────┴────────┐                            │
+│                      ▼                 ▼                            │
+│  ┌────────────────────────┐  ┌─────────────────────────────────┐    │
+│  │  Gap Analysis          │  │  TraceLens Analysis             │    │
+│  │  • Time window filter  │  │  • Perf report (per-rank)       │    │
+│  │  • Category filter     │  │  • Multi-rank collective report │    │
+│  │  • Kernel stats CSV    │  │                                 │    │
+│  └────────────────────────┘  └─────────────────────────────────┘    │
+└─────────────────────────────────────────────────────────────────────┘
+```
+
+## More info
+
+- [Benchmark frameworks with Magpie](../how-to/benchmarking/benchmark.md) — how-to guide covering configuration, run modes, TraceLens, gap analysis, and examples
+- [Magpie benchmark mode configuration](../reference/benchmark-config.md) — full YAML schema with all available options and defaults
+- [Run Magpie on a Ray cluster](../how-to/ray.md) — running benchmarks on remote GPU nodes using Ray
+- [Find kernel sources with Magpie](../how-to/kernel-source-finder.md) — mapping kernel names from gap analysis output to source files
+- [Magpie troubleshooting](../reference/troubleshooting.md) — solutions for common benchmark errors
\ No newline at end of file
diff --git a/docs/conceptual/ray-architecture.md b/docs/conceptual/ray-architecture.md
new file mode 100644
index 0000000..a81cf1f
--- /dev/null
+++ b/docs/conceptual/ray-architecture.md
@@ -0,0 +1,60 @@
+---
+myst:
+    html_meta:
+        "description": "Understand Magpie's Ray integration driver-worker model, executor selection, and end-to-end task flow for running GPU workloads on remote Ray clusters."
+        "keywords": "Magpie, Ray architecture, RayJobExecutor, driver worker, remote GPU, distributed benchmark, ROCm, CUDA"
+---
+
+# Magpie on Ray architecture
+
+Magpie's Ray integration offloads analyze, compare, and benchmark workloads from the machine running the CLI or MCP server onto GPU-capable worker nodes in a Ray cluster, without changing the evaluation logic itself. The integration is built around a driver-worker split: the driver process submits a remote function via `RayJobExecutor`, and the worker node executes the same `AnalyzeMode`, `CompareMode`, or `BenchmarkMode` code it would run locally. This page describes how executor selection works, how the task flows end-to-end, and where to find the relevant source files.
+
+Magpie's Ray integration follows a driver-worker model where the driver submits tasks and workers execute them on GPU-capable nodes.
+
+## Driver vs worker
+
+Magpie's Ray integration uses two roles: the driver process that submits work, and the worker nodes that execute it.
+
+- **Driver**: process running `python -m Magpie …`, MCP, or your script. It calls `Scheduler` or `BenchmarkMode`, connects with `ray.init(address=…)`, and submits a remote function.
+- **Worker**: Ray executes `Magpie.remote.tasks.run_task` on a GPU-capable node. That function dispatches to `_run_analyze`, `_run_compare`, or `_run_benchmark`.
+
+## Executor selection
+
+The executor is chosen based on `SchedulerConfig.environment_type`.
+
+| `SchedulerConfig.environment_type` | Executor | Execution |
+|-----------------------------------|----------|-----------|
+| `local` | `LocalExecutor` | Subprocesses on the driver machine (`Magpie/core/executor.py`). |
+| `container` | Container executor | Isolated environment on the driver (kernel flows). |
+| `ray` | `RayJobExecutor` | `ray.remote(run_task)` on a cluster node (`Magpie/core/ray_executor.py`). |
+
+Benchmark mode additionally uses `BenchmarkConfig.run_mode`: `docker`, `local`, or `ray`. When `run_mode` is `ray`, `BenchmarkMode` builds a `Task` and uses `RayJobExecutor` internally (`Magpie/modes/benchmark/benchmarker.py`).
+
+## End-to-end flow
+
+```mermaid
+flowchart LR
+  subgraph Driver
+    CLI[MCP / CLI]
+    SCH[Scheduler or BenchmarkMode]
+    RJE[RayJobExecutor]
+    CLI --> SCH --> RJE
+  end
+  subgraph Cluster
+    RT[run_task]
+    A[AnalyzeMode]
+    C[CompareMode]
+    B[BenchmarkMode]
+    RJE -->|ray.remote| RT
+    RT --> A
+    RT --> C
+    RT --> B
+  end
+```
+
+## More info
+
+- [Magpie on Ray](../how-to/ray.md) — how-to guide covering cluster setup, configuration, shared storage, and troubleshooting
+- [Benchmark frameworks with Magpie](../how-to/benchmarking/benchmark.md) — benchmark run modes including `run_mode: ray`
+- [Magpie benchmarking mode architecture](benchmarking-architecture.md) — how the benchmark pipeline is designed and how components interact
+- [Ray documentation](https://docs.ray.io/) — cluster setup, job submission, and runtime environments
\ No newline at end of file
diff --git a/docs/conf.py b/docs/conf.py
index e0a1332..775aa0a 100644
--- a/docs/conf.py
+++ b/docs/conf.py
@@ -1,37 +1,63 @@
-# Configuration file for the Sphinx documentation builder.
-#
-# Magpie documentation is built with rocm-docs-core, which configures the
-# theme, navigation, MyST Markdown support, and shared ROCm options. Both
-# Markdown (.md, via MyST) and reStructuredText (.rst) source files build out
-# of the box.
-#
-# https://www.sphinx-doc.org/en/master/usage/configuration.html
-# https://rocm.docs.amd.com/projects/rocm-docs-core/en/latest/
-
-# -- Project information ------------------------------------------------------
+"""
+html_theme is usually unchanged (rocm_docs_theme).
+flavor defines the site header display, select the flavor for the corresponding portals
+flavor options: rocm, rocm-docs-home, rocm-blogs, rocm-ds, instinct, ai-developer-hub, local, generic
+"""
 
+version_number = "0.1.0"
+
+html_theme = "rocm_docs_theme"
+html_theme_options = {
+    "flavor": "generic",
+    "header_title": f"Magpie {version_number}",
+    "header_link": False,
+    "version_list_link": False,
+    "nav_secondary_items": {
+        "GitHub": False,
+        "Community": False,
+        "Blogs": "https://rocm.blogs.amd.com/",
+        "ROCm Developer Hub": "https://www.amd.com/en/developer/resources/rocm-hub.html",
+        "Instinct™ Docs": "https://instinct.docs.amd.com/",
+        "Infinity Hub": "https://www.amd.com/en/developer/resources/infinity-hub.html",
+        "Support": False,
+    },
+    "link_main_doc": False,
+}
+
+# This section turns on/off article info
+setting_all_article_info = True
+all_article_info_os = ["linux"]
+all_article_info_author = ""
+
+# for PDF output on Read the Docs
 project = "Magpie"
 author = "Advanced Micro Devices, Inc."
-copyright = "2026, Advanced Micro Devices, Inc."
-
-# Single-sourced version. Update alongside pyproject.toml / package version.
-version = "0.1.0"
-release = version
-
-# -- General configuration ----------------------------------------------------
+copyright = "Copyright (c) 2026 Advanced Micro Devices, Inc. All rights reserved."
+version = version_number
+release = version_number
+
+external_toc_path = "./sphinx/_toc.yml"  # Defines Table of Content structure definition path
+
+"""
+Doxygen Settings
+Ensure Doxyfile is located at docs/doxygen.
+If the component does not need doxygen, delete this section for optimal build time
+"""
+# doxygen_root = "doxygen"
+# doxysphinx_enabled = True
+# doxygen_project = {
+#    "name": "doxygen",
+#    "path": "doxygen/xml",
+# }
+
+# Add more addtional package accordingly
+extensions = [
+    "rocm_docs",
+    "sphinxcontrib.mermaid"
+]
 
-extensions = ["rocm_docs", "sphinxcontrib.mermaid"]
-
-# Render fenced ```mermaid code blocks in Markdown as diagrams.
 myst_fence_as_directive = ["mermaid"]
 
-external_toc_path = "./sphinx/_toc.yml"
-
-# docs/README.md documents the build process for contributors and is not a
-# published page; keep it out of the source build so it is not treated as an
-# orphan document.
-exclude_patterns = ["README.md"]
+html_title = f"{project} {version_number} documentation"
 
-# rocm-docs-core options.
-html_theme = "rocm_docs_theme"
-html_theme_options = {"flavor": "rocm-docs-home"}
+external_projects_current_project = "Magpie"
\ No newline at end of file
diff --git a/docs/examples/examples.md b/docs/examples/examples.md
index 5396814..850878b 100644
--- a/docs/examples/examples.md
+++ b/docs/examples/examples.md
@@ -1,6 +1,13 @@
-# Examples
+---
+myst:
+    html_meta:
+        "description": "Step-by-step Magpie examples for analyzing HIP kernels, comparing implementations, benchmarking vLLM with TraceLens, and running standalone gap analysis on GPU traces."
+        "keywords": "Magpie, examples, HIP kernel, compare kernels, vLLM benchmark, TraceLens, gap analysis, ROCm, CUDA, GPU"
+---
 
-This page provides end-to-end, step-by-step examples for common Magpie use
+# Magpie examples
+
+This topic provides end-to-end, step-by-step examples for common Magpie use
 cases. Each example lists the prerequisites, the exact commands to run, and the
 expected output. All example configuration files referenced here live in the
 [`examples/`](https://github.com/AMD-AGI/Magpie/tree/main/examples) directory of
@@ -8,7 +15,7 @@ the Magpie repository.
 
 Run every command from the Magpie repository root unless noted otherwise.
 
-## Example 1: Analyze a simple HIP kernel
+## Analyze a simple HIP kernel
 
 This example analyzes a minimal HIP `vector_add` kernel for correctness using a
 testcase command.
@@ -55,7 +62,7 @@ Magpie reports a passing correctness state and writes a JSON report to
 with an overall `score` of `1.0` when correctness succeeds and profiling is
 skipped.
 
-## Example 2: Compare two kernel implementations
+## Compare two kernel implementations
 
 This example compares BF16 and FP16 grouped GEMM kernels from Composable Kernel
 and ranks them by performance.
@@ -108,7 +115,7 @@ Magpie evaluates both kernels, prints a ranked comparison against the baseline
 implementation. See [Analyze and compare kernels](../how-to/analyze-compare.md)
 for how scores and rankings are computed.
 
-## Example 3: Benchmark vLLM with TraceLens analysis
+## Benchmark vLLM with TraceLens analysis
 
 This example runs a framework-level benchmark of vLLM and analyzes the resulting
 traces.
@@ -139,10 +146,10 @@ traces.
 
 Magpie launches the benchmark, collects throughput and latency metrics, and (for
 the TraceLens config) produces a trace analysis report under the benchmark
-workspace in `./results`. See [Benchmark frameworks](../how-to/benchmark.md) for
+workspace in `./results`. See [Benchmark frameworks with Magpie](../how-to/benchmarking/benchmark.md) for
 the full result layout and metric descriptions.
 
-## Example 4: Standalone gap analysis on existing traces
+## Standalone gap analysis on existing traces
 
 If you already have torch profiler traces, you can run gap analysis without
 launching a benchmark to find the kernels that dominate runtime.
@@ -161,7 +168,7 @@ Magpie writes a `gap_analysis/gap_analysis.csv` file (plus optional per-rank
 CSVs) under the trace directory, listing the top bottleneck kernels by
 aggregated duration. Add `--find-kernel-sources` to also locate kernel source
 files and test commands for AMD kernels; see
-[Find kernel sources](../how-to/kernel-source-finder.md).
+[Find kernel sources with Magpie](../how-to/kernel-source-finder.md).
 
 ## More examples
 
diff --git a/docs/how-to/analyze-compare.md b/docs/how-to/analyze-compare.md
index 9d9d980..35151d3 100644
--- a/docs/how-to/analyze-compare.md
+++ b/docs/how-to/analyze-compare.md
@@ -1,40 +1,57 @@
-# Analyze vs Compare
+---
+myst:
+    html_meta:
+        "description": "Learn when to use Magpie's Analyze and Compare modes to evaluate GPU kernel correctness and rank implementations by performance on AMD and NVIDIA hardware."
+        "keywords": "Magpie, analyze, compare, GPU kernel, HIP, CUDA, PyTorch, Triton, correctness, performance, ranking"
+---
 
-Magpie’s **Analyze** and **Compare** modes both evaluate GPU kernels (HIP, CUDA, PyTorch, Triton) through the same underlying pipeline—compile (optional), correctness, and optional performance profiling—but they differ in how many kernels you evaluate and how a “winner” is chosen.
+# Analyze and compare kernels with Magpie
 
-## At a glance
+Magpie’s Analyze and Compare modes both evaluate GPU kernels—HIP, CUDA, PyTorch, and Triton—through the same underlying pipeline: optional compilation, correctness validation against a testcase, and optional performance profiling. 
+
+Analyze targets a single kernel and produces a detailed per-stage evaluation report, making it the right choice when you need to confirm that one implementation is correct before promoting it. 
+
+Compare targets two or more kernel variants, runs the same evaluation pipeline on each, and produces a ranked result with a declared winner, making it the right choice when you want to find the fastest correct implementation from a set of candidates.
+
+## Mode comparison
+
+This table summarizes the key differences between the two modes:
 
 | | **Analyze** | **Compare** |
 |---|-------------|-------------|
 | **Goal** | Validate one implementation end-to-end | Rank two or more implementations |
 | **Kernels** | One (or multiple independent runs from one config) | At least two |
-| **Testcase** | **Required** (CLI or YAML `testcase_command`) | Optional per kernel; if omitted, PyTorch can use **result comparison** between variants |
+| **Testcase** | **Required** (CLI or YAML `testcase_command`) | Optional per kernel; if omitted, PyTorch can use **result comparison** mode between variants |
 | **Outcome** | Per-kernel `EvaluationState` | `ComparisonResult`: correctness vector, perf scores, rankings, `winner` |
 | **CLI** | `magpie analyze …` | `magpie compare …` |
 | **Report file** | `analyze_report.json` | `compare_report.json` |
 
 For architecture and diagrams, see the [README](https://github.com/AMD-AGI/Magpie#readme) (Analyze & Compare pipeline image).
 
-## When to use which
+### When to use which
 
 - **Analyze** when you have a single kernel (or a small set you want to evaluate independently) and a clear test command (build + run test, or script that exits non-zero on failure).
-- **Compare** when you have multiple source variants (e.g. v1 vs v2 HIP, or several PyTorch implementations) and want Magpie to run them in sequence, check correctness, optionally profile each, and produce a **ranking** and **winner** using configured scoring rules.
+- **Compare** when you have multiple source variants (for example, v1 vs v2 HIP, or several PyTorch implementations) and want Magpie to run them in sequence, check correctness, optionally profile each, and produce a ranking and winner using configured scoring rules.
 
 ## Correctness behavior
 
+The two modes differ in how they validate kernel output.
+
 ### Analyze
 
-- `AnalyzeMode` **requires** `testcase_command` in the effective `KernelEvalConfig`. Without it, analysis stops with an error.
+- `AnalyzeMode` requires `testcase_command` in the effective `KernelEvalConfig`. Without it, analysis stops with an error.
 - Use this mode when your validation story is “run this command and trust exit status / Accordo backend output.”
 
 ### Compare
 
-- If a kernel has `testcase_command`, correctness uses the **testcase** path (same idea as analyze).
-- If **no** testcase is provided, compare can use **result comparison** mode for PyTorch-style workflows (outputs compared across variants).
-- You need **at least two** kernel entries (from CLI paths or YAML `kernels:` list).
+- If a kernel has `testcase_command`, correctness uses the `testcase` path (same idea as analyze).
+- If *no* testcase is provided, compare can use **result comparison** mode for PyTorch-style workflows (outputs compared across variants).
+- You need *at least two* kernel entries (from CLI paths or YAML `kernels:` list).
 
 ## CLI quick reference
 
+The following examples show the most common analyze and compare invocations:
+
 ```bash
 # Analyze: kernel file(s) + testcase (required without --kernel-config)
 magpie analyze path/to/kernel.hip -t "./run_test.sh"
@@ -92,11 +109,13 @@ Tune these when your comparison should emphasize different hardware metrics or w
 
 Analyze and compare runs create timestamped workspaces under `--output-dir` (default `./results`):
 
-- **Analyze:** `analyze_report.json` plus profiler output under `performance/` when profiling is enabled; config snapshot and correctness artifacts as configured.
-- **Compare:** `compare_report.json` with `kernel_results`, `comparison_metrics` (including `correctness`, `perf_scores`, `all_correct`), `rankings`, `winner`, and `summary`.
+- **Analyze**: `analyze_report.json` plus profiler output under `performance/` when profiling is enabled; config snapshot and correctness artifacts as configured.
+- **Compare**: `compare_report.json` with `kernel_results`, `comparison_metrics` (including `correctness`, `perf_scores`, `all_correct`), `rankings`, `winner`, and `summary`.
+
+## More info
 
-## Related documentation
+See the following pages for related topics.
 
-- [Benchmark mode](benchmark.md) — vLLM/SGLang framework benchmarks (separate from kernel analyze/compare).
-- [Skills install](mcp-and-skills.md) — using Magpie without MCP.
-- [Ray scheduling (EN)](ray.md) — remote execution when `scheduler.environment: ray`.
+- [Benchmark frameworks with Magpie](benchmarking/benchmark.md) — vLLM/SGLang framework benchmarks (separate from kernel analyze/compare).
+- [Run MCP server and agent skills with Magpie](mcp-and-skills.md) — using Magpie without MCP.
+- [Run Magpie on a Ray cluster](ray.md) — remote execution when `scheduler.environment: ray`.
diff --git a/docs/how-to/benchmark.md b/docs/how-to/benchmark.md
deleted file mode 100644
index 6f322e9..0000000
--- a/docs/how-to/benchmark.md
+++ /dev/null
@@ -1,642 +0,0 @@
-# Benchmark Mode
-
-Benchmark mode enables framework-level performance benchmarking for LLM inference engines (vLLM, SGLang, Atom) with integrated trace analysis capabilities. Atom support is single-node only in v1 (no Ray multi-node TP, no torch-profiler wiring).
-
-**Execution:** Benchmarks use `run_mode`: **`docker`** (default), **`local`** (host / in-pod, via YAML or `--run-mode local`), or **`ray`** (driver submits `RayJobExecutor`; a **GPU worker** runs the same InferenceX → vLLM/SGLang flow—see [Magpie + Ray](ray.md)). InferenceX is cloned automatically when `inferencex_path` is empty (see `Magpie/config.yaml` `benchmark.inferencex_path`).
-
-## Overview
-
-```
-┌─────────────────────────────────────────────────────────────────────┐
-│                        Benchmark Mode                               │
-├─────────────────────────────────────────────────────────────────────┤
-│  ┌───────────────┐    ┌───────────────┐    ┌────────────────────┐   │
-│  │BenchmarkConfig│  → │ BenchmarkMode │ →  │  BenchmarkResult   │   │
-│  │  (YAML)       │    │               │    │  (JSON + CSV)      │   │
-│  └───────────────┘    └───────────────┘    └────────────────────┘   │
-│                               │                                     │
-│                               ▼                                     │
-│  ┌──────────────────────────────────────────────────────────────┐   │
-│  │  Runtime: docker │ local │ ray                                │   │
-│  │  ┌─────────────┐        ┌─────────────────────────────────┐  │   │
-│  │  │ InferenceX  │  →    │ vLLM / SGLang Server + Client   │  │   │
-│  │  │ scripts     │        │ + Torch Profiler                │  │   │
-│  │  └─────────────┘        └─────────────────────────────────┘  │   │
-│  │  Ray: Magpie driver → RayJobExecutor → GPU worker runs the   │   │
-│  │        same stack (local/docker on worker; NFS for cache/     │   │
-│  │        results). See ray.md                                   │   │
-│  └──────────────────────────────────────────────────────────────┘   │
-│                               │                                     │
-│                      ┌────────┴────────┐                            │
-│                      ▼                 ▼                            │
-│  ┌────────────────────────┐  ┌─────────────────────────────────┐   │
-│  │  Gap Analysis          │  │  TraceLens Analysis             │   │
-│  │  • Time window filter  │  │  • Perf report (per-rank)       │   │
-│  │  • Category filter     │  │  • Multi-rank collective report │   │
-│  │  • Kernel stats CSV    │  │                                 │   │
-│  └────────────────────────┘  └─────────────────────────────────┘   │
-└─────────────────────────────────────────────────────────────────────┘
-```
-
-## Quick Start
-
-```bash
-# Basic vLLM benchmark (paths are under examples/benchmarks/)
-python -m Magpie benchmark --benchmark-config examples/benchmarks/benchmark_vllm_dsr1.yaml
-
-# vLLM with TraceLens analysis
-python -m Magpie benchmark --benchmark-config examples/benchmarks/benchmark_vllm_tracelens.yaml
-
-# vLLM with gap analysis (kernel bottleneck report)
-python -m Magpie benchmark --benchmark-config examples/benchmarks/benchmark_vllm_kimi_k2.yaml
-
-# Standalone gap analysis on existing traces
-python -m Magpie benchmark gap-analysis --trace-dir results/benchmark_vllm_<timestamp>/
-
-# SGLang benchmark
-python -m Magpie benchmark --benchmark-config examples/benchmarks/benchmark_sglang_dsr1.yaml
-
-# Ad-hoc CLI without a YAML file (framework + model; optional torch profiler)
-python -m Magpie benchmark vllm --model deepseek-ai/DeepSeek-R1-0528 --torch-profiler
-```
-
-## Configuration
-
-### Minimal Example
-
-```yaml
-benchmark:
-  framework: vllm              # "vllm", "sglang", or "atom"
-  model: deepseek-ai/DeepSeek-R1-0528
-  precision: fp8               # "fp8", "fp16", "bf16"
-  
-  envs:
-    TP: 8                      # Tensor parallelism
-    CONC: 32                   # Concurrency (num_prompts = CONC * 10)
-    ISL: 1024                  # Input sequence length
-    OSL: 1024                  # Output sequence length
-    
-  profiler:
-    torch_profiler:
-      enabled: true            # Generate torch profiling traces
-      
-  timeout_seconds: 3600
-```
-
-### Full Configuration Reference
-
-```yaml
-benchmark:
-  # Framework selection
-  framework: vllm              # Required: "vllm", "sglang", or "atom"
-  model: <model_name>          # Required: HuggingFace model name/path
-  precision: fp8               # Optional: "fp8" (default), "fp16", "bf16"
-  
-  # Benchmark parameters
-  envs:
-    TP: 8                      # Tensor parallelism (GPU count)
-    CONC: 32                   # Request concurrency
-    ISL: 1024                  # Input sequence length
-    OSL: 1024                  # Output sequence length
-    RANDOM_RANGE_RATIO: 1      # Length randomization (0-1)
-    MAX_MODEL_LEN: 131072      # Max model context length
-    GPU_MEM_UTIL: 0.95         # GPU memory utilization (0-1)
-    ENABLE_PROFILE: "true"     # Enable profiling in benchmark script
-    
-  # Profiler configuration
-  profiler:
-    # PyTorch profiler (generates JSON traces)
-    torch_profiler:
-      enabled: true            # Sets VLLM_TORCH_PROFILER_DIR
-      
-    # System profiler (rocprof-compute / ncu)
-    system_profiler:
-      enabled: false
-      profile_args: []         # Additional profiler arguments
-      
-    # TraceLens trace analysis
-    tracelens:
-      enabled: true                 # Enable TraceLens analysis
-      analysis_mode: inference      # Optional, default: inference
-      analysis_stages: all          # Optional, default: all
-      auto_patch_runtime: true      # Optional, default: true for Docker runs
-      tracelens_repo_path: null     # Optional public TraceLens source checkout
-      cli_timeout_seconds: 2400     # TraceLens postprocess timeout per command
-      export_format: csv            # "csv" or "excel"
-      perf_report_enabled: true           # Single-rank performance report
-      multi_rank_report_enabled: true     # Multi-rank collective report
-      gpu_arch_config: null         # Optional: GPU arch config for roofline
-
-  # Gap analysis (kernel bottleneck report)
-  gap_analysis:
-    enabled: true              # Enable gap analysis after benchmark
-    trace_start_pct: 50        # Start of analysis window (0-100)
-    trace_end_pct: 80          # End of analysis window (0-100)
-    top_k: 20                  # Number of top kernels in report
-    min_duration_us: 0.0       # Filter out events shorter than this (us)
-    categories:                # Event category whitelist (default: [kernel, gpu])
-      - kernel
-      - gpu
-    ignore_categories:         # Event category blacklist (default: [gpu_user_annotation])
-      - gpu_user_annotation
-      
-  # Auto-pick idle GPU(s) before launching (enabled by default).
-  # See "Automatic GPU Selection" below for details.
-  gpu_selection:
-    auto: true                 # Default: true. Set false to disable.
-    min_free_memory_gb: 8.0    # Reject GPUs with less free VRAM
-    count: null                # Number of GPUs; null -> use envs.TP
-    candidates: null           # Optional whitelist of physical GPU ids
-
-  # Execution settings
-  run_mode: docker             # "docker" (default) or "local" (host / in-container)
-  docker_image: null           # Optional: override auto-selected image
-  gpu_arch: null               # Optional: force GPU architecture
-  timeout_seconds: 3600        # Benchmark timeout
-  
-  # Paths
-  inferencex_path: /path/to/InferenceX  # InferenceX installation
-  hf_cache_path: null          # HuggingFace cache directory
-  
-  # InferenceX specific
-  runner_type: mi300x          # Hardware runner type
-  benchmark_script: null       # Override benchmark script
-```
-
-## Environment Variables
-
-| Variable | Description | Default |
-|----------|-------------|---------|
-| `TP` | Tensor parallelism (number of GPUs) | 1 |
-| `CONC` | Request concurrency | 32 |
-| `ISL` | Input sequence length | 1024 |
-| `OSL` | Output sequence length | 512 |
-| `RANDOM_RANGE_RATIO` | Length randomization ratio | 0.5 |
-| `MAX_MODEL_LEN` | Maximum model context length | - |
-| `GPU_MEM_UTIL` | GPU memory utilization | 0.95 |
-| `ENABLE_PROFILE` | Enable torch profiler | "false" |
-| `EXTRA_VLLM_ARGS` | Additional arguments passed to `vllm serve` | "" |
-
-## Automatic GPU Selection
-
-Before launching the benchmark, Magpie scans the host (`rocm-smi` / `nvidia-smi`),
-picks the least-busy GPU(s) with enough free VRAM, and pins the run via
-vendor-specific environment variables:
-
-- **AMD**: `ROCR_VISIBLE_DEVICES=<ids>` (the launcher script remaps
-  `HIP_VISIBLE_DEVICES` to the post-filter logical range `0..N-1`).
-- **NVIDIA**: `CUDA_VISIBLE_DEVICES=<ids>` + `CUDA_DEVICE_ORDER=PCI_BUS_ID`.
-
-GPU ids use the same index space as `rocm-smi` / `nvidia-smi`. By default the
-selector asks for `envs.TP` idle GPUs; override with `gpu_selection.count`.
-
-**Config knobs** (all optional; `gpu_selection` block is enabled by default):
-
-| Field | Default | Description |
-|-------|---------|-------------|
-| `auto` | `true` | Set `false` to disable and let the framework see every GPU |
-| `min_free_memory_gb` | `8.0` | Reject GPUs with less free VRAM than this |
-| `count` | `null` | Number of GPUs to pin; `null` → `envs.TP` |
-| `candidates` | `null` | Optional whitelist of physical GPU ids to consider |
-
-**Manual override**: setting `HIP_VISIBLE_DEVICES` / `CUDA_VISIBLE_DEVICES` /
-`ROCR_VISIBLE_DEVICES` in `envs:` pins to specific cards and skips auto-selection:
-
-```yaml
-  envs:
-    TP: 1
-    ROCR_VISIBLE_DEVICES: "3"    # AMD: pin to rocm-smi GPU[3]
-    # CUDA_VISIBLE_DEVICES: "3"  # NVIDIA alternative
-    # CUDA_DEVICE_ORDER: PCI_BUS_ID
-```
-
-**Ray mode** (`run_mode: ray`): `gpu_selection` is ignored — Ray schedules
-devices itself via `num_gpus`. To restrict the cluster to specific cards,
-export `ROCR_VISIBLE_DEVICES` / `CUDA_VISIBLE_DEVICES` in the shell before
-starting `ray start`.
-
-**Local server lifecycle reuse** (`server_lifecycle.enabled: true`): When the run
-would attach to an existing healthy server matching reuse metadata (`force_reuse`
-skips mismatch checks), **`gpu_selection.auto` is skipped** entirely for that invocation
-so pinned devices do not churn between chain runs — see **Persistent server reuse** below.
-
-## Persistent server reuse (local)
-
-Use `server_lifecycle.enabled: true` with **`run_mode: local`** to keep one
-detached inference server alive across successive `python -m Magpie benchmark`
-runs on the **same PORT**.
-
-- **`timeout_seconds`**: applies to the **client** subprocess (`benchmark_serving.py`)
-  only — it does not stop the shared HTTP server afterward.
-- **`server_lifecycle.cleanup`**: Magpie terminates the persisted process group
-  (writes `SIGTERM`, then kills stragglers) **only when** `cleanup: true`; it also
-  removes the associated `*.pid` / `*.json` artifacts under ``~/.cache/magpie/server/``
-  (or `server_lifecycle.pid_dir`).
-- **Compatibility gate**: reuse checks JSON metadata versus `MODEL`, `TP`,
-  `EXTRA_VLLM_ARGS`, `EXTRA_SGLANG_ARGS`, `MAX_MODEL_LEN`, InferenceX resolved path,
-  framework, and `PORT`. Set `force_reuse: true` to bypass the mismatch errors.
-- **Scripts**: Requires **Magpie built-in InferenceX wrappers** that implement
-  `MAGPIE_RUN_PHASE=server|client` (e.g. `vllm_mi355x.sh`). Native InferenceX
-  `gptoss_*` / `dsr1_*` scripts reject this flag path by design until they are
-  updated upstream — point `benchmark_script` at one of the Magpie `*.sh` files.
-- **Profiling**: Torch profiler + `cleanup: false` is rejected (profiler state
-  is tied to surviving workers). Configure `profiler.torch_profiler.enabled: false`
-  for warmed servers, or set `cleanup: true`.
-- **GPU pins vs reuse**: Before each run Magpie probes `http://127.0.0.1:$PORT/health`
-  and compares reuse metadata against the chosen config (`force_reuse: true`
-  skips the comparison). If that probe says the **existing** server should be reused
-  (eligible client-only path), **`gpu_selection.auto` is skipped** (`find_idle_gpus`
-  is not run), so visible-device env vars are **not** re-randomised relative to
-  the server's physical GPUs on the reuse chain. When the probe fails (cold start /
-  stale server after crash), idle-GPU selection runs as usual and Magpie launches a
-  new server phase. For `profiler.gpu_monitor` while reusing without auto-selection,
-  pin GPUs in `envs` or set `gpu_monitor.device_id` if you care which card is sampled.
-
-Example YAML: **`examples/benchmarks/benchmark_vllm_reuse.yaml`**.
-
-## Profiling Options
-
-### Torch Profiler
-
-When `torch_profiler.enabled: true`:
-- Sets `VLLM_TORCH_PROFILER_DIR` automatically
-- Generates JSON trace files for each GPU rank
-- Traces saved to: `results/benchmark_<framework>_<timestamp>/torch_trace/`
-
-### TraceLens Analysis
-
-TraceLens provides automated analysis of torch profiler traces:
-
-| Command | Description | Output |
-|---------|-------------|--------|
-| `TraceLens_split_inference_trace` | Split vLLM/SGLang inference traces into phase windows | `torch_trace/trace_split/` |
-| `TraceLens_generate_perf_report_pytorch_inference` | Inference-aware prefill/decode reports | `tracelens/` |
-| `TraceLens_generate_perf_report_pytorch` | Single-rank performance report | `tracelens_rank0_csvs/` |
-| `TraceLens_generate_multi_rank_collective_report_pytorch` | Multi-rank collective analysis | `tracelens_collective_csvs/` |
-
-`analysis_mode` defaults to `inference`, which is the recommended mode for
-vLLM/SGLang benchmarks. It automatically enables the torch profiler, patches the
-needed InferenceX profiling helpers for the run, splits the rank-0 trace, and
-runs reports for all inference stages. Use `analysis_mode: pytorch` to keep the
-legacy direct PyTorch report flow.
-
-For Docker benchmarks, `auto_patch_runtime` defaults to `true`. When TraceLens
-inference mode is enabled and the selected runtime image is not already
-TraceLens-ready, Magpie builds a derived image from supported official
-vLLM/SGLang tags using the public TraceLens workflow scripts. The derived image
-is tagged locally as `magpie-tracelens-<framework>:...` and reused on later runs.
-Set `profiler.tracelens.tracelens_repo_path` or `TRACELENS_REPO_PATH` to a public
-TraceLens source checkout if Magpie cannot auto-locate it.
-
-For `run_mode: docker`, TraceLens inference post-processing also runs inside the
-resolved runtime image after the benchmark container exits. The post-processing
-container is CPU-only, mounts the benchmark workspace at `/workspace`, and writes
-CSV outputs under `tracelens/`. Host Python only needs Docker; it does not need
-the TraceLens CLI on `PATH`.
-
-`analysis_stages` defaults to `all`:
-
-```yaml
-profiler:
-  tracelens:
-    enabled: true
-    analysis_stages: all
-```
-
-To run only selected stages:
-
-```yaml
-profiler:
-  tracelens:
-    enabled: true
-    analysis_stages: [prefill, decode]
-```
-
-Supported stage names are `prefilldecode` (alias: `mixed`), `prefill`, and
-`decode`. GPU architecture is detected through Magpie's existing runner/GPU
-mapping and passed to TraceLens as `--gpu_arch_platform`.
-
-For SGLang, TraceLens inference mode automatically adds
-`--enable-profile-cuda-graph`. It also adds
-`--enable-shape-discovery-for-cuda-graph-profile` when the configured Docker
-image name looks like a TraceLens-patched SGLang image, such as
-`tracelens-sglang:*` or `magpie-tracelens-sglang:*`. For local runs, Magpie also
-detects whether the installed SGLang exposes the patched server argument. For
-other SGLang builds, keep patched-runtime-only flags explicit in
-`EXTRA_SGLANG_ARGS`.
-
-Each TraceLens inference postprocess command uses `cli_timeout_seconds`, which
-defaults to `1800`. Increase it for long-output runs where splitting the full
-decode trace can take longer:
-
-```yaml
-profiler:
-  tracelens:
-    enabled: true
-    cli_timeout_seconds: 2400
-```
-
-To enable an internal TraceLens extension, set `TL_EXTENSION` either in the
-shell environment or under benchmark envs. Magpie does not interpret the value;
-it only passes the variable through to the benchmark and TraceLens
-post-processing commands:
-
-```yaml
-benchmark:
-  envs:
-    TL_EXTENSION: "TraceLens_NDA"
-```
-
-#### TraceLens Output Files
-
-**Inference reports (`tracelens/`):**
-- `prefilldecode/` - Mixed prefill+decode phase report
-- `decode_only/` - Pure decode phase report
-- `prefill_only/` - Pure prefill phase report
-
-**Single-rank report (`tracelens_rank0_csvs/`):**
-- `gpu_timeline.csv` - GPU kernel timeline
-- `ops_summary.csv` - Operation summary
-- `ops_summary_by_category.csv` - Operations by category
-- `coll_analysis.csv` - Collective communication analysis
-- `kernel_summary.csv` - Kernel summary statistics
-
-**Multi-rank collective report (`tracelens_collective_csvs/`):**
-- Aggregated statistics across all GPU ranks
-- Communication pattern analysis
-- Load balancing metrics
-
-### Gap Analysis
-
-Gap analysis identifies GPU kernel bottlenecks from torch profiler traces. It applies a configurable time window to focus on the steady-state portion of the trace, then aggregates kernel durations by category.
-
-**Pipeline:**
-1. Apply time window (`trace_start_pct` – `trace_end_pct`) to isolate steady-state events
-2. Filter by category (case-insensitive substring matching on the event `cat` field)
-3. Aggregate stats per kernel name, rank by total duration
-
-**CSV output columns:** `Name, Calls, Self CUDA total (us), Avg time (us), % Total`
-
-**Defaults (no YAML needed):**
-- `categories`: `["kernel", "gpu"]`
-- `ignore_categories`: `["gpu_user_annotation"]`
-
-**Minimal config:**
-```yaml
-  gap_analysis:
-    enabled: true
-    trace_start_pct: 50
-    trace_end_pct: 80
-```
-
-#### Standalone CLI
-
-Run gap analysis on existing trace directories without re-running the benchmark:
-
-```bash
-# Basic usage (CLI defaults: --start-pct 0 --end-pct 100 unless you override)
-python -m Magpie benchmark gap-analysis \
-    --trace-dir results/benchmark_vllm_<timestamp>/
-
-# With custom window and categories (align with YAML gap_analysis window if desired)
-python -m Magpie benchmark gap-analysis \
-    --trace-dir results/benchmark_vllm_<timestamp>/torch_trace \
-    --start-pct 50 --end-pct 80 \
-    --top-k 15 \
-    --categories kernel gpu \
-    --ignore-categories gpu_user_annotation
-```
-
-The `--trace-dir` argument accepts either a benchmark workspace directory (auto-detects `torch_trace/` inside) or a direct path to the trace directory.
-
-Output is written to a `gap_analysis/` subfolder under the trace directory's parent.
-
-## Output Structure
-
-```
-results/benchmark_vllm_<timestamp>/
-├── benchmark_report.json      # Main benchmark results
-├── summary.txt                # Human-readable summary
-├── config.yaml                # Snapshot of benchmark configuration
-├── container_stdout.log       # Container stdout
-├── container_stderr.log       # Container stderr
-├── inferencex_result.json   # Raw InferenceX output
-├── torch_trace/               # Raw torch profiler traces
-│   ├── *-rank-0.*.pt.trace.json.gz
-│   ├── *-rank-1.*.pt.trace.json.gz
-│   └── ...
-├── gap_analysis/              # Gap analysis output (if enabled)
-│   ├── gap_analysis.csv       # Merged kernel stats across all ranks
-│   ├── gap_analysis_rank0.csv # Per-rank kernel stats
-│   ├── gap_analysis_rank1.csv
-│   └── ...
-├── tracelens_rank0_csvs/      # Single-rank TraceLens analysis
-│   ├── gpu_timeline.csv
-│   ├── ops_summary.csv
-│   └── ...
-└── tracelens_collective_csvs/ # Multi-rank TraceLens analysis
-    └── ...
-```
-
-## Benchmark report
-
-The primary summary file is **`benchmark_report.json`** in the run workspace (see `WorkspaceManager.save_report`). It aggregates throughput, latency, and optional `gap_analysis` / `tracelens_analysis` sections. A typical shape (abbreviated, with `...` marking elided values):
-
-```text
-{
-  "success": true,
-  "framework": "vllm",
-  "model": "amd/Kimi-K2-Thinking-MXFP4",
-  "throughput": {
-    "request_throughput": 0.16,
-    "output_throughput": 1.13,
-    "total_token_throughput": 1192.76,
-    "completed_requests": 40
-  },
-  "latency": {
-    "ttft": { "mean_ms": 1185.44, "p99_ms": 1969.59 },
-    "tpot": { "mean_ms": 131.09, "p99_ms": 282.21 }
-  },
-  "gap_analysis": {
-    "config": { "trace_start_pct": 50, "trace_end_pct": 80, "categories": ["kernel", "gpu"] },
-    "csv_path": "results/.../gap_analysis/gap_analysis.csv",
-    "top_kernels": [
-      { "name": "rcclGenericKernel<...>", "calls": 19620, "self_cuda_total_us": 28999961.95, "pct_total": 44.0 },
-      { "name": "kernel_moe_mxgemm_2lds<...>", "calls": 9360, "self_cuda_total_us": 12495324.68, "pct_total": 18.9 }
-    ]
-  },
-  "tracelens_analysis": { "output_files": [...] }
-}
-```
-
-## Examples
-
-### Quick Profiling Run
-
-Minimal configuration for fast trace collection:
-
-```yaml
-benchmark:
-  framework: vllm
-  model: deepseek-ai/DeepSeek-R1-0528
-  precision: fp8
-  
-  envs:
-    TP: 8
-    CONC: 4                    # Small concurrency for quick run
-    ISL: 128
-    OSL: 64
-    GPU_MEM_UTIL: 0.85
-    
-  profiler:
-    torch_profiler:
-      enabled: true
-    tracelens:
-      enabled: true
-      # analysis_mode defaults to inference
-      # analysis_stages defaults to all (prefilldecode, decode, prefill)
-      # auto_patch_runtime defaults to true for Docker runs
-      # tracelens_repo_path can point to a public TraceLens checkout
-      # cli_timeout_seconds defaults to 1800
-      export_format: csv
-      multi_rank_report_enabled: false  # Skip multi-rank for speed
-      
-  timeout_seconds: 1200
-```
-
-### Full Production Benchmark
-
-```yaml
-benchmark:
-  framework: vllm
-  model: deepseek-ai/DeepSeek-R1-0528
-  precision: fp8
-  
-  envs:
-    TP: 8
-    CONC: 64
-    ISL: 2048
-    OSL: 2048
-    MAX_MODEL_LEN: 131072
-    
-  profiler:
-    torch_profiler:
-      enabled: true
-    tracelens:
-      enabled: true
-      analysis_mode: inference
-      analysis_stages: all
-      auto_patch_runtime: true
-      # tracelens_repo_path: /path/to/TraceLens
-      cli_timeout_seconds: 2400
-      export_format: csv
-      perf_report_enabled: true
-      multi_rank_report_enabled: true
-      
-  timeout_seconds: 7200
-```
-
-### SGLang Benchmark
-
-```yaml
-benchmark:
-  framework: sglang
-  model: meta-llama/Llama-3.1-70B-Instruct
-  precision: fp16
-  
-  envs:
-    TP: 4
-    CONC: 32
-    ISL: 1024
-    OSL: 512
-    
-  profiler:
-    torch_profiler:
-      enabled: true
-      
-  timeout_seconds: 3600
-```
-
-## Troubleshooting
-
-### Common Issues
-
-**1. GPU Memory Error**
-```
-ValueError: Free memory on device (...) is less than desired GPU memory utilization
-```
-Solution: Reduce `GPU_MEM_UTIL` in config (e.g., 0.85)
-
-**2. Docker Permission Error**
-```
-docker: permission denied
-```
-Solution: Add user to docker group or run with sudo
-
-**3. TraceLens Not Found**
-```
-Required TraceLens inference CLI command(s) not found on PATH
-```
-Solution: This applies to `run_mode: local` or classic host post-processing.
-TraceLens will be auto-installed. If issues persist:
-```bash
-pip install git+https://github.com/AMD-AIG-AIMA/TraceLens.git
-```
-
-If `TL_EXTENSION=TraceLens_NDA` is set, install the matching internal extension
-package wherever the TraceLens runtime/post-processing commands need it. For
-`run_mode: docker`, those commands are resolved from the runtime image.
-
-**4. Timeout During Model Loading**
-
-Large models (e.g., DeepSeek-R1) may need longer timeouts:
-```yaml
-timeout_seconds: 7200  # 2 hours
-```
-
-**5. `gpu_selection.auto failed: ...`**
-
-Not enough idle GPUs on the host. Either free a GPU, lower
-`gpu_selection.min_free_memory_gb`, narrow `gpu_selection.candidates`, or pin
-manually via `envs.ROCR_VISIBLE_DEVICES` (AMD) / `envs.CUDA_VISIBLE_DEVICES`
-(NVIDIA). See [Automatic GPU Selection](#automatic-gpu-selection).
-
-### Debug Mode
-
-Enable verbose logging:
-```bash
-python -m Magpie benchmark --benchmark-config config.yaml --log-level DEBUG
-```
-
-## Architecture
-
-### Components
-
-| Component | File | Description |
-|-----------|------|-------------|
-| `BenchmarkMode` | `benchmarker.py` | Main orchestrator |
-| `BenchmarkConfig` | `config.py` | Configuration dataclasses |
-| `TraceLensAnalyzer` | `tracelens.py` | TraceLens CLI integration |
-| `GapAnalyzer` | `gap_analysis.py` | Kernel bottleneck analysis |
-| `BenchmarkResult` | `result.py` | Result data structures |
-
-### Execution Flow
-
-1. **Configuration Loading**: Parse YAML config into `BenchmarkConfig`
-2. **Runtime Setup**: For `run_mode: docker`, prepare a container with InferenceX; for `local`, use the host environment
-3. **Server Launch**: Start vLLM/SGLang server (in container or on host per `run_mode`)
-4. **Client Execution**: Run benchmark client with profiling enabled
-5. **Trace Collection**: Torch profiler traces saved to workspace
-6. **TraceLens Analysis**: Run TraceLens CLI commands inside the runtime image
-   for Docker inference mode, or on host for local/classic mode (if enabled)
-7. **Gap Analysis**: Analyze kernel bottlenecks within time window (if enabled)
-8. **Result Generation**: Aggregate metrics and generate reports
-
-## Related
-
-- [Analyze vs Compare](analyze-compare.md) — kernel evaluation modes (orthogonal to Benchmark)
-- [TraceLens](https://github.com/AMD-AIG-AIMA/TraceLens) — Trace analysis library
-- [InferenceX](https://github.com/SemiAnalysisAI/InferenceX) — Benchmark scripts (auto-clone target in default config)
-- [vLLM](https://github.com/vllm-project/vllm) — LLM inference engine
-- [SGLang](https://github.com/sgl-project/sglang) — LLM serving framework
-- [Ray + Magpie](ray.md) — optional remote benchmark scheduling
diff --git a/docs/how-to/benchmarking/automatic-gpu.md b/docs/how-to/benchmarking/automatic-gpu.md
new file mode 100644
index 0000000..4d30b37
--- /dev/null
+++ b/docs/how-to/benchmarking/automatic-gpu.md
@@ -0,0 +1,62 @@
+---
+myst:
+    html_meta:
+        "description": "Learn how Magpie automatically selects idle GPUs before launching a benchmark, and how to override GPU selection manually or disable it for Ray and server-reuse runs."
+        "keywords": "Magpie, GPU selection, ROCR_VISIBLE_DEVICES, CUDA_VISIBLE_DEVICES, benchmark, ROCm, AMD Instinct, idle GPU, gpu_selection"
+---
+
+# Automatic GPU selection in Magpie's benchmark mode
+
+Before launching the benchmark, Magpie scans the host (`rocm-smi` / `nvidia-smi`),
+picks the least-busy GPU(s) with enough free VRAM, and pins the run via
+vendor-specific environment variables. 
+
+```{note}
+For a full overview of benchmark mode and how GPU selection fits into the broader pipeline, see [Benchmark frameworks with Magpie](benchmark.md).
+```
+
+- **AMD**: `ROCR_VISIBLE_DEVICES=<ids>` (the launcher script remaps
+  `HIP_VISIBLE_DEVICES` to the post-filter logical range `0..N-1`).
+- **NVIDIA**: `CUDA_VISIBLE_DEVICES=<ids>` + `CUDA_DEVICE_ORDER=PCI_BUS_ID`.
+
+GPU IDs use the same index space as `rocm-smi` / `nvidia-smi`. By default the
+selector asks for `envs.TP` idle GPUs; override with `gpu_selection.count`.
+
+## Configuration
+
+All `gpu_selection` fields are optional; the block is enabled by default.
+
+| Field | Default | Description |
+|-------|---------|-------------|
+| `auto` | `true` | Set `false` to disable and let the framework see every GPU |
+| `min_free_memory_gb` | `8.0` | Reject GPUs with less free VRAM than this |
+| `count` | `null` | Number of GPUs to pin; `null` → `envs.TP` |
+| `candidates` | `null` | Optional allowlist of physical GPU IDs to consider |
+
+## Manual override
+
+Setting `HIP_VISIBLE_DEVICES` / `CUDA_VISIBLE_DEVICES` /
+`ROCR_VISIBLE_DEVICES` in `envs:` pins to specific cards and skips auto-selection:
+
+```yaml
+  envs:
+    TP: 1
+    ROCR_VISIBLE_DEVICES: "3"    # AMD: pin to rocm-smi GPU[3]
+    # CUDA_VISIBLE_DEVICES: "3"  # NVIDIA alternative
+    # CUDA_DEVICE_ORDER: PCI_BUS_ID
+```
+
+## Ray mode
+
+In Ray mode (`run_mode: ray`), `gpu_selection` is ignored — Ray schedules
+devices itself via `num_gpus`. To restrict the cluster to specific cards,
+export `ROCR_VISIBLE_DEVICES` / `CUDA_VISIBLE_DEVICES` in the shell before
+starting `ray start`.
+
+## Server lifecycle reuse
+
+When `server_lifecycle.enabled: true` and the run attaches to an existing
+healthy server matching reuse metadata (`force_reuse` skips mismatch checks),
+`gpu_selection.auto` is skipped entirely for that invocation so pinned devices
+do not churn between chain runs. See
+[Persistent server reuse](persistent-server-reuse.md) for details.
diff --git a/docs/how-to/benchmarking/benchmark.md b/docs/how-to/benchmarking/benchmark.md
new file mode 100644
index 0000000..17bdbb4
--- /dev/null
+++ b/docs/how-to/benchmarking/benchmark.md
@@ -0,0 +1,126 @@
+---
+myst:
+    html_meta:
+        "description": "Run framework-level benchmarks for vLLM, SGLang, and Atom with Magpie. Covers Docker and local run modes, TraceLens analysis, gap analysis, and GPU selection."
+        "keywords": "Magpie, benchmark, vLLM, SGLang, Atom, TraceLens, gap analysis, ROCm, AMD Instinct, GPU, LLM inference"
+---
+
+# Benchmark frameworks with Magpie
+
+Magpie's benchmark mode runs end-to-end performance tests against LLM inference frameworks—vLLM, SGLang, and Atom—and collects throughput and latency metrics in a structured JSON report. Benchmarks can run inside a Docker container, directly on the host, or on a remote Ray cluster, and optionally capture torch profiler traces for deeper analysis with TraceLens and gap analysis. Use this mode to measure inference performance on AMD Instinct™ GPUs and identify the GPU kernels that dominate runtime.
+
+Review these topics for more information:
+
+- [Magpie benchmarking mode architecture](../../conceptual/benchmarking-architecture.md) — how the benchmark pipeline is designed and how the components interact
+- [Magpie benchmark mode configuration](../../reference/benchmark-config.md) — full YAML schema with all available options and defaults
+- [Magpie troubleshooting](../../reference/troubleshooting.md) — solutions for common benchmark errors
+
+```{toctree}
+:maxdepth: 1
+:hidden:
+
+automatic-gpu
+persistent-server-reuse
+profiling-options
+```
+
+## Quick start
+
+The following commands cover the most common benchmark invocations.
+
+```bash
+# Basic vLLM benchmark (paths are under examples/benchmarks/)
+python -m Magpie benchmark --benchmark-config examples/benchmarks/benchmark_vllm_dsr1.yaml
+
+# vLLM with TraceLens analysis
+python -m Magpie benchmark --benchmark-config examples/benchmarks/benchmark_vllm_tracelens.yaml
+
+# vLLM with gap analysis (kernel bottleneck report)
+python -m Magpie benchmark --benchmark-config examples/benchmarks/benchmark_vllm_kimi_k2.yaml
+
+# Standalone gap analysis on existing traces
+python -m Magpie benchmark gap-analysis --trace-dir results/benchmark_vllm_<timestamp>/
+
+# SGLang benchmark
+python -m Magpie benchmark --benchmark-config examples/benchmarks/benchmark_sglang_dsr1.yaml
+
+# Ad-hoc CLI without a YAML file (framework + model; optional torch profiler)
+python -m Magpie benchmark vllm --model deepseek-ai/DeepSeek-R1-0528 --torch-profiler
+```
+
+## Output structure
+
+A successful benchmark run creates a timestamped workspace directory with the following layout.
+
+```
+results/benchmark_vllm_<timestamp>/
+├── benchmark_report.json      # Main benchmark results
+├── summary.txt                # Human-readable summary
+├── config.yaml                # Snapshot of benchmark configuration
+├── container_stdout.log       # Container stdout
+├── container_stderr.log       # Container stderr
+├── inferencex_result.json   # Raw InferenceX output
+├── torch_trace/               # Raw torch profiler traces
+│   ├── *-rank-0.*.pt.trace.json.gz
+│   ├── *-rank-1.*.pt.trace.json.gz
+│   └── ...
+├── gap_analysis/              # Gap analysis output (if enabled)
+│   ├── gap_analysis.csv       # Merged kernel stats across all ranks
+│   ├── gap_analysis_rank0.csv # Per-rank kernel stats
+│   ├── gap_analysis_rank1.csv
+│   └── ...
+├── tracelens_rank0_csvs/      # Single-rank TraceLens analysis
+│   ├── gpu_timeline.csv
+│   ├── ops_summary.csv
+│   └── ...
+└── tracelens_collective_csvs/ # Multi-rank TraceLens analysis
+    └── ...
+```
+
+## Benchmark report
+
+The primary summary file is **`benchmark_report.json`** in the run workspace (see `WorkspaceManager.save_report`). It aggregates throughput, latency, and optional `gap_analysis` / `tracelens_analysis` sections. A typical shape (abbreviated, with `...` marking elided values):
+
+```text
+{
+  "success": true,
+  "framework": "vllm",
+  "model": "amd/Kimi-K2-Thinking-MXFP4",
+  "throughput": {
+    "request_throughput": 0.16,
+    "output_throughput": 1.13,
+    "total_token_throughput": 1192.76,
+    "completed_requests": 40
+  },
+  "latency": {
+    "ttft": { "mean_ms": 1185.44, "p99_ms": 1969.59 },
+    "tpot": { "mean_ms": 131.09, "p99_ms": 282.21 }
+  },
+  "gap_analysis": {
+    "config": { "trace_start_pct": 50, "trace_end_pct": 80, "categories": ["kernel", "gpu"] },
+    "csv_path": "results/.../gap_analysis/gap_analysis.csv",
+    "top_kernels": [
+      { "name": "rcclGenericKernel<...>", "calls": 19620, "self_cuda_total_us": 28999961.95, "pct_total": 44.0 },
+      { "name": "kernel_moe_mxgemm_2lds<...>", "calls": 9360, "self_cuda_total_us": 12495324.68, "pct_total": 18.9 }
+    ]
+  },
+  "tracelens_analysis": { "output_files": [...] }
+}
+```
+
+## More info
+
+See the following pages for related concepts, configuration, and reference material.
+
+- [Automatic GPU selection in Magpie's benchmark mode](automatic-gpu.md) — how Magpie picks idle GPUs before launching and how to override or disable selection
+- [Persistent server reuse (local) in Magpie's benchmark mode](persistent-server-reuse.md) — keep a server alive across runs to avoid model reload overhead
+- [Profiling options in Magpie's benchmark mode](profiling-options.md) — configure torch profiler, TraceLens, and gap analysis
+- [Analyze and compare kernels with Magpie](../analyze-compare.md) — kernel evaluation modes (orthogonal to Benchmark)
+- [Run Magpie on a Ray cluster](../ray.md) — optional remote benchmark scheduling
+
+## Related resources
+
+- [TraceLens](https://github.com/AMD-AIG-AIMA/TraceLens) — Trace analysis library
+- [InferenceX](https://github.com/SemiAnalysisAI/InferenceX) — Benchmark scripts (auto-clone target in default config)
+- [vLLM](https://github.com/vllm-project/vllm) — LLM inference engine
+- [SGLang](https://github.com/sgl-project/sglang) — LLM serving framework
diff --git a/docs/how-to/benchmarking/persistent-server-reuse.md b/docs/how-to/benchmarking/persistent-server-reuse.md
new file mode 100644
index 0000000..cd5b413
--- /dev/null
+++ b/docs/how-to/benchmarking/persistent-server-reuse.md
@@ -0,0 +1,59 @@
+---
+myst:
+    html_meta:
+        "description": "Keep a Magpie inference server alive across successive benchmark runs using server_lifecycle.enabled to avoid model reload overhead between client invocations."
+        "keywords": "Magpie, persistent server, server reuse, server_lifecycle, vLLM, benchmark, local run mode, ROCm"
+---
+
+# Persistent server reuse (local) in Magpie's benchmark mode
+
+Setting `server_lifecycle.enabled: true` with `run_mode: local` keeps one
+detached inference server alive across successive `python -m Magpie benchmark`
+runs on the same port, avoiding the model reload overhead on each invocation.
+This is useful when running many client-only benchmark sweeps against the same
+model and configuration.
+
+```{note}
+Persistent server reuse is a `run_mode: local` feature of Magpie's benchmark mode.
+For a full overview of benchmark mode, including run modes, configuration, and output
+structure, see [Benchmark frameworks with Magpie](benchmark.md).
+```
+
+## How it works
+
+- `timeout_seconds` applies to the client subprocess (`benchmark_serving.py`)
+  only — it does not stop the shared HTTP server afterward.
+- `server_lifecycle.cleanup`: Magpie terminates the persisted process group
+  (writes `SIGTERM`, then kills stragglers) only when `cleanup: true`; it also
+  removes the associated `*.pid` / `*.json` artifacts under
+  `~/.cache/magpie/server/` (or `server_lifecycle.pid_dir`).
+- Compatibility gate: reuse checks JSON metadata versus `MODEL`, `TP`,
+  `EXTRA_VLLM_ARGS`, `EXTRA_SGLANG_ARGS`, `MAX_MODEL_LEN`, InferenceX resolved
+  path, framework, and `PORT`. Set `force_reuse: true` to bypass the mismatch
+  errors.
+- Scripts: requires Magpie built-in InferenceX wrappers that implement
+  `MAGPIE_RUN_PHASE=server|client` (for example, `vllm_mi355x.sh`). Native
+  InferenceX `gptoss_*` / `dsr1_*` scripts reject this flag path by design until
+  they are updated upstream — point `benchmark_script` at one of the Magpie
+  `*.sh` files.
+- Profiling: torch profiler + `cleanup: false` is rejected (profiler state is
+  tied to surviving workers). Configure `profiler.torch_profiler.enabled: false`
+  for warmed servers, or set `cleanup: true`.
+
+## GPU selection and server reuse
+
+Before each run Magpie probes `http://127.0.0.1:$PORT/health` and compares
+reuse metadata against the chosen config (`force_reuse: true` skips the
+comparison). When the probe indicates the existing server should be reused
+(eligible client-only path), `gpu_selection.auto` is skipped (`find_idle_gpus`
+is not run), so visible-device env vars are not re-randomized relative to the
+server's physical GPUs on the reuse chain. When the probe fails (cold start or
+stale server after crash), idle-GPU selection runs as usual and Magpie launches a
+new server phase. For `profiler.gpu_monitor` while reusing without auto-selection,
+pin GPUs in `envs` or set `gpu_monitor.device_id` if you care which card is
+sampled.
+
+## Example
+
+See `examples/benchmarks/benchmark_vllm_reuse.yaml` for a complete working
+configuration.
diff --git a/docs/how-to/benchmarking/profiling-options.md b/docs/how-to/benchmarking/profiling-options.md
new file mode 100644
index 0000000..3691531
--- /dev/null
+++ b/docs/how-to/benchmarking/profiling-options.md
@@ -0,0 +1,178 @@
+---
+myst:
+    html_meta:
+        "description": "Configure Magpie's profiling backends for benchmark mode: torch profiler for trace capture, TraceLens for inference analysis, and gap analysis for kernel bottleneck reporting."
+        "keywords": "Magpie, profiling, torch profiler, TraceLens, gap analysis, benchmark, ROCm, vLLM, SGLang, kernel bottleneck"
+---
+
+# Profiling options in Magpie's benchmark mode
+
+Magpie supports three profiling backends that can be enabled independently and combined in a single benchmark run: the torch profiler captures per-rank JSON traces, TraceLens runs inference-aware analysis on those traces to produce prefill/decode performance reports, and gap analysis aggregates kernel durations from the traces to identify the operations that dominate runtime. Each backend is configured under the `profiler:` key in your benchmark YAML and writes its output to a dedicated subdirectory in the benchmark workspace. Enable only the backends you need — torch profiler is a prerequisite for both TraceLens and gap analysis, but TraceLens and gap analysis are independent of each other.
+
+```{note}
+Profiling is an optional feature of Magpie's benchmark mode. For a full overview
+of benchmark mode, including run modes, configuration, and output structure, see
+[Benchmark frameworks with Magpie](benchmark.md).
+```
+
+## Torch profiler
+
+When `torch_profiler.enabled: true`, Magpie takes the following actions.
+
+- Sets `VLLM_TORCH_PROFILER_DIR` automatically
+- Generates JSON trace files for each GPU rank
+- Traces saved to: `results/benchmark_<framework>_<timestamp>/torch_trace/`
+
+## TraceLens analysis
+
+TraceLens provides automated analysis of torch profiler traces:
+
+| Command | Description | Output |
+|---------|-------------|--------|
+| `TraceLens_`<br>`split_`<br>`inference`<br>`_trace` | Split vLLM/SGLang inference traces into phase windows | `torch_trace/trace_split/` |
+| `TraceLens_`<br>`generate_perf`<br>`_report_pytorch`<br>`_inference` | Inference-aware prefill/decode reports | `tracelens/` |
+| `TraceLens_`<br>`generate_perf`<br>`_report_pytorch` | Single-rank performance report | `tracelens_rank0_csvs/` |
+| `TraceLens_`<br>`generate_multi`<br>`_rank_collective`<br>`_report_pytorch` | Multi-rank collective analysis | `tracelens_collective_csvs/` |
+
+`analysis_mode` defaults to `inference`, which is the recommended mode for
+vLLM/SGLang benchmarks. It automatically enables the torch profiler, patches the
+needed InferenceX profiling helpers for the run, splits the rank-0 trace, and
+runs reports for all inference stages. Use `analysis_mode: pytorch` to keep the
+legacy direct PyTorch report flow.
+
+For Docker benchmarks, `auto_patch_runtime` defaults to `true`. When TraceLens
+inference mode is enabled and the selected runtime image is not already
+TraceLens-ready, Magpie builds a derived image from supported official
+vLLM/SGLang tags using the public TraceLens workflow scripts. The derived image
+is tagged locally as `magpie-tracelens-<framework>:...` and reused on later runs.
+Set `profiler.tracelens.tracelens_repo_path` or `TRACELENS_REPO_PATH` to a public
+TraceLens source checkout if Magpie cannot auto-locate it.
+
+For `run_mode: docker`, TraceLens inference post-processing also runs inside the
+resolved runtime image after the benchmark container exits. The post-processing
+container is CPU-only, mounts the benchmark workspace at `/workspace`, and writes
+CSV outputs under `tracelens/`. Host Python only needs Docker; it does not need
+the TraceLens CLI on `PATH`.
+
+`analysis_stages` defaults to `all`:
+
+```yaml
+profiler:
+  tracelens:
+    enabled: true
+    analysis_stages: all
+```
+
+To run only selected stages:
+
+```yaml
+profiler:
+  tracelens:
+    enabled: true
+    analysis_stages: [prefill, decode]
+```
+
+Supported stage names are `prefilldecode` (alias: `mixed`), `prefill`, and
+`decode`. GPU architecture is detected through Magpie's existing runner/GPU
+mapping and passed to TraceLens as `--gpu_arch_platform`.
+
+For SGLang, TraceLens inference mode automatically adds
+`--enable-profile-cuda-graph`. It also adds
+`--enable-shape-discovery-for-cuda-graph-profile` when the configured Docker
+image name looks like a TraceLens-patched SGLang image, such as
+`tracelens-sglang:*` or `magpie-tracelens-sglang:*`. For local runs, Magpie also
+detects whether the installed SGLang exposes the patched server argument. For
+other SGLang builds, keep patched-runtime-only flags explicit in
+`EXTRA_SGLANG_ARGS`.
+
+Each TraceLens inference postprocess command uses `cli_timeout_seconds`, which
+defaults to `1800`. Increase it for long-output runs where splitting the full
+decode trace can take longer:
+
+```yaml
+profiler:
+  tracelens:
+    enabled: true
+    cli_timeout_seconds: 2400
+```
+
+To enable an internal TraceLens extension, set `TL_EXTENSION` either in the
+shell environment or under benchmark envs. Magpie does not interpret the value;
+it only passes the variable through to the benchmark and TraceLens
+post-processing commands:
+
+```yaml
+benchmark:
+  envs:
+    TL_EXTENSION: "TraceLens_NDA"
+```
+
+### TraceLens output files
+
+TraceLens writes results into the following directories under the benchmark workspace.
+
+**Inference reports (`tracelens/`):**
+- `prefilldecode/` - Mixed prefill+decode phase report
+- `decode_only/` - Pure decode phase report
+- `prefill_only/` - Pure prefill phase report
+
+**Single-rank report (`tracelens_rank0_csvs/`):**
+- `gpu_timeline.csv` - GPU kernel timeline
+- `ops_summary.csv` - Operation summary
+- `ops_summary_by_category.csv` - Operations by category
+- `coll_analysis.csv` - Collective communication analysis
+- `kernel_summary.csv` - Kernel summary statistics
+
+**Multi-rank collective report (`tracelens_collective_csvs/`):**
+- Aggregated statistics across all GPU ranks
+- Communication pattern analysis
+- Load balancing metrics
+
+## Gap analysis
+
+Gap analysis identifies GPU kernel bottlenecks from torch profiler traces. It applies a configurable time window to focus on the steady-state portion of the trace, then aggregates kernel durations by category.
+
+The analysis pipeline runs the following steps.
+
+1. Apply time window (`trace_start_pct` – `trace_end_pct`) to isolate steady-state events
+2. Filter by category (case-insensitive substring matching on the event `cat` field)
+3. Aggregate stats per kernel name, rank by total duration
+
+**CSV output columns:** `Name, Calls, Self CUDA total (us), Avg time (us), % Total`
+
+**Defaults (no YAML needed):**
+- `categories`: `["kernel", "gpu"]`
+- `ignore_categories`: `["gpu_user_annotation"]`
+
+**Minimal config:**
+```yaml
+  gap_analysis:
+    enabled: true
+    trace_start_pct: 50
+    trace_end_pct: 80
+```
+
+### Standalone CLI
+
+Run gap analysis on existing trace directories without re-running the benchmark.
+
+
+
+```bash
+# Basic usage (CLI defaults: --start-pct 0 --end-pct 100 unless you override)
+python -m Magpie benchmark gap-analysis \
+    --trace-dir results/benchmark_vllm_<timestamp>/
+
+# With custom window and categories (align with YAML gap_analysis window if desired)
+python -m Magpie benchmark gap-analysis \
+    --trace-dir results/benchmark_vllm_<timestamp>/torch_trace \
+    --start-pct 50 --end-pct 80 \
+    --top-k 15 \
+    --categories kernel gpu \
+    --ignore-categories gpu_user_annotation
+```
+
+The `--trace-dir` argument accepts either a benchmark workspace directory (auto-detects `torch_trace/` inside) or a direct path to the trace directory.
+
+Output is written to a `gap_analysis/` subfolder under the trace directory's parent.
+
diff --git a/docs/how-to/kernel-source-finder.md b/docs/how-to/kernel-source-finder.md
index d43abe3..1dd0496 100644
--- a/docs/how-to/kernel-source-finder.md
+++ b/docs/how-to/kernel-source-finder.md
@@ -1,8 +1,17 @@
-# Kernel Source Finder
+---
+myst:
+    html_meta:
+        "description": "Automatically map GPU kernel names from profiler traces to their source code and test files using Magpie's kernel source finder for AMD and NVIDIA kernels."
+        "keywords": "Magpie, kernel source finder, GPU kernels, Triton, CK Tile, Tensile, gap analysis, ROCm, profiler traces"
+---
 
-Automatically maps GPU kernel names from profiler traces to their source code and test files.
+# Find kernel sources with Magpie
 
-## Overview
+When gap analysis identifies the GPU kernels dominating your benchmark runtime, the kernel source finder maps those mangled kernel names back to their human-readable source files and runnable test commands. It clones the relevant upstream repositories automatically, parses the kernel name to determine its type and origin, and writes source file paths, GitHub URLs, and test commands directly into the gap analysis CSV. Use this feature to quickly locate the code behind a bottleneck kernel and reproduce it in isolation.
+
+## Pipeline overview
+
+The kernel source finder follows a four-step pipeline.
 
 ```
 Profiler Trace → Kernel Name → Parser → Searcher → Source & Test Info
@@ -11,11 +20,36 @@ Profiler Trace → Kernel Name → Parser → Searcher → Source & Test Info
                             kernel type  cloned repos
 ```
 
-## Supported Kernel Types
+## Architecture
+
+```
+┌─────────────────────────────────────────────────────────────┐
+│                    KernelSourceFinder                       │
+│                    (finder.py)                              │
+├─────────────────────────────────────────────────────────────┤
+│  ┌──────────────┐  ┌──────────────┐  ┌──────────────────┐   │
+│  │ RepoManager  │  │ KernelName   │  │ KernelSource     │   │
+│  │              │  │ Parser       │  │ Searcher         │   │
+│  │ - auto clone │  │              │  │                  │   │
+│  │ - 5 repos    │  │ - classify   │  │ - ripgrep search │   │
+│  │              │  │ - parse info │  │ - static mapping │   │
+│  └──────────────┘  └──────────────┘  └──────────────────┘   │
+│         │                 │                   │             │
+│         ▼                 ▼                   ▼             │
+│  ┌──────────────────────────────────────────────────────┐   │
+│  │                   KernelSourceInfo                   │   │
+│  │  (kind, category, source_file, test_file, test_cmd)  │   │
+│  └──────────────────────────────────────────────────────┘   │
+└─────────────────────────────────────────────────────────────┘
+```
+
+## Supported kernel types
+
+The kernel source finder recognizes the following kernel types.
 
 | Type | Pattern | Source Repository |
 |------|---------|-------------------|
-| **Triton JIT** | `*.kd` (e.g., `_matmul_ogs_NNT.kd`) | triton, triton_kernels |
+| **Triton JIT** | `*.kd` (for example, `_matmul_ogs_NNT.kd`) | triton, triton_kernels |
 | **CK Tile** | `_ZN7ck_tile*` | rocm-libraries/composablekernel |
 | **Tensile GEMM** | `Cijk_*` | rocm-libraries/rocblas |
 | **ATen Native** | `void at::native::*` | pytorch |
@@ -23,9 +57,11 @@ Profiler Trace → Kernel Name → Parser → Searcher → Source & Test Info
 | **AITER** | `_ZN5aiter*` | aiter |
 | **Inductor** | `triton_*_fused_*` | pytorch |
 
-## How It Works
+## Workflow
 
-### Step 1: Auto-Clone Repositories
+The finder runs four sequential steps to map kernel names to source files.
+
+### Step 1: Auto-clone repositories
 
 When gap analysis runs, it automatically clones required repos to `~/.cache/magpie/repos/`:
 
@@ -38,7 +74,7 @@ When gap analysis runs, it automatically clones required repos to `~/.cache/magp
 └── aiter/             # AITER kernels
 ```
 
-### Step 2: Parse Kernel Name
+### Step 2: Parse kernel name
 
 The parser extracts structured info from kernel names:
 
@@ -53,14 +89,14 @@ ParsedKernelName(
 )
 ```
 
-### Step 3: Search Source & Test
+### Step 3: Search source and test
 
 The searcher looks up source files using:
 - **ripgrep**: Fast regex search across repos
 - **Static mappings**: Known paths for Tensile, CK Tile examples
 - **Kernel index**: Pre-built index for faster lookups
 
-### Step 4: Generate Output
+### Step 4: Generate output
 
 Results are written to `gap_analysis.csv`:
 
@@ -71,7 +107,9 @@ _matmul_ogs_NNT_bf16.kd,24552,5631747.87,...,triton_jit,gemm,triton_kernels,$TRI
 
 ## Usage
 
-### Run Gap Analysis with Kernel Source Finding
+### Run gap analysis with kernel source finding
+
+Pass `--find-kernel-sources` to enable source lookup during gap analysis.
 
 ```bash
 python3 -m Magpie benchmark \
@@ -80,7 +118,9 @@ python3 -m Magpie benchmark \
     --find-kernel-sources
 ```
 
-### Output Fields
+### Output fields
+
+The following fields are added to `gap_analysis.csv` when kernel source finding is enabled.
 
 | Field | Description |
 |-------|-------------|
@@ -93,7 +133,7 @@ python3 -m Magpie benchmark \
 | `test_cmd` | Command to run tests |
 | `notes` | Additional info (dtype, tile sizes, etc.) |
 
-### Path Variables
+### Path variables
 
 The CSV header includes path mappings:
 
@@ -106,7 +146,7 @@ The CSV header includes path mappings:
 
 Base directory: `~/.cache/magpie/repos/`
 
-## Example Output
+## Example output
 
 For a CK Tile RMSNorm kernel:
 
@@ -120,37 +160,16 @@ test_file: $ROCM_LIBRARIES_DIR/projects/composablekernel/example/ck_tile/10_rmsn
 test_cmd: cd $ROCM_LIBRARIES_DIR/projects/composablekernel/build && cmake --build . -j --target tile_example_rmsnorm2d_fwd
 ```
 
-## Architecture
-
-```
-┌─────────────────────────────────────────────────────────────┐
-│                    KernelSourceFinder                       │
-│                    (finder.py)                              │
-├─────────────────────────────────────────────────────────────┤
-│  ┌──────────────┐  ┌──────────────┐  ┌──────────────────┐  │
-│  │ RepoManager  │  │ KernelName   │  │ KernelSource     │  │
-│  │              │  │ Parser       │  │ Searcher         │  │
-│  │ - auto clone │  │              │  │                  │  │
-│  │ - 5 repos    │  │ - classify   │  │ - ripgrep search │  │
-│  │              │  │ - parse info │  │ - static mapping │  │
-│  └──────────────┘  └──────────────┘  └──────────────────┘  │
-│         │                 │                   │             │
-│         ▼                 ▼                   ▼             │
-│  ┌──────────────────────────────────────────────────────┐  │
-│  │                   KernelSourceInfo                    │  │
-│  │  (kind, category, source_file, test_file, test_cmd)  │  │
-│  └──────────────────────────────────────────────────────┘  │
-└─────────────────────────────────────────────────────────────┘
-```
+## Add new kernel types
 
-## Adding New Kernel Types
+To add support for a new kernel type, update these three files:
 
-1. Add pattern to `parser.py`:
+- Add pattern to `parser.py`:
    ```python
    MY_PATTERN = re.compile(r'^my_kernel_prefix')
    ```
 
-2. Add search methods to `searcher.py`:
+- Add search methods to `searcher.py`:
    ```python
    def _search_my_source(self, parsed):
        # Search logic
@@ -159,7 +178,7 @@ test_cmd: cd $ROCM_LIBRARIES_DIR/projects/composablekernel/build && cmake --buil
        # Test search logic
    ```
 
-3. Add repo URL to `repo_manager.py`:
+- Add repo URL to `repo_manager.py`:
    ```python
    REPO_URLS = {
        "my-repo": "https://github.com/org/my-repo.git",
diff --git a/docs/how-to/mcp-and-skills.md b/docs/how-to/mcp-and-skills.md
index e1a0cac..e9beafc 100644
--- a/docs/how-to/mcp-and-skills.md
+++ b/docs/how-to/mcp-and-skills.md
@@ -1,7 +1,13 @@
-# MCP server and agent skills
+---
+myst:
+    html_meta:
+        "description": "Drive Magpie from AI agents using the MCP server or the Magpie agent skill for Cursor, Claude Code, and Codex. Covers installation, configuration, and verification."
+        "keywords": "Magpie, MCP server, agent skills, Claude Code, Cursor, Codex, Model Context Protocol, AI agents, GPU evaluation"
+---
 
-Magpie can be driven by AI agents in two ways: the Model Context Protocol (MCP)
-server, or the Magpie agent skill for editors that do not support MCP.
+# Run MCP server and agent skills with Magpie
+
+Magpie can be driven by AI agents in two ways: the Model Context Protocol (MCP) server, which exposes the full Magpie toolset to any MCP-compatible client, or the Magpie agent skill, which gives editors without MCP support access to the same capabilities through documented CLI patterns. The MCP server is the preferred integration for programmatic use, remote execution, and Ray-based workflows, while the skill is a drop-in option for Cursor, Claude Code, Codex, and similar editors. Both approaches let agents analyze kernels, run benchmarks, query GPU hardware, and trigger gap analysis without leaving the IDE.
 
 ## Run the MCP server
 
@@ -17,9 +23,9 @@ A sample client configuration is provided at `Magpie/mcp/config.json`. For the
 full list of tools and their parameters, see the
 [API reference](../reference/api-reference.md).
 
-## Installing the Magpie skill
+## Install the Magpie skill
 
-The Magpie skill lets AI agents (Cursor, Claude Code, Codex, etc.) drive Magpie through documented CLI patterns when MCP is not available. The skill lives in this repo under **`skills/magpie/`** (IDE-neutral). Install it into each editor’s skills directory with the script below, or follow the manual copy steps.
+The Magpie skill lets AI agents (Cursor, Claude Code, Codex, and similar editors) drive Magpie through documented CLI patterns when MCP is not available. The skill lives in this repo under **`skills/magpie/`** (IDE-neutral). Install it into each editor’s skills directory with the script below, or follow the manual copy steps.
 
 ## Install script (recommended)
 
@@ -48,9 +54,12 @@ chmod +x skills/install-skill.sh
 ./skills/install-skill.sh -h
 ```
 
-**Behavior:** The script removes any existing destination folder named `magpie` and copies `skills/magpie` there. No editor restart is usually required.
+The script removes any existing destination folder named `magpie` and copies `skills/magpie` there. No editor restart is usually required.
 
-**Related docs:** [Analyze vs Compare](analyze-compare.md), [Benchmark mode](benchmark.md), [README](https://github.com/AMD-AGI/Magpie#readme) (MCP vs skill).
+Review these topics for more information:
+- [Analyze and compare kernels with Magpie](analyze-compare.md)
+- [Benchmark frameworks with Magpie](benchmarking/benchmark.md)
+- [README](https://github.com/AMD-AGI/Magpie#readme) (MCP vs skill).
 
 ## Manual install
 
@@ -58,11 +67,11 @@ Source folder: **`skills/magpie/`** in this repo (contains `SKILL.md`, `referenc
 
 ### Cursor
 
-- **Global:** Copy the skill into your Cursor skills folder:
+- **Global**: Copy the skill into your Cursor skills folder:
   ```bash
   cp -r /path/to/Magpie/skills/magpie ~/.cursor/skills/magpie
   ```
-- **Project:** Copy into your project’s Cursor skills folder so only that project uses it:
+- **Project**: Copy into your project’s Cursor skills folder so only that project uses it:
   ```bash
   mkdir -p /path/to/your/project/.cursor/skills
   cp -r /path/to/Magpie/skills/magpie /path/to/your/project/.cursor/skills/magpie
@@ -86,16 +95,16 @@ Same `SKILL.md` format and layout.
 
 ### Codex and other IDEs
 
-- If the IDE has a **skills** or **custom instructions** directory (e.g. `~/.codex/skills/` or a project `.codex/skills/`), copy the **`magpie`** directory there:
+- If the IDE has a `skills` or `custom instructions` directory (for example, `~/.codex/skills/` or a project `.codex/skills/`), copy the **`magpie`** directory there:
   ```bash
   mkdir -p ~/.codex/skills
   cp -r /path/to/Magpie/skills/magpie ~/.codex/skills/magpie
   ```
   Use the path your IDE documents for custom skills.
-- If the IDE only has a **single “custom instructions”** or “rules” field, paste the body of [skills/magpie/SKILL.md](https://github.com/AMD-AGI/Magpie/blob/main/skills/magpie/SKILL.md) (the markdown after the YAML frontmatter) into that field, and add a note that these instructions apply when working with Magpie, GPU kernel analysis/compare, or vLLM/SGLang benchmarks.
+- If the IDE only has a single `custom instructions` or `rules` field, paste the body of [skills/magpie/SKILL.md](https://github.com/AMD-AGI/Magpie/blob/main/skills/magpie/SKILL.md) (the markdown after the YAML frontmatter) into that field, and add a note that these instructions apply when working with Magpie, GPU kernel analysis/compare, or vLLM/SGLang benchmarks.
 
-## Verifying the skill
+## Verify the skill
 
-1. **Discovery:** In the target IDE, ask: “What can you do with Magpie?” or “I want to analyze a HIP kernel with Magpie.” The agent should use the Magpie skill (reference its instructions or run Magpie commands).
-2. **Correctness:** Ask “Show GPU info using Magpie.” The agent should run `magpie --gpu-info` or `python -m Magpie --gpu-info` from the Magpie repo (or from a directory where Magpie is installed).
-3. **CLI check:** From the Magpie repo root, run `magpie --gpu-info` (or `python -m Magpie --gpu-info`) to confirm the CLI and environment work.
+1. In the target IDE, ask: “What can you do with Magpie?” or “I want to analyze a HIP kernel with Magpie.” The agent should use the Magpie skill (reference its instructions or run Magpie commands).
+2. Ask “Show GPU info using Magpie.” The agent should run `magpie --gpu-info` or `python -m Magpie --gpu-info` from the Magpie repo (or from a directory where Magpie is installed).
+3. From the Magpie repo root, run `magpie --gpu-info` (or `python -m Magpie --gpu-info`) to confirm the CLI and environment work.
diff --git a/docs/how-to/ray.md b/docs/how-to/ray.md
index 5da4733..f683a87 100644
--- a/docs/how-to/ray.md
+++ b/docs/how-to/ray.md
@@ -1,141 +1,107 @@
-# Magpie on Ray
+---
+myst:
+    html_meta:
+        "description": "Run Magpie's analyze, compare, and benchmark workloads on remote GPU nodes using Ray. Covers cluster setup, shared storage, configuration, and troubleshooting."
+        "keywords": "Magpie, Ray, remote GPU, cluster, RayJobExecutor, vLLM, SGLang, distributed benchmark, ROCm, CUDA"
+---
 
-This document describes how [Ray](https://www.ray.io/) is integrated into Magpie so that **analyze**, **compare**, and **benchmark** workloads can run on **remote GPU nodes** instead of (or in addition to) the machine where you invoke the CLI or MCP.
+# Run Magpie on a Ray cluster
 
-It is a **reference manual** for operators and contributors. For Benchmark-mode diagrams and YAML examples, see [benchmark.md](benchmark.md). For kernel analyze vs compare semantics, see [analyze-compare.md](analyze-compare.md).
+Ray integration lets you offload Magpie's analyze, compare, and benchmark workloads to remote GPU nodes without changing the evaluation logic—the same `AnalyzeMode`, `CompareMode`, and `BenchmarkMode` code runs on the worker as it would run locally, driven by a `RayJobExecutor` that the driver submits to the cluster. This is useful when the GPUs, software stack, or model weights you need are on a dedicated cluster node rather than the machine running the CLI or MCP server. This page covers how to configure the cluster connection, set up shared storage, run kernel and benchmark workloads on Ray, and troubleshoot common problems.
 
 ---
 
-## 1. Goals and scope
+## Goals and scope
 
 - **Offload** compilation, correctness checks, profilers, and LLM serving to machines that actually have the GPUs and software stack you need.
-- **Reuse the same mode code** (`AnalyzeMode`, `CompareMode`, `BenchmarkMode`) on the worker; Ray only changes **how** the task is launched (`RayJobExecutor` + `run_task` on a worker).
+- **Reuse the same mode code** (`AnalyzeMode`, `CompareMode`, `BenchmarkMode`) on the worker; Ray only changes how the task is launched (`RayJobExecutor` + `run_task` on a worker).
 - **Rely on shared storage** (typically NFS) so Hugging Face caches, InferenceX checkouts, and benchmark artifacts are visible to both driver and workers.
 
-Ray integration does **not** require the Ray Dashboard or Jobs API—only connectivity to the cluster (`ray.init` with `auto` or `ray://…`).
+Ray integration does not require the Ray Dashboard or Jobs API—only connectivity to the cluster (`ray.init` with `auto` or `ray://…`).
 
 ---
 
-## 2. Prerequisites
+## Prerequisites
 
 | Requirement | Notes |
 |-------------|--------|
-| Python `ray` package on the **driver** | `pip install ray`; same major version as the cluster is recommended. |
-| GPU **worker** nodes registered with Ray | Nodes expose `GPU` resources in `ray.nodes()`. |
-| Magpie importable on workers | Default `RayConfig.install_magpie` is `true`: workers may install Magpie + `requirements.txt` via Ray `runtime_env["pip"]`. Pre-install Magpie in the image and set `install_magpie: false` to skip. |
+| Python `ray` package on the driver | `pip install ray`; same major version as the cluster is recommended. |
+| GPU worker nodes registered with Ray | Nodes expose `GPU` resources in `ray.nodes()`. |
+| Magpie importable on workers | Default `RayConfig.install_magpie` is `true`: workers can install Magpie + `requirements.txt` via Ray `runtime_env["pip"]`. Pre-install Magpie in the image and set `install_magpie: false` to skip. |
 | Shared filesystem (strongly recommended) | Same mount path on driver and workers for model cache, InferenceX, and benchmark results. |
-| Kernel / project paths (analyze/compare) | Paths in YAML (e.g. `${CK_HOME}/…`) must exist **on the worker** (or on shared FS visible there). |
+| Kernel / project paths (analyze/compare) | Paths in YAML (for example, `${CK_HOME}/…`) must exist on the worker (or on shared FS visible there). |
 
 ---
 
-## 3. Architecture
-
-### 3.1 Driver vs worker
-
-- **Driver**: process running `python -m Magpie …`, MCP, or your script. It calls `Scheduler` or `BenchmarkMode`, connects with `ray.init(address=…)`, and submits a remote function.
-- **Worker**: Ray executes `Magpie.remote.tasks.run_task` on a **GPU-capable node**. That function dispatches to `_run_analyze`, `_run_compare`, or `_run_benchmark`.
-
-### 3.2 Executor selection
-
-| `SchedulerConfig.environment_type` | Executor | Execution |
-|-----------------------------------|----------|-----------|
-| `local` | `LocalExecutor` | Subprocesses on the driver machine (`Magpie/core/executor.py`). |
-| `container` | Container executor | Isolated environment on the driver (kernel flows). |
-| `ray` | `RayJobExecutor` | `ray.remote(run_task)` on a cluster node (`Magpie/core/ray_executor.py`). |
-
-Benchmark mode additionally uses `BenchmarkConfig.run_mode`: `docker`, `local`, or `ray`. When `run_mode` is `ray`, `BenchmarkMode` builds a `Task` and uses `RayJobExecutor` internally (`Magpie/modes/benchmark/benchmarker.py`).
-
-### 3.3 End-to-end flow (conceptual)
-
-```mermaid
-flowchart LR
-  subgraph Driver
-    CLI[MCP / CLI]
-    SCH[Scheduler or BenchmarkMode]
-    RJE[RayJobExecutor]
-    CLI --> SCH --> RJE
-  end
-  subgraph Cluster
-    RT[run_task]
-    A[AnalyzeMode]
-    C[CompareMode]
-    B[BenchmarkMode]
-    RJE -->|ray.remote| RT
-    RT --> A
-    RT --> C
-    RT --> B
-  end
-```
-
----
-
-## 4. Connecting to the cluster (`cluster_address`)
+## Connect to the cluster (`cluster_address`)
 
 `RayConfig.cluster_address` (and the analyze/compare path: values taken from kernel YAML `ray_config.cluster_address` into `SchedulerConfig.ray_cluster_address`) is passed to `ray.init(address=…)`.
 
 | Value | When to use |
 |-------|-------------|
-| `"auto"` | Driver runs **on the Ray head** or in the same Ray network namespace so the local GCS is discoverable. |
-| IP/hostname of head (e.g. `"192.168.1.10:6379"`) | Explicit GCS address when `auto` is ambiguous. |
-| `"ray://<host>:10001"` | Driver is **remote**; connect via **Ray Client** (typical client port `10001`). |
+| `"auto"` | Driver runs on the Ray head or in the same Ray network namespace so the local GCS is discoverable. |
+| IP/hostname of head (for example, `"192.168.1.10:6379"`) | Explicit GCS address when `auto` is ambiguous. |
+| `"ray://<host>:10001"` | Driver is remote; connect via Ray Client (typical client port `10001`). |
 
-Using `auto` from a laptop that is **not** attached to the cluster will not reach remote workers—use `ray://…` or run the driver on the head node.
+Using `auto` from a laptop that is not attached to the cluster will not reach remote workers—use `ray://…` or run the driver on the head node.
 
 ---
 
-## 5. Shared storage
+## Shared storage
 
 Default shared root in code: `DEFAULT_SHARED_STORAGE_PATH = "/shared_nfs/magpie"` (`Magpie/modes/benchmark/config.py`). Override with `shared_storage_path` in YAML.
 
-### 5.1 What uses it
+### What uses it
 
 - **Worker env** (`Magpie/remote/tasks.py` `_setup_env`): sets `HF_HOME` and `TRANSFORMERS_CACHE` under `{shared_storage_path}/hf_cache` unless already set.
 - **Benchmark on Ray** (`_run_benchmark`): if paths are empty, sets `inferencex_path` → `{shared_storage}/InferenceX`, `hf_cache_path` → `{shared_storage}/hf_cache`, and writes results under `{results_dir}/{task_id}` (defaults `results_dir` to `{shared_storage}/results`).
 
-Analyze/compare do not rewrite kernel `working_dir` for you: **your YAML must point at directories the worker can access** (local disk or the same NFS mount).
+Analyze/compare do not rewrite kernel `working_dir` for you: your YAML must point at directories the worker can access (local disk or the same NFS mount).
 
-### 5.2 Driver `runtime_env` and cache dirs
+### Driver `runtime_env` and cache dirs
 
 `RayJobExecutor._build_runtime_env` sets `HF_HOME` / `TRANSFORMERS_CACHE` from `RayConfig.hf_cache_dir` (derived from `shared_storage_path`) for the remote task environment, and optionally adds `pip` entries when `install_magpie` or `pip_packages` is set.
 
 ---
 
-## 6. Analyze and compare on Ray
+## Analyze and compare on Ray
+
+The following sections describe how to configure and run kernel analyze and compare workloads on a Ray cluster.
 
-### 6.1 Enabling Ray for kernel modes
+### Enable Ray for kernel modes
 
-1. Add a top-level **`ray_config:`** block to the kernel YAML. `load_kernel_config` in `Magpie/main.py` sets `scheduler.environment` to **`ray`** when `ray_config` is present (unless overridden).
-2. Or set **`scheduler.environment: ray`** in framework `config.yaml` / kernel YAML `scheduler:` and supply connection details (see below).
+1. Add a top-level `ray_config:` block to the kernel YAML. `load_kernel_config` in `Magpie/main.py` sets `scheduler.environment` to `ray` when `ray_config` is present (unless overridden).
+2. Or set `scheduler.environment: ray` in framework `config.yaml` / kernel YAML `scheduler:` and supply connection details (see below).
 
-### 6.2 Fields read from kernel YAML into the scheduler
+### Fields read from kernel YAML into the scheduler
 
 `_get_scheduler_config` maps:
 
 - `ray_config.cluster_address` → `SchedulerConfig.ray_cluster_address`
 - `ray_config.shared_storage_path` → `SchedulerConfig.ray_shared_storage_path`
 
-The scheduler then constructs a **`RayConfig`** for `RayJobExecutor` with those fields (and **defaults** for all other `RayConfig` attributes such as `install_magpie`, `entrypoint_num_cpus`, `pip_packages`). Extra keys under kernel YAML `ray_config` are not merged into that executor `RayConfig` in the current implementation—treat comments in examples as **hints** for future alignment or for benchmark YAML where merging differs.
+The scheduler then constructs a `RayConfig` for `RayJobExecutor` with those fields (and defaults for all other `RayConfig` attributes such as `install_magpie`, `entrypoint_num_cpus`, `pip_packages`). Extra keys under kernel YAML `ray_config` are not merged into that executor `RayConfig` in the current implementation—treat comments in examples as hints for future alignment or for benchmark YAML where merging differs.
 
-### 6.3 Overriding environment on the CLI
+### Override environment on the CLI
 
-For any kernel config that implies Ray, you can force **local** execution:
+For any kernel config that implies Ray, you can force local execution:
 
 ```bash
 python -m Magpie analyze --kernel-config examples/ck_gemm_add_ray.yaml -e local
 ```
 
-Priority is **CLI `--environment` > kernel YAML > `Magpie/config.yaml` scheduler** (`Magpie/main.py` `_get_scheduler_config`).
+Priority is CLI `--environment` > kernel YAML > `Magpie/config.yaml` scheduler (`Magpie/main.py` `_get_scheduler_config`).
 
-### 6.4 Remote execution path
+### Remote execution path
 
 1. `run_analyze` / `run_compare` builds kernel configs and calls `scheduler.run_analyze` / `run_compare`.
 2. `Scheduler` creates a `Task` with `ModeType.ANALYZE` or `COMPARE` and `executor.execute(task)`.
 3. `RayJobExecutor._submit_ray_task` serializes `task.to_dict()` into `job_payload`, merges benchmark-only `ray_config` overrides if present (usually empty for kernel tasks), and calls `run_task.remote(job_payload)`.
 4. On the worker, `run_task` runs `_run_analyze` or `_run_compare` (`Magpie/remote/tasks.py`)—same classes as locally.
 
-**Result shape:** the driver receives a dict; `main.run_analyze` unwraps nested `results` when Ray returns `{"task_id", "results": [...]}`.
+Result shape: the driver receives a dict; `main.run_analyze` unwraps nested `results` when Ray returns `{"task_id", "results": [...]}`.
 
-### 6.5 Example
+### Example
 
 See `examples/ck_gemm_add_ray.yaml`: `ray_config.cluster_address`, `shared_storage_path`, plus kernel paths using `${CK_HOME}` on the worker.
 
@@ -145,9 +111,11 @@ python -m Magpie analyze --kernel-config examples/ck_gemm_add_ray.yaml --no-perf
 
 ---
 
-## 7. Benchmark on Ray
+## Benchmark on Ray
 
-### 7.1 Configuration
+The following sections describe how to configure and run framework-level benchmarks on a Ray cluster.
+
+### Configuration
 
 In benchmark YAML set:
 
@@ -161,61 +129,63 @@ benchmark:
     install_magpie: false              # common when image already has Magpie
 ```
 
-Full `RayConfig` fields are documented in `Magpie/modes/benchmark/config.py`. For `BenchmarkConfig`, `ray_config` is **required** when `run_mode` is `ray`.
+Full `RayConfig` fields are documented in `Magpie/modes/benchmark/config.py`. For `BenchmarkConfig`, `ray_config` is required when `run_mode` is `ray`.
 
-### 7.2 Worker-side behavior
+### Worker-side behavior
 
 `_run_benchmark` (`Magpie/remote/tasks.py`):
 
-1. Sets `run_mode` from `ray` to **`local`** so the worker runs `BenchmarkMode` **on that node** (Docker or local subprocess), not another Ray hop.
+1. Sets `run_mode` from `ray` to `local` so the worker runs `BenchmarkMode` on that node (Docker or local subprocess), not another Ray hop.
 2. Fills default `inferencex_path` / `hf_cache_path` from `shared_storage_path` when omitted.
 3. Sets `output_dir` to `{results_dir}/{task_id}` (defaults under shared storage).
 
-The **driver** does not run vLLM/SGLang; it waits on `ray.get` and maps the returned dict into `BenchmarkResult` (`benchmarker._populate_result_from_ray`).
+The driver does not run vLLM/SGLang; it waits on `ray.get` and maps the returned dict into `BenchmarkResult` (`benchmarker._populate_result_from_ray`).
 
-### 7.3 Tensor parallelism and multi-node hints
+### Tensor parallelism and multi-node hints
 
-For **benchmark** payloads only, `run_task` calls `_configure_tp_isolation`:
+For benchmark payloads only, `run_task` calls `_configure_tp_isolation`:
 
-- If `TP` ≤ GPUs on the current Ray node, it **clears `RAY_ADDRESS`** in the worker env and can append vLLM `--distributed-executor-backend mp` so the child uses local multiprocessing.
-- If `TP` exceeds local GPUs, it **keeps `RAY_ADDRESS`** and can append vLLM `--distributed-executor-backend ray` or SGLang `--use-ray --nnodes N`.
+- If `TP` ≤ GPUs on the current Ray node, it clears `RAY_ADDRESS` in the worker env and can append vLLM `--distributed-executor-backend mp` so the child uses local multiprocessing.
+- If `TP` exceeds local GPUs, it keeps `RAY_ADDRESS` and can append vLLM `--distributed-executor-backend ray` or SGLang `--use-ray --nnodes N`.
 
 Logic is in `Magpie/remote/tasks.py` (`_get_local_gpu_count`, `_configure_tp_isolation`). Tune `EXTRA_VLLM_ARGS` / `EXTRA_SGLANG_ARGS` in YAML if you need overrides.
 
-### 7.4 Entry points on the driver
+### Entry points on the driver
 
 | Entry | Location | Behavior |
 |-------|-----------|----------|
 | CLI / MCP `benchmark` with YAML `run_mode: ray` | `BenchmarkMode.run()` | Creates `RayJobExecutor`, `execute(task)`, fills `BenchmarkResult`. |
-| `Scheduler.run_benchmark_ray(...)` | `Magpie/core/scheduler.py` | Injects `run_mode: ray` and minimal `ray_config`, then **`BenchmarkMode.run()`** (same core path; helper for programmatic use). |
+| `Scheduler.run_benchmark_ray(...)` | `Magpie/core/scheduler.py` | Injects `run_mode: ray` and minimal `ray_config`, then `BenchmarkMode.run()` (same core path; helper for programmatic use). |
 
-### 7.5 Example YAML
+### Example YAML
 
 `examples/benchmarks/benchmark_vllm_dsr1_ray.yaml` shows a full vLLM benchmark targeting Ray with shared storage and `install_magpie: false`.
 
 ---
 
-## 8. Scheduling, GPUs, and the outer `num_gpus=0` task
+## Scheduling, GPUs, and the outer `num_gpus=0` task
 
-`RayJobExecutor._submit_ray_task` declares the remote function with **`num_gpus=0`** but pins it to a node that has GPUs using **`NodeAffinitySchedulingStrategy`** (`ray.util.scheduling_strategies`). `RayJobExecutor._find_gpu_node` prefers a **non-head** GPU worker when possible.
+`RayJobExecutor._submit_ray_task` declares the remote function with `num_gpus=0` but pins it to a node that has GPUs using `NodeAffinitySchedulingStrategy` (`ray.util.scheduling_strategies`). `RayJobExecutor._find_gpu_node` prefers a non-head GPU worker when possible.
 
-**Why `num_gpus=0`?** Frameworks such as vLLM may spawn their own Ray or multiprocessing workers and need to control `CUDA_VISIBLE_DEVICES` / `HIP_VISIBLE_DEVICES`. Reserving all GPUs on the outer task would conflict with that model.
+Why `num_gpus=0`? Frameworks such as vLLM might spawn their own Ray or multiprocessing workers and need to control `CUDA_VISIBLE_DEVICES` / `HIP_VISIBLE_DEVICES`. Reserving all GPUs on the outer task would conflict with that model.
 
-After the task starts, `_clear_hidden_gpus` removes Ray-imposed **empty** visibility env vars so **child processes** see the node GPUs again (`Magpie/remote/tasks.py`).
+After the task starts, `_clear_hidden_gpus` removes Ray-imposed empty visibility env vars so child processes see the node GPUs again (`Magpie/remote/tasks.py`).
 
-> **`gpu_selection` is disabled under Ray.** The benchmark YAML's
-> `gpu_selection` block (auto idle-GPU picker) is a no-op when
-> `run_mode: ray` — Ray schedules devices itself via `num_gpus`, and a
-> driver-side `rocm-smi` / `nvidia-smi` scan does not reflect worker nodes.
-> To restrict the cluster to specific cards, export
-> `ROCR_VISIBLE_DEVICES` / `CUDA_VISIBLE_DEVICES` in the shell **before**
-> starting `ray start` on each node.
+```{note}
+`gpu_selection` is disabled under Ray. The benchmark YAML's
+`gpu_selection` block (auto idle-GPU picker) is a no-op when
+`run_mode: ray` — Ray schedules devices itself via `num_gpus`, and a
+driver-side `rocm-smi` / `nvidia-smi` scan does not reflect worker nodes.
+To restrict the cluster to specific cards, export
+`ROCR_VISIBLE_DEVICES` / `CUDA_VISIBLE_DEVICES` in the shell before
+starting `ray start` on each node.
+```
 
 ---
 
-## 9. Asynchronous benchmark submit (MCP / advanced)
+## Asynchronous benchmark submit (MCP and advanced)
 
-`BenchmarkMode.submit_ray_benchmark(executor)` submits a Ray task **without** blocking; MCP and scripts can poll with the same `RayJobExecutor` instance. Related MCP tools (see `Magpie/mcp/server.py`):
+`BenchmarkMode.submit_ray_benchmark(executor)` submits a Ray task without blocking; MCP and scripts can poll with the same `RayJobExecutor` instance. Related MCP tools (see `Magpie/mcp/server.py`):
 
 - `ray_task_status` — running / succeeded / failed
 - `ray_task_result` — cached result dict after completion
@@ -224,7 +194,7 @@ After the task starts, `_clear_hidden_gpus` removes Ray-imposed **empty** visibi
 
 ---
 
-## 10. `RayConfig` reference
+## `RayConfig` reference
 
 Defined in `Magpie/modes/benchmark/config.py` (`@dataclass RayConfig`).
 
@@ -232,7 +202,7 @@ Defined in `Magpie/modes/benchmark/config.py` (`@dataclass RayConfig`).
 |-------|------|
 | `cluster_address` | `ray.init(address=…)` |
 | `shared_storage_path` | Root for HF cache, InferenceX, results layout on workers |
-| `entrypoint_num_cpus` | CPUs requested on the **outer** remote task |
+| `entrypoint_num_cpus` | CPUs requested on the outer remote task |
 | `entrypoint_num_gpus` | Declared on `RayConfig` but outer task uses `num_gpus=0` in code |
 | `multi_node`, `total_num_gpus`, `num_nodes`, `gpus_per_node` | Used when building `runtime_env` (`RAY_ADDRESS`, `MAGPIE_TOTAL_GPUS`) for multi-node scenarios |
 | `pip_packages` | Extra pip specs in `runtime_env` |
@@ -242,22 +212,8 @@ Defined in `Magpie/modes/benchmark/config.py` (`@dataclass RayConfig`).
 
 Derived helpers: `results_dir`, `hf_cache_dir`, `inferencex_dir`.
 
----
-
-## 11. Troubleshooting
-
-| Symptom | Things to check |
-|---------|------------------|
-| `ray.init` fails | Firewall, wrong address, Ray version mismatch; try `ray://host:10001` from remote drivers. |
-| `No GPU node found in the Ray cluster` | Workers not started with GPUs; head-only cluster; GPU resources zero in `ray.nodes()`. |
-| Analyze fails on worker: missing sources | `${CK_HOME}` or paths not on worker or NFS; build artifacts not present on worker. |
-| Worker import errors for Magpie | Set `install_magpie: true` or bake Magpie into the worker image; check `runtime_env` pip logs. |
-| Benchmark TP / Ray backend wrong | Inspect `_configure_tp_isolation` logs; set `EXTRA_VLLM_ARGS` / `EXTRA_SGLANG_ARGS` explicitly. |
-| Empty GPU visibility in child | Should be fixed by `_clear_hidden_gpus`; if not, inspect env in InferenceX subprocess. |
-
----
 
-## 12. Source map
+## Source map
 
 | Area | File |
 |------|------|
@@ -271,7 +227,9 @@ Derived helpers: `results_dir`, `hf_cache_dir`, `inferencex_dir`.
 
 ---
 
-## 13. Related material
+## More info
 
-- [benchmark.md](benchmark.md) — Benchmark mode, TraceLens, gap analysis, `run_mode` overview.
-- [analyze-compare.md](analyze-compare.md) — Kernel analyze vs compare.
+- [Benchmark frameworks with Magpie](benchmarking/benchmark.md) — run modes, TraceLens analysis, gap analysis, and GPU selection
+- [Analyze and compare kernels with Magpie](analyze-compare.md) — kernel evaluation modes that run on Ray via `scheduler.environment: ray`
+- [Ray documentation](https://docs.ray.io/) — cluster setup, job submission, and runtime environments
+- [Magpie API reference](../reference/api-reference.md) — MCP Ray task tools (`ray_task_status`, `ray_task_result`, `ray_task_cancel`)
diff --git a/docs/index.rst b/docs/index.rst
index fe01156..3eeea3a 100644
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -16,49 +16,6 @@ Magpie is a component of the Hyperloom toolkit. The Magpie source code is
 hosted in the `AMD-AGI/Magpie <https://github.com/AMD-AGI/Magpie>`_ GitHub
 repository.
 
-What Magpie does
-================
-
-Magpie organizes kernel evaluation into three modes:
-
-* **Analyze** -- Evaluate a single kernel against a testcase for correctness,
-  then optionally profile its performance.
-* **Compare** -- Evaluate and rank multiple kernel implementations against a
-  baseline to find the fastest correct variant.
-* **Benchmark** -- Run framework-level benchmarks (vLLM, SGLang, Atom) with
-  optional torch and system profiling, including TraceLens trace analysis and
-  kernel-level gap analysis.
-
-Key features
-============
-
-* **Three evaluation modes**: analyze, compare, and benchmark.
-* **Heterogeneous hardware**: AMD (HIP) and NVIDIA (CUDA) GPUs.
-* **Multiple execution environments**: local host, sandboxed container, and
-  remote Ray cluster.
-* **Hardware-aware evaluation**: controlled execution with optional power and
-  frequency settings.
-* **Automatic GPU selection**: benchmark mode picks idle GPUs before launching.
-* **Trace analysis**: TraceLens integration for performance profiling and gap
-  analysis.
-* **MCP server**: Model Context Protocol integration for AI agents.
-* **Structured reports**: JSON output for pipeline integration.
-
-Use cases
-=========
-
-* Validate hand-written or AI-generated GPU kernels for correctness before
-  promoting them.
-* Rank competing kernel implementations to pick the fastest correct one.
-* Benchmark and profile LLM inference frameworks on AMD GPUs and locate the
-  kernels that dominate runtime.
-* Drive kernel evaluation from an AI agent through the MCP server.
-
-Documentation
-=============
-
-The Magpie documentation is organized into the following categories.
-
 .. grid:: 2
    :gutter: 3
 
@@ -66,27 +23,30 @@ The Magpie documentation is organized into the following categories.
 
       * :doc:`Install Magpie <install/install>`
 
-   .. grid-item-card:: Reference
-
-      * :doc:`Release notes <reference/release-notes>`
-      * :doc:`Compatibility matrix <reference/compatibility-matrix>`
-      * :doc:`API reference <reference/api-reference>`
-
    .. grid-item-card:: How to
 
       * :doc:`Analyze and compare kernels <how-to/analyze-compare>`
-      * :doc:`Benchmark frameworks <how-to/benchmark>`
+      * :doc:`Benchmark frameworks <how-to/benchmarking/benchmark>`
       * :doc:`Run on a Ray cluster <how-to/ray>`
       * :doc:`MCP server and agent skills <how-to/mcp-and-skills>`
       * :doc:`Find kernel sources <how-to/kernel-source-finder>`
+  
+   .. grid-item-card:: Conceptual
+
+      * :doc:`Benchmarking architecture <conceptual/benchmarking-architecture>`
+      * :doc:`Ray architecture <conceptual/ray-architecture>`
 
    .. grid-item-card:: Examples
 
       * :doc:`Examples <examples/examples>`
 
-   .. grid-item-card:: About
+   .. grid-item-card:: Reference
 
-      * :doc:`License <about/license>`
+      * :doc:`Release notes <reference/release-notes>`
+      * :doc:`Compatibility matrix <reference/compatibility-matrix>`
+      * :doc:`API reference <reference/api-reference>`
+      * :doc:`Benchmark configuration <reference/benchmark-config>`
+      * :doc:`Troubleshooting <reference/troubleshooting>`
 
 To contribute to the documentation, see the
 `Magpie GitHub repository <https://github.com/AMD-AGI/Magpie>`_.
diff --git a/docs/install/install.md b/docs/install/install.md
index b6a22fd..558fde4 100644
--- a/docs/install/install.md
+++ b/docs/install/install.md
@@ -1,6 +1,13 @@
+---
+myst:
+    html_meta:
+        "description": "Install Magpie from GitHub, source, or make on AMD ROCm or NVIDIA CUDA systems. Includes prerequisites, optional dependencies, and verification steps."
+        "keywords": "Magpie, install, ROCm, HIP, CUDA, GPU, pip, Python, setup, benchmark, kernel evaluation"
+---
+
 # Install Magpie
 
-This page provides step-by-step instructions to install Magpie and verify the
+This topic provides step-by-step instructions to install Magpie and verify the
 installation on your system. Choose the method that best fits your workflow:
 install directly from GitHub (recommended for most users), install from source
 (recommended for development), or install into a managed virtual environment
@@ -11,7 +18,7 @@ with `make`.
 Before installing Magpie, make sure your system meets the following
 requirements:
 
-- Python 3.10 or newer
+- Python 3.10 or later
 - `pip` (and optionally `git` for source/GitHub installs)
 - An AMD ROCm (HIP) or NVIDIA CUDA toolchain, if you plan to compile or profile
   GPU kernels
@@ -23,7 +30,7 @@ For the full list of verified hardware and software versions, see the
 
 ## Install from GitHub (recommended)
 
-This installs the latest published Magpie package and its core dependencies.
+This installs the latest published Magpie package and its core dependencies:
 
 ```bash
 pip install git+https://github.com/AMD-AGI/Magpie.git
@@ -125,8 +132,10 @@ profiling features are unavailable, but the CLI still installs and runs.
 
 ## Next steps
 
-- Run your first evaluation with the [Analyze and compare](../how-to/analyze-compare.md)
+After installing Magpie, explore the following guides to get started.
+
+- Run your first evaluation with the [Analyze and compare kernels with Magpie](../how-to/analyze-compare.md)
   guide.
-- Benchmark an inference framework with the [Benchmark](../how-to/benchmark.md)
+- Benchmark an inference framework with the [Benchmark frameworks with Magpie](../how-to/benchmarking/benchmark.md)
   guide.
-- Browse the [Examples](../examples/examples.md) for end-to-end walkthroughs.
+- Browse the [Magpie examples](../examples/examples.md) for end-to-end walkthroughs.
diff --git a/docs/reference/api-reference.md b/docs/reference/api-reference.md
index 7829362..c3cf3cc 100644
--- a/docs/reference/api-reference.md
+++ b/docs/reference/api-reference.md
@@ -1,4 +1,11 @@
-# API reference
+---
+myst:
+    html_meta:
+        "description": "Reference documentation for the Magpie CLI, YAML configuration files, and MCP server tools, including command options, parameters, and usage examples."
+        "keywords": "Magpie, API reference, CLI, MCP server, YAML configuration, magpie analyze, magpie compare, magpie benchmark, GPU kernel"
+---
+
+# Magpie API reference
 
 Magpie exposes three public interfaces: a command-line interface (CLI), a set of
 configuration files, and a Model Context Protocol (MCP) server for AI agents.
@@ -14,16 +21,16 @@ You can invoke the CLI either as `magpie <command>` or as
 
 ### Global options
 
-These options apply to every command and are passed before the subcommand.
+The following options apply to every command and are passed before the subcommand.
 
 | Option | Type | Default | Description |
 | --- | --- | --- | --- |
-| `--config`, `-c` | path | `Magpie/config.yaml` | Framework configuration file. |
-| `--verbose`, `-v` | flag | off | Enable verbose (DEBUG) logging. |
-| `--gpu-info` | flag | off | Print detected GPU and toolchain info, then exit. |
-| `--environment`, `-e` | `local` \| `container` \| `ray` | `local` | Execution environment. |
-| `--workers`, `-w` | int | from config | Number of concurrent workers. |
-| `--docker-image` | string | from config | Docker image for the container environment. |
+| `--config`, `-c` | path | `Magpie/config.yaml` | Framework configuration file |
+| `--verbose`, `-v` | flag | off | Enable verbose (DEBUG) logging |
+| `--gpu-info` | flag | off | Print detected GPU and toolchain info, then exit |
+| `--environment`, `-e` | `local` \| `container` \| `ray` | `local` | Execution environment |
+| `--workers`, `-w` | int | from config | Number of concurrent workers |
+| `--docker-image` | string | from config | Docker image for the container environment |
 
 Example:
 
@@ -43,13 +50,13 @@ magpie analyze [kernels ...] [options]
 
 | Argument / option | Type | Default | Description |
 | --- | --- | --- | --- |
-| `kernels` | path(s) | none | Kernel source file(s) to analyze. |
-| `--kernel-config`, `-k` | path | none | Kernel configuration file (alternative to positional kernels). |
-| `--testcase`, `-t` | string | none | Testcase command used to verify correctness. |
-| `--type` | `hip` \| `cuda` \| `pytorch` \| `triton` | `hip` | Kernel type. |
-| `--compile-cmd` | string | auto | Custom compile command. |
-| `--no-perf` | flag | off | Skip performance profiling. |
-| `--output-dir`, `-o` | path | `./results` | Output directory for reports. |
+| `kernels` | path(s) | none | Kernel source file(s) to analyze |
+| `--kernel-config`, `-k` | path | none | Kernel configuration file (alternative to positional kernels) |
+| `--testcase`, `-t` | string | none | Testcase command used to verify correctness |
+| `--type` | `hip` \| `cuda` \| `pytorch` \| `triton` | `hip` | Kernel type |
+| `--compile-cmd` | string | auto | Custom compile command |
+| `--no-perf` | flag | off | Skip performance profiling |
+| `--output-dir`, `-o` | path | `./results` | Output directory for reports |
 
 Example:
 
@@ -67,13 +74,13 @@ magpie compare [kernels ...] [options]
 
 | Argument / option | Type | Default | Description |
 | --- | --- | --- | --- |
-| `kernels` | path(s) | none | Kernel files to compare. |
-| `--kernel-config`, `-k` | path | none | Kernel configuration file. |
-| `--testcase`, `-t` | string | none | Testcase command (optional). |
-| `--type` | `hip` \| `cuda` \| `pytorch` \| `triton` | `hip` | Kernel type. |
-| `--baseline` | int | `0` | Index of the baseline kernel. |
-| `--no-perf` | flag | off | Skip performance profiling. |
-| `--output-dir`, `-o` | path | `./results` | Output directory for reports. |
+| `kernels` | path(s) | none | Kernel files to compare |
+| `--kernel-config`, `-k` | path | none | Kernel configuration file |
+| `--testcase`, `-t` | string | none | Testcase command (optional) |
+| `--type` | `hip` \| `cuda` \| `pytorch` \| `triton` | `hip` | Kernel type |
+| `--baseline` | int | `0` | Index of the baseline kernel |
+| `--no-perf` | flag | off | Skip performance profiling |
+| `--output-dir`, `-o` | path | `./results` | Output directory for reports |
 
 Example:
 
@@ -92,22 +99,22 @@ magpie benchmark [framework] [options]
 
 | Argument / option | Type | Default | Description |
 | --- | --- | --- | --- |
-| `framework` | `vllm` \| `sglang` \| `atom` | none | Framework to benchmark. |
-| `--benchmark-config`, `-b` | path | none | Benchmark configuration file. |
-| `--model`, `-m` | string | none | Model name or path. |
-| `--precision`, `-p` | `fp8` \| `fp16` \| `bf16` \| `fp4` | `fp8` | Model precision. |
-| `--tp` | int | `1` | Tensor parallel size. |
-| `--concurrency` | int | `32` | Request concurrency. |
-| `--input-len` | int | `1024` | Input sequence length. |
-| `--output-len` | int | `512` | Output sequence length. |
-| `--torch-profiler` | flag | off | Enable the torch profiler. |
-| `--system-profiler` | flag | off | Enable the system profiler (rocprof/ncu). |
-| `--run-mode` | `docker` \| `local` | `docker` | Run in a container or directly on the host. |
-| `--docker-image` | string | from config | Override the Docker image. |
-| `--inferencex-path` | string | auto-clone | Path to an InferenceX installation. |
-| `--benchmark-script` | string | none | InferenceX benchmark script name. |
-| `--timeout` | int | `3600` | Benchmark timeout in seconds. |
-| `--output-dir`, `-o` | path | `./results` | Output directory for reports. |
+| `framework` | `vllm` \| `sglang` \| `atom` | none | Framework to benchmark |
+| `--benchmark-config`, `-b` | path | none | Benchmark configuration file |
+| `--model`, `-m` | string | none | Model name or path |
+| `--precision`, `-p` | `fp8` \| `fp16` \| `bf16` \| `fp4` | `fp8` | Model precision |
+| `--tp` | int | `1` | Tensor parallel size |
+| `--concurrency` | int | `32` | Request concurrency |
+| `--input-len` | int | `1024` | Input sequence length |
+| `--output-len` | int | `512` | Output sequence length |
+| `--torch-profiler` | flag | off | Enable the torch profiler |
+| `--system-profiler` | flag | off | Enable the system profiler (rocprof/ncu) |
+| `--run-mode` | `docker` \| `local` | `docker` | Run in a container or directly on the host |
+| `--docker-image` | string | from config | Override the Docker image |
+| `--inferencex-path` | string | auto-clone | Path to an InferenceX installation |
+| `--benchmark-script` | string | none | InferenceX benchmark script name |
+| `--timeout` | int | `3600` | Benchmark timeout in seconds |
+| `--output-dir`, `-o` | path | `./results` | Output directory for reports |
 
 #### Standalone gap analysis options
 
@@ -116,16 +123,16 @@ traces instead of launching a benchmark.
 
 | Option | Type | Default | Description |
 | --- | --- | --- | --- |
-| `--trace-dir` | path | none | Run gap analysis on this `torch_trace` directory (or a workspace containing one). |
-| `--top-k` | int | `20` | Number of top bottleneck kernels to report. |
-| `--start-pct` | float | `0.0` | Start of the analysis window (0-100). |
-| `--end-pct` | float | `100.0` | End of the analysis window (0-100). |
-| `--min-duration-us` | float | `0.0` | Minimum event duration to include, in microseconds. |
-| `--categories` | string(s) | all | Event categories to include (for example, `kernel gpu`). |
-| `--ignore-categories` | string(s) | none | Event categories to exclude. |
-| `--no-rank-csv` | flag | off | Skip per-rank CSV generation. |
-| `--find-kernel-sources` | flag | off | Find kernel source files and test commands (AMD kernels). |
-| `--kernel-source-repos` | string(s) | none | Repository paths to search for kernel sources. |
+| `--trace-dir` | path | none | Run gap analysis on this `torch_trace` directory (or a workspace containing one) |
+| `--top-k` | int | `20` | Number of top bottleneck kernels to report |
+| `--start-pct` | float | `0.0` | Start of the analysis window (0-100) |
+| `--end-pct` | float | `100.0` | End of the analysis window (0-100) |
+| `--min-duration-us` | float | `0.0` | Minimum event duration to include, in microseconds |
+| `--categories` | string(s) | all | Event categories to include (for example, `kernel gpu`) |
+| `--ignore-categories` | string(s) | none | Event categories to exclude |
+| `--no-rank-csv` | flag | off | Skip per-rank CSV generation |
+| `--find-kernel-sources` | flag | off | Find kernel source files and test commands (AMD kernels) |
+| `--kernel-source-repos` | string(s) | none | Repository paths to search for kernel sources |
 
 Example:
 
@@ -145,28 +152,28 @@ for complete, commented examples.
 
 | Section | Purpose |
 | --- | --- |
-| `gpu` | Device selection and hardware control (power/frequency). |
-| `scheduler` | Execution environment (local, container, or Ray) and worker settings. |
-| `compiling` | Default compile behavior. |
-| `correctness` | Correctness backend (testcase or Accordo) and tolerances. |
-| `performance` | Profiler backend (`rocprof-compute`, `ncu`, Metrix), timeouts, and metric blocks. |
-| `compare` | Performance metric weights and winner selection for compare mode. |
-| `benchmark` | InferenceX path, image mapping, and default profiler flags. |
-| `logging` | Log levels and optional file output. |
+| `gpu` | Device selection and hardware control (power/frequency) |
+| `scheduler` | Execution environment (local, container, or Ray) and worker settings |
+| `compiling` | Default compile behavior |
+| `correctness` | Correctness backend (testcase or Accordo) and tolerances |
+| `performance` | Profiler backend (`rocprof-compute`, `ncu`, Metrix), timeouts, and metric blocks |
+| `compare` | Performance metric weights and winner selection for compare mode |
+| `benchmark` | InferenceX path, image mapping, and default profiler flags |
+| `logging` | Log levels and optional file output |
 
 ### Kernel configuration
 
-A kernel configuration file may define a single `kernel:` or multiple
+A kernel configuration file can define a single `kernel:` or multiple
 `kernels:`, plus optional override sections:
 
 | Key | Purpose |
 | --- | --- |
-| `kernel` | A single kernel entry (path, type, testcase, compile command). |
-| `kernels` | A list of kernel entries for compare mode. |
-| `performance` | Overrides framework-level profiler settings. |
-| `correctness` | Overrides framework-level correctness settings. |
-| `ray_config` | Ray cluster settings (implies `environment: ray`). |
-| `scheduler` | Scheduler-level overrides (environment, workers, and so on). |
+| `kernel` | A single kernel entry (path, type, testcase, compile command) |
+| `kernels` | A list of kernel entries for compare mode |
+| `performance` | Overrides framework-level profiler settings |
+| `correctness` | Overrides framework-level correctness settings |
+| `ray_config` | Ray cluster settings (implies `environment: ray`) |
+| `scheduler` | Scheduler-level overrides (environment, workers, and so on) |
 
 ## MCP server
 
@@ -182,21 +189,23 @@ All tools return JSON-formatted strings.
 
 ### Available tools
 
+The MCP server exposes the following tools to AI agents.
+
 | Tool | Description |
 | --- | --- |
-| `analyze` | Analyze a kernel for correctness and performance. |
-| `compare` | Compare multiple kernel implementations. |
-| `hardware_spec` | Query GPU hardware specifications. |
-| `configure_gpu` | Configure GPU power and frequency. |
-| `discover_kernels` | Scan a project and suggest analyzable kernels/configs. |
-| `suggest_optimizations` | Suggest performance optimizations from analyze output. |
-| `create_kernel_config` | Generate a kernel config YAML for analyze. |
-| `benchmark` | Run a vLLM/SGLang/Atom benchmark with optional profiling. |
-| `gap_analysis` | Run gap analysis on existing torch profiler traces. |
-| `list_benchmark_images` | List available Docker images per framework/architecture. |
-| `list_benchmark_results` | List previous benchmark workspaces and summaries. |
-| `get_benchmark_result` | Read detailed results from a specific benchmark run. |
-| `compare_benchmark_reports` | Compare TraceLens reports across benchmark runs. |
+| `analyze` | Analyze a kernel for correctness and performance |
+| `compare` | Compare multiple kernel implementations |
+| `hardware_spec` | Query GPU hardware specifications |
+| `configure_gpu` | Configure GPU power and frequency |
+| `discover_kernels` | Scan a project and suggest analyzable kernels/configs |
+| `suggest_optimizations` | Suggest performance optimizations from analyze output |
+| `create_kernel_config` | Generate a kernel config YAML for analyze |
+| `benchmark` | Run a vLLM/SGLang/Atom benchmark with optional profiling |
+| `gap_analysis` | Run gap analysis on existing torch profiler traces |
+| `list_benchmark_images` | List available Docker images per framework/architecture |
+| `list_benchmark_results` | List previous benchmark workspaces and summaries |
+| `get_benchmark_result` | Read detailed results from a specific benchmark run |
+| `compare_benchmark_reports` | Compare TraceLens reports across benchmark runs |
 
 ### Selected tool signatures
 
@@ -207,8 +216,8 @@ defaults unless noted.
 
 | Parameter | Type | Default | Description |
 | --- | --- | --- | --- |
-| `device_id` | int | `0` | GPU device ID to query. |
-| `include_all` | bool | `false` | Return info for all available GPUs. |
+| `device_id` | int | `0` | GPU device ID to query |
+| `include_all` | bool | `false` | Return info for all available GPUs |
 
 Returns JSON with vendor, architecture, power, clocks, temperature, and memory.
 
@@ -216,16 +225,16 @@ Returns JSON with vendor, architecture, power, clocks, temperature, and memory.
 
 | Parameter | Type | Default | Description |
 | --- | --- | --- | --- |
-| `kernel_path` | string | required | Path to the kernel source file (`.hip`, `.cu`, `.py`). |
-| `testcase_command` | string | required | Command to run the testcase. |
-| `kernel_type` | string | `hip` | `hip`, `cuda`, `pytorch`, or `triton`. |
-| `working_dir` | string | kernel's dir | Working directory. |
-| `compile_command` | string | auto | Custom compile command. |
-| `check_performance` | bool | `true` | Run performance profiling. |
-| `environment` | string | `local` | `local` or `container`. |
-| `performance_backend` | string | auto | `metrix`, `rocprof_compute`, or `ncu`. |
-| `correctness_backend` | string | auto | `accordo` or `testcase`. |
-| `accordo_*` | various | see defaults | Accordo kernel name, binaries, tolerances, and timeout. |
+| `kernel_path` | string | required | Path to the kernel source file (`.hip`, `.cu`, `.py`) |
+| `testcase_command` | string | required | Command to run the testcase |
+| `kernel_type` | string | `hip` | `hip`, `cuda`, `pytorch`, or `triton` |
+| `working_dir` | string | kernel's dir | Working directory |
+| `compile_command` | string | auto | Custom compile command |
+| `check_performance` | bool | `true` | Run performance profiling |
+| `environment` | string | `local` | `local` or `container` |
+| `performance_backend` | string | auto | `metrix`, `rocprof_compute`, or `ncu` |
+| `correctness_backend` | string | auto | `accordo` or `testcase` |
+| `accordo_*` | various | see defaults | Accordo kernel name, binaries, tolerances, and timeout |
 
 Returns JSON with `compiling_state`, `correctness_state`, `performance_state`, a
 `score` (0.0-1.0), and detailed per-stage results.
diff --git a/docs/reference/benchmark-config.md b/docs/reference/benchmark-config.md
new file mode 100644
index 0000000..e666d4b
--- /dev/null
+++ b/docs/reference/benchmark-config.md
@@ -0,0 +1,239 @@
+---
+myst:
+    html_meta:
+        "description": "Full YAML configuration reference for Magpie benchmark mode, including profiler settings, gap analysis, GPU selection, run modes, and environment variable options."
+        "keywords": "Magpie, benchmark configuration, YAML, TraceLens, gap analysis, GPU selection, vLLM, SGLang, ROCm, profiler"
+---
+
+# Magpie benchmark mode configuration
+
+Magpie benchmark mode is configured entirely through YAML files, which control the inference framework, model, request shape, profiling backends, GPU selection, and execution environment. All settings live under a top-level `benchmark:` key and are passed to the CLI with `--benchmark-config`; you can also set individual values directly on the command line for quick experimentation. This page provides a minimal starting configuration, a full annotated reference with every available option, a table of environment variables for request shape and profiling, and ready-to-run examples for common scenarios.
+
+## Configuration
+
+All benchmark settings live under a top-level `benchmark:` key in a YAML file passed to `--benchmark-config`.
+
+### Minimal example
+
+The following configuration runs a basic vLLM benchmark with torch profiling enabled.
+
+```yaml
+benchmark:
+  framework: vllm              # "vllm", "sglang", or "atom"
+  model: deepseek-ai/DeepSeek-R1-0528
+  precision: fp8               # "fp8", "fp16", "bf16"
+  
+  envs:
+    TP: 8                      # Tensor parallelism
+    CONC: 32                   # Concurrency (num_prompts = CONC * 10)
+    ISL: 1024                  # Input sequence length
+    OSL: 1024                  # Output sequence length
+    
+  profiler:
+    torch_profiler:
+      enabled: true            # Generate torch profiling traces
+      
+  timeout_seconds: 3600
+```
+
+### Full configuration reference
+
+The following shows every available option with its default or a representative value.
+
+```yaml
+benchmark:
+  # Framework selection
+  framework: vllm              # Required: "vllm", "sglang", or "atom"
+  model: <model_name>          # Required: HuggingFace model name/path
+  precision: fp8               # Optional: "fp8" (default), "fp16", "bf16"
+  
+  # Benchmark parameters
+  envs:
+    TP: 8                      # Tensor parallelism (GPU count)
+    CONC: 32                   # Request concurrency
+    ISL: 1024                  # Input sequence length
+    OSL: 1024                  # Output sequence length
+    RANDOM_RANGE_RATIO: 1      # Length randomization (0-1)
+    MAX_MODEL_LEN: 131072      # Max model context length
+    GPU_MEM_UTIL: 0.95         # GPU memory utilization (0-1)
+    ENABLE_PROFILE: "true"     # Enable profiling in benchmark script
+    
+  # Profiler configuration
+  profiler:
+    # PyTorch profiler (generates JSON traces)
+    torch_profiler:
+      enabled: true            # Sets VLLM_TORCH_PROFILER_DIR
+      
+    # System profiler (rocprof-compute / ncu)
+    system_profiler:
+      enabled: false
+      profile_args: []         # Additional profiler arguments
+      
+    # TraceLens trace analysis
+    tracelens:
+      enabled: true                 # Enable TraceLens analysis
+      analysis_mode: inference      # Optional, default: inference
+      analysis_stages: all          # Optional, default: all
+      auto_patch_runtime: true      # Optional, default: true for Docker runs
+      tracelens_repo_path: null     # Optional public TraceLens source checkout
+      cli_timeout_seconds: 2400     # TraceLens postprocess timeout per command
+      export_format: csv            # "csv" or "excel"
+      perf_report_enabled: true           # Single-rank performance report
+      multi_rank_report_enabled: true     # Multi-rank collective report
+      gpu_arch_config: null         # Optional: GPU arch config for roofline
+
+  # Gap analysis (kernel bottleneck report)
+  gap_analysis:
+    enabled: true              # Enable gap analysis after benchmark
+    trace_start_pct: 50        # Start of analysis window (0-100)
+    trace_end_pct: 80          # End of analysis window (0-100)
+    top_k: 20                  # Number of top kernels in report
+    min_duration_us: 0.0       # Filter out events shorter than this (us)
+    categories:                # Event category whitelist (default: [kernel, gpu])
+      - kernel
+      - gpu
+    ignore_categories:         # Event category blacklist (default: [gpu_user_annotation])
+      - gpu_user_annotation
+      
+  # Auto-pick idle GPU(s) before launching (enabled by default).
+  # See "Automatic GPU Selection" below for details.
+  gpu_selection:
+    auto: true                 # Default: true. Set false to disable.
+    min_free_memory_gb: 8.0    # Reject GPUs with less free VRAM
+    count: null                # Number of GPUs; null -> use envs.TP
+    candidates: null           # Optional whitelist of physical GPU ids
+
+  # Execution settings
+  run_mode: docker             # "docker" (default) or "local" (host / in-container)
+  docker_image: null           # Optional: override auto-selected image
+  gpu_arch: null               # Optional: force GPU architecture
+  timeout_seconds: 3600        # Benchmark timeout
+  
+  # Paths
+  inferencex_path: /path/to/InferenceX  # InferenceX installation
+  hf_cache_path: null          # HuggingFace cache directory
+  
+  # InferenceX specific
+  runner_type: mi300x          # Hardware runner type
+  benchmark_script: null       # Override benchmark script
+```
+
+## Environment variables
+
+Pass these variables under `benchmark.envs:` to control request shape, concurrency, memory usage, and profiling behavior.
+
+| Variable | Description | Default |
+|----------|-------------|---------|
+| `TP` | Tensor parallelism (number of GPUs) | 1 |
+| `CONC` | Request concurrency | 32 |
+| `ISL` | Input sequence length | 1024 |
+| `OSL` | Output sequence length | 512 |
+| `RANDOM_RANGE_RATIO` | Length randomization ratio | 0.5 |
+| `MAX_MODEL_LEN` | Maximum model context length | - |
+| `GPU_MEM_UTIL` | GPU memory utilization | 0.95 |
+| `ENABLE_PROFILE` | Enable torch profiler | "false" |
+| `EXTRA_VLLM_ARGS` | Additional arguments passed to `vllm serve` | "" |
+
+## Examples
+
+The following example configurations cover common benchmark scenarios.
+
+### Quick profiling run
+
+Minimal configuration for fast trace collection:
+
+```yaml
+benchmark:
+  framework: vllm
+  model: deepseek-ai/DeepSeek-R1-0528
+  precision: fp8
+  
+  envs:
+    TP: 8
+    CONC: 4                    # Small concurrency for quick run
+    ISL: 128
+    OSL: 64
+    GPU_MEM_UTIL: 0.85
+    
+  profiler:
+    torch_profiler:
+      enabled: true
+    tracelens:
+      enabled: true
+      # analysis_mode defaults to inference
+      # analysis_stages defaults to all (prefilldecode, decode, prefill)
+      # auto_patch_runtime defaults to true for Docker runs
+      # tracelens_repo_path can point to a public TraceLens checkout
+      # cli_timeout_seconds defaults to 1800
+      export_format: csv
+      multi_rank_report_enabled: false  # Skip multi-rank for speed
+      
+  timeout_seconds: 1200
+```
+
+### Full production benchmark
+
+Full configuration with TraceLens inference analysis enabled across all stages:
+
+```yaml
+benchmark:
+  framework: vllm
+  model: deepseek-ai/DeepSeek-R1-0528
+  precision: fp8
+  
+  envs:
+    TP: 8
+    CONC: 64
+    ISL: 2048
+    OSL: 2048
+    MAX_MODEL_LEN: 131072
+    
+  profiler:
+    torch_profiler:
+      enabled: true
+    tracelens:
+      enabled: true
+      analysis_mode: inference
+      analysis_stages: all
+      auto_patch_runtime: true
+      # tracelens_repo_path: /path/to/TraceLens
+      cli_timeout_seconds: 2400
+      export_format: csv
+      perf_report_enabled: true
+      multi_rank_report_enabled: true
+      
+  timeout_seconds: 7200
+```
+
+### SGLang benchmark
+
+Basic SGLang benchmark with torch profiler enabled:
+
+```yaml
+benchmark:
+  framework: sglang
+  model: meta-llama/Llama-3.1-70B-Instruct
+  precision: fp16
+  
+  envs:
+    TP: 4
+    CONC: 32
+    ISL: 1024
+    OSL: 512
+    
+  profiler:
+    torch_profiler:
+      enabled: true
+      
+  timeout_seconds: 3600
+```
+
+## More info
+
+See the following pages for related concepts, how-to guidance, and reference material.
+
+- [Benchmark frameworks with Magpie](../how-to/benchmarking/benchmark.md) — how-to guide covering run modes, TraceLens analysis, gap analysis, and automatic GPU selection
+- [Magpie benchmarking mode architecture](../conceptual/benchmarking-architecture.md) — how the benchmark pipeline components interact
+- [Run Magpie on a Ray cluster](../how-to/ray.md) — running benchmarks on remote GPU nodes using `run_mode: ray`
+- [Magpie API reference](api-reference.md) — CLI options for `magpie benchmark` and standalone gap analysis
+- [Magpie troubleshooting](troubleshooting.md) — solutions for common benchmark errors
\ No newline at end of file
diff --git a/docs/reference/compatibility-matrix.md b/docs/reference/compatibility-matrix.md
index e9115ea..ec0ee79 100644
--- a/docs/reference/compatibility-matrix.md
+++ b/docs/reference/compatibility-matrix.md
@@ -1,6 +1,13 @@
+---
+myst:
+    html_meta:
+        "description": "Verified hardware and software compatibility for Magpie, including Python versions, ROCm and CUDA toolchains, profilers, GPU hardware, and supported inference frameworks."
+        "keywords": "Magpie, compatibility matrix, ROCm, CUDA, AMD Instinct, Python, vLLM, SGLang, rocprof-compute, GPU requirements"
+---
+
 # Compatibility matrix
 
-This page lists the known version requirements for Magpie. It covers hardware
+This topic lists the known version requirements for Magpie. It covers hardware
 and software requirements and is intended to capture only versions that have
 been verified and tested.
 
@@ -12,6 +19,8 @@ remain in this table.
 
 ## Software requirements
 
+The following Python packages and operating systems are required or tested with Magpie.
+
 | Component | Supported / tested versions | Notes |
 | --- | --- | --- |
 | Python | 3.10, 3.11, 3.12, 3.13 | Declared in `pyproject.toml`. Minimum is 3.10. |
@@ -22,6 +31,8 @@ remain in this table.
 
 ## GPU toolchains
 
+The following GPU compute toolchains are supported for kernel compilation and profiling.
+
 | Toolchain | Supported / tested versions | Notes |
 | --- | --- | --- |
 | AMD ROCm (HIP) | TODO (verify) | Required for HIP kernel compilation and profiling on AMD GPUs. |
@@ -29,6 +40,8 @@ remain in this table.
 
 ## Profilers and optional tools
 
+The following profiling tools and optional packages extend Magpie's capabilities.
+
 | Tool | Supported / tested versions | Notes |
 | --- | --- | --- |
 | `rocprof-compute` | >= 3.40 | AMD performance profiling. See the [install guide](https://rocm.docs.amd.com/projects/rocprofiler-compute/en/latest/install/core-install.html). |
@@ -40,9 +53,11 @@ remain in this table.
 
 ## Hardware
 
+Magpie has been tested on the following GPU hardware.
+
 | Hardware | Supported / tested | Notes |
 | --- | --- | --- |
-| AMD Instinct GPUs | TODO (verify) | List verified architectures (for example, MI300 series). |
+| AMD Instinct™ GPUs | TODO (verify) | List verified architectures (for example, MI300 series). |
 | NVIDIA GPUs | TODO (verify) | List verified architectures. |
 
 ## Framework benchmark compatibility
diff --git a/docs/reference/release-notes.md b/docs/reference/release-notes.md
index 63cd1ad..73d1a78 100644
--- a/docs/reference/release-notes.md
+++ b/docs/reference/release-notes.md
@@ -1,6 +1,13 @@
+---
+myst:
+    html_meta:
+        "description": "Release notes for Magpie, a GPU kernel evaluation framework. Lists features added in each release, including evaluation modes, profiling backends, and MCP integration."
+        "keywords": "Magpie, release notes, changelog, GPU kernel, ROCm, HIP, CUDA, vLLM, SGLang, TraceLens, MCP server"
+---
+
 # Release notes
 
-This page summarizes the features included in each Magpie release. For the
+This topic summarizes the features included in each Magpie release. For the
 initial release, the notes provide an overview of all features available in the
 tool.
 
@@ -12,6 +19,8 @@ NVIDIA hardware.
 
 ### Evaluation modes
 
+Magpie 0.1.0 ships with three evaluation modes.
+
 - **Analyze**: Single-kernel evaluation against a testcase, with optional
   performance profiling.
 - **Compare**: Multi-kernel comparison and ranking against a configurable
@@ -21,6 +30,8 @@ NVIDIA hardware.
 
 ### Hardware and execution
 
+This release supports the following hardware and execution environments.
+
 - Support for AMD (HIP/ROCm) and NVIDIA (CUDA) GPUs.
 - Three execution environments: local host, sandboxed container, and remote
   Ray cluster.
@@ -30,10 +41,14 @@ NVIDIA hardware.
 
 ### Kernel types
 
+The following kernel types are supported for compilation and evaluation.
+
 - HIP, CUDA, PyTorch, and Triton kernels.
 
 ### Profiling and trace analysis
 
+The following profiling and trace analysis capabilities are included.
+
 - Pluggable performance profiler backends: `rocprof-compute` (AMD), `ncu`
   (NVIDIA), and IntelliKit Metrix.
 - Optional correctness validation via testcase or IntelliKit Accordo.
@@ -45,6 +60,8 @@ NVIDIA hardware.
 
 ### Integration
 
+Magpie integrates with the following external systems and workflows.
+
 - Model Context Protocol (MCP) server exposing Magpie capabilities to AI
   agents.
 - Agent skill packaging for environments without MCP.
@@ -52,6 +69,8 @@ NVIDIA hardware.
 
 ### Configuration
 
+Magpie provides the following configuration mechanisms.
+
 - Framework-level configuration via `config.yaml`.
 - Per-evaluation kernel configuration files for analyze and compare modes.
 - Benchmark configuration files for framework benchmarks.
diff --git a/docs/reference/troubleshooting.md b/docs/reference/troubleshooting.md
new file mode 100644
index 0000000..bbf9454
--- /dev/null
+++ b/docs/reference/troubleshooting.md
@@ -0,0 +1,43 @@
+---
+myst:
+    html_meta:
+        "description": "Troubleshoot common Magpie issues including GPU memory errors, Docker permission problems, TraceLens installation, and timeout configuration."
+        "keywords": "Magpie, troubleshooting, GPU memory error, Docker, TraceLens, timeout, ROCm, benchmark, debug"
+---
+
+# Magpie troubleshooting
+
+This topic covers errors and debugging techniques. Each section presents symptoms and their solutions in a table so you can quickly find the issue you're seeing. For benchmark configuration problems not listed here, enable verbose logging with `--log-level DEBUG` and check the output before filing a bug report.
+
+## Benchmarking mode
+
+### Common issues
+
+The following errors are frequently reported in benchmark mode.
+
+| Error | Solution |
+|-------|----------|
+| `ValueError: Free memory on device (...) is less than desired GPU memory utilization` | Reduce `GPU_MEM_UTIL` in config (for example, `0.85`). |
+| `docker: permission denied` | Add your user to the docker group or run with sudo. |
+| `Required TraceLens inference CLI command(s) not found on PATH` | Applies to `run_mode: local` or classic host post-processing. TraceLens auto-installs on first run. If issues persist, run: `pip install git+https://github.com/AMD-AIG-AIMA/TraceLens.git`. If `TL_EXTENSION=TraceLens_NDA` is set, install the matching internal extension package. For `run_mode: docker`, commands are resolved from the runtime image. |
+| Timeout during model loading | Large models (for example, DeepSeek-R1) might need longer timeouts. Set `timeout_seconds: 7200` in your benchmark config. |
+| `gpu_selection.auto failed: ...` | Not enough idle GPUs on the host. Free a GPU, lower `gpu_selection.min_free_memory_gb`, narrow `gpu_selection.candidates`, or pin manually via `envs.ROCR_VISIBLE_DEVICES` (AMD) / `envs.CUDA_VISIBLE_DEVICES` (NVIDIA). See [Automatic GPU selection in Magpie's benchmark mode](../how-to/benchmarking/automatic-gpu.md). |
+
+### Debug mode
+
+Enable verbose logging to get more detailed output from the benchmark run.
+
+```bash
+python -m Magpie benchmark --benchmark-config config.yaml --log-level DEBUG
+```
+
+## Ray on Magpie
+
+| Error | Solution |
+|---------|------------------|
+| `ray.init` fails | Firewall, wrong address, Ray version mismatch; try `ray://host:10001` from remote drivers. |
+| `No GPU node found in the Ray cluster` | Workers not started with GPUs; head-only cluster; GPU resources zero in `ray.nodes()`. |
+| Analyze fails on worker: missing sources | `${CK_HOME}` or paths not on worker or NFS; build artifacts not present on worker. |
+| Worker import errors for Magpie | Set `install_magpie: true` or bake Magpie into the worker image; check `runtime_env` pip logs. |
+| Benchmark TP / Ray backend wrong | Inspect `_configure_tp_isolation` logs; set `EXTRA_VLLM_ARGS` / `EXTRA_SGLANG_ARGS` explicitly. |
+| Empty GPU visibility in child | Should be fixed by `_clear_hidden_gpus`; if not, inspect env in InferenceX subprocess. |
\ No newline at end of file
diff --git a/docs/sphinx/_toc.yml.in b/docs/sphinx/_toc.yml.in
index a786d0c..1b88942 100644
--- a/docs/sphinx/_toc.yml.in
+++ b/docs/sphinx/_toc.yml.in
@@ -8,39 +8,54 @@ defaults:
 root: index
 
 subtrees:
-  - caption: Install
-    entries:
-      - file: install/install
-        title: Install Magpie
-
-  - caption: Reference
-    entries:
-      - file: reference/release-notes
-        title: Release notes
-      - file: reference/compatibility-matrix
-        title: Compatibility matrix
-      - file: reference/api-reference
-        title: API reference
-
-  - caption: How to
-    entries:
-      - file: how-to/analyze-compare
-        title: Analyze and compare kernels
-      - file: how-to/benchmark
-        title: Benchmark frameworks
-      - file: how-to/ray
-        title: Run on a Ray cluster
-      - file: how-to/mcp-and-skills
-        title: MCP server and agent skills
-      - file: how-to/kernel-source-finder
-        title: Find kernel sources
-
-  - caption: Examples
-    entries:
-      - file: examples/examples
-        title: Examples
-
-  - caption: About
-    entries:
-      - file: about/license
-        title: License
+- entries:
+  - file: what-is-magpie
+    title: What is Magpie?
+  - file: reference/release-notes
+    title: Release notes
+  - file: reference/compatibility-matrix
+    title: Compatibility matrix
+
+- caption: Install
+  entries:
+    - file: install/install
+      title: Install Magpie
+
+- caption: How to
+  entries:
+    - file: how-to/analyze-compare
+      title: Analyze and compare kernels
+    - file: how-to/benchmarking/benchmark
+      title: Benchmark frameworks
+    - file: how-to/ray
+      title: Run on a Ray cluster
+    - file: how-to/mcp-and-skills
+      title: MCP server and agent skills
+    - file: how-to/kernel-source-finder
+      title: Find kernel sources
+
+- caption: Conceptual
+  entries:
+    - file: conceptual/benchmarking-architecture
+      title: Benchmarking architecture
+    - file: conceptual/ray-architecture
+      title: Ray architecture
+
+- caption: Examples
+  entries:
+    - file: examples/examples
+      title: Examples
+
+- caption: Reference
+  entries:
+    - file: reference/api-reference
+      title: API reference
+    - file: reference/benchmark-config
+      title: Benchmark configuration
+    - file: reference/troubleshooting
+      title: Troubleshooting
+
+- caption: About
+  entries:
+    - file: about/license
+      title: License
diff --git a/docs/what-is-magpie.rst b/docs/what-is-magpie.rst
new file mode 100644
index 0000000..b09f14a
--- /dev/null
+++ b/docs/what-is-magpie.rst
@@ -0,0 +1,66 @@
+.. meta::
+   :description: Magpie is a lightweight, general-purpose framework for evaluating GPU kernel correctness and performance on AMD and NVIDIA GPUs.
+   :keywords: Magpie, ROCm, GPU, kernel, evaluation, benchmark, HIP, CUDA, profiling, AMD
+
+***************
+What is Magpie?
+***************
+
+Magpie is a lightweight, general-purpose framework for evaluating GPU kernel
+correctness and performance. It provides a single, reproducible workflow for
+checking that a kernel is correct, comparing competing implementations, and
+benchmarking full inference frameworks, on both AMD (HIP/ROCm) and NVIDIA
+(CUDA) hardware.
+
+Magpie is a component of the Hyperloom toolkit. The Magpie source code is
+hosted in the `AMD-AGI/Magpie <https://github.com/AMD-AGI/Magpie>`_ GitHub
+repository.
+
+What Magpie does
+================
+
+Magpie organizes kernel evaluation into three modes:
+
+* **Analyze** -- Evaluate a single kernel against a testcase for correctness,
+  then optionally profile its performance.
+* **Compare** -- Evaluate and rank multiple kernel implementations against a
+  baseline to find the fastest correct variant.
+* **Benchmark** -- Run framework-level benchmarks (vLLM, SGLang, Atom) with
+  optional torch and system profiling, including TraceLens trace analysis and
+  kernel-level gap analysis.
+
+Key features
+============
+
+Magpie provides the following capabilities:
+
+* **Three evaluation modes**: analyze, compare, and benchmark.
+* **Heterogeneous hardware**: AMD (HIP) and NVIDIA (CUDA) GPUs.
+* **Multiple execution environments**: local host, sandboxed container, and
+  remote Ray cluster.
+* **Hardware-aware evaluation**: controlled execution with optional power and
+  frequency settings.
+* **Automatic GPU selection**: benchmark mode picks idle GPUs before launching.
+* **Trace analysis**: TraceLens integration for performance profiling and gap
+  analysis.
+* **MCP server**: Model Context Protocol integration for AI agents.
+* **Structured reports**: JSON output for pipeline integration.
+
+Use cases
+=========
+
+* Validate hand-written or AI-generated GPU kernels for correctness before
+  promoting them.
+* Rank competing kernel implementations to pick the fastest correct one.
+* Benchmark and profile LLM inference frameworks on AMD GPUs and locate the
+  kernels that dominate runtime.
+
+Next steps
+==========
+
+To get started with Magpie, see the following pages.
+
+* :doc:`Install Magpie <install/install>` — get up and running in minutes.
+* :doc:`Analyze and compare kernels <how-to/analyze-compare>` — validate a kernel and rank competing implementations.
+* :doc:`Benchmark frameworks <how-to/benchmarking/benchmark>` — run vLLM, SGLang, or Atom benchmarks with trace analysis.
+* :doc:`MCP server and agent skills <how-to/mcp-and-skills>` — drive Magpie from an AI agent.
\ No newline at end of file
diff --git a/examples/benchmarks/benchmark_atom_dsr1.yaml b/examples/benchmarks/benchmark_atom_dsr1.yaml
index eaf5933..4b2851c 100644
--- a/examples/benchmarks/benchmark_atom_dsr1.yaml
+++ b/examples/benchmarks/benchmark_atom_dsr1.yaml
@@ -54,7 +54,7 @@ benchmark:
     top_k: 20
 
   # Optional: auto-pick idle GPU(s) before launching. Enabled by default;
-  # uncomment to override (see docs/how-to/benchmark.md#automatic-gpu-selection).
+  # uncomment to override (see docs/how-to/benchmarking/benchmark.md#automatic-gpu-selection).
   # gpu_selection:
   #   auto: true
   #   min_free_memory_gb: 8.0
diff --git a/examples/benchmarks/benchmark_vllm_dsr1.yaml b/examples/benchmarks/benchmark_vllm_dsr1.yaml
index f7819b9..fcc2ba2 100644
--- a/examples/benchmarks/benchmark_vllm_dsr1.yaml
+++ b/examples/benchmarks/benchmark_vllm_dsr1.yaml
@@ -39,7 +39,7 @@ benchmark:
     top_k: 100
 
   # Optional: auto-pick idle GPU(s) before launching. Enabled by default;
-  # uncomment to override (see docs/how-to/benchmark.md#automatic-gpu-selection).
+  # uncomment to override (see docs/how-to/benchmarking/benchmark.md#automatic-gpu-selection).
   # gpu_selection:
   #   auto: true
   #   min_free_memory_gb: 8.0
diff --git a/skills/magpie/SKILL.md b/skills/magpie/SKILL.md
index 319bb34..74f5829 100644
--- a/skills/magpie/SKILL.md
+++ b/skills/magpie/SKILL.md
@@ -81,7 +81,7 @@ magpie benchmark --benchmark-config examples/benchmarks/benchmark_vllm_dsr1.yaml
 - `--run-mode`: `docker` (default) or `local`.
 - `--docker-image`, `--timeout`, `-o`: Override image, timeout (seconds), output dir.
 
-Example configs: [examples/benchmarks/benchmark_vllm_dsr1.yaml](examples/benchmarks/benchmark_vllm_dsr1.yaml), [docs/how-to/benchmark.md](docs/how-to/benchmark.md).
+Example configs: [examples/benchmarks/benchmark_vllm_dsr1.yaml](examples/benchmarks/benchmark_vllm_dsr1.yaml), [docs/how-to/benchmarking/benchmark.md](docs/how-to/benchmarking/benchmark.md).
 
 ## Gap analysis (standalone)
 
diff --git a/skills/magpie/reference.md b/skills/magpie/reference.md
index eb423f5..4ac81a4 100644
--- a/skills/magpie/reference.md
+++ b/skills/magpie/reference.md
@@ -100,4 +100,4 @@ kernels:
 
 ## Benchmark config YAML
 
-Top-level key `benchmark:` with `framework`, `model`, `precision`, `envs`, `profiler`, `gap_analysis`, `timeout_seconds`, etc. See `examples/benchmarks/benchmark_vllm_dsr1.yaml` and `docs/how-to/benchmark.md`.
+Top-level key `benchmark:` with `framework`, `model`, `precision`, `envs`, `profiler`, `gap_analysis`, `timeout_seconds`, etc. See `examples/benchmarks/benchmark_vllm_dsr1.yaml` and `docs/how-to/benchmarking/benchmark.md`.