From 9c3f790f4ee7ea9a9bbf854bfb91323ab3745717 Mon Sep 17 00:00:00 2001
From: learning-chip <jiawei.zhuang@outlook.com>
Date: Wed, 27 May 2026 15:01:00 +0000
Subject: [PATCH 1/3] Add Ascend950 pure-vector simulator examples for SiLU and
 SwiGLU.

Introduce a self-contained a5_sim harness with dav-c310-vec kernels, msprof
and cannsim runners, scale-ladder timing docs, and OMP thread sweep scripts.

Co-authored-by: Cursor <cursoragent@cursor.com>
---
 examples/a5_sim/README.md                 | 125 +++++++
 examples/a5_sim/cannsim_vs_msprof.md      |  86 +++++
 examples/a5_sim/common/__init__.py        |   1 +
 examples/a5_sim/common/build.py           | 172 ++++++++++
 examples/a5_sim/common/host_info.py       |  46 +++
 examples/a5_sim/common/torch_runtime.py   |  56 ++++
 examples/a5_sim/configs/scale_ladder.json |   7 +
 examples/a5_sim/kernels/silu_a5.cpp       | 122 +++++++
 examples/a5_sim/kernels/swiglu_a5.cpp     | 375 +++++++++++++++++++++
 examples/a5_sim/outputs/.gitignore        |   2 +
 examples/a5_sim/run_cannsim.sh            |  60 ++++
 examples/a5_sim/run_cannsim_entry.sh      |   6 +
 examples/a5_sim/run_msprof.sh             |  30 ++
 examples/a5_sim/run_thread_sweep.sh       | 123 +++++++
 examples/a5_sim/vec_sim.py                | 389 ++++++++++++++++++++++
 15 files changed, 1600 insertions(+)
 create mode 100644 examples/a5_sim/README.md
 create mode 100644 examples/a5_sim/cannsim_vs_msprof.md
 create mode 100644 examples/a5_sim/common/__init__.py
 create mode 100644 examples/a5_sim/common/build.py
 create mode 100644 examples/a5_sim/common/host_info.py
 create mode 100644 examples/a5_sim/common/torch_runtime.py
 create mode 100644 examples/a5_sim/configs/scale_ladder.json
 create mode 100644 examples/a5_sim/kernels/silu_a5.cpp
 create mode 100644 examples/a5_sim/kernels/swiglu_a5.cpp
 create mode 100644 examples/a5_sim/outputs/.gitignore
 create mode 100755 examples/a5_sim/run_cannsim.sh
 create mode 100755 examples/a5_sim/run_cannsim_entry.sh
 create mode 100755 examples/a5_sim/run_msprof.sh
 create mode 100755 examples/a5_sim/run_thread_sweep.sh
 create mode 100644 examples/a5_sim/vec_sim.py

diff --git a/examples/a5_sim/README.md b/examples/a5_sim/README.md
new file mode 100644
index 00000000..2315ebbf
--- /dev/null
+++ b/examples/a5_sim/README.md
@@ -0,0 +1,125 @@
+# A5 Pure-Vector Simulator Examples (SiLU + SwiGLU)
+
+Self-contained **Ascend950PR** pure-vector PTO kernels with **msprof op simulator** and **cannsim record** harnesses. Use these to validate A5 simulator plumbing before tackling mix kernels (see [`megagdn-pto/benchmarks/a5_sim`](../../megagdn-pto/benchmarks/a5_sim)).
+
+Kernels compile with `--cce-aicore-arch=dav-c310-vec` and `-DREGISTER_BASE`. For the 910B `chunk_h` simulator benchmark (different arch), see [`megagdn-pto/benchmarks/simulator/README.md`](../../megagdn-pto/benchmarks/simulator/README.md).
+
+## Prerequisites
+
+```bash
+source /usr/local/Ascend/ascend-toolkit/latest/bin/setenv.bash
+export PTO_LIB_PATH=/path/to/pto-isa   # or megagdn-pto/third_party/pto-isa
+pip install torch torch-npu
+```
+
+Build kernels:
+
+```bash
+cd pto-kernels/examples/a5_sim
+python3 -m common.build --all
+```
+
+## Quick start
+
+```bash
+# Correctness smoke (msprof)
+./run_msprof.sh --kernel silu --mode correctness --num-elements 128 --label smoke
+./run_msprof.sh --kernel swiglu --mode correctness --batch 1 --input-n 256 --label smoke
+
+# Same under cannsim
+./run_cannsim.sh --kernel silu --mode correctness --num-elements 128 --label smoke
+./run_cannsim.sh --kernel swiglu --mode correctness --batch 1 --input-n 256 --label smoke
+
+# Scale ladder timing
+./run_msprof.sh --kernel silu --mode sweep --skip-correctness \
+  --output-json outputs/silu_sweep_msprof.json
+./run_thread_sweep.sh   # OMP sweep, T=512, both tools
+```
+
+## Host environment
+
+Measured on **Kunpeng-920** (HUAWEI Kunpeng 920 5250), **192 logical CPUs** (4 sockets × 48 cores, 1 thread/core), **aarch64**, CANN **9.0.0**, May 2026.
+
+## Simulator time cost summary
+
+Wall time uses `time.perf_counter()` around one kernel launch (includes PEM/msprof or cannsim startup). **T** = output element count (same ladder labels as the 910B `chunk_h` benchmark). **Correctness PASS** at smoke shape on both tools (PyTorch CPU reference).
+
+### SiLU — msprof (`Ascend950PR_9599`)
+
+| Label | T | Sim wall | ms/element |
+|-------|---|----------|------------|
+| smoke | 128 | **52 s** | 406 ms |
+| tiny | 512 | **24 s** | 48 ms |
+| small | 1024 | **26 s** | 25 ms |
+| varlen_2x512 | 1024 | **26 s** | 26 ms |
+| medium | 4096 | **29 s** | 7.1 ms |
+
+### SiLU — cannsim (`Ascend950`)
+
+| Label | T | Sim wall | ms/element |
+|-------|---|----------|------------|
+| smoke | 128 | **42 s** | 331 ms |
+| tiny | 512 | **15 s** | 30 ms |
+| small | 1024 | **17 s** | 17 ms |
+| varlen_2x512 | 1024 | **16 s** | 16 ms |
+| medium | 4096 | **17 s** | 4.1 ms |
+
+### SwiGLU — msprof
+
+| Label | T | Sim wall | ms/element |
+|-------|---|----------|------------|
+| smoke | 128 | **75 s** | 588 ms |
+| tiny | 512 | **49 s** | 95 ms |
+| small | 1024 | **61 s** | 59 ms |
+| varlen_2x512 | 1024 | **47 s** | 46 ms |
+| medium | 4096 | **52 s** | 13 ms |
+
+### SwiGLU — cannsim
+
+| Label | T | Sim wall | ms/element |
+|-------|---|----------|------------|
+| smoke | 128 | **52 s** | 403 ms |
+| tiny | 512 | **27 s** | 52 ms |
+| small | 1024 | **29 s** | 28 ms |
+| varlen_2x512 | 1024 | **21 s** | 21 ms |
+| medium | 4096 | **22 s** | 5.4 ms |
+
+**Scaling law (approximate):**
+
+- Fixed overhead **~15–75 s** at T=128 dominates smoke; do not extrapolate from smoke alone.
+- After startup, cost scales **roughly linearly with T** at ~**0.005–0.06 s/element** on cannsim and ~**0.007–0.06 s/element** on msprof for T≥512.
+- **Varlen vs fixed length** at the same T: negligible (1024 tokens: SiLU msprof 26 s vs 26 s).
+- Pure-vector kernels finish in **minutes** on the default ladder; contrast with mix `chunk_h_mini` v1 (scalar matmul, 35+ min timeouts).
+
+### vs CPU thread count (OMP)
+
+Fixed workload **T=512** (SiLU), swept `OMP_NUM_THREADS`, `OPENBLAS_NUM_THREADS`, `MKL_NUM_THREADS` together:
+
+| OMP threads | msprof mean (s) | speedup vs 1 | cannsim mean (s) | speedup vs 1 |
+|-------------|-----------------|--------------|------------------|--------------|
+| 1 | 39.5 | 1.00× | 31.6 | 1.00× |
+| 2 | 44.0 | 0.90× | 35.3 | 0.90× |
+| 4 | 41.7 | 0.95× | 34.4 | 0.92× |
+| 8 | 41.4 | 0.95× | 35.1 | 0.90× |
+| 16 | 44.6 | 0.89× | 31.4 | 1.01× |
+| 32 | 42.2 | 0.93× | 32.1 | 0.99× |
+
+**Conclusion:** host OMP thread env vars change simulator wall time by at most **~±11%** (msprof) and **~±12%** (cannsim). Tuning `OMP_NUM_THREADS` is not an effective lever; PEM uses internal worker pools.
+
+## Layout
+
+```
+examples/a5_sim/
+├── kernels/silu_a5.cpp, swiglu_a5.cpp
+├── vec_sim.py                 # driver (--kernel silu|swiglu)
+├── common/build.py            # dav-c310-vec build
+├── run_msprof.sh / run_cannsim.sh / run_thread_sweep.sh
+├── configs/scale_ladder.json
+└── outputs/                   # gitignored results
+```
+
+## References
+
+- A5 PTO ST tests: `megagdn-pto/third_party/pto-isa/tests/npu/a5/src/st/testcase`
+- A2 originals: `examples/jit_cpp/silu_dynamic`, `csrc/kernel/kernel_swiglu.cpp`
+- Tool comparison: [`cannsim_vs_msprof.md`](cannsim_vs_msprof.md)
diff --git a/examples/a5_sim/cannsim_vs_msprof.md b/examples/a5_sim/cannsim_vs_msprof.md
new file mode 100644
index 00000000..393cc7d9
--- /dev/null
+++ b/examples/a5_sim/cannsim_vs_msprof.md
@@ -0,0 +1,86 @@
+# SiLU / SwiGLU — msprof vs cannsim (Ascend950 / dav-c310-vec)
+
+Pure-vector A5 examples for **`pto-kernels/examples/a5_sim`**. Recommended first step for Ascend950 simulator validation before mix kernels in [`megagdn-pto/benchmarks/a5_sim`](../../megagdn-pto/benchmarks/a5_sim).
+
+## Executive summary
+
+| Aspect | msprof op simulator | cannsim record |
+|--------|---------------------|----------------|
+| SoC flag | `Ascend950PR_9599` | `Ascend950` |
+| AICore arch | `dav-c310-vec` | `dav-c310-vec` |
+| Correctness (smoke) | **PASS** (SiLU T=128, SwiGLU T=128) | **PASS** (same shapes) |
+| Invocation | Wraps `python3 vec_sim.py` directly | Executable `run_cannsim_entry.sh` + `-u "..."` |
+| Typical smoke wall | SiLU ~26 s, SwiGLU ~54 s | SiLU ~14 s, SwiGLU ~26 s |
+| Exit code | 0 on success | May return non-zero after **teardown segfault** even when JSON is valid |
+
+## Tool overview
+
+**msprof** preloads the CA model via `LD_PRELOAD` and runs Python + ctypes kernel launch (same pattern as [`ptoisa-a5-test/tests/torch_sim`](../../ptoisa-a5-test/tests/torch_sim/msprof_mechanism.md)).
+
+**cannsim** runs a standalone entry script under full SoC simulation. User args pass via `-u "--kernel silu --mode ..."`, not trailing argv.
+
+## Correctness
+
+| Kernel | Shape | msprof | cannsim | Reference |
+|--------|-------|--------|---------|-----------|
+| SiLU | T=128 | PASS | PASS | `x * sigmoid(x)` on CPU |
+| SwiGLU | batch=1, input_n=256 (T=128 out) | PASS | PASS | split + SiLU gate × value on CPU |
+
+Inputs are allocated on CPU then copied to NPU; reference checks run on CPU (simulator rejects many dynamic NPU ops).
+
+## Speed comparison (scale ladder, timing-only sweep)
+
+**SiLU msprof vs cannsim** (seconds, wall clock):
+
+| label | T | msprof | cannsim | ratio msprof/cannsim |
+|-------|---|--------|---------|----------------------|
+| smoke | 128 | 52 | 42 | 1.2× |
+| tiny | 512 | 24 | 15 | 1.6× |
+| small | 1024 | 26 | 17 | 1.5× |
+| medium | 4096 | 29 | 17 | 1.7× |
+
+**SwiGLU msprof vs cannsim**:
+
+| label | T | msprof | cannsim | ratio |
+|-------|---|--------|---------|-------|
+| smoke | 128 | 75 | 52 | 1.4× |
+| tiny | 512 | 49 | 27 | 1.8× |
+| small | 1024 | 61 | 29 | 2.1× |
+| medium | 4096 | 52 | 22 | 2.4× |
+
+cannsim is generally **faster** on wall clock for these pure-vector kernels once T≥512; msprof carries heavier profiling/injection overhead.
+
+## Failure modes
+
+| Issue | Mitigation |
+|-------|------------|
+| `torch.randn` on NPU under sim | Create tensors on CPU, `.to("npu:0")` |
+| Reference ops on NPU fail | Compare `y.cpu()` vs CPU PyTorch ref |
+| cannsim segfault on exit | JSON is still written; `run_cannsim.sh` accepts valid `--output-json` |
+| A5 `pipe_barrier(PIPE_V)` compile error | Use `PIPE_ALL` in SwiGLU compute path |
+| `Stride` ambiguous on A5 | Qualify as `pto::Stride<...>` |
+
+## Invocation examples
+
+```bash
+cd pto-kernels/examples/a5_sim
+source $ASCEND_HOME_PATH/bin/setenv.bash
+export PTO_LIB_PATH=/path/to/pto-isa
+
+MSPROF_TIMEOUT=30 ./run_msprof.sh --kernel silu --mode sweep --skip-correctness \
+  --output-json outputs/silu_sweep_msprof.json
+
+./run_cannsim.sh --kernel swiglu --mode correctness --batch 1 --input-n 256 \
+  --output-json outputs/smoke_swiglu_cannsim.json
+```
+
+## Recommendations
+
+1. **Start with SiLU** (simplest 1D pipeline) under msprof smoke correctness.
+2. Use **cannsim** for faster scale sweeps once smoke passes.
+3. Use **mix chunk_h_mini** only after pure-vector path is green ([`megagdn-pto/benchmarks/a5_sim`](../../megagdn-pto/benchmarks/a5_sim)).
+
+## References
+
+- Harness README: [`README.md`](README.md)
+- 910B chunk_h comparison: [`megagdn-pto/benchmarks/simulator/cannsim_vs_msprof.md`](../../megagdn-pto/benchmarks/simulator/cannsim_vs_msprof.md)
diff --git a/examples/a5_sim/common/__init__.py b/examples/a5_sim/common/__init__.py
new file mode 100644
index 00000000..afc5bf56
--- /dev/null
+++ b/examples/a5_sim/common/__init__.py
@@ -0,0 +1 @@
+"""Shared helpers for A5 pure-vector simulator examples."""
diff --git a/examples/a5_sim/common/build.py b/examples/a5_sim/common/build.py
new file mode 100644
index 00000000..075522a3
--- /dev/null
+++ b/examples/a5_sim/common/build.py
@@ -0,0 +1,172 @@
+#!/usr/bin/env python3
+"""Bisheng build helper for A5 pure-vector kernels (dav-c310-vec, REGISTER_BASE)."""
+
+from __future__ import annotations
+
+import argparse
+import os
+import shutil
+import subprocess
+import sys
+from pathlib import Path
+
+A5_SIM_ROOT = Path(__file__).resolve().parent.parent
+KERNEL_DIR = A5_SIM_ROOT / "kernels"
+BUILD_DIR = A5_SIM_ROOT / "build"
+
+KERNELS = {
+    "silu": {
+        "source": "silu_a5.cpp",
+        "lib": "libsilu_a5.so",
+    },
+    "swiglu": {
+        "source": "swiglu_a5.cpp",
+        "lib": "libswiglu_a5.so",
+    },
+}
+
+
+def _pto_include_root() -> Path:
+    env = os.environ.get("PTO_LIB_PATH")
+    if env:
+        candidate = Path(env)
+        if (candidate / "include" / "pto" / "pto-inst.hpp").is_file():
+            return candidate / "include"
+        if (candidate / "pto" / "pto-inst.hpp").is_file():
+            return candidate
+    ascend = os.environ.get("ASCEND_HOME_PATH") or os.environ.get("ASCEND_TOOLKIT_HOME")
+    if ascend:
+        candidate = Path(ascend)
+        if (candidate / "include" / "pto" / "pto-inst.hpp").is_file():
+            return candidate / "include"
+    fallback = Path("/workdir/megagdn-pto/third_party/pto-isa/include")
+    if (fallback / "pto" / "pto-inst.hpp").is_file():
+        return fallback
+    raise EnvironmentError(
+        "PTO headers not found. Set PTO_LIB_PATH or source CANN setenv.bash."
+    )
+
+
+def _ascend_home() -> Path:
+    home = os.environ.get("ASCEND_HOME_PATH") or os.environ.get("ASCEND_TOOLKIT_HOME")
+    if not home:
+        raise EnvironmentError("ASCEND_HOME_PATH is not set. Source CANN setenv.bash first.")
+    return Path(home)
+
+
+def _bisheng() -> str:
+    ascend = _ascend_home()
+    candidate = ascend / "bin" / "bisheng"
+    if candidate.is_file():
+        return str(candidate)
+    found = shutil.which("bisheng")
+    if found:
+        return found
+    raise FileNotFoundError("bisheng compiler not found")
+
+
+def _common_includes() -> list[str]:
+    ascend = _ascend_home()
+    driver = os.environ.get("ASCEND_DRIVER_PATH", "/usr/local/Ascend/driver")
+    pto_root = _pto_include_root()
+    return [
+        f"-I{pto_root}",
+        f"-I{ascend}/include",
+        f"-I{driver}/kernel/inc",
+        f"-I{KERNEL_DIR}",
+    ]
+
+
+def _kernel_flags() -> list[str]:
+    ascend = _ascend_home()
+    return (
+        _common_includes()
+        + [
+            f"-I{ascend}/pkg_inc",
+            f"-I{ascend}/pkg_inc/profiling",
+            f"-I{ascend}/pkg_inc/runtime/runtime",
+            "-std=gnu++17",
+            "-O2",
+            "-Wno-macro-redefined",
+            "-Wno-ignored-attributes",
+            "-Wno-unknown-attributes",
+            "-fPIC",
+            "-xcce",
+            "-Xhost-start",
+            "-Xhost-end",
+            "-mllvm",
+            "-cce-aicore-stack-size=0x8000",
+            "-mllvm",
+            "-cce-aicore-function-stack-size=0x8000",
+            "-mllvm",
+            "-cce-aicore-record-overflow=true",
+            "-mllvm",
+            "-cce-aicore-addr-transform",
+            "-mllvm",
+            "-cce-aicore-dcci-insert-for-scalar=false",
+            "--cce-aicore-arch=dav-c310-vec",
+            "-DREGISTER_BASE",
+        ]
+    )
+
+
+def _run(cmd: list[str], cwd: Path) -> None:
+    print("==>", " ".join(cmd))
+    subprocess.run(cmd, cwd=cwd, check=True)
+
+
+def build_kernel(name: str, force: bool = False) -> Path:
+    if name not in KERNELS:
+        raise ValueError(f"unknown kernel: {name}")
+    spec = KERNELS[name]
+    BUILD_DIR.mkdir(parents=True, exist_ok=True)
+    out = BUILD_DIR / spec["lib"]
+    if out.is_file() and not force:
+        return out
+
+    src_path = KERNEL_DIR / spec["source"]
+    obj = BUILD_DIR / f"{src_path.stem}.o"
+    bisheng = _bisheng()
+    _run([bisheng, *_kernel_flags(), "-c", str(src_path), "-o", str(obj)], cwd=BUILD_DIR)
+    _run(
+        [
+            bisheng,
+            "-fPIC",
+            "-shared",
+            "--cce-fatobj-link",
+            "-Wl,-soname," + spec["lib"],
+            str(obj),
+            "-o",
+            str(out),
+        ],
+        cwd=BUILD_DIR,
+    )
+    print(f"Built {out}")
+    return out
+
+
+def build_all(force: bool = False) -> dict[str, Path]:
+    return {name: build_kernel(name, force=force) for name in KERNELS}
+
+
+def main() -> None:
+    parser = argparse.ArgumentParser(description="Build A5 pure-vector example kernels")
+    parser.add_argument("--kernel", choices=tuple(KERNELS.keys()))
+    parser.add_argument("--all", action="store_true")
+    parser.add_argument("--force", action="store_true")
+    args = parser.parse_args()
+    try:
+        if args.all:
+            build_all(force=args.force)
+        elif args.kernel:
+            build_kernel(args.kernel, force=args.force)
+        else:
+            parser.print_help()
+            raise SystemExit(1)
+    except (EnvironmentError, FileNotFoundError, subprocess.CalledProcessError, ValueError) as exc:
+        print(f"build failed: {exc}", file=sys.stderr)
+        raise SystemExit(1) from exc
+
+
+if __name__ == "__main__":
+    main()
diff --git a/examples/a5_sim/common/host_info.py b/examples/a5_sim/common/host_info.py
new file mode 100644
index 00000000..f10a12b1
--- /dev/null
+++ b/examples/a5_sim/common/host_info.py
@@ -0,0 +1,46 @@
+"""Capture host CPU metadata for benchmark documentation."""
+
+from __future__ import annotations
+
+import json
+import subprocess
+from pathlib import Path
+
+
+def _run(cmd: list[str]) -> str:
+    try:
+        return subprocess.check_output(cmd, text=True, stderr=subprocess.STDOUT).strip()
+    except (subprocess.CalledProcessError, FileNotFoundError):
+        return ""
+
+
+def capture_host_cpu() -> dict:
+    lscpu = _run(["lscpu"])
+    info: dict[str, str | int] = {"lscpu_excerpt": "\n".join(lscpu.splitlines()[:12]) if lscpu else ""}
+    for line in lscpu.splitlines():
+        if ":" not in line:
+            continue
+        key, value = line.split(":", 1)
+        key = key.strip().lower().replace(" ", "_").replace("(", "").replace(")", "")
+        info[key] = value.strip()
+    nproc = _run(["nproc"])
+    if nproc.isdigit():
+        info["logical_cpus"] = int(nproc)
+    return info
+
+
+def readme_cpu_snippet(info: dict) -> str:
+    model = info.get("model_name", "unknown")
+    cpus = info.get("cpu_s", info.get("cpus", "?"))
+    threads = info.get("thread_s_per_core", "?")
+    logical = info.get("logical_cpus", "?")
+    return (
+        f"Host CPU: **{model}**, {cpus} cores × {threads} threads/core, "
+        f"{logical} logical CPUs (from `lscpu`)."
+    )
+
+
+def write_host_info(path: Path) -> dict:
+    info = capture_host_cpu()
+    path.write_text(json.dumps(info, indent=2))
+    return info
diff --git a/examples/a5_sim/common/torch_runtime.py b/examples/a5_sim/common/torch_runtime.py
new file mode 100644
index 00000000..59ceb28e
--- /dev/null
+++ b/examples/a5_sim/common/torch_runtime.py
@@ -0,0 +1,56 @@
+"""torch_npu runtime helpers for Ascend950 simulator runs."""
+
+from __future__ import annotations
+
+import ctypes
+
+import numpy as np
+import torch
+import torch_npu  # noqa: F401
+
+_DEVICE = "npu:0"
+
+
+def init_torch_npu(device: str = _DEVICE) -> None:
+    global _DEVICE
+    _DEVICE = device
+    torch.npu.config.allow_internal_format = False
+    torch_npu.npu.set_compile_mode(jit_compile=False)
+    torch.npu.set_device(device)
+
+
+def npu_tensor(np_arr: np.ndarray) -> torch.Tensor:
+    return torch.from_numpy(np_arr).to(_DEVICE)
+
+
+def empty_npu(shape, dtype: torch.dtype) -> torch.Tensor:
+    return torch.empty(shape, dtype=dtype, device=_DEVICE)
+
+
+def zeros_npu(shape, dtype: torch.dtype) -> torch.Tensor:
+    """Allocate zero-filled NPU tensor via CPU NumPy (avoids ZerosLike op under simulators)."""
+    np_dtype = {
+        torch.float32: np.float32,
+        torch.float16: np.float16,
+        torch.int32: np.int32,
+        torch.int64: np.int64,
+    }.get(dtype)
+    if np_dtype is None:
+        raise TypeError(f"unsupported dtype for zeros_npu: {dtype}")
+    return npu_tensor(np.zeros(shape, dtype=np_dtype))
+
+
+def stream_ptr() -> int:
+    return torch.npu.current_stream()._as_parameter_  # noqa: SLF001
+
+
+def data_ptr(tensor: torch.Tensor) -> int:
+    return ctypes.c_void_p(tensor.data_ptr()).value or 0
+
+
+def sync() -> None:
+    torch.npu.synchronize()
+
+
+def to_numpy(tensor: torch.Tensor) -> np.ndarray:
+    return tensor.cpu().numpy()
diff --git a/examples/a5_sim/configs/scale_ladder.json b/examples/a5_sim/configs/scale_ladder.json
new file mode 100644
index 00000000..82df57ae
--- /dev/null
+++ b/examples/a5_sim/configs/scale_ladder.json
@@ -0,0 +1,7 @@
+[
+  {"label": "smoke", "n_seq": 1, "l_seg": 128},
+  {"label": "tiny", "n_seq": 1, "l_seg": 512},
+  {"label": "small", "n_seq": 1, "l_seg": 1024},
+  {"label": "varlen_2x512", "n_seq": 2, "l_seg": 512},
+  {"label": "medium", "n_seq": 4, "l_seg": 1024}
+]
diff --git a/examples/a5_sim/kernels/silu_a5.cpp b/examples/a5_sim/kernels/silu_a5.cpp
new file mode 100644
index 00000000..580af68c
--- /dev/null
+++ b/examples/a5_sim/kernels/silu_a5.cpp
@@ -0,0 +1,122 @@
+#include <pto/pto-inst.hpp>
+using namespace pto;
+
+constexpr uint32_t UB_ALLOC_BYTES = 48 * 1024;
+constexpr uint32_t ELEMENTS_PER_TILE = UB_ALLOC_BYTES / 2;
+
+constexpr unsigned X_PING = 0x00000;
+constexpr unsigned X_PONG = (X_PING + 0x8000 + 0x100);
+constexpr unsigned CAL_PING = 0x10000;
+constexpr unsigned CAL_PONG = (CAL_PING + 0x8000 + 0x100);
+
+// SiLU: y = x / (1 + exp(-x))
+template <typename T>
+AICORE void runTSilu(__gm__ T *y, __gm__ T *x, uint32_t num_elements) {
+#if defined(__DAV_VEC__)
+
+  const uint32_t num_cores = block_num;
+  const uint32_t elements_per_core =
+      (num_elements + num_cores - 1) / num_cores;
+  const uint32_t offset_this_core = elements_per_core * block_idx;
+
+  if (offset_this_core >= num_elements) return;
+
+  uint32_t elements_to_process = elements_per_core;
+  if (offset_this_core + elements_to_process > num_elements) {
+    elements_to_process = num_elements - offset_this_core;
+  }
+  if (elements_to_process == 0) return;
+
+  using ShapeDim5 = pto::Shape<1, 1, 1, 1, ELEMENTS_PER_TILE>;
+  using StridDim5 = pto::Stride<1, 1, 1, 1, 1>;
+  using GlobalData = pto::GlobalTensor<T, ShapeDim5, StridDim5>;
+
+  GlobalData xGlobal(x + offset_this_core);
+  GlobalData yGlobal(y + offset_this_core);
+
+  using TileData =
+      Tile<TileType::Vec, T, 1, ELEMENTS_PER_TILE, BLayout::RowMajor, -1, -1>;
+
+  set_flag(PIPE_V, PIPE_MTE2, EVENT_ID0);
+  set_flag(PIPE_V, PIPE_MTE2, EVENT_ID1);
+  set_flag(PIPE_MTE3, PIPE_V, EVENT_ID0);
+  set_flag(PIPE_MTE3, PIPE_V, EVENT_ID1);
+
+  uint32_t x_offset = 0;
+  uint32_t y_offset = 0;
+
+  for (uint32_t num_processed = 0, ping = 1;
+       num_processed < elements_to_process;
+       num_processed += ELEMENTS_PER_TILE) {
+    const uint32_t remaining = elements_to_process - num_processed;
+    const uint32_t cur_cols =
+        (remaining >= ELEMENTS_PER_TILE) ? ELEMENTS_PER_TILE : remaining;
+
+    const int8_t buf = ping ? 0 : 1;
+    const event_t ev = ping ? (event_t)EVENT_ID0 : (event_t)EVENT_ID1;
+
+    TileData xTile(1, cur_cols);
+    TileData calTile(1, cur_cols);
+
+    if (buf == 0) {
+      TASSIGN(xTile, X_PING);
+      TASSIGN(calTile, CAL_PING);
+    } else {
+      TASSIGN(xTile, X_PONG);
+      TASSIGN(calTile, CAL_PONG);
+    }
+
+    TASSIGN(xGlobal, (x + offset_this_core + x_offset));
+    TASSIGN(yGlobal, (y + offset_this_core + y_offset));
+
+    wait_flag(PIPE_V, PIPE_MTE2, ev);
+    TLOAD(xTile, xGlobal);
+    pipe_barrier(PIPE_ALL);
+
+    set_flag(PIPE_MTE2, PIPE_V, ev);
+    wait_flag(PIPE_MTE2, PIPE_V, ev);
+
+    wait_flag(PIPE_MTE3, PIPE_V, ev);
+
+    TMULS(calTile, xTile, (T)-1);
+    pipe_barrier(PIPE_ALL);
+
+    TEXP(calTile, calTile);
+    pipe_barrier(PIPE_ALL);
+
+    TADDS(calTile, calTile, (T)1);
+    pipe_barrier(PIPE_ALL);
+
+    TDIV(calTile, xTile, calTile);
+    pipe_barrier(PIPE_ALL);
+
+    set_flag(PIPE_V, PIPE_MTE3, ev);
+    wait_flag(PIPE_V, PIPE_MTE3, ev);
+
+    TSTORE(yGlobal, calTile);
+    pipe_barrier(PIPE_ALL);
+
+    set_flag(PIPE_MTE3, PIPE_V, ev);
+    set_flag(PIPE_V, PIPE_MTE2, ev);
+
+    x_offset += cur_cols;
+    y_offset += cur_cols;
+    ping = 1 - ping;
+  }
+
+  wait_flag(PIPE_V, PIPE_MTE2, EVENT_ID0);
+  wait_flag(PIPE_V, PIPE_MTE2, EVENT_ID1);
+  wait_flag(PIPE_MTE3, PIPE_V, EVENT_ID0);
+  wait_flag(PIPE_MTE3, PIPE_V, EVENT_ID1);
+#endif
+}
+
+__global__ AICORE void silu_custom(__gm__ void *x, __gm__ void *y,
+                                   uint32_t num_elements) {
+  runTSilu<half>((__gm__ half *)y, (__gm__ half *)x, num_elements);
+}
+
+extern "C" void call_kernel(uint32_t blockDim, void *stream, uint8_t *y,
+                            uint8_t *x, uint32_t num_elements) {
+  silu_custom<<<blockDim, nullptr, stream>>>(x, y, num_elements);
+}
diff --git a/examples/a5_sim/kernels/swiglu_a5.cpp b/examples/a5_sim/kernels/swiglu_a5.cpp
new file mode 100644
index 00000000..c23f7872
--- /dev/null
+++ b/examples/a5_sim/kernels/swiglu_a5.cpp
@@ -0,0 +1,375 @@
+/**
+Copyright (c) 2026 Huawei Technologies Co., Ltd.
+A5 pure-vector SwiGLU port (self-contained, dav-c310-vec).
+*/
+
+#include <pto/pto-inst.hpp>
+
+#ifndef GM_ADDR
+#define GM_ADDR __gm__ uint8_t*
+#endif
+
+using namespace pto;
+
+#define DIV_ROUNDUP(x, y) (((x) + (y) - 1) / (y))
+#define ALIGN_UP(x, y) (DIV_ROUNDUP((x), (y)) * (y))
+
+constexpr uint32_t UB_SLOT_BYTES = (192 * 1024) / 6;
+constexpr uint32_t X0_BUFFER_BYTES = UB_SLOT_BYTES;
+constexpr uint32_t X1_BUFFER_BYTES = UB_SLOT_BYTES;
+constexpr uint32_t Y_BUFFER_BYTES = UB_SLOT_BYTES;
+constexpr uint32_t ELEMENTS_PER_TILE = Y_BUFFER_BYTES / sizeof(half);
+constexpr uint32_t UB_USABLE_BYTES = 192 * 1024;
+constexpr uint32_t TILE_ALIGNMENT = 16;
+
+#define SWIGLU_FOR_EACH_COL_TILE(X) \
+  X(16)                             \
+  X(32)                             \
+  X(64)                             \
+  X(128)                            \
+  X(256)                            \
+  X(512)                            \
+  X(1024)                           \
+  X(2048)                           \
+  X(4096)                           \
+  X(8192)                           \
+  X(16384)
+
+#define SWIGLU_COL_VALUE(width) width,
+constexpr uint32_t COL_TILE_CANDIDATES[] = {
+    SWIGLU_FOR_EACH_COL_TILE(SWIGLU_COL_VALUE)};
+#undef SWIGLU_COL_VALUE
+constexpr uint32_t NUM_COL_TILE_CANDIDATES =
+    sizeof(COL_TILE_CANDIDATES) / sizeof(COL_TILE_CANDIDATES[0]);
+
+constexpr uint32_t TARGET_ACTIVE_TILES_DIVISOR = 2;
+
+constexpr unsigned X0_PING = 0x00000;
+constexpr unsigned X1_PING = X0_PING + X0_BUFFER_BYTES;
+constexpr unsigned Y_PING = X1_PING + X1_BUFFER_BYTES;
+constexpr unsigned X0_PONG = Y_PING + Y_BUFFER_BYTES;
+constexpr unsigned X1_PONG = X0_PONG + X0_BUFFER_BYTES;
+constexpr unsigned Y_PONG = X1_PONG + X1_BUFFER_BYTES;
+
+static_assert(UB_SLOT_BYTES * 6 == UB_USABLE_BYTES,
+              "SwiGLU UB slots must fully pack the usable UB budget.");
+static_assert(Y_PONG + Y_BUFFER_BYTES <= UB_USABLE_BYTES,
+              "SwiGLU UB layout exceeds usable UB.");
+
+#if defined(__DAV_VEC__)
+
+namespace {
+
+struct TileConfig {
+  uint32_t row_tile_len;
+  uint32_t col_tile_len;
+  uint32_t total_tiles;
+  uint32_t area;
+  bool meets_target;
+};
+
+struct TileWork {
+  uint32_t row_offset;
+  uint32_t col_offset;
+  uint32_t row_count;
+  uint32_t col_count;
+  uint32_t col_count_store;
+};
+
+AICORE inline uint32_t maxRowsForColTile(uint32_t col_tile_len) {
+  return ELEMENTS_PER_TILE / col_tile_len;
+}
+
+AICORE inline void initTilePipeFlags() {
+  set_flag(PIPE_V, PIPE_MTE2, EVENT_ID0);
+  set_flag(PIPE_V, PIPE_MTE2, EVENT_ID1);
+  set_flag(PIPE_MTE3, PIPE_V, EVENT_ID0);
+  set_flag(PIPE_MTE3, PIPE_V, EVENT_ID1);
+}
+
+AICORE inline void drainTilePipeFlags() {
+  wait_flag(PIPE_V, PIPE_MTE2, EVENT_ID0);
+  wait_flag(PIPE_V, PIPE_MTE2, EVENT_ID1);
+  wait_flag(PIPE_MTE3, PIPE_V, EVENT_ID0);
+  wait_flag(PIPE_MTE3, PIPE_V, EVENT_ID1);
+}
+
+AICORE inline TileConfig makeTileConfig(uint32_t batch, uint32_t output_n,
+                                        uint32_t num_cores,
+                                        uint32_t col_tile_len) {
+  const uint32_t max_rows = maxRowsForColTile(col_tile_len);
+  const uint32_t col_tiles = DIV_ROUNDUP(output_n, col_tile_len);
+  const uint32_t target_tiles =
+      max(1U, num_cores / TARGET_ACTIVE_TILES_DIVISOR);
+  const uint32_t min_row_tiles = max(1U, DIV_ROUNDUP(target_tiles, col_tiles));
+
+  uint32_t row_tile_len = min(batch, max_rows);
+  if (min_row_tiles > 1) {
+    const uint32_t capped = batch / min_row_tiles;
+    row_tile_len = min(row_tile_len, max(1U, capped));
+  }
+  if (row_tile_len == 0) {
+    row_tile_len = 1;
+  }
+
+  const uint32_t row_tiles = DIV_ROUNDUP(batch, row_tile_len);
+  const uint32_t total_tiles = row_tiles * col_tiles;
+  return TileConfig{
+      row_tile_len,
+      col_tile_len,
+      total_tiles,
+      row_tile_len * col_tile_len,
+      total_tiles >= target_tiles,
+  };
+}
+
+AICORE inline bool preferTileConfig(const TileConfig& cand,
+                                    const TileConfig& best) {
+  if (cand.meets_target != best.meets_target) {
+    return cand.meets_target;
+  }
+  if (cand.area != best.area) {
+    return cand.area > best.area;
+  }
+  if (cand.meets_target) {
+    if (cand.total_tiles != best.total_tiles) {
+      return cand.total_tiles < best.total_tiles;
+    }
+  } else if (cand.total_tiles != best.total_tiles) {
+    return cand.total_tiles > best.total_tiles;
+  }
+  if (cand.col_tile_len != best.col_tile_len) {
+    return cand.col_tile_len > best.col_tile_len;
+  }
+  return cand.row_tile_len > best.row_tile_len;
+}
+
+AICORE inline TileConfig chooseTileConfig(uint32_t batch, uint32_t output_n,
+                                          uint32_t num_cores) {
+  TileConfig best =
+      makeTileConfig(batch, output_n, num_cores, COL_TILE_CANDIDATES[0]);
+  for (uint32_t c = 1; c < NUM_COL_TILE_CANDIDATES; ++c) {
+    const TileConfig cand =
+        makeTileConfig(batch, output_n, num_cores, COL_TILE_CANDIDATES[c]);
+    if (preferTileConfig(cand, best)) {
+      best = cand;
+    }
+  }
+  return best;
+}
+
+AICORE inline TileWork makeTileWork(uint32_t global_tile_idx,
+                                    uint32_t row_tile_len,
+                                    uint32_t col_tile_len, uint32_t batch,
+                                    uint32_t output_n) {
+  const uint32_t col_tiles = DIV_ROUNDUP(output_n, col_tile_len);
+  const uint32_t row_tile_idx = global_tile_idx / col_tiles;
+  const uint32_t col_tile_idx = global_tile_idx % col_tiles;
+  const uint32_t row_offset = row_tile_idx * row_tile_len;
+  const uint32_t col_offset = col_tile_idx * col_tile_len;
+  const uint32_t actual_col = min(col_tile_len, output_n - col_offset);
+  return TileWork{
+      row_offset,
+      col_offset,
+      min(row_tile_len, batch - row_offset),
+      ALIGN_UP(actual_col, TILE_ALIGNMENT),
+      actual_col,
+  };
+}
+
+template <typename TileData, typename T>
+AICORE inline void computeSwiGLUTile(TileData& x0Tile, TileData& x1Tile,
+                                     TileData& yTile) {
+  TMULS(yTile, x0Tile, (T)-1);
+  pipe_barrier(PIPE_ALL);
+  TEXP(yTile, yTile);
+  pipe_barrier(PIPE_ALL);
+  TADDS(yTile, yTile, (T)1);
+  pipe_barrier(PIPE_ALL);
+  TDIV(yTile, x0Tile, yTile);
+  pipe_barrier(PIPE_ALL);
+  TMUL(yTile, yTile, x1Tile);
+  pipe_barrier(PIPE_ALL);
+}
+
+template <typename T, uint32_t kTileRows, uint32_t kTileCols>
+AICORE void issueTLoad(__gm__ T* x, uint32_t input_n, uint32_t output_n,
+                       const TileWork& tile, unsigned x0_base, unsigned x1_base,
+                       event_t ev) {
+  using TileShapeND = TileShape2D<T, DYNAMIC, DYNAMIC, Layout::ND>;
+  using DynStrideND = pto::Stride<1, 1, 1, DYNAMIC, 1>;
+  using GlobalData = GlobalTensor<T, TileShapeND, DynStrideND, Layout::ND>;
+  using TileData = Tile<TileType::Vec, T, kTileRows, kTileCols,
+                        BLayout::RowMajor, DYNAMIC, DYNAMIC>;
+
+  TileData x0Tile(tile.row_count, tile.col_count);
+  TileData x1Tile(tile.row_count, tile.col_count);
+  TASSIGN(x0Tile, x0_base);
+  TASSIGN(x1Tile, x1_base);
+
+  const uint32_t input_offset = tile.row_offset * input_n + tile.col_offset;
+  const TileShapeND shape(tile.row_count, tile.col_count_store);
+  const DynStrideND stride(input_n);
+
+  GlobalData x0Global(x + input_offset, shape, stride);
+  GlobalData x1Global(x + input_offset + output_n, shape, stride);
+
+  wait_flag(PIPE_V, PIPE_MTE2, ev);
+  wait_flag(PIPE_MTE3, PIPE_V, ev);
+  TLOAD(x0Tile, x0Global);
+  TLOAD(x1Tile, x1Global);
+  set_flag(PIPE_MTE2, PIPE_V, ev);
+}
+
+template <typename T, uint32_t kTileRows, uint32_t kTileCols>
+AICORE void issueTStore(__gm__ T* y, uint32_t output_n, const TileWork& tile,
+                        unsigned y_base, event_t ev) {
+  using TileShapeND = TileShape2D<T, DYNAMIC, DYNAMIC, Layout::ND>;
+  using DynStrideND = pto::Stride<1, 1, 1, DYNAMIC, 1>;
+  using GlobalData = GlobalTensor<T, TileShapeND, DynStrideND, Layout::ND>;
+  using TileData = Tile<TileType::Vec, T, kTileRows, kTileCols,
+                        BLayout::RowMajor, DYNAMIC, DYNAMIC>;
+
+  TileData yTile(tile.row_count, tile.col_count);
+  TASSIGN(yTile, y_base);
+
+  const uint32_t output_offset = tile.row_offset * output_n + tile.col_offset;
+  const TileShapeND shape(tile.row_count, tile.col_count_store);
+  const DynStrideND stride(output_n);
+  GlobalData yGlobal(y + output_offset, shape, stride);
+
+  set_flag(PIPE_V, PIPE_MTE3, ev);
+  wait_flag(PIPE_V, PIPE_MTE3, ev);
+  TSTORE(yGlobal, yTile);
+  set_flag(PIPE_MTE3, PIPE_V, ev);
+  set_flag(PIPE_V, PIPE_MTE2, ev);
+}
+
+template <uint32_t kTileCols, typename T>
+AICORE void runTSwiGLUTiled(__gm__ T* x, __gm__ T* y, uint32_t batch,
+                            uint32_t input_n, uint32_t num_cores, uint32_t vid,
+                            uint32_t row_tile_len) {
+  constexpr uint32_t kTileRows = ELEMENTS_PER_TILE / kTileCols;
+  static_assert(kTileRows * kTileCols == ELEMENTS_PER_TILE,
+                "2D tile shape must match the UB vector tile capacity.");
+
+  const uint32_t output_n = input_n >> 1;
+  const uint32_t col_tiles = DIV_ROUNDUP(output_n, kTileCols);
+  const uint32_t row_tiles = DIV_ROUNDUP(batch, row_tile_len);
+  const uint32_t total_tiles = row_tiles * col_tiles;
+  if (vid >= total_tiles) {
+    return;
+  }
+
+  using TileData = Tile<TileType::Vec, T, kTileRows, kTileCols,
+                        BLayout::RowMajor, DYNAMIC, DYNAMIC>;
+
+  initTilePipeFlags();
+
+  uint32_t current_tile_idx = vid;
+  TileWork current_tile =
+      makeTileWork(current_tile_idx, row_tile_len, kTileCols, batch, output_n);
+  bool ping = true;
+  issueTLoad<T, kTileRows, kTileCols>(x, input_n, output_n, current_tile,
+                                      X0_PING, X1_PING, (event_t)EVENT_ID0);
+
+  while (true) {
+    const event_t current_ev = ping ? (event_t)EVENT_ID0 : (event_t)EVENT_ID1;
+    const unsigned current_x0_base = ping ? X0_PING : X0_PONG;
+    const unsigned current_x1_base = ping ? X1_PING : X1_PONG;
+    const unsigned current_y_base = ping ? Y_PING : Y_PONG;
+
+    wait_flag(PIPE_MTE2, PIPE_V, current_ev);
+
+    TileWork next_tile{0, 0, 0, 0, 0};
+    const uint32_t next_tile_idx = current_tile_idx + num_cores;
+    const bool has_next = next_tile_idx < total_tiles;
+    if (has_next) {
+      next_tile =
+          makeTileWork(next_tile_idx, row_tile_len, kTileCols, batch, output_n);
+      const event_t next_ev = ping ? (event_t)EVENT_ID1 : (event_t)EVENT_ID0;
+      const unsigned next_x0_base = ping ? X0_PONG : X0_PING;
+      const unsigned next_x1_base = ping ? X1_PONG : X1_PING;
+      issueTLoad<T, kTileRows, kTileCols>(x, input_n, output_n, next_tile,
+                                          next_x0_base, next_x1_base, next_ev);
+    }
+
+    TileData x0Tile(current_tile.row_count, current_tile.col_count);
+    TileData x1Tile(current_tile.row_count, current_tile.col_count);
+    TileData yTile(current_tile.row_count, current_tile.col_count);
+    TASSIGN(x0Tile, current_x0_base);
+    TASSIGN(x1Tile, current_x1_base);
+    TASSIGN(yTile, current_y_base);
+
+    computeSwiGLUTile<TileData, T>(x0Tile, x1Tile, yTile);
+    issueTStore<T, kTileRows, kTileCols>(y, output_n, current_tile,
+                                         current_y_base, current_ev);
+
+    if (!has_next) {
+      break;
+    }
+
+    current_tile = next_tile;
+    current_tile_idx = next_tile_idx;
+    ping = !ping;
+  }
+
+  drainTilePipeFlags();
+}
+
+template <typename T>
+AICORE void runTSwiGLUMainTiled(__gm__ T* x, __gm__ T* y, uint32_t batch,
+                                uint32_t input_n, uint32_t num_cores,
+                                uint32_t vid) {
+  const uint32_t output_n = input_n >> 1;
+  const TileConfig cfg = chooseTileConfig(batch, output_n, num_cores);
+
+  switch (cfg.col_tile_len) {
+#define SWIGLU_TILE_CASE(width)                                     \
+  case width:                                                       \
+    runTSwiGLUTiled<width, T>(x, y, batch, input_n, num_cores, vid, \
+                              cfg.row_tile_len);                    \
+    break;
+    SWIGLU_FOR_EACH_COL_TILE(SWIGLU_TILE_CASE)
+#undef SWIGLU_TILE_CASE
+    default:
+      runTSwiGLUTiled<128, T>(x, y, batch, input_n, num_cores, vid,
+                              cfg.row_tile_len);
+      break;
+  }
+}
+
+template <typename T>
+AICORE void runTSwiGLU(__gm__ T* x, __gm__ T* y, uint32_t batch,
+                       uint32_t input_n, uint32_t num_cores, uint32_t vid) {
+  if (input_n == 0 || (input_n & 1U) != 0) {
+    return;
+  }
+  runTSwiGLUMainTiled(x, y, batch, input_n, num_cores, vid);
+}
+
+}  // namespace
+
+#endif
+
+extern "C" __global__ AICORE void swiglu_fp16(GM_ADDR x, GM_ADDR y,
+                                              uint32_t batch,
+                                              uint32_t input_n) {
+#if defined(__DAV_VEC__)
+  const uint32_t num_cores = get_block_num() * get_subblockdim();
+  const uint32_t vid = get_block_idx() * get_subblockdim() + get_subblockid();
+  runTSwiGLU<half>((__gm__ half*)x, (__gm__ half*)y, batch, input_n, num_cores,
+                   vid);
+#else
+  (void)x;
+  (void)y;
+  (void)batch;
+  (void)input_n;
+#endif
+}
+
+extern "C" void call_swiglu_kernel(uint32_t blockDim, void* stream, uint8_t* x,
+                                   uint8_t* y, uint32_t batch,
+                                   uint32_t input_n) {
+  swiglu_fp16<<<blockDim * 2, nullptr, stream>>>(x, y, batch, input_n);
+}
diff --git a/examples/a5_sim/outputs/.gitignore b/examples/a5_sim/outputs/.gitignore
new file mode 100644
index 00000000..d6b7ef32
--- /dev/null
+++ b/examples/a5_sim/outputs/.gitignore
@@ -0,0 +1,2 @@
+*
+!.gitignore
diff --git a/examples/a5_sim/run_cannsim.sh b/examples/a5_sim/run_cannsim.sh
new file mode 100755
index 00000000..b79d9570
--- /dev/null
+++ b/examples/a5_sim/run_cannsim.sh
@@ -0,0 +1,60 @@
+#!/usr/bin/env bash
+# Run A5 pure-vector kernels under CANN Simulator (cannsim record, Ascend950).
+set -euo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+A5_DIR="${SCRIPT_DIR}"
+
+ASCEND_HOME_PATH="${ASCEND_HOME_PATH:-/usr/local/Ascend/ascend-toolkit/latest}"
+# shellcheck disable=SC1091
+source "${ASCEND_HOME_PATH}/bin/setenv.bash"
+
+SOC="${CANNSIM_SOC:-Ascend950}"
+OUTPUT_DIR="${A5_DIR}/outputs/cannsim_vec"
+mkdir -p "${OUTPUT_DIR}"
+ulimit -n 65535
+
+cd "${A5_DIR}"
+
+USER_OPTS="${*}"
+if [[ -z "${USER_OPTS}" ]]; then
+  USER_OPTS="--kernel silu --mode bench --num-elements 128 --label smoke --skip-correctness"
+fi
+
+echo "==> cannsim record (${SOC})"
+echo "    ASCEND_HOME_PATH=${ASCEND_HOME_PATH}"
+echo "    user options: ${USER_OPTS}"
+
+cannsim record \
+  -s "${SOC}" \
+  -o "${OUTPUT_DIR}" \
+  "${A5_DIR}/run_cannsim_entry.sh" \
+  -u "${USER_OPTS}"
+CANNSIM_RC=$?
+
+# cannsim may segfault during teardown after Python wrote JSON; accept if output exists.
+if [[ "${CANNSIM_RC}" -ne 0 ]] && [[ "${USER_OPTS}" == *"--output-json"* ]]; then
+  OUT_JSON="$(python3 - "${USER_OPTS}" <<'PY'
+import shlex, sys
+args = shlex.split(sys.argv[1])
+for i, a in enumerate(args):
+    if a == "--output-json" and i + 1 < len(args):
+        print(args[i + 1])
+        break
+PY
+)"
+  if [[ -n "${OUT_JSON}" && -f "${OUT_JSON}" ]]; then
+    if python3 - "${OUT_JSON}" <<'PY'
+import json, sys
+from pathlib import Path
+data = json.loads(Path(sys.argv[1]).read_text())
+rows = data.get("results", [])
+sys.exit(0 if rows and rows[0].get("sim_wall_s") is not None else 1)
+PY
+    then
+      echo "==> cannsim exited ${CANNSIM_RC} but ${OUT_JSON} is valid; treating as success"
+      exit 0
+    fi
+  fi
+fi
+exit "${CANNSIM_RC}"
diff --git a/examples/a5_sim/run_cannsim_entry.sh b/examples/a5_sim/run_cannsim_entry.sh
new file mode 100755
index 00000000..1e1e9909
--- /dev/null
+++ b/examples/a5_sim/run_cannsim_entry.sh
@@ -0,0 +1,6 @@
+#!/usr/bin/env bash
+# Entry point for cannsim record (standalone executable script).
+set -euo pipefail
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+cd "${SCRIPT_DIR}"
+exec python3 "${SCRIPT_DIR}/vec_sim.py" "$@"
diff --git a/examples/a5_sim/run_msprof.sh b/examples/a5_sim/run_msprof.sh
new file mode 100755
index 00000000..b13baee8
--- /dev/null
+++ b/examples/a5_sim/run_msprof.sh
@@ -0,0 +1,30 @@
+#!/usr/bin/env bash
+# Run A5 pure-vector kernels under msprof op simulator (Ascend950PR_9599).
+set -euo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+A5_DIR="${SCRIPT_DIR}"
+
+ASCEND_HOME_PATH="${ASCEND_HOME_PATH:-/usr/local/Ascend/ascend-toolkit/latest}"
+# shellcheck disable=SC1091
+source "${ASCEND_HOME_PATH}/bin/setenv.bash"
+
+SIM_LIB="${ASCEND_HOME_PATH}/tools/simulator/Ascend950PR_9599/lib"
+export LD_LIBRARY_PATH="${SIM_LIB}:${LD_LIBRARY_PATH:-}"
+ulimit -n 65535
+
+TIMEOUT="${MSPROF_TIMEOUT:-30}"
+OUTPUT_DIR="${A5_DIR}/outputs/msprof_vec"
+mkdir -p "${OUTPUT_DIR}"
+
+cd "${A5_DIR}"
+
+echo "==> msprof op simulator (Ascend950PR_9599)"
+echo "    ASCEND_HOME_PATH=${ASCEND_HOME_PATH}"
+echo "    timeout=${TIMEOUT} min"
+
+msprof op simulator \
+  --soc-version=Ascend950PR_9599 \
+  --timeout="${TIMEOUT}" \
+  --output="${OUTPUT_DIR}" \
+  python3 "${A5_DIR}/vec_sim.py" "$@"
diff --git a/examples/a5_sim/run_thread_sweep.sh b/examples/a5_sim/run_thread_sweep.sh
new file mode 100755
index 00000000..d3dea406
--- /dev/null
+++ b/examples/a5_sim/run_thread_sweep.sh
@@ -0,0 +1,123 @@
+#!/usr/bin/env bash
+# Sweep OMP threads for A5 pure-vector kernels under msprof and cannsim (T=512).
+set -euo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+OUTPUT_DIR="${SCRIPT_DIR}/outputs"
+mkdir -p "${OUTPUT_DIR}"
+
+KERNEL="${THREAD_SWEEP_KERNEL:-silu}"
+NPROC="$(nproc)"
+THREADS=(1 2 4 8 16)
+TIMESTAMP="$(date +%Y%m%d_%H%M%S)"
+RESULTS_JSON="${OUTPUT_DIR}/thread_sweep_${KERNEL}_${TIMESTAMP}.json"
+ROWS_JSON="${OUTPUT_DIR}/_thread_sweep_rows_${TIMESTAMP}.json"
+echo "[]" > "${ROWS_JSON}"
+
+T=512
+REPEAT="${THREAD_SWEEP_REPEAT:-1}"
+MSPROF_TIMEOUT=30
+MAX_THREADS="${THREAD_SWEEP_MAX:-32}"
+
+if [[ "${NPROC}" -gt 16 ]]; then
+  if [[ "${NPROC}" -le "${MAX_THREADS}" ]]; then
+    THREADS+=("${NPROC}")
+  else
+    THREADS+=("${MAX_THREADS}")
+  fi
+fi
+
+echo "==> thread sweep: kernel=${KERNEL} T=${T} repeat=${REPEAT}"
+echo "    threads: ${THREADS[*]}"
+
+_run_tool() {
+  local tool="$1"
+  local n="$2"
+  local log
+  log="$(mktemp)"
+  local script
+  if [[ "${tool}" == "msprof" ]]; then
+    script="${SCRIPT_DIR}/run_msprof.sh"
+  else
+    script="${SCRIPT_DIR}/run_cannsim.sh"
+  fi
+  local extra_args=(--kernel "${KERNEL}" --mode bench --label "threads_${tool}_${n}" --repeat "${REPEAT}" --skip-correctness)
+  if [[ "${KERNEL}" == "silu" ]]; then
+    extra_args+=(--num-elements "${T}")
+  else
+    extra_args+=(--batch 1 --input-n $((T * 2)))
+  fi
+  if MSPROF_TIMEOUT="${MSPROF_TIMEOUT}" "${script}" "${extra_args[@]}" --output-json "${log}" >/dev/null 2>&1; then
+    :
+  elif [[ ! -f "${log}" ]]; then
+    echo "FAILED"
+    rm -f "${log}"
+    return
+  fi
+  if python3 -c "import json; print(json.load(open('${log}'))['results'][0]['sim_wall_s'])" 2>/dev/null; then
+    :
+  else
+    echo "FAILED"
+  fi
+  rm -f "${log}"
+}
+
+for N in "${THREADS[@]}"; do
+  echo ""
+  echo "==> OMP_NUM_THREADS=${N}"
+  export OMP_NUM_THREADS="${N}"
+  export OPENBLAS_NUM_THREADS="${N}"
+  export MKL_NUM_THREADS="${N}"
+
+  for TOOL in msprof cannsim; do
+    RUN_TIMES=()
+    for R in $(seq 1 "${REPEAT}"); do
+      SIM_S="$(_run_tool "${TOOL}" "${N}")"
+      if [[ "${SIM_S}" != "FAILED" ]]; then
+        RUN_TIMES+=("${SIM_S}")
+        echo "    ${TOOL} run ${R}: ${SIM_S}s"
+      else
+        echo "    ${TOOL} run ${R}: FAILED" >&2
+      fi
+    done
+    if [[ ${#RUN_TIMES[@]} -gt 0 ]]; then
+      python3 - "${TOOL}" "${N}" "${T}" "${KERNEL}" "${ROWS_JSON}" "${RUN_TIMES[@]}" <<'PY'
+import json, statistics, sys
+from pathlib import Path
+
+tool = sys.argv[1]
+threads = int(sys.argv[2])
+t = int(sys.argv[3])
+kernel = sys.argv[4]
+rows_path = Path(sys.argv[5])
+times = [float(x) for x in sys.argv[6:]]
+mean_s = statistics.mean(times)
+rows = json.loads(rows_path.read_text())
+rows.append({
+    "tool": tool,
+    "kernel": kernel,
+    "threads": threads,
+    "T": t,
+    "runs_s": times,
+    "mean_s": mean_s,
+    "omp_threads": threads,
+})
+rows_path.write_text(json.dumps(rows, indent=2))
+PY
+    fi
+  done
+done
+
+python3 - "${ROWS_JSON}" "${RESULTS_JSON}" <<'PY'
+import json, sys
+from pathlib import Path
+
+rows = json.loads(Path(sys.argv[1]).read_text())
+out = {"thread_sweep": rows}
+Path(sys.argv[2]).write_text(json.dumps(out, indent=2))
+print(json.dumps(out, indent=2))
+PY
+
+rm -f "${ROWS_JSON}"
+echo ""
+echo "Wrote ${RESULTS_JSON}"
diff --git a/examples/a5_sim/vec_sim.py b/examples/a5_sim/vec_sim.py
new file mode 100644
index 00000000..a2aa2e7d
--- /dev/null
+++ b/examples/a5_sim/vec_sim.py
@@ -0,0 +1,389 @@
+#!/usr/bin/env python3
+"""Run A5 pure-vector kernels (SiLU, SwiGLU) under Ascend950 simulators.
+
+Usage::
+
+    ./run_msprof.sh --kernel silu --mode correctness --num-elements 128
+    ./run_cannsim.sh --kernel swiglu --mode correctness --batch 1 --input-n 256
+    python3 vec_sim.py --kernel silu --mode sweep
+"""
+
+from __future__ import annotations
+
+import argparse
+import ctypes
+import json
+import os
+import subprocess
+import sys
+import time
+from datetime import datetime, timezone
+from pathlib import Path
+
+import torch
+
+_A5_DIR = Path(__file__).resolve().parent
+if str(_A5_DIR) not in sys.path:
+    sys.path.insert(0, str(_A5_DIR))
+
+from common.build import build_kernel  # noqa: E402
+from common.host_info import capture_host_cpu  # noqa: E402
+from common.torch_runtime import (  # noqa: E402
+    init_torch_npu,
+    stream_ptr,
+    sync,
+    zeros_npu,
+)
+
+DEFAULT_BLOCK_DIM = 8
+_DEFAULT_LADDER = _A5_DIR / "configs" / "scale_ladder.json"
+_LIBS: dict[str, ctypes.CDLL] = {}
+
+
+def _vp(t: torch.Tensor) -> ctypes.c_void_p:
+    return ctypes.c_void_p(t.data_ptr())
+
+
+def _load_lib(kernel: str) -> ctypes.CDLL:
+    if kernel in _LIBS:
+        return _LIBS[kernel]
+    lib_path = build_kernel(kernel)
+    lib = ctypes.CDLL(str(lib_path))
+    if kernel == "silu":
+        lib.call_kernel.argtypes = [
+            ctypes.c_uint32,
+            ctypes.c_void_p,
+            ctypes.c_void_p,
+            ctypes.c_void_p,
+            ctypes.c_uint32,
+        ]
+        lib.call_kernel.restype = None
+    elif kernel == "swiglu":
+        lib.call_swiglu_kernel.argtypes = [
+            ctypes.c_uint32,
+            ctypes.c_void_p,
+            ctypes.c_void_p,
+            ctypes.c_void_p,
+            ctypes.c_uint32,
+            ctypes.c_uint32,
+        ]
+        lib.call_swiglu_kernel.restype = None
+    else:
+        raise ValueError(f"unknown kernel: {kernel}")
+    _LIBS[kernel] = lib
+    return lib
+
+
+def _ladder_shape(entry: dict, kernel: str) -> tuple[str, int, dict]:
+    n_seq = int(entry["n_seq"])
+    l_seg = int(entry["l_seg"])
+    t = n_seq * l_seg
+    label = entry["label"]
+    if kernel == "silu":
+        return label, t, {"num_elements": t}
+    return label, t, {"batch": n_seq, "input_n": 2 * l_seg}
+
+
+def _make_silu_inputs(num_elements: int, seed: int, device: str) -> dict:
+    torch.manual_seed(seed)
+    x = torch.randn(num_elements, dtype=torch.float16).to(device)
+    y = zeros_npu((num_elements,), torch.float16)
+    return {"x": x, "y": y, "T": num_elements}
+
+
+def _make_swiglu_inputs(batch: int, input_n: int, seed: int, device: str) -> dict:
+    if input_n <= 0 or (input_n & 1):
+        raise ValueError("input_n must be a positive even integer")
+    torch.manual_seed(seed)
+    x = torch.randn(batch, input_n, dtype=torch.float16).to(device)
+    y = zeros_npu((batch, input_n // 2), torch.float16)
+    return {"x": x, "y": y, "T": batch * (input_n // 2), "batch": batch, "input_n": input_n}
+
+
+def _make_inputs(kernel: str, shape: dict, seed: int, device: str) -> dict:
+    if kernel == "silu":
+        return _make_silu_inputs(shape["num_elements"], seed, device)
+    return _make_swiglu_inputs(shape["batch"], shape["input_n"], seed, device)
+
+
+def _launch_silu(data: dict, block_dim: int) -> None:
+    lib = _load_lib("silu")
+    lib.call_kernel(
+        block_dim,
+        stream_ptr(),
+        _vp(data["y"]),
+        _vp(data["x"]),
+        data["T"],
+    )
+
+
+def _launch_swiglu(data: dict, block_dim: int) -> None:
+    lib = _load_lib("swiglu")
+    lib.call_swiglu_kernel(
+        block_dim,
+        stream_ptr(),
+        _vp(data["x"]),
+        _vp(data["y"]),
+        data["batch"],
+        data["input_n"],
+    )
+
+
+def _launch(kernel: str, data: dict, block_dim: int) -> None:
+    if kernel == "silu":
+        _launch_silu(data, block_dim)
+    else:
+        _launch_swiglu(data, block_dim)
+
+
+def check_correctness(kernel: str, data: dict) -> tuple[bool, str]:
+    _launch(kernel, data, data["block_dim"])
+    sync()
+    x_cpu = data["x"].cpu()
+    out_cpu = data["y"].cpu()
+    if kernel == "silu":
+        ref = x_cpu * torch.sigmoid(x_cpu.float()).to(torch.float16)
+        rtol, atol = 1e-1, 1e-5
+    else:
+        a, b = x_cpu.chunk(2, dim=-1)
+        ref = a * torch.sigmoid(a.float()).to(torch.float16) * b
+        rtol, atol = 1e-2, 1e-5
+    try:
+        torch.testing.assert_close(out_cpu, ref, rtol=rtol, atol=atol)
+        return True, "PASS"
+    except AssertionError as exc:
+        diff = (out_cpu.float() - ref.float()).abs().max().item()
+        return False, f"FAIL max_diff={diff:.4e}: {exc}"
+
+
+def bench_once(kernel: str, data: dict) -> float:
+    sync()
+    t0 = time.perf_counter()
+    _launch(kernel, data, data["block_dim"])
+    sync()
+    return time.perf_counter() - t0
+
+
+def _load_ladder(path: Path) -> list[dict]:
+    return json.loads(path.read_text())
+
+
+def _run_case(
+    kernel: str,
+    label: str,
+    t: int,
+    shape: dict,
+    *,
+    mode: str,
+    device: str,
+    seed: int,
+    repeat: int,
+    check: bool,
+    block_dim: int,
+) -> dict:
+    data = _make_inputs(kernel, shape, seed=seed, device=device)
+    data["block_dim"] = block_dim
+    row: dict = {
+        "label": label,
+        "kernel": kernel,
+        "T": t,
+        "block_dim": block_dim,
+        "omp_threads": int(os.environ.get("OMP_NUM_THREADS", "0") or 0),
+        **shape,
+    }
+
+    if check or mode == "correctness":
+        ok, msg = check_correctness(kernel, data)
+        row["correctness_pass"] = ok
+        row["correctness_msg"] = msg
+        if not ok:
+            print(f"  [{label}] CORRECTNESS FAIL: {msg}", file=sys.stderr)
+
+    if mode in ("bench", "sweep", "correctness", "compare-tools"):
+        times = [bench_once(kernel, data) for _ in range(max(1, repeat))]
+        row["sim_wall_s"] = sum(times) / len(times)
+        row["sim_wall_ms"] = row["sim_wall_s"] * 1000.0
+        row["ms_per_element"] = (row["sim_wall_s"] / t) * 1000.0 if t else None
+        if len(times) > 1:
+            mean = row["sim_wall_s"]
+            row["sim_wall_s_std"] = (
+                sum((x - mean) ** 2 for x in times) / len(times)
+            ) ** 0.5
+
+    return row
+
+
+def _print_table(rows: list[dict]) -> None:
+    print(f"\n{'label':<16} {'T':>8} {'sim_ms':>12} {'ms/elem':>10} {'ok':>5}")
+    print("-" * 56)
+    for r in sorted(rows, key=lambda x: x["T"]):
+        sim = r.get("sim_wall_ms")
+        mpe = r.get("ms_per_element")
+        ok = r.get("correctness_pass")
+        sim_s = f"{sim:.1f}" if sim is not None else "n/a"
+        mpe_s = f"{mpe:.2f}" if mpe is not None else "n/a"
+        ok_s = "yes" if ok else ("no" if ok is False else "-")
+        print(f"{r.get('label', ''):<16} {r['T']:>8} {sim_s:>12} {mpe_s:>10} {ok_s:>5}")
+
+
+def _spawn_tool(script: str, args: list[str]) -> dict:
+    cmd = [str(_A5_DIR / script), *args]
+    proc = subprocess.run(cmd, capture_output=True, text=True)
+    if proc.returncode != 0:
+        raise RuntimeError(
+            f"{script} failed (exit {proc.returncode}):\n{proc.stdout}\n{proc.stderr}"
+        )
+    for line in reversed(proc.stdout.splitlines()):
+        line = line.strip()
+        if line.startswith("{") and line.endswith("}"):
+            return json.loads(line)
+    raise RuntimeError(f"No JSON output from {script}:\n{proc.stdout}\n{proc.stderr}")
+
+
+def main() -> None:
+    parser = argparse.ArgumentParser(description="A5 pure-vector simulator driver")
+    parser.add_argument("--kernel", choices=("silu", "swiglu"), default="silu")
+    parser.add_argument(
+        "--mode",
+        choices=("correctness", "bench", "sweep", "compare-tools"),
+        default="correctness",
+    )
+    parser.add_argument("--num-elements", type=int, default=None, help="SiLU element count")
+    parser.add_argument("--batch", type=int, default=1)
+    parser.add_argument("--input-n", type=int, default=None, help="SwiGLU input width (even)")
+    parser.add_argument("--label", default="custom")
+    parser.add_argument("--seed", type=int, default=42)
+    parser.add_argument("--repeat", type=int, default=1)
+    parser.add_argument("--device", default="npu:0")
+    parser.add_argument("--block-dim", type=int, default=DEFAULT_BLOCK_DIM)
+    parser.add_argument("--ladder", type=Path, default=_DEFAULT_LADDER)
+    parser.add_argument("--skip-correctness", action="store_true")
+    parser.add_argument("--output-json", type=Path, default=None)
+    args = parser.parse_args()
+
+    init_torch_npu(args.device)
+    _load_lib(args.kernel)
+
+    rows: list[dict] = []
+    if args.mode == "sweep":
+        for entry in _load_ladder(args.ladder):
+            label, t, shape = _ladder_shape(entry, args.kernel)
+            rows.append(
+                _run_case(
+                    args.kernel,
+                    label,
+                    t,
+                    shape,
+                    mode="sweep",
+                    device=args.device,
+                    seed=args.seed,
+                    repeat=args.repeat,
+                    check=not args.skip_correctness,
+                    block_dim=args.block_dim,
+                )
+            )
+        _print_table(rows)
+    elif args.mode == "compare-tools":
+        if args.kernel == "silu":
+            if args.num_elements is None:
+                args.num_elements = 128
+            shape = {"num_elements": args.num_elements}
+            t = args.num_elements
+        else:
+            if args.input_n is None:
+                args.input_n = 256
+            shape = {"batch": args.batch, "input_n": args.input_n}
+            t = args.batch * (args.input_n // 2)
+        base_args = [
+            "--kernel",
+            args.kernel,
+            "--mode",
+            "bench",
+            "--label",
+            args.label,
+            "--repeat",
+            str(args.repeat),
+            "--skip-correctness",
+            "--block-dim",
+            str(args.block_dim),
+        ]
+        if args.kernel == "silu":
+            base_args += ["--num-elements", str(shape["num_elements"])]
+        else:
+            base_args += [
+                "--batch",
+                str(shape["batch"]),
+                "--input-n",
+                str(shape["input_n"]),
+            ]
+        msprof = _spawn_tool("run_msprof.sh", base_args + ["--output-json", "/dev/stdout"])
+        cannsim = _spawn_tool("run_cannsim.sh", base_args)
+        ms_row = msprof.get("results", [msprof])[0]
+        cn_row = cannsim.get("results", [cannsim])[0]
+        rows = [
+            {"tool": "msprof", **ms_row},
+            {"tool": "cannsim", **cn_row},
+        ]
+        if ms_row.get("sim_wall_s") and cn_row.get("sim_wall_s"):
+            rows.append(
+                {
+                    "tool": "ratio_msprof_over_cannsim",
+                    "sim_wall_s": ms_row["sim_wall_s"] / cn_row["sim_wall_s"],
+                }
+            )
+        print(json.dumps({"results": rows}, indent=2))
+    else:
+        if args.kernel == "silu":
+            num_elements = args.num_elements if args.num_elements is not None else 128
+            shape = {"num_elements": num_elements}
+            t = num_elements
+        else:
+            input_n = args.input_n if args.input_n is not None else 256
+            shape = {"batch": args.batch, "input_n": input_n}
+            t = args.batch * (input_n // 2)
+        rows.append(
+            _run_case(
+                args.kernel,
+                args.label,
+                t,
+                shape,
+                mode=args.mode,
+                device=args.device,
+                seed=args.seed,
+                repeat=args.repeat,
+                check=not args.skip_correctness
+                and args.mode in ("correctness", "bench"),
+                block_dim=args.block_dim,
+            )
+        )
+        if args.mode == "correctness":
+            ok = rows[0].get("correctness_pass", False)
+            print(rows[0].get("correctness_msg", "FAIL"))
+            if not ok:
+                raise SystemExit(1)
+
+    host_cpu = capture_host_cpu()
+    payload = {
+        "timestamp": datetime.now(timezone.utc).isoformat(),
+        "mode": args.mode,
+        "kernel": args.kernel,
+        "soc_msprof": "Ascend950PR_9599",
+        "soc_cannsim": "Ascend950",
+        "arch": "dav-c310-vec",
+        "host_cpu": host_cpu,
+        "results": rows,
+    }
+    out = json.dumps(payload, indent=2)
+    if args.output_json:
+        if str(args.output_json) == "/dev/stdout":
+            print(out)
+        else:
+            args.output_json.parent.mkdir(parents=True, exist_ok=True)
+            args.output_json.write_text(out)
+            print(out)
+    elif args.mode != "compare-tools":
+        print(out)
+
+
+if __name__ == "__main__":
+    main()

From f047794f98ee528133996d3f348eb2ad8420c2d4 Mon Sep 17 00:00:00 2001
From: jiaweizhuang <zhuangjiawei@hisilicon.com>
Date: Wed, 27 May 2026 18:24:29 +0000
Subject: [PATCH 2/3] third_party: add pto-isa submodule (v9.0.0) for a5_sim
 builds

Pin pto-isa at v9.0.0 to match CANN 9.0.0 and teach the a5_sim build
helper to resolve headers from the in-repo third_party path first.
---
 .gitmodules                     | 3 +++
 examples/a5_sim/common/build.py | 4 ++++
 third_party/pto-isa             | 1 +
 3 files changed, 8 insertions(+)
 create mode 100644 .gitmodules
 create mode 160000 third_party/pto-isa

diff --git a/.gitmodules b/.gitmodules
new file mode 100644
index 00000000..0be6691c
--- /dev/null
+++ b/.gitmodules
@@ -0,0 +1,3 @@
+[submodule "third_party/pto-isa"]
+	path = third_party/pto-isa
+	url = https://gitcode.com/cann/pto-isa.git
diff --git a/examples/a5_sim/common/build.py b/examples/a5_sim/common/build.py
index 075522a3..b1be4a4d 100644
--- a/examples/a5_sim/common/build.py
+++ b/examples/a5_sim/common/build.py
@@ -39,6 +39,10 @@ def _pto_include_root() -> Path:
         candidate = Path(ascend)
         if (candidate / "include" / "pto" / "pto-inst.hpp").is_file():
             return candidate / "include"
+    repo_root = A5_SIM_ROOT.parent.parent
+    local = repo_root / "third_party" / "pto-isa" / "include"
+    if (local / "pto" / "pto-inst.hpp").is_file():
+        return local
     fallback = Path("/workdir/megagdn-pto/third_party/pto-isa/include")
     if (fallback / "pto" / "pto-inst.hpp").is_file():
         return fallback
diff --git a/third_party/pto-isa b/third_party/pto-isa
new file mode 160000
index 00000000..37752695
--- /dev/null
+++ b/third_party/pto-isa
@@ -0,0 +1 @@
+Subproject commit 377526952e95c39f9149cb94d261eba9e394f0d4

From 823d6bb8f3fa5167b33a8ac19bf1586de1bfd64e Mon Sep 17 00:00:00 2001
From: jiaweizhuang <zhuangjiawei@hisilicon.com>
Date: Wed, 27 May 2026 18:24:32 +0000
Subject: [PATCH 3/3] docs(a5_sim): record simulator timings on AMD EPYC; keep
 Kunpeng
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Reproduce SiLU/SwiGLU msprof and cannsim sweeps on x86_64 (AMD EPYC
9654). Document both hosts side by side: Kunpeng-920 baseline tables
unchanged, new EPYC numbers (~3–5× faster wall time). Correctness PASS
on smoke shapes for both tools.
---
 examples/a5_sim/README.md            | 74 +++++++++++++++++++++++++---
 examples/a5_sim/cannsim_vs_msprof.md | 29 +++++++++--
 2 files changed, 92 insertions(+), 11 deletions(-)

diff --git a/examples/a5_sim/README.md b/examples/a5_sim/README.md
index 2315ebbf..b98978fe 100644
--- a/examples/a5_sim/README.md
+++ b/examples/a5_sim/README.md
@@ -8,7 +8,7 @@ Kernels compile with `--cce-aicore-arch=dav-c310-vec` and `-DREGISTER_BASE`. For
 
 ```bash
 source /usr/local/Ascend/ascend-toolkit/latest/bin/setenv.bash
-export PTO_LIB_PATH=/path/to/pto-isa   # or megagdn-pto/third_party/pto-isa
+export PTO_LIB_PATH=/path/to/pto-kernels/third_party/pto-isa
 pip install torch torch-npu
 ```
 
@@ -36,15 +36,22 @@ python3 -m common.build --all
 ./run_thread_sweep.sh   # OMP sweep, T=512, both tools
 ```
 
-## Host environment
+## Host environments
 
-Measured on **Kunpeng-920** (HUAWEI Kunpeng 920 5250), **192 logical CPUs** (4 sockets × 48 cores, 1 thread/core), **aarch64**, CANN **9.0.0**, May 2026.
+| Host | CPU | Logical CPUs | Arch | CANN |
+|------|-----|--------------|------|------|
+| Kunpeng server | HUAWEI Kunpeng 920 5250 | 192 (4×48 cores, 1 thread/core) | aarch64 | 9.0.0 |
+| x86 server | AMD EPYC 9654 96-Core | 192 (1×96 cores, 2 threads/core) | x86_64 | 9.0.0 |
+
+Both measured May 2026. On the same ladder shapes, the x86 host is roughly **3–5× faster** in simulator wall time (startup overhead still dominates smoke on both).
 
 ## Simulator time cost summary
 
 Wall time uses `time.perf_counter()` around one kernel launch (includes PEM/msprof or cannsim startup). **T** = output element count (same ladder labels as the 910B `chunk_h` benchmark). **Correctness PASS** at smoke shape on both tools (PyTorch CPU reference).
 
-### SiLU — msprof (`Ascend950PR_9599`)
+### Kunpeng-920 (aarch64)
+
+#### SiLU — msprof (`Ascend950PR_9599`)
 
 | Label | T | Sim wall | ms/element |
 |-------|---|----------|------------|
@@ -54,7 +61,7 @@ Wall time uses `time.perf_counter()` around one kernel launch (includes PEM/mspr
 | varlen_2x512 | 1024 | **26 s** | 26 ms |
 | medium | 4096 | **29 s** | 7.1 ms |
 
-### SiLU — cannsim (`Ascend950`)
+#### SiLU — cannsim (`Ascend950`)
 
 | Label | T | Sim wall | ms/element |
 |-------|---|----------|------------|
@@ -64,7 +71,7 @@ Wall time uses `time.perf_counter()` around one kernel launch (includes PEM/mspr
 | varlen_2x512 | 1024 | **16 s** | 16 ms |
 | medium | 4096 | **17 s** | 4.1 ms |
 
-### SwiGLU — msprof
+#### SwiGLU — msprof
 
 | Label | T | Sim wall | ms/element |
 |-------|---|----------|------------|
@@ -74,7 +81,7 @@ Wall time uses `time.perf_counter()` around one kernel launch (includes PEM/mspr
 | varlen_2x512 | 1024 | **47 s** | 46 ms |
 | medium | 4096 | **52 s** | 13 ms |
 
-### SwiGLU — cannsim
+#### SwiGLU — cannsim
 
 | Label | T | Sim wall | ms/element |
 |-------|---|----------|------------|
@@ -84,15 +91,66 @@ Wall time uses `time.perf_counter()` around one kernel launch (includes PEM/mspr
 | varlen_2x512 | 1024 | **21 s** | 21 ms |
 | medium | 4096 | **22 s** | 5.4 ms |
 
-**Scaling law (approximate):**
+**Scaling law (Kunpeng, approximate):**
 
 - Fixed overhead **~15–75 s** at T=128 dominates smoke; do not extrapolate from smoke alone.
 - After startup, cost scales **roughly linearly with T** at ~**0.005–0.06 s/element** on cannsim and ~**0.007–0.06 s/element** on msprof for T≥512.
 - **Varlen vs fixed length** at the same T: negligible (1024 tokens: SiLU msprof 26 s vs 26 s).
 - Pure-vector kernels finish in **minutes** on the default ladder; contrast with mix `chunk_h_mini` v1 (scalar matmul, 35+ min timeouts).
 
+### AMD EPYC 9654 (x86_64)
+
+#### SiLU — msprof (`Ascend950PR_9599`)
+
+| Label | T | Sim wall | ms/element |
+|-------|---|----------|------------|
+| smoke | 128 | **12 s** | 91 ms |
+| tiny | 512 | **7 s** | 14 ms |
+| small | 1024 | **7 s** | 7.1 ms |
+| varlen_2x512 | 1024 | **7 s** | 6.9 ms |
+| medium | 4096 | **12 s** | 3.0 ms |
+
+#### SiLU — cannsim (`Ascend950`)
+
+| Label | T | Sim wall | ms/element |
+|-------|---|----------|------------|
+| smoke | 128 | **9 s** | 72 ms |
+| tiny | 512 | **4 s** | 7.2 ms |
+| small | 1024 | **4 s** | 3.9 ms |
+| varlen_2x512 | 1024 | **4 s** | 3.5 ms |
+| medium | 4096 | **5 s** | 1.3 ms |
+
+#### SwiGLU — msprof
+
+| Label | T | Sim wall | ms/element |
+|-------|---|----------|------------|
+| smoke | 128 | **18 s** | 137 ms |
+| tiny | 512 | **13 s** | 25 ms |
+| small | 1024 | **17 s** | 16 ms |
+| varlen_2x512 | 1024 | **15 s** | 14 ms |
+| medium | 4096 | **20 s** | 4.9 ms |
+
+#### SwiGLU — cannsim
+
+| Label | T | Sim wall | ms/element |
+|-------|---|----------|------------|
+| smoke | 128 | **11 s** | 87 ms |
+| tiny | 512 | **6 s** | 12 ms |
+| small | 1024 | **6 s** | 6.3 ms |
+| varlen_2x512 | 1024 | **6 s** | 5.8 ms |
+| medium | 4096 | **7 s** | 1.7 ms |
+
+**Scaling law (AMD EPYC, approximate):**
+
+- Fixed overhead **~9–18 s** at T=128 dominates smoke; do not extrapolate from smoke alone.
+- After startup, cost scales **roughly linearly with T** at ~**0.001–0.012 s/element** on cannsim and ~**0.003–0.014 s/element** on msprof for T≥512.
+- **Varlen vs fixed length** at the same T: negligible (1024 tokens: SiLU msprof 7 s vs 7 s).
+- Pure-vector kernels finish in **under ~2 min** for the full ladder on this host.
+
 ### vs CPU thread count (OMP)
 
+Measured on **Kunpeng-920** only (not re-run on x86).
+
 Fixed workload **T=512** (SiLU), swept `OMP_NUM_THREADS`, `OPENBLAS_NUM_THREADS`, `MKL_NUM_THREADS` together:
 
 | OMP threads | msprof mean (s) | speedup vs 1 | cannsim mean (s) | speedup vs 1 |
diff --git a/examples/a5_sim/cannsim_vs_msprof.md b/examples/a5_sim/cannsim_vs_msprof.md
index 393cc7d9..775c1104 100644
--- a/examples/a5_sim/cannsim_vs_msprof.md
+++ b/examples/a5_sim/cannsim_vs_msprof.md
@@ -10,7 +10,8 @@ Pure-vector A5 examples for **`pto-kernels/examples/a5_sim`**. Recommended first
 | AICore arch | `dav-c310-vec` | `dav-c310-vec` |
 | Correctness (smoke) | **PASS** (SiLU T=128, SwiGLU T=128) | **PASS** (same shapes) |
 | Invocation | Wraps `python3 vec_sim.py` directly | Executable `run_cannsim_entry.sh` + `-u "..."` |
-| Typical smoke wall | SiLU ~26 s, SwiGLU ~54 s | SiLU ~14 s, SwiGLU ~26 s |
+| Typical smoke wall (Kunpeng) | SiLU ~52 s, SwiGLU ~75 s | SiLU ~42 s, SwiGLU ~52 s |
+| Typical smoke wall (AMD EPYC) | SiLU ~12 s, SwiGLU ~18 s | SiLU ~9 s, SwiGLU ~11 s |
 | Exit code | 0 on success | May return non-zero after **teardown segfault** even when JSON is valid |
 
 ## Tool overview
@@ -30,6 +31,8 @@ Inputs are allocated on CPU then copied to NPU; reference checks run on CPU (sim
 
 ## Speed comparison (scale ladder, timing-only sweep)
 
+### Kunpeng-920 (aarch64, May 2026)
+
 **SiLU msprof vs cannsim** (seconds, wall clock):
 
 | label | T | msprof | cannsim | ratio msprof/cannsim |
@@ -48,7 +51,27 @@ Inputs are allocated on CPU then copied to NPU; reference checks run on CPU (sim
 | small | 1024 | 61 | 29 | 2.1× |
 | medium | 4096 | 52 | 22 | 2.4× |
 
-cannsim is generally **faster** on wall clock for these pure-vector kernels once T≥512; msprof carries heavier profiling/injection overhead.
+### AMD EPYC 9654 (x86_64, May 2026)
+
+**SiLU msprof vs cannsim** (seconds, wall clock):
+
+| label | T | msprof | cannsim | ratio msprof/cannsim |
+|-------|---|--------|---------|----------------------|
+| smoke | 128 | 12 | 9 | 1.3× |
+| tiny | 512 | 7 | 4 | 1.8× |
+| small | 1024 | 7 | 4 | 1.8× |
+| medium | 4096 | 12 | 5 | 2.4× |
+
+**SwiGLU msprof vs cannsim**:
+
+| label | T | msprof | cannsim | ratio |
+|-------|---|--------|---------|-------|
+| smoke | 128 | 18 | 11 | 1.6× |
+| tiny | 512 | 13 | 6 | 2.2× |
+| small | 1024 | 17 | 6 | 2.8× |
+| medium | 4096 | 20 | 7 | 2.9× |
+
+On both hosts, cannsim is generally **faster** on wall clock for these pure-vector kernels once T≥512; msprof carries heavier profiling/injection overhead. Tool ratios are similar; absolute wall time is ~3–5× lower on the AMD EPYC host.
 
 ## Failure modes
 
@@ -65,7 +88,7 @@ cannsim is generally **faster** on wall clock for these pure-vector kernels once
 ```bash
 cd pto-kernels/examples/a5_sim
 source $ASCEND_HOME_PATH/bin/setenv.bash
-export PTO_LIB_PATH=/path/to/pto-isa
+export PTO_LIB_PATH=/path/to/pto-kernels/third_party/pto-isa
 
 MSPROF_TIMEOUT=30 ./run_msprof.sh --kernel silu --mode sweep --skip-correctness \
   --output-json outputs/silu_sweep_msprof.json