From 9c3f790f4ee7ea9a9bbf854bfb91323ab3745717 Mon Sep 17 00:00:00 2001 From: learning-chip Date: Wed, 27 May 2026 15:01:00 +0000 Subject: [PATCH 1/3] Add Ascend950 pure-vector simulator examples for SiLU and SwiGLU. Introduce a self-contained a5_sim harness with dav-c310-vec kernels, msprof and cannsim runners, scale-ladder timing docs, and OMP thread sweep scripts. Co-authored-by: Cursor --- examples/a5_sim/README.md | 125 +++++++ examples/a5_sim/cannsim_vs_msprof.md | 86 +++++ examples/a5_sim/common/__init__.py | 1 + examples/a5_sim/common/build.py | 172 ++++++++++ examples/a5_sim/common/host_info.py | 46 +++ examples/a5_sim/common/torch_runtime.py | 56 ++++ examples/a5_sim/configs/scale_ladder.json | 7 + examples/a5_sim/kernels/silu_a5.cpp | 122 +++++++ examples/a5_sim/kernels/swiglu_a5.cpp | 375 +++++++++++++++++++++ examples/a5_sim/outputs/.gitignore | 2 + examples/a5_sim/run_cannsim.sh | 60 ++++ examples/a5_sim/run_cannsim_entry.sh | 6 + examples/a5_sim/run_msprof.sh | 30 ++ examples/a5_sim/run_thread_sweep.sh | 123 +++++++ examples/a5_sim/vec_sim.py | 389 ++++++++++++++++++++++ 15 files changed, 1600 insertions(+) create mode 100644 examples/a5_sim/README.md create mode 100644 examples/a5_sim/cannsim_vs_msprof.md create mode 100644 examples/a5_sim/common/__init__.py create mode 100644 examples/a5_sim/common/build.py create mode 100644 examples/a5_sim/common/host_info.py create mode 100644 examples/a5_sim/common/torch_runtime.py create mode 100644 examples/a5_sim/configs/scale_ladder.json create mode 100644 examples/a5_sim/kernels/silu_a5.cpp create mode 100644 examples/a5_sim/kernels/swiglu_a5.cpp create mode 100644 examples/a5_sim/outputs/.gitignore create mode 100755 examples/a5_sim/run_cannsim.sh create mode 100755 examples/a5_sim/run_cannsim_entry.sh create mode 100755 examples/a5_sim/run_msprof.sh create mode 100755 examples/a5_sim/run_thread_sweep.sh create mode 100644 examples/a5_sim/vec_sim.py diff --git a/examples/a5_sim/README.md b/examples/a5_sim/README.md new file mode 100644 index 00000000..2315ebbf --- /dev/null +++ b/examples/a5_sim/README.md @@ -0,0 +1,125 @@ +# A5 Pure-Vector Simulator Examples (SiLU + SwiGLU) + +Self-contained **Ascend950PR** pure-vector PTO kernels with **msprof op simulator** and **cannsim record** harnesses. Use these to validate A5 simulator plumbing before tackling mix kernels (see [`megagdn-pto/benchmarks/a5_sim`](../../megagdn-pto/benchmarks/a5_sim)). + +Kernels compile with `--cce-aicore-arch=dav-c310-vec` and `-DREGISTER_BASE`. For the 910B `chunk_h` simulator benchmark (different arch), see [`megagdn-pto/benchmarks/simulator/README.md`](../../megagdn-pto/benchmarks/simulator/README.md). + +## Prerequisites + +```bash +source /usr/local/Ascend/ascend-toolkit/latest/bin/setenv.bash +export PTO_LIB_PATH=/path/to/pto-isa # or megagdn-pto/third_party/pto-isa +pip install torch torch-npu +``` + +Build kernels: + +```bash +cd pto-kernels/examples/a5_sim +python3 -m common.build --all +``` + +## Quick start + +```bash +# Correctness smoke (msprof) +./run_msprof.sh --kernel silu --mode correctness --num-elements 128 --label smoke +./run_msprof.sh --kernel swiglu --mode correctness --batch 1 --input-n 256 --label smoke + +# Same under cannsim +./run_cannsim.sh --kernel silu --mode correctness --num-elements 128 --label smoke +./run_cannsim.sh --kernel swiglu --mode correctness --batch 1 --input-n 256 --label smoke + +# Scale ladder timing +./run_msprof.sh --kernel silu --mode sweep --skip-correctness \ + --output-json outputs/silu_sweep_msprof.json +./run_thread_sweep.sh # OMP sweep, T=512, both tools +``` + +## Host environment + +Measured on **Kunpeng-920** (HUAWEI Kunpeng 920 5250), **192 logical CPUs** (4 sockets × 48 cores, 1 thread/core), **aarch64**, CANN **9.0.0**, May 2026. + +## Simulator time cost summary + +Wall time uses `time.perf_counter()` around one kernel launch (includes PEM/msprof or cannsim startup). **T** = output element count (same ladder labels as the 910B `chunk_h` benchmark). **Correctness PASS** at smoke shape on both tools (PyTorch CPU reference). + +### SiLU — msprof (`Ascend950PR_9599`) + +| Label | T | Sim wall | ms/element | +|-------|---|----------|------------| +| smoke | 128 | **52 s** | 406 ms | +| tiny | 512 | **24 s** | 48 ms | +| small | 1024 | **26 s** | 25 ms | +| varlen_2x512 | 1024 | **26 s** | 26 ms | +| medium | 4096 | **29 s** | 7.1 ms | + +### SiLU — cannsim (`Ascend950`) + +| Label | T | Sim wall | ms/element | +|-------|---|----------|------------| +| smoke | 128 | **42 s** | 331 ms | +| tiny | 512 | **15 s** | 30 ms | +| small | 1024 | **17 s** | 17 ms | +| varlen_2x512 | 1024 | **16 s** | 16 ms | +| medium | 4096 | **17 s** | 4.1 ms | + +### SwiGLU — msprof + +| Label | T | Sim wall | ms/element | +|-------|---|----------|------------| +| smoke | 128 | **75 s** | 588 ms | +| tiny | 512 | **49 s** | 95 ms | +| small | 1024 | **61 s** | 59 ms | +| varlen_2x512 | 1024 | **47 s** | 46 ms | +| medium | 4096 | **52 s** | 13 ms | + +### SwiGLU — cannsim + +| Label | T | Sim wall | ms/element | +|-------|---|----------|------------| +| smoke | 128 | **52 s** | 403 ms | +| tiny | 512 | **27 s** | 52 ms | +| small | 1024 | **29 s** | 28 ms | +| varlen_2x512 | 1024 | **21 s** | 21 ms | +| medium | 4096 | **22 s** | 5.4 ms | + +**Scaling law (approximate):** + +- Fixed overhead **~15–75 s** at T=128 dominates smoke; do not extrapolate from smoke alone. +- After startup, cost scales **roughly linearly with T** at ~**0.005–0.06 s/element** on cannsim and ~**0.007–0.06 s/element** on msprof for T≥512. +- **Varlen vs fixed length** at the same T: negligible (1024 tokens: SiLU msprof 26 s vs 26 s). +- Pure-vector kernels finish in **minutes** on the default ladder; contrast with mix `chunk_h_mini` v1 (scalar matmul, 35+ min timeouts). + +### vs CPU thread count (OMP) + +Fixed workload **T=512** (SiLU), swept `OMP_NUM_THREADS`, `OPENBLAS_NUM_THREADS`, `MKL_NUM_THREADS` together: + +| OMP threads | msprof mean (s) | speedup vs 1 | cannsim mean (s) | speedup vs 1 | +|-------------|-----------------|--------------|------------------|--------------| +| 1 | 39.5 | 1.00× | 31.6 | 1.00× | +| 2 | 44.0 | 0.90× | 35.3 | 0.90× | +| 4 | 41.7 | 0.95× | 34.4 | 0.92× | +| 8 | 41.4 | 0.95× | 35.1 | 0.90× | +| 16 | 44.6 | 0.89× | 31.4 | 1.01× | +| 32 | 42.2 | 0.93× | 32.1 | 0.99× | + +**Conclusion:** host OMP thread env vars change simulator wall time by at most **~±11%** (msprof) and **~±12%** (cannsim). Tuning `OMP_NUM_THREADS` is not an effective lever; PEM uses internal worker pools. + +## Layout + +``` +examples/a5_sim/ +├── kernels/silu_a5.cpp, swiglu_a5.cpp +├── vec_sim.py # driver (--kernel silu|swiglu) +├── common/build.py # dav-c310-vec build +├── run_msprof.sh / run_cannsim.sh / run_thread_sweep.sh +├── configs/scale_ladder.json +└── outputs/ # gitignored results +``` + +## References + +- A5 PTO ST tests: `megagdn-pto/third_party/pto-isa/tests/npu/a5/src/st/testcase` +- A2 originals: `examples/jit_cpp/silu_dynamic`, `csrc/kernel/kernel_swiglu.cpp` +- Tool comparison: [`cannsim_vs_msprof.md`](cannsim_vs_msprof.md) diff --git a/examples/a5_sim/cannsim_vs_msprof.md b/examples/a5_sim/cannsim_vs_msprof.md new file mode 100644 index 00000000..393cc7d9 --- /dev/null +++ b/examples/a5_sim/cannsim_vs_msprof.md @@ -0,0 +1,86 @@ +# SiLU / SwiGLU — msprof vs cannsim (Ascend950 / dav-c310-vec) + +Pure-vector A5 examples for **`pto-kernels/examples/a5_sim`**. Recommended first step for Ascend950 simulator validation before mix kernels in [`megagdn-pto/benchmarks/a5_sim`](../../megagdn-pto/benchmarks/a5_sim). + +## Executive summary + +| Aspect | msprof op simulator | cannsim record | +|--------|---------------------|----------------| +| SoC flag | `Ascend950PR_9599` | `Ascend950` | +| AICore arch | `dav-c310-vec` | `dav-c310-vec` | +| Correctness (smoke) | **PASS** (SiLU T=128, SwiGLU T=128) | **PASS** (same shapes) | +| Invocation | Wraps `python3 vec_sim.py` directly | Executable `run_cannsim_entry.sh` + `-u "..."` | +| Typical smoke wall | SiLU ~26 s, SwiGLU ~54 s | SiLU ~14 s, SwiGLU ~26 s | +| Exit code | 0 on success | May return non-zero after **teardown segfault** even when JSON is valid | + +## Tool overview + +**msprof** preloads the CA model via `LD_PRELOAD` and runs Python + ctypes kernel launch (same pattern as [`ptoisa-a5-test/tests/torch_sim`](../../ptoisa-a5-test/tests/torch_sim/msprof_mechanism.md)). + +**cannsim** runs a standalone entry script under full SoC simulation. User args pass via `-u "--kernel silu --mode ..."`, not trailing argv. + +## Correctness + +| Kernel | Shape | msprof | cannsim | Reference | +|--------|-------|--------|---------|-----------| +| SiLU | T=128 | PASS | PASS | `x * sigmoid(x)` on CPU | +| SwiGLU | batch=1, input_n=256 (T=128 out) | PASS | PASS | split + SiLU gate × value on CPU | + +Inputs are allocated on CPU then copied to NPU; reference checks run on CPU (simulator rejects many dynamic NPU ops). + +## Speed comparison (scale ladder, timing-only sweep) + +**SiLU msprof vs cannsim** (seconds, wall clock): + +| label | T | msprof | cannsim | ratio msprof/cannsim | +|-------|---|--------|---------|----------------------| +| smoke | 128 | 52 | 42 | 1.2× | +| tiny | 512 | 24 | 15 | 1.6× | +| small | 1024 | 26 | 17 | 1.5× | +| medium | 4096 | 29 | 17 | 1.7× | + +**SwiGLU msprof vs cannsim**: + +| label | T | msprof | cannsim | ratio | +|-------|---|--------|---------|-------| +| smoke | 128 | 75 | 52 | 1.4× | +| tiny | 512 | 49 | 27 | 1.8× | +| small | 1024 | 61 | 29 | 2.1× | +| medium | 4096 | 52 | 22 | 2.4× | + +cannsim is generally **faster** on wall clock for these pure-vector kernels once T≥512; msprof carries heavier profiling/injection overhead. + +## Failure modes + +| Issue | Mitigation | +|-------|------------| +| `torch.randn` on NPU under sim | Create tensors on CPU, `.to("npu:0")` | +| Reference ops on NPU fail | Compare `y.cpu()` vs CPU PyTorch ref | +| cannsim segfault on exit | JSON is still written; `run_cannsim.sh` accepts valid `--output-json` | +| A5 `pipe_barrier(PIPE_V)` compile error | Use `PIPE_ALL` in SwiGLU compute path | +| `Stride` ambiguous on A5 | Qualify as `pto::Stride<...>` | + +## Invocation examples + +```bash +cd pto-kernels/examples/a5_sim +source $ASCEND_HOME_PATH/bin/setenv.bash +export PTO_LIB_PATH=/path/to/pto-isa + +MSPROF_TIMEOUT=30 ./run_msprof.sh --kernel silu --mode sweep --skip-correctness \ + --output-json outputs/silu_sweep_msprof.json + +./run_cannsim.sh --kernel swiglu --mode correctness --batch 1 --input-n 256 \ + --output-json outputs/smoke_swiglu_cannsim.json +``` + +## Recommendations + +1. **Start with SiLU** (simplest 1D pipeline) under msprof smoke correctness. +2. Use **cannsim** for faster scale sweeps once smoke passes. +3. Use **mix chunk_h_mini** only after pure-vector path is green ([`megagdn-pto/benchmarks/a5_sim`](../../megagdn-pto/benchmarks/a5_sim)). + +## References + +- Harness README: [`README.md`](README.md) +- 910B chunk_h comparison: [`megagdn-pto/benchmarks/simulator/cannsim_vs_msprof.md`](../../megagdn-pto/benchmarks/simulator/cannsim_vs_msprof.md) diff --git a/examples/a5_sim/common/__init__.py b/examples/a5_sim/common/__init__.py new file mode 100644 index 00000000..afc5bf56 --- /dev/null +++ b/examples/a5_sim/common/__init__.py @@ -0,0 +1 @@ +"""Shared helpers for A5 pure-vector simulator examples.""" diff --git a/examples/a5_sim/common/build.py b/examples/a5_sim/common/build.py new file mode 100644 index 00000000..075522a3 --- /dev/null +++ b/examples/a5_sim/common/build.py @@ -0,0 +1,172 @@ +#!/usr/bin/env python3 +"""Bisheng build helper for A5 pure-vector kernels (dav-c310-vec, REGISTER_BASE).""" + +from __future__ import annotations + +import argparse +import os +import shutil +import subprocess +import sys +from pathlib import Path + +A5_SIM_ROOT = Path(__file__).resolve().parent.parent +KERNEL_DIR = A5_SIM_ROOT / "kernels" +BUILD_DIR = A5_SIM_ROOT / "build" + +KERNELS = { + "silu": { + "source": "silu_a5.cpp", + "lib": "libsilu_a5.so", + }, + "swiglu": { + "source": "swiglu_a5.cpp", + "lib": "libswiglu_a5.so", + }, +} + + +def _pto_include_root() -> Path: + env = os.environ.get("PTO_LIB_PATH") + if env: + candidate = Path(env) + if (candidate / "include" / "pto" / "pto-inst.hpp").is_file(): + return candidate / "include" + if (candidate / "pto" / "pto-inst.hpp").is_file(): + return candidate + ascend = os.environ.get("ASCEND_HOME_PATH") or os.environ.get("ASCEND_TOOLKIT_HOME") + if ascend: + candidate = Path(ascend) + if (candidate / "include" / "pto" / "pto-inst.hpp").is_file(): + return candidate / "include" + fallback = Path("/workdir/megagdn-pto/third_party/pto-isa/include") + if (fallback / "pto" / "pto-inst.hpp").is_file(): + return fallback + raise EnvironmentError( + "PTO headers not found. Set PTO_LIB_PATH or source CANN setenv.bash." + ) + + +def _ascend_home() -> Path: + home = os.environ.get("ASCEND_HOME_PATH") or os.environ.get("ASCEND_TOOLKIT_HOME") + if not home: + raise EnvironmentError("ASCEND_HOME_PATH is not set. Source CANN setenv.bash first.") + return Path(home) + + +def _bisheng() -> str: + ascend = _ascend_home() + candidate = ascend / "bin" / "bisheng" + if candidate.is_file(): + return str(candidate) + found = shutil.which("bisheng") + if found: + return found + raise FileNotFoundError("bisheng compiler not found") + + +def _common_includes() -> list[str]: + ascend = _ascend_home() + driver = os.environ.get("ASCEND_DRIVER_PATH", "/usr/local/Ascend/driver") + pto_root = _pto_include_root() + return [ + f"-I{pto_root}", + f"-I{ascend}/include", + f"-I{driver}/kernel/inc", + f"-I{KERNEL_DIR}", + ] + + +def _kernel_flags() -> list[str]: + ascend = _ascend_home() + return ( + _common_includes() + + [ + f"-I{ascend}/pkg_inc", + f"-I{ascend}/pkg_inc/profiling", + f"-I{ascend}/pkg_inc/runtime/runtime", + "-std=gnu++17", + "-O2", + "-Wno-macro-redefined", + "-Wno-ignored-attributes", + "-Wno-unknown-attributes", + "-fPIC", + "-xcce", + "-Xhost-start", + "-Xhost-end", + "-mllvm", + "-cce-aicore-stack-size=0x8000", + "-mllvm", + "-cce-aicore-function-stack-size=0x8000", + "-mllvm", + "-cce-aicore-record-overflow=true", + "-mllvm", + "-cce-aicore-addr-transform", + "-mllvm", + "-cce-aicore-dcci-insert-for-scalar=false", + "--cce-aicore-arch=dav-c310-vec", + "-DREGISTER_BASE", + ] + ) + + +def _run(cmd: list[str], cwd: Path) -> None: + print("==>", " ".join(cmd)) + subprocess.run(cmd, cwd=cwd, check=True) + + +def build_kernel(name: str, force: bool = False) -> Path: + if name not in KERNELS: + raise ValueError(f"unknown kernel: {name}") + spec = KERNELS[name] + BUILD_DIR.mkdir(parents=True, exist_ok=True) + out = BUILD_DIR / spec["lib"] + if out.is_file() and not force: + return out + + src_path = KERNEL_DIR / spec["source"] + obj = BUILD_DIR / f"{src_path.stem}.o" + bisheng = _bisheng() + _run([bisheng, *_kernel_flags(), "-c", str(src_path), "-o", str(obj)], cwd=BUILD_DIR) + _run( + [ + bisheng, + "-fPIC", + "-shared", + "--cce-fatobj-link", + "-Wl,-soname," + spec["lib"], + str(obj), + "-o", + str(out), + ], + cwd=BUILD_DIR, + ) + print(f"Built {out}") + return out + + +def build_all(force: bool = False) -> dict[str, Path]: + return {name: build_kernel(name, force=force) for name in KERNELS} + + +def main() -> None: + parser = argparse.ArgumentParser(description="Build A5 pure-vector example kernels") + parser.add_argument("--kernel", choices=tuple(KERNELS.keys())) + parser.add_argument("--all", action="store_true") + parser.add_argument("--force", action="store_true") + args = parser.parse_args() + try: + if args.all: + build_all(force=args.force) + elif args.kernel: + build_kernel(args.kernel, force=args.force) + else: + parser.print_help() + raise SystemExit(1) + except (EnvironmentError, FileNotFoundError, subprocess.CalledProcessError, ValueError) as exc: + print(f"build failed: {exc}", file=sys.stderr) + raise SystemExit(1) from exc + + +if __name__ == "__main__": + main() diff --git a/examples/a5_sim/common/host_info.py b/examples/a5_sim/common/host_info.py new file mode 100644 index 00000000..f10a12b1 --- /dev/null +++ b/examples/a5_sim/common/host_info.py @@ -0,0 +1,46 @@ +"""Capture host CPU metadata for benchmark documentation.""" + +from __future__ import annotations + +import json +import subprocess +from pathlib import Path + + +def _run(cmd: list[str]) -> str: + try: + return subprocess.check_output(cmd, text=True, stderr=subprocess.STDOUT).strip() + except (subprocess.CalledProcessError, FileNotFoundError): + return "" + + +def capture_host_cpu() -> dict: + lscpu = _run(["lscpu"]) + info: dict[str, str | int] = {"lscpu_excerpt": "\n".join(lscpu.splitlines()[:12]) if lscpu else ""} + for line in lscpu.splitlines(): + if ":" not in line: + continue + key, value = line.split(":", 1) + key = key.strip().lower().replace(" ", "_").replace("(", "").replace(")", "") + info[key] = value.strip() + nproc = _run(["nproc"]) + if nproc.isdigit(): + info["logical_cpus"] = int(nproc) + return info + + +def readme_cpu_snippet(info: dict) -> str: + model = info.get("model_name", "unknown") + cpus = info.get("cpu_s", info.get("cpus", "?")) + threads = info.get("thread_s_per_core", "?") + logical = info.get("logical_cpus", "?") + return ( + f"Host CPU: **{model}**, {cpus} cores × {threads} threads/core, " + f"{logical} logical CPUs (from `lscpu`)." + ) + + +def write_host_info(path: Path) -> dict: + info = capture_host_cpu() + path.write_text(json.dumps(info, indent=2)) + return info diff --git a/examples/a5_sim/common/torch_runtime.py b/examples/a5_sim/common/torch_runtime.py new file mode 100644 index 00000000..59ceb28e --- /dev/null +++ b/examples/a5_sim/common/torch_runtime.py @@ -0,0 +1,56 @@ +"""torch_npu runtime helpers for Ascend950 simulator runs.""" + +from __future__ import annotations + +import ctypes + +import numpy as np +import torch +import torch_npu # noqa: F401 + +_DEVICE = "npu:0" + + +def init_torch_npu(device: str = _DEVICE) -> None: + global _DEVICE + _DEVICE = device + torch.npu.config.allow_internal_format = False + torch_npu.npu.set_compile_mode(jit_compile=False) + torch.npu.set_device(device) + + +def npu_tensor(np_arr: np.ndarray) -> torch.Tensor: + return torch.from_numpy(np_arr).to(_DEVICE) + + +def empty_npu(shape, dtype: torch.dtype) -> torch.Tensor: + return torch.empty(shape, dtype=dtype, device=_DEVICE) + + +def zeros_npu(shape, dtype: torch.dtype) -> torch.Tensor: + """Allocate zero-filled NPU tensor via CPU NumPy (avoids ZerosLike op under simulators).""" + np_dtype = { + torch.float32: np.float32, + torch.float16: np.float16, + torch.int32: np.int32, + torch.int64: np.int64, + }.get(dtype) + if np_dtype is None: + raise TypeError(f"unsupported dtype for zeros_npu: {dtype}") + return npu_tensor(np.zeros(shape, dtype=np_dtype)) + + +def stream_ptr() -> int: + return torch.npu.current_stream()._as_parameter_ # noqa: SLF001 + + +def data_ptr(tensor: torch.Tensor) -> int: + return ctypes.c_void_p(tensor.data_ptr()).value or 0 + + +def sync() -> None: + torch.npu.synchronize() + + +def to_numpy(tensor: torch.Tensor) -> np.ndarray: + return tensor.cpu().numpy() diff --git a/examples/a5_sim/configs/scale_ladder.json b/examples/a5_sim/configs/scale_ladder.json new file mode 100644 index 00000000..82df57ae --- /dev/null +++ b/examples/a5_sim/configs/scale_ladder.json @@ -0,0 +1,7 @@ +[ + {"label": "smoke", "n_seq": 1, "l_seg": 128}, + {"label": "tiny", "n_seq": 1, "l_seg": 512}, + {"label": "small", "n_seq": 1, "l_seg": 1024}, + {"label": "varlen_2x512", "n_seq": 2, "l_seg": 512}, + {"label": "medium", "n_seq": 4, "l_seg": 1024} +] diff --git a/examples/a5_sim/kernels/silu_a5.cpp b/examples/a5_sim/kernels/silu_a5.cpp new file mode 100644 index 00000000..580af68c --- /dev/null +++ b/examples/a5_sim/kernels/silu_a5.cpp @@ -0,0 +1,122 @@ +#include +using namespace pto; + +constexpr uint32_t UB_ALLOC_BYTES = 48 * 1024; +constexpr uint32_t ELEMENTS_PER_TILE = UB_ALLOC_BYTES / 2; + +constexpr unsigned X_PING = 0x00000; +constexpr unsigned X_PONG = (X_PING + 0x8000 + 0x100); +constexpr unsigned CAL_PING = 0x10000; +constexpr unsigned CAL_PONG = (CAL_PING + 0x8000 + 0x100); + +// SiLU: y = x / (1 + exp(-x)) +template +AICORE void runTSilu(__gm__ T *y, __gm__ T *x, uint32_t num_elements) { +#if defined(__DAV_VEC__) + + const uint32_t num_cores = block_num; + const uint32_t elements_per_core = + (num_elements + num_cores - 1) / num_cores; + const uint32_t offset_this_core = elements_per_core * block_idx; + + if (offset_this_core >= num_elements) return; + + uint32_t elements_to_process = elements_per_core; + if (offset_this_core + elements_to_process > num_elements) { + elements_to_process = num_elements - offset_this_core; + } + if (elements_to_process == 0) return; + + using ShapeDim5 = pto::Shape<1, 1, 1, 1, ELEMENTS_PER_TILE>; + using StridDim5 = pto::Stride<1, 1, 1, 1, 1>; + using GlobalData = pto::GlobalTensor; + + GlobalData xGlobal(x + offset_this_core); + GlobalData yGlobal(y + offset_this_core); + + using TileData = + Tile; + + set_flag(PIPE_V, PIPE_MTE2, EVENT_ID0); + set_flag(PIPE_V, PIPE_MTE2, EVENT_ID1); + set_flag(PIPE_MTE3, PIPE_V, EVENT_ID0); + set_flag(PIPE_MTE3, PIPE_V, EVENT_ID1); + + uint32_t x_offset = 0; + uint32_t y_offset = 0; + + for (uint32_t num_processed = 0, ping = 1; + num_processed < elements_to_process; + num_processed += ELEMENTS_PER_TILE) { + const uint32_t remaining = elements_to_process - num_processed; + const uint32_t cur_cols = + (remaining >= ELEMENTS_PER_TILE) ? ELEMENTS_PER_TILE : remaining; + + const int8_t buf = ping ? 0 : 1; + const event_t ev = ping ? (event_t)EVENT_ID0 : (event_t)EVENT_ID1; + + TileData xTile(1, cur_cols); + TileData calTile(1, cur_cols); + + if (buf == 0) { + TASSIGN(xTile, X_PING); + TASSIGN(calTile, CAL_PING); + } else { + TASSIGN(xTile, X_PONG); + TASSIGN(calTile, CAL_PONG); + } + + TASSIGN(xGlobal, (x + offset_this_core + x_offset)); + TASSIGN(yGlobal, (y + offset_this_core + y_offset)); + + wait_flag(PIPE_V, PIPE_MTE2, ev); + TLOAD(xTile, xGlobal); + pipe_barrier(PIPE_ALL); + + set_flag(PIPE_MTE2, PIPE_V, ev); + wait_flag(PIPE_MTE2, PIPE_V, ev); + + wait_flag(PIPE_MTE3, PIPE_V, ev); + + TMULS(calTile, xTile, (T)-1); + pipe_barrier(PIPE_ALL); + + TEXP(calTile, calTile); + pipe_barrier(PIPE_ALL); + + TADDS(calTile, calTile, (T)1); + pipe_barrier(PIPE_ALL); + + TDIV(calTile, xTile, calTile); + pipe_barrier(PIPE_ALL); + + set_flag(PIPE_V, PIPE_MTE3, ev); + wait_flag(PIPE_V, PIPE_MTE3, ev); + + TSTORE(yGlobal, calTile); + pipe_barrier(PIPE_ALL); + + set_flag(PIPE_MTE3, PIPE_V, ev); + set_flag(PIPE_V, PIPE_MTE2, ev); + + x_offset += cur_cols; + y_offset += cur_cols; + ping = 1 - ping; + } + + wait_flag(PIPE_V, PIPE_MTE2, EVENT_ID0); + wait_flag(PIPE_V, PIPE_MTE2, EVENT_ID1); + wait_flag(PIPE_MTE3, PIPE_V, EVENT_ID0); + wait_flag(PIPE_MTE3, PIPE_V, EVENT_ID1); +#endif +} + +__global__ AICORE void silu_custom(__gm__ void *x, __gm__ void *y, + uint32_t num_elements) { + runTSilu((__gm__ half *)y, (__gm__ half *)x, num_elements); +} + +extern "C" void call_kernel(uint32_t blockDim, void *stream, uint8_t *y, + uint8_t *x, uint32_t num_elements) { + silu_custom<<>>(x, y, num_elements); +} diff --git a/examples/a5_sim/kernels/swiglu_a5.cpp b/examples/a5_sim/kernels/swiglu_a5.cpp new file mode 100644 index 00000000..c23f7872 --- /dev/null +++ b/examples/a5_sim/kernels/swiglu_a5.cpp @@ -0,0 +1,375 @@ +/** +Copyright (c) 2026 Huawei Technologies Co., Ltd. +A5 pure-vector SwiGLU port (self-contained, dav-c310-vec). +*/ + +#include + +#ifndef GM_ADDR +#define GM_ADDR __gm__ uint8_t* +#endif + +using namespace pto; + +#define DIV_ROUNDUP(x, y) (((x) + (y) - 1) / (y)) +#define ALIGN_UP(x, y) (DIV_ROUNDUP((x), (y)) * (y)) + +constexpr uint32_t UB_SLOT_BYTES = (192 * 1024) / 6; +constexpr uint32_t X0_BUFFER_BYTES = UB_SLOT_BYTES; +constexpr uint32_t X1_BUFFER_BYTES = UB_SLOT_BYTES; +constexpr uint32_t Y_BUFFER_BYTES = UB_SLOT_BYTES; +constexpr uint32_t ELEMENTS_PER_TILE = Y_BUFFER_BYTES / sizeof(half); +constexpr uint32_t UB_USABLE_BYTES = 192 * 1024; +constexpr uint32_t TILE_ALIGNMENT = 16; + +#define SWIGLU_FOR_EACH_COL_TILE(X) \ + X(16) \ + X(32) \ + X(64) \ + X(128) \ + X(256) \ + X(512) \ + X(1024) \ + X(2048) \ + X(4096) \ + X(8192) \ + X(16384) + +#define SWIGLU_COL_VALUE(width) width, +constexpr uint32_t COL_TILE_CANDIDATES[] = { + SWIGLU_FOR_EACH_COL_TILE(SWIGLU_COL_VALUE)}; +#undef SWIGLU_COL_VALUE +constexpr uint32_t NUM_COL_TILE_CANDIDATES = + sizeof(COL_TILE_CANDIDATES) / sizeof(COL_TILE_CANDIDATES[0]); + +constexpr uint32_t TARGET_ACTIVE_TILES_DIVISOR = 2; + +constexpr unsigned X0_PING = 0x00000; +constexpr unsigned X1_PING = X0_PING + X0_BUFFER_BYTES; +constexpr unsigned Y_PING = X1_PING + X1_BUFFER_BYTES; +constexpr unsigned X0_PONG = Y_PING + Y_BUFFER_BYTES; +constexpr unsigned X1_PONG = X0_PONG + X0_BUFFER_BYTES; +constexpr unsigned Y_PONG = X1_PONG + X1_BUFFER_BYTES; + +static_assert(UB_SLOT_BYTES * 6 == UB_USABLE_BYTES, + "SwiGLU UB slots must fully pack the usable UB budget."); +static_assert(Y_PONG + Y_BUFFER_BYTES <= UB_USABLE_BYTES, + "SwiGLU UB layout exceeds usable UB."); + +#if defined(__DAV_VEC__) + +namespace { + +struct TileConfig { + uint32_t row_tile_len; + uint32_t col_tile_len; + uint32_t total_tiles; + uint32_t area; + bool meets_target; +}; + +struct TileWork { + uint32_t row_offset; + uint32_t col_offset; + uint32_t row_count; + uint32_t col_count; + uint32_t col_count_store; +}; + +AICORE inline uint32_t maxRowsForColTile(uint32_t col_tile_len) { + return ELEMENTS_PER_TILE / col_tile_len; +} + +AICORE inline void initTilePipeFlags() { + set_flag(PIPE_V, PIPE_MTE2, EVENT_ID0); + set_flag(PIPE_V, PIPE_MTE2, EVENT_ID1); + set_flag(PIPE_MTE3, PIPE_V, EVENT_ID0); + set_flag(PIPE_MTE3, PIPE_V, EVENT_ID1); +} + +AICORE inline void drainTilePipeFlags() { + wait_flag(PIPE_V, PIPE_MTE2, EVENT_ID0); + wait_flag(PIPE_V, PIPE_MTE2, EVENT_ID1); + wait_flag(PIPE_MTE3, PIPE_V, EVENT_ID0); + wait_flag(PIPE_MTE3, PIPE_V, EVENT_ID1); +} + +AICORE inline TileConfig makeTileConfig(uint32_t batch, uint32_t output_n, + uint32_t num_cores, + uint32_t col_tile_len) { + const uint32_t max_rows = maxRowsForColTile(col_tile_len); + const uint32_t col_tiles = DIV_ROUNDUP(output_n, col_tile_len); + const uint32_t target_tiles = + max(1U, num_cores / TARGET_ACTIVE_TILES_DIVISOR); + const uint32_t min_row_tiles = max(1U, DIV_ROUNDUP(target_tiles, col_tiles)); + + uint32_t row_tile_len = min(batch, max_rows); + if (min_row_tiles > 1) { + const uint32_t capped = batch / min_row_tiles; + row_tile_len = min(row_tile_len, max(1U, capped)); + } + if (row_tile_len == 0) { + row_tile_len = 1; + } + + const uint32_t row_tiles = DIV_ROUNDUP(batch, row_tile_len); + const uint32_t total_tiles = row_tiles * col_tiles; + return TileConfig{ + row_tile_len, + col_tile_len, + total_tiles, + row_tile_len * col_tile_len, + total_tiles >= target_tiles, + }; +} + +AICORE inline bool preferTileConfig(const TileConfig& cand, + const TileConfig& best) { + if (cand.meets_target != best.meets_target) { + return cand.meets_target; + } + if (cand.area != best.area) { + return cand.area > best.area; + } + if (cand.meets_target) { + if (cand.total_tiles != best.total_tiles) { + return cand.total_tiles < best.total_tiles; + } + } else if (cand.total_tiles != best.total_tiles) { + return cand.total_tiles > best.total_tiles; + } + if (cand.col_tile_len != best.col_tile_len) { + return cand.col_tile_len > best.col_tile_len; + } + return cand.row_tile_len > best.row_tile_len; +} + +AICORE inline TileConfig chooseTileConfig(uint32_t batch, uint32_t output_n, + uint32_t num_cores) { + TileConfig best = + makeTileConfig(batch, output_n, num_cores, COL_TILE_CANDIDATES[0]); + for (uint32_t c = 1; c < NUM_COL_TILE_CANDIDATES; ++c) { + const TileConfig cand = + makeTileConfig(batch, output_n, num_cores, COL_TILE_CANDIDATES[c]); + if (preferTileConfig(cand, best)) { + best = cand; + } + } + return best; +} + +AICORE inline TileWork makeTileWork(uint32_t global_tile_idx, + uint32_t row_tile_len, + uint32_t col_tile_len, uint32_t batch, + uint32_t output_n) { + const uint32_t col_tiles = DIV_ROUNDUP(output_n, col_tile_len); + const uint32_t row_tile_idx = global_tile_idx / col_tiles; + const uint32_t col_tile_idx = global_tile_idx % col_tiles; + const uint32_t row_offset = row_tile_idx * row_tile_len; + const uint32_t col_offset = col_tile_idx * col_tile_len; + const uint32_t actual_col = min(col_tile_len, output_n - col_offset); + return TileWork{ + row_offset, + col_offset, + min(row_tile_len, batch - row_offset), + ALIGN_UP(actual_col, TILE_ALIGNMENT), + actual_col, + }; +} + +template +AICORE inline void computeSwiGLUTile(TileData& x0Tile, TileData& x1Tile, + TileData& yTile) { + TMULS(yTile, x0Tile, (T)-1); + pipe_barrier(PIPE_ALL); + TEXP(yTile, yTile); + pipe_barrier(PIPE_ALL); + TADDS(yTile, yTile, (T)1); + pipe_barrier(PIPE_ALL); + TDIV(yTile, x0Tile, yTile); + pipe_barrier(PIPE_ALL); + TMUL(yTile, yTile, x1Tile); + pipe_barrier(PIPE_ALL); +} + +template +AICORE void issueTLoad(__gm__ T* x, uint32_t input_n, uint32_t output_n, + const TileWork& tile, unsigned x0_base, unsigned x1_base, + event_t ev) { + using TileShapeND = TileShape2D; + using DynStrideND = pto::Stride<1, 1, 1, DYNAMIC, 1>; + using GlobalData = GlobalTensor; + using TileData = Tile; + + TileData x0Tile(tile.row_count, tile.col_count); + TileData x1Tile(tile.row_count, tile.col_count); + TASSIGN(x0Tile, x0_base); + TASSIGN(x1Tile, x1_base); + + const uint32_t input_offset = tile.row_offset * input_n + tile.col_offset; + const TileShapeND shape(tile.row_count, tile.col_count_store); + const DynStrideND stride(input_n); + + GlobalData x0Global(x + input_offset, shape, stride); + GlobalData x1Global(x + input_offset + output_n, shape, stride); + + wait_flag(PIPE_V, PIPE_MTE2, ev); + wait_flag(PIPE_MTE3, PIPE_V, ev); + TLOAD(x0Tile, x0Global); + TLOAD(x1Tile, x1Global); + set_flag(PIPE_MTE2, PIPE_V, ev); +} + +template +AICORE void issueTStore(__gm__ T* y, uint32_t output_n, const TileWork& tile, + unsigned y_base, event_t ev) { + using TileShapeND = TileShape2D; + using DynStrideND = pto::Stride<1, 1, 1, DYNAMIC, 1>; + using GlobalData = GlobalTensor; + using TileData = Tile; + + TileData yTile(tile.row_count, tile.col_count); + TASSIGN(yTile, y_base); + + const uint32_t output_offset = tile.row_offset * output_n + tile.col_offset; + const TileShapeND shape(tile.row_count, tile.col_count_store); + const DynStrideND stride(output_n); + GlobalData yGlobal(y + output_offset, shape, stride); + + set_flag(PIPE_V, PIPE_MTE3, ev); + wait_flag(PIPE_V, PIPE_MTE3, ev); + TSTORE(yGlobal, yTile); + set_flag(PIPE_MTE3, PIPE_V, ev); + set_flag(PIPE_V, PIPE_MTE2, ev); +} + +template +AICORE void runTSwiGLUTiled(__gm__ T* x, __gm__ T* y, uint32_t batch, + uint32_t input_n, uint32_t num_cores, uint32_t vid, + uint32_t row_tile_len) { + constexpr uint32_t kTileRows = ELEMENTS_PER_TILE / kTileCols; + static_assert(kTileRows * kTileCols == ELEMENTS_PER_TILE, + "2D tile shape must match the UB vector tile capacity."); + + const uint32_t output_n = input_n >> 1; + const uint32_t col_tiles = DIV_ROUNDUP(output_n, kTileCols); + const uint32_t row_tiles = DIV_ROUNDUP(batch, row_tile_len); + const uint32_t total_tiles = row_tiles * col_tiles; + if (vid >= total_tiles) { + return; + } + + using TileData = Tile; + + initTilePipeFlags(); + + uint32_t current_tile_idx = vid; + TileWork current_tile = + makeTileWork(current_tile_idx, row_tile_len, kTileCols, batch, output_n); + bool ping = true; + issueTLoad(x, input_n, output_n, current_tile, + X0_PING, X1_PING, (event_t)EVENT_ID0); + + while (true) { + const event_t current_ev = ping ? (event_t)EVENT_ID0 : (event_t)EVENT_ID1; + const unsigned current_x0_base = ping ? X0_PING : X0_PONG; + const unsigned current_x1_base = ping ? X1_PING : X1_PONG; + const unsigned current_y_base = ping ? Y_PING : Y_PONG; + + wait_flag(PIPE_MTE2, PIPE_V, current_ev); + + TileWork next_tile{0, 0, 0, 0, 0}; + const uint32_t next_tile_idx = current_tile_idx + num_cores; + const bool has_next = next_tile_idx < total_tiles; + if (has_next) { + next_tile = + makeTileWork(next_tile_idx, row_tile_len, kTileCols, batch, output_n); + const event_t next_ev = ping ? (event_t)EVENT_ID1 : (event_t)EVENT_ID0; + const unsigned next_x0_base = ping ? X0_PONG : X0_PING; + const unsigned next_x1_base = ping ? X1_PONG : X1_PING; + issueTLoad(x, input_n, output_n, next_tile, + next_x0_base, next_x1_base, next_ev); + } + + TileData x0Tile(current_tile.row_count, current_tile.col_count); + TileData x1Tile(current_tile.row_count, current_tile.col_count); + TileData yTile(current_tile.row_count, current_tile.col_count); + TASSIGN(x0Tile, current_x0_base); + TASSIGN(x1Tile, current_x1_base); + TASSIGN(yTile, current_y_base); + + computeSwiGLUTile(x0Tile, x1Tile, yTile); + issueTStore(y, output_n, current_tile, + current_y_base, current_ev); + + if (!has_next) { + break; + } + + current_tile = next_tile; + current_tile_idx = next_tile_idx; + ping = !ping; + } + + drainTilePipeFlags(); +} + +template +AICORE void runTSwiGLUMainTiled(__gm__ T* x, __gm__ T* y, uint32_t batch, + uint32_t input_n, uint32_t num_cores, + uint32_t vid) { + const uint32_t output_n = input_n >> 1; + const TileConfig cfg = chooseTileConfig(batch, output_n, num_cores); + + switch (cfg.col_tile_len) { +#define SWIGLU_TILE_CASE(width) \ + case width: \ + runTSwiGLUTiled(x, y, batch, input_n, num_cores, vid, \ + cfg.row_tile_len); \ + break; + SWIGLU_FOR_EACH_COL_TILE(SWIGLU_TILE_CASE) +#undef SWIGLU_TILE_CASE + default: + runTSwiGLUTiled<128, T>(x, y, batch, input_n, num_cores, vid, + cfg.row_tile_len); + break; + } +} + +template +AICORE void runTSwiGLU(__gm__ T* x, __gm__ T* y, uint32_t batch, + uint32_t input_n, uint32_t num_cores, uint32_t vid) { + if (input_n == 0 || (input_n & 1U) != 0) { + return; + } + runTSwiGLUMainTiled(x, y, batch, input_n, num_cores, vid); +} + +} // namespace + +#endif + +extern "C" __global__ AICORE void swiglu_fp16(GM_ADDR x, GM_ADDR y, + uint32_t batch, + uint32_t input_n) { +#if defined(__DAV_VEC__) + const uint32_t num_cores = get_block_num() * get_subblockdim(); + const uint32_t vid = get_block_idx() * get_subblockdim() + get_subblockid(); + runTSwiGLU((__gm__ half*)x, (__gm__ half*)y, batch, input_n, num_cores, + vid); +#else + (void)x; + (void)y; + (void)batch; + (void)input_n; +#endif +} + +extern "C" void call_swiglu_kernel(uint32_t blockDim, void* stream, uint8_t* x, + uint8_t* y, uint32_t batch, + uint32_t input_n) { + swiglu_fp16<<>>(x, y, batch, input_n); +} diff --git a/examples/a5_sim/outputs/.gitignore b/examples/a5_sim/outputs/.gitignore new file mode 100644 index 00000000..d6b7ef32 --- /dev/null +++ b/examples/a5_sim/outputs/.gitignore @@ -0,0 +1,2 @@ +* +!.gitignore diff --git a/examples/a5_sim/run_cannsim.sh b/examples/a5_sim/run_cannsim.sh new file mode 100755 index 00000000..b79d9570 --- /dev/null +++ b/examples/a5_sim/run_cannsim.sh @@ -0,0 +1,60 @@ +#!/usr/bin/env bash +# Run A5 pure-vector kernels under CANN Simulator (cannsim record, Ascend950). +set -euo pipefail + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +A5_DIR="${SCRIPT_DIR}" + +ASCEND_HOME_PATH="${ASCEND_HOME_PATH:-/usr/local/Ascend/ascend-toolkit/latest}" +# shellcheck disable=SC1091 +source "${ASCEND_HOME_PATH}/bin/setenv.bash" + +SOC="${CANNSIM_SOC:-Ascend950}" +OUTPUT_DIR="${A5_DIR}/outputs/cannsim_vec" +mkdir -p "${OUTPUT_DIR}" +ulimit -n 65535 + +cd "${A5_DIR}" + +USER_OPTS="${*}" +if [[ -z "${USER_OPTS}" ]]; then + USER_OPTS="--kernel silu --mode bench --num-elements 128 --label smoke --skip-correctness" +fi + +echo "==> cannsim record (${SOC})" +echo " ASCEND_HOME_PATH=${ASCEND_HOME_PATH}" +echo " user options: ${USER_OPTS}" + +cannsim record \ + -s "${SOC}" \ + -o "${OUTPUT_DIR}" \ + "${A5_DIR}/run_cannsim_entry.sh" \ + -u "${USER_OPTS}" +CANNSIM_RC=$? + +# cannsim may segfault during teardown after Python wrote JSON; accept if output exists. +if [[ "${CANNSIM_RC}" -ne 0 ]] && [[ "${USER_OPTS}" == *"--output-json"* ]]; then + OUT_JSON="$(python3 - "${USER_OPTS}" <<'PY' +import shlex, sys +args = shlex.split(sys.argv[1]) +for i, a in enumerate(args): + if a == "--output-json" and i + 1 < len(args): + print(args[i + 1]) + break +PY +)" + if [[ -n "${OUT_JSON}" && -f "${OUT_JSON}" ]]; then + if python3 - "${OUT_JSON}" <<'PY' +import json, sys +from pathlib import Path +data = json.loads(Path(sys.argv[1]).read_text()) +rows = data.get("results", []) +sys.exit(0 if rows and rows[0].get("sim_wall_s") is not None else 1) +PY + then + echo "==> cannsim exited ${CANNSIM_RC} but ${OUT_JSON} is valid; treating as success" + exit 0 + fi + fi +fi +exit "${CANNSIM_RC}" diff --git a/examples/a5_sim/run_cannsim_entry.sh b/examples/a5_sim/run_cannsim_entry.sh new file mode 100755 index 00000000..1e1e9909 --- /dev/null +++ b/examples/a5_sim/run_cannsim_entry.sh @@ -0,0 +1,6 @@ +#!/usr/bin/env bash +# Entry point for cannsim record (standalone executable script). +set -euo pipefail +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +cd "${SCRIPT_DIR}" +exec python3 "${SCRIPT_DIR}/vec_sim.py" "$@" diff --git a/examples/a5_sim/run_msprof.sh b/examples/a5_sim/run_msprof.sh new file mode 100755 index 00000000..b13baee8 --- /dev/null +++ b/examples/a5_sim/run_msprof.sh @@ -0,0 +1,30 @@ +#!/usr/bin/env bash +# Run A5 pure-vector kernels under msprof op simulator (Ascend950PR_9599). +set -euo pipefail + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +A5_DIR="${SCRIPT_DIR}" + +ASCEND_HOME_PATH="${ASCEND_HOME_PATH:-/usr/local/Ascend/ascend-toolkit/latest}" +# shellcheck disable=SC1091 +source "${ASCEND_HOME_PATH}/bin/setenv.bash" + +SIM_LIB="${ASCEND_HOME_PATH}/tools/simulator/Ascend950PR_9599/lib" +export LD_LIBRARY_PATH="${SIM_LIB}:${LD_LIBRARY_PATH:-}" +ulimit -n 65535 + +TIMEOUT="${MSPROF_TIMEOUT:-30}" +OUTPUT_DIR="${A5_DIR}/outputs/msprof_vec" +mkdir -p "${OUTPUT_DIR}" + +cd "${A5_DIR}" + +echo "==> msprof op simulator (Ascend950PR_9599)" +echo " ASCEND_HOME_PATH=${ASCEND_HOME_PATH}" +echo " timeout=${TIMEOUT} min" + +msprof op simulator \ + --soc-version=Ascend950PR_9599 \ + --timeout="${TIMEOUT}" \ + --output="${OUTPUT_DIR}" \ + python3 "${A5_DIR}/vec_sim.py" "$@" diff --git a/examples/a5_sim/run_thread_sweep.sh b/examples/a5_sim/run_thread_sweep.sh new file mode 100755 index 00000000..d3dea406 --- /dev/null +++ b/examples/a5_sim/run_thread_sweep.sh @@ -0,0 +1,123 @@ +#!/usr/bin/env bash +# Sweep OMP threads for A5 pure-vector kernels under msprof and cannsim (T=512). +set -euo pipefail + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +OUTPUT_DIR="${SCRIPT_DIR}/outputs" +mkdir -p "${OUTPUT_DIR}" + +KERNEL="${THREAD_SWEEP_KERNEL:-silu}" +NPROC="$(nproc)" +THREADS=(1 2 4 8 16) +TIMESTAMP="$(date +%Y%m%d_%H%M%S)" +RESULTS_JSON="${OUTPUT_DIR}/thread_sweep_${KERNEL}_${TIMESTAMP}.json" +ROWS_JSON="${OUTPUT_DIR}/_thread_sweep_rows_${TIMESTAMP}.json" +echo "[]" > "${ROWS_JSON}" + +T=512 +REPEAT="${THREAD_SWEEP_REPEAT:-1}" +MSPROF_TIMEOUT=30 +MAX_THREADS="${THREAD_SWEEP_MAX:-32}" + +if [[ "${NPROC}" -gt 16 ]]; then + if [[ "${NPROC}" -le "${MAX_THREADS}" ]]; then + THREADS+=("${NPROC}") + else + THREADS+=("${MAX_THREADS}") + fi +fi + +echo "==> thread sweep: kernel=${KERNEL} T=${T} repeat=${REPEAT}" +echo " threads: ${THREADS[*]}" + +_run_tool() { + local tool="$1" + local n="$2" + local log + log="$(mktemp)" + local script + if [[ "${tool}" == "msprof" ]]; then + script="${SCRIPT_DIR}/run_msprof.sh" + else + script="${SCRIPT_DIR}/run_cannsim.sh" + fi + local extra_args=(--kernel "${KERNEL}" --mode bench --label "threads_${tool}_${n}" --repeat "${REPEAT}" --skip-correctness) + if [[ "${KERNEL}" == "silu" ]]; then + extra_args+=(--num-elements "${T}") + else + extra_args+=(--batch 1 --input-n $((T * 2))) + fi + if MSPROF_TIMEOUT="${MSPROF_TIMEOUT}" "${script}" "${extra_args[@]}" --output-json "${log}" >/dev/null 2>&1; then + : + elif [[ ! -f "${log}" ]]; then + echo "FAILED" + rm -f "${log}" + return + fi + if python3 -c "import json; print(json.load(open('${log}'))['results'][0]['sim_wall_s'])" 2>/dev/null; then + : + else + echo "FAILED" + fi + rm -f "${log}" +} + +for N in "${THREADS[@]}"; do + echo "" + echo "==> OMP_NUM_THREADS=${N}" + export OMP_NUM_THREADS="${N}" + export OPENBLAS_NUM_THREADS="${N}" + export MKL_NUM_THREADS="${N}" + + for TOOL in msprof cannsim; do + RUN_TIMES=() + for R in $(seq 1 "${REPEAT}"); do + SIM_S="$(_run_tool "${TOOL}" "${N}")" + if [[ "${SIM_S}" != "FAILED" ]]; then + RUN_TIMES+=("${SIM_S}") + echo " ${TOOL} run ${R}: ${SIM_S}s" + else + echo " ${TOOL} run ${R}: FAILED" >&2 + fi + done + if [[ ${#RUN_TIMES[@]} -gt 0 ]]; then + python3 - "${TOOL}" "${N}" "${T}" "${KERNEL}" "${ROWS_JSON}" "${RUN_TIMES[@]}" <<'PY' +import json, statistics, sys +from pathlib import Path + +tool = sys.argv[1] +threads = int(sys.argv[2]) +t = int(sys.argv[3]) +kernel = sys.argv[4] +rows_path = Path(sys.argv[5]) +times = [float(x) for x in sys.argv[6:]] +mean_s = statistics.mean(times) +rows = json.loads(rows_path.read_text()) +rows.append({ + "tool": tool, + "kernel": kernel, + "threads": threads, + "T": t, + "runs_s": times, + "mean_s": mean_s, + "omp_threads": threads, +}) +rows_path.write_text(json.dumps(rows, indent=2)) +PY + fi + done +done + +python3 - "${ROWS_JSON}" "${RESULTS_JSON}" <<'PY' +import json, sys +from pathlib import Path + +rows = json.loads(Path(sys.argv[1]).read_text()) +out = {"thread_sweep": rows} +Path(sys.argv[2]).write_text(json.dumps(out, indent=2)) +print(json.dumps(out, indent=2)) +PY + +rm -f "${ROWS_JSON}" +echo "" +echo "Wrote ${RESULTS_JSON}" diff --git a/examples/a5_sim/vec_sim.py b/examples/a5_sim/vec_sim.py new file mode 100644 index 00000000..a2aa2e7d --- /dev/null +++ b/examples/a5_sim/vec_sim.py @@ -0,0 +1,389 @@ +#!/usr/bin/env python3 +"""Run A5 pure-vector kernels (SiLU, SwiGLU) under Ascend950 simulators. + +Usage:: + + ./run_msprof.sh --kernel silu --mode correctness --num-elements 128 + ./run_cannsim.sh --kernel swiglu --mode correctness --batch 1 --input-n 256 + python3 vec_sim.py --kernel silu --mode sweep +""" + +from __future__ import annotations + +import argparse +import ctypes +import json +import os +import subprocess +import sys +import time +from datetime import datetime, timezone +from pathlib import Path + +import torch + +_A5_DIR = Path(__file__).resolve().parent +if str(_A5_DIR) not in sys.path: + sys.path.insert(0, str(_A5_DIR)) + +from common.build import build_kernel # noqa: E402 +from common.host_info import capture_host_cpu # noqa: E402 +from common.torch_runtime import ( # noqa: E402 + init_torch_npu, + stream_ptr, + sync, + zeros_npu, +) + +DEFAULT_BLOCK_DIM = 8 +_DEFAULT_LADDER = _A5_DIR / "configs" / "scale_ladder.json" +_LIBS: dict[str, ctypes.CDLL] = {} + + +def _vp(t: torch.Tensor) -> ctypes.c_void_p: + return ctypes.c_void_p(t.data_ptr()) + + +def _load_lib(kernel: str) -> ctypes.CDLL: + if kernel in _LIBS: + return _LIBS[kernel] + lib_path = build_kernel(kernel) + lib = ctypes.CDLL(str(lib_path)) + if kernel == "silu": + lib.call_kernel.argtypes = [ + ctypes.c_uint32, + ctypes.c_void_p, + ctypes.c_void_p, + ctypes.c_void_p, + ctypes.c_uint32, + ] + lib.call_kernel.restype = None + elif kernel == "swiglu": + lib.call_swiglu_kernel.argtypes = [ + ctypes.c_uint32, + ctypes.c_void_p, + ctypes.c_void_p, + ctypes.c_void_p, + ctypes.c_uint32, + ctypes.c_uint32, + ] + lib.call_swiglu_kernel.restype = None + else: + raise ValueError(f"unknown kernel: {kernel}") + _LIBS[kernel] = lib + return lib + + +def _ladder_shape(entry: dict, kernel: str) -> tuple[str, int, dict]: + n_seq = int(entry["n_seq"]) + l_seg = int(entry["l_seg"]) + t = n_seq * l_seg + label = entry["label"] + if kernel == "silu": + return label, t, {"num_elements": t} + return label, t, {"batch": n_seq, "input_n": 2 * l_seg} + + +def _make_silu_inputs(num_elements: int, seed: int, device: str) -> dict: + torch.manual_seed(seed) + x = torch.randn(num_elements, dtype=torch.float16).to(device) + y = zeros_npu((num_elements,), torch.float16) + return {"x": x, "y": y, "T": num_elements} + + +def _make_swiglu_inputs(batch: int, input_n: int, seed: int, device: str) -> dict: + if input_n <= 0 or (input_n & 1): + raise ValueError("input_n must be a positive even integer") + torch.manual_seed(seed) + x = torch.randn(batch, input_n, dtype=torch.float16).to(device) + y = zeros_npu((batch, input_n // 2), torch.float16) + return {"x": x, "y": y, "T": batch * (input_n // 2), "batch": batch, "input_n": input_n} + + +def _make_inputs(kernel: str, shape: dict, seed: int, device: str) -> dict: + if kernel == "silu": + return _make_silu_inputs(shape["num_elements"], seed, device) + return _make_swiglu_inputs(shape["batch"], shape["input_n"], seed, device) + + +def _launch_silu(data: dict, block_dim: int) -> None: + lib = _load_lib("silu") + lib.call_kernel( + block_dim, + stream_ptr(), + _vp(data["y"]), + _vp(data["x"]), + data["T"], + ) + + +def _launch_swiglu(data: dict, block_dim: int) -> None: + lib = _load_lib("swiglu") + lib.call_swiglu_kernel( + block_dim, + stream_ptr(), + _vp(data["x"]), + _vp(data["y"]), + data["batch"], + data["input_n"], + ) + + +def _launch(kernel: str, data: dict, block_dim: int) -> None: + if kernel == "silu": + _launch_silu(data, block_dim) + else: + _launch_swiglu(data, block_dim) + + +def check_correctness(kernel: str, data: dict) -> tuple[bool, str]: + _launch(kernel, data, data["block_dim"]) + sync() + x_cpu = data["x"].cpu() + out_cpu = data["y"].cpu() + if kernel == "silu": + ref = x_cpu * torch.sigmoid(x_cpu.float()).to(torch.float16) + rtol, atol = 1e-1, 1e-5 + else: + a, b = x_cpu.chunk(2, dim=-1) + ref = a * torch.sigmoid(a.float()).to(torch.float16) * b + rtol, atol = 1e-2, 1e-5 + try: + torch.testing.assert_close(out_cpu, ref, rtol=rtol, atol=atol) + return True, "PASS" + except AssertionError as exc: + diff = (out_cpu.float() - ref.float()).abs().max().item() + return False, f"FAIL max_diff={diff:.4e}: {exc}" + + +def bench_once(kernel: str, data: dict) -> float: + sync() + t0 = time.perf_counter() + _launch(kernel, data, data["block_dim"]) + sync() + return time.perf_counter() - t0 + + +def _load_ladder(path: Path) -> list[dict]: + return json.loads(path.read_text()) + + +def _run_case( + kernel: str, + label: str, + t: int, + shape: dict, + *, + mode: str, + device: str, + seed: int, + repeat: int, + check: bool, + block_dim: int, +) -> dict: + data = _make_inputs(kernel, shape, seed=seed, device=device) + data["block_dim"] = block_dim + row: dict = { + "label": label, + "kernel": kernel, + "T": t, + "block_dim": block_dim, + "omp_threads": int(os.environ.get("OMP_NUM_THREADS", "0") or 0), + **shape, + } + + if check or mode == "correctness": + ok, msg = check_correctness(kernel, data) + row["correctness_pass"] = ok + row["correctness_msg"] = msg + if not ok: + print(f" [{label}] CORRECTNESS FAIL: {msg}", file=sys.stderr) + + if mode in ("bench", "sweep", "correctness", "compare-tools"): + times = [bench_once(kernel, data) for _ in range(max(1, repeat))] + row["sim_wall_s"] = sum(times) / len(times) + row["sim_wall_ms"] = row["sim_wall_s"] * 1000.0 + row["ms_per_element"] = (row["sim_wall_s"] / t) * 1000.0 if t else None + if len(times) > 1: + mean = row["sim_wall_s"] + row["sim_wall_s_std"] = ( + sum((x - mean) ** 2 for x in times) / len(times) + ) ** 0.5 + + return row + + +def _print_table(rows: list[dict]) -> None: + print(f"\n{'label':<16} {'T':>8} {'sim_ms':>12} {'ms/elem':>10} {'ok':>5}") + print("-" * 56) + for r in sorted(rows, key=lambda x: x["T"]): + sim = r.get("sim_wall_ms") + mpe = r.get("ms_per_element") + ok = r.get("correctness_pass") + sim_s = f"{sim:.1f}" if sim is not None else "n/a" + mpe_s = f"{mpe:.2f}" if mpe is not None else "n/a" + ok_s = "yes" if ok else ("no" if ok is False else "-") + print(f"{r.get('label', ''):<16} {r['T']:>8} {sim_s:>12} {mpe_s:>10} {ok_s:>5}") + + +def _spawn_tool(script: str, args: list[str]) -> dict: + cmd = [str(_A5_DIR / script), *args] + proc = subprocess.run(cmd, capture_output=True, text=True) + if proc.returncode != 0: + raise RuntimeError( + f"{script} failed (exit {proc.returncode}):\n{proc.stdout}\n{proc.stderr}" + ) + for line in reversed(proc.stdout.splitlines()): + line = line.strip() + if line.startswith("{") and line.endswith("}"): + return json.loads(line) + raise RuntimeError(f"No JSON output from {script}:\n{proc.stdout}\n{proc.stderr}") + + +def main() -> None: + parser = argparse.ArgumentParser(description="A5 pure-vector simulator driver") + parser.add_argument("--kernel", choices=("silu", "swiglu"), default="silu") + parser.add_argument( + "--mode", + choices=("correctness", "bench", "sweep", "compare-tools"), + default="correctness", + ) + parser.add_argument("--num-elements", type=int, default=None, help="SiLU element count") + parser.add_argument("--batch", type=int, default=1) + parser.add_argument("--input-n", type=int, default=None, help="SwiGLU input width (even)") + parser.add_argument("--label", default="custom") + parser.add_argument("--seed", type=int, default=42) + parser.add_argument("--repeat", type=int, default=1) + parser.add_argument("--device", default="npu:0") + parser.add_argument("--block-dim", type=int, default=DEFAULT_BLOCK_DIM) + parser.add_argument("--ladder", type=Path, default=_DEFAULT_LADDER) + parser.add_argument("--skip-correctness", action="store_true") + parser.add_argument("--output-json", type=Path, default=None) + args = parser.parse_args() + + init_torch_npu(args.device) + _load_lib(args.kernel) + + rows: list[dict] = [] + if args.mode == "sweep": + for entry in _load_ladder(args.ladder): + label, t, shape = _ladder_shape(entry, args.kernel) + rows.append( + _run_case( + args.kernel, + label, + t, + shape, + mode="sweep", + device=args.device, + seed=args.seed, + repeat=args.repeat, + check=not args.skip_correctness, + block_dim=args.block_dim, + ) + ) + _print_table(rows) + elif args.mode == "compare-tools": + if args.kernel == "silu": + if args.num_elements is None: + args.num_elements = 128 + shape = {"num_elements": args.num_elements} + t = args.num_elements + else: + if args.input_n is None: + args.input_n = 256 + shape = {"batch": args.batch, "input_n": args.input_n} + t = args.batch * (args.input_n // 2) + base_args = [ + "--kernel", + args.kernel, + "--mode", + "bench", + "--label", + args.label, + "--repeat", + str(args.repeat), + "--skip-correctness", + "--block-dim", + str(args.block_dim), + ] + if args.kernel == "silu": + base_args += ["--num-elements", str(shape["num_elements"])] + else: + base_args += [ + "--batch", + str(shape["batch"]), + "--input-n", + str(shape["input_n"]), + ] + msprof = _spawn_tool("run_msprof.sh", base_args + ["--output-json", "/dev/stdout"]) + cannsim = _spawn_tool("run_cannsim.sh", base_args) + ms_row = msprof.get("results", [msprof])[0] + cn_row = cannsim.get("results", [cannsim])[0] + rows = [ + {"tool": "msprof", **ms_row}, + {"tool": "cannsim", **cn_row}, + ] + if ms_row.get("sim_wall_s") and cn_row.get("sim_wall_s"): + rows.append( + { + "tool": "ratio_msprof_over_cannsim", + "sim_wall_s": ms_row["sim_wall_s"] / cn_row["sim_wall_s"], + } + ) + print(json.dumps({"results": rows}, indent=2)) + else: + if args.kernel == "silu": + num_elements = args.num_elements if args.num_elements is not None else 128 + shape = {"num_elements": num_elements} + t = num_elements + else: + input_n = args.input_n if args.input_n is not None else 256 + shape = {"batch": args.batch, "input_n": input_n} + t = args.batch * (input_n // 2) + rows.append( + _run_case( + args.kernel, + args.label, + t, + shape, + mode=args.mode, + device=args.device, + seed=args.seed, + repeat=args.repeat, + check=not args.skip_correctness + and args.mode in ("correctness", "bench"), + block_dim=args.block_dim, + ) + ) + if args.mode == "correctness": + ok = rows[0].get("correctness_pass", False) + print(rows[0].get("correctness_msg", "FAIL")) + if not ok: + raise SystemExit(1) + + host_cpu = capture_host_cpu() + payload = { + "timestamp": datetime.now(timezone.utc).isoformat(), + "mode": args.mode, + "kernel": args.kernel, + "soc_msprof": "Ascend950PR_9599", + "soc_cannsim": "Ascend950", + "arch": "dav-c310-vec", + "host_cpu": host_cpu, + "results": rows, + } + out = json.dumps(payload, indent=2) + if args.output_json: + if str(args.output_json) == "/dev/stdout": + print(out) + else: + args.output_json.parent.mkdir(parents=True, exist_ok=True) + args.output_json.write_text(out) + print(out) + elif args.mode != "compare-tools": + print(out) + + +if __name__ == "__main__": + main() From f047794f98ee528133996d3f348eb2ad8420c2d4 Mon Sep 17 00:00:00 2001 From: jiaweizhuang Date: Wed, 27 May 2026 18:24:29 +0000 Subject: [PATCH 2/3] third_party: add pto-isa submodule (v9.0.0) for a5_sim builds Pin pto-isa at v9.0.0 to match CANN 9.0.0 and teach the a5_sim build helper to resolve headers from the in-repo third_party path first. --- .gitmodules | 3 +++ examples/a5_sim/common/build.py | 4 ++++ third_party/pto-isa | 1 + 3 files changed, 8 insertions(+) create mode 100644 .gitmodules create mode 160000 third_party/pto-isa diff --git a/.gitmodules b/.gitmodules new file mode 100644 index 00000000..0be6691c --- /dev/null +++ b/.gitmodules @@ -0,0 +1,3 @@ +[submodule "third_party/pto-isa"] + path = third_party/pto-isa + url = https://gitcode.com/cann/pto-isa.git diff --git a/examples/a5_sim/common/build.py b/examples/a5_sim/common/build.py index 075522a3..b1be4a4d 100644 --- a/examples/a5_sim/common/build.py +++ b/examples/a5_sim/common/build.py @@ -39,6 +39,10 @@ def _pto_include_root() -> Path: candidate = Path(ascend) if (candidate / "include" / "pto" / "pto-inst.hpp").is_file(): return candidate / "include" + repo_root = A5_SIM_ROOT.parent.parent + local = repo_root / "third_party" / "pto-isa" / "include" + if (local / "pto" / "pto-inst.hpp").is_file(): + return local fallback = Path("/workdir/megagdn-pto/third_party/pto-isa/include") if (fallback / "pto" / "pto-inst.hpp").is_file(): return fallback diff --git a/third_party/pto-isa b/third_party/pto-isa new file mode 160000 index 00000000..37752695 --- /dev/null +++ b/third_party/pto-isa @@ -0,0 +1 @@ +Subproject commit 377526952e95c39f9149cb94d261eba9e394f0d4 From 823d6bb8f3fa5167b33a8ac19bf1586de1bfd64e Mon Sep 17 00:00:00 2001 From: jiaweizhuang Date: Wed, 27 May 2026 18:24:32 +0000 Subject: [PATCH 3/3] docs(a5_sim): record simulator timings on AMD EPYC; keep Kunpeng MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Reproduce SiLU/SwiGLU msprof and cannsim sweeps on x86_64 (AMD EPYC 9654). Document both hosts side by side: Kunpeng-920 baseline tables unchanged, new EPYC numbers (~3–5× faster wall time). Correctness PASS on smoke shapes for both tools. --- examples/a5_sim/README.md | 74 +++++++++++++++++++++++++--- examples/a5_sim/cannsim_vs_msprof.md | 29 +++++++++-- 2 files changed, 92 insertions(+), 11 deletions(-) diff --git a/examples/a5_sim/README.md b/examples/a5_sim/README.md index 2315ebbf..b98978fe 100644 --- a/examples/a5_sim/README.md +++ b/examples/a5_sim/README.md @@ -8,7 +8,7 @@ Kernels compile with `--cce-aicore-arch=dav-c310-vec` and `-DREGISTER_BASE`. For ```bash source /usr/local/Ascend/ascend-toolkit/latest/bin/setenv.bash -export PTO_LIB_PATH=/path/to/pto-isa # or megagdn-pto/third_party/pto-isa +export PTO_LIB_PATH=/path/to/pto-kernels/third_party/pto-isa pip install torch torch-npu ``` @@ -36,15 +36,22 @@ python3 -m common.build --all ./run_thread_sweep.sh # OMP sweep, T=512, both tools ``` -## Host environment +## Host environments -Measured on **Kunpeng-920** (HUAWEI Kunpeng 920 5250), **192 logical CPUs** (4 sockets × 48 cores, 1 thread/core), **aarch64**, CANN **9.0.0**, May 2026. +| Host | CPU | Logical CPUs | Arch | CANN | +|------|-----|--------------|------|------| +| Kunpeng server | HUAWEI Kunpeng 920 5250 | 192 (4×48 cores, 1 thread/core) | aarch64 | 9.0.0 | +| x86 server | AMD EPYC 9654 96-Core | 192 (1×96 cores, 2 threads/core) | x86_64 | 9.0.0 | + +Both measured May 2026. On the same ladder shapes, the x86 host is roughly **3–5× faster** in simulator wall time (startup overhead still dominates smoke on both). ## Simulator time cost summary Wall time uses `time.perf_counter()` around one kernel launch (includes PEM/msprof or cannsim startup). **T** = output element count (same ladder labels as the 910B `chunk_h` benchmark). **Correctness PASS** at smoke shape on both tools (PyTorch CPU reference). -### SiLU — msprof (`Ascend950PR_9599`) +### Kunpeng-920 (aarch64) + +#### SiLU — msprof (`Ascend950PR_9599`) | Label | T | Sim wall | ms/element | |-------|---|----------|------------| @@ -54,7 +61,7 @@ Wall time uses `time.perf_counter()` around one kernel launch (includes PEM/mspr | varlen_2x512 | 1024 | **26 s** | 26 ms | | medium | 4096 | **29 s** | 7.1 ms | -### SiLU — cannsim (`Ascend950`) +#### SiLU — cannsim (`Ascend950`) | Label | T | Sim wall | ms/element | |-------|---|----------|------------| @@ -64,7 +71,7 @@ Wall time uses `time.perf_counter()` around one kernel launch (includes PEM/mspr | varlen_2x512 | 1024 | **16 s** | 16 ms | | medium | 4096 | **17 s** | 4.1 ms | -### SwiGLU — msprof +#### SwiGLU — msprof | Label | T | Sim wall | ms/element | |-------|---|----------|------------| @@ -74,7 +81,7 @@ Wall time uses `time.perf_counter()` around one kernel launch (includes PEM/mspr | varlen_2x512 | 1024 | **47 s** | 46 ms | | medium | 4096 | **52 s** | 13 ms | -### SwiGLU — cannsim +#### SwiGLU — cannsim | Label | T | Sim wall | ms/element | |-------|---|----------|------------| @@ -84,15 +91,66 @@ Wall time uses `time.perf_counter()` around one kernel launch (includes PEM/mspr | varlen_2x512 | 1024 | **21 s** | 21 ms | | medium | 4096 | **22 s** | 5.4 ms | -**Scaling law (approximate):** +**Scaling law (Kunpeng, approximate):** - Fixed overhead **~15–75 s** at T=128 dominates smoke; do not extrapolate from smoke alone. - After startup, cost scales **roughly linearly with T** at ~**0.005–0.06 s/element** on cannsim and ~**0.007–0.06 s/element** on msprof for T≥512. - **Varlen vs fixed length** at the same T: negligible (1024 tokens: SiLU msprof 26 s vs 26 s). - Pure-vector kernels finish in **minutes** on the default ladder; contrast with mix `chunk_h_mini` v1 (scalar matmul, 35+ min timeouts). +### AMD EPYC 9654 (x86_64) + +#### SiLU — msprof (`Ascend950PR_9599`) + +| Label | T | Sim wall | ms/element | +|-------|---|----------|------------| +| smoke | 128 | **12 s** | 91 ms | +| tiny | 512 | **7 s** | 14 ms | +| small | 1024 | **7 s** | 7.1 ms | +| varlen_2x512 | 1024 | **7 s** | 6.9 ms | +| medium | 4096 | **12 s** | 3.0 ms | + +#### SiLU — cannsim (`Ascend950`) + +| Label | T | Sim wall | ms/element | +|-------|---|----------|------------| +| smoke | 128 | **9 s** | 72 ms | +| tiny | 512 | **4 s** | 7.2 ms | +| small | 1024 | **4 s** | 3.9 ms | +| varlen_2x512 | 1024 | **4 s** | 3.5 ms | +| medium | 4096 | **5 s** | 1.3 ms | + +#### SwiGLU — msprof + +| Label | T | Sim wall | ms/element | +|-------|---|----------|------------| +| smoke | 128 | **18 s** | 137 ms | +| tiny | 512 | **13 s** | 25 ms | +| small | 1024 | **17 s** | 16 ms | +| varlen_2x512 | 1024 | **15 s** | 14 ms | +| medium | 4096 | **20 s** | 4.9 ms | + +#### SwiGLU — cannsim + +| Label | T | Sim wall | ms/element | +|-------|---|----------|------------| +| smoke | 128 | **11 s** | 87 ms | +| tiny | 512 | **6 s** | 12 ms | +| small | 1024 | **6 s** | 6.3 ms | +| varlen_2x512 | 1024 | **6 s** | 5.8 ms | +| medium | 4096 | **7 s** | 1.7 ms | + +**Scaling law (AMD EPYC, approximate):** + +- Fixed overhead **~9–18 s** at T=128 dominates smoke; do not extrapolate from smoke alone. +- After startup, cost scales **roughly linearly with T** at ~**0.001–0.012 s/element** on cannsim and ~**0.003–0.014 s/element** on msprof for T≥512. +- **Varlen vs fixed length** at the same T: negligible (1024 tokens: SiLU msprof 7 s vs 7 s). +- Pure-vector kernels finish in **under ~2 min** for the full ladder on this host. + ### vs CPU thread count (OMP) +Measured on **Kunpeng-920** only (not re-run on x86). + Fixed workload **T=512** (SiLU), swept `OMP_NUM_THREADS`, `OPENBLAS_NUM_THREADS`, `MKL_NUM_THREADS` together: | OMP threads | msprof mean (s) | speedup vs 1 | cannsim mean (s) | speedup vs 1 | diff --git a/examples/a5_sim/cannsim_vs_msprof.md b/examples/a5_sim/cannsim_vs_msprof.md index 393cc7d9..775c1104 100644 --- a/examples/a5_sim/cannsim_vs_msprof.md +++ b/examples/a5_sim/cannsim_vs_msprof.md @@ -10,7 +10,8 @@ Pure-vector A5 examples for **`pto-kernels/examples/a5_sim`**. Recommended first | AICore arch | `dav-c310-vec` | `dav-c310-vec` | | Correctness (smoke) | **PASS** (SiLU T=128, SwiGLU T=128) | **PASS** (same shapes) | | Invocation | Wraps `python3 vec_sim.py` directly | Executable `run_cannsim_entry.sh` + `-u "..."` | -| Typical smoke wall | SiLU ~26 s, SwiGLU ~54 s | SiLU ~14 s, SwiGLU ~26 s | +| Typical smoke wall (Kunpeng) | SiLU ~52 s, SwiGLU ~75 s | SiLU ~42 s, SwiGLU ~52 s | +| Typical smoke wall (AMD EPYC) | SiLU ~12 s, SwiGLU ~18 s | SiLU ~9 s, SwiGLU ~11 s | | Exit code | 0 on success | May return non-zero after **teardown segfault** even when JSON is valid | ## Tool overview @@ -30,6 +31,8 @@ Inputs are allocated on CPU then copied to NPU; reference checks run on CPU (sim ## Speed comparison (scale ladder, timing-only sweep) +### Kunpeng-920 (aarch64, May 2026) + **SiLU msprof vs cannsim** (seconds, wall clock): | label | T | msprof | cannsim | ratio msprof/cannsim | @@ -48,7 +51,27 @@ Inputs are allocated on CPU then copied to NPU; reference checks run on CPU (sim | small | 1024 | 61 | 29 | 2.1× | | medium | 4096 | 52 | 22 | 2.4× | -cannsim is generally **faster** on wall clock for these pure-vector kernels once T≥512; msprof carries heavier profiling/injection overhead. +### AMD EPYC 9654 (x86_64, May 2026) + +**SiLU msprof vs cannsim** (seconds, wall clock): + +| label | T | msprof | cannsim | ratio msprof/cannsim | +|-------|---|--------|---------|----------------------| +| smoke | 128 | 12 | 9 | 1.3× | +| tiny | 512 | 7 | 4 | 1.8× | +| small | 1024 | 7 | 4 | 1.8× | +| medium | 4096 | 12 | 5 | 2.4× | + +**SwiGLU msprof vs cannsim**: + +| label | T | msprof | cannsim | ratio | +|-------|---|--------|---------|-------| +| smoke | 128 | 18 | 11 | 1.6× | +| tiny | 512 | 13 | 6 | 2.2× | +| small | 1024 | 17 | 6 | 2.8× | +| medium | 4096 | 20 | 7 | 2.9× | + +On both hosts, cannsim is generally **faster** on wall clock for these pure-vector kernels once T≥512; msprof carries heavier profiling/injection overhead. Tool ratios are similar; absolute wall time is ~3–5× lower on the AMD EPYC host. ## Failure modes @@ -65,7 +88,7 @@ cannsim is generally **faster** on wall clock for these pure-vector kernels once ```bash cd pto-kernels/examples/a5_sim source $ASCEND_HOME_PATH/bin/setenv.bash -export PTO_LIB_PATH=/path/to/pto-isa +export PTO_LIB_PATH=/path/to/pto-kernels/third_party/pto-isa MSPROF_TIMEOUT=30 ./run_msprof.sh --kernel silu --mode sweep --skip-correctness \ --output-json outputs/silu_sweep_msprof.json