Skip to content
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
125 changes: 125 additions & 0 deletions examples/a5_sim/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
# A5 Pure-Vector Simulator Examples (SiLU + SwiGLU)

Self-contained **Ascend950PR** pure-vector PTO kernels with **msprof op simulator** and **cannsim record** harnesses. Use these to validate A5 simulator plumbing before tackling mix kernels (see [`megagdn-pto/benchmarks/a5_sim`](../../megagdn-pto/benchmarks/a5_sim)).

Kernels compile with `--cce-aicore-arch=dav-c310-vec` and `-DREGISTER_BASE`. For the 910B `chunk_h` simulator benchmark (different arch), see [`megagdn-pto/benchmarks/simulator/README.md`](../../megagdn-pto/benchmarks/simulator/README.md).

## Prerequisites

```bash
source /usr/local/Ascend/ascend-toolkit/latest/bin/setenv.bash
export PTO_LIB_PATH=/path/to/pto-isa # or megagdn-pto/third_party/pto-isa
pip install torch torch-npu
```

Build kernels:

```bash
cd pto-kernels/examples/a5_sim
python3 -m common.build --all
```

## Quick start

```bash
# Correctness smoke (msprof)
./run_msprof.sh --kernel silu --mode correctness --num-elements 128 --label smoke
./run_msprof.sh --kernel swiglu --mode correctness --batch 1 --input-n 256 --label smoke

# Same under cannsim
./run_cannsim.sh --kernel silu --mode correctness --num-elements 128 --label smoke
./run_cannsim.sh --kernel swiglu --mode correctness --batch 1 --input-n 256 --label smoke

# Scale ladder timing
./run_msprof.sh --kernel silu --mode sweep --skip-correctness \
--output-json outputs/silu_sweep_msprof.json
./run_thread_sweep.sh # OMP sweep, T=512, both tools
```

## Host environment

Measured on **Kunpeng-920** (HUAWEI Kunpeng 920 5250), **192 logical CPUs** (4 sockets × 48 cores, 1 thread/core), **aarch64**, CANN **9.0.0**, May 2026.

## Simulator time cost summary

Wall time uses `time.perf_counter()` around one kernel launch (includes PEM/msprof or cannsim startup). **T** = output element count (same ladder labels as the 910B `chunk_h` benchmark). **Correctness PASS** at smoke shape on both tools (PyTorch CPU reference).

### SiLU — msprof (`Ascend950PR_9599`)

| Label | T | Sim wall | ms/element |
|-------|---|----------|------------|
| smoke | 128 | **52 s** | 406 ms |
| tiny | 512 | **24 s** | 48 ms |
| small | 1024 | **26 s** | 25 ms |
| varlen_2x512 | 1024 | **26 s** | 26 ms |
| medium | 4096 | **29 s** | 7.1 ms |

### SiLU — cannsim (`Ascend950`)

| Label | T | Sim wall | ms/element |
|-------|---|----------|------------|
| smoke | 128 | **42 s** | 331 ms |
| tiny | 512 | **15 s** | 30 ms |
| small | 1024 | **17 s** | 17 ms |
| varlen_2x512 | 1024 | **16 s** | 16 ms |
| medium | 4096 | **17 s** | 4.1 ms |

### SwiGLU — msprof

| Label | T | Sim wall | ms/element |
|-------|---|----------|------------|
| smoke | 128 | **75 s** | 588 ms |
| tiny | 512 | **49 s** | 95 ms |
| small | 1024 | **61 s** | 59 ms |
| varlen_2x512 | 1024 | **47 s** | 46 ms |
| medium | 4096 | **52 s** | 13 ms |

### SwiGLU — cannsim

| Label | T | Sim wall | ms/element |
|-------|---|----------|------------|
| smoke | 128 | **52 s** | 403 ms |
| tiny | 512 | **27 s** | 52 ms |
| small | 1024 | **29 s** | 28 ms |
| varlen_2x512 | 1024 | **21 s** | 21 ms |
| medium | 4096 | **22 s** | 5.4 ms |

**Scaling law (approximate):**

- Fixed overhead **~15–75 s** at T=128 dominates smoke; do not extrapolate from smoke alone.
- After startup, cost scales **roughly linearly with T** at ~**0.005–0.06 s/element** on cannsim and ~**0.007–0.06 s/element** on msprof for T≥512.
- **Varlen vs fixed length** at the same T: negligible (1024 tokens: SiLU msprof 26 s vs 26 s).
- Pure-vector kernels finish in **minutes** on the default ladder; contrast with mix `chunk_h_mini` v1 (scalar matmul, 35+ min timeouts).

### vs CPU thread count (OMP)

Fixed workload **T=512** (SiLU), swept `OMP_NUM_THREADS`, `OPENBLAS_NUM_THREADS`, `MKL_NUM_THREADS` together:

| OMP threads | msprof mean (s) | speedup vs 1 | cannsim mean (s) | speedup vs 1 |
|-------------|-----------------|--------------|------------------|--------------|
| 1 | 39.5 | 1.00× | 31.6 | 1.00× |
| 2 | 44.0 | 0.90× | 35.3 | 0.90× |
| 4 | 41.7 | 0.95× | 34.4 | 0.92× |
| 8 | 41.4 | 0.95× | 35.1 | 0.90× |
| 16 | 44.6 | 0.89× | 31.4 | 1.01× |
| 32 | 42.2 | 0.93× | 32.1 | 0.99× |

**Conclusion:** host OMP thread env vars change simulator wall time by at most **~±11%** (msprof) and **~±12%** (cannsim). Tuning `OMP_NUM_THREADS` is not an effective lever; PEM uses internal worker pools.

## Layout

```
examples/a5_sim/
├── kernels/silu_a5.cpp, swiglu_a5.cpp
├── vec_sim.py # driver (--kernel silu|swiglu)
├── common/build.py # dav-c310-vec build
├── run_msprof.sh / run_cannsim.sh / run_thread_sweep.sh
├── configs/scale_ladder.json
└── outputs/ # gitignored results
```

## References

- A5 PTO ST tests: `megagdn-pto/third_party/pto-isa/tests/npu/a5/src/st/testcase`
- A2 originals: `examples/jit_cpp/silu_dynamic`, `csrc/kernel/kernel_swiglu.cpp`
- Tool comparison: [`cannsim_vs_msprof.md`](cannsim_vs_msprof.md)
86 changes: 86 additions & 0 deletions examples/a5_sim/cannsim_vs_msprof.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
# SiLU / SwiGLU — msprof vs cannsim (Ascend950 / dav-c310-vec)

Pure-vector A5 examples for **`pto-kernels/examples/a5_sim`**. Recommended first step for Ascend950 simulator validation before mix kernels in [`megagdn-pto/benchmarks/a5_sim`](../../megagdn-pto/benchmarks/a5_sim).

## Executive summary

| Aspect | msprof op simulator | cannsim record |
|--------|---------------------|----------------|
| SoC flag | `Ascend950PR_9599` | `Ascend950` |
| AICore arch | `dav-c310-vec` | `dav-c310-vec` |
| Correctness (smoke) | **PASS** (SiLU T=128, SwiGLU T=128) | **PASS** (same shapes) |
| Invocation | Wraps `python3 vec_sim.py` directly | Executable `run_cannsim_entry.sh` + `-u "..."` |
| Typical smoke wall | SiLU ~26 s, SwiGLU ~54 s | SiLU ~14 s, SwiGLU ~26 s |
| Exit code | 0 on success | May return non-zero after **teardown segfault** even when JSON is valid |

## Tool overview

**msprof** preloads the CA model via `LD_PRELOAD` and runs Python + ctypes kernel launch (same pattern as [`ptoisa-a5-test/tests/torch_sim`](../../ptoisa-a5-test/tests/torch_sim/msprof_mechanism.md)).

**cannsim** runs a standalone entry script under full SoC simulation. User args pass via `-u "--kernel silu --mode ..."`, not trailing argv.

## Correctness

| Kernel | Shape | msprof | cannsim | Reference |
|--------|-------|--------|---------|-----------|
| SiLU | T=128 | PASS | PASS | `x * sigmoid(x)` on CPU |
| SwiGLU | batch=1, input_n=256 (T=128 out) | PASS | PASS | split + SiLU gate × value on CPU |

Inputs are allocated on CPU then copied to NPU; reference checks run on CPU (simulator rejects many dynamic NPU ops).

## Speed comparison (scale ladder, timing-only sweep)

**SiLU msprof vs cannsim** (seconds, wall clock):

| label | T | msprof | cannsim | ratio msprof/cannsim |
|-------|---|--------|---------|----------------------|
| smoke | 128 | 52 | 42 | 1.2× |
| tiny | 512 | 24 | 15 | 1.6× |
| small | 1024 | 26 | 17 | 1.5× |
| medium | 4096 | 29 | 17 | 1.7× |

**SwiGLU msprof vs cannsim**:

| label | T | msprof | cannsim | ratio |
|-------|---|--------|---------|-------|
| smoke | 128 | 75 | 52 | 1.4× |
| tiny | 512 | 49 | 27 | 1.8× |
| small | 1024 | 61 | 29 | 2.1× |
| medium | 4096 | 52 | 22 | 2.4× |

cannsim is generally **faster** on wall clock for these pure-vector kernels once T≥512; msprof carries heavier profiling/injection overhead.

## Failure modes

| Issue | Mitigation |
|-------|------------|
| `torch.randn` on NPU under sim | Create tensors on CPU, `.to("npu:0")` |
| Reference ops on NPU fail | Compare `y.cpu()` vs CPU PyTorch ref |
| cannsim segfault on exit | JSON is still written; `run_cannsim.sh` accepts valid `--output-json` |
| A5 `pipe_barrier(PIPE_V)` compile error | Use `PIPE_ALL` in SwiGLU compute path |
| `Stride` ambiguous on A5 | Qualify as `pto::Stride<...>` |

## Invocation examples

```bash
cd pto-kernels/examples/a5_sim
source $ASCEND_HOME_PATH/bin/setenv.bash
export PTO_LIB_PATH=/path/to/pto-isa

MSPROF_TIMEOUT=30 ./run_msprof.sh --kernel silu --mode sweep --skip-correctness \
--output-json outputs/silu_sweep_msprof.json

./run_cannsim.sh --kernel swiglu --mode correctness --batch 1 --input-n 256 \
--output-json outputs/smoke_swiglu_cannsim.json
```

## Recommendations

1. **Start with SiLU** (simplest 1D pipeline) under msprof smoke correctness.
2. Use **cannsim** for faster scale sweeps once smoke passes.
3. Use **mix chunk_h_mini** only after pure-vector path is green ([`megagdn-pto/benchmarks/a5_sim`](../../megagdn-pto/benchmarks/a5_sim)).

## References

- Harness README: [`README.md`](README.md)
- 910B chunk_h comparison: [`megagdn-pto/benchmarks/simulator/cannsim_vs_msprof.md`](../../megagdn-pto/benchmarks/simulator/cannsim_vs_msprof.md)
1 change: 1 addition & 0 deletions examples/a5_sim/common/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
"""Shared helpers for A5 pure-vector simulator examples."""
172 changes: 172 additions & 0 deletions examples/a5_sim/common/build.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,172 @@
#!/usr/bin/env python3
"""Bisheng build helper for A5 pure-vector kernels (dav-c310-vec, REGISTER_BASE)."""

from __future__ import annotations

import argparse
import os
import shutil
import subprocess
import sys
from pathlib import Path

A5_SIM_ROOT = Path(__file__).resolve().parent.parent
KERNEL_DIR = A5_SIM_ROOT / "kernels"
BUILD_DIR = A5_SIM_ROOT / "build"

KERNELS = {
"silu": {
"source": "silu_a5.cpp",
"lib": "libsilu_a5.so",
},
"swiglu": {
"source": "swiglu_a5.cpp",
"lib": "libswiglu_a5.so",
},
}


def _pto_include_root() -> Path:
env = os.environ.get("PTO_LIB_PATH")
if env:
candidate = Path(env)
if (candidate / "include" / "pto" / "pto-inst.hpp").is_file():
return candidate / "include"
if (candidate / "pto" / "pto-inst.hpp").is_file():
return candidate
ascend = os.environ.get("ASCEND_HOME_PATH") or os.environ.get("ASCEND_TOOLKIT_HOME")
if ascend:
candidate = Path(ascend)
if (candidate / "include" / "pto" / "pto-inst.hpp").is_file():
return candidate / "include"
fallback = Path("/workdir/megagdn-pto/third_party/pto-isa/include")
if (fallback / "pto" / "pto-inst.hpp").is_file():
return fallback
raise EnvironmentError(
"PTO headers not found. Set PTO_LIB_PATH or source CANN setenv.bash."
)


def _ascend_home() -> Path:
home = os.environ.get("ASCEND_HOME_PATH") or os.environ.get("ASCEND_TOOLKIT_HOME")
if not home:
raise EnvironmentError("ASCEND_HOME_PATH is not set. Source CANN setenv.bash first.")
return Path(home)


def _bisheng() -> str:
ascend = _ascend_home()
candidate = ascend / "bin" / "bisheng"
if candidate.is_file():
return str(candidate)
found = shutil.which("bisheng")
if found:
return found
raise FileNotFoundError("bisheng compiler not found")


def _common_includes() -> list[str]:
ascend = _ascend_home()
driver = os.environ.get("ASCEND_DRIVER_PATH", "/usr/local/Ascend/driver")
pto_root = _pto_include_root()
return [
f"-I{pto_root}",
f"-I{ascend}/include",
f"-I{driver}/kernel/inc",
f"-I{KERNEL_DIR}",
]


def _kernel_flags() -> list[str]:
ascend = _ascend_home()
return (
_common_includes()
+ [
f"-I{ascend}/pkg_inc",
f"-I{ascend}/pkg_inc/profiling",
f"-I{ascend}/pkg_inc/runtime/runtime",
"-std=gnu++17",
"-O2",
"-Wno-macro-redefined",
"-Wno-ignored-attributes",
"-Wno-unknown-attributes",
"-fPIC",
"-xcce",
"-Xhost-start",
"-Xhost-end",
"-mllvm",
"-cce-aicore-stack-size=0x8000",
"-mllvm",
"-cce-aicore-function-stack-size=0x8000",
"-mllvm",
"-cce-aicore-record-overflow=true",
"-mllvm",
"-cce-aicore-addr-transform",
"-mllvm",
"-cce-aicore-dcci-insert-for-scalar=false",
"--cce-aicore-arch=dav-c310-vec",
"-DREGISTER_BASE",
]
)


def _run(cmd: list[str], cwd: Path) -> None:
print("==>", " ".join(cmd))
subprocess.run(cmd, cwd=cwd, check=True)


def build_kernel(name: str, force: bool = False) -> Path:
if name not in KERNELS:
raise ValueError(f"unknown kernel: {name}")
spec = KERNELS[name]
BUILD_DIR.mkdir(parents=True, exist_ok=True)
out = BUILD_DIR / spec["lib"]
if out.is_file() and not force:
return out

src_path = KERNEL_DIR / spec["source"]
obj = BUILD_DIR / f"{src_path.stem}.o"
bisheng = _bisheng()
_run([bisheng, *_kernel_flags(), "-c", str(src_path), "-o", str(obj)], cwd=BUILD_DIR)
_run(
[
bisheng,
"-fPIC",
"-shared",
"--cce-fatobj-link",
"-Wl,-soname," + spec["lib"],
str(obj),
"-o",
str(out),
],
cwd=BUILD_DIR,
)
print(f"Built {out}")
return out


def build_all(force: bool = False) -> dict[str, Path]:
return {name: build_kernel(name, force=force) for name in KERNELS}


def main() -> None:
parser = argparse.ArgumentParser(description="Build A5 pure-vector example kernels")
parser.add_argument("--kernel", choices=tuple(KERNELS.keys()))
parser.add_argument("--all", action="store_true")
parser.add_argument("--force", action="store_true")
args = parser.parse_args()
try:
if args.all:
build_all(force=args.force)
elif args.kernel:
build_kernel(args.kernel, force=args.force)
else:
parser.print_help()
raise SystemExit(1)
except (EnvironmentError, FileNotFoundError, subprocess.CalledProcessError, ValueError) as exc:
print(f"build failed: {exc}", file=sys.stderr)
raise SystemExit(1) from exc


if __name__ == "__main__":
main()
Loading
Loading