Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitmodules
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
[submodule "third_party/pto-isa"]
path = third_party/pto-isa
url = https://gitcode.com/cann/pto-isa.git
183 changes: 183 additions & 0 deletions examples/a5_sim/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,183 @@
# A5 Pure-Vector Simulator Examples (SiLU + SwiGLU)

Self-contained **Ascend950PR** pure-vector PTO kernels with **msprof op simulator** and **cannsim record** harnesses. Use these to validate A5 simulator plumbing before tackling mix kernels (see [`megagdn-pto/benchmarks/a5_sim`](../../megagdn-pto/benchmarks/a5_sim)).

Kernels compile with `--cce-aicore-arch=dav-c310-vec` and `-DREGISTER_BASE`. For the 910B `chunk_h` simulator benchmark (different arch), see [`megagdn-pto/benchmarks/simulator/README.md`](../../megagdn-pto/benchmarks/simulator/README.md).

## Prerequisites

```bash
source /usr/local/Ascend/ascend-toolkit/latest/bin/setenv.bash
export PTO_LIB_PATH=/path/to/pto-kernels/third_party/pto-isa
pip install torch torch-npu
```

Build kernels:

```bash
cd pto-kernels/examples/a5_sim
python3 -m common.build --all
```

## Quick start

```bash
# Correctness smoke (msprof)
./run_msprof.sh --kernel silu --mode correctness --num-elements 128 --label smoke
./run_msprof.sh --kernel swiglu --mode correctness --batch 1 --input-n 256 --label smoke

# Same under cannsim
./run_cannsim.sh --kernel silu --mode correctness --num-elements 128 --label smoke
./run_cannsim.sh --kernel swiglu --mode correctness --batch 1 --input-n 256 --label smoke

# Scale ladder timing
./run_msprof.sh --kernel silu --mode sweep --skip-correctness \
--output-json outputs/silu_sweep_msprof.json
./run_thread_sweep.sh # OMP sweep, T=512, both tools
```

## Host environments

| Host | CPU | Logical CPUs | Arch | CANN |
|------|-----|--------------|------|------|
| Kunpeng server | HUAWEI Kunpeng 920 5250 | 192 (4×48 cores, 1 thread/core) | aarch64 | 9.0.0 |
| x86 server | AMD EPYC 9654 96-Core | 192 (1×96 cores, 2 threads/core) | x86_64 | 9.0.0 |

Both measured May 2026. On the same ladder shapes, the x86 host is roughly **3–5× faster** in simulator wall time (startup overhead still dominates smoke on both).

## Simulator time cost summary

Wall time uses `time.perf_counter()` around one kernel launch (includes PEM/msprof or cannsim startup). **T** = output element count (same ladder labels as the 910B `chunk_h` benchmark). **Correctness PASS** at smoke shape on both tools (PyTorch CPU reference).

### Kunpeng-920 (aarch64)

#### SiLU — msprof (`Ascend950PR_9599`)

| Label | T | Sim wall | ms/element |
|-------|---|----------|------------|
| smoke | 128 | **52 s** | 406 ms |
| tiny | 512 | **24 s** | 48 ms |
| small | 1024 | **26 s** | 25 ms |
| varlen_2x512 | 1024 | **26 s** | 26 ms |
| medium | 4096 | **29 s** | 7.1 ms |

#### SiLU — cannsim (`Ascend950`)

| Label | T | Sim wall | ms/element |
|-------|---|----------|------------|
| smoke | 128 | **42 s** | 331 ms |
| tiny | 512 | **15 s** | 30 ms |
| small | 1024 | **17 s** | 17 ms |
| varlen_2x512 | 1024 | **16 s** | 16 ms |
| medium | 4096 | **17 s** | 4.1 ms |

#### SwiGLU — msprof

| Label | T | Sim wall | ms/element |
|-------|---|----------|------------|
| smoke | 128 | **75 s** | 588 ms |
| tiny | 512 | **49 s** | 95 ms |
| small | 1024 | **61 s** | 59 ms |
| varlen_2x512 | 1024 | **47 s** | 46 ms |
| medium | 4096 | **52 s** | 13 ms |

#### SwiGLU — cannsim

| Label | T | Sim wall | ms/element |
|-------|---|----------|------------|
| smoke | 128 | **52 s** | 403 ms |
| tiny | 512 | **27 s** | 52 ms |
| small | 1024 | **29 s** | 28 ms |
| varlen_2x512 | 1024 | **21 s** | 21 ms |
| medium | 4096 | **22 s** | 5.4 ms |

**Scaling law (Kunpeng, approximate):**

- Fixed overhead **~15–75 s** at T=128 dominates smoke; do not extrapolate from smoke alone.
- After startup, cost scales **roughly linearly with T** at ~**0.005–0.06 s/element** on cannsim and ~**0.007–0.06 s/element** on msprof for T≥512.
- **Varlen vs fixed length** at the same T: negligible (1024 tokens: SiLU msprof 26 s vs 26 s).
- Pure-vector kernels finish in **minutes** on the default ladder; contrast with mix `chunk_h_mini` v1 (scalar matmul, 35+ min timeouts).

### AMD EPYC 9654 (x86_64)

#### SiLU — msprof (`Ascend950PR_9599`)

| Label | T | Sim wall | ms/element |
|-------|---|----------|------------|
| smoke | 128 | **12 s** | 91 ms |
| tiny | 512 | **7 s** | 14 ms |
| small | 1024 | **7 s** | 7.1 ms |
| varlen_2x512 | 1024 | **7 s** | 6.9 ms |
| medium | 4096 | **12 s** | 3.0 ms |

#### SiLU — cannsim (`Ascend950`)

| Label | T | Sim wall | ms/element |
|-------|---|----------|------------|
| smoke | 128 | **9 s** | 72 ms |
| tiny | 512 | **4 s** | 7.2 ms |
| small | 1024 | **4 s** | 3.9 ms |
| varlen_2x512 | 1024 | **4 s** | 3.5 ms |
| medium | 4096 | **5 s** | 1.3 ms |

#### SwiGLU — msprof

| Label | T | Sim wall | ms/element |
|-------|---|----------|------------|
| smoke | 128 | **18 s** | 137 ms |
| tiny | 512 | **13 s** | 25 ms |
| small | 1024 | **17 s** | 16 ms |
| varlen_2x512 | 1024 | **15 s** | 14 ms |
| medium | 4096 | **20 s** | 4.9 ms |

#### SwiGLU — cannsim

| Label | T | Sim wall | ms/element |
|-------|---|----------|------------|
| smoke | 128 | **11 s** | 87 ms |
| tiny | 512 | **6 s** | 12 ms |
| small | 1024 | **6 s** | 6.3 ms |
| varlen_2x512 | 1024 | **6 s** | 5.8 ms |
| medium | 4096 | **7 s** | 1.7 ms |

**Scaling law (AMD EPYC, approximate):**

- Fixed overhead **~9–18 s** at T=128 dominates smoke; do not extrapolate from smoke alone.
- After startup, cost scales **roughly linearly with T** at ~**0.001–0.012 s/element** on cannsim and ~**0.003–0.014 s/element** on msprof for T≥512.
- **Varlen vs fixed length** at the same T: negligible (1024 tokens: SiLU msprof 7 s vs 7 s).
- Pure-vector kernels finish in **under ~2 min** for the full ladder on this host.

### vs CPU thread count (OMP)

Measured on **Kunpeng-920** only (not re-run on x86).

Fixed workload **T=512** (SiLU), swept `OMP_NUM_THREADS`, `OPENBLAS_NUM_THREADS`, `MKL_NUM_THREADS` together:

| OMP threads | msprof mean (s) | speedup vs 1 | cannsim mean (s) | speedup vs 1 |
|-------------|-----------------|--------------|------------------|--------------|
| 1 | 39.5 | 1.00× | 31.6 | 1.00× |
| 2 | 44.0 | 0.90× | 35.3 | 0.90× |
| 4 | 41.7 | 0.95× | 34.4 | 0.92× |
| 8 | 41.4 | 0.95× | 35.1 | 0.90× |
| 16 | 44.6 | 0.89× | 31.4 | 1.01× |
| 32 | 42.2 | 0.93× | 32.1 | 0.99× |

**Conclusion:** host OMP thread env vars change simulator wall time by at most **~±11%** (msprof) and **~±12%** (cannsim). Tuning `OMP_NUM_THREADS` is not an effective lever; PEM uses internal worker pools.

## Layout

```
examples/a5_sim/
├── kernels/silu_a5.cpp, swiglu_a5.cpp
├── vec_sim.py # driver (--kernel silu|swiglu)
├── common/build.py # dav-c310-vec build
├── run_msprof.sh / run_cannsim.sh / run_thread_sweep.sh
├── configs/scale_ladder.json
└── outputs/ # gitignored results
```

## References

- A5 PTO ST tests: `megagdn-pto/third_party/pto-isa/tests/npu/a5/src/st/testcase`
- A2 originals: `examples/jit_cpp/silu_dynamic`, `csrc/kernel/kernel_swiglu.cpp`
- Tool comparison: [`cannsim_vs_msprof.md`](cannsim_vs_msprof.md)
109 changes: 109 additions & 0 deletions examples/a5_sim/cannsim_vs_msprof.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
# SiLU / SwiGLU — msprof vs cannsim (Ascend950 / dav-c310-vec)

Pure-vector A5 examples for **`pto-kernels/examples/a5_sim`**. Recommended first step for Ascend950 simulator validation before mix kernels in [`megagdn-pto/benchmarks/a5_sim`](../../megagdn-pto/benchmarks/a5_sim).

## Executive summary

| Aspect | msprof op simulator | cannsim record |
|--------|---------------------|----------------|
| SoC flag | `Ascend950PR_9599` | `Ascend950` |
| AICore arch | `dav-c310-vec` | `dav-c310-vec` |
| Correctness (smoke) | **PASS** (SiLU T=128, SwiGLU T=128) | **PASS** (same shapes) |
| Invocation | Wraps `python3 vec_sim.py` directly | Executable `run_cannsim_entry.sh` + `-u "..."` |
| Typical smoke wall (Kunpeng) | SiLU ~52 s, SwiGLU ~75 s | SiLU ~42 s, SwiGLU ~52 s |
| Typical smoke wall (AMD EPYC) | SiLU ~12 s, SwiGLU ~18 s | SiLU ~9 s, SwiGLU ~11 s |
| Exit code | 0 on success | May return non-zero after **teardown segfault** even when JSON is valid |

## Tool overview

**msprof** preloads the CA model via `LD_PRELOAD` and runs Python + ctypes kernel launch (same pattern as [`ptoisa-a5-test/tests/torch_sim`](../../ptoisa-a5-test/tests/torch_sim/msprof_mechanism.md)).

**cannsim** runs a standalone entry script under full SoC simulation. User args pass via `-u "--kernel silu --mode ..."`, not trailing argv.

## Correctness

| Kernel | Shape | msprof | cannsim | Reference |
|--------|-------|--------|---------|-----------|
| SiLU | T=128 | PASS | PASS | `x * sigmoid(x)` on CPU |
| SwiGLU | batch=1, input_n=256 (T=128 out) | PASS | PASS | split + SiLU gate × value on CPU |

Inputs are allocated on CPU then copied to NPU; reference checks run on CPU (simulator rejects many dynamic NPU ops).

## Speed comparison (scale ladder, timing-only sweep)

### Kunpeng-920 (aarch64, May 2026)

**SiLU msprof vs cannsim** (seconds, wall clock):

| label | T | msprof | cannsim | ratio msprof/cannsim |
|-------|---|--------|---------|----------------------|
| smoke | 128 | 52 | 42 | 1.2× |
| tiny | 512 | 24 | 15 | 1.6× |
| small | 1024 | 26 | 17 | 1.5× |
| medium | 4096 | 29 | 17 | 1.7× |

**SwiGLU msprof vs cannsim**:

| label | T | msprof | cannsim | ratio |
|-------|---|--------|---------|-------|
| smoke | 128 | 75 | 52 | 1.4× |
| tiny | 512 | 49 | 27 | 1.8× |
| small | 1024 | 61 | 29 | 2.1× |
| medium | 4096 | 52 | 22 | 2.4× |

### AMD EPYC 9654 (x86_64, May 2026)

**SiLU msprof vs cannsim** (seconds, wall clock):

| label | T | msprof | cannsim | ratio msprof/cannsim |
|-------|---|--------|---------|----------------------|
| smoke | 128 | 12 | 9 | 1.3× |
| tiny | 512 | 7 | 4 | 1.8× |
| small | 1024 | 7 | 4 | 1.8× |
| medium | 4096 | 12 | 5 | 2.4× |

**SwiGLU msprof vs cannsim**:

| label | T | msprof | cannsim | ratio |
|-------|---|--------|---------|-------|
| smoke | 128 | 18 | 11 | 1.6× |
| tiny | 512 | 13 | 6 | 2.2× |
| small | 1024 | 17 | 6 | 2.8× |
| medium | 4096 | 20 | 7 | 2.9× |

On both hosts, cannsim is generally **faster** on wall clock for these pure-vector kernels once T≥512; msprof carries heavier profiling/injection overhead. Tool ratios are similar; absolute wall time is ~3–5× lower on the AMD EPYC host.

## Failure modes

| Issue | Mitigation |
|-------|------------|
| `torch.randn` on NPU under sim | Create tensors on CPU, `.to("npu:0")` |
| Reference ops on NPU fail | Compare `y.cpu()` vs CPU PyTorch ref |
| cannsim segfault on exit | JSON is still written; `run_cannsim.sh` accepts valid `--output-json` |
| A5 `pipe_barrier(PIPE_V)` compile error | Use `PIPE_ALL` in SwiGLU compute path |
| `Stride` ambiguous on A5 | Qualify as `pto::Stride<...>` |

## Invocation examples

```bash
cd pto-kernels/examples/a5_sim
source $ASCEND_HOME_PATH/bin/setenv.bash
export PTO_LIB_PATH=/path/to/pto-kernels/third_party/pto-isa

MSPROF_TIMEOUT=30 ./run_msprof.sh --kernel silu --mode sweep --skip-correctness \
--output-json outputs/silu_sweep_msprof.json

./run_cannsim.sh --kernel swiglu --mode correctness --batch 1 --input-n 256 \
--output-json outputs/smoke_swiglu_cannsim.json
```

## Recommendations

1. **Start with SiLU** (simplest 1D pipeline) under msprof smoke correctness.
2. Use **cannsim** for faster scale sweeps once smoke passes.
3. Use **mix chunk_h_mini** only after pure-vector path is green ([`megagdn-pto/benchmarks/a5_sim`](../../megagdn-pto/benchmarks/a5_sim)).

## References

- Harness README: [`README.md`](README.md)
- 910B chunk_h comparison: [`megagdn-pto/benchmarks/simulator/cannsim_vs_msprof.md`](../../megagdn-pto/benchmarks/simulator/cannsim_vs_msprof.md)
1 change: 1 addition & 0 deletions examples/a5_sim/common/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
"""Shared helpers for A5 pure-vector simulator examples."""
Loading
Loading