Skip to content

Add Ascend950 pure-vector simulator examples for SiLU and SwiGLU.#172

Open
learning-chip wants to merge 3 commits into
mainfrom
sim_speed
Open

Add Ascend950 pure-vector simulator examples for SiLU and SwiGLU.#172
learning-chip wants to merge 3 commits into
mainfrom
sim_speed

Conversation

@learning-chip
Copy link
Copy Markdown
Collaborator

@learning-chip learning-chip commented May 27, 2026

Compares msprof vs cannsim simulator speed, w.r.t. to input data size.

Take away: as long as the input data size is within ~4K seqlen, the simulator can finish in one minute, still tolerable for interactive kernel development without a real device.

Results copied from README.md

Kunpeng-920

SiLU msprof vs cannsim (seconds, wall clock):

label T msprof cannsim ratio msprof/cannsim
smoke 128 52 42 1.2×
tiny 512 24 15 1.6×
small 1024 26 17 1.5×
medium 4096 29 17 1.7×

SwiGLU msprof vs cannsim:

label T msprof cannsim ratio
smoke 128 75 52 1.4×
tiny 512 49 27 1.8×
small 1024 61 29 2.1×
medium 4096 52 22 2.4×

AMD EPYC 9654

SiLU msprof vs cannsim (seconds, wall clock):

label T msprof cannsim ratio msprof/cannsim
smoke 128 12 9 1.3×
tiny 512 7 4 1.8×
small 1024 7 4 1.8×
medium 4096 12 5 2.4×

SwiGLU msprof vs cannsim:

label T msprof cannsim ratio
smoke 128 18 11 1.6×
tiny 512 13 6 2.2×
small 1024 17 6 2.8×
medium 4096 20 7 2.9×

On both hosts, cannsim is generally faster on wall clock for these pure-vector kernels once T≥512; msprof carries heavier profiling/injection overhead.

Introduce a self-contained a5_sim harness with dav-c310-vec kernels, msprof
and cannsim runners, scale-ladder timing docs, and OMP thread sweep scripts.

Co-authored-by: Cursor <cursoragent@cursor.com>
if args.num_elements is None:
args.num_elements = 128
shape = {"num_elements": args.num_elements}
t = args.num_elements
if args.input_n is None:
args.input_n = 256
shape = {"batch": args.batch, "input_n": args.input_n}
t = args.batch * (args.input_n // 2)
jiaweizhuang added 2 commits May 27, 2026 18:24
Pin pto-isa at v9.0.0 to match CANN 9.0.0 and teach the a5_sim build
helper to resolve headers from the in-repo third_party path first.
Reproduce SiLU/SwiGLU msprof and cannsim sweeps on x86_64 (AMD EPYC
9654). Document both hosts side by side: Kunpeng-920 baseline tables
unchanged, new EPYC numbers (~3–5× faster wall time). Correctness PASS
on smoke shapes for both tools.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant