-
Notifications
You must be signed in to change notification settings - Fork 11
msprof simulator on custom A5 kernel + torch_npu wrapper launch #169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
learning-chip
wants to merge
1
commit into
166-kernel-abs-run-using-cannsim
Choose a base branch
from
a5_msprof_sim
base: 166-kernel-abs-run-using-cannsim
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,3 @@ | ||
| *.so | ||
| OPPROF_*/ | ||
| kernel_meta/ |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,100 @@ | ||
| # a5_abs — JIT `vabs_fp16` with torch_npu + msprof CA simulator | ||
|
|
||
| Refactors the [a5 `main_abs`](../../a5/main_abs.cpp) flow: JIT-compile [`kernel_abs.cpp`](../../../csrc/kernel/kernel_abs.cpp) with `bisheng`, launch `call_vabs_fp16` via **ctypes** on **torch_npu** tensors, and profile under the Ascend950 CA model with **msprof op simulator**. | ||
|
|
||
| Numeric correctness is **not** validated in `run_abs.py` when using the CA simulator — `msprof op simulator` models hardware pipeline behavior only. A successful run compiles the kernel, launches it on NPU tensors, and produces profiler output under `OPPROF_*`. | ||
|
|
||
| ## Layout | ||
|
|
||
| | File | Role | | ||
| |------|------| | ||
| | `jit_util_a5_abs.py` | JIT compile `kernel_abs.cpp` → `libkernel_abs_jit.so`, ctypes wrapper | | ||
| | `run_abs.py` | Build input, launch kernel, synchronize | | ||
| | `run_msprof.sh` | Wrapper: env + `msprof op simulator` | | ||
|
|
||
| ## Prerequisites | ||
|
|
||
| - Docker image `agent_npu_cann_950:9.0.0` (or equivalent CANN 9.0 + torch_npu 2.9) | ||
| - Host checkout of `pto-kernels` mounted into the container | ||
|
|
||
| ## Reproduce (CA simulator) | ||
|
|
||
| From the host, start the container (mount the repo parent so `pto-kernels` is visible): | ||
|
|
||
| ```bash | ||
| cd /path/to/parent-of-pto-kernels | ||
| docker run -it --rm \ | ||
| --privileged \ | ||
| --network=host \ | ||
| --ipc=host \ | ||
| -v "$(pwd)":/workspace \ | ||
| -w /workspace \ | ||
| --name torch_npu \ | ||
| agent_npu_cann_950:9.0.0 \ | ||
| /bin/bash | ||
| ``` | ||
|
|
||
| Inside the container: | ||
|
|
||
| ```bash | ||
| source /usr/local/Ascend/ascend-toolkit/set_env.sh | ||
| cd /workspace/pto-kernels/examples/jit_cpp/a5_abs | ||
|
|
||
| ./run_msprof.sh | ||
| ``` | ||
|
|
||
| Or manually (same as [`ca_model.md`](../../../../npu_kernels/950_setup/ca_model.md)): | ||
|
|
||
| ```bash | ||
| source /usr/local/Ascend/ascend-toolkit/set_env.sh | ||
| export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/tools/simulator/Ascend950PR_9599/lib:$LD_LIBRARY_PATH | ||
| ulimit -n 65535 | ||
|
|
||
| # msprof splits on spaces; use a script file, not python -c. | ||
| msprof op simulator --soc-version=Ascend950PR_9599 \ | ||
| python ./run_abs.py | ||
| ``` | ||
|
|
||
| ### Expected success signals | ||
|
|
||
| 1. Console: `generated .../libkernel_abs_jit.so`, then `vabs_fp16 kernel launch completed.` | ||
| 2. msprof log: `Profiling on kernel: vabs_fp16_mix_aic` (or similar) and core duration table | ||
| 3. New directory `OPPROF_<timestamp>_*` in this folder with parsed simulator results | ||
|
|
||
| ### Compile flags (Ascend950 / A5) | ||
|
|
||
| JIT build uses Ascend950-oriented flags (aligned with tilelang-ascend PTO `A5` path): | ||
|
|
||
| - `--cce-aicore-arch=dav-c310` | ||
| - `-DREGISTER_BASE` | ||
| - `-std=gnu++17` (required for A5 PTO headers; `-std=c++20` fails on dav-c310) | ||
| - AICore stack LLVM options (`-cce-aicore-stack-size`, etc.) | ||
|
|
||
| Do not use `--npu-arch=dav-2201` here; that targets an older arch profile. | ||
|
|
||
| ## Run without msprof (real device) | ||
|
|
||
| On hardware with a working NPU runtime, you can smoke-test launch only: | ||
|
|
||
| ```bash | ||
| source /usr/local/Ascend/ascend-toolkit/set_env.sh | ||
| cd /workspace/pto-kernels/examples/jit_cpp/a5_abs | ||
| python ./run_abs.py | ||
| ``` | ||
|
|
||
| For numeric checks on device, use the packaged op (`tests/test_abs.py` / `pto_abs`) or the legacy ACL sample: | ||
|
|
||
| ```bash | ||
| cd /workspace/pto-kernels | ||
| make run_abs_a5 # cannsim record + examples/a5/main_abs.cpp | ||
| ``` | ||
|
|
||
| ## Relation to legacy a5 sample | ||
|
|
||
| | Legacy (`make run_abs_a5`) | This example | | ||
| |----------------------------|--------------| | ||
| | `g++` + `libkernel_abs.so` + ACL host buffers | `bisheng` JIT + torch tensors + ctypes | | ||
| | `cannsim record --soc=Ascend950` | `msprof op simulator --soc-version=Ascend950PR_9599` | | ||
| | `examples/a5/main_abs.cpp` | `run_abs.py` + `jit_util_a5_abs.py` | | ||
|
|
||
| Kernel source is shared: `csrc/kernel/kernel_abs.cpp` (`call_vabs_fp16`, shape `8×128`, `blockDim=8`). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,108 @@ | ||
| import ctypes | ||
| import os | ||
| import subprocess | ||
|
|
||
| import torch | ||
|
|
||
| REPO_ROOT = os.path.abspath(os.path.join(os.path.dirname(__file__), "..", "..", "..")) | ||
| KERNEL_SRC = os.path.join(REPO_ROOT, "csrc", "kernel", "kernel_abs.cpp") | ||
| KERNEL_INC = os.path.join(REPO_ROOT, "csrc", "kernel") | ||
|
|
||
| ASCEND_TOOLKIT_HOME = os.environ.get("ASCEND_TOOLKIT_HOME", "") | ||
| PTO_LIB_PATH = os.environ.get("PTO_LIB_PATH", ASCEND_TOOLKIT_HOME) | ||
|
|
||
| DEFAULT_BLOCK_DIM = 8 | ||
|
|
||
|
|
||
| def _lib_path() -> str: | ||
| return os.path.join(os.path.dirname(__file__), "libkernel_abs_jit.so") | ||
|
|
||
|
|
||
| def compile_cpp(verbose: bool = False, timeout: int = 120) -> str: | ||
| lib_path = _lib_path() | ||
|
|
||
| flags = [ | ||
| "bisheng", | ||
| "--cce-aicore-arch=dav-c310", | ||
| "-DREGISTER_BASE", | ||
| "-O2", | ||
| "-std=gnu++17", | ||
| "-xcce", | ||
| "-fPIC", | ||
| "--shared", | ||
| "-mllvm", | ||
| "-cce-aicore-stack-size=0x8000", | ||
| "-mllvm", | ||
| "-cce-aicore-function-stack-size=0x8000", | ||
| "-Wno-ignored-attributes", | ||
| f"-I{KERNEL_INC}", | ||
| f"-I{PTO_LIB_PATH}/include", | ||
| f"-I{ASCEND_TOOLKIT_HOME}/include", | ||
| f"-I{ASCEND_TOOLKIT_HOME}/pkg_inc", | ||
| f"-I{ASCEND_TOOLKIT_HOME}/pkg_inc/runtime", | ||
| f"-I{ASCEND_TOOLKIT_HOME}/pkg_inc/profiling", | ||
| ] | ||
|
|
||
| command = [*flags, KERNEL_SRC, "-o", lib_path] | ||
| if verbose: | ||
| print("compile command:", " ".join(command)) | ||
|
|
||
| try: | ||
| result = subprocess.run( | ||
| command, | ||
| timeout=timeout, | ||
| check=True, | ||
| capture_output=True, | ||
| text=True, | ||
| ) | ||
| except subprocess.CalledProcessError as e: | ||
| output = e.stdout or "" | ||
| if e.stderr: | ||
| output += e.stderr | ||
| raise RuntimeError( | ||
| f"Compile failed with exit code {e.returncode}:\n{output}" | ||
| ) from e | ||
|
|
||
| if verbose and result.stdout: | ||
| print(result.stdout) | ||
| if verbose: | ||
| print(f"generated {lib_path}") | ||
| return lib_path | ||
|
|
||
|
|
||
| def torch_to_ctypes(tensor): | ||
| return ctypes.c_void_p(tensor.data_ptr()) | ||
|
|
||
|
|
||
| def load_lib(lib_path: str): | ||
| lib_path = os.path.abspath(lib_path) | ||
| lib = ctypes.CDLL(lib_path) | ||
|
|
||
| lib.call_vabs_fp16.argtypes = [ | ||
| ctypes.c_uint32, # blockDim | ||
| ctypes.c_void_p, # stream | ||
| ctypes.c_void_p, # x | ||
| ctypes.c_void_p, # z | ||
| ctypes.c_uint32, # num_elements | ||
| ] | ||
| lib.call_vabs_fp16.restype = None | ||
|
|
||
| def vabs_fp16(x, z, num_elements, block_dim=DEFAULT_BLOCK_DIM, stream_ptr=None): | ||
| if stream_ptr is None: | ||
| stream_ptr = torch.npu.current_stream()._as_parameter_ # noqa: SLF001 | ||
| lib.call_vabs_fp16( | ||
| block_dim, | ||
| stream_ptr, | ||
| torch_to_ctypes(x), | ||
| torch_to_ctypes(z), | ||
| num_elements, | ||
| ) | ||
|
|
||
| return vabs_fp16 | ||
|
|
||
|
|
||
| def jit_compile(verbose: bool = False, force_recompile: bool = False): | ||
| lib_path = _lib_path() | ||
| if force_recompile or not os.path.isfile(lib_path): | ||
| compile_cpp(verbose=verbose) | ||
| return load_lib(lib_path) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,46 @@ | ||
| """Run vabs_fp16 via JIT-compiled kernel + torch_npu tensors (a5 example).""" | ||
|
|
||
| import numpy as np | ||
| import torch | ||
| import torch_npu # noqa: F401 | ||
|
|
||
| from jit_util_a5_abs import DEFAULT_BLOCK_DIM, jit_compile | ||
|
|
||
| # Matches examples/a5/main_abs.cpp | ||
| VABS_SHAPE = (8, 128) | ||
| VABS_NUM_ELEMENTS = VABS_SHAPE[0] * VABS_SHAPE[1] | ||
| DEVICE = "npu:0" | ||
|
|
||
|
|
||
| def make_input_x(): | ||
| """Same data as scripts/data_gen_abs.py (seed=42).""" | ||
| rng = np.random.default_rng(seed=42) | ||
| return rng.uniform(-100, 100, VABS_SHAPE).astype(np.float16) | ||
|
|
||
|
|
||
| def main(): | ||
| torch.npu.config.allow_internal_format = False | ||
| torch_npu.npu.set_compile_mode(jit_compile=False) | ||
| torch.npu.set_device(DEVICE) | ||
|
|
||
| x_np = make_input_x() | ||
| # Allocate on CPU then copy to NPU (same pattern as tests/test_abs.py). | ||
| x = torch.from_numpy(x_np) | ||
| x = x.npu() | ||
| z = torch.empty_like(x) | ||
|
|
||
| print(f"[vabs] shape={VABS_SHAPE}, blockDim={DEFAULT_BLOCK_DIM}") | ||
| print("Compiling kernel_abs.cpp ...") | ||
| vabs = jit_compile(verbose=True) | ||
|
|
||
| vabs(x, z, VABS_NUM_ELEMENTS, block_dim=DEFAULT_BLOCK_DIM) | ||
| torch.npu.synchronize() | ||
|
|
||
| # msprof CA simulator models pipeline timing only, not numeric results. | ||
| print("Input X (first 16):", x.flatten()[:16].cpu()) | ||
| print("Output Z (first 16):", z.flatten()[:16].cpu()) | ||
| print("vabs_fp16 kernel launch completed.") | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| main() | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,19 @@ | ||
| #!/usr/bin/env bash | ||
| # Run a5 abs JIT example under msprof CA simulator (Ascend950PR_9599). | ||
| set -euo pipefail | ||
|
|
||
| SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" | ||
| cd "$SCRIPT_DIR" | ||
|
|
||
| if [[ -f /usr/local/Ascend/ascend-toolkit/set_env.sh ]]; then | ||
| # shellcheck source=/dev/null | ||
| source /usr/local/Ascend/ascend-toolkit/set_env.sh | ||
| fi | ||
|
|
||
| export LD_LIBRARY_PATH="/usr/local/Ascend/ascend-toolkit/latest/tools/simulator/Ascend950PR_9599/lib:${LD_LIBRARY_PATH:-}" | ||
| ulimit -n 65535 | ||
|
|
||
| # msprof splits on spaces; use a script file, not python -c. | ||
| exec msprof op simulator --soc-version=Ascend950PR_9599 \ | ||
| --output="msprof_res" --kernel-name="vabs_fp16_mix_aic" --launch-count=10 \ | ||
| python ./run_abs.py | ||
|
Comment on lines
+17
to
+19
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Run once without |
||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, accuracy test also passes in msprof simulator mode. Unlike
cannsim(biprof) that only simulates cycles.