(cmake) use bisheng by zouzias · Pull Request #83 · huawei-csl/pto-kernels

zouzias · 2026-04-01T21:22:22Z

This MR removes all AscendC dependencies and makes the pto-kernels wheel friendly for manylinux support.

Changelog

Rename torch_tri_inv to torch_tri_inv_col_sweep

Status

pto-kernels/venv/lib/python3.10/site-packages/pto_kernels$ tree 
.
├── __init__.py
├── __pycache__
│   ├── __init__.cpython-310.pyc
│   ├── benchmarking.cpython-310.pyc
│   └── profiling.cpython-310.pyc
├── benchmarking.py
├── lib
│   └── libno_workspace_kernel.so
├── profiling.py
└── pto_kernels_ops.cpython-310-x86_64-linux-gnu.so

2 directories, 8 files

and

pto-kernels/venv/lib/python3.10/site-packages/pto_kernels$ nm -a lib/libno_workspace_kernel.so 
000000000002e8d0 d _DYNAMIC
000000000002fb80 d _GLOBAL_OFFSET_TABLE_
                 w _ITM_deregisterTMCloneTable
                 w _ITM_registerTMCloneTable
000000000002fad8 d __TMC_END__
000000000002d110 t __cce_rtKernelLaunchWithFlagV2
000000000002cc40 t __cce_rtLaunch
                 U __cxa_atexit
                 w __cxa_finalize
000000000002fac8 d __dso_handle
                 w __gmon_start__
000000000002d7a0 t _fini
000000000002d784 t _init
000000000002d770 t atexit
000000000002d510 T batch_matrix_square_fp16
000000000002d560 T batch_matrix_square_fp32
                 U fprintf
                 U free
                 U malloc
                 U memcpy
                 U memset
                 U rtDevBinaryRegister
                 U rtDevBinaryUnRegister
                 U rtFunctionRegister
                 U rtKernelLaunch
                 U rtKernelLaunchWithFlagV2
                 U rtLaunch
000000000002d3f0 T simple_matmul_fp16
000000000002d450 T simple_matmul_fp32
                 U stderr
000000000002d610 T tri_inv_rec_unroll_fp16
000000000002d6c0 T tri_inv_trick_fp16
000000000002d1d0 T triv_inv_col_sweep_fp16
000000000002d230 T triv_inv_col_sweep_fp32
000000000002d2f0 T vabs_fp16
000000000002d340 T vabs_fp32

TODO

TL;DR the vector kernels deadlock on the new setup.

Tests (x mean passing tests)

pytest tests/test_batch_matrix_square.py
pytest tests/test_csr_gather.py
pytest tests/test_simple_matmul.py
pytest tests/test_tri_swiglu.py
pytest tests/test_tri_inv_col_sweep.py
pytest tests/test_tri_inv_rec_unroll.py
pytest tests/test_tri_inv_trick.py

Test vabs in isolation

import os
import ctypes
import torch

lib_path="libno_workspace_kernel.so"
lib_path = os.path.abspath(lib_path)
lib = ctypes.CDLL(lib_path)

lib.call_vabs_fp16.restype = None
lib.call_vabs_fp16.argtypes = [
            ctypes.c_uint32,  # blockDim
            ctypes.c_void_p,  # stream
            ctypes.c_void_p,  # y
            ctypes.c_void_p,  # x
            ctypes.c_int,  # N
        ]
stream_ptr = torch.npu.current_stream()._as_parameter_  # noqa
def torch_to_ctypes(tensor):
    return ctypes.c_void_p(tensor.data_ptr())


block_num = 4
length = [block_num, 64]

x = torch.randn(length, device="cpu", dtype=torch.float16).npu()
z = torch.empty_like(x)


lib.call_vabs_fp16(block_num, stream_ptr, torch_to_ctypes(x), torch_to_ctypes(z), x.numel())

zouzias · 2026-04-02T06:19:26Z

This also segfaults

import os
import ctypes
import torch

lib_path="libno_workspace_kernel.so"
lib_path = os.path.abspath(lib_path)
lib = ctypes.CDLL(lib_path)

lib.vabs_fp16.restype = None
lib.vabs_fp16.argtypes = [
            ctypes.c_uint32,  # blockDim
            ctypes.c_void_p,  # stream
            ctypes.c_void_p,  # y
            ctypes.c_void_p,  # x
            ctypes.c_int,  # N
        ]
stream_ptr = torch.npu.current_stream()._as_parameter_  # noqa
def torch_to_ctypes(tensor):
    return ctypes.c_void_p(tensor.data_ptr())


block_num = 4
length = [block_num, 64]

x = torch.rand(length, device="cpu", dtype=torch.float16).npu()
z = torch.empty_like(x)


lib.vabs_fp16(block_num, stream_ptr, torch_to_ctypes(x), torch_to_ctypes(z), x.numel())

zouzias · 2026-04-02T09:08:52Z

This works!

import os
import ctypes
import torch

lib_path="libno_workspace_kernel.so"
lib_path = os.path.abspath(lib_path)
lib = ctypes.CDLL(lib_path)

lib.call_vabs_fp16.restype = None
lib.call_vabs_fp16.argtypes = [
            ctypes.c_uint32,  # blockDim
            ctypes.c_void_p,  # stream
            ctypes.c_void_p,  # y
            ctypes.c_void_p,  # x
            ctypes.c_int,  # N
        ]
stream_ptr = torch.npu.current_stream()._as_parameter_  # noqa
def torch_to_ctypes(tensor):
    return ctypes.c_void_p(tensor.data_ptr())


block_num = 4
length = [block_num, 64]

x = torch.randn(length, device="cpu", dtype=torch.float16).npu()
z = torch.empty_like(x)


lib.call_vabs_fp16(block_num, stream_ptr, torch_to_ctypes(x), torch_to_ctypes(z), x.numel())

zouzias · 2026-04-03T11:31:42Z

Tests pass now, i.e., make clean build_wheel install test

@learning-chip let me know if you can reproduce. If so, we can quickly merge and release the manylinux wheels to pypi.

* Fix formatting of Makefile for shared library build --------- Co-authored-by: anastasios <anastasios.zouzias@huawei.com>

vloncar · 2026-05-12T12:32:51Z

An alternative to this approach would be to use pytorch to do compilation, kind of like how it is done in #156 for CPU. you can set CXX=bisheng and put the extra compile and link flags and it should work. this would enable simpler cmakelists.txt, but one can go a step further and get rid of it altogether. instead we could have a simpler setuptools-based flow that auto-discovers new content in csrc/ without the need to include them in the build scripts. it would internally use ninja for building kernels, much faster than current setup.

(cmake) use bisheng

d413a66

zouzias linked an issue Apr 1, 2026 that may be closed by this pull request

Simplify cmake structure #3

Open

zouzias added 4 commits April 1, 2026 21:34

fix

702c91f

fix

b45cee6

fix abs

4780099

fix

1217f9c

zouzias requested a review from learning-chip April 2, 2026 05:44

learning-chip reviewed Apr 2, 2026

View reviewed changes

Comment thread CMakeLists.txt

zouzias added 8 commits April 2, 2026 09:16

fix

3945182

fix

ef0711f

abs pass tests

6e9c716

fix

19cfc07

fix

3853f46

fix

c234e4e

fix sync

57e13b2

fix

068b9bc

zouzias marked this pull request as ready for review April 2, 2026 16:50

zouzias requested a review from learning-chip April 2, 2026 16:50

anastasios added 2 commits April 3, 2026 10:13

Merge branch 'main' into 3-simplify-cmake-structure

594a133

(abs) stream set flag to true

80d6eeb

anastasios added 2 commits April 4, 2026 20:07

Merge branch 'main' into 3-simplify-cmake-structure

0c3be5b

fix

bc5720b

zouzias marked this pull request as draft April 9, 2026 04:57

zouzias mentioned this pull request Apr 16, 2026

Adding kernel for Newton-Schulz inverse #107

Merged

Anastasios Zouzias added 3 commits April 16, 2026 11:33

Merge branch 'main' into 3-simplify-cmake-structure

5008120

WIP

dfc37e7

fix

7cc6789

anastasios and others added 7 commits April 29, 2026 10:00

fix

5c1498e

WIP

ae8c488

fix

8f7e724

fix

2e0622a

(makefile) introduce 'make compile_<kernel_name>' (#145)

621d636

* Fix formatting of Makefile for shared library build --------- Co-authored-by: anastasios <anastasios.zouzias@huawei.com>

fix

31b67b0

fix

ff155ba

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(cmake) use bisheng#83

(cmake) use bisheng#83
zouzias wants to merge 27 commits into
mainfrom
3-simplify-cmake-structure

zouzias commented Apr 1, 2026 •

edited

Loading

Uh oh!

zouzias commented Apr 2, 2026 •

edited

Loading

Uh oh!

Uh oh!

zouzias commented Apr 2, 2026 •

edited

Loading

Uh oh!

zouzias commented Apr 3, 2026

Uh oh!

vloncar commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

zouzias commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog

Status

TODO

Tests (x mean passing tests)

Test vabs in isolation

Uh oh!

zouzias commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

zouzias commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zouzias commented Apr 3, 2026

Uh oh!

vloncar commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zouzias commented Apr 1, 2026 •

edited

Loading

zouzias commented Apr 2, 2026 •

edited

Loading

zouzias commented Apr 2, 2026 •

edited

Loading