Skip to content

(cmake) use bisheng#83

Draft
zouzias wants to merge 27 commits into
mainfrom
3-simplify-cmake-structure
Draft

(cmake) use bisheng#83
zouzias wants to merge 27 commits into
mainfrom
3-simplify-cmake-structure

Conversation

@zouzias
Copy link
Copy Markdown
Collaborator

@zouzias zouzias commented Apr 1, 2026

This MR removes all AscendC dependencies and makes the pto-kernels wheel friendly for manylinux support.

Changelog

  • Rename torch_tri_inv to torch_tri_inv_col_sweep

Status

pto-kernels/venv/lib/python3.10/site-packages/pto_kernels$ tree 
.
├── __init__.py
├── __pycache__
│   ├── __init__.cpython-310.pyc
│   ├── benchmarking.cpython-310.pyc
│   └── profiling.cpython-310.pyc
├── benchmarking.py
├── lib
│   └── libno_workspace_kernel.so
├── profiling.py
└── pto_kernels_ops.cpython-310-x86_64-linux-gnu.so

2 directories, 8 files

and

pto-kernels/venv/lib/python3.10/site-packages/pto_kernels$ nm -a lib/libno_workspace_kernel.so 
000000000002e8d0 d _DYNAMIC
000000000002fb80 d _GLOBAL_OFFSET_TABLE_
                 w _ITM_deregisterTMCloneTable
                 w _ITM_registerTMCloneTable
000000000002fad8 d __TMC_END__
000000000002d110 t __cce_rtKernelLaunchWithFlagV2
000000000002cc40 t __cce_rtLaunch
                 U __cxa_atexit
                 w __cxa_finalize
000000000002fac8 d __dso_handle
                 w __gmon_start__
000000000002d7a0 t _fini
000000000002d784 t _init
000000000002d770 t atexit
000000000002d510 T batch_matrix_square_fp16
000000000002d560 T batch_matrix_square_fp32
                 U fprintf
                 U free
                 U malloc
                 U memcpy
                 U memset
                 U rtDevBinaryRegister
                 U rtDevBinaryUnRegister
                 U rtFunctionRegister
                 U rtKernelLaunch
                 U rtKernelLaunchWithFlagV2
                 U rtLaunch
000000000002d3f0 T simple_matmul_fp16
000000000002d450 T simple_matmul_fp32
                 U stderr
000000000002d610 T tri_inv_rec_unroll_fp16
000000000002d6c0 T tri_inv_trick_fp16
000000000002d1d0 T triv_inv_col_sweep_fp16
000000000002d230 T triv_inv_col_sweep_fp32
000000000002d2f0 T vabs_fp16
000000000002d340 T vabs_fp32

TODO

TL;DR the vector kernels deadlock on the new setup.

Tests (x mean passing tests)

  • pytest tests/test_batch_matrix_square.py
  • pytest tests/test_csr_gather.py
  • pytest tests/test_simple_matmul.py
  • pytest tests/test_tri_swiglu.py
  • pytest tests/test_tri_inv_col_sweep.py
  • pytest tests/test_tri_inv_rec_unroll.py
  • pytest tests/test_tri_inv_trick.py

Test vabs in isolation

import os
import ctypes
import torch

lib_path="libno_workspace_kernel.so"
lib_path = os.path.abspath(lib_path)
lib = ctypes.CDLL(lib_path)

lib.call_vabs_fp16.restype = None
lib.call_vabs_fp16.argtypes = [
            ctypes.c_uint32,  # blockDim
            ctypes.c_void_p,  # stream
            ctypes.c_void_p,  # y
            ctypes.c_void_p,  # x
            ctypes.c_int,  # N
        ]
stream_ptr = torch.npu.current_stream()._as_parameter_  # noqa
def torch_to_ctypes(tensor):
    return ctypes.c_void_p(tensor.data_ptr())


block_num = 4
length = [block_num, 64]

x = torch.randn(length, device="cpu", dtype=torch.float16).npu()
z = torch.empty_like(x)


lib.call_vabs_fp16(block_num, stream_ptr, torch_to_ctypes(x), torch_to_ctypes(z), x.numel())

@zouzias zouzias linked an issue Apr 1, 2026 that may be closed by this pull request
@zouzias zouzias requested a review from learning-chip April 2, 2026 05:44
@zouzias
Copy link
Copy Markdown
Collaborator Author

zouzias commented Apr 2, 2026

This also segfaults

import os
import ctypes
import torch

lib_path="libno_workspace_kernel.so"
lib_path = os.path.abspath(lib_path)
lib = ctypes.CDLL(lib_path)

lib.vabs_fp16.restype = None
lib.vabs_fp16.argtypes = [
            ctypes.c_uint32,  # blockDim
            ctypes.c_void_p,  # stream
            ctypes.c_void_p,  # y
            ctypes.c_void_p,  # x
            ctypes.c_int,  # N
        ]
stream_ptr = torch.npu.current_stream()._as_parameter_  # noqa
def torch_to_ctypes(tensor):
    return ctypes.c_void_p(tensor.data_ptr())


block_num = 4
length = [block_num, 64]

x = torch.rand(length, device="cpu", dtype=torch.float16).npu()
z = torch.empty_like(x)


lib.vabs_fp16(block_num, stream_ptr, torch_to_ctypes(x), torch_to_ctypes(z), x.numel())

Comment thread CMakeLists.txt
@zouzias
Copy link
Copy Markdown
Collaborator Author

zouzias commented Apr 2, 2026

This works!

import os
import ctypes
import torch

lib_path="libno_workspace_kernel.so"
lib_path = os.path.abspath(lib_path)
lib = ctypes.CDLL(lib_path)

lib.call_vabs_fp16.restype = None
lib.call_vabs_fp16.argtypes = [
            ctypes.c_uint32,  # blockDim
            ctypes.c_void_p,  # stream
            ctypes.c_void_p,  # y
            ctypes.c_void_p,  # x
            ctypes.c_int,  # N
        ]
stream_ptr = torch.npu.current_stream()._as_parameter_  # noqa
def torch_to_ctypes(tensor):
    return ctypes.c_void_p(tensor.data_ptr())


block_num = 4
length = [block_num, 64]

x = torch.randn(length, device="cpu", dtype=torch.float16).npu()
z = torch.empty_like(x)


lib.call_vabs_fp16(block_num, stream_ptr, torch_to_ctypes(x), torch_to_ctypes(z), x.numel())

@zouzias zouzias marked this pull request as ready for review April 2, 2026 16:50
@zouzias zouzias requested a review from learning-chip April 2, 2026 16:50
@zouzias
Copy link
Copy Markdown
Collaborator Author

zouzias commented Apr 3, 2026

Tests pass now, i.e., make clean build_wheel install test

@learning-chip let me know if you can reproduce. If so, we can quickly merge and release the manylinux wheels to pypi.

@zouzias zouzias marked this pull request as draft April 9, 2026 04:57
anastasios and others added 7 commits April 29, 2026 10:00
* Fix formatting of Makefile for shared library build

---------

Co-authored-by: anastasios <anastasios.zouzias@huawei.com>
@vloncar
Copy link
Copy Markdown
Collaborator

vloncar commented May 12, 2026

An alternative to this approach would be to use pytorch to do compilation, kind of like how it is done in #156 for CPU. you can set CXX=bisheng and put the extra compile and link flags and it should work. this would enable simpler cmakelists.txt, but one can go a step further and get rid of it altogether. instead we could have a simpler setuptools-based flow that auto-discovers new content in csrc/ without the need to include them in the build scripts. it would internally use ninja for building kernels, much faster than current setup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Simplify cmake structure

3 participants