Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
204 changes: 174 additions & 30 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,50 +1,194 @@
<div align="center">

# PTO-DSL
Pythonic interface and JIT compiler for [PTO-ISA](https://github.com/PTO-ISA/pto-isa)
</div>

PTO-DSL provides a programming abstraction similar to [cuTile](https://docs.nvidia.com/cuda/cutile-python/), but native to [NPU](https://www.hiascend.com/).
Python DSL for PTO-ISA kernels, with a public `pto` surface for tensor/tile
authoring, a raw `micro` surface for direct PTO micro instructions, and an A5
library layer that rewrites tile-style helpers in terms of those micro ops.

**Key features:**
- Automatic software pipelining without [manual synchronization](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/850/API/ascendcopapi/atlasascendc_api_07_0179.html)
- Easily interface with [torch-npu](https://gitcode.com/ascend/pytorch)
- Lightweight, open-source compiler stack using [PTO Assembler](https://github.com/zhangstevenunity/PTOAS)
The current repo targets three authoring levels:

## Installation
- `ptodsl.pto`: ergonomic tensor, view, tile, sync, and control-flow helpers
- `ptodsl.micro`: raw PTO micro instruction access such as `vlds`, `vadd`,
`vsts`, `pset_b32`, and vector register types
- `ptodsl.lib.a5`: readable A5 helper implementations that show how tile-style
operations are written with PTO micro instructions

See [docker/README.md](./docker/README.md) for full reproducible dependencies on NPU.
## Recent Upgrade

Then, install this lightweight DSL package itself:
The recent PTODSL upgrade changed the repo in four important ways:

```bash
# install latest commit
pip install git+https://github.com/huawei-csl/pto-dsl.git
1. `pto.ptr(dtype, space=...)` is now the preferred pointer constructor for
explicit memory spaces such as `GM`, `VEC`, `LEFT`, `RIGHT`, and `ACC`.
2. The public `pto` namespace now includes pythonic builders:
`make_tensor(...)`, `TensorView.slice(...)`, `make_tile_buffer(...)`,
`TileBufferSpec.alloc()`, and `TileBuffer.load_from()/store_to()`.
3. The package root now exposes `ptodsl.micro` as the raw micro-op surface.
4. The A5 library under [`ptodsl/lib/a5`](/Users/zhoubot/github/pto-org/pto-dsl/ptodsl/lib/a5)
is organized around tile helpers implemented with PTO micro instructions,
and selected pure-micro kernels are validated through PTOAS into
`llvm.hivm.*` intrinsics.

# or stable tag
pip install git+https://github.com/huawei-csl/pto-dsl.git@0.1.0
```
Detailed API notes are in
[`docs/latest_api.md`](/Users/zhoubot/github/pto-org/pto-dsl/docs/latest_api.md).

## Install

For in-place development:
PTODSL depends on the PTO dialect Python bindings from PTOAS and an MLIR Python
environment. For a reproducible setup, start with
[`docker/README.md`](/Users/zhoubot/github/pto-org/pto-dsl/docker/README.md).

For local development:

```bash
git clone https://github.com/huawei-csl/pto-dsl.git
git clone https://github.com/PTO-ISA/pto-dsl.git
cd pto-dsl
pip install -e .
```

## Usage
Typical local testing in this repo also needs PTOAS and MLIR on `PYTHONPATH`,
for example:

```bash
PYTHONPATH=/path/to/mlir_core:/path/to/PTOAS/install:/path/to/PTOAS/build/python \
python -m pytest -q tests/frontend tests/regression
```

## Public API

### 1. Pythonic `pto`

Use `ptodsl.pto` for tensor/view/tile construction:

```python
from mlir.ir import IndexType
from ptodsl import pto, tile, to_ir_module


def meta_data():
return {
"ptr_t": pto.ptr(pto.float32),
"index_t": IndexType.get(),
}


@to_ir_module(meta_data=meta_data)
def add_tile(src0: "ptr_t", src1: "ptr_t", dst: "ptr_t", valid_row: "index_t", valid_col: "index_t") -> None:
lhs = pto.make_tensor(src0, shape=[8, 64], dtype=pto.float32)
rhs = pto.make_tensor(src1, shape=[8, 64], dtype=pto.float32)
out = pto.make_tensor(dst, shape=[8, 64], dtype=pto.float32)

lhs_tile = lhs.slice([0, 0], [8, 64])
rhs_tile = rhs.slice([0, 0], [8, 64])
out_tile = out.slice([0, 0], [8, 64])

with pto.vector_section():
lhs_buf = pto.make_tile_buffer(
pto.float32,
[8, 64],
space="VEC",
valid_shape=[-1, -1],
).alloc(valid_row=valid_row, valid_col=valid_col)
rhs_buf = pto.make_tile_buffer(
pto.float32,
[8, 64],
space="VEC",
valid_shape=[-1, -1],
).alloc(valid_row=valid_row, valid_col=valid_col)
out_buf = pto.make_tile_buffer(
pto.float32,
[8, 64],
space="VEC",
valid_shape=[-1, -1],
).alloc(valid_row=valid_row, valid_col=valid_col)

lhs_buf.load_from(lhs_tile)
rhs_buf.load_from(rhs_tile)
tile.add(lhs_buf, rhs_buf, out_buf)
out_buf.store_to(out_tile)
```

This still emits native PTO tensor/tile IR such as `pto.make_tensor_view`,
`pto.partition_view`, `pto.alloc_tile`, `pto.tload`, `pto.tadd`, and
`pto.tstore`.

### 2. Raw `micro`

Use `ptodsl.micro` when you want to write the micro instruction sequence
directly:

```python
from mlir.ir import IndexType
from ptodsl import micro, pto, to_ir_module
from ptodsl.api.scalar import _unwrap


def meta_data():
return {
"ptr_t": pto.ptr(pto.float32, space="VEC"),
"index_t": IndexType.get(),
}


@to_ir_module(meta_data=meta_data)
def vadd_demo(src0: "ptr_t", src1: "ptr_t", dst: "ptr_t", offset: "index_t") -> None:
v64f32 = micro.VRegType.get(64, pto.float32)
mask = micro.pset_b32(micro.MaskType.get(), "PAT_ALL")
lhs = micro.vlds(v64f32, _unwrap(src0), _unwrap(offset))
rhs = micro.vlds(v64f32, _unwrap(src1), _unwrap(offset))
out = micro.vadd(v64f32, lhs, rhs, mask)
micro.vsts(out, _unwrap(dst), _unwrap(offset), mask)
```

This is the most direct PTODSL surface for VPTO/PTOAS lowering.

### 3. A5 Library

The A5 layer under [`ptodsl/lib/a5`](/Users/zhoubot/github/pto-org/pto-dsl/ptodsl/lib/a5)
shows how tile-style helpers map to micro instructions:

- `tadd` is written with `pto.vlds`, `pto.vadd`, and `pto.vsts`
- `trow_sum` is written with `pto.vcadd` plus vector combine/store logic
- `tcol_expand`, `tgather`, `tmrgsort`, and `tsort32` are expressed directly in
terms of PTO micro opcodes where supported

See [`ptodsl/lib/a5/README.md`](/Users/zhoubot/github/pto-org/pto-dsl/ptodsl/lib/a5/README.md)
for the file layout and generation flow.

See [examples](./examples) and [tests](./tests)
## End-to-End Flow

## Contribute
The repo currently tracks two useful flows:

See [contribute_guide.md](./contribute_guide.md)
- PTODSL frontend coverage:
tensor/view/tile and A5 examples emit correct `.pto`
- PTODSL -> PTOAS -> HIVM proof path:
pure micro kernels such as
[`a5_hivm_vadd_demo.pto`](/Users/zhoubot/github/pto-org/pto-dsl/ptodsl/lib/a5/generated/a5_hivm_vadd_demo.pto)
lower through PTOAS into
[`a5_hivm_vadd_demo.ll`](/Users/zhoubot/github/pto-org/pto-dsl/ptodsl/lib/a5/generated/a5_hivm_vadd_demo.ll)

Generated examples live in
[`ptodsl/lib/a5/generated`](/Users/zhoubot/github/pto-org/pto-dsl/ptodsl/lib/a5/generated).

## Tests

The repo currently uses:

- [`tests/frontend`](/Users/zhoubot/github/pto-org/pto-dsl/tests/frontend) for
frontend IR construction
- [`tests/regression`](/Users/zhoubot/github/pto-org/pto-dsl/tests/regression)
for A5 library coverage, generated artifact expectations, and public-surface
regressions

Run them with:

```bash
PYTHONPATH=/path/to/mlir_core:/path/to/PTOAS/install:/path/to/PTOAS/build/python \
python -m pytest -q tests/frontend tests/regression
```

## Compare to other frameworks
## Related Files

PTO-DSL aims for **low-level, explicit, NPU-native primitives** that can match the performance of **programming in [hardware intrinsics](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/850/API/cceintrinsicapi/cceapi_0001.html)**. Compared to other (also very good) kernel programming frameworks, it has a bit different scope by design:
- vs [tilelang-ascend](https://github.com/tile-ai/tilelang-ascend): tilelang can also [use PTO-ISA as codegen backend](https://github.com/tile-ai/tilelang-ascend/blob/76553755da078479a7f60cce9c5f0e9a24d0008b/src/target/codegen_ascend_pto.cc). PTO-DSL intentionally exposes lower-level control, for example L2 swizzling is one-liner `T.use_swizzle` in tilelang, but is a user-defined custom function in PTO-DSL -- see this [matmul optimization example](examples/aot/matmul_optimization_guide/matmul_optim_guide.md). Once PTO-DSL is more stabilized, it might serve as a component like the [CuteDSL backend for tilelang](https://github.com/tile-ai/tilelang/blob/v0.1.8/src/target/codegen_cutedsl.cc).
- vs [triton-ascend](https://gitcode.com/Ascend/triton-ascend): Both frameworks automate software pipelining based on some MLIR dialects for NPU. PTO-DSL exposes more NPU-native memory hierarchy such as `L0`/`L1`/`UB`. Also, `pto.load`/`pto.store` always maps to native efficient DMA instructions, while `tl.load`/`tl.store` tries to do GPU-style memory coalescing.
- vs [Catlass](https://gitcode.com/cann/catlass): Catlass provides expert-optimized template collections, while PTO-DSL is more like the [CuteDSL](https://docs.nvidia.com/cutlass/latest/media/docs/pythonDSL/overview.html) layer of Cutlass, offering explicit low-level primitives.
- vs [PyPTO](https://gitcode.com/cann/pypto): PyPTO is a full [MPMD](https://en.wikipedia.org/wiki/Flynn%27s_taxonomy#Multiple_programs,_multiple_data_streams_(MPMD)) dynamic runtime stack, which also [uses PTO-ISA as lowest-level primitive](https://gitcode.com/cann/pypto/tree/r0.1.1/framework/src/interface/tileop). PyPTO's Tensor API abstraction is closer to PyTorch/JAX level, while a PTO-DSL kernel is still [SPMD](https://en.wikipedia.org/wiki/Single_program,_multiple_data) and is closer to CuTile/CuteDSL level.
- [`docs/latest_api.md`](/Users/zhoubot/github/pto-org/pto-dsl/docs/latest_api.md)
- [`ptodsl/api/pto.py`](/Users/zhoubot/github/pto-org/pto-dsl/ptodsl/api/pto.py)
- [`ptodsl/api/micro.py`](/Users/zhoubot/github/pto-org/pto-dsl/ptodsl/api/micro.py)
- [`ptodsl/lib/a5/README.md`](/Users/zhoubot/github/pto-org/pto-dsl/ptodsl/lib/a5/README.md)
- [`contribute_guide.md`](/Users/zhoubot/github/pto-org/pto-dsl/contribute_guide.md)
120 changes: 120 additions & 0 deletions docs/latest_api.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
# PTODSL Latest API

This document summarizes the public PTODSL surface after the recent A5/micro
upgrade and explains which layer to use for each kind of kernel.

## Public Layers

### `ptodsl.pto`

Use this layer for:

- pointer/type construction
- tensor and partitioned-view authoring
- tile buffer allocation and `tload`/`tstore`
- control flow and synchronization

Key entry points:

- `pto.ptr(dtype, space=None)`
- `pto.TensorType(rank=..., dtype=...)`
- `pto.SubTensorType(shape=..., dtype=...)`
- `pto.TileBufType(shape=..., dtype=..., memory_space=..., valid_shape=..., config=...)`
- `pto.make_tensor(ptr, shape=..., strides=None, dtype=..., type=None, layout=None)`
- `TensorView.slice(offsets, sizes, static_shape=None)`
- `pto.make_tile_buffer(dtype, shape, space=..., valid_shape=None, config=None)`
- `TileBufferSpec.alloc(addr=None, valid_row=None, valid_col=None)`
- `TileBuffer.load_from(view)` / `TileBuffer.store_to(view)`

Type aliases currently exposed through `pto`:

- `bool`
- `float16`
- `float32`
- `bfloat16`
- `int8`
- `int16`
- `int32`
- `uint8`
- `uint16`
- `uint32`

## `ptodsl.micro`

Use this layer when you want raw PTO micro instructions without going through
tile helpers.

Examples:

- `micro.vlds`
- `micro.vadd`
- `micro.vsts`
- `micro.vcadd`
- `micro.vgather2`
- `micro.vmrgsort4`
- `micro.vbitsort`
- `micro.pset_b32`
- `micro.VRegType`
- `micro.MaskType`

This layer is a thin pass-through over the PTO dialect Python bindings, filtered
to the public micro-op surface.

## `ptodsl.lib.a5`

Use the A5 library when you want readable, opcode-focused examples of how an
existing A5 tile helper is expressed with PTO micro instructions.

Examples:

- `a5.tadd`
- `a5.tadds`
- `a5.trow_sum`
- `a5.tcol_expand`
- `a5.tgather`
- `a5.tsort32`

The split modules in [`ptodsl/lib/a5`](/Users/zhoubot/github/pto-org/pto-dsl/ptodsl/lib/a5)
are organized by tile helper family:

- [`tbinary.py`](/Users/zhoubot/github/pto-org/pto-dsl/ptodsl/lib/a5/tbinary.py)
- [`tscalar.py`](/Users/zhoubot/github/pto-org/pto-dsl/ptodsl/lib/a5/tscalar.py)
- [`tunary.py`](/Users/zhoubot/github/pto-org/pto-dsl/ptodsl/lib/a5/tunary.py)
- [`texpand.py`](/Users/zhoubot/github/pto-org/pto-dsl/ptodsl/lib/a5/texpand.py)
- [`treduce.py`](/Users/zhoubot/github/pto-org/pto-dsl/ptodsl/lib/a5/treduce.py)
- [`tsort.py`](/Users/zhoubot/github/pto-org/pto-dsl/ptodsl/lib/a5/tsort.py)

## Compile-Time vs Runtime Values

PTODSL now follows the same staging model as the PTO C++ tile headers:

- compile-time constants:
dtype, memory space, tile capacity, tile layout/config, specialization knobs
- runtime values:
pointers, offsets, valid row/column bounds, problem sizes

In practice:

- `tile_shape=[ROWS, COLS]` describes the fixed tile envelope
- `valid_row` and `valid_col` describe the runtime active region when the valid
box is dynamic
- `Constexpr[...]` is used in template-style builders such as
`build_templated_elementwise_add`

## End-to-End Lowering

The strongest validated path today is:

1. write a pure micro kernel in PTODSL
2. emit `.pto`
3. lower with PTOAS VPTO
4. inspect emitted `llvm.hivm.*` intrinsics

Reference artifacts:

- [`a5_hivm_vadd_demo.pto`](/Users/zhoubot/github/pto-org/pto-dsl/ptodsl/lib/a5/generated/a5_hivm_vadd_demo.pto)
- [`a5_hivm_vadd_demo.ll`](/Users/zhoubot/github/pto-org/pto-dsl/ptodsl/lib/a5/generated/a5_hivm_vadd_demo.ll)

The higher-level tensor/tile frontend remains fully useful for PTODSL authoring
and regression coverage, but the pure micro path is the clearest proof route
for PTOAS HIVM lowering.
3 changes: 2 additions & 1 deletion ptodsl/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
from . import pto, scalar, tile
from . import micro, pto, scalar, tile
from .bench import do_bench
from .compiler.ir import to_ir_module
from .compiler.jit import JitWrapper, jit
Expand All @@ -10,6 +10,7 @@
"const_expr",
"do_bench",
"jit",
"micro",
"pto",
"range_constexpr",
"scalar",
Expand Down
4 changes: 2 additions & 2 deletions ptodsl/api/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
from . import pto, scalar, tile
from . import micro, pto, scalar, tile

__all__ = ["pto", "scalar", "tile"]
__all__ = ["micro", "pto", "scalar", "tile"]
Loading
Loading