PTO-ISA · zhoubot · Mar 31, 2026 · Mar 31, 2026
diff --git a/README.md b/README.md
@@ -1,50 +1,194 @@
-<div align="center">
-
 # PTO-DSL
-Pythonic interface and JIT compiler for [PTO-ISA](https://github.com/PTO-ISA/pto-isa)
-</div>
 
-PTO-DSL provides a programming abstraction similar to [cuTile](https://docs.nvidia.com/cuda/cutile-python/), but native to [NPU](https://www.hiascend.com/).
+Python DSL for PTO-ISA kernels, with a public `pto` surface for tensor/tile
+authoring, a raw `micro` surface for direct PTO micro instructions, and an A5
+library layer that rewrites tile-style helpers in terms of those micro ops.
 
-**Key features:**
-- Automatic software pipelining without [manual synchronization](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/850/API/ascendcopapi/atlasascendc_api_07_0179.html)
-- Easily interface with [torch-npu](https://gitcode.com/ascend/pytorch)
-- Lightweight, open-source compiler stack using [PTO Assembler](https://github.com/zhangstevenunity/PTOAS)
+The current repo targets three authoring levels:
 
-## Installation
+- `ptodsl.pto`: ergonomic tensor, view, tile, sync, and control-flow helpers
+- `ptodsl.micro`: raw PTO micro instruction access such as `vlds`, `vadd`,
+  `vsts`, `pset_b32`, and vector register types
+- `ptodsl.lib.a5`: readable A5 helper implementations that show how tile-style
+  operations are written with PTO micro instructions
 
-See [docker/README.md](./docker/README.md) for full reproducible dependencies on NPU.
+## Recent Upgrade
 
-Then, install this lightweight DSL package itself:
+The recent PTODSL upgrade changed the repo in four important ways:
 
-```bash
-# install latest commit
-pip install git+https://github.com/huawei-csl/pto-dsl.git
+1. `pto.ptr(dtype, space=...)` is now the preferred pointer constructor for
+   explicit memory spaces such as `GM`, `VEC`, `LEFT`, `RIGHT`, and `ACC`.
+2. The public `pto` namespace now includes pythonic builders:
+   `make_tensor(...)`, `TensorView.slice(...)`, `make_tile_buffer(...)`,
+   `TileBufferSpec.alloc()`, and `TileBuffer.load_from()/store_to()`.
+3. The package root now exposes `ptodsl.micro` as the raw micro-op surface.
+4. The A5 library under [`ptodsl/lib/a5`](/Users/zhoubot/github/pto-org/pto-dsl/ptodsl/lib/a5)
+   is organized around tile helpers implemented with PTO micro instructions,
+   and selected pure-micro kernels are validated through PTOAS into
+   `llvm.hivm.*` intrinsics.
 
-# or stable tag
-pip install git+https://github.com/huawei-csl/pto-dsl.git@0.1.0
-```
+Detailed API notes are in
+[`docs/latest_api.md`](/Users/zhoubot/github/pto-org/pto-dsl/docs/latest_api.md).
+
+## Install
 
-For in-place development:
+PTODSL depends on the PTO dialect Python bindings from PTOAS and an MLIR Python
+environment. For a reproducible setup, start with
+[`docker/README.md`](/Users/zhoubot/github/pto-org/pto-dsl/docker/README.md).
+
+For local development:
 
 ```bash
-git clone https://github.com/huawei-csl/pto-dsl.git
+git clone https://github.com/PTO-ISA/pto-dsl.git
 cd pto-dsl
 pip install -e .
 ```
 
-## Usage
+Typical local testing in this repo also needs PTOAS and MLIR on `PYTHONPATH`,
+for example:
+
+```bash
+PYTHONPATH=/path/to/mlir_core:/path/to/PTOAS/install:/path/to/PTOAS/build/python \
+python -m pytest -q tests/frontend tests/regression
+```
+
+## Public API
+
+### 1. Pythonic `pto`
+
+Use `ptodsl.pto` for tensor/view/tile construction:
+
+```python
+from mlir.ir import IndexType
+from ptodsl import pto, tile, to_ir_module
+
+
+def meta_data():
+    return {
+        "ptr_t": pto.ptr(pto.float32),
+        "index_t": IndexType.get(),
+    }
+
+
+@to_ir_module(meta_data=meta_data)
+def add_tile(src0: "ptr_t", src1: "ptr_t", dst: "ptr_t", valid_row: "index_t", valid_col: "index_t") -> None:
+    lhs = pto.make_tensor(src0, shape=[8, 64], dtype=pto.float32)
+    rhs = pto.make_tensor(src1, shape=[8, 64], dtype=pto.float32)
+    out = pto.make_tensor(dst, shape=[8, 64], dtype=pto.float32)
+
+    lhs_tile = lhs.slice([0, 0], [8, 64])
+    rhs_tile = rhs.slice([0, 0], [8, 64])
+    out_tile = out.slice([0, 0], [8, 64])
+
+    with pto.vector_section():
+        lhs_buf = pto.make_tile_buffer(
+            pto.float32,
+            [8, 64],
+            space="VEC",
+            valid_shape=[-1, -1],
+        ).alloc(valid_row=valid_row, valid_col=valid_col)
+        rhs_buf = pto.make_tile_buffer(
+            pto.float32,
+            [8, 64],
+            space="VEC",
+            valid_shape=[-1, -1],
+        ).alloc(valid_row=valid_row, valid_col=valid_col)
+        out_buf = pto.make_tile_buffer(
+            pto.float32,
+            [8, 64],
+            space="VEC",
+            valid_shape=[-1, -1],
+        ).alloc(valid_row=valid_row, valid_col=valid_col)
+
+        lhs_buf.load_from(lhs_tile)
+        rhs_buf.load_from(rhs_tile)
+        tile.add(lhs_buf, rhs_buf, out_buf)
+        out_buf.store_to(out_tile)
+```
+
+This still emits native PTO tensor/tile IR such as `pto.make_tensor_view`,
+`pto.partition_view`, `pto.alloc_tile`, `pto.tload`, `pto.tadd`, and
+`pto.tstore`.
+
+### 2. Raw `micro`
+
+Use `ptodsl.micro` when you want to write the micro instruction sequence
+directly:
+
+```python
+from mlir.ir import IndexType
+from ptodsl import micro, pto, to_ir_module
+from ptodsl.api.scalar import _unwrap
+
+
+def meta_data():
+    return {
+        "ptr_t": pto.ptr(pto.float32, space="VEC"),
+        "index_t": IndexType.get(),
+    }
+
+
+@to_ir_module(meta_data=meta_data)
+def vadd_demo(src0: "ptr_t", src1: "ptr_t", dst: "ptr_t", offset: "index_t") -> None:
+    v64f32 = micro.VRegType.get(64, pto.float32)
+    mask = micro.pset_b32(micro.MaskType.get(), "PAT_ALL")
+    lhs = micro.vlds(v64f32, _unwrap(src0), _unwrap(offset))
+    rhs = micro.vlds(v64f32, _unwrap(src1), _unwrap(offset))
+    out = micro.vadd(v64f32, lhs, rhs, mask)
+    micro.vsts(out, _unwrap(dst), _unwrap(offset), mask)
+```
+
+This is the most direct PTODSL surface for VPTO/PTOAS lowering.
+
+### 3. A5 Library
+
+The A5 layer under [`ptodsl/lib/a5`](/Users/zhoubot/github/pto-org/pto-dsl/ptodsl/lib/a5)
+shows how tile-style helpers map to micro instructions:
+
+- `tadd` is written with `pto.vlds`, `pto.vadd`, and `pto.vsts`
+- `trow_sum` is written with `pto.vcadd` plus vector combine/store logic
+- `tcol_expand`, `tgather`, `tmrgsort`, and `tsort32` are expressed directly in
+  terms of PTO micro opcodes where supported
+
+See [`ptodsl/lib/a5/README.md`](/Users/zhoubot/github/pto-org/pto-dsl/ptodsl/lib/a5/README.md)
+for the file layout and generation flow.
 
-See [examples](./examples) and [tests](./tests)
+## End-to-End Flow
 
-## Contribute
+The repo currently tracks two useful flows:
 
-See [contribute_guide.md](./contribute_guide.md)
+- PTODSL frontend coverage:
+  tensor/view/tile and A5 examples emit correct `.pto`
+- PTODSL -> PTOAS -> HIVM proof path:
+  pure micro kernels such as
+  [`a5_hivm_vadd_demo.pto`](/Users/zhoubot/github/pto-org/pto-dsl/ptodsl/lib/a5/generated/a5_hivm_vadd_demo.pto)
+  lower through PTOAS into
+  [`a5_hivm_vadd_demo.ll`](/Users/zhoubot/github/pto-org/pto-dsl/ptodsl/lib/a5/generated/a5_hivm_vadd_demo.ll)
+
+Generated examples live in
+[`ptodsl/lib/a5/generated`](/Users/zhoubot/github/pto-org/pto-dsl/ptodsl/lib/a5/generated).
+
+## Tests
+
+The repo currently uses:
+
+- [`tests/frontend`](/Users/zhoubot/github/pto-org/pto-dsl/tests/frontend) for
+  frontend IR construction
+- [`tests/regression`](/Users/zhoubot/github/pto-org/pto-dsl/tests/regression)
+  for A5 library coverage, generated artifact expectations, and public-surface
+  regressions
+
+Run them with:
+
+```bash
+PYTHONPATH=/path/to/mlir_core:/path/to/PTOAS/install:/path/to/PTOAS/build/python \
+python -m pytest -q tests/frontend tests/regression
+```
 
-## Compare to other frameworks
+## Related Files
 
-PTO-DSL aims for **low-level, explicit, NPU-native primitives** that can match the performance of **programming in [hardware intrinsics](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/850/API/cceintrinsicapi/cceapi_0001.html)**. Compared to other (also very good) kernel programming frameworks, it has a bit different scope by design:
-- vs [tilelang-ascend](https://github.com/tile-ai/tilelang-ascend): tilelang can also [use PTO-ISA as codegen backend](https://github.com/tile-ai/tilelang-ascend/blob/76553755da078479a7f60cce9c5f0e9a24d0008b/src/target/codegen_ascend_pto.cc). PTO-DSL intentionally exposes lower-level control, for example L2 swizzling is one-liner `T.use_swizzle` in tilelang, but is a user-defined custom function in PTO-DSL -- see this [matmul optimization example](examples/aot/matmul_optimization_guide/matmul_optim_guide.md). Once PTO-DSL is more stabilized, it might serve as a component like the [CuteDSL backend for tilelang](https://github.com/tile-ai/tilelang/blob/v0.1.8/src/target/codegen_cutedsl.cc).
-- vs [triton-ascend](https://gitcode.com/Ascend/triton-ascend): Both frameworks automate software pipelining based on some MLIR dialects for NPU. PTO-DSL exposes more NPU-native memory hierarchy such as `L0`/`L1`/`UB`. Also, `pto.load`/`pto.store` always maps to native efficient DMA instructions, while `tl.load`/`tl.store` tries to do GPU-style memory coalescing.
-- vs [Catlass](https://gitcode.com/cann/catlass): Catlass provides expert-optimized template collections, while PTO-DSL is more like the [CuteDSL](https://docs.nvidia.com/cutlass/latest/media/docs/pythonDSL/overview.html) layer of Cutlass, offering explicit low-level primitives.
-- vs [PyPTO](https://gitcode.com/cann/pypto): PyPTO is a full [MPMD](https://en.wikipedia.org/wiki/Flynn%27s_taxonomy#Multiple_programs,_multiple_data_streams_(MPMD)) dynamic runtime stack, which also [uses PTO-ISA as lowest-level primitive](https://gitcode.com/cann/pypto/tree/r0.1.1/framework/src/interface/tileop). PyPTO's Tensor API abstraction is closer to PyTorch/JAX level, while a PTO-DSL kernel is still [SPMD](https://en.wikipedia.org/wiki/Single_program,_multiple_data) and is closer to CuTile/CuteDSL level.
+- [`docs/latest_api.md`](/Users/zhoubot/github/pto-org/pto-dsl/docs/latest_api.md)
+- [`ptodsl/api/pto.py`](/Users/zhoubot/github/pto-org/pto-dsl/ptodsl/api/pto.py)
+- [`ptodsl/api/micro.py`](/Users/zhoubot/github/pto-org/pto-dsl/ptodsl/api/micro.py)
+- [`ptodsl/lib/a5/README.md`](/Users/zhoubot/github/pto-org/pto-dsl/ptodsl/lib/a5/README.md)
+- [`contribute_guide.md`](/Users/zhoubot/github/pto-org/pto-dsl/contribute_guide.md)
diff --git a/docs/latest_api.md b/docs/latest_api.md
@@ -0,0 +1,120 @@
+# PTODSL Latest API
+
+This document summarizes the public PTODSL surface after the recent A5/micro
+upgrade and explains which layer to use for each kind of kernel.
+
+## Public Layers
+
+### `ptodsl.pto`
+
+Use this layer for:
+
+- pointer/type construction
+- tensor and partitioned-view authoring
+- tile buffer allocation and `tload`/`tstore`
+- control flow and synchronization
+
+Key entry points:
+
+- `pto.ptr(dtype, space=None)`
+- `pto.TensorType(rank=..., dtype=...)`
+- `pto.SubTensorType(shape=..., dtype=...)`
+- `pto.TileBufType(shape=..., dtype=..., memory_space=..., valid_shape=..., config=...)`
+- `pto.make_tensor(ptr, shape=..., strides=None, dtype=..., type=None, layout=None)`
+- `TensorView.slice(offsets, sizes, static_shape=None)`
+- `pto.make_tile_buffer(dtype, shape, space=..., valid_shape=None, config=None)`
+- `TileBufferSpec.alloc(addr=None, valid_row=None, valid_col=None)`
+- `TileBuffer.load_from(view)` / `TileBuffer.store_to(view)`
+
+Type aliases currently exposed through `pto`:
+
+- `bool`
+- `float16`
+- `float32`
+- `bfloat16`
+- `int8`
+- `int16`
+- `int32`
+- `uint8`
+- `uint16`
+- `uint32`
+
+## `ptodsl.micro`
+
+Use this layer when you want raw PTO micro instructions without going through
+tile helpers.
+
+Examples:
+
+- `micro.vlds`
+- `micro.vadd`
+- `micro.vsts`
+- `micro.vcadd`
+- `micro.vgather2`
+- `micro.vmrgsort4`
+- `micro.vbitsort`
+- `micro.pset_b32`
+- `micro.VRegType`
+- `micro.MaskType`
+
+This layer is a thin pass-through over the PTO dialect Python bindings, filtered
+to the public micro-op surface.
+
+## `ptodsl.lib.a5`
+
+Use the A5 library when you want readable, opcode-focused examples of how an
+existing A5 tile helper is expressed with PTO micro instructions.
+
+Examples:
+
+- `a5.tadd`
+- `a5.tadds`
+- `a5.trow_sum`
+- `a5.tcol_expand`
+- `a5.tgather`
+- `a5.tsort32`
+
+The split modules in [`ptodsl/lib/a5`](/Users/zhoubot/github/pto-org/pto-dsl/ptodsl/lib/a5)
+are organized by tile helper family:
+
+- [`tbinary.py`](/Users/zhoubot/github/pto-org/pto-dsl/ptodsl/lib/a5/tbinary.py)
+- [`tscalar.py`](/Users/zhoubot/github/pto-org/pto-dsl/ptodsl/lib/a5/tscalar.py)
+- [`tunary.py`](/Users/zhoubot/github/pto-org/pto-dsl/ptodsl/lib/a5/tunary.py)
+- [`texpand.py`](/Users/zhoubot/github/pto-org/pto-dsl/ptodsl/lib/a5/texpand.py)
+- [`treduce.py`](/Users/zhoubot/github/pto-org/pto-dsl/ptodsl/lib/a5/treduce.py)
+- [`tsort.py`](/Users/zhoubot/github/pto-org/pto-dsl/ptodsl/lib/a5/tsort.py)
+
+## Compile-Time vs Runtime Values
+
+PTODSL now follows the same staging model as the PTO C++ tile headers:
+
+- compile-time constants:
+  dtype, memory space, tile capacity, tile layout/config, specialization knobs
+- runtime values:
+  pointers, offsets, valid row/column bounds, problem sizes
+
+In practice:
+
+- `tile_shape=[ROWS, COLS]` describes the fixed tile envelope
+- `valid_row` and `valid_col` describe the runtime active region when the valid
+  box is dynamic
+- `Constexpr[...]` is used in template-style builders such as
+  `build_templated_elementwise_add`
+
+## End-to-End Lowering
+
+The strongest validated path today is:
+
+1. write a pure micro kernel in PTODSL
+2. emit `.pto`
+3. lower with PTOAS VPTO
+4. inspect emitted `llvm.hivm.*` intrinsics
+
+Reference artifacts:
+
+- [`a5_hivm_vadd_demo.pto`](/Users/zhoubot/github/pto-org/pto-dsl/ptodsl/lib/a5/generated/a5_hivm_vadd_demo.pto)
+- [`a5_hivm_vadd_demo.ll`](/Users/zhoubot/github/pto-org/pto-dsl/ptodsl/lib/a5/generated/a5_hivm_vadd_demo.ll)
+
+The higher-level tensor/tile frontend remains fully useful for PTODSL authoring
+and regression coverage, but the pure micro path is the clearest proof route
+for PTOAS HIVM lowering.
diff --git a/ptodsl/__init__.py b/ptodsl/__init__.py
@@ -1,4 +1,4 @@
-from . import pto, scalar, tile
+from . import micro, pto, scalar, tile
 from .bench import do_bench
 from .compiler.ir import to_ir_module
 from .compiler.jit import JitWrapper, jit
@@ -10,6 +10,7 @@
     "const_expr",
     "do_bench",
     "jit",
+    "micro",
     "pto",
     "range_constexpr",
     "scalar",

diff --git a/ptodsl/api/__init__.py b/ptodsl/api/__init__.py
@@ -1,3 +1,3 @@
-from . import pto, scalar, tile
+from . import micro, pto, scalar, tile
 
-__all__ = ["pto", "scalar", "tile"]
+__all__ = ["micro", "pto", "scalar", "tile"]