Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
56 commits
Select commit Hold shift + click to select a range
bbe4721
update ptoas to https://github.com/huawei-csl/PTOAS/releases/tag/2026…
learning-chip Mar 4, 2026
03525e5
Geglu Dynamic Multicore (#54)
MirkoDeVita98 Mar 4, 2026
9635d5f
fast hadamard example, in both manual-sync and auto-sync versions (#52)
learning-chip Mar 4, 2026
79cd084
Various code clean-ups for hadamard example (#59)
learning-chip Mar 4, 2026
0475779
Performance measurement and tuning for fast hadamard python example (…
learning-chip Mar 5, 2026
bf95136
fix vector index counts for other vector examples (#64)
learning-chip Mar 5, 2026
9dec639
Rowsum dynamic fp32 test (#63)
MirkoDeVita98 Mar 6, 2026
14d6be1
feat: add double buffered matmul matching `torch.matmul` speed (#55)
fiskrt Mar 6, 2026
c9d17b0
elementwise unary fp16 fp32 dynamic multicore tests (#67)
MirkoDeVita98 Mar 7, 2026
1d888fc
Refactor ptodsl package module/namespace layout (#69)
learning-chip Mar 7, 2026
7fa663e
update ptoas to https://github.com/huawei-csl/PTOAS/releases/tag/2026…
learning-chip Mar 8, 2026
80c4cb5
feat: add `TPRINT` and `printf` ops and example (#68)
fiskrt Mar 9, 2026
0306b98
Matmul swizzle cleanup (#72)
learning-chip Mar 9, 2026
e7a6842
Test auto-sync on general-shape matmul with swizzle, compare to manua…
learning-chip Mar 9, 2026
79749eb
fix pip install problem without -e, and cover both editable and non-e…
learning-chip Mar 10, 2026
f646f94
add matplotlib and pandas to docker image (commonly used by benchmark…
learning-chip Mar 10, 2026
8875bca
added min and max for tiles with tests (#78)
MirkoDeVita98 Mar 10, 2026
d923ac2
Step-by-step optimization guide of dynamic general-shape matmul (#79)
learning-chip Mar 10, 2026
9a41ef4
added colsum dynamic multicore test and merged to rowsum
Mar 10, 2026
1c0bec8
wip: generalization of reduce ops, misisng colmin, colmax and rowprod…
Mar 10, 2026
0adadc9
working col min and col max
Mar 11, 2026
132e349
rename to reduce tests
Mar 11, 2026
838492e
added reduce op to list for import visibility
Mar 11, 2026
7f8176a
Merge pull request #80 from huawei-csl/rowsum
MirkoDeVita98 Mar 11, 2026
7800e14
added row/col expand dynamic multicore tests
Mar 11, 2026
0956c8e
reduce number of tests for expand
Mar 11, 2026
3f0860b
Merge pull request #81 from huawei-csl/expand_ops
MirkoDeVita98 Mar 11, 2026
36cd417
Manually polished matmul optimization guide (#82)
learning-chip Mar 13, 2026
45427a4
add Chinese version of matmul optim guide
learning-chip Mar 14, 2026
867f5f2
re-org example dir to multi-level
learning-chip Mar 14, 2026
710a552
run `pre-commit run --all-files`
learning-chip Mar 14, 2026
b0a5520
all pre-commit check to CI
learning-chip Mar 14, 2026
07da842
minor syntax
learning-chip Mar 14, 2026
c5eba2e
move pyproject.toml to root dir to enable easy pip install from git
learning-chip Mar 14, 2026
16e3c7e
add docker links
learning-chip Mar 14, 2026
b9b0c4a
black
learning-chip Mar 14, 2026
0a1c031
update repo links in matmul guide
learning-chip Mar 14, 2026
577097c
polish README for 0.1.0 release
learning-chip Mar 14, 2026
0f5255b
add comparison to pypto
learning-chip Mar 14, 2026
bf107ec
move framework comparison to bottom
learning-chip Mar 15, 2026
8a64455
minimum agent skills to translate PTO-ISA cpp to PTO-DSL python (#89)
learning-chip Mar 16, 2026
0b53c8d
Python version of fast-inverse-trick (#90)
learning-chip Mar 16, 2026
87617a9
ignore artifacts
learning-chip Mar 17, 2026
a94bd08
Enable auto-sync for fast inverse example (#91)
learning-chip Mar 17, 2026
2af0bc7
use on-the-fly clone for external references
Mar 17, 2026
2fa7152
collect fast_inverse as translation example
Mar 17, 2026
1248fab
fix subdir
Mar 17, 2026
9c84cfd
User-friendly guide for fast-inverse-trick in Python (#93)
learning-chip Mar 20, 2026
4526d42
remove old ignore
learning-chip Mar 20, 2026
201ffbb
Add example translation collection check to CI (#94)
learning-chip Mar 20, 2026
1479a8f
mrgsort and sort32 dynamic multicore test + TopK semidynamic example …
MirkoDeVita98 Mar 26, 2026
b836487
Free up resources in examples and other fixes (#97)
vloncar Mar 26, 2026
ef996b0
feat(frontend): add mxfp8 helpers and examples
Mar 29, 2026
1eef916
Merge remote-tracking branch 'origin/main'
Mar 29, 2026
49cefb3
Add A5 PTODSL library and micro coverage
Mar 30, 2026
52523fe
Fix CI and constexpr exports for PTODSL A5 PR
Mar 31, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
176 changes: 176 additions & 0 deletions .agent/skills/translate_cpp2py/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,176 @@
---
name: translate-cpp2py
description: Translate manual PTO-ISA C++ kernels into PTO-DSL Python builders and verification harnesses. Use when converting pto-isa kernel code to ptodsl, generating .pto/.cpp via ptoas, handling manual vs auto sync variants, separating vector vs cube APIs, or adding missing ptodsl API wrappers.
---

# Translate PTO-ISA C++ to PTO-DSL

## Scope

This skill converts a manually written PTO C++ kernel into:
- a **manual-sync** PTO-DSL Python builder (must mirror source C++ behavior),
- an **auto-sync** PTO-DSL variant (same math/control flow, sync removed),
- generated `.pto` and `.cpp`,
- launcher and runtime correctness test scripts.

Primary references are under `references/example_translation`. Only consult long compiler/dialect sources when mapping is missing.

## Required Outputs Per Translation Task

Produce all of the following unless user asks otherwise:
- Python builder for **manual-sync** kernel.
- Python builder for **auto-sync** kernel.
- Compile scripts:
- manual: `python builder.py > kernel.pto && ptoas kernel.pto -o kernel.cpp`
- auto: `python builder.py > kernel.pto && ptoas --enable-insert-sync kernel.pto -o kernel.cpp`
- `caller.cpp` kernel launcher with correct ABI and launch geometry.
- `run_*.py` load-and-test script to validate numerical correctness.
- `README.md` with minimal usage commands (compile + run + optional bench), following concise style used in `examples/aot/*/README.md`.

## Non-Negotiable Rules

1. Input C++ is manual-sync by default. Port to manual-sync Python first.
2. Then create auto-sync variant by removing explicit sync APIs and compiling with `--enable-insert-sync`.
3. Preserve ABI exactly: function name, argument order/types, launch contract.
4. Match section type exactly: vector (`__DAV_VEC__`) vs cube (`__DAV_CUBE__`).
5. Prefer compact Python; preserve semantics, not C++ verbosity.
6. If wrapper is missing in `ptodsl/api`, add it instead of forcing awkward translation.
7. First check if the directory `references/example_translation` is empty or contains too few examples,
If empty, ask for running `scripts/collect_example_translate.py` to generate full Python-C++ mapping examples.


## Translation Workflow

1. **Classify kernel**
- Determine section: vector vs cube.
- Determine sync style: manual vs auto (source C++ is manual).
- Identify core partitioning pattern (block/subblock/batch split).

2. **Rebuild signature + metadata first**
- Define `meta_data()` with scalar/index/pointer/tensor/subtensor/tile types.
- Use `@to_ir_module(meta_data=meta_data)`.
- Keep argument order identical to C++ kernel ABI.

3. **Port runtime control flow**
- Use `pto.range`, `pto.if_context`, `pto.cond` for runtime logic.
- Keep all tail guards and truncation branches.

4. **Port data movement + tile math**
- Build tensors via `pto.as_tensor`.
- Create subviews with `pto.slice_view`.
- Allocate tiles with `pto.alloc_tile`.
- Map load/store/compute ops 1:1 (see mapping rules below).

5. **Handle synchronization**
- Manual variant: keep explicit event/barrier calls.
- Auto variant: remove manual sync calls, keep op order, compile with insert-sync pass.

6. **Generate and verify round-trip**
- Emit `.pto`, compile to `.cpp`, and sanity-check structural equivalence.
- Build `.so` with `caller.cpp`.
- Run Python test script against reference (`torch` or equivalent).

## Sync Modes (Must Explain in Every Task)

- **Manual sync mode**
- Python uses explicit sync APIs in `ptodsl/api/synchronization.py`.
- Typical APIs: `record_event`, `wait_event`, `record_wait_pair`, `barrier`.
- Compile with plain `ptoas` (no `--enable-insert-sync`).
- Use for direct mirroring of manual C++ or for hand-tuned pipelines.

- **Auto sync mode**
- Remove explicit sync APIs from Python DSL.
- Compile with `ptoas --enable-insert-sync`.
- Compiler inserts hazard-handling synchronization.
- Use for simpler maintainable variant with same algorithmic behavior.

Rule of thumb: one kernel variant should use one sync strategy only.

## Vector vs Cube Section/API Boundaries

- **Vector kernels**
- Use `with pto.vector_section():`
- Lowers to `#if defined(__DAV_VEC__)`.
- Typical ops: elementwise/reduction/vector dataflow (`tile.add/sub/mul/div/...`).

- **Cube kernels**
- Use `with pto.cube_section():`
- Lowers to `#if defined(__DAV_CUBE__)`.
- Typical ops: matrix engines (`tile.matmul`, `tile.matmul_acc`, `tile.matmul_bias`).

- **API surface filtering**
- Vector-only example: `tile.add` in `ptodsl/api/tile.py`.
- Cube-only example: `tile.matmul` in `ptodsl/api/tile.py`.
- Keep agent search narrow: choose section first, then look only at relevant API family.

## Compact Mapping Rules (Python -> C++)

1. `@to_ir_module` function -> emitted `__global__ AICORE void ...`.
2. `PtrType(dtype)` -> C++ GM pointer arg type.
3. `TensorType/SubTensorType` + `as_tensor/slice_view` -> `GlobalTensor` objects/views.
4. `TileBufType(memory_space=...)` + `alloc_tile` -> tile declarations in corresponding memory space.
5. `pto.get_block_idx/get_block_num/get_subblock_idx/get_subblock_num` -> runtime core/subcore intrinsics.
6. `s.const/s.index_cast/s.ceil_div/s.select/min` -> scalar arithmetic + branch/select expressions.
7. `pto.range(...)` -> runtime loop in IR/C++.
8. Python `range(...)` -> build-time unroll/metaprogramming.
9. `pto.if_context(...)` / `pto.cond(...)` -> runtime conditional branches.
10. Python `if` -> build-time branch while constructing IR.
11. `pto.load` / `pto.store` -> load/store tile movement ops.
12. `tile.add/sub/mul/div/relu/exp/...` -> corresponding PTO compute intrinsics.
13. `tile.matmul*` family -> cube matmul intrinsics.
14. Multicore distribution usually maps via:
- vector core id = `block_idx * subblock_num + subblock_idx` (vector core is 2x than cube core, `subblock_num` equals 2)
- tiles per core = ceil-div(total tiles, total cores)
- guarded tail processing for final core(s).
15. Dynamic-shape kernels require explicit bound guards before slicing/loading/storing.

## Runtime Semantics Reminder (Critical)

PTO-DSL is Python tracing, not AST rewriting:
- Python-native `if/for` executes at build time, similar to C++ compile-time metaprogramming or loop unrolling
- Only `pto.range` and `pto.if_context` represent runtime control flow in generated kernel.

Never translate runtime C++ control logic into Python-native `if/range` by mistake.

## Missing API Wrapper Protocol

If required C++ op has no convenient Python wrapper:

1. Add thin wrapper in the right module:
- tile/instruction ops -> `ptodsl/api/tile.py`
- general tensor/control helpers -> `ptodsl/api/pto_general.py`
- sync helpers -> `ptodsl/api/synchronization.py`
2. Re-export through `ptodsl/api/pto.py` when needed.
3. Keep wrapper minimal: pass through to MLIR Python binding op with light argument normalization.

## Escalation Path (Only When Mapping Is Missing)

Check in order in the `references/external_repo`
1. Clone the `PTOAS` and `pto-isa` repos
2. Check Dialect op definitions: `PTOOps.td` in `PTOAS` repo
3. C++ codegen lowering: `PTOToEmitC.cpp` in `PTOAS` repo
4. ISA semantics: `pto-inst.hpp` in `pto-isa` repo

If op exists in dialect but not lowered in `PTOToEmitC.cpp`, translation requires PTOAS compiler work (not only DSL wrapper work).
In this case, suggest an issue report to PTOAS project (https://github.com/zhangstevenunity/PTOAS)

## Round-Trip Verification Checklist

- [ ] Manual-sync Python version created first and compiles with plain `ptoas`.
- [ ] Auto-sync variant created and compiles with `--enable-insert-sync`.
- [ ] Generated C++ keeps ABI/section/loop/tail semantics.
- [ ] Launcher `caller.cpp` matches kernel symbol and launch parameters.
- [ ] Test script loads `.so`, runs multiple shapes (including tail/non-divisible cases), compares against trusted reference.
- [ ] If multicore kernel: test cases include shapes not multiples of core count.
- [ ] `README.md` documents the exact local commands to compile and run verification.

## Reference Priority

Use these first:
- `references/example_translation/**` (primary mapping corpus)
- `references/example_translation/fast_hadamard/**` (manual vs auto sync pair)
- `references/example_translation/batch_matmul/**` (cube kernels)
- `examples/aot/elementwise/add_dynamic_multicore/*` (caller/test/build pattern)
- `examples/aot/matmul_optimization_guide/matmul_optim_guide.md` (sync and runtime-control semantics)

Consult `references/external_repo/**` only for patterns not covered by examples.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
*
21 changes: 21 additions & 0 deletions .agent/skills/translate_cpp2py/references/external_repo/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
This directory holds the 3rd-party repos that are used internally by PTO-DSL:
- https://github.com/zhangstevenunity/PTOAS: implements "ptoas" command line tool, the PTO MLIR dialect and its Python bindings, and the InjectSync pass to insert set_flag/wait_flag for "auto-sync" mode. Important files are:
- `PTOAS/include/PTO/IR/PTOOps.td` defines the MLIR PTO dialect
- `PTOAS/python/pto/dialects/pto.py` has low-level Python wrappers of PTO MLIR python binding (more Pythonic wrappers are in pto-dsl package)
- `PTOAS/lib/PTO/Transforms/PTOToEmitC.cpp` the compile pass that converts `*.pto` IR to C++ source code based on PTO-ISA headers.
- https://gitcode.com/cann/pto-isa: header-only library that defined the C++ APIs of PTO-ISA. It is the target API set for the `PTOToEmitC` pass in PTOAS. Important files are:
- `pto-isa/include/pto/common/pto_instr.hpp` the top-level interface
- `pto-isa/include/pto/common/*` common type definitions
- `pto-isa/include/pto/npu/a2a3/*` implementation for current hardware (used in current pto-dsl examples)
- `pto-isa/include/pto/npu/a5/*` implementation for next-generation hardware (not used in current pto-dsl examples)

Current directory is empty by default, and the repos should be cloned on-the-fly when the agent needs to access extra context.

For difficult task that needs to look into PTOAS and pto-isa repos, the agent or user can clone them by:

```bash
git clone https://github.com/zhangstevenunity/PTOAS.git
git clone https://gitcode.com/cann/pto-isa.git
```

Remind the user to check if the commit id of PTOAS and pto-isa matches the test environment (usually a pre-built docker image), to avoid mismatch between the context and the real execution.
Loading
Loading