[codex] Add A5 PTODSL library and micro coverage by zhoubot · Pull Request #2 · PTO-ISA/pto-dsl

zhoubot · 2026-03-30T13:20:17Z

Summary

Add a new ptodsl.lib.a5 package that rewrites the public PTO tile surface into Pythonic PTODSL helpers backed by PTO micro instructions where possible.

What Changed

add ptodsl.lib.a5 with A5-oriented helpers, generated .pto samples, and a small README
add constexpr-specialized micro lowerings for elementwise, broadcast, reduction, sort32, mrgsort, and indexed gather
add A5-style validation checks so PTODSL rejects invalid dtype/shape combinations early
add a checked-in tile-to-micro coverage checklist and generator/update scripts
add regression coverage for emitted IR and checklist sync

Coverage

Current tile micro coverage in this PR:

implemented: 26
partial: 1 (gather mask-pattern form still depends on vsqz, which is not exposed in the current PTO micro Python surface)
blocked: 4 (matmul, matmul_bias, matmul_acc, extract)

Validation

PYTHONPATH=... python -m pytest -q tests/regression
result: 30 passed in 0.35s

Impact

This gives PTO-DSL a usable A5 library layer, generated .pto artifacts, and an observable micro-coverage contract for future parity work.

…0304

* added dynamic multicore geglu example * added bench geglu and readme for validate examples --------- Co-authored-by: mirkodevita <[email protected]>

…awei-csl#52) * wip: fast hadamard builder * more closely match cpp ref * caller and execution script * default to manual-sync * extra barrier * disable `pto.barrier("STORE_VEC")` for now due to hw-native-sys/PTOAS#187 * do not early-exit * print full mismatch msg * allocate double buffer * extra sync between VEC and MTE3 * try refactor to bulk load-store just like manual cpp ref * add one missing sync * fix subset view indexing * add missing sync again * Revert "add missing sync again" This reverts commit ba25329. * Revert "fix subset view indexing" This reverts commit 9e3316f. * add C++ fast hadamard reference * change compile option fixes run-time error, why? * fallback to legacy `--cce-soc-core-type=VecCore` * commit hadamard_manual_sync.cpp for record * fix early-exit logic in Python builder * update generated cpp and run log * fix n_half * Revert "fix n_half" This reverts commit 49cd872. * syntax sugar * fallback to loading one sample at a time, unit test now all pass * test batch 29 to span over two rounds * remove --test-both option * remove cpp references for a clean PR

* split-out auto vs manual sync branches, so that default auto-sync builder looks clean * remove old run log * reuse code across if and else branches * fix bracket * remove temporary `num_blocks` var * simplify lib path * move kernel side dim check to launch time, to reduce indentation * flatten module builder, do not use closure

…uawei-csl#62) * clean-up * perf benchmark * switch to `--npu-arch=dav-2201` and adjust builder accordingly. * plot bandwidth (script taken from cpp ref)

* fix vector index counts for dynamic add * fix geglu vector index count

* added rowsum dynamic multicore tests for fp32 * added more rowsum tests and removed unnecessary checks in builder * use more general batch sizes --------- Co-authored-by: mirkodevita <[email protected]> Co-authored-by: learning-chip <[email protected]>

…i-csl#55) * WIP: dsl * feat: simplify * feat: add finer grained set/wait flags for more pipeline overlap * fix: func signature compatability with caller * feat: fp16xfp16->fp16 matmul with fp32 acc tiles * feat: A/C are now 3d tensors * fix: typo * feat: pre-load B * WIP: double buffer * feat: cleanup * wip: not working * feat: add cpp double buf code for debug * fix: typos * feat: remove extra sync in bench * feat: add double buf cpp ref * feat: convert dsl to fp16 * feat: add modulus and support multiple event_ids * feat: double buffer matmul pto-dsl * fix: remove cpp * feat: simplify * feat: add manual sync option to decorator * Revert "feat: add manual sync option to decorator" This reverts commit 8ccbf68. * fix: duplicate and typing * fix: type check * feat: move to new directory * fix: remove sync param * feat: add run script * refactor: cleanup cpp ref * feat: add caller and readme * feat: verify and benchmark, add barrier * feat: add optional dev depencdency matplotlib * fix: only plot when specified * fix: only use the signed remainder ops

* added elementwise unary tests or fp16 and fp32 * reduce test case numbers * register require_npu marker --------- Co-authored-by: mirkodevita <[email protected]> Co-authored-by: learning-chip <[email protected]>

* refactor ptodsl module namespaces * fix lazy-import * module clean-up * comment on the need to lazy-eval * pto.subset -> tile.subset * further split pto.py file * multi-level dir layout * avoid overwriting sys.modules * lazy-import * pto.for_range -> pto.range

…0307

* feat: add print tile example * feat: add builder file that will be fixed in future * feat: make compile script more robust, dsl now works * feat: remove cpp file * fix: compile script * fix: reorder print * feat: add printf

* port the latest minimum code from huawei-csl#70 * fix missing apis in ptodsl package * temporarily add artifacts * make swizzle count run-time parameter * update generated cpp and IR * expost swizzle_direction also as run-time parameter * fix MLIR verify * update generated ir and cpp * style compact * move nested functions out of main kernel function * note on SSA dominance * ignore compile artifacts * benchmark scripts * revert to older working builder * use more informative function name * update ptoas version * move level1_loop_mn_dynamic_tilesize to outside of main function * move out swizzle_zn and swizzle_nz util functions * plot swizzle speed-up and FLOPS ratio * remove fraction plot * ignore benchmark artifacts * reduce line counts * TODO on uni-tile-size demo * add auto-sync version of matmul swizzle * Revert "add auto-sync version of matmul swizzle" This reverts commit 446bee3. --------- Co-authored-by: jiawei_zhuang <[email protected]>

…l sync performance. (huawei-csl#73) * simpler matmul demo * remove swizzle param from simple demo * remove unused var * inline level1_loop_mn function * use one TensorType declare * rename level2_loop_k, put comments * add auto-sync variant * correct compile auto-sync version * ignore artifacts * performance benchmark * simple static swizzle

…ditable install by CI (huawei-csl#77) * fix pip install problem without -e * update CI to run pip install both with or without -e * more clear CI job naming --------- Co-authored-by: jiawei_zhuang <[email protected]>

… scripts)

Co-authored-by: mirkodevita <[email protected]>

…ei-csl#79) * remove unused event id * even simpler single-buffer matmul, compare to double buf perf * fix print * more explain in comments * flatten nested loop * correct flop ratio print * correct print * also compare with non swizzle * re-structure to step-by-step optimization * output artifacts dir * refactor: split IR builder into 4 steps * draft optimization guide * add pto syntax explain * add numpy simulation code for step-1 * print swizzle grid in numpy * rename tutorial directory --------- Co-authored-by: jiawei_zhuang <[email protected]>

… and colprod

row/col reduce op dynamic multicore tests

Expand col and row (with expand div, sub, mul) dynamic multicore tests

* ignore artifacts * remove unused script * rename script * remove _256 suffix * rename * rename in guide * improve numpy emulation * better size print, update guide * Manual draft of optimization guide and figures * move early version to experimental dir * move up dir * rename dir to matmul optimization guide * link to optim guide * 90% finish * remove old guide * consistent title * font * move * remove copy * Note on frontend * msprof link * font * swizzle explain * smaller img * plot FLOPs * unify figure path * fix figure suffix * larger font * remark on manual sync * ignore artifacts * add FLOPs plots * fix typo * fix typo * fix typo * fix single-buffer pipeline figure * global grammer and typo clean-up * more explain on swizzle * move figures * typo * update frontend note * typo --------- Co-authored-by: jiawei_zhuang <[email protected]>

…ei-csl#89) * minimum agent skills to translate PTO-ISA cpp to PTO-DSL python * move example generation to Non-Negotiable Rules * more explicit rule for ref example checking * pre-commit run --all-files --------- Co-authored-by: jiawei_zhuang <[email protected]>

* update PTOAS to https://github.com/huawei-csl/PTOAS/releases/tag/0.8 * temporarily commit all example translations (will remove before merging PR) * one-shot translation to python prompt: Translate @kernel_tri_inv_trick.cpp and test script @test_tri_inv_trick.py to @fast_inverse using skill @SKILL.md * force docker rebuild * temporary add generated IR and Cpp (remove before merge) * fix bisheng compile * use newer pto-isa version to avoid bisheng error on TMOV * do not early-exit on fail * use a known working version (padding + write to GM) * add TODO for TMOV * update PTO-ISA header * fix TMOV mismatch * fix tmov order * update generated cpp * fix TMOV type mismatch * update generated cpp * fix caller dtype mismatch * try fix manual sync * inline spill_acc_to_mat * remove translation references * try support smaller blocks like 64x64 * update generated cpp * try build all shapes * fix context * remove dynamic valid shape * note on auto-sync bug * remove artifacts * change error to warning, and use larger ftol * build_artifacts dir * ignore build_artifacts * remove unused MAX_MATRIX_SIZE = 128 * run pre-commit run --all-files --------- Co-authored-by: jiawei_zhuang <[email protected]>

* update PTOAS to https://github.com/zhangstevenunity/PTOAS/releases/tag/v0.9 * auto-sync now works * test more general block-diag size * separate out manual vs auto functions * clean up * precommit check * loop count should cover larger block size * precommit check --------- Co-authored-by: jiawei_zhuang <[email protected]>

Cleans up huawei-csl#90 huawei-csl#91 for human readers.

* Add example translation collection check to CI * fix ci path * update ptoas version in CI --------- Co-authored-by: jiawei_zhuang <[email protected]>

…uawei-csl#96) * mrgsort and sort32 dynamic multicore tests and semidynamic topk example * updated README for topk example --------- Co-authored-by: mirkodevita <[email protected]>

# Conflicts: # docker/Dockerfile

learning-chip and others added 30 commits March 4, 2026 10:23

update ptoas to https://github.com/huawei-csl/PTOAS/releases/tag/2026…

bbe4721

…0304

Geglu Dynamic Multicore (huawei-csl#54)

03525e5

* added dynamic multicore geglu example * added bench geglu and readme for validate examples --------- Co-authored-by: mirkodevita <[email protected]>

Performance measurement and tuning for fast hadamard python example (h…

0475779

…uawei-csl#62) * clean-up * perf benchmark * switch to `--npu-arch=dav-2201` and adjust builder accordingly. * plot bandwidth (script taken from cpp ref)

fix vector index counts for other vector examples (huawei-csl#64)

bf95136

* fix vector index counts for dynamic add * fix geglu vector index count

elementwise unary fp16 fp32 dynamic multicore tests (huawei-csl#67)

c9d17b0

* added elementwise unary tests or fp16 and fp32 * reduce test case numbers * register require_npu marker --------- Co-authored-by: mirkodevita <[email protected]> Co-authored-by: learning-chip <[email protected]>

update ptoas to https://github.com/huawei-csl/PTOAS/releases/tag/2026…

7fa663e

…0307

feat: add TPRINT and printf ops and example (huawei-csl#68)

80c4cb5

* feat: add print tile example * feat: add builder file that will be fixed in future * feat: make compile script more robust, dsl now works * feat: remove cpp file * fix: compile script * fix: reorder print * feat: add printf

add matplotlib and pandas to docker image (commonly used by benchmark…

f646f94

… scripts)

added min and max for tiles with tests (huawei-csl#78)

8875bca

Co-authored-by: mirkodevita <[email protected]>

added colsum dynamic multicore test and merged to rowsum

9a41ef4

wip: generalization of reduce ops, misisng colmin, colmax and rowprod…

1c0bec8

… and colprod

working col min and col max

0adadc9

rename to reduce tests

132e349

added reduce op to list for import visibility

838492e

Merge pull request huawei-csl#80 from huawei-csl/rowsum

7f8176a

row/col reduce op dynamic multicore tests

added row/col expand dynamic multicore tests

7800e14

reduce number of tests for expand

0956c8e

Merge pull request huawei-csl#81 from huawei-csl/expand_ops

3f0860b

Expand col and row (with expand div, sub, mul) dynamic multicore tests

add Chinese version of matmul optim guide

45427a4

re-org example dir to multi-level

867f5f2

learning-chip and others added 26 commits March 14, 2026 07:35

run pre-commit run --all-files

710a552

all pre-commit check to CI

b0a5520

minor syntax

07da842

move pyproject.toml to root dir to enable easy pip install from git

c5eba2e

add docker links

16e3c7e

black

b9b0c4a

update repo links in matmul guide

0a1c031

polish README for 0.1.0 release

577097c

add comparison to pypto

0f5255b

move framework comparison to bottom

bf107ec

ignore artifacts

87617a9

use on-the-fly clone for external references

2af0bc7

collect fast_inverse as translation example

2fa7152

fix subdir

1248fab

User-friendly guide for fast-inverse-trick in Python (huawei-csl#93)

9c84cfd

Cleans up huawei-csl#90 huawei-csl#91 for human readers.

remove old ignore

4526d42

Add example translation collection check to CI (huawei-csl#94)

201ffbb

* Add example translation collection check to CI * fix ci path * update ptoas version in CI --------- Co-authored-by: jiawei_zhuang <[email protected]>

mrgsort and sort32 dynamic multicore test + TopK semidynamic example (h…

1479a8f

…uawei-csl#96) * mrgsort and sort32 dynamic multicore tests and semidynamic topk example * updated README for topk example --------- Co-authored-by: mirkodevita <[email protected]>

Free up resources in examples and other fixes (huawei-csl#97)

b836487

feat(frontend): add mxfp8 helpers and examples

ef996b0

Merge remote-tracking branch 'origin/main'

1eef916

# Conflicts: # docker/Dockerfile

Add A5 PTODSL library and micro coverage

49cefb3

Fix CI and constexpr exports for PTODSL A5 PR

52523fe

zhoubot marked this pull request as ready for review March 31, 2026 01:21

zhoubot merged commit e1f267b into main Mar 31, 2026
5 checks passed

zhoubot deleted the codex/ptodsl-a5-lib branch March 31, 2026 01:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[codex] Add A5 PTODSL library and micro coverage#2

[codex] Add A5 PTODSL library and micro coverage#2
zhoubot merged 56 commits into
mainfrom
codex/ptodsl-a5-lib

zhoubot commented Mar 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

zhoubot commented Mar 30, 2026

Summary

What Changed

Coverage

Validation

Impact

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants