Skip to content

[codex] Add A5 PTODSL library and micro coverage#2

Merged
zhoubot merged 56 commits into
mainfrom
codex/ptodsl-a5-lib
Mar 31, 2026
Merged

[codex] Add A5 PTODSL library and micro coverage#2
zhoubot merged 56 commits into
mainfrom
codex/ptodsl-a5-lib

Conversation

@zhoubot
Copy link
Copy Markdown

@zhoubot zhoubot commented Mar 30, 2026

Summary

Add a new ptodsl.lib.a5 package that rewrites the public PTO tile surface into Pythonic PTODSL helpers backed by PTO micro instructions where possible.

What Changed

  • add ptodsl.lib.a5 with A5-oriented helpers, generated .pto samples, and a small README
  • add constexpr-specialized micro lowerings for elementwise, broadcast, reduction, sort32, mrgsort, and indexed gather
  • add A5-style validation checks so PTODSL rejects invalid dtype/shape combinations early
  • add a checked-in tile-to-micro coverage checklist and generator/update scripts
  • add regression coverage for emitted IR and checklist sync

Coverage

Current tile micro coverage in this PR:

  • implemented: 26
  • partial: 1 (gather mask-pattern form still depends on vsqz, which is not exposed in the current PTO micro Python surface)
  • blocked: 4 (matmul, matmul_bias, matmul_acc, extract)

Validation

  • PYTHONPATH=... python -m pytest -q tests/regression
  • result: 30 passed in 0.35s

Impact

This gives PTO-DSL a usable A5 library layer, generated .pto artifacts, and an observable micro-coverage contract for future parity work.

learning-chip and others added 30 commits March 4, 2026 10:23
* added dynamic multicore geglu example

* added bench geglu and readme for validate examples

---------

Co-authored-by: mirkodevita <[email protected]>
…awei-csl#52)

* wip: fast hadamard builder

* more closely match cpp ref

* caller and execution script

* default to manual-sync

* extra barrier

* disable `pto.barrier("STORE_VEC")` for now due to hw-native-sys/PTOAS#187

* do not early-exit

* print full mismatch msg

* allocate double buffer

* extra sync between VEC and MTE3

* try refactor to bulk load-store just like manual cpp ref

* add one missing sync

* fix subset view indexing

* add missing sync again

* Revert "add missing sync again"

This reverts commit ba25329.

* Revert "fix subset view indexing"

This reverts commit 9e3316f.

* add C++ fast hadamard reference

* change compile option fixes run-time error, why?

* fallback to legacy `--cce-soc-core-type=VecCore`

* commit hadamard_manual_sync.cpp for record

* fix early-exit logic in Python builder

* update generated cpp and run log

* fix n_half

* Revert "fix n_half"

This reverts commit 49cd872.

* syntax sugar

* fallback to loading one sample at a time, unit test now all pass

* test batch 29 to span over two rounds

* remove  --test-both option

* remove cpp references for a clean PR
* split-out auto vs manual sync branches, so that default auto-sync builder looks clean

* remove old run log

* reuse code across if and else branches

* fix bracket

* remove temporary `num_blocks` var

* simplify lib path

* move kernel side dim check to launch time, to reduce indentation

* flatten module builder, do not use closure
…uawei-csl#62)

* clean-up

* perf benchmark

* switch to `--npu-arch=dav-2201` and adjust builder accordingly.

* plot bandwidth (script taken from cpp ref)
* fix vector index counts for dynamic add

* fix geglu vector index count
* added rowsum dynamic multicore tests for fp32

* added more rowsum tests and removed unnecessary checks in builder

* use more general batch sizes

---------

Co-authored-by: mirkodevita <[email protected]>
Co-authored-by: learning-chip <[email protected]>
…i-csl#55)

* WIP: dsl

* feat: simplify

* feat: add finer grained set/wait flags for more pipeline overlap

* fix: func signature compatability with caller

* feat: fp16xfp16->fp16 matmul with fp32 acc tiles

* feat: A/C are now 3d tensors

* fix: typo

* feat: pre-load B

* WIP: double buffer

* feat: cleanup

* wip: not working

* feat: add cpp double buf code for debug

* fix: typos

* feat: remove extra sync in bench

* feat: add double buf cpp ref

* feat: convert dsl to fp16

* feat: add modulus and support multiple event_ids

* feat: double buffer matmul pto-dsl

* fix: remove cpp

* feat: simplify

* feat: add manual sync option to decorator

* Revert "feat: add manual sync option to decorator"

This reverts commit 8ccbf68.

* fix: duplicate and typing

* fix: type check

* feat: move to new directory

* fix: remove sync param

* feat: add run script

* refactor: cleanup cpp ref

* feat: add caller and readme

* feat: verify and benchmark, add barrier

* feat: add optional dev depencdency matplotlib

* fix: only plot when specified

* fix: only use the signed remainder ops
* added elementwise unary tests or fp16 and fp32

* reduce test case numbers

* register require_npu marker

---------

Co-authored-by: mirkodevita <[email protected]>
Co-authored-by: learning-chip <[email protected]>
* refactor ptodsl module namespaces

* fix lazy-import

* module clean-up

* comment on the need to lazy-eval

* pto.subset -> tile.subset

* further split pto.py file

* multi-level dir layout

* avoid overwriting sys.modules

* lazy-import

* pto.for_range -> pto.range
* feat: add print tile example

* feat: add builder file that will be fixed in future

* feat: make compile script more robust, dsl now works

* feat: remove cpp file

* fix: compile script

* fix: reorder print

* feat: add printf
* port the latest minimum code from huawei-csl#70

* fix missing apis in ptodsl package

* temporarily add artifacts

* make swizzle count run-time parameter

* update generated cpp and IR

* expost swizzle_direction also as run-time parameter

* fix MLIR verify

* update generated ir and cpp

* style compact

* move nested functions out of main kernel function

* note on SSA dominance

* ignore compile artifacts

* benchmark scripts

* revert to older working builder

* use more informative function name

* update ptoas version

* move level1_loop_mn_dynamic_tilesize to outside of main function

* move out swizzle_zn and swizzle_nz util functions

* plot swizzle speed-up and FLOPS ratio

* remove fraction plot

* ignore benchmark artifacts

* reduce line counts

* TODO on uni-tile-size demo

* add auto-sync version of matmul swizzle

* Revert "add auto-sync version of matmul swizzle"

This reverts commit 446bee3.

---------

Co-authored-by: jiawei_zhuang <[email protected]>
…l sync performance. (huawei-csl#73)

* simpler matmul demo

* remove swizzle param from simple demo

* remove unused var

* inline level1_loop_mn function

* use one TensorType declare

* rename level2_loop_k, put comments

* add auto-sync variant

* correct compile auto-sync version

* ignore artifacts

* performance benchmark

* simple static swizzle
…ditable install by CI (huawei-csl#77)

* fix pip install problem without -e

* update CI to run pip install both with or without -e

* more clear CI job naming

---------

Co-authored-by: jiawei_zhuang <[email protected]>
…ei-csl#79)

* remove unused event id

* even simpler single-buffer matmul, compare to double buf perf

* fix print

* more explain in comments

* flatten nested loop

* correct flop ratio print

* correct print

* also compare with non swizzle

* re-structure to step-by-step optimization

* output artifacts dir

* refactor: split IR builder into 4 steps

* draft optimization guide

* add pto syntax explain

* add numpy simulation code for step-1

* print swizzle grid in numpy

* rename tutorial directory

---------

Co-authored-by: jiawei_zhuang <[email protected]>
row/col reduce op dynamic multicore tests
Expand col and row (with expand div, sub, mul) dynamic multicore tests
* ignore artifacts

* remove unused script

* rename script

* remove _256 suffix

* rename

* rename in guide

* improve numpy emulation

* better size print, update guide

* Manual draft of optimization guide and figures

* move early version to experimental dir

* move up dir

* rename dir to matmul optimization guide

* link to optim guide

* 90% finish

* remove old guide

* consistent title

* font

* move

* remove copy

* Note on frontend

* msprof link

* font

* swizzle explain

* smaller img

* plot FLOPs

* unify figure path

* fix figure suffix

* larger font

* remark on manual sync

* ignore artifacts

* add FLOPs plots

* fix typo

* fix typo

* fix typo

* fix single-buffer pipeline figure

* global grammer and typo clean-up

* more explain on swizzle

* move figures

* typo

* update frontend note

* typo

---------

Co-authored-by: jiawei_zhuang <[email protected]>
learning-chip and others added 26 commits March 14, 2026 07:35
…ei-csl#89)

* minimum agent skills to translate PTO-ISA cpp to PTO-DSL python

* move example generation to Non-Negotiable Rules

* more explicit rule for ref example checking

* pre-commit run --all-files

---------

Co-authored-by: jiawei_zhuang <[email protected]>
* update PTOAS to https://github.com/huawei-csl/PTOAS/releases/tag/0.8

* temporarily commit all example translations (will remove before merging PR)

* one-shot translation to  python

prompt:
Translate @kernel_tri_inv_trick.cpp and test script @test_tri_inv_trick.py to @fast_inverse  using skill @SKILL.md

* force docker rebuild

* temporary add generated IR and Cpp (remove before merge)

* fix bisheng compile

* use newer pto-isa version to avoid bisheng error on TMOV

* do not early-exit on fail

* use a known working version (padding + write to GM)

* add TODO for TMOV

* update PTO-ISA header

* fix TMOV mismatch

* fix tmov order

* update generated cpp

* fix TMOV type mismatch

* update generated cpp

* fix caller dtype mismatch

* try fix manual sync

* inline  spill_acc_to_mat

* remove translation references

* try support smaller blocks like 64x64

* update generated cpp

* try build all shapes

* fix context

* remove dynamic valid shape

* note on auto-sync bug

* remove artifacts

* change error to warning, and use larger ftol

* build_artifacts dir

* ignore build_artifacts

* remove unused MAX_MATRIX_SIZE = 128

* run pre-commit run --all-files

---------

Co-authored-by: jiawei_zhuang <[email protected]>
* update PTOAS to https://github.com/zhangstevenunity/PTOAS/releases/tag/v0.9

* auto-sync now works

* test more general block-diag size

* separate out manual vs auto  functions

* clean up

* precommit check

* loop count should cover larger block size

* precommit check

---------

Co-authored-by: jiawei_zhuang <[email protected]>
* Add example translation collection check to CI

* fix ci path

* update ptoas version in CI

---------

Co-authored-by: jiawei_zhuang <[email protected]>
…uawei-csl#96)

* mrgsort and sort32 dynamic multicore tests and semidynamic topk example

* updated README for topk example

---------

Co-authored-by: mirkodevita <[email protected]>
# Conflicts:
#	docker/Dockerfile
@zhoubot zhoubot marked this pull request as ready for review March 31, 2026 01:21
@zhoubot zhoubot merged commit e1f267b into main Mar 31, 2026
5 checks passed
@zhoubot zhoubot deleted the codex/ptodsl-a5-lib branch March 31, 2026 01:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants