[codex] Add A5 PTODSL library and micro coverage#2
Merged
Conversation
* added dynamic multicore geglu example * added bench geglu and readme for validate examples --------- Co-authored-by: mirkodevita <[email protected]>
…awei-csl#52) * wip: fast hadamard builder * more closely match cpp ref * caller and execution script * default to manual-sync * extra barrier * disable `pto.barrier("STORE_VEC")` for now due to hw-native-sys/PTOAS#187 * do not early-exit * print full mismatch msg * allocate double buffer * extra sync between VEC and MTE3 * try refactor to bulk load-store just like manual cpp ref * add one missing sync * fix subset view indexing * add missing sync again * Revert "add missing sync again" This reverts commit ba25329. * Revert "fix subset view indexing" This reverts commit 9e3316f. * add C++ fast hadamard reference * change compile option fixes run-time error, why? * fallback to legacy `--cce-soc-core-type=VecCore` * commit hadamard_manual_sync.cpp for record * fix early-exit logic in Python builder * update generated cpp and run log * fix n_half * Revert "fix n_half" This reverts commit 49cd872. * syntax sugar * fallback to loading one sample at a time, unit test now all pass * test batch 29 to span over two rounds * remove --test-both option * remove cpp references for a clean PR
* split-out auto vs manual sync branches, so that default auto-sync builder looks clean * remove old run log * reuse code across if and else branches * fix bracket * remove temporary `num_blocks` var * simplify lib path * move kernel side dim check to launch time, to reduce indentation * flatten module builder, do not use closure
…uawei-csl#62) * clean-up * perf benchmark * switch to `--npu-arch=dav-2201` and adjust builder accordingly. * plot bandwidth (script taken from cpp ref)
* fix vector index counts for dynamic add * fix geglu vector index count
* added rowsum dynamic multicore tests for fp32 * added more rowsum tests and removed unnecessary checks in builder * use more general batch sizes --------- Co-authored-by: mirkodevita <[email protected]> Co-authored-by: learning-chip <[email protected]>
…i-csl#55) * WIP: dsl * feat: simplify * feat: add finer grained set/wait flags for more pipeline overlap * fix: func signature compatability with caller * feat: fp16xfp16->fp16 matmul with fp32 acc tiles * feat: A/C are now 3d tensors * fix: typo * feat: pre-load B * WIP: double buffer * feat: cleanup * wip: not working * feat: add cpp double buf code for debug * fix: typos * feat: remove extra sync in bench * feat: add double buf cpp ref * feat: convert dsl to fp16 * feat: add modulus and support multiple event_ids * feat: double buffer matmul pto-dsl * fix: remove cpp * feat: simplify * feat: add manual sync option to decorator * Revert "feat: add manual sync option to decorator" This reverts commit 8ccbf68. * fix: duplicate and typing * fix: type check * feat: move to new directory * fix: remove sync param * feat: add run script * refactor: cleanup cpp ref * feat: add caller and readme * feat: verify and benchmark, add barrier * feat: add optional dev depencdency matplotlib * fix: only plot when specified * fix: only use the signed remainder ops
* added elementwise unary tests or fp16 and fp32 * reduce test case numbers * register require_npu marker --------- Co-authored-by: mirkodevita <[email protected]> Co-authored-by: learning-chip <[email protected]>
* refactor ptodsl module namespaces * fix lazy-import * module clean-up * comment on the need to lazy-eval * pto.subset -> tile.subset * further split pto.py file * multi-level dir layout * avoid overwriting sys.modules * lazy-import * pto.for_range -> pto.range
* feat: add print tile example * feat: add builder file that will be fixed in future * feat: make compile script more robust, dsl now works * feat: remove cpp file * fix: compile script * fix: reorder print * feat: add printf
* port the latest minimum code from huawei-csl#70 * fix missing apis in ptodsl package * temporarily add artifacts * make swizzle count run-time parameter * update generated cpp and IR * expost swizzle_direction also as run-time parameter * fix MLIR verify * update generated ir and cpp * style compact * move nested functions out of main kernel function * note on SSA dominance * ignore compile artifacts * benchmark scripts * revert to older working builder * use more informative function name * update ptoas version * move level1_loop_mn_dynamic_tilesize to outside of main function * move out swizzle_zn and swizzle_nz util functions * plot swizzle speed-up and FLOPS ratio * remove fraction plot * ignore benchmark artifacts * reduce line counts * TODO on uni-tile-size demo * add auto-sync version of matmul swizzle * Revert "add auto-sync version of matmul swizzle" This reverts commit 446bee3. --------- Co-authored-by: jiawei_zhuang <[email protected]>
…l sync performance. (huawei-csl#73) * simpler matmul demo * remove swizzle param from simple demo * remove unused var * inline level1_loop_mn function * use one TensorType declare * rename level2_loop_k, put comments * add auto-sync variant * correct compile auto-sync version * ignore artifacts * performance benchmark * simple static swizzle
…ditable install by CI (huawei-csl#77) * fix pip install problem without -e * update CI to run pip install both with or without -e * more clear CI job naming --------- Co-authored-by: jiawei_zhuang <[email protected]>
Co-authored-by: mirkodevita <[email protected]>
…ei-csl#79) * remove unused event id * even simpler single-buffer matmul, compare to double buf perf * fix print * more explain in comments * flatten nested loop * correct flop ratio print * correct print * also compare with non swizzle * re-structure to step-by-step optimization * output artifacts dir * refactor: split IR builder into 4 steps * draft optimization guide * add pto syntax explain * add numpy simulation code for step-1 * print swizzle grid in numpy * rename tutorial directory --------- Co-authored-by: jiawei_zhuang <[email protected]>
row/col reduce op dynamic multicore tests
Expand col and row (with expand div, sub, mul) dynamic multicore tests
* ignore artifacts * remove unused script * rename script * remove _256 suffix * rename * rename in guide * improve numpy emulation * better size print, update guide * Manual draft of optimization guide and figures * move early version to experimental dir * move up dir * rename dir to matmul optimization guide * link to optim guide * 90% finish * remove old guide * consistent title * font * move * remove copy * Note on frontend * msprof link * font * swizzle explain * smaller img * plot FLOPs * unify figure path * fix figure suffix * larger font * remark on manual sync * ignore artifacts * add FLOPs plots * fix typo * fix typo * fix typo * fix single-buffer pipeline figure * global grammer and typo clean-up * more explain on swizzle * move figures * typo * update frontend note * typo --------- Co-authored-by: jiawei_zhuang <[email protected]>
…ei-csl#89) * minimum agent skills to translate PTO-ISA cpp to PTO-DSL python * move example generation to Non-Negotiable Rules * more explicit rule for ref example checking * pre-commit run --all-files --------- Co-authored-by: jiawei_zhuang <[email protected]>
* update PTOAS to https://github.com/huawei-csl/PTOAS/releases/tag/0.8 * temporarily commit all example translations (will remove before merging PR) * one-shot translation to python prompt: Translate @kernel_tri_inv_trick.cpp and test script @test_tri_inv_trick.py to @fast_inverse using skill @SKILL.md * force docker rebuild * temporary add generated IR and Cpp (remove before merge) * fix bisheng compile * use newer pto-isa version to avoid bisheng error on TMOV * do not early-exit on fail * use a known working version (padding + write to GM) * add TODO for TMOV * update PTO-ISA header * fix TMOV mismatch * fix tmov order * update generated cpp * fix TMOV type mismatch * update generated cpp * fix caller dtype mismatch * try fix manual sync * inline spill_acc_to_mat * remove translation references * try support smaller blocks like 64x64 * update generated cpp * try build all shapes * fix context * remove dynamic valid shape * note on auto-sync bug * remove artifacts * change error to warning, and use larger ftol * build_artifacts dir * ignore build_artifacts * remove unused MAX_MATRIX_SIZE = 128 * run pre-commit run --all-files --------- Co-authored-by: jiawei_zhuang <[email protected]>
* update PTOAS to https://github.com/zhangstevenunity/PTOAS/releases/tag/v0.9 * auto-sync now works * test more general block-diag size * separate out manual vs auto functions * clean up * precommit check * loop count should cover larger block size * precommit check --------- Co-authored-by: jiawei_zhuang <[email protected]>
Cleans up huawei-csl#90 huawei-csl#91 for human readers.
* Add example translation collection check to CI * fix ci path * update ptoas version in CI --------- Co-authored-by: jiawei_zhuang <[email protected]>
…uawei-csl#96) * mrgsort and sort32 dynamic multicore tests and semidynamic topk example * updated README for topk example --------- Co-authored-by: mirkodevita <[email protected]>
# Conflicts: # docker/Dockerfile
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add a new
ptodsl.lib.a5package that rewrites the public PTO tile surface into Pythonic PTODSL helpers backed by PTO micro instructions where possible.What Changed
ptodsl.lib.a5with A5-oriented helpers, generated.ptosamples, and a small READMEsort32,mrgsort, and indexedgatherCoverage
Current tile micro coverage in this PR:
gathermask-pattern form still depends onvsqz, which is not exposed in the current PTO micro Python surface)matmul,matmul_bias,matmul_acc,extract)Validation
PYTHONPATH=... python -m pytest -q tests/regression30 passed in 0.35sImpact
This gives PTO-DSL a usable A5 library layer, generated
.ptoartifacts, and an observable micro-coverage contract for future parity work.