Skip to content

IR builder prototype for vpto#362

Open
learning-chip wants to merge 258 commits into
mouliangyu:feature-vpto-backendfrom
learning-chip:a5_dsl
Open

IR builder prototype for vpto#362
learning-chip wants to merge 258 commits into
mouliangyu:feature-vpto-backendfrom
learning-chip:a5_dsl

Conversation

@learning-chip
Copy link
Copy Markdown

@learning-chip learning-chip commented May 13, 2026

To reproduce

# assume ptoas/binding is already installed
cd $PTOAS_REPO_ROOT/ptodsl
pip install -e .

python check_ir.py  # check IR equal for all 4 existing examples

cd $PTOAS_REPO_ROOT/ptodsl/examples
python ./tadd_lowlevel.py  # low level binding call that matches test/lit/vpto/expand_tileop_to_vpto_result.pto
python ./tadd_dsl.py  # dsl version that generates same IR as low-level builder
python ./softmax_lowlevel.py # low level binding call that matches test/tilelang_st/npu/a5/src/st/testcase/softmax/softmax.pto
python ./softmax_dsl.py  #  dsl version that generates same IR as low-level builder

API design

As discussed in #360 (comment), I intentionally avoid ast module, but only use normal eager Python wrappers.

See the added ptodsl/README.md file for current API list.

Side notes

The two bash scripts quick_install.sh and set_ptoas_env.sh are just for internal quick development on top of my Docker image learning-chip/agent_docker_npu#8, they can be removed upon merge.

findings on the python binding APIs

  1. VRegType / MaskType — no dedicated Python class: These types (!pto.vreg<64xf32>, !pto.mask) are not exposed as Python classes in the C extension. They are constructed via Type.parse("!pto.vreg<64xf32>") and Type.parse("!pto.mask"), which is valid since the dialect is registered.

  2. Nested modules: Module.create() ignores the active InsertionPoint. To insert an inner module as a child operation, Operation.create("builtin.module", regions=1) is used instead, followed by manually appending a block to its region with inner_op.regions[0].blocks.append().

  3. pto.vecscope region: VecScopeOp creates the op with an empty region (no blocks). A block must be explicitly appended with vecscope_op.body.blocks.append() before inserting ops into it.

  4. Module attributes (pto.target_arch, pto.kernel_kind): Set via op.attributes["name"] = value — string attributes via StringAttr.get(...), custom dialect attributes via Attribute.parse("#pto.kernel_kind").

  5. No need to change or rebuild ptoas source: All required ops (VecScopeOp, VldsOp, VstsOp, VaddOp, PltB32Op, CastPtrOp, AddPtrOp) were already present in the auto-generated _pto_ops_gen.py bindings.

Lok and others added 30 commits April 28, 2026 04:36
- Acquire/release per slice inside loop
- Iteration 0 blocks until MTE2 done, iterations 1-7 proceed immediately
Add the merged v0.3 PTO micro-instruction release spec document for A5,
including ISA group references and updated synchronization notes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
    Introduce a three-pass pipeline that lowers PTO tile ops to vector-level
    implementations via TileLang DSL templates:

    - ExpandTileOp: invokes TileLang Python DSL to instantiate template
      functions and replaces tile ops with func.call. SpecKey covers all
      operands; tile_buf operands are passed through without bridging.
    - PTOInlineLibCall: extended to recognize tilelang instance functions via
      the  attribute set by the DSL frontend.
    - FoldTileBufIntrinsics: resolves pto.tile_buf_addr / tile_valid_rows /
      tile_valid_cols, including dynamic valid-shape via pto.bind_tile chain
      tracing.
    - MemrefToTileBuf: recovers tile_buf types from memref + bind_tile
      metadata after PlanMemory/InsertSync.
    - PTOViewToMemref: insert pto.bind_tile anchors for tile_buf function
      args so MemrefToTileBuf can recover them.

    Adds new PTO ops (tile_buf_addr/tile_valid_rows/tile_valid_cols),
    ptoas pipeline wiring, design docs, and unit tests.
mouliangyu and others added 18 commits May 11, 2026 00:54
- Add tilelang-dsl cube surface design and implementation
- Add vpto lowering for cube operations
- Update frontend AST, semantic analysis, and lowering passes
- Add tests for tilelang-dsl v1 cube features
- Add design specs and task documentation
…w-native-sys#303)

* feat: enhance verification for multiple PTO operations in A5 architecture

- TLoadOp: add vec dst layout validation (ND/DN/NZ) and shape matching
- TStoreOp: add vec layout consistency and alignment checks
- TAddSOp: enable requireValidRowsEqual validation for A5
- TCvtOp: add A5 type conversion validation
- TPartAddOp/TPartMaxOp/TPartMinOp/TPartMulOp: add verifyPartialValidPattern for A5
- TSelSOp: add mask row-major layout check
- TShlOp/TShrOp: add dst element type validation in verifyShiftLikeBinaryTileOpCommon

* feat: add 10 test cases and update PTO.cpp verification

Test cases added to test/basic:
- tadds_validrow_mismatch.pto
- tcvt_type_combination.pto
- tload_vec_layout_mismatch.pto
- tpartadd_valid_shape.pto
- tpartmax_valid_shape.pto
- tpartmin_valid_shape.pto
- tpartmul_valid_shape.pto
- tshl_dtype_mismatch.pto
- tshr_dtype_mismatch.pto
- tstore_vec_alignment.pto

PTO.cpp updates:
- Simplify TLoadOp NZ validation
- TStoreOp verification refinements

* fix: update TPart*Op test cases and PTO.cpp verification

* fix: update PTO.cpp and remove tcvt_type_combination test

* fix: update PTO.cpp verification logic

* fix: update PTO.cpp and test cases

* fix: update PTO.cpp

* fix: update PTO.cpp verification

* fix: update PTO.cpp

* fix: update PTO.cpp

* fix: update PTO.cpp

* fix: update PTO.cpp

* revert: 回退PTO.cpp校验逻辑到正确版本

回退到commit 0ec558d(最后能通过的版本)

原因:
- 检视意见误解了ISA编译时static_assert vs AS运行时校验的差异
- 盲目照搬ISA规则导致破坏原有设计
- 添加的错误校验导致CI反复失败(5次失败)

删除的错误校验:
- TLoad的ValidCol/shape匹配强制校验
- TLoad的NZ shape[3]/shape[4]强制校验
- TPart的'至少一个src==dst'校验
- TStore过度严格的对齐校验
- fp4特殊处理逻辑

恢复的正确设计:
- TLoad只校验布局类型匹配(ND/DN/NZ)
- TStore标准对齐校验
- TPart只校验dst>=src
- 保留运行时的灵活性和原有设计哲学

* fix: add missing validations for TMinS/TMulS/TMaxS and TPart ops

- TMinS/TMulS/TMaxS: enable requireValidRowsEqualOnA5 to match ISA BinaryInstr behavior
  (ISA uses validRow for loop bounds, mismatch causes out-of-bounds access)
- TPart: add validation that at least one src valid_shape matches dst valid_shape
  (ISA TPartInstr requires this for TPartCopyInstr/TPartOps)
- TStore: improve code readability with explicit dim check and rows/cols naming

* fix: correct TStore validation and revert TPart validation

- TStore: use srcValid (not srcShape) for isSpecialCase and alignment checks
  (valid_shape represents actual data range, shape is tile allocation)
- TPart: remove incorrect 'at least one src == dst' validation
  (ISA allows src0/src1 to be smaller in different dimensions)
- Keep: TMinS/TMulS/TMaxS requireValidRowsEqualOnA5=true (valid fix)

* fix: revert TStore to use srcShape for alignment checks

- TStore alignment check should use srcShape (tile allocation size)
  not srcValid (valid data range)
- Example: cols=256, v_col=255 → 256 satisfies 32-byte alignment, 255 doesn't
- Hardware alignment requirement is based on tile allocation, not data range

* fix: correct indentation for TStore verification code

- Fix missing indentation on line 2285 (int32_t bl declaration)
- Code formatting consistency with surrounding context
- No functional changes, only whitespace fix

* fix: update PTO.cpp

* Fix TPart*Op validation: only use verifyPartialValidPatternLoose for A5

- Added verifyPartialValidPatternLoose function (checks only <= dst, not == dst)
- Modified only A5 branches to use loose validation (4 locations)
- A2A3 branches continue to use strict verifyPartialValidPattern (requires at least one == dst)
- Fixes tpartmin/tpartmax test failures

---------

Co-authored-by: User <user@example.com>
Input: kernel.pto with cube/vector kernel modules, or a single kernel module normalized into that form.

Output: the final VPTO fatobj object at the requested -o path.

Steps:

1. Normalize the VPTO input and lower cube/vector modules to LLVM.

2. Compile cube/vector LLVM IR to device objects and merge them.

3. Generate the host stub internally and repack it with the merged device object into the final fatobj.

Tests: VPTO host validation now links the ptoas-produced fatobj directly; cases use one kernel.pto and no longer need cube.pto or hand-written stub.cpp.
Input: single VPTO module with pto.aicore functions containing pto.section.vector and pto.section.cube regions.

Output: canonical VPTO container with vector/cube child modules carrying pto.kernel_kind, ready for the existing LLVM/fatobj path.

Key steps:

1. Add vpto-split-cv-module to clone the input per core kind, drop the opposite section, and inline the selected section body.

2. Add vpto-normalize-container to wrap single kernel modules and verify the final container shape.

3. Run both passes in the VPTO backend pipeline and convert the mixed cbuf/ubuf SIM case to the section input form.
Input: VPTO and TileLang ST .pto testcases\nOutput: ptoas fatobj validation flow\n\n1. Remove stub.cpp and old ll/repack path\n2. Route TileLang ST and VPTO scripts through fatobj\n3. Normalize testcase inputs to the new module form
* Add HP (HIGH_PRECISION) support for TLog

* Refactor tlog_template: extract precision mode into subfunctions

Extract HIGH_PRECISION and DEFAULT implementations into separate
inline_proc functions for better code organization per review feedback.

* fix(tlog): add missing pto.aicore attribute to _hp functions

---------

Co-authored-by: caojian5 <caojian5@huawei.com>
Move lit-discovered .pto tests out of test/basic into test/lit/pto or test/lit/vpto, and update related documentation paths.

Relocate misplaced VPTO sim cases under the micro-op hierarchy, remove a duplicated nested sim case, and add local guidance for choosing PTOAS test frameworks.
Update VPTO lit tests to inspect explicit VPTO IR with --emit-vpto where they check VPTO-level rewrites.

Route *_vpto_llvm tests through the VPTO LLVM lowering dump and check llvm.hivm calls from LLVM dialect MLIR instead of stale VPTO output.
Relocate FileCheck-based VPTO .pto tests from test/vpto to test/lit/vpto so they are discovered by the lit framework.

Wrap auto-vecscope tests in a vector kernel submodule to match the current VPTO container form.
@learning-chip learning-chip changed the base branch from main to feature-vpto-backend May 13, 2026 14:16
@mouliangyu mouliangyu force-pushed the feature-vpto-backend branch from 5e223fb to 42b74f9 Compare May 14, 2026 00:19
@learning-chip
Copy link
Copy Markdown
Author

learning-chip commented May 14, 2026

Because the base branch mouliangyu:feature-vpto-backend is rebased and force-pushed, the file diff of this PR got messed up.

For an easy rebase, just move the ptodsl subdirectory to the latest commit. Other internal files (like .cpp or .td) are not touched by this PR. @Zhendong404 @mouliangyu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.