Skip to content

add tconcat template and related testcases#310

Open
sundyCoder wants to merge 193 commits into
mouliangyu:feature-vpto-backendfrom
sundyCoder:feature-vpto-backend
Open

add tconcat template and related testcases#310
sundyCoder wants to merge 193 commits into
mouliangyu:feature-vpto-backendfrom
sundyCoder:feature-vpto-backend

Conversation

@sundyCoder
Copy link
Copy Markdown

@sundyCoder sundyCoder commented Apr 29, 2026

add tconcat template and related testcases
image

mouliangyu and others added 30 commits April 28, 2026 04:36
Explain block/subblock runtime queries in workload-partitioning terms and remove redundant supported-forms wording from conversion ops docs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add detailed mode parameter documentation (mode=0 vs mode=1)
- Add 'Why get_buf/rls_buf is More Programmer-Friendly' section:
  - No manual priming/draining for ping/pong loops
  - No loop peeling for complex/nested loop dependencies
  - Simpler mental model (buffer ID + program order)
- Add quick example comparison showing set_flag overhead vs get_buf simplicity
- Update Example 2 and 3b with explicit mode=0 in code
- Update comparison table with 'Loop peeling' row
- set_flag/wait_flag: 2 IDs per buffer (1 forward + 1 reverse pipe-pair)
- get_buf/rls_buf: 1 ID per buffer (handles both directions automatically)
- 8 per pipe-pair is HW limit, not a formula
- set_flag/wait_flag: 8 IDs per pipe-pair direction (HW limit)
- get_buf/rls_buf: 1 buffer ID per shared resource (HW limit: 32 global), same ID used across all pipelines
- Event ID mgmt: each buffer occupies 1 ID per direction (removed misleading 4 IDs calc)
- Drain example: use concrete EVT_*_0/EVT_*_1 instead of {(N-1)%2} expressions
- 4 set_flag + 4 wait_flag (not 8)
- 4 IDs = 2 pipe-pair directions × 2 ping/pong buffers
- set_flag/wait_flag: 1 MTE2 load, 8 Vector slices — must peel set/wait outside loop
- get_buf/rls_buf: same pattern but acquire/release can stay inside or outside
- Acquire/release per slice inside loop
- Iteration 0 blocks until MTE2 done, iterations 1-7 proceed immediately
Add the merged v0.3 PTO micro-instruction release spec document for A5,
including ISA group references and updated synchronization notes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
    Introduce a three-pass pipeline that lowers PTO tile ops to vector-level
    implementations via TileLang DSL templates:

    - ExpandTileOp: invokes TileLang Python DSL to instantiate template
      functions and replaces tile ops with func.call. SpecKey covers all
      operands; tile_buf operands are passed through without bridging.
    - PTOInlineLibCall: extended to recognize tilelang instance functions via
      the  attribute set by the DSL frontend.
    - FoldTileBufIntrinsics: resolves pto.tile_buf_addr / tile_valid_rows /
      tile_valid_cols, including dynamic valid-shape via pto.bind_tile chain
      tracing.
    - MemrefToTileBuf: recovers tile_buf types from memref + bind_tile
      metadata after PlanMemory/InsertSync.
    - PTOViewToMemref: insert pto.bind_tile anchors for tile_buf function
      args so MemrefToTileBuf can recover them.

    Adds new PTO ops (tile_buf_addr/tile_valid_rows/tile_valid_cols),
    ptoas pipeline wiring, design docs, and unit tests.
Happybot and others added 27 commits April 28, 2026 04:44
32 semaphores are shared for both C->V and V->C directions:
- sema_id  0-15: communicate with AIV0 (subblock_id=0)
- sema_id 16-31: communicate with AIV1 (subblock_id=1)
-> 16 sema_id pairs available for 1:2 C:V sync per slot
* [feat] add rowsum rowmin rowmax rowargmax rowargmin

* feat(tileop): add TROWARGMAX, TROWARGMIN, TROWMAX, TROWMIN, TROWPROD, and TROWSUM test cases

* refactor(tileops): simplify initialization of min/max values and accumulator in templates

Co-authored-by: Copilot <copilot@github.com>

* feat(tileops): implement one-point store distance selection in row templates

Co-authored-by: Copilot <copilot@github.com>

---------

Co-authored-by: Copilot <copilot@github.com>
* add tcolmax tcolmin tileops lib implementation

* add tcolsum tcolprod tileops lib implementation

* fix license check in lib template

* fix col mask processing and add constraints check for tcolmax/min/sum/prod

* fix license check for tcolmax/min/sum/prod
)

* add texpand/tfillpad/tfillpad_inplace/tfillpad_expand op

* revise the review comments

* Add uint16-related test cases for tfillpad/tfillpad_expand

* solve the CI problem

* revise review comments

* revise review comments

* revise review comments

---------

Co-authored-by: kangjiaming <kangjiaming@huawei.com>
Co-authored-by: KurrinQu <qukelin1991@163.com>
Co-authored-by: mouliangyu <mouliangyu@huawei.com>
… been passed. (mouliangyu#212)

* trelu tlrelu trandom 算子ST测试已通过,tsel tsels tprelu算子ST测试暂未通过

* trelu tlrelu trandom 算子ST测试已通过

* 针对评论修改trelu和tlrelu算子

* trelu, tlrelu, tprelu, tsel, tsels算子编译与ST测试均已通过

* trelu, tlrelu, tprelu, tsel, tsels算子编译与ST测试均已通过

* 添加license

* 根据评论修改,删除prelu算子后续再提

* 针对tsels ci不通过修改

---------

Co-authored-by: KurrinQu <qukelin1991@163.com>
* add trowexpand op

* trowexpand series op

* trowexpand series op

* trowexpand series op
* feat: add TileOps templates and basic test cases for tcolexpand operations

* test: add tcolexpand operators test cases

* fix: 添加TODO说明tcolexpanddiv需要高精度版本

* feat: fp32使用vexpdif实现tcolexpandexpdif,fp16使用vsub+vexp

* fix: add PR386 license headers to template and test files

- Add license headers to 7 tcolexpand*_template.py files
- Add license headers to test case files (CMakeLists.txt, compare.py, launch.cpp, main.cpp, gen_data.py, cases.py)

* feat: register tcolexpand operators in CMakeLists.txt

* fix: replace aclFloat16 with uint16_t in tcolexpand test cases

- Replace aclFloat16 with uint16_t in main.cpp and launch.cpp (16 files)
- Remove duplicate license headers in 5 launch.cpp files
- Fix .pto comments: aclFloat16 -> fp16
- Remove unnecessary #include "acl/acl.h" from launch.cpp files
- Align with tpartmax implementation pattern

---------

Co-authored-by: User <user@example.com>
Co-authored-by: mouliangyu <mouliangyu@huawei.com>
@sundyCoder sundyCoder force-pushed the feature-vpto-backend branch from 3c848ec to 415d2bd Compare April 29, 2026 05:23
lanes_i32 = pto.i32(lanes)
for col in range(0, valid_cols1, lanes):
active_lanes = remained1
if active_lanes > lanes_i32:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的标量操作,是否可以使用valid_cols1作make_mask的参数代替?89行同理

: !pto.tensor_view<1x1x1x16x64xf32> -> !pto.partition_tensor_view<1x1x1x16x64xf32>

%src0 = pto.alloc_tile
: !pto.tile_buf<loc=vec, dtype=f32, rows=16, cols=64, v_row=16, v_col=32,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tile_buf建议使用简写类型,可以参考这里

@mouliangyu mouliangyu force-pushed the feature-vpto-backend branch from 5e223fb to 42b74f9 Compare May 14, 2026 00:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.