docs: add v0.5 release spec bundles by KurrinQu · Pull Request #325 · mouliangyu/PTOAS

KurrinQu · 2026-04-30T08:51:22Z

No description provided.

Explain block/subblock runtime queries in workload-partitioning terms and remove redundant supported-forms wording from conversion ops docs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Add detailed mode parameter documentation (mode=0 vs mode=1) - Add 'Why get_buf/rls_buf is More Programmer-Friendly' section: - No manual priming/draining for ping/pong loops - No loop peeling for complex/nested loop dependencies - Simpler mental model (buffer ID + program order) - Add quick example comparison showing set_flag overhead vs get_buf simplicity - Update Example 2 and 3b with explicit mode=0 in code - Update comparison table with 'Loop peeling' row

- set_flag/wait_flag: 2 IDs per buffer (1 forward + 1 reverse pipe-pair) - get_buf/rls_buf: 1 ID per buffer (handles both directions automatically) - 8 per pipe-pair is HW limit, not a formula

- set_flag/wait_flag: 8 IDs per pipe-pair direction (HW limit) - get_buf/rls_buf: 1 buffer ID per shared resource (HW limit: 32 global), same ID used across all pipelines

- Event ID mgmt: each buffer occupies 1 ID per direction (removed misleading 4 IDs calc) - Drain example: use concrete EVT_*_0/EVT_*_1 instead of {(N-1)%2} expressions

- 4 set_flag + 4 wait_flag (not 8) - 4 IDs = 2 pipe-pair directions × 2 ping/pong buffers

- set_flag/wait_flag: 1 MTE2 load, 8 Vector slices — must peel set/wait outside loop - get_buf/rls_buf: same pattern but acquire/release can stay inside or outside

- Acquire/release per slice inside loop - Iteration 0 blocks until MTE2 done, iterations 1-7 proceed immediately

Add the merged v0.3 PTO micro-instruction release spec document for A5, including ISA group references and updated synchronization notes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Introduce a three-pass pipeline that lowers PTO tile ops to vector-level implementations via TileLang DSL templates: - ExpandTileOp: invokes TileLang Python DSL to instantiate template functions and replaces tile ops with func.call. SpecKey covers all operands; tile_buf operands are passed through without bridging. - PTOInlineLibCall: extended to recognize tilelang instance functions via the attribute set by the DSL frontend. - FoldTileBufIntrinsics: resolves pto.tile_buf_addr / tile_valid_rows / tile_valid_cols, including dynamic valid-shape via pto.bind_tile chain tracing. - MemrefToTileBuf: recovers tile_buf types from memref + bind_tile metadata after PlanMemory/InsertSync. - PTOViewToMemref: insert pto.bind_tile anchors for tile_buf function args so MemrefToTileBuf can recover them. Adds new PTO ops (tile_buf_addr/tile_valid_rows/tile_valid_cols), ptoas pipeline wiring, design docs, and unit tests.

… tfmod

Co-authored-by: mouliangyu <mouliangyu@huawei.com>

…sys#246)

… (hw-native-sys#308)

(cherry picked from commit 34b7864)

(cherry picked from commit ecc78bf)

Signed-off-by: FangRui <fangrui_95@163.com>

…amming Model - Define fractal NZ layout (K1M1M0K0 / K1N1K0N0 / N1M1M0N0) for L1/L0A/L0B/L0C - Document full GM->L1->L0A/B->L0C->GM data flow pipeline with ASCII diagrams - Clarify copy_gm_to_cbuf_multi_nd2nz vs dn2nz (nd2nz preferred for GEMM; dn2nz for NCHW/conv; A2/A3 only has nd2nz so nd2nz is backward compatible) - Clarify L0A layout: FRACTAL_NZ K1M1M0K0 on A5 (FRACTAL_ZZ M1K1M0K0 on A3) - Clarify load_cbuf_to_ca/cb: each burst = one 512B fractal z-block (16x16 bf16); inner-box transpose for B done on-the-fly during MTE L1->L0B transfer - Add copy_matrix_cc_to_ub writeback path (A5 only, fixed-point datapath)

Add new subsection under Intra-Cluster Data Paths in Cluster Programming Model: - Define fractal NZ layout (K1M1M0K0 / K1N1K0N0 / N1M1M0N0) for L1/L0A/L0B/L0C - Per-buffer NZ layout table with copy ops - L0A: FRACTAL_NZ K1M1M0K0 on A5 / FRACTAL_ZZ M1K1M0K0 on A3 - Full GM->L1->L0A/B->L0C->GM ASCII pipeline diagram - load_cbuf_to_ca/cb: each burst = one 512B fractal z-block; B transpose on-the-fly - copy_matrix_cc_to_ub writeback (A5 only, fixed-point datapath) - nd2nz preferred for GEMM; dn2nz for NCHW/conv; A2/A3 has no dn2nz (backward compat)

mouliangyu and others added 30 commits April 28, 2026 04:36

feat: add vpto backend

97cbb4b

clarify block query docs and trim conversion section

cd451b4

Explain block/subblock runtime queries in workload-partitioning terms and remove redundant supported-forms wording from conversion ops docs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: correct event ID explanation in comparison table

46f1a8a

- set_flag/wait_flag: 2 IDs per buffer (1 forward + 1 reverse pipe-pair) - get_buf/rls_buf: 1 ID per buffer (handles both directions automatically) - 8 per pipe-pair is HW limit, not a formula

fix: clarify event ID management in comparison table

3df56a7

- set_flag/wait_flag: 8 IDs per pipe-pair direction (HW limit) - get_buf/rls_buf: 1 buffer ID per shared resource (HW limit: 32 global), same ID used across all pipelines

fix: simplify event ID explanation and drain example

31540f2

- Event ID mgmt: each buffer occupies 1 ID per direction (removed misleading 4 IDs calc) - Drain example: use concrete EVT_*_0/EVT_*_1 instead of {(N-1)%2} expressions

fix: correct set_flag/wait_flag count in quick example

440925c

- 4 set_flag + 4 wait_flag (not 8) - 4 IDs = 2 pipe-pair directions × 2 ping/pong buffers

fix: add concrete 1:N example for loop peeling comparison

a7e5451

- set_flag/wait_flag: 1 MTE2 load, 8 Vector slices — must peel set/wait outside loop - get_buf/rls_buf: same pattern but acquire/release can stay inside or outside

fix: show get_buf/rls_buf inside scf loop for 1:N example

4d5dac8

- Acquire/release per slice inside loop - Iteration 0 blocks until MTE2 done, iterations 1-7 proceed immediately

refactor vpto llvm emiter

dcd8586

add online softmax (q * k is ready) case

227da17

docs: add VPTO spec v0.3 release draft

dbfe96b

Add the merged v0.3 PTO micro-instruction release spec document for A5, including ISA group references and updated synchronization notes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

add PTO-Gym submodule

a77c8b2

feat: add PTO-Gym guide skill

949a928

Add tilelang dsl implementation

495a02c

Update openspec

f5ee676

Support more syntax in tilelang dsl

47a102f

Update openspec

5ac3dba

tilelang dsl first working version

0fdf4d3

Update openspec

0d8e71e

Support more dsl syntax

5d96301

Amend

0945be0

Update docs

bd86d21

docs: add TileOp expand design and demo

11bc085

brough back the lost pass and pto op after rebase

f640129

feat: tile lang dsl

40b2c93

Update DSL user guide

203ea91

Support more tile attributes

51112bb

Support more event IDs

30dd61a

zwd060924 and others added 29 commits April 28, 2026 04:44

[Add] tadd tsub tmul tdiv tmax tmin tshl tshr tor tand txor tcmp trem…

38cc246

… tfmod

[Add] tcmp trem tfmod

454b061

[Delete] tcmp tfmod trem

c515dbe

[Delete] tcmp tfmod trem

e707c7d

[Fix] pass CI

9a4df37

[Fix] delete tcmp tfmod trem in Cmakelists

1a6b535

[Fix] CI error in tmin

2ed48b0

Fix VPTO vcvt and vaxpy mask lowering

4662a3e

Fix installed TileLang resources and disable wheel CI

690e8bc

Fix TileLang soft-math helper lookup in installed layout

9324eab

bugfix: fixup the mask lack (hw-native-sys#291)

352427a

Co-authored-by: mouliangyu <mouliangyu@huawei.com>

[Fix] TPartMin & TPartMax add mem_bar

0a87075

Fix saturation immediate encoding in VPTO LLVM emitter

721a08b

Fix lit cases

969a6b2

fix(dsl): allow mixed index and integer scalar binary ops (hw-native-…

8b08239

…sys#246)

feat(vpto): support vecscope auto infer in the vpto backend

76a92ee

Support cross-file inline_proc import

7aab96a

fix(vpto): support merged align state across scf.if (hw-native-sys#300)…

8ebbb4c

… (hw-native-sys#308)

docs: refresh Tile Instruction specs and release bundles

ce77b9c

docs: move tensor_view_addr into tile spec

344059d

(cherry picked from commit 34b7864)

docs: refresh v0.4 release spec notes

0f5cec9

(cherry picked from commit ecc78bf)

fix(vpto): force V300 ctrl mode for sat-sensitive vcvt

6670954

feat: add cube dma laod/store ops

0626c66

bugfix: fixup mix fatobj

33905f4

fix(test/vpto): use pto.acc_store_gm for ACC to GM in cube vpto ST

e8014e9

feat: add left/right_load_mx

f9b523b

Signed-off-by: FangRui <fangrui_95@163.com>

docs: add v0.5 release spec bundles

e9efe7b

mouliangyu force-pushed the feature-vpto-backend branch from 5e223fb to 42b74f9 Compare May 14, 2026 00:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: add v0.5 release spec bundles#325

docs: add v0.5 release spec bundles#325
KurrinQu wants to merge 209 commits into
mouliangyu:feature-vpto-backendfrom
KurrinQu:feature-v0.5-release-docs

KurrinQu commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

17 participants

Conversation

KurrinQu commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

17 participants