Skip to content

docs: align VPTO MTE memory naming#341

Draft
WenboCodes wants to merge 227 commits into
mouliangyu:feature-vpto-backendfrom
WenboCodes:codex/vpto-mte-docs
Draft

docs: align VPTO MTE memory naming#341
WenboCodes wants to merge 227 commits into
mouliangyu:feature-vpto-backendfrom
WenboCodes:codex/vpto-mte-docs

Conversation

@WenboCodes
Copy link
Copy Markdown
Collaborator

@WenboCodes WenboCodes commented May 8, 2026

Summary

  • reorganize MTE DMA copy docs around public pto.mte_<src>_<dst> op naming
  • move cube load/store movement docs into the DMA copy chapter and split GM-to-L1 burst/fractal forms
  • align VPTO spec references and address-space conventions with L1/L0A/L0B/L0C/BT names and aliases
旧 wrapper 接口 新 MTE 接口 数据方向 说明
pto.dma_load pto.mte_gm_ub GM -> UB Structured GM-to-UB DMA load wrapper;nburst(...) / loop(...) / pad(...) 保留
pto.dma_store pto.mte_ub_gm UB -> GM Structured UB-to-GM DMA store wrapper;nburst(...) / loop(...) 保留
pto.dma_copy,dst 是 UB pto.mte_ub_ub UB -> UB UB 内部 copy wrapper
pto.dma_copy,dst 是 L1 / mat pto.mte_ub_l1 UB -> L1 UB-to-L1 copy wrapper
pto.cube_load pto.mte_gm_l1_burst GM -> L1 Structured GM-to-L1 burst copy wrapper
pto.cube_load_frac pto.mte_gm_l1_fractal GM -> L1 Structured GM-to-L1 fractal load wrapper;用 nd2nz / dn2nz mode 区分
pto.left_load pto.mte_l1_l0a L1 -> L0A Structured L1-to-L0A wrapper
pto.right_load pto.mte_l1_l0b L1 -> L0B Structured L1-to-L0B wrapper
pto.left_load_mx pto.mte_l1_l0a_mx L1 -> L0A MX-mode L1-to-L0A wrapper
pto.right_load_mx pto.mte_l1_l0b_mx L1 -> L0B MX-mode L1-to-L0B wrapper
pto.bias_load pto.mte_l1_bt L1 -> BT Structured L1-to-BT bias load wrapper
pto.acc_store,dst 是 L1 / mat pto.mte_l0c_l1 L0C -> L1 Structured L0C-to-L1 wrapper
pto.acc_store_gm pto.mte_l0c_gm L0C -> GM Structured L0C-to-GM writeback wrapper
pto.acc_store_ub pto.mte_l0c_ub L0C -> UB Structured L0C-to-UB wrapper

mouliangyu and others added 30 commits April 28, 2026 04:36
Explain block/subblock runtime queries in workload-partitioning terms and remove redundant supported-forms wording from conversion ops docs.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
- Add detailed mode parameter documentation (mode=0 vs mode=1)
- Add 'Why get_buf/rls_buf is More Programmer-Friendly' section:
  - No manual priming/draining for ping/pong loops
  - No loop peeling for complex/nested loop dependencies
  - Simpler mental model (buffer ID + program order)
- Add quick example comparison showing set_flag overhead vs get_buf simplicity
- Update Example 2 and 3b with explicit mode=0 in code
- Update comparison table with 'Loop peeling' row
- set_flag/wait_flag: 2 IDs per buffer (1 forward + 1 reverse pipe-pair)
- get_buf/rls_buf: 1 ID per buffer (handles both directions automatically)
- 8 per pipe-pair is HW limit, not a formula
- set_flag/wait_flag: 8 IDs per pipe-pair direction (HW limit)
- get_buf/rls_buf: 1 buffer ID per shared resource (HW limit: 32 global), same ID used across all pipelines
- Event ID mgmt: each buffer occupies 1 ID per direction (removed misleading 4 IDs calc)
- Drain example: use concrete EVT_*_0/EVT_*_1 instead of {(N-1)%2} expressions
- 4 set_flag + 4 wait_flag (not 8)
- 4 IDs = 2 pipe-pair directions × 2 ping/pong buffers
- set_flag/wait_flag: 1 MTE2 load, 8 Vector slices — must peel set/wait outside loop
- get_buf/rls_buf: same pattern but acquire/release can stay inside or outside
- Acquire/release per slice inside loop
- Iteration 0 blocks until MTE2 done, iterations 1-7 proceed immediately
Add the merged v0.3 PTO micro-instruction release spec document for A5,
including ISA group references and updated synchronization notes.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
    Introduce a three-pass pipeline that lowers PTO tile ops to vector-level
    implementations via TileLang DSL templates:

    - ExpandTileOp: invokes TileLang Python DSL to instantiate template
      functions and replaces tile ops with func.call. SpecKey covers all
      operands; tile_buf operands are passed through without bridging.
    - PTOInlineLibCall: extended to recognize tilelang instance functions via
      the  attribute set by the DSL frontend.
    - FoldTileBufIntrinsics: resolves pto.tile_buf_addr / tile_valid_rows /
      tile_valid_cols, including dynamic valid-shape via pto.bind_tile chain
      tracing.
    - MemrefToTileBuf: recovers tile_buf types from memref + bind_tile
      metadata after PlanMemory/InsertSync.
    - PTOViewToMemref: insert pto.bind_tile anchors for tile_buf function
      args so MemrefToTileBuf can recover them.

    Adds new PTO ops (tile_buf_addr/tile_valid_rows/tile_valid_cols),
    ptoas pipeline wiring, design docs, and unit tests.
Zhendong404 and others added 25 commits April 30, 2026 14:56
…amming Model

- Define fractal NZ layout (K1M1M0K0 / K1N1K0N0 / N1M1M0N0) for L1/L0A/L0B/L0C
- Document full GM->L1->L0A/B->L0C->GM data flow pipeline with ASCII diagrams
- Clarify copy_gm_to_cbuf_multi_nd2nz vs dn2nz (nd2nz preferred for GEMM; dn2nz
  for NCHW/conv; A2/A3 only has nd2nz so nd2nz is backward compatible)
- Clarify L0A layout: FRACTAL_NZ K1M1M0K0 on A5 (FRACTAL_ZZ M1K1M0K0 on A3)
- Clarify load_cbuf_to_ca/cb: each burst = one 512B fractal z-block (16x16 bf16);
  inner-box transpose for B done on-the-fly during MTE L1->L0B transfer
- Add copy_matrix_cc_to_ub writeback path (A5 only, fixed-point datapath)
Add new subsection under Intra-Cluster Data Paths in Cluster Programming Model:
- Define fractal NZ layout (K1M1M0K0 / K1N1K0N0 / N1M1M0N0) for L1/L0A/L0B/L0C
- Per-buffer NZ layout table with copy ops
- L0A: FRACTAL_NZ K1M1M0K0 on A5 / FRACTAL_ZZ M1K1M0K0 on A3
- Full GM->L1->L0A/B->L0C->GM ASCII pipeline diagram
- load_cbuf_to_ca/cb: each burst = one 512B fractal z-block; B transpose on-the-fly
- copy_matrix_cc_to_ub writeback (A5 only, fixed-point datapath)
- nd2nz preferred for GEMM; dn2nz for NCHW/conv; A2/A3 has no dn2nz (backward compat)
* [Feat] enhance vcvt behavior for rowsum

* [Feat] enhance vcvt behavior for rowsum

* [Feat] enhance vcvt behavior for rowsum

* update attributes
* [Add] tcmp trem tfmod

* [Fix] license-header check

---------

Co-authored-by: zwd060924 <[email protected]>
* fix tfillpad_inplace_template

* fix tfillpad_inplace

* fix tfillpad_inplace

* fix tfillpad_inplace

* fix tfillpad_inplace

* fix tfillpad_inplace

* fix tfillpad_inplace

---------

Co-authored-by: kangjiaming <[email protected]>
…angyu#332)

* Delete useless code and abbreviate tile_buf type in test cases

* Delete useless code and abbreviate tile_buf type in test cases

* Delete useless code and abbreviate tile_buf type in test cases

* Delete useless code and abbreviate tile_buf type in test cases

---------

Co-authored-by: kangjiaming <[email protected]>
* add tcolargmax/min tileops lib implementation

* fix tcolargmax/min implementations and tile_buf declaration
* Add OP For TSort32

* review-fix: tile_buf use compact-mode

* review-fix: tmp param should be optional

---------

Co-authored-by: caojian5 <[email protected]>
Comment thread docs/isa/micro-isa/02-dma-copy.md Outdated
@mouliangyu mouliangyu force-pushed the feature-vpto-backend branch from 5e223fb to 42b74f9 Compare May 14, 2026 00:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.