diff --git a/docs/isa/README_zh.md b/docs/isa/README_zh.md index 083f54508..28565369f 100644 --- a/docs/isa/README_zh.md +++ b/docs/isa/README_zh.md @@ -1,4 +1,4 @@ -

+

PTO Tile Lib

diff --git a/docs/isa/comm/TGET_ASYNC_zh.md b/docs/isa/comm/TGET_ASYNC_zh.md index 06255cd1f..50b4c5c31 100644 --- a/docs/isa/comm/TGET_ASYNC_zh.md +++ b/docs/isa/comm/TGET_ASYNC_zh.md @@ -1,4 +1,4 @@ -# TGET_ASYNC +# pto.tget_async ## 简介 diff --git a/docs/isa/comm/TGET_zh.md b/docs/isa/comm/TGET_zh.md index 987bfe7b4..b03fdb9bc 100644 --- a/docs/isa/comm/TGET_zh.md +++ b/docs/isa/comm/TGET_zh.md @@ -1,4 +1,4 @@ -# pto.tget / TGET +# pto.tget ## 简介 diff --git a/docs/isa/comm/TNOTIFY_zh.md b/docs/isa/comm/TNOTIFY_zh.md index ea27b1ebc..c14bc5d42 100644 --- a/docs/isa/comm/TNOTIFY_zh.md +++ b/docs/isa/comm/TNOTIFY_zh.md @@ -1,4 +1,4 @@ -# TNOTIFY +# pto.tnotify ## 简介 diff --git a/docs/isa/comm/TPUT_ASYNC_zh.md b/docs/isa/comm/TPUT_ASYNC_zh.md index f3b0d32cd..46f721e53 100644 --- a/docs/isa/comm/TPUT_ASYNC_zh.md +++ b/docs/isa/comm/TPUT_ASYNC_zh.md @@ -1,4 +1,4 @@ -# TPUT_ASYNC +# pto.tput_async ## 简介 diff --git a/docs/isa/comm/TPUT_zh.md b/docs/isa/comm/TPUT_zh.md index efc44c847..6e66a4c84 100644 --- a/docs/isa/comm/TPUT_zh.md +++ b/docs/isa/comm/TPUT_zh.md @@ -1,4 +1,4 @@ -# TPUT +# pto.tput ## 简介 diff --git a/docs/isa/comm/TTEST_zh.md b/docs/isa/comm/TTEST_zh.md index 87c5a4424..0d4b28941 100644 --- a/docs/isa/comm/TTEST_zh.md +++ b/docs/isa/comm/TTEST_zh.md @@ -1,4 +1,4 @@ -# TTEST +# pto.ttest ## 简介 diff --git a/docs/isa/comm/TWAIT.md b/docs/isa/comm/TWAIT.md index a21c4ad54..5d1dfb69b 100644 --- a/docs/isa/comm/TWAIT.md +++ b/docs/isa/comm/TWAIT.md @@ -1,4 +1,4 @@ -# pto.twait +# pto.twait `pto.twait` is part of the [Collective Communication](communication-runtime.md) instruction set. diff --git a/docs/isa/comm/TWAIT_zh.md b/docs/isa/comm/TWAIT_zh.md index d3f4e952d..404d13ecd 100644 --- a/docs/isa/comm/TWAIT_zh.md +++ b/docs/isa/comm/TWAIT_zh.md @@ -1,4 +1,4 @@ -# TWAIT +# pto.twait ## 简介 diff --git a/docs/isa/conventions.md b/docs/isa/conventions.md index 798033da8..8cb4a91bd 100644 --- a/docs/isa/conventions.md +++ b/docs/isa/conventions.md @@ -1,4 +1,4 @@ -# PTO ISA Conventions +# PTO ISA Conventions Shared conventions for the per-instruction ISA reference pages in `docs/isa/` and the corresponding C++ intrinsics in `include/pto/common/pto_instr.hpp` are defined below. diff --git a/docs/isa/conventions_zh.md b/docs/isa/conventions_zh.md index 6e0ce875e..a502fd245 100644 --- a/docs/isa/conventions_zh.md +++ b/docs/isa/conventions_zh.md @@ -1,4 +1,4 @@ -# PTO ISA 通用约定 +# PTO ISA 通用约定 `docs/isa/` 指令参考文档使用的通用术语与写法如下,并与 `include/pto/common/pto_instr.hpp` 中的 C++ 内建接口保持一致。 diff --git a/docs/isa/cube/README.md b/docs/isa/cube/README.md new file mode 100644 index 000000000..f31bfacba --- /dev/null +++ b/docs/isa/cube/README.md @@ -0,0 +1,78 @@ +# Cube Micro-Instruction Reference + +This section documents the PTO **Cube micro-instruction surface**: the matrix-multiply (MAD) and cube-side data-movement ops that program the cube core (AIC) and its dedicated buffer hierarchy (L1 / L0A / L0B / L0C / BT). + +!!! note "Scope and audience" + Tile-level matrix ops such as `pto.tmatmul` (covered under [Tile ISA Matrix & Matrix-Vector](../tile/matrix-and-matrix-vector.md)) hide most of these primitives behind a tile-shaped interface. The cube micro-instructions documented here are the lower-level surface that compiler back-ends and hand-tuned cube kernels target directly. They make NZ fractal layout, L1/L0 buffer hierarchy, and FIXPIPE writeback explicit. + +## Architectural Background + +| Page | Purpose | +|------|---------| +| [NZ Fractal Layout](./nz-fractal-layout.md) | The fractal NZ format used by L1, L0A, L0B, and L0C. Defines the `(k1, m1, m0, k0)` re-indexing and per-buffer layout variants. | +| [Buffer Hierarchy](./buffer-hierarchy.md) | The L1 / L0A / L0B / L0C / BT memory hierarchy: address spaces, sizes, and data-flow contracts. | +| [FIXPIPE Model](./fixpipe-model.md) | The FIXPIPE writeback path: how L0C results are converted back to ND and routed to UB or GM. | + +## Matrix Multiply (MAD) Ops + +The MAD family computes `dst = lhs @ rhs` on tiles staged into the cube's L0A / L0B / L0C buffers. All variants share the same `(M, N, K)` shape parameters and a common set of optional clauses (`unit_flag`, `disable_gemv`, `sat`/`nosat`, `tf32_mode`, `n_dir`). + +| Op | Semantics | +|----|-----------| +| [pto.mad](./ops/mad/mad.md) | Zero-init: `dst = lhs @ rhs` | +| [pto.mad_acc](./ops/mad/mad-acc.md) | Accumulate: `dst = dst + lhs @ rhs` | +| [pto.mad_bias](./ops/mad/mad-bias.md) | Bias-init: `dst = lhs @ rhs + bias[n]` | +| [pto.mad_mx](./ops/mad/mad-mx.md) | Zero-init MX (microscaled) matmul | +| [pto.mad_mx_acc](./ops/mad/mad-mx-acc.md) | Accumulating MX matmul | +| [pto.mad_mx_bias](./ops/mad/mad-mx-bias.md) | Bias-init MX matmul | + +## Cube Data Movement Ops + +These ops move tiles between GM, L1, L0A/L0B, and L0C using grouped `nburst(...)` / `loop(...)` clauses analogous to the [scalar DMA Copy](../scalar/dma-copy.md) surface. + +### GM → L1 + +- [pto.mte_gm_l1](./ops/data-movement/mte-gm-l1.md) — Direct GM→L1 load (no layout transform) +- [pto.mte_gm_l1_frac](./ops/data-movement/mte-gm-l1-frac.md) — GM→L1 with ND→NZ fractal repack + +### L1 ↔ UB + +- [pto.mte_l1_ub](./ops/data-movement/mte-l1-ub.md) — L1→UB transfer (cube-to-vector data path) +- [pto.mte_ub_l1](../scalar/ops/dma-copy/mte-ub-l1.md) — UB→L1 transfer (vector-to-cube data path; lives in the scalar DMA section) + +### L1 → L0A / L0B (cube operand load) + +- [pto.mte_l1_l0a](./ops/data-movement/mte-l1-l0a.md) — Stage L1 NZ tile into L0A (left operand) +- [pto.mte_l1_l0b](./ops/data-movement/mte-l1-l0b.md) — Stage L1 NZ tile into L0B (right operand, K-innermost transpose) +- [pto.mte_l1_l0a_mx](./ops/data-movement/mte-l1-l0a-mx.md) — Load MX scale payload for L0A +- [pto.mte_l1_l0b_mx](./ops/data-movement/mte-l1-l0b-mx.md) — Load MX scale payload for L0B + +### L1 → BT (bias) + +- [pto.mte_l1_bt](./ops/data-movement/mte-l1-bt.md) — Stage bias vector into BT for `pto.mad_bias` / `pto.mad_mx_bias` +- [pto.mte_l1_fb](./ops/data-movement/mte-l1-fb.md) — Stage FIXPIPE-relevant payload (e.g., dequant params) + +### L0C writeback (FIXPIPE) + +- [pto.mte_l0c_l1](./ops/data-movement/mte-l0c-l1.md) — FIXPIPE: L0C → L1 +- [pto.mte_l0c_gm](./ops/data-movement/mte-l0c-gm.md) — FIXPIPE: L0C → GM +- [pto.mte_l0c_ub](./ops/data-movement/mte-l0c-ub.md) — FIXPIPE: L0C → UB + +## Full Cube Pipeline + +```text +GM (ND) L1/cbuf (NZ) L0A/B (NZ) L0C (NZ) GM (ND) + +A[M,K] --mte_gm_l1_frac/mte_gm_l1--> K1 M1 M0 K0 --mte_l1_l0a--> K1 M1 M0 K0 -+ + +-MAD-> N1 M1 M0 N0 --> C[M,N] +B[K,N] --mte_gm_l1_frac/mte_gm_l1--> K1 N1 K0 N0 --mte_l1_l0b--> K1 N1 N0 K0 -+ + ^ + transpose as part of mte_l1_l0b when requested + NOT at GM->L1 +``` + +## Related Sections + +- [Tile ISA: Matrix and Matrix-Vector](../tile/matrix-and-matrix-vector.md) — Tile-level matrix ops +- [Scalar DMA Copy](../scalar/dma-copy.md) — UB-side DMA grouped transfers +- [Pipeline Synchronization](../scalar/ops/pipeline-sync/) — Cube/Vector synchronization primitives diff --git a/docs/isa/cube/README_zh.md b/docs/isa/cube/README_zh.md new file mode 100644 index 000000000..ec475c017 --- /dev/null +++ b/docs/isa/cube/README_zh.md @@ -0,0 +1,78 @@ +# Cube 微指令参考 + +本节记录 PTO 的 **Cube 微指令表面**:矩阵乘加(MAD)以及面向 cube core(AIC)和其专用缓冲层级(L1 / L0A / L0B / L0C / BT)的数据搬运指令。 + +!!! note "范围与受众" + Tile 级的矩阵指令(例如 `pto.tmatmul`,见 [Tile ISA 矩阵与矩阵-向量](../tile/matrix-and-matrix-vector_zh.md))把这些底层原语隐藏在 tile 形状的接口之后。本节描述的 cube 微指令是编译器后端与手写 cube 内核直接对接的低层表面,把 NZ fractal 布局、L1/L0 缓冲层级、FIXPIPE 回写显式化。 + +## 架构背景 + +| 页面 | 用途 | +|------|------| +| [NZ Fractal 布局](./nz-fractal-layout_zh.md) | L1、L0A、L0B、L0C 使用的 fractal NZ 格式,定义 `(k1, m1, m0, k0)` 重新索引与各缓冲变种。 | +| [缓冲层级](./buffer-hierarchy_zh.md) | L1 / L0A / L0B / L0C / BT 内存层级:地址空间、大小、数据流契约。 | +| [FIXPIPE 模型](./fixpipe-model_zh.md) | FIXPIPE 回写通路:L0C 结果如何转换回 ND 并路由到 UB 或 GM。 | + +## 矩阵乘加(MAD)指令 + +MAD 家族在 cube 的 L0A / L0B / L0C 缓冲上计算 `dst = lhs @ rhs`。所有变体共享相同的 `(M, N, K)` 形状参数与一组可选 clauses(`unit_flag`、`disable_gemv`、`sat`/`nosat`、`tf32_mode`、`n_dir`)。 + +| 指令 | 语义 | +|------|------| +| [pto.mad](./ops/mad/mad_zh.md) | 零初始化:`dst = lhs @ rhs` | +| [pto.mad_acc](./ops/mad/mad-acc_zh.md) | 累加:`dst = dst + lhs @ rhs` | +| [pto.mad_bias](./ops/mad/mad-bias_zh.md) | 偏置初始化:`dst = lhs @ rhs + bias[n]` | +| [pto.mad_mx](./ops/mad/mad-mx_zh.md) | MX(微缩放)零初始化 matmul | +| [pto.mad_mx_acc](./ops/mad/mad-mx-acc_zh.md) | MX 累加 matmul | +| [pto.mad_mx_bias](./ops/mad/mad-mx-bias_zh.md) | MX 偏置初始化 matmul | + +## Cube 数据搬运指令 + +这些指令在 GM、L1、L0A/L0B、L0C 之间搬运 tile,使用与 [标量 DMA Copy](../scalar/dma-copy_zh.md) 同样的内联 `nburst(...)` / `loop(...)` 子句模型。 + +### GM → L1 + +- [pto.mte_gm_l1](./ops/data-movement/mte-gm-l1_zh.md):直接 GM→L1 加载(不做布局变换) +- [pto.mte_gm_l1_frac](./ops/data-movement/mte-gm-l1-frac_zh.md):GM→L1 并完成 ND→NZ fractal 重排 + +### L1 ↔ UB + +- [pto.mte_l1_ub](./ops/data-movement/mte-l1-ub_zh.md):L1→UB(cube→vector 数据通路) +- [pto.mte_ub_l1](../scalar/ops/dma-copy/mte-ub-l1_zh.md):UB→L1(vector→cube 数据通路;位于标量 DMA 节) + +### L1 → L0A / L0B(cube 操作数加载) + +- [pto.mte_l1_l0a](./ops/data-movement/mte-l1-l0a_zh.md):把 L1 NZ tile 加载到 L0A(左操作数) +- [pto.mte_l1_l0b](./ops/data-movement/mte-l1-l0b_zh.md):把 L1 NZ tile 加载到 L0B(右操作数,K-innermost 转置) +- [pto.mte_l1_l0a_mx](./ops/data-movement/mte-l1-l0a-mx_zh.md):为 L0A 加载 MX scale payload +- [pto.mte_l1_l0b_mx](./ops/data-movement/mte-l1-l0b-mx_zh.md):为 L0B 加载 MX scale payload + +### L1 → BT(偏置) + +- [pto.mte_l1_bt](./ops/data-movement/mte-l1-bt_zh.md):把 bias 向量加载到 BT,供 `pto.mad_bias` / `pto.mad_mx_bias` 消费 +- [pto.mte_l1_fb](./ops/data-movement/mte-l1-fb_zh.md):加载 FIXPIPE 相关 payload(例如反量化参数) + +### L0C 回写(FIXPIPE) + +- [pto.mte_l0c_l1](./ops/data-movement/mte-l0c-l1_zh.md):FIXPIPE 回写 L0C → L1 +- [pto.mte_l0c_gm](./ops/data-movement/mte-l0c-gm_zh.md):FIXPIPE 回写 L0C → GM +- [pto.mte_l0c_ub](./ops/data-movement/mte-l0c-ub_zh.md):FIXPIPE 回写 L0C → UB + +## 完整 Cube 流水线 + +```text +GM (ND) L1/cbuf (NZ) L0A/B (NZ) L0C (NZ) GM (ND) + +A[M,K] --mte_gm_l1_frac/mte_gm_l1--> K1 M1 M0 K0 --mte_l1_l0a--> K1 M1 M0 K0 -+ + +-MAD-> N1 M1 M0 N0 --> C[M,N] +B[K,N] --mte_gm_l1_frac/mte_gm_l1--> K1 N1 K0 N0 --mte_l1_l0b--> K1 N1 N0 K0 -+ + ^ + 必要时由 mte_l1_l0b 进行转置 + 不在 GM→L1 阶段做 +``` + +## 相关章节 + +- [Tile ISA:矩阵与矩阵-向量](../tile/matrix-and-matrix-vector_zh.md) — Tile 级矩阵指令 +- [标量 DMA Copy](../scalar/dma-copy_zh.md) — UB 侧分组 DMA 传输 +- [流水线同步](../scalar/ops/pipeline-sync/) — Cube / Vector 同步原语 diff --git a/docs/isa/cube/buffer-hierarchy.md b/docs/isa/cube/buffer-hierarchy.md new file mode 100644 index 000000000..cc2216087 --- /dev/null +++ b/docs/isa/cube/buffer-hierarchy.md @@ -0,0 +1,67 @@ +# Cube Buffer Hierarchy + +The cube core (AIC) operates on a dedicated buffer hierarchy distinct from the Unified Buffer (UB) that Vector blocks use. Cube operands move through `L1` (cbuf) → `L0A` / `L0B` → `L0C` → writeback, with optional `BT` (bias table) and `FB` (FIXPIPE buffer) helpers. + +## Address Spaces + +| Space | Role | Layout | Typical Producer | Typical Consumer | +|-------|------|--------|------------------|------------------| +| `gm` | Global Memory (off-chip HBM/DDR) | ND row-major | host / kernel | DMA loaders | +| `l1` | Cube CBUF, ~1 MB on-chip | NZ fractal | `pto.mte_gm_l1`, `pto.mte_gm_l1_frac`, `pto.mte_ub_l1` | `pto.mte_l1_l0a`, `pto.mte_l1_l0b`, `pto.mte_l1_ub`, `pto.mte_l1_bt` | +| `l0a` | Cube left-operand scratchpad | FRACTAL_NZ (A5) / FRACTAL_ZZ (A3) | `pto.mte_l1_l0a` | `pto.mad*` | +| `l0b` | Cube right-operand scratchpad | FRACTAL_ZN (K innermost) | `pto.mte_l1_l0b` | `pto.mad*` | +| `l0c` | Cube accumulator | FRACTAL_NZ output of MMAD | `pto.mad*` | FIXPIPE writeback (`pto.mte_l0c_*`) | +| `bt` | Bias Table | element-type-matched vector | `pto.mte_l1_bt` | `pto.mad_bias`, `pto.mad_mx_bias` | +| `fb` | FIXPIPE auxiliary buffer | implementation-defined | `pto.mte_l1_fb` | FIXPIPE writeback ops | +| `ub` | Vector Unified Buffer | ND | DMA loaders | vector pipe | + +See [NZ Fractal Layout](./nz-fractal-layout.md) for the precise per-buffer NZ index orders. + +## Data-Flow Contract + +```text + +----------------- AIC issue queues -----------------+ + | MTE2 MTE1 CUBE (MMAD) FIXP | + | | | | | | +GM (ND) --- pto.mte_gm_l1 / pto.mte_gm_l1_frac | | + | | | + v v | + L1 (NZ) <-- pto.mte_ub_l1 --- UB | + | | + +------+-----+---------------------+ | + | | | | + mte_l1_l0a mte_l1_l0b mte_l1_bt / mte_l1_fb | + | | | | + v v | | + L0A L0B | | + | | | | + +-----+------+ | | + | | | + | pto.mad / pto.mad_acc / pto.mad_bias / *_mx* | + | <----------------------+ | + v | + L0C | + | | + +-- pto.mte_l0c_l1 / pto.mte_l0c_gm / pto.mte_l0c_ub ---+ + (FIXPIPE writeback) +``` + +## Alignment and Sizing Conventions + +- All cube buffer pointers (L1 / L0A / L0B / L0C / BT / FB) are 32-byte aligned. +- L0A and L0B fractal tiles are 512B (one 32B-wide × 16-row block in the appropriate inner orientation). +- L0C accumulator tiles use the `N1 M1 M0 N0` order so that FIXPIPE can stream out one M-row of results at a time. +- Element-type-derived inner widths (`K0 = N0 = C0 / sizeof(T)`) follow [NZ Fractal Layout](./nz-fractal-layout.md). + +## Synchronization + +The cube programs are issued from the AIC's Scalar Unit (SU) into the MTE2 / MTE1 / CUBE / FIXP issue queues. Synchronization with the Vector blocks happens through the System Controller (SC) semaphores and the dedicated 1:2 fixpipe broadcast path. See: + +- [Pipeline Synchronization](../scalar/ops/pipeline-sync/) for the intra-block (`pto.set_flag` / `pto.wait_flag`) primitives that order MTE2 → MTE1 → CUBE → FIXP within the AIC. +- [Cluster Programming Model](../machine-model/execution-agents.md) for inter-block (`pto.set_intra_block` / `pto.wait_intra_core`) primitives used between AIC and AIV. + +## Related Sections + +- [NZ Fractal Layout](./nz-fractal-layout.md) +- [FIXPIPE Model](./fixpipe-model.md) +- [Cube Data Movement Ops](./README.md#cube-data-movement-ops) diff --git a/docs/isa/cube/buffer-hierarchy_zh.md b/docs/isa/cube/buffer-hierarchy_zh.md new file mode 100644 index 000000000..993bc8b03 --- /dev/null +++ b/docs/isa/cube/buffer-hierarchy_zh.md @@ -0,0 +1,67 @@ +# Cube 缓冲层级 + +Cube core(AIC)操作的是一个独立于 Vector 块 UB 的专用缓冲层级。Cube 操作数依次经过 `L1`(cbuf)→ `L0A` / `L0B` → `L0C` → 回写,可选辅助缓冲为 `BT`(bias table)与 `FB`(FIXPIPE buffer)。 + +## 地址空间 + +| 空间 | 角色 | 布局 | 典型生产者 | 典型消费者 | +|------|------|------|------------|------------| +| `gm` | Global Memory(片外 HBM/DDR) | ND 行优先 | host / kernel | DMA 加载器 | +| `l1` | Cube CBUF,片上约 1 MB | NZ fractal | `pto.mte_gm_l1`、`pto.mte_gm_l1_frac`、`pto.mte_ub_l1` | `pto.mte_l1_l0a`、`pto.mte_l1_l0b`、`pto.mte_l1_ub`、`pto.mte_l1_bt` | +| `l0a` | Cube 左操作数暂存区 | FRACTAL_NZ(A5)/ FRACTAL_ZZ(A3) | `pto.mte_l1_l0a` | `pto.mad*` | +| `l0b` | Cube 右操作数暂存区 | FRACTAL_ZN(K 最内) | `pto.mte_l1_l0b` | `pto.mad*` | +| `l0c` | Cube 累加器 | MMAD 输出的 FRACTAL_NZ | `pto.mad*` | FIXPIPE 回写(`pto.mte_l0c_*`) | +| `bt` | Bias Table | 与元素类型匹配的向量 | `pto.mte_l1_bt` | `pto.mad_bias`、`pto.mad_mx_bias` | +| `fb` | FIXPIPE 辅助缓冲 | 实现相关 | `pto.mte_l1_fb` | FIXPIPE 回写指令 | +| `ub` | Vector Unified Buffer | ND | DMA 加载器 | vector 流水线 | + +各缓冲的精确 NZ 索引顺序见 [NZ Fractal 布局](./nz-fractal-layout_zh.md)。 + +## 数据流契约 + +```text + +----------------- AIC 发射队列 -----------------+ + | MTE2 MTE1 CUBE (MMAD) FIXP | + | | | | | | +GM (ND) --- pto.mte_gm_l1 / pto.mte_gm_l1_frac | | + | | | + v v | + L1 (NZ) <-- pto.mte_ub_l1 --- UB | + | | + +------+-----+---------------------+ | + | | | | + mte_l1_l0a mte_l1_l0b mte_l1_bt / mte_l1_fb | + | | | | + v v | | + L0A L0B | | + | | | | + +-----+------+ | | + | | | + | pto.mad / pto.mad_acc / pto.mad_bias / *_mx* | + | <----------------------+ | + v | + L0C | + | | + +-- pto.mte_l0c_l1 / pto.mte_l0c_gm / pto.mte_l0c_ub + + (FIXPIPE 回写) +``` + +## 对齐与尺寸约定 + +- 所有 cube 缓冲指针(L1 / L0A / L0B / L0C / BT / FB)都要求 32 字节对齐。 +- L0A 与 L0B 的 fractal tile 是 512B(一个 32B 宽 × 16 行的 block,按相应的内层朝向)。 +- L0C 累加器 tile 使用 `N1 M1 M0 N0` 顺序,方便 FIXPIPE 每次流式输出一行 M 维结果。 +- 按元素类型派生的内层宽度(`K0 = N0 = C0 / sizeof(T)`)遵循 [NZ Fractal 布局](./nz-fractal-layout_zh.md)。 + +## 同步 + +Cube 程序由 AIC 的 Scalar Unit(SU)发射到 MTE2 / MTE1 / CUBE / FIXP 各自的发射队列。与 Vector 块的同步通过 System Controller(SC)的信号量、以及专用 1:2 fixpipe 广播路径来实现。详见: + +- [流水线同步](../scalar/ops/pipeline-sync/):用于在 AIC 内对 MTE2 → MTE1 → CUBE → FIXP 排序的 `pto.set_flag` / `pto.wait_flag` 原语。 +- [Cluster 编程模型](../machine-model/execution-agents_zh.md):AIC 与 AIV 之间使用的跨块原语(`pto.set_intra_block` / `pto.wait_intra_core`)。 + +## 相关章节 + +- [NZ Fractal 布局](./nz-fractal-layout_zh.md) +- [FIXPIPE 模型](./fixpipe-model_zh.md) +- [Cube 数据搬运指令](./README_zh.md#cube-数据搬运指令) diff --git a/docs/isa/cube/fixpipe-model.md b/docs/isa/cube/fixpipe-model.md new file mode 100644 index 000000000..59544eb76 --- /dev/null +++ b/docs/isa/cube/fixpipe-model.md @@ -0,0 +1,64 @@ +# FIXPIPE Writeback Model + +`FIXPIPE` is the cube core's dedicated writeback / post-processing pipeline. It moves `L0C` accumulator results out to `L1`, `UB`, or `GM` while applying the layout conversion (NZ → ND) required by the destination, plus optional dequantization / scale / clip / activation post-processing. + +This page describes the FIXPIPE addressing model used by the three writeback ops: + +- [`pto.mte_l0c_l1`](./ops/data-movement/mte-l0c-l1.md) — L0C → L1 +- [`pto.mte_l0c_gm`](./ops/data-movement/mte-l0c-gm.md) — L0C → GM +- [`pto.mte_l0c_ub`](./ops/data-movement/mte-l0c-ub.md) — L0C → UB + +## Source Layout + +The L0C source tile is laid out as `N1 M1 M0 N0` (FRACTAL_NZ: col-major outer, row-major inner). FIXPIPE addresses one M-row of results at a time and emits them in the destination memory's natural order. + +## NZ → ND Conversion at Writeback + +For each cube fragment in L0C, FIXPIPE applies: + +```text +C_nz[n1][m1][m0][n0] --> C_nd[m1*M0 + m0][n1*N0 + n0] +``` + +The conversion is **fused** with the writeback — no separate explicit transpose step is required. Destination strides are expressed in ND coordinates on the FIXPIPE op. + +## Dual-Destination Broadcast (1 → 2 Cube-to-Vector) + +When the FIXPIPE destination is a Vector block UB, the cube can simultaneously broadcast to both AIV0 and AIV1 UB regions via the dedicated on-chip data path, with the tile split either along the row axis (`DualModeSplitM`) or the column axis (`DualModeSplitN`): + +| Split | AIV0 receives | AIV1 receives | +|-------|---------------|---------------| +| Split-M (rows) | Upper `[M/2, N]` in ND | Lower `[M/2, N]` in ND | +| Split-N (cols) | Left `[M, N/2]` in ND | Right `[M, N/2]` in ND | + +This 1→2 broadcast with in-hardware tile split is the architectural basis for 1:2 Cube-to-Vector tile distribution and is selected as an attribute on `pto.mte_l0c_ub`. + +## Burst / Loop Model + +Like the [scalar DMA](../scalar/dma-copy.md) and [cube data-movement](./README.md#cube-data-movement-ops) ops, FIXPIPE writeback uses the grouped `nburst(...)` / `loop(...)` clause form to express row-stride and outer-stride repetition without external configuration registers. + +## Post-Processing Hooks + +FIXPIPE optionally applies the following post-processing along the writeback path, configured via clauses or auxiliary `FB` payload loaded by [`pto.mte_l1_fb`](./ops/data-movement/mte-l1-fb.md): + +- Dequantization (per-channel scale / zero-point) +- Clip / saturate to destination element-type range +- Activation (ReLU / clipped linear, target-defined) + +Per-op pages document which post-processing clauses each `pto.mte_l0c_*` variant accepts. + +## Synchronization Around FIXPIPE + +`FIXP` is one of the four AIC-side issue queues (alongside `MTE2`, `MTE1`, `CUBE`). The standard producer / consumer chain is: + +```text +CUBE (pto.mad*) --set_flag(CUBE -> FIXP)--> FIXP (pto.mte_l0c_*) --> L1 / UB / GM +``` + +After a `pto.mad*` finishes a tile in L0C, the producer must `set_flag` from `PIPE_CUBE` to `PIPE_FIXP` (using one of the configured event IDs); the FIXPIPE consumer issues a matching `wait_flag` before issuing `pto.mte_l0c_*` against the same L0C tile. Failure to synchronize results in a read of in-flight L0C state and is a verifier error. + +## Related Sections + +- [NZ Fractal Layout](./nz-fractal-layout.md) +- [Buffer Hierarchy](./buffer-hierarchy.md) +- [Pipeline Synchronization](../scalar/ops/pipeline-sync/) diff --git a/docs/isa/cube/fixpipe-model_zh.md b/docs/isa/cube/fixpipe-model_zh.md new file mode 100644 index 000000000..e00a9edf6 --- /dev/null +++ b/docs/isa/cube/fixpipe-model_zh.md @@ -0,0 +1,64 @@ +# FIXPIPE 回写模型 + +`FIXPIPE` 是 cube 专用的回写 / 后处理流水线。它把 `L0C` 累加器结果搬到 `L1`、`UB` 或 `GM`,同时把目标所要求的布局变换(NZ → ND)一起做完,并可选地附加反量化 / scale / clip / 激活等后处理。 + +本页描述以下三条回写指令共享的 FIXPIPE 地址模型: + +- [`pto.mte_l0c_l1`](./ops/data-movement/mte-l0c-l1_zh.md):L0C → L1 +- [`pto.mte_l0c_gm`](./ops/data-movement/mte-l0c-gm_zh.md):L0C → GM +- [`pto.mte_l0c_ub`](./ops/data-movement/mte-l0c-ub_zh.md):L0C → UB + +## 源布局 + +L0C 源 tile 布局为 `N1 M1 M0 N0`(FRACTAL_NZ:外层列优先、内层行优先)。FIXPIPE 每次寻址一行 M 维结果,并按目标内存的自然顺序输出。 + +## 回写阶段的 NZ → ND 转换 + +对 L0C 中的每个 cube 片段,FIXPIPE 应用: + +```text +C_nz[n1][m1][m0][n0] --> C_nd[m1*M0 + m0][n1*N0 + n0] +``` + +转换与回写**融合**完成——不需要单独的显式转置。目标 stride 在 FIXPIPE 指令上以 ND 坐标表达。 + +## 双目标广播(1 → 2 Cube-to-Vector) + +当 FIXPIPE 目标是 Vector 块的 UB 时,cube 可以通过专用片上数据通路同时广播到 AIV0 与 AIV1 的 UB 区域,按行轴(`DualModeSplitM`)或列轴(`DualModeSplitN`)切分 tile: + +| 切分 | AIV0 接收 | AIV1 接收 | +|------|-----------|-----------| +| Split-M(按行) | ND 中上半 `[M/2, N]` | ND 中下半 `[M/2, N]` | +| Split-N(按列) | ND 中左半 `[M, N/2]` | ND 中右半 `[M, N/2]` | + +这种 1→2 在硬件中带 tile 切分的广播,是 1:2 Cube-to-Vector tile 分发的架构基础,在 `pto.mte_l0c_ub` 上通过属性选择。 + +## Burst / Loop 模型 + +与 [标量 DMA](../scalar/dma-copy_zh.md) 和 [cube 数据搬运](./README_zh.md#cube-数据搬运指令) 一样,FIXPIPE 回写使用内联 `nburst(...)` / `loop(...)` 子句,无需外部配置寄存器。 + +## 后处理 hook + +FIXPIPE 可以在回写路径上沿途应用以下后处理,通过 clause 或 [`pto.mte_l1_fb`](./ops/data-movement/mte-l1-fb_zh.md) 加载的 `FB` payload 配置: + +- 反量化(每通道 scale / zero-point) +- Clip / 饱和到目标元素类型范围 +- 激活(ReLU / clipped linear,目标相关) + +每条 `pto.mte_l0c_*` 的 per-op 页面会说明它支持哪些后处理 clause。 + +## FIXPIPE 周围的同步 + +`FIXP` 是 AIC 侧四条发射队列之一(其余三条是 `MTE2`、`MTE1`、`CUBE`)。标准的生产者 / 消费者链是: + +```text +CUBE (pto.mad*) --set_flag(CUBE -> FIXP)--> FIXP (pto.mte_l0c_*) --> L1 / UB / GM +``` + +`pto.mad*` 完成一个 L0C tile 后,生产者必须通过其中一个事件 ID 把 `set_flag` 从 `PIPE_CUBE` 发到 `PIPE_FIXP`;FIXPIPE 消费者在对同一个 L0C tile 发出 `pto.mte_l0c_*` 之前必须发出匹配的 `wait_flag`。否则会读到尚未提交的 L0C 状态,verifier 会报错。 + +## 相关章节 + +- [NZ Fractal 布局](./nz-fractal-layout_zh.md) +- [缓冲层级](./buffer-hierarchy_zh.md) +- [流水线同步](../scalar/ops/pipeline-sync/) diff --git a/docs/isa/cube/nz-fractal-layout.md b/docs/isa/cube/nz-fractal-layout.md new file mode 100644 index 000000000..1a73a6ed8 --- /dev/null +++ b/docs/isa/cube/nz-fractal-layout.md @@ -0,0 +1,85 @@ +# NZ Fractal Layout + +The cube's internal buffers (`L1` / `cbuf`, `L0A`, `L0B`, `L0C`) all use a **fractal NZ layout** rather than row-major ND. Understanding NZ layout is essential when authoring cube data-movement ops or reasoning about MAD operand organization. + +## Definition + +Given the hardware constant `C0 = 32 bytes`, for an element type with byte width `E = sizeof(T)`: + +- Inner tile width: `K0 = N0 = C0 / E` (for example, `K0 = 16` for `f16` and `bf16`; `K0 = 8` for `f32`) +- Inner tile height: `M0 = 16` + +NZ re-indexing for a logical `[M, K]` tensor: + +```text +NZ index: (k1, m1, m0, k0) + where k1 = k / K0, k0 = k % K0 + m1 = m / M0, m0 = m % M0 +Physical layout: K1 x M1 x M0 x K0 (last dimension contiguous) +``` + +The same outer / inner factorization is applied to `[K, N]` tensors, swapping the inner-width axis. + +## Per-Buffer NZ Layouts + +| Buffer | Logical shape | Physical NZ layout | Notes | +|--------|---------------|--------------------|-------| +| L1 (cbuf) — Tensor A | `[M, K]` | `K1 M1 M0 K0` | Row-major A staged into NZ layout | +| L1 (cbuf) — Tensor B | `[K, N]` | `K1 N1 K0 N0` | Row-major B staged into NZ layout | +| L0A (left operand) | — | `K1 M1 M0 K0` | FRACTAL_NZ on A5 / FRACTAL_ZZ on A3: same NZ order as L1 cbuf | +| L0B (right operand) | — | `K1 N1 N0 K0` | FRACTAL_ZN: row-major outer, col-major inner (K0 innermost) | +| L0C (accumulator) | `[M, N]` | `N1 M1 M0 N0` | Output of MMAD (FRACTAL_NZ: col-major outer, row-major inner) | + +## Why K-Innermost on L0B? + +The cube reduction axis is `K`. L0B requires K innermost (`K1 N1 N0 K0`) so the cube hardware reads all `K0` elements per cycle without striding. + +The inner-box transpose is performed as part of the [`pto.mte_l1_l0b`](./ops/data-movement/mte-l1-l0b.md) structured right-load movement itself; no separate user-visible pass is required. Each 512B fractal Z-block is permuted as it moves from L1 to L0B. + +## Data Flow: GM → L1 → L0A/B → L0C + +```text ++------------------------------------------------------------------------------+ +| GEMM Data Layout: GM -> L1 (NZ) -> L0A/B -> L0C | ++------------------------------------------------------------------------------+ + +STEP 1 - Global Memory (ND, row-major) +-------------------------------------- + Tensor A [M, K] Tensor B [K, N] + (K is the contiguous axis) (N is the contiguous axis) + Physical: A[m*K + k] Physical: B[k*N + n] + +STEP 2 - GM -> L1 (cbuf): ND-to-NZ fractal repack +------------------------------------------------- + A in L1: K1 x M1 x M0 x K0 B in L1: K1 x N1 x K0 x N0 + For each outer block (k1, m1): For each outer block (k1, n1): + inner is M0 rows x K0 cols inner is K0 rows x N0 cols + (16x16 elems contiguous) (16x16 elems contiguous) + Physical: A_nz[k1][m1][m0][k0] Physical: B_nz[k1][n1][k0][n0] + +STEP 3 - L1 -> L0A / L0B +-------------------------- + L0A: cbuf K1 M1 M0 K0 --mte_l1_l0a--> L0A K1 M1 M0 K0 (FRACTAL_NZ on A5) + L0B: cbuf K1 N1 K0 N0 --mte_l1_l0b--> L0B K1 N1 N0 K0 (FRACTAL_ZN, K0 innermost) + +STEP 4 - MAD: L0A x L0B -> L0C +------------------------------- + dst[m, n] = sum k in 0..K-1: lhs[m, k] * rhs[k, n] + L0C layout: N1 M1 M0 N0 + +STEP 5 - L0C writeback (FIXPIPE) +--------------------------------- + FIXPIPE MTE ops (mte_l0c_l1 / mte_l0c_gm / mte_l0c_ub) convert the L0C NZ + result to the requested destination layout (typically ND) and memory space. +``` + +## Authoring Guidance + +When the source GEMM operand is already in a transposed logical layout, express that at the structured load level (`pto.mte_l1_l0a` / `pto.mte_l1_l0b`) instead of relying on a later reinterpretation of the same bytes. Operating on a reinterpreted NZ buffer with the wrong outer / inner factorization is a verifier error and a common source of correctness bugs. + +## Related Sections + +- [Buffer Hierarchy](./buffer-hierarchy.md) +- [FIXPIPE Model](./fixpipe-model.md) +- [Cube MAD Ops](./README.md#matrix-multiply-mad-ops) +- [Cube Data Movement Ops](./README.md#cube-data-movement-ops) diff --git a/docs/isa/cube/nz-fractal-layout_zh.md b/docs/isa/cube/nz-fractal-layout_zh.md new file mode 100644 index 000000000..ec4bf5ba7 --- /dev/null +++ b/docs/isa/cube/nz-fractal-layout_zh.md @@ -0,0 +1,81 @@ +# NZ Fractal 布局 + +Cube 的内部缓冲(`L1` / `cbuf`、`L0A`、`L0B`、`L0C`)都使用 **fractal NZ 布局**,而不是行优先 ND。理解 NZ 布局是编写 cube 数据搬运指令、推理 MAD 操作数组织的前提。 + +## 定义 + +给定硬件常数 `C0 = 32 字节`,对元素字节宽度为 `E = sizeof(T)` 的类型: + +- 内层 tile 宽度:`K0 = N0 = C0 / E`(例如 `f16` / `bf16` 是 `K0 = 16`;`f32` 是 `K0 = 8`) +- 内层 tile 高度:`M0 = 16` + +逻辑 `[M, K]` 张量的 NZ 重新索引: + +```text +NZ 索引:(k1, m1, m0, k0) + 其中 k1 = k / K0, k0 = k % K0 + m1 = m / M0, m0 = m % M0 +物理布局:K1 x M1 x M0 x K0(最后一维连续) +``` + +对 `[K, N]` 张量做同样的外/内层分解,只是内层宽度轴换成 `N0`。 + +## 各缓冲的 NZ 布局 + +| 缓冲 | 逻辑形状 | 物理 NZ 布局 | 备注 | +|------|----------|--------------|------| +| L1(cbuf)- 张量 A | `[M, K]` | `K1 M1 M0 K0` | 行优先 A 被打入 NZ 布局 | +| L1(cbuf)- 张量 B | `[K, N]` | `K1 N1 K0 N0` | 行优先 B 被打入 NZ 布局 | +| L0A(左操作数) | — | `K1 M1 M0 K0` | A5 上为 FRACTAL_NZ / A3 上为 FRACTAL_ZZ:与 L1 cbuf 同 NZ 顺序 | +| L0B(右操作数) | — | `K1 N1 N0 K0` | FRACTAL_ZN:外层行优先,内层列优先(K0 最内) | +| L0C(累加器) | `[M, N]` | `N1 M1 M0 N0` | MMAD 输出(FRACTAL_NZ:外层列优先、内层行优先) | + +## 为什么 L0B 必须 K-innermost? + +Cube 的归约轴是 `K`。L0B 要求 K 在最内层(`K1 N1 N0 K0`),这样 cube 硬件每个 cycle 都能读到完整的 `K0` 个元素而不跨 stride。 + +从 L1 的 `K1 N1 K0 N0` 到 L0B 的 `K1 N1 N0 K0` 的内层 box 转置是由 [`pto.mte_l1_l0b`](./ops/data-movement/mte-l1-l0b_zh.md) 这条结构化右侧加载指令在搬运过程中完成的,用户层面看不到额外的转置 pass。从 L1 搬到 L0B 时,每个 512B fractal Z-block 都会被原位置换。 + +## 数据流:GM → L1 → L0A/B → L0C + +```text ++------------------------------------------------------------------------------+ +| GEMM 数据布局:GM -> L1 (NZ) -> L0A/B -> L0C | ++------------------------------------------------------------------------------+ + +步骤 1 — Global Memory(ND,行优先) +------------------------------------- + 张量 A [M, K] 张量 B [K, N] + 物理:A[m*K + k] 物理:B[k*N + n] + +步骤 2 — GM -> L1(cbuf):ND→NZ fractal 重排 +--------------------------------------------- + L1 中的 A:K1 x M1 x M0 x K0 L1 中的 B:K1 x N1 x K0 x N0 + 物理:A_nz[k1][m1][m0][k0] 物理:B_nz[k1][n1][k0][n0] + +步骤 3 — L1 -> L0A / L0B +-------------------------- + L0A:cbuf K1 M1 M0 K0 --mte_l1_l0a--> L0A K1 M1 M0 K0 (A5 上为 FRACTAL_NZ) + L0B:cbuf K1 N1 K0 N0 --mte_l1_l0b--> L0B K1 N1 N0 K0 (FRACTAL_ZN, K0 最内) + +步骤 4 — MAD:L0A x L0B -> L0C +------------------------------- + dst[m, n] = sum k in 0..K-1: lhs[m, k] * rhs[k, n] + L0C 布局:N1 M1 M0 N0 + +步骤 5 — L0C 回写(FIXPIPE) +----------------------------- + FIXPIPE MTE 指令(mte_l0c_l1 / mte_l0c_gm / mte_l0c_ub)把 L0C NZ 结果转换为 + 所需的目标布局(通常是 ND),并写入指定的内存空间。 +``` + +## 编写指引 + +当源 GEMM 操作数本身已经是某种已转置的逻辑布局时,应该在结构化加载层(`pto.mte_l1_l0a` / `pto.mte_l1_l0b`)显式表达这一点,不要寄希望于事后对同一段字节做不同的 NZ 解释。用错误的外/内层分解去操作一块 NZ 缓冲是 verifier 错误,也是最常见的正确性 bug 来源之一。 + +## 相关章节 + +- [缓冲层级](./buffer-hierarchy_zh.md) +- [FIXPIPE 模型](./fixpipe-model_zh.md) +- [Cube MAD 指令](./README_zh.md#矩阵乘加mad指令) +- [Cube 数据搬运指令](./README_zh.md#cube-数据搬运指令) diff --git a/docs/isa/cube/ops/data-movement/mte-gm-l1-frac.md b/docs/isa/cube/ops/data-movement/mte-gm-l1-frac.md new file mode 100644 index 000000000..4bce9fa51 --- /dev/null +++ b/docs/isa/cube/ops/data-movement/mte-gm-l1-frac.md @@ -0,0 +1,94 @@ +# pto.mte_gm_l1_frac + +`pto.mte_gm_l1_frac` is part of [Cube Data Movement Ops](../../README.md#cube-data-movement-ops). + +## Summary + +Load a logical 2-D GM region and write one or more L1 **NZ fractal** matrix groups. `nd2nz` reads a logical `src[n, d]` matrix; `dn2nz` reads a logical `src[d, n]` matrix and writes the same logical `N x D` result into NZ layout. + +This is the canonical entry point for staging row-major or column-major GM operands into the cube's [NZ Fractal Layout](../../nz-fractal-layout.md). After `pto.mte_gm_l1_frac`, the L1 tile can feed [`pto.mte_l1_l0a`](./mte-l1-l0a.md) / [`pto.mte_l1_l0b`](./mte-l1-l0b.md). + +## Mechanism + +Reference addressing: + +```text +for g in 0 .. group_count-1: + src_g = src + g * src_outer_stride + dst_g = dst + g * dst_loop4_stride * 32 + + for n in 0 .. n_value-1: + for d in 0 .. d_value-1: + if mode == nd2nz: + value = load(src_g + n * src_inner_stride + d * sizeof(T)) + else: + value = load(src_g + d * src_inner_stride + n * sizeof(T)) + store value into NZ position for logical [n, d] under dst_g + + invalid lanes in the final C0 group are written as zero +``` + +## Syntax + +```mlir +pto.mte_gm_l1_frac %src, %dst, nd2nz|dn2nz, + shape(%n_value, %d_value), + src_layout(%src_inner_stride[, %src_outer_stride]), + dst_group(%group_count, %dst_loop2_stride, %dst_loop3_stride, %dst_loop4_stride), + ctrl(%l2_cache_ctrl, %smallc0_en) + : !pto.ptr, !pto.ptr, ... +``` + +## Inputs + +| Parameter | Width | Description | +|-----------|-------|-------------| +| `%src` | ptr | GM source base pointer | +| `%dst` | ptr | L1 NZ destination base pointer (`!pto.ptr`) | +| `nd2nz` / `dn2nz` | keyword | Source logical layout mode | +| `shape(%n_value, %d_value)` | i64 pair | Logical output shape before NZ packing | +| `src_layout(%src_inner_stride[, %src_outer_stride])` | i64 / optional i64 | Source row/matrix byte strides | +| `dst_group(...)` | i64 tuple | Destination group count and placement strides in C0-size units (1 unit = 32 bytes) | +| `ctrl(%l2_cache_ctrl, %smallc0_en)` | i64, i1 | Cache hint and small-C0 packing enable | + +`src_layout(%src_inner_stride)` describes one logical source matrix. For `nd2nz`, `%src_inner_stride` is the byte distance from `src[n, 0]` to `src[n + 1, 0]`. For `dn2nz`, it is the byte distance from `src[d, 0]` to `src[d + 1, 0]`. When `%src_outer_stride` is present, it is the byte distance between adjacent source matrices; omitted means 0. + +`dst_group(%group_count, %dst_loop2_stride, %dst_loop3_stride, %dst_loop4_stride)` writes `%group_count` logical matrices. Destination strides are measured in C0-size units. These strides place generated NZ blocks relative to `%dst`; they do not select a separate memory block. + +## Expected Outputs + +| Result | Type | Description | +| --- | --- | --- | +| None | `—` | Writes one or more NZ matrix groups into L1. | + +## Side Effects + +Reads GM-visible storage; writes L1-visible storage. Engages the AIC MTE2 pipe. + +## Constraints + +!!! warning "Constraints" + - Source strides are bytes. For row-major 16×16 f16 input, `src_layout(32)` describes consecutive rows. + - Destination strides are C0-size units, **not** bytes and **not** elements. + - `smallc0_en = true` is valid only for target-supported small-C0 cases. The current contract rejects `d_value > 4` in small-C0 mode. + - In normal C0 mode, each destination C0 burst is padded to 32 bytes. In small-C0 mode, each destination burst is padded to 4 logical channels; the generated inner-N and C0 destination placement is fixed by that small-C0 packing rule. `%dst_loop4_stride` still places adjacent matrix groups. + - In small-C0 mode, missing logical `N` rows and invalid `D` lanes are written as zero, and the tail of a generated NZ matrix is padded to the 32-byte C0 boundary. + - Destination regions selected by `%dst` and `dst_group(...)` must not overlap. If two generated writes target the same bytes, the final value is not a stable program result. + +## Examples + +```mlir +pto.mte_gm_l1_frac %src, %dst, nd2nz, + shape(%c32_i64, %c16_i64), + src_layout(%c32_i64, %c1024_i64), + dst_group(%c2_i64, %c1_i64, %c16_i64, %c64_i64), + ctrl(%c0_i64, %false) + : !pto.ptr, !pto.ptr, nd2nz, shape i64, i64, + src_layout(i64, i64), dst_group i64, i64, i64, i64, ctrl i64, i1 +``` + +## Related Ops + +- Direct GM→L1 copy (no repack): [pto.mte_gm_l1](./mte-gm-l1.md) +- Consume the NZ tile: [pto.mte_l1_l0a](./mte-l1-l0a.md), [pto.mte_l1_l0b](./mte-l1-l0b.md) +- Layout reference: [NZ Fractal Layout](../../nz-fractal-layout.md) diff --git a/docs/isa/cube/ops/data-movement/mte-gm-l1-frac_zh.md b/docs/isa/cube/ops/data-movement/mte-gm-l1-frac_zh.md new file mode 100644 index 000000000..3eff8af70 --- /dev/null +++ b/docs/isa/cube/ops/data-movement/mte-gm-l1-frac_zh.md @@ -0,0 +1,94 @@ +# pto.mte_gm_l1_frac + +`pto.mte_gm_l1_frac` 属于 [Cube 数据搬运指令](../../README_zh.md#cube-数据搬运指令)。 + +## 摘要 + +加载逻辑 2D GM 区域并向 L1 写入一个或多个 **NZ fractal** 矩阵组。`nd2nz` 读取一个逻辑 `src[n, d]` 矩阵;`dn2nz` 读取一个逻辑 `src[d, n]` 矩阵,并把相同的逻辑 `N x D` 结果按 NZ 布局写出。 + +这是把行优先 / 列优先 GM 操作数装入 cube 的 [NZ Fractal 布局](../../nz-fractal-layout_zh.md) 的标准入口。`pto.mte_gm_l1_frac` 之后,L1 tile 即可被 [`pto.mte_l1_l0a`](./mte-l1-l0a_zh.md) / [`pto.mte_l1_l0b`](./mte-l1-l0b_zh.md) 消费。 + +## 机制 + +参考地址: + +```text +for g in 0 .. group_count-1: + src_g = src + g * src_outer_stride + dst_g = dst + g * dst_loop4_stride * 32 + + for n in 0 .. n_value-1: + for d in 0 .. d_value-1: + if mode == nd2nz: + value = load(src_g + n * src_inner_stride + d * sizeof(T)) + else: + value = load(src_g + d * src_inner_stride + n * sizeof(T)) + store value into NZ position for logical [n, d] under dst_g + + 最后一个 C0 组中的无效 lane 写 0 +``` + +## 语法 + +```mlir +pto.mte_gm_l1_frac %src, %dst, nd2nz|dn2nz, + shape(%n_value, %d_value), + src_layout(%src_inner_stride[, %src_outer_stride]), + dst_group(%group_count, %dst_loop2_stride, %dst_loop3_stride, %dst_loop4_stride), + ctrl(%l2_cache_ctrl, %smallc0_en) + : !pto.ptr, !pto.ptr, ... +``` + +## 输入 + +| 参数 | 位宽 | 描述 | +|-----------|-------|-------------| +| `%src` | ptr | GM 源基址 | +| `%dst` | ptr | L1 NZ 目标基址(`!pto.ptr`) | +| `nd2nz` / `dn2nz` | 关键字 | 源逻辑布局模式 | +| `shape(%n_value, %d_value)` | i64 对 | NZ 打包前的逻辑输出形状 | +| `src_layout(%src_inner_stride[, %src_outer_stride])` | i64 / 可选 i64 | 源行 / 矩阵字节步长 | +| `dst_group(...)` | i64 元组 | 目标组数与放置步长,单位是 C0(1 个 C0 = 32 字节) | +| `ctrl(%l2_cache_ctrl, %smallc0_en)` | i64, i1 | Cache 提示与 small-C0 打包使能 | + +`src_layout(%src_inner_stride)` 描述一个逻辑源矩阵。对 `nd2nz` 而言,`%src_inner_stride` 是从 `src[n, 0]` 到 `src[n + 1, 0]` 的字节距离;对 `dn2nz` 而言,则是从 `src[d, 0]` 到 `src[d + 1, 0]` 的字节距离。`%src_outer_stride` 表示相邻源矩阵之间的字节距离;不写时为 0。 + +`dst_group(%group_count, %dst_loop2_stride, %dst_loop3_stride, %dst_loop4_stride)` 写出 `%group_count` 个逻辑矩阵。目标步长以 C0 为单位。这些步长把生成的 NZ block 放置在相对 `%dst` 的位置,并不切换到独立的内存块。 + +## 预期输出 + +| 结果 | 类型 | 描述 | +| --- | --- | --- | +| 无 | `—` | 把一个或多个 NZ 矩阵组写入 L1。 | + +## 副作用 + +读 GM 可见存储,写 L1 可见存储。占用 AIC MTE2 流水线。 + +## 约束 + +!!! warning "约束" + - 源步长是字节。对行优先 16×16 f16 输入,`src_layout(32)` 描述连续的行。 + - 目标步长是 C0,**不是**字节,**不是**元素。 + - `smallc0_en = true` 仅在目标支持的 small-C0 场景下合法;当前契约在 small-C0 模式拒绝 `d_value > 4`。 + - 普通 C0 模式下,每个目标 C0 burst 会被填充到 32 字节;small-C0 模式下,每个目标 burst 会被填充到 4 个逻辑通道,所生成的内层 N 与 C0 放置位置由 small-C0 打包规则固定。`%dst_loop4_stride` 仍然负责放置相邻矩阵组。 + - small-C0 模式下,缺失的逻辑 `N` 行与无效 `D` lane 写 0;生成的 NZ 矩阵尾部按 32 字节 C0 边界做填充。 + - `%dst` 与 `dst_group(...)` 选中的目标区域不得重叠;如果两次写命中相同字节,最终结果不是稳定的程序结果。 + +## 示例 + +```mlir +pto.mte_gm_l1_frac %src, %dst, nd2nz, + shape(%c32_i64, %c16_i64), + src_layout(%c32_i64, %c1024_i64), + dst_group(%c2_i64, %c1_i64, %c16_i64, %c64_i64), + ctrl(%c0_i64, %false) + : !pto.ptr, !pto.ptr, nd2nz, shape i64, i64, + src_layout(i64, i64), dst_group i64, i64, i64, i64, ctrl i64, i1 +``` + +## 相关指令 + +- 直接 GM→L1 拷贝(不重排):[pto.mte_gm_l1](./mte-gm-l1_zh.md) +- 消费该 NZ tile:[pto.mte_l1_l0a](./mte-l1-l0a_zh.md)、[pto.mte_l1_l0b](./mte-l1-l0b_zh.md) +- 布局参考:[NZ Fractal 布局](../../nz-fractal-layout_zh.md) diff --git a/docs/isa/cube/ops/data-movement/mte-gm-l1.md b/docs/isa/cube/ops/data-movement/mte-gm-l1.md new file mode 100644 index 000000000..0918a3241 --- /dev/null +++ b/docs/isa/cube/ops/data-movement/mte-gm-l1.md @@ -0,0 +1,63 @@ +# pto.mte_gm_l1 + +`pto.mte_gm_l1` is part of [Cube Data Movement Ops](../../README.md#cube-data-movement-ops). + +## Summary + +Structured GM→L1 (cube CBUF) copy. Copies grouped byte ranges from `%src` in GM to `%dst` in L1 without performing any layout transform — the source bytes are written to L1 verbatim. + +Use [`pto.mte_gm_l1_frac`](./mte-gm-l1-frac.md) when the source is row-major ND data that needs ND→NZ fractal repack before it can serve as a cube operand. + +## Mechanism + +Like the scalar [`pto.mte_gm_ub`](../../../scalar/ops/dma-copy/copy-gm-to-ubuf.md), this op uses the grouped `nburst(...) [loop(...)]*` model. For each `nburst` row, source and destination advance by `src_stride` / `dst_stride`. Optional outer `loop(...)` groups wrap the inner transfer. + +## Syntax + +```mlir +pto.mte_gm_l1 %src, %dst, %len_burst + nburst(%count, %src_stride, %dst_stride) + [loop(%count_i, %src_stride_i, %dst_stride_i)]* + : !pto.ptr, !pto.ptr, i64, i64, i64, i64 +``` + +## Inputs + +| Parameter | Width | Description | +|-----------|-------|-------------| +| `%src` | ptr | GM source base pointer | +| `%dst` | ptr | L1 destination base pointer (`!pto.ptr`) | +| `%len_burst` | i64 | Bytes copied per burst row | +| `nburst(%count, %src_stride, %dst_stride)` | i64 triple | Innermost burst count and byte strides between row starts | +| `loop(%count_i, %src_stride_i, %dst_stride_i)` | i64 triple | Optional outer repetition; byte advances between enclosed patterns | + +## Expected Outputs + +| Result | Type | Description | +| --- | --- | --- | +| None | `—` | Writes data into the L1 destination region. | + +## Side Effects + +Reads GM-visible storage; writes L1-visible storage. Engages the AIC MTE2 pipe. + +## Constraints + +!!! warning "Constraints" + - `nburst(...)` is required. + - Each `loop(...)` group must provide all three operands. + - All strides are bytes. For a contiguous 16-element f16 vector, use `%len_burst = 32`. + +## Examples + +```mlir +pto.mte_gm_l1 %bias_gm, %l1_bias, %c32_i64 + nburst(%c4_i64, %c64_i64, %c32_i64) + : !pto.ptr, !pto.ptr, i64, i64, i64, i64 +``` + +## Related Ops + +- ND→NZ repack: [pto.mte_gm_l1_frac](./mte-gm-l1-frac.md) +- L1 → UB: [pto.mte_l1_ub](./mte-l1-ub.md) +- L1 → cube operand tiles: [pto.mte_l1_l0a](./mte-l1-l0a.md), [pto.mte_l1_l0b](./mte-l1-l0b.md) diff --git a/docs/isa/cube/ops/data-movement/mte-gm-l1_zh.md b/docs/isa/cube/ops/data-movement/mte-gm-l1_zh.md new file mode 100644 index 000000000..25568cb58 --- /dev/null +++ b/docs/isa/cube/ops/data-movement/mte-gm-l1_zh.md @@ -0,0 +1,63 @@ +# pto.mte_gm_l1 + +`pto.mte_gm_l1` 属于 [Cube 数据搬运指令](../../README_zh.md#cube-数据搬运指令)。 + +## 摘要 + +结构化 GM→L1(cube CBUF)拷贝。把 `%src`(GM)中的分组字节区间原样写入 `%dst`(L1),不做任何布局变换——源字节按字面顺序写入 L1。 + +如果源是行优先 ND 数据,需要在送给 cube 之前做 ND→NZ fractal 重排,请用 [`pto.mte_gm_l1_frac`](./mte-gm-l1-frac_zh.md)。 + +## 机制 + +与标量 [`pto.mte_gm_ub`](../../../scalar/ops/dma-copy/copy-gm-to-ubuf_zh.md) 一样,使用分组 `nburst(...) [loop(...)]*` 模型。每个 `nburst` 行结束后,源/目标按 `src_stride` / `dst_stride` 前进;可选的外层 `loop(...)` 把内层传输打包。 + +## 语法 + +```mlir +pto.mte_gm_l1 %src, %dst, %len_burst + nburst(%count, %src_stride, %dst_stride) + [loop(%count_i, %src_stride_i, %dst_stride_i)]* + : !pto.ptr, !pto.ptr, i64, i64, i64, i64 +``` + +## 输入 + +| 参数 | 位宽 | 描述 | +|-----------|-------|-------------| +| `%src` | ptr | GM 源基址 | +| `%dst` | ptr | L1 目标基址(`!pto.ptr`) | +| `%len_burst` | i64 | 每个 burst 行复制的字节数 | +| `nburst(%count, %src_stride, %dst_stride)` | i64 三元组 | 最内层 burst 个数与字节步长 | +| `loop(%count_i, %src_stride_i, %dst_stride_i)` | i64 三元组 | 可选外层重复;字节步长 | + +## 预期输出 + +| 结果 | 类型 | 描述 | +| --- | --- | --- | +| 无 | `—` | 把数据写入 L1 目标区域。 | + +## 副作用 + +读 GM 可见存储,写 L1 可见存储。占用 AIC MTE2 流水线。 + +## 约束 + +!!! warning "约束" + - `nburst(...)` 必须存在。 + - 每个 `loop(...)` 子句出现时必须给出完整三元组。 + - 所有步长以字节为单位。对一个连续 16 元素的 f16 向量,使用 `%len_burst = 32`。 + +## 示例 + +```mlir +pto.mte_gm_l1 %bias_gm, %l1_bias, %c32_i64 + nburst(%c4_i64, %c64_i64, %c32_i64) + : !pto.ptr, !pto.ptr, i64, i64, i64, i64 +``` + +## 相关指令 + +- ND→NZ 重排:[pto.mte_gm_l1_frac](./mte-gm-l1-frac_zh.md) +- L1 → UB:[pto.mte_l1_ub](./mte-l1-ub_zh.md) +- L1 → cube 操作数 tile:[pto.mte_l1_l0a](./mte-l1-l0a_zh.md)、[pto.mte_l1_l0b](./mte-l1-l0b_zh.md) diff --git a/docs/isa/cube/ops/data-movement/mte-l0c-gm.md b/docs/isa/cube/ops/data-movement/mte-l0c-gm.md new file mode 100644 index 000000000..5c3448406 --- /dev/null +++ b/docs/isa/cube/ops/data-movement/mte-l0c-gm.md @@ -0,0 +1,73 @@ +# pto.mte_l0c_gm + +`pto.mte_l0c_gm` is part of [Cube Data Movement Ops](../../README.md#cube-data-movement-ops). It is one of the three FIXPIPE writeback ops; see [FIXPIPE Model](../../fixpipe-model.md) for the shared writeback pipeline. + +## Summary + +FIXPIPE writeback from `l0c` to GM. The data transform clauses match [`pto.mte_l0c_l1`](./mte-l0c-l1.md); GM-specific operands select the GM write path and optional atomic update behavior. + +## Syntax + +```mlir +pto.mte_l0c_gm %src, %dst, %m, %n, %src_stride, %dst_stride, %sid, %l2_cache_ctrl + [, unit_flag(check_only | check_and_clear)]? + [, pre_quant(%payload, mode = )]? + [, pre_relu([%payload, ]mode = [, clip = %clip])]? + [, nz2nd | nz2dn(%loop0_src_stride) | nz2nz(%split)?] + [, loop3(%count, %src_stride3, %dst_stride3)]? + [, sat | sat(preserve_nan) | nosat]? + [, atomic(type = , op = )]? + : ... +``` + +## Inputs + +| Parameter | Width | Description | +|-----------|-------|-------------| +| `%src`, `%m`, `%n`, `%src_stride` | — | Same as [`pto.mte_l0c_l1`](./mte-l0c-l1.md#inputs) | +| `%dst` | buffer-like | GM destination | +| `%dst_stride` | i64 | GM destination stride in destination elements | +| `%sid` | i64 | GM stream/session hint; does not change written values | +| `%l2_cache_ctrl` | i64 | GM store cache hint; does not change written values | +| `atomic(type = ..., op = ...)` | clause | Optional GM read-modify-write | +| other optional clauses | — | Same as [`pto.mte_l0c_l1`](./mte-l0c-l1.md#syntax) | + +`%sid` and `%l2_cache_ctrl` affect the memory path only — they do not change the logical result, destination layout, numeric conversion, or atomic operation. For target-profile GM writeback, constant `%sid` values must be in `[0, 3]` (use `0` unless the surrounding memory system deliberately assigns a different stream/session hint). Constant `%l2_cache_ctrl` values must fit in the target cache-control hint range `[0, 15]`. + +`atomic(type = T, op = add|max|min)` performs an atomic read-modify-write at each GM destination element. `add` accumulates the converted value into the existing GM value. `max` and `min` compare using `T` and write the selected value. Supported atomic types: `f32`, `f16`, `bf16`, `s32`, `s16`, `s8`. + +## Expected Outputs + +| Result | Type | Description | +| --- | --- | --- | +| None | `—` | Writes converted `M x N` result to GM. | + +## Side Effects + +Reads L0C; writes GM. Engages the AIC FIXP pipe. If `atomic(...)` is present, the GM update is read-modify-write. + +## Constraints + +!!! warning "Constraints" + - `atomic(...)` is valid only on `pto.mte_l0c_gm`. + - `atomic` requires both `type` and `op`. + - Atomic op values are `add`, `max`, and `min`. + - If `%sid` or `%l2_cache_ctrl` is a constant, it must be in the target range described above. + - Other constraints match [`pto.mte_l0c_l1`](./mte-l0c-l1.md#constraints). + +## Examples + +```mlir +pto.mte_l0c_gm %l0c, %out, %c16_i64, %c32_i64, %c16_i64, %c32_i64, + %c0_i64, %c0_i64, + pre_quant(%c1_f32, mode = qf322f16_pre_scalar), + nz2nd, + atomic(type = f16, op = add) + : !pto.ptr, !pto.ptr, i64, i64, i64, i64, i64, i64, f32 +``` + +## Related Ops + +- FIXPIPE writeback siblings: [pto.mte_l0c_l1](./mte-l0c-l1.md), [pto.mte_l0c_ub](./mte-l0c-ub.md) +- Parameter payload loader: [pto.mte_l1_fb](./mte-l1-fb.md) +- MAD producers: [pto.mad](../mad/mad.md) and variants diff --git a/docs/isa/cube/ops/data-movement/mte-l0c-gm_zh.md b/docs/isa/cube/ops/data-movement/mte-l0c-gm_zh.md new file mode 100644 index 000000000..f9a5d1544 --- /dev/null +++ b/docs/isa/cube/ops/data-movement/mte-l0c-gm_zh.md @@ -0,0 +1,73 @@ +# pto.mte_l0c_gm + +`pto.mte_l0c_gm` 属于 [Cube 数据搬运指令](../../README_zh.md#cube-数据搬运指令)。它是三条 FIXPIPE 回写指令之一;共享的回写流水线见 [FIXPIPE 模型](../../fixpipe-model_zh.md)。 + +## 摘要 + +把 L0C 中的结果 FIXPIPE 回写到 GM。数据变换 clauses 与 [`pto.mte_l0c_l1`](./mte-l0c-l1_zh.md) 完全一致;GM 特有的操作数用于选择 GM 写路径及可选的原子更新行为。 + +## 语法 + +```mlir +pto.mte_l0c_gm %src, %dst, %m, %n, %src_stride, %dst_stride, %sid, %l2_cache_ctrl + [, unit_flag(check_only | check_and_clear)]? + [, pre_quant(%payload, mode = )]? + [, pre_relu([%payload, ]mode = [, clip = %clip])]? + [, nz2nd | nz2dn(%loop0_src_stride) | nz2nz(%split)?] + [, loop3(%count, %src_stride3, %dst_stride3)]? + [, sat | sat(preserve_nan) | nosat]? + [, atomic(type = , op = )]? + : ... +``` + +## 输入 + +| 参数 | 位宽 | 描述 | +|-----------|-------|-------------| +| `%src`、`%m`、`%n`、`%src_stride` | — | 同 [`pto.mte_l0c_l1`](./mte-l0c-l1_zh.md#输入) | +| `%dst` | buffer-like | GM 目标 | +| `%dst_stride` | i64 | GM 目标步长,单位为目标元素 | +| `%sid` | i64 | GM stream/session 提示,不影响写入数值 | +| `%l2_cache_ctrl` | i64 | GM store cache 提示,不影响写入数值 | +| `atomic(type = ..., op = ...)` | clause | 可选 GM 读-改-写 | +| 其它可选 clauses | — | 同 [`pto.mte_l0c_l1`](./mte-l0c-l1_zh.md#语法) | + +`%sid` 与 `%l2_cache_ctrl` 仅影响内存路径,不会改变逻辑结果、目标布局、数值转换或原子操作。对当前 profile,常量 `%sid` 必须在 `[0, 3]`(若内存系统没有特别配置,使用 `0` 即可)。常量 `%l2_cache_ctrl` 必须在目标 cache 控制范围 `[0, 15]` 内。 + +`atomic(type = T, op = add|max|min)` 在每个 GM 目标元素上执行原子读-改-写。`add` 把转换后的值累加到 GM 原值上;`max` 与 `min` 按类型 `T` 比较后写入较大/较小者。支持的原子类型:`f32`、`f16`、`bf16`、`s32`、`s16`、`s8`。 + +## 预期输出 + +| 结果 | 类型 | 描述 | +| --- | --- | --- | +| 无 | `—` | 把转换后的 `M x N` 结果写到 GM。 | + +## 副作用 + +读 L0C,写 GM。占用 AIC FIXP 流水线。带 `atomic(...)` 时,GM 更新是原子读-改-写。 + +## 约束 + +!!! warning "约束" + - `atomic(...)` 仅在 `pto.mte_l0c_gm` 上有效。 + - `atomic` 必须同时提供 `type` 和 `op`。 + - 原子 op 取值为 `add`、`max`、`min`。 + - `%sid` 或 `%l2_cache_ctrl` 若为常量,必须落在上文给出的目标取值范围内。 + - 其它约束同 [`pto.mte_l0c_l1`](./mte-l0c-l1_zh.md#约束)。 + +## 示例 + +```mlir +pto.mte_l0c_gm %l0c, %out, %c16_i64, %c32_i64, %c16_i64, %c32_i64, + %c0_i64, %c0_i64, + pre_quant(%c1_f32, mode = qf322f16_pre_scalar), + nz2nd, + atomic(type = f16, op = add) + : !pto.ptr, !pto.ptr, i64, i64, i64, i64, i64, i64, f32 +``` + +## 相关指令 + +- FIXPIPE 回写兄弟指令:[pto.mte_l0c_l1](./mte-l0c-l1_zh.md)、[pto.mte_l0c_ub](./mte-l0c-ub_zh.md) +- 参数 payload 装载:[pto.mte_l1_fb](./mte-l1-fb_zh.md) +- MAD 生产者:[pto.mad](../mad/mad_zh.md) 及其变体 diff --git a/docs/isa/cube/ops/data-movement/mte-l0c-l1.md b/docs/isa/cube/ops/data-movement/mte-l0c-l1.md new file mode 100644 index 000000000..e0bbb7317 --- /dev/null +++ b/docs/isa/cube/ops/data-movement/mte-l0c-l1.md @@ -0,0 +1,78 @@ +# pto.mte_l0c_l1 + +`pto.mte_l0c_l1` is part of [Cube Data Movement Ops](../../README.md#cube-data-movement-ops). It is one of the three FIXPIPE writeback ops; see [FIXPIPE Model](../../fixpipe-model.md) for the shared writeback pipeline, layout modes, and clause semantics. + +## Summary + +FIXPIPE writeback from `l0c` to L1 `l1`. Applies optional pre-quant, pre-ReLU/clip, layout transform, outer-loop repeat, and saturation behavior in canonical order before storing the converted result to L1. + +## Syntax + +```mlir +pto.mte_l0c_l1 %src, %dst, %m, %n, %src_stride, %dst_stride + [, unit_flag(check_only | check_and_clear)]? + [, pre_quant(%payload, mode = )]? + [, pre_relu([%payload, ]mode = [, clip = %clip])]? + [, nz2nd | nz2dn(%loop0_src_stride) | nz2nz(%split)?] + [, loop3(%count, %src_stride3, %dst_stride3)]? + [, sat | sat(preserve_nan) | nosat]? + : ... +``` + +## Inputs + +| Parameter | Width | Description | +|-----------|-------|-------------| +| `%src` | buffer-like | Accumulator source in `l0c` | +| `%dst` | buffer-like | L1 destination in `l1` | +| `%m` | i64 | Logical M element count | +| `%n` | i64 | Logical N element count | +| `%src_stride` | i64 | Source stride in C0-size units (1 unit = 32 bytes) | +| `%dst_stride` | i64 | Destination stride in destination elements | + +See [FIXPIPE Common Clauses](../../fixpipe-model.md#fixpipe-common-clauses) and [FIXPIPE Layout Model](../../fixpipe-model.md#fixpipe-layout-model) for the optional clauses. + +## Expected Outputs + +| Result | Type | Description | +| --- | --- | --- | +| None | `—` | Writes converted `M x N` result to L1. | + +## Side Effects + +Reads L0C; writes L1. Engages the AIC FIXP pipe. Consumers in L1 must synchronize through pipe events. + +## Constraints + +!!! warning "Constraints" + - Clauses must appear in canonical order: `unit_flag` → `pre_quant` → `pre_relu` → layout → `loop3` → `sat`/`nosat`. + - `pre_quant` requires payload and mode together. + - Vector `pre_quant` modes require a `fb` pointer with `f16`, `bf16`, or `f32` element type. + - Scalar `pre_quant` modes require an `f16`, `bf16`, or `f32` scalar payload. + - `pre_quant` source element type must be `f32` or `i32`, and the selected mode must be compatible with the source and destination element types. + - `no_relu` and `normal_relu` do not accept a payload. + - `scalar_relu` requires an `f16`/`bf16`/`f32` scalar payload. + - `vector_relu` requires a `fb` pointer with `f16`/`bf16`/`f32` element type. + - `clip` can appear only inside `pre_relu(...)`. + - `clip` is supported for destination `f16`, `ui8`, and signed/signless 4/8/16-bit integer destinations; payload must match the destination family. + - `nz2dn` requires `%loop0_src_stride`; `nz2nd` and `nz2nz` do not accept it. + - `unit_flag` must be omitted when `nz2dn(%loop0_src_stride)` uses a value other than 1. + - `nz2nz` requires `f32` destination element type and does not accept `loop3`. + - `sat`, `sat(preserve_nan)`, and `nosat` are mutually exclusive. + +## Examples + +```mlir +pto.mte_l0c_l1 %l0c, %l1_out, %c16_i64, %c32_i64, %c16_i64, %c32_i64, + pre_quant(%c1_f32, mode = qf322f16_pre_scalar), + pre_relu(%c025_f32, mode = scalar_relu), + nz2nd, + sat + : !pto.ptr, !pto.ptr, i64, i64, i64, i64, f32, f32 +``` + +## Related Ops + +- FIXPIPE writeback siblings: [pto.mte_l0c_gm](./mte-l0c-gm.md), [pto.mte_l0c_ub](./mte-l0c-ub.md) +- Parameter payload loader: [pto.mte_l1_fb](./mte-l1-fb.md) +- MAD producers: [pto.mad](../mad/mad.md) and variants diff --git a/docs/isa/cube/ops/data-movement/mte-l0c-l1_zh.md b/docs/isa/cube/ops/data-movement/mte-l0c-l1_zh.md new file mode 100644 index 000000000..570ee9da7 --- /dev/null +++ b/docs/isa/cube/ops/data-movement/mte-l0c-l1_zh.md @@ -0,0 +1,78 @@ +# pto.mte_l0c_l1 + +`pto.mte_l0c_l1` 属于 [Cube 数据搬运指令](../../README_zh.md#cube-数据搬运指令)。它是三条 FIXPIPE 回写指令之一;共享的回写流水线、布局模式与 clause 语义见 [FIXPIPE 模型](../../fixpipe-model_zh.md)。 + +## 摘要 + +把 L0C 中的结果 FIXPIPE 回写到 L1 `l1`。在写到 L1 前按规范顺序依次应用可选的 pre-quant、pre-ReLU/clip、布局变换、外层 loop3 重复以及饱和行为。 + +## 语法 + +```mlir +pto.mte_l0c_l1 %src, %dst, %m, %n, %src_stride, %dst_stride + [, unit_flag(check_only | check_and_clear)]? + [, pre_quant(%payload, mode = )]? + [, pre_relu([%payload, ]mode = [, clip = %clip])]? + [, nz2nd | nz2dn(%loop0_src_stride) | nz2nz(%split)?] + [, loop3(%count, %src_stride3, %dst_stride3)]? + [, sat | sat(preserve_nan) | nosat]? + : ... +``` + +## 输入 + +| 参数 | 位宽 | 描述 | +|-----------|-------|-------------| +| `%src` | buffer-like | L0C 中的累加器源 | +| `%dst` | buffer-like | L1 目标 | +| `%m` | i64 | 逻辑 M 元素数 | +| `%n` | i64 | 逻辑 N 元素数 | +| `%src_stride` | i64 | 源步长,单位 C0(1 个 C0 = 32 字节) | +| `%dst_stride` | i64 | 目标步长,单位为目标元素 | + +可选 clauses 见 [FIXPIPE 通用 Clauses](../../fixpipe-model_zh.md) 与 [FIXPIPE 布局模型](../../fixpipe-model_zh.md)。 + +## 预期输出 + +| 结果 | 类型 | 描述 | +| --- | --- | --- | +| 无 | `—` | 把转换后的 `M x N` 结果写到 L1。 | + +## 副作用 + +读 L0C,写 L1。占用 AIC FIXP 流水线。L1 中的下游消费者需通过 pipe 事件同步。 + +## 约束 + +!!! warning "约束" + - Clauses 必须按规范顺序:`unit_flag` → `pre_quant` → `pre_relu` → layout → `loop3` → `sat`/`nosat`。 + - `pre_quant` 要求 payload 与 mode 一起出现。 + - 向量 `pre_quant` 模式要求 `fb` 指针的元素类型为 `f16`、`bf16` 或 `f32`。 + - 标量 `pre_quant` 模式要求 `f16`、`bf16` 或 `f32` 的标量 payload。 + - `pre_quant` 的源元素类型必须是 `f32` 或 `i32`,所选 mode 必须兼容源与目标元素类型。 + - `no_relu` 与 `normal_relu` 不接受 payload。 + - `scalar_relu` 要求 `f16` / `bf16` / `f32` 标量 payload。 + - `vector_relu` 要求 `fb` 指针,元素类型 `f16` / `bf16` / `f32`。 + - `clip` 只能出现在 `pre_relu(...)` 里。 + - `clip` 支持的目标包括 `f16`、`ui8` 以及有符号/无符号 4/8/16 位整型;payload 类型必须匹配目标家族。 + - `nz2dn` 需要 `%loop0_src_stride`;`nz2nd` 与 `nz2nz` 不接受该参数。 + - 当 `nz2dn(%loop0_src_stride)` 取值不为 1 时,必须省略 `unit_flag`。 + - `nz2nz` 要求 `f32` 目标元素类型,且不接受 `loop3`。 + - `sat`、`sat(preserve_nan)`、`nosat` 三者互斥。 + +## 示例 + +```mlir +pto.mte_l0c_l1 %l0c, %l1_out, %c16_i64, %c32_i64, %c16_i64, %c32_i64, + pre_quant(%c1_f32, mode = qf322f16_pre_scalar), + pre_relu(%c025_f32, mode = scalar_relu), + nz2nd, + sat + : !pto.ptr, !pto.ptr, i64, i64, i64, i64, f32, f32 +``` + +## 相关指令 + +- FIXPIPE 回写兄弟指令:[pto.mte_l0c_gm](./mte-l0c-gm_zh.md)、[pto.mte_l0c_ub](./mte-l0c-ub_zh.md) +- 参数 payload 装载:[pto.mte_l1_fb](./mte-l1-fb_zh.md) +- MAD 生产者:[pto.mad](../mad/mad_zh.md) 及其变体 diff --git a/docs/isa/cube/ops/data-movement/mte-l0c-ub.md b/docs/isa/cube/ops/data-movement/mte-l0c-ub.md new file mode 100644 index 000000000..8c0ca67cf --- /dev/null +++ b/docs/isa/cube/ops/data-movement/mte-l0c-ub.md @@ -0,0 +1,75 @@ +# pto.mte_l0c_ub + +`pto.mte_l0c_ub` is part of [Cube Data Movement Ops](../../README.md#cube-data-movement-ops). It is one of the three FIXPIPE writeback ops; see [FIXPIPE Model](../../fixpipe-model.md) for the shared writeback pipeline. This is also the architectural basis for [1→2 Cube-to-Vector tile distribution](../../fixpipe-model.md#dual-destination-broadcast-1--2-cube-to-vector). + +## Summary + +FIXPIPE writeback from `l0c` to UB. The data transform clauses match [`pto.mte_l0c_l1`](./mte-l0c-l1.md); UB-specific operands select single-destination or dual-destination (split-M / split-N) behavior. + +## Syntax + +```mlir +pto.mte_l0c_ub %src, %dst, %m, %n, %src_stride, %dst_stride, + dst_mode(%sub_blockid | split_m | split_n) + [, unit_flag(check_only | check_and_clear)]? + [, pre_quant(%payload, mode = )]? + [, pre_relu([%payload, ]mode = [, clip = %clip])]? + [, nz2nd | nz2dn(%loop0_src_stride) | nz2nz(%split)?] + [, loop3(%count, %src_stride3, %dst_stride3)]? + [, sat | sat(preserve_nan) | nosat]? + : ... +``` + +## Inputs + +| Parameter | Width | Description | +|-----------|-------|-------------| +| `%src`, `%m`, `%n`, `%src_stride` | — | Same as [`pto.mte_l0c_l1`](./mte-l0c-l1.md#inputs) | +| `%dst` | buffer-like | UB destination | +| `%dst_stride` | i64 | UB destination stride in destination elements | +| `dst_mode(%sub_blockid)` | i64 operand | Single-destination mode. `%sub_blockid` selects UB sub-block `0` or `1`; the value may be dynamic. | +| `dst_mode(split_m)` | keyword | Dual-destination mode that splits the logical tile along M. | +| `dst_mode(split_n)` | keyword | Dual-destination mode that splits the logical tile along N. | +| other optional clauses | — | Same as [`pto.mte_l0c_l1`](./mte-l0c-l1.md); `atomic(...)` is **not** supported | + +In `dst_mode(%sub_blockid)`, the whole logical result tile is written to the selected UB sub-block using the selected layout mode and `%dst` as that sub-block's base destination pointer. + +In `dst_mode(split_m)`, the logical tile is split into two M ranges: `[0, m/2)` and `[m/2, m)`. The first range is written to UB sub-block 0 and the second range is written to UB sub-block 1. Each sub-block sees its own destination origin at `%dst`; within each sub-block, the written logical tile has shape `(m / 2) x n`. + +In `dst_mode(split_n)`, the logical tile is split into two N ranges: `[0, n/2)` and `[n/2, n)`. The first range is written to UB sub-block 0 and the second range is written to UB sub-block 1. Each sub-block sees its own destination origin at `%dst`; within each sub-block, the written logical tile has shape `m x (n / 2)`. + +## Expected Outputs + +| Result | Type | Description | +| --- | --- | --- | +| None | `—` | Writes converted `M x N` result to UB, possibly split between AIV0 / AIV1 sub-blocks. | + +## Side Effects + +Reads L0C; writes UB. Engages the AIC FIXP pipe and (for dual-destination modes) the dedicated 1→2 cube-to-vector data path. UB-side consumers on the AIV blocks must synchronize via cross-block sema primitives. + +## Constraints + +!!! warning "Constraints" + - `atomic(...)` is not supported on `pto.mte_l0c_ub`. + - `dst_mode(%sub_blockid)` writes the whole logical tile to one UB sub-block. Runtime `%sub_blockid` values must be `0` or `1`; constant values are checked statically when available. + - `dst_mode(split_m)` splits the logical tile along M into two equal-height sub-block regions. `%m` must be even; each sub-block receives an `(m / 2) x n` tile. + - `dst_mode(split_n)` splits the logical tile along N into two equal-width sub-block regions. `%n` must be a multiple of 32; each sub-block receives an `m x (n / 2)` tile. + - Dual-destination split modes are valid only for target-supported normal or `nz2nd` writeback cases with pre-quant, pre-ReLU/clip, and other transform clauses omitted. + - Other constraints match [`pto.mte_l0c_l1`](./mte-l0c-l1.md#constraints). + +## Examples + +```mlir +pto.mte_l0c_ub %l0c, %ub_out, %c16_i64, %c32_i64, %c16_i64, %c32_i64, + dst_mode(%c1_i64), + nz2nd + : !pto.ptr, !pto.ptr, i64, i64, i64, i64, i64 +``` + +## Related Ops + +- FIXPIPE writeback siblings: [pto.mte_l0c_l1](./mte-l0c-l1.md), [pto.mte_l0c_gm](./mte-l0c-gm.md) +- Parameter payload loader: [pto.mte_l1_fb](./mte-l1-fb.md) +- MAD producers: [pto.mad](../mad/mad.md) and variants +- Cluster broadcast model: [FIXPIPE Model — Dual-Destination Broadcast](../../fixpipe-model.md#dual-destination-broadcast-1--2-cube-to-vector) diff --git a/docs/isa/cube/ops/data-movement/mte-l0c-ub_zh.md b/docs/isa/cube/ops/data-movement/mte-l0c-ub_zh.md new file mode 100644 index 000000000..d048a8db0 --- /dev/null +++ b/docs/isa/cube/ops/data-movement/mte-l0c-ub_zh.md @@ -0,0 +1,75 @@ +# pto.mte_l0c_ub + +`pto.mte_l0c_ub` 属于 [Cube 数据搬运指令](../../README_zh.md#cube-数据搬运指令)。它是三条 FIXPIPE 回写指令之一;共享的回写流水线见 [FIXPIPE 模型](../../fixpipe-model_zh.md)。这是 [1→2 Cube-to-Vector tile 分发](../../fixpipe-model_zh.md#双目标广播1--2-cube-to-vector) 的架构基础。 + +## 摘要 + +把 L0C 中的结果 FIXPIPE 回写到 UB。数据变换 clauses 与 [`pto.mte_l0c_l1`](./mte-l0c-l1_zh.md) 完全一致;UB 特有的操作数用于选择单目标或双目标(split-M / split-N)行为。 + +## 语法 + +```mlir +pto.mte_l0c_ub %src, %dst, %m, %n, %src_stride, %dst_stride, + dst_mode(%sub_blockid | split_m | split_n) + [, unit_flag(check_only | check_and_clear)]? + [, pre_quant(%payload, mode = )]? + [, pre_relu([%payload, ]mode = [, clip = %clip])]? + [, nz2nd | nz2dn(%loop0_src_stride) | nz2nz(%split)?] + [, loop3(%count, %src_stride3, %dst_stride3)]? + [, sat | sat(preserve_nan) | nosat]? + : ... +``` + +## 输入 + +| 参数 | 位宽 | 描述 | +|-----------|-------|-------------| +| `%src`、`%m`、`%n`、`%src_stride` | — | 同 [`pto.mte_l0c_l1`](./mte-l0c-l1_zh.md#输入) | +| `%dst` | buffer-like | UB 目标 | +| `%dst_stride` | i64 | UB 目标步长,单位为目标元素 | +| `dst_mode(%sub_blockid)` | i64 operand | 单目标模式。`%sub_blockid` 选择 UB 子块 `0` 或 `1`;该值可以动态。 | +| `dst_mode(split_m)` | keyword | 双目标模式,沿 M 切分逻辑 tile。 | +| `dst_mode(split_n)` | keyword | 双目标模式,沿 N 切分逻辑 tile。 | +| 其它可选 clauses | — | 同 [`pto.mte_l0c_l1`](./mte-l0c-l1_zh.md);**不**支持 `atomic(...)`。 | + +`dst_mode(%sub_blockid)` 时,整个逻辑结果 tile 都写到选中的 UB 子块;`%dst` 是该子块的基址。 + +`dst_mode(split_m)` 时,逻辑 tile 沿 M 维切成两段 `[0, m/2)` 和 `[m/2, m)`,分别写到 UB 子块 0 和子块 1。每个子块都看到自己的目标原点 `%dst`,每个子块内写入的逻辑 tile 形状为 `(m / 2) x n`。 + +`dst_mode(split_n)` 时,逻辑 tile 沿 N 维切成两段 `[0, n/2)` 和 `[n/2, n)`,分别写到 UB 子块 0 和子块 1。每个子块都看到自己的目标原点 `%dst`,每个子块内写入的逻辑 tile 形状为 `m x (n / 2)`。 + +## 预期输出 + +| 结果 | 类型 | 描述 | +| --- | --- | --- | +| 无 | `—` | 把转换后的 `M x N` 结果写到 UB;双目标模式下可同时写到 AIV0 / AIV1 子块。 | + +## 副作用 + +读 L0C,写 UB。占用 AIC FIXP 流水线;双目标模式下还会使用专用 1→2 cube-to-vector 数据通路。AIV 块上的 UB 消费者需要通过跨块信号量原语同步。 + +## 约束 + +!!! warning "约束" + - `pto.mte_l0c_ub` 不支持 `atomic(...)`。 + - `dst_mode(%sub_blockid)` 把整个逻辑 tile 写到一个 UB 子块。运行时 `%sub_blockid` 必须为 `0` 或 `1`;常量值在编译期检查。 + - `dst_mode(split_m)` 沿 M 切分成两个等高子块区域;`%m` 必须是偶数,每个子块收到 `(m / 2) x n` tile。 + - `dst_mode(split_n)` 沿 N 切分成两个等宽子块区域;`%n` 必须是 32 的倍数,每个子块收到 `m x (n / 2)` tile。 + - 双目标 split 模式仅在目标支持的 normal 或 `nz2nd` 回写场景下有效,并且必须省略 pre-quant、pre-ReLU/clip 等数据变换 clauses。 + - 其它约束同 [`pto.mte_l0c_l1`](./mte-l0c-l1_zh.md#约束)。 + +## 示例 + +```mlir +pto.mte_l0c_ub %l0c, %ub_out, %c16_i64, %c32_i64, %c16_i64, %c32_i64, + dst_mode(%c1_i64), + nz2nd + : !pto.ptr, !pto.ptr, i64, i64, i64, i64, i64 +``` + +## 相关指令 + +- FIXPIPE 回写兄弟指令:[pto.mte_l0c_l1](./mte-l0c-l1_zh.md)、[pto.mte_l0c_gm](./mte-l0c-gm_zh.md) +- 参数 payload 装载:[pto.mte_l1_fb](./mte-l1-fb_zh.md) +- MAD 生产者:[pto.mad](../mad/mad_zh.md) 及其变体 +- Cluster 广播模型:[FIXPIPE 模型 — 双目标广播](../../fixpipe-model_zh.md#双目标广播1--2-cube-to-vector) diff --git a/docs/isa/cube/ops/data-movement/mte-l1-bt.md b/docs/isa/cube/ops/data-movement/mte-l1-bt.md new file mode 100644 index 000000000..4fc8cf096 --- /dev/null +++ b/docs/isa/cube/ops/data-movement/mte-l1-bt.md @@ -0,0 +1,59 @@ +# pto.mte_l1_bt + +`pto.mte_l1_bt` is part of [Cube Data Movement Ops](../../README.md#cube-data-movement-ops). + +## Summary + +Load an L1 bias payload into the `bt` address space for later [`pto.mad_bias`](../mad/mad-bias.md) or [`pto.mad_mx_bias`](../mad/mad-mx-bias.md) consumption. The consumer interprets the result as an `N`-element bias vector `bias[n]`. + +## Mechanism + +One burst loads `%len_burst` bias-load units from `%src` and writes the corresponding bias values to `%dst`. After each burst except the last, source and destination advance by the burst length plus the corresponding gap. Each unit is the bias-element width for the configured type pair. + +## Syntax + +```mlir +pto.mte_l1_bt %src, %dst, %len_burst + nburst(%count, %src_gap, %dst_gap) + : !pto.ptr, !pto.ptr, i64, i64, i64, i64 +``` + +## Inputs + +| Parameter | Width | Description | +|-----------|-------|-------------| +| `%src` | ptr | L1 source pointer in `l1` | +| `%dst` | ptr | Bias destination pointer in `bt` | +| `%len_burst` | i64 | Number of bias-load units per burst | +| `%count` | i64 | Burst count | +| `%src_gap` | i64 | Source gap between bursts, in bias-load units | +| `%dst_gap` | i64 | Destination gap between bursts, in bias-load units | + +## Expected Outputs + +| Result | Type | Description | +| --- | --- | --- | +| None | `—` | Writes bias values into the `bt` destination region. | + +## Side Effects + +Reads L1-visible storage; writes BT-visible storage. The result is consumed by `pto.mad_bias` / `pto.mad_mx_bias` later in the cube pipeline. + +## Constraints + +!!! warning "Constraints" + - Supported type pairs: `f32 -> f32`, `i32 -> i32`, `f16 -> f32`, `bf16 -> f32`. + - For `bf16 -> f32`, compact bf16 source values are always widened to f32 bias values. For `f16 -> f32`, compact f16 source values are widened when the load is used as an f32 bias payload; otherwise the f16 payload is stored in the 32-bit bias slot with unused high bits. + - Load exactly the channel bias values needed by the consumer tile; the bias payload is not result-shaped. + +## Examples + +```mlir +pto.mte_l1_bt %l1_bias, %bt, %c1_i64 nburst(%c4_i64, %c0_i64, %c0_i64) + : !pto.ptr, !pto.ptr, i64, i64, i64, i64 +``` + +## Related Ops + +- Bias-init MAD: [pto.mad_bias](../mad/mad-bias.md), [pto.mad_mx_bias](../mad/mad-mx-bias.md) +- FIXPIPE auxiliary payload: [pto.mte_l1_fb](./mte-l1-fb.md) diff --git a/docs/isa/cube/ops/data-movement/mte-l1-bt_zh.md b/docs/isa/cube/ops/data-movement/mte-l1-bt_zh.md new file mode 100644 index 000000000..37ed89a02 --- /dev/null +++ b/docs/isa/cube/ops/data-movement/mte-l1-bt_zh.md @@ -0,0 +1,59 @@ +# pto.mte_l1_bt + +`pto.mte_l1_bt` 属于 [Cube 数据搬运指令](../../README_zh.md#cube-数据搬运指令)。 + +## 摘要 + +把 L1 中的 bias payload 加载到 `bt` 地址空间,供后续 [`pto.mad_bias`](../mad/mad-bias_zh.md) / [`pto.mad_mx_bias`](../mad/mad-mx-bias_zh.md) 消费。消费者把结果解读为 `N` 元素的 bias 向量 `bias[n]`。 + +## 机制 + +一次 burst 从 `%src` 读 `%len_burst` 个 bias-load 单位,写到 `%dst`。除最后一次外,每次 burst 之后源/目标按 burst 长度加上对应 gap 前进。每个单位的宽度由配置的类型对决定。 + +## 语法 + +```mlir +pto.mte_l1_bt %src, %dst, %len_burst + nburst(%count, %src_gap, %dst_gap) + : !pto.ptr, !pto.ptr, i64, i64, i64, i64 +``` + +## 输入 + +| 参数 | 位宽 | 描述 | +|-----------|-------|-------------| +| `%src` | ptr | L1 源指针,位于 `l1` | +| `%dst` | ptr | bias 目标指针,位于 `bt` | +| `%len_burst` | i64 | 每个 burst 加载的 bias-load 单位数 | +| `%count` | i64 | Burst 数量 | +| `%src_gap` | i64 | 跨 burst 的源间隔(bias-load 单位) | +| `%dst_gap` | i64 | 跨 burst 的目标间隔(bias-load 单位) | + +## 预期输出 + +| 结果 | 类型 | 描述 | +| --- | --- | --- | +| 无 | `—` | 把 bias 值写入 `bt` 目标区域。 | + +## 副作用 + +读 L1,写 BT。结果由后续的 `pto.mad_bias` / `pto.mad_mx_bias` 消费。 + +## 约束 + +!!! warning "约束" + - 支持的类型对:`f32 -> f32`、`i32 -> i32`、`f16 -> f32`、`bf16 -> f32`。 + - 对 `bf16 -> f32`,bf16 源会被恒定地扩展到 f32 bias 值;对 `f16 -> f32`,作为 f32 bias 使用时会扩展,否则 f16 payload 写入 32-bit bias slot,高位未用。 + - 只加载消费 tile 实际需要的通道 bias;bias payload 不是 result-shaped。 + +## 示例 + +```mlir +pto.mte_l1_bt %l1_bias, %bt, %c1_i64 nburst(%c4_i64, %c0_i64, %c0_i64) + : !pto.ptr, !pto.ptr, i64, i64, i64, i64 +``` + +## 相关指令 + +- 偏置初始化 MAD:[pto.mad_bias](../mad/mad-bias_zh.md)、[pto.mad_mx_bias](../mad/mad-mx-bias_zh.md) +- FIXPIPE 辅助 payload:[pto.mte_l1_fb](./mte-l1-fb_zh.md) diff --git a/docs/isa/cube/ops/data-movement/mte-l1-fb.md b/docs/isa/cube/ops/data-movement/mte-l1-fb.md new file mode 100644 index 000000000..cf6aace2b --- /dev/null +++ b/docs/isa/cube/ops/data-movement/mte-l1-fb.md @@ -0,0 +1,60 @@ +# pto.mte_l1_fb + +`pto.mte_l1_fb` is part of [Cube Data Movement Ops](../../README.md#cube-data-movement-ops). + +## Summary + +Load FIXPIPE parameter payloads from L1 into `fb`. Vector `pre_quant(...)` and `pre_relu(...)` clauses in the `pto.mte_l0c_*` writeback family later consume these payloads through `fb` pointers. See [FIXPIPE Model](../../fixpipe-model.md) for the writeback pipeline context. + +## Mechanism + +One burst loads `%len_burst` parameter-load units from `%src` to `%dst`. The copy unit is the parameter-load unit of this op — separate from the row size consumed by `mte_l0c_*` vector payloads. `%len_burst` and the `nburst(...)` gaps are counted in these load units, not in bytes and not in destination elements. + +After `pto.mte_l1_fb` materializes the payload in `fb`, vector pre-ReLU consumers read it as 64B parameter rows and vector pre-quant consumers read it as 128B parameter rows. The payload pointer passed to `mte_l0c_*` must point at the first row for the logical output tile, and rows must follow the same channel/NZ order consumed by that store. + +## Syntax + +```mlir +pto.mte_l1_fb %src, %dst, %len_burst + nburst(%count, %src_gap, %dst_gap) + : !pto.ptr, !pto.ptr, i64, i64, i64, i64 +``` + +## Inputs + +| Parameter | Width | Description | +|-----------|-------|-------------| +| `%src` | ptr | L1 source pointer in `l1` | +| `%dst` | ptr | Scaling/parameter destination pointer in `fb` | +| `%len_burst` | i64 | Number of parameter-load units per burst | +| `%count` | i64 | Burst count | +| `%src_gap` | i64 | Source gap between bursts, in parameter-load units | +| `%dst_gap` | i64 | Destination gap between bursts, in parameter-load units | + +## Expected Outputs + +| Result | Type | Description | +| --- | --- | --- | +| None | `—` | Writes FIXPIPE parameter payload into the `fb` destination region. | + +## Side Effects + +Reads L1-visible storage; writes FB-visible storage. The result is consumed by subsequent `pto.mte_l0c_*` writeback ops that reference an `fb` pointer in their `pre_quant` or `pre_relu` clauses. + +## Constraints + +!!! warning "Constraints" + - `%src` must be in `l1`, `%dst` must be in `fb`. + - Vector `pre_quant` and `pre_relu` consumers require parameter data prepared in the row order documented under [FIXPIPE Model](../../fixpipe-model.md). + +## Examples + +```mlir +pto.mte_l1_fb %l1_fp, %fb_fp, %c2_i64 nburst(%c4_i64, %c0_i64, %c0_i64) + : !pto.ptr, !pto.ptr, i64, i64, i64, i64 +``` + +## Related Ops + +- FIXPIPE writeback consumers: [pto.mte_l0c_l1](./mte-l0c-l1.md), [pto.mte_l0c_gm](./mte-l0c-gm.md), [pto.mte_l0c_ub](./mte-l0c-ub.md) +- FIXPIPE pipeline overview: [FIXPIPE Model](../../fixpipe-model.md) diff --git a/docs/isa/cube/ops/data-movement/mte-l1-fb_zh.md b/docs/isa/cube/ops/data-movement/mte-l1-fb_zh.md new file mode 100644 index 000000000..371e8cfea --- /dev/null +++ b/docs/isa/cube/ops/data-movement/mte-l1-fb_zh.md @@ -0,0 +1,60 @@ +# pto.mte_l1_fb + +`pto.mte_l1_fb` 属于 [Cube 数据搬运指令](../../README_zh.md#cube-数据搬运指令)。 + +## 摘要 + +把 FIXPIPE 参数 payload 从 L1 加载到 `fb`。`pto.mte_l0c_*` 回写指令族里的 `pre_quant(...)` 与 `pre_relu(...)` 子句稍后会通过 `fb` 指针消费这些 payload。FIXPIPE 流水线全貌见 [FIXPIPE 模型](../../fixpipe-model_zh.md)。 + +## 机制 + +一次 burst 从 `%src` 读 `%len_burst` 个 parameter-load 单位,写到 `%dst`。本指令的拷贝单位是它的 parameter-load 单位——与 `mte_l0c_*` 向量 payload 行的大小不同。`%len_burst` 与 `nburst(...)` 的 gap 都以这个单位计数,不是字节,也不是目标元素。 + +`pto.mte_l1_fb` 在 `fb` 中物化 payload 之后,pre-ReLU 向量消费者按 64B 参数行读取,pre-quant 向量消费者按 128B 参数行读取。传给 `mte_l0c_*` 的 payload 指针必须指向该 store 对应逻辑输出 tile 的第一行;后续行按照逻辑累加器元素相同的通道 / NZ 顺序前进。 + +## 语法 + +```mlir +pto.mte_l1_fb %src, %dst, %len_burst + nburst(%count, %src_gap, %dst_gap) + : !pto.ptr, !pto.ptr, i64, i64, i64, i64 +``` + +## 输入 + +| 参数 | 位宽 | 描述 | +|-----------|-------|-------------| +| `%src` | ptr | L1 源指针,位于 `l1` | +| `%dst` | ptr | 缩放/参数目标指针,位于 `fb` | +| `%len_burst` | i64 | 每个 burst 加载的 parameter-load 单位数 | +| `%count` | i64 | Burst 数量 | +| `%src_gap` | i64 | 跨 burst 的源间隔(parameter-load 单位) | +| `%dst_gap` | i64 | 跨 burst 的目标间隔(parameter-load 单位) | + +## 预期输出 + +| 结果 | 类型 | 描述 | +| --- | --- | --- | +| 无 | `—` | 把 FIXPIPE 参数 payload 写入 `fb` 目标区域。 | + +## 副作用 + +读 L1,写 FB。结果由后续 `pto.mte_l0c_*` 回写指令里引用 `fb` 指针的 `pre_quant` / `pre_relu` 子句消费。 + +## 约束 + +!!! warning "约束" + - `%src` 必须在 `l1`,`%dst` 必须在 `fb`。 + - 向量 `pre_quant` 与 `pre_relu` 消费者要求按 [FIXPIPE 模型](../../fixpipe-model_zh.md) 文档化的行序准备参数数据。 + +## 示例 + +```mlir +pto.mte_l1_fb %l1_fp, %fb_fp, %c2_i64 nburst(%c4_i64, %c0_i64, %c0_i64) + : !pto.ptr, !pto.ptr, i64, i64, i64, i64 +``` + +## 相关指令 + +- FIXPIPE 回写消费者:[pto.mte_l0c_l1](./mte-l0c-l1_zh.md)、[pto.mte_l0c_gm](./mte-l0c-gm_zh.md)、[pto.mte_l0c_ub](./mte-l0c-ub_zh.md) +- FIXPIPE 流水线概览:[FIXPIPE 模型](../../fixpipe-model_zh.md) diff --git a/docs/isa/cube/ops/data-movement/mte-l1-l0a-mx.md b/docs/isa/cube/ops/data-movement/mte-l1-l0a-mx.md new file mode 100644 index 000000000..e101a9881 --- /dev/null +++ b/docs/isa/cube/ops/data-movement/mte-l1-l0a-mx.md @@ -0,0 +1,59 @@ +# pto.mte_l1_l0a_mx + +`pto.mte_l1_l0a_mx` is part of [Cube Data Movement Ops](../../README.md#cube-data-movement-ops). + +## Summary + +Load left-side MX scale fragments for a logical `%m x %k` left data tile. The fragments prepare the scale payload consumed by [`pto.mad_mx`](../mad/mad-mx.md) / [`pto.mad_mx_acc`](../mad/mad-mx-acc.md) / [`pto.mad_mx_bias`](../mad/mad-mx-bias.md). + +## MX Scale Load Model + +Each scale entry applies to one 32-element K group. + +- Left scale logical shape: `[M, ceil(K / 32)]` +- L1 source data is organized as 32B scale fragments in the same logical order as the associated data tile. + +## Syntax + +```mlir +pto.mte_l1_l0a_mx %src, %dst, %m, %k + : !pto.ptr, !pto.ptr, i64, i64 +``` + +## Inputs + +| Parameter | Width | Description | +|-----------|-------|-------------| +| `%src` | ptr | L1 MX scale source in `l1` | +| `%dst` | ptr | Left-side MX payload destination associated with `l0a` | +| `%m` | i64 | M extent of the associated left data tile | +| `%k` | i64 | K extent; scale grouping is by 32 K elements | + +## Expected Outputs + +| Result | Type | Description | +| --- | --- | --- | +| None | `—` | Writes the MX scale payload associated with the L0A operand tile. | + +## Side Effects + +Reads L1; writes MX scale state associated with L0A. The result is consumed by the next `pto.mad_mx*` op that reads from this `%lhs`. + +## Constraints + +!!! warning "Constraints" + - `%src` must be in `l1`, `%dst` must be in `l0a`. + - `%src` and `%dst` must satisfy 32B MX scale-fragment alignment. + +## Examples + +```mlir +pto.mte_l1_l0a_mx %l1_a_scale, %l0a_scale, %c16_i64, %c64_i64 + : !pto.ptr, !pto.ptr, i64, i64 +``` + +## Related Ops + +- Data tile load: [pto.mte_l1_l0a](./mte-l1-l0a.md) +- Right-side scale loader: [pto.mte_l1_l0b_mx](./mte-l1-l0b-mx.md) +- MX MAD consumers: [pto.mad_mx](../mad/mad-mx.md), [pto.mad_mx_acc](../mad/mad-mx-acc.md), [pto.mad_mx_bias](../mad/mad-mx-bias.md) diff --git a/docs/isa/cube/ops/data-movement/mte-l1-l0a-mx_zh.md b/docs/isa/cube/ops/data-movement/mte-l1-l0a-mx_zh.md new file mode 100644 index 000000000..d90922c6d --- /dev/null +++ b/docs/isa/cube/ops/data-movement/mte-l1-l0a-mx_zh.md @@ -0,0 +1,59 @@ +# pto.mte_l1_l0a_mx + +`pto.mte_l1_l0a_mx` 属于 [Cube 数据搬运指令](../../README_zh.md#cube-数据搬运指令)。 + +## 摘要 + +为逻辑 `%m x %k` 左数据 tile 加载左侧 MX scale 片段。生成的 payload 供 [`pto.mad_mx`](../mad/mad-mx_zh.md) / [`pto.mad_mx_acc`](../mad/mad-mx-acc_zh.md) / [`pto.mad_mx_bias`](../mad/mad-mx-bias_zh.md) 消费。 + +## MX Scale 加载模型 + +每个 scale 项对应一个 32 元素的 K 组。 + +- 左侧 scale 逻辑形状:`[M, ceil(K / 32)]` +- L1 源数据按 32B scale 片段组织,逻辑顺序与对应数据 tile 一致。 + +## 语法 + +```mlir +pto.mte_l1_l0a_mx %src, %dst, %m, %k + : !pto.ptr, !pto.ptr, i64, i64 +``` + +## 输入 + +| 参数 | 位宽 | 描述 | +|-----------|-------|-------------| +| `%src` | ptr | L1 中的 MX scale 源 | +| `%dst` | ptr | 与 `l0a` 关联的左侧 MX payload 目标 | +| `%m` | i64 | 对应左数据 tile 的 M 范围 | +| `%k` | i64 | K 范围;scale 按 32 个 K 元素分组 | + +## 预期输出 + +| 结果 | 类型 | 描述 | +| --- | --- | --- | +| 无 | `—` | 把与 L0A 操作数 tile 关联的 MX scale payload 写好。 | + +## 副作用 + +读 L1,写与 L0A 关联的 MX scale 状态。结果由后续读取该 `%lhs` 的 `pto.mad_mx*` 消费。 + +## 约束 + +!!! warning "约束" + - `%src` 必须在 `l1`,`%dst` 必须在 `l0a`。 + - `%src` 与 `%dst` 必须满足 32B MX scale 片段对齐要求。 + +## 示例 + +```mlir +pto.mte_l1_l0a_mx %l1_a_scale, %l0a_scale, %c16_i64, %c64_i64 + : !pto.ptr, !pto.ptr, i64, i64 +``` + +## 相关指令 + +- 数据 tile 加载:[pto.mte_l1_l0a](./mte-l1-l0a_zh.md) +- 右侧 scale 加载:[pto.mte_l1_l0b_mx](./mte-l1-l0b-mx_zh.md) +- MX MAD 消费者:[pto.mad_mx](../mad/mad-mx_zh.md)、[pto.mad_mx_acc](../mad/mad-mx-acc_zh.md)、[pto.mad_mx_bias](../mad/mad-mx-bias_zh.md) diff --git a/docs/isa/cube/ops/data-movement/mte-l1-l0a.md b/docs/isa/cube/ops/data-movement/mte-l1-l0a.md new file mode 100644 index 000000000..1ee8a57f6 --- /dev/null +++ b/docs/isa/cube/ops/data-movement/mte-l1-l0a.md @@ -0,0 +1,61 @@ +# pto.mte_l1_l0a + +`pto.mte_l1_l0a` is part of [Cube Data Movement Ops](../../README.md#cube-data-movement-ops). + +## Summary + +Load a logical `%m x %k` left tile from L1 `l1` into `l0a` for `pto.mad*` consumption. The source must already be in cube-fractal NZ layout; this op does not convert arbitrary row-major matrices. Use [`pto.mte_gm_l1_frac`](./mte-gm-l1-frac.md) to repack ND/DN source data first. + +## Mechanism + +The op moves an L1 cube-fractal tile into the L0A operand domain. The destination layout follows [NZ Fractal Layout](../../nz-fractal-layout.md#per-buffer-nz-layouts) for L0A (`K1 M1 M0 K0`, FRACTAL_NZ on A5 / FRACTAL_ZZ on A3). + +If `transpose = true`, the selected logical source tile is transposed before placement in the destination operand domain. Omitting the attribute means `transpose = false`. + +## Syntax + +```mlir +pto.mte_l1_l0a %src, %dst, %m, %k + : !pto.ptr, !pto.ptr, i64, i64 +``` + +## Inputs + +| Parameter | Width | Description | +|-----------|-------|-------------| +| `%src` | ptr | L1 cube-fractal source tile in `l1` | +| `%dst` | ptr | Left operand destination in `l0a` | +| `%m` | i64 | Logical M extent | +| `%k` | i64 | Logical K extent | +| `transpose` | attr | Optional boolean source-tile transpose before destination placement | + +## Expected Outputs + +| Result | Type | Description | +| --- | --- | --- | +| None | `—` | Writes the L0A tile that subsequent `pto.mad*` will read. | + +## Side Effects + +Reads L1; writes L0A. Engages the AIC MTE1 pipe. + +## Constraints + +!!! warning "Constraints" + - `%src` must be in `l1`, `%dst` must be in `l0a`. + - `%src` and `%dst` must satisfy the target alignment for Cube tile loads. + - `transpose = true` requires a tile shape supported by the element-type transpose granularity. + +## Examples + +```mlir +pto.mte_l1_l0a %l1_a, %l0a, %c16_i64, %c32_i64 + : !pto.ptr, !pto.ptr, i64, i64 +``` + +## Related Ops + +- Right operand load: [pto.mte_l1_l0b](./mte-l1-l0b.md) +- MX scale loader: [pto.mte_l1_l0a_mx](./mte-l1-l0a-mx.md) +- Upstream repack: [pto.mte_gm_l1_frac](./mte-gm-l1-frac.md) +- MAD consumers: [pto.mad](../mad/mad.md), [pto.mad_acc](../mad/mad-acc.md), [pto.mad_bias](../mad/mad-bias.md) diff --git a/docs/isa/cube/ops/data-movement/mte-l1-l0a_zh.md b/docs/isa/cube/ops/data-movement/mte-l1-l0a_zh.md new file mode 100644 index 000000000..f796fde09 --- /dev/null +++ b/docs/isa/cube/ops/data-movement/mte-l1-l0a_zh.md @@ -0,0 +1,61 @@ +# pto.mte_l1_l0a + +`pto.mte_l1_l0a` 属于 [Cube 数据搬运指令](../../README_zh.md#cube-数据搬运指令)。 + +## 摘要 + +把 L1 `l1` 中的逻辑 `%m x %k` 左 tile 加载到 `l0a`,供 `pto.mad*` 消费。源必须已经是 cube fractal NZ 布局;本指令不会把任意行优先矩阵自动转换。若源是 ND/DN 原始数据,请先用 [`pto.mte_gm_l1_frac`](./mte-gm-l1-frac_zh.md) 完成重排。 + +## 机制 + +将 L1 cube fractal tile 移入 L0A 操作数域。目标布局遵循 [NZ Fractal 布局 — 各缓冲的 NZ 布局](../../nz-fractal-layout_zh.md#各缓冲的-nz-布局):L0A 为 `K1 M1 M0 K0`(A5 上 FRACTAL_NZ / A3 上 FRACTAL_ZZ)。 + +若 `transpose = true`,被选中的逻辑源 tile 会在置入目标操作数域之前转置;省略该属性等同于 `transpose = false`。 + +## 语法 + +```mlir +pto.mte_l1_l0a %src, %dst, %m, %k + : !pto.ptr, !pto.ptr, i64, i64 +``` + +## 输入 + +| 参数 | 位宽 | 描述 | +|-----------|-------|-------------| +| `%src` | ptr | L1 中的 cube fractal 源 tile | +| `%dst` | ptr | L0A 左操作数目标 | +| `%m` | i64 | 逻辑 M 范围 | +| `%k` | i64 | 逻辑 K 范围 | +| `transpose` | attr | 可选布尔属性,置入目标域前是否对源 tile 做转置 | + +## 预期输出 + +| 结果 | 类型 | 描述 | +| --- | --- | --- | +| 无 | `—` | 把 L0A tile 写好,供后续 `pto.mad*` 读取。 | + +## 副作用 + +读 L1,写 L0A。占用 AIC MTE1 流水线。 + +## 约束 + +!!! warning "约束" + - `%src` 必须在 `l1`,`%dst` 必须在 `l0a`。 + - `%src` 与 `%dst` 必须满足 Cube tile load 的目标对齐要求。 + - `transpose = true` 要求 tile 形状被元素类型的转置粒度支持。 + +## 示例 + +```mlir +pto.mte_l1_l0a %l1_a, %l0a, %c16_i64, %c32_i64 + : !pto.ptr, !pto.ptr, i64, i64 +``` + +## 相关指令 + +- 右操作数加载:[pto.mte_l1_l0b](./mte-l1-l0b_zh.md) +- MX scale 加载:[pto.mte_l1_l0a_mx](./mte-l1-l0a-mx_zh.md) +- 上游重排:[pto.mte_gm_l1_frac](./mte-gm-l1-frac_zh.md) +- MAD 消费者:[pto.mad](../mad/mad_zh.md)、[pto.mad_acc](../mad/mad-acc_zh.md)、[pto.mad_bias](../mad/mad-bias_zh.md) diff --git a/docs/isa/cube/ops/data-movement/mte-l1-l0b-mx.md b/docs/isa/cube/ops/data-movement/mte-l1-l0b-mx.md new file mode 100644 index 000000000..a8c297903 --- /dev/null +++ b/docs/isa/cube/ops/data-movement/mte-l1-l0b-mx.md @@ -0,0 +1,59 @@ +# pto.mte_l1_l0b_mx + +`pto.mte_l1_l0b_mx` is part of [Cube Data Movement Ops](../../README.md#cube-data-movement-ops). + +## Summary + +Load right-side MX scale fragments for a logical `%k x %n` right data tile. The fragments prepare the scale payload consumed by [`pto.mad_mx`](../mad/mad-mx.md) / [`pto.mad_mx_acc`](../mad/mad-mx-acc.md) / [`pto.mad_mx_bias`](../mad/mad-mx-bias.md). + +## MX Scale Load Model + +Each scale entry applies to one 32-element K group. + +- Right scale logical shape: `[ceil(K / 32), N]` +- L1 source data is organized as 32B scale fragments in the same logical order as the associated data tile. + +## Syntax + +```mlir +pto.mte_l1_l0b_mx %src, %dst, %k, %n + : !pto.ptr, !pto.ptr, i64, i64 +``` + +## Inputs + +| Parameter | Width | Description | +|-----------|-------|-------------| +| `%src` | ptr | L1 MX scale source in `l1` | +| `%dst` | ptr | Right-side MX payload destination associated with `l0b` | +| `%k` | i64 | K extent; scale grouping is by 32 K elements | +| `%n` | i64 | N extent of the associated right data tile | + +## Expected Outputs + +| Result | Type | Description | +| --- | --- | --- | +| None | `—` | Writes the MX scale payload associated with the L0B operand tile. | + +## Side Effects + +Reads L1; writes MX scale state associated with L0B. The result is consumed by the next `pto.mad_mx*` op that reads from this `%rhs`. + +## Constraints + +!!! warning "Constraints" + - `%src` must be in `l1`, `%dst` must be in `l0b`. + - `%src` and `%dst` must satisfy 32B MX scale-fragment alignment. + +## Examples + +```mlir +pto.mte_l1_l0b_mx %l1_b_scale, %l0b_scale, %c64_i64, %c16_i64 + : !pto.ptr, !pto.ptr, i64, i64 +``` + +## Related Ops + +- Data tile load: [pto.mte_l1_l0b](./mte-l1-l0b.md) +- Left-side scale loader: [pto.mte_l1_l0a_mx](./mte-l1-l0a-mx.md) +- MX MAD consumers: [pto.mad_mx](../mad/mad-mx.md), [pto.mad_mx_acc](../mad/mad-mx-acc.md), [pto.mad_mx_bias](../mad/mad-mx-bias.md) diff --git a/docs/isa/cube/ops/data-movement/mte-l1-l0b-mx_zh.md b/docs/isa/cube/ops/data-movement/mte-l1-l0b-mx_zh.md new file mode 100644 index 000000000..0689154dd --- /dev/null +++ b/docs/isa/cube/ops/data-movement/mte-l1-l0b-mx_zh.md @@ -0,0 +1,59 @@ +# pto.mte_l1_l0b_mx + +`pto.mte_l1_l0b_mx` 属于 [Cube 数据搬运指令](../../README_zh.md#cube-数据搬运指令)。 + +## 摘要 + +为逻辑 `%k x %n` 右数据 tile 加载右侧 MX scale 片段。生成的 payload 供 [`pto.mad_mx`](../mad/mad-mx_zh.md) / [`pto.mad_mx_acc`](../mad/mad-mx-acc_zh.md) / [`pto.mad_mx_bias`](../mad/mad-mx-bias_zh.md) 消费。 + +## MX Scale 加载模型 + +每个 scale 项对应一个 32 元素的 K 组。 + +- 右侧 scale 逻辑形状:`[ceil(K / 32), N]` +- L1 源数据按 32B scale 片段组织,逻辑顺序与对应数据 tile 一致。 + +## 语法 + +```mlir +pto.mte_l1_l0b_mx %src, %dst, %k, %n + : !pto.ptr, !pto.ptr, i64, i64 +``` + +## 输入 + +| 参数 | 位宽 | 描述 | +|-----------|-------|-------------| +| `%src` | ptr | L1 中的 MX scale 源 | +| `%dst` | ptr | 与 `l0b` 关联的右侧 MX payload 目标 | +| `%k` | i64 | K 范围;scale 按 32 个 K 元素分组 | +| `%n` | i64 | 对应右数据 tile 的 N 范围 | + +## 预期输出 + +| 结果 | 类型 | 描述 | +| --- | --- | --- | +| 无 | `—` | 把与 L0B 操作数 tile 关联的 MX scale payload 写好。 | + +## 副作用 + +读 L1,写与 L0B 关联的 MX scale 状态。结果由后续读取该 `%rhs` 的 `pto.mad_mx*` 消费。 + +## 约束 + +!!! warning "约束" + - `%src` 必须在 `l1`,`%dst` 必须在 `l0b`。 + - `%src` 与 `%dst` 必须满足 32B MX scale 片段对齐要求。 + +## 示例 + +```mlir +pto.mte_l1_l0b_mx %l1_b_scale, %l0b_scale, %c64_i64, %c16_i64 + : !pto.ptr, !pto.ptr, i64, i64 +``` + +## 相关指令 + +- 数据 tile 加载:[pto.mte_l1_l0b](./mte-l1-l0b_zh.md) +- 左侧 scale 加载:[pto.mte_l1_l0a_mx](./mte-l1-l0a-mx_zh.md) +- MX MAD 消费者:[pto.mad_mx](../mad/mad-mx_zh.md)、[pto.mad_mx_acc](../mad/mad-mx-acc_zh.md)、[pto.mad_mx_bias](../mad/mad-mx-bias_zh.md) diff --git a/docs/isa/cube/ops/data-movement/mte-l1-l0b.md b/docs/isa/cube/ops/data-movement/mte-l1-l0b.md new file mode 100644 index 000000000..93015243e --- /dev/null +++ b/docs/isa/cube/ops/data-movement/mte-l1-l0b.md @@ -0,0 +1,63 @@ +# pto.mte_l1_l0b + +`pto.mte_l1_l0b` is part of [Cube Data Movement Ops](../../README.md#cube-data-movement-ops). + +## Summary + +Load a logical `%k x %n` right tile from L1 `l1` into `l0b` for `pto.mad*` consumption. The source must already be in cube-fractal NZ layout; this op does not convert arbitrary row-major matrices. + +L0B uses `K1 N1 N0 K0` FRACTAL_ZN with **K innermost** so the cube hardware reads all `K0` elements per cycle without striding. The inner-box transpose from L1's `K1 N1 K0 N0` to L0B's `K1 N1 N0 K0` is performed as part of this movement; no separate user-visible pass is required. + +## Mechanism + +If `transpose = true`, the selected logical source tile is transposed before placement in the destination operand domain. Omitting the attribute means `transpose = false`. + +See [NZ Fractal Layout — Why K-Innermost on L0B](../../nz-fractal-layout.md#why-k-innermost-on-l0b) for the layout rationale. + +## Syntax + +```mlir +pto.mte_l1_l0b %src, %dst, %k, %n + : !pto.ptr, !pto.ptr, i64, i64 +``` + +## Inputs + +| Parameter | Width | Description | +|-----------|-------|-------------| +| `%src` | ptr | L1 cube-fractal source tile in `l1` | +| `%dst` | ptr | Right operand destination in `l0b` | +| `%k` | i64 | Logical K extent | +| `%n` | i64 | Logical N extent | +| `transpose` | attr | Optional boolean source-tile transpose before destination placement | + +## Expected Outputs + +| Result | Type | Description | +| --- | --- | --- | +| None | `—` | Writes the L0B tile that subsequent `pto.mad*` will read. | + +## Side Effects + +Reads L1; writes L0B. Engages the AIC MTE1 pipe. + +## Constraints + +!!! warning "Constraints" + - `%src` must be in `l1`, `%dst` must be in `l0b`. + - `%src` and `%dst` must satisfy the target alignment for Cube tile loads. + - `transpose = true` requires a tile shape supported by the element-type transpose granularity. + +## Examples + +```mlir +pto.mte_l1_l0b %l1_b, %l0b, %c32_i64, %c16_i64 + : !pto.ptr, !pto.ptr, i64, i64 +``` + +## Related Ops + +- Left operand load: [pto.mte_l1_l0a](./mte-l1-l0a.md) +- MX scale loader: [pto.mte_l1_l0b_mx](./mte-l1-l0b-mx.md) +- Upstream repack: [pto.mte_gm_l1_frac](./mte-gm-l1-frac.md) +- MAD consumers: [pto.mad](../mad/mad.md), [pto.mad_acc](../mad/mad-acc.md), [pto.mad_bias](../mad/mad-bias.md) diff --git a/docs/isa/cube/ops/data-movement/mte-l1-l0b_zh.md b/docs/isa/cube/ops/data-movement/mte-l1-l0b_zh.md new file mode 100644 index 000000000..ea5f186b8 --- /dev/null +++ b/docs/isa/cube/ops/data-movement/mte-l1-l0b_zh.md @@ -0,0 +1,63 @@ +# pto.mte_l1_l0b + +`pto.mte_l1_l0b` 属于 [Cube 数据搬运指令](../../README_zh.md#cube-数据搬运指令)。 + +## 摘要 + +把 L1 `l1` 中的逻辑 `%k x %n` 右 tile 加载到 `l0b`,供 `pto.mad*` 消费。源必须已经是 cube fractal NZ 布局;本指令不会把任意行优先矩阵自动转换。 + +L0B 使用 `K1 N1 N0 K0` FRACTAL_ZN,**K 在最内层**,这样 cube 硬件每个 cycle 都能读到完整的 `K0` 个元素而不跨 stride。从 L1 的 `K1 N1 K0 N0` 到 L0B 的 `K1 N1 N0 K0` 的内层 box 转置在本指令搬运过程中完成,用户层看不到额外的转置 pass。 + +## 机制 + +若 `transpose = true`,选中的逻辑源 tile 在置入目标操作数域前转置;省略等同于 `transpose = false`。 + +布局原理见 [NZ Fractal 布局 — 为什么 L0B 必须 K-innermost](../../nz-fractal-layout_zh.md#为什么-l0b-必须-k-innermost)。 + +## 语法 + +```mlir +pto.mte_l1_l0b %src, %dst, %k, %n + : !pto.ptr, !pto.ptr, i64, i64 +``` + +## 输入 + +| 参数 | 位宽 | 描述 | +|-----------|-------|-------------| +| `%src` | ptr | L1 中的 cube fractal 源 tile | +| `%dst` | ptr | L0B 右操作数目标 | +| `%k` | i64 | 逻辑 K 范围 | +| `%n` | i64 | 逻辑 N 范围 | +| `transpose` | attr | 可选布尔属性,置入目标域前是否对源 tile 做转置 | + +## 预期输出 + +| 结果 | 类型 | 描述 | +| --- | --- | --- | +| 无 | `—` | 把 L0B tile 写好,供后续 `pto.mad*` 读取。 | + +## 副作用 + +读 L1,写 L0B。占用 AIC MTE1 流水线。 + +## 约束 + +!!! warning "约束" + - `%src` 必须在 `l1`,`%dst` 必须在 `l0b`。 + - `%src` 与 `%dst` 必须满足 Cube tile load 的目标对齐要求。 + - `transpose = true` 要求 tile 形状被元素类型的转置粒度支持。 + +## 示例 + +```mlir +pto.mte_l1_l0b %l1_b, %l0b, %c32_i64, %c16_i64 + : !pto.ptr, !pto.ptr, i64, i64 +``` + +## 相关指令 + +- 左操作数加载:[pto.mte_l1_l0a](./mte-l1-l0a_zh.md) +- MX scale 加载:[pto.mte_l1_l0b_mx](./mte-l1-l0b-mx_zh.md) +- 上游重排:[pto.mte_gm_l1_frac](./mte-gm-l1-frac_zh.md) +- MAD 消费者:[pto.mad](../mad/mad_zh.md)、[pto.mad_acc](../mad/mad-acc_zh.md)、[pto.mad_bias](../mad/mad-bias_zh.md) diff --git a/docs/isa/cube/ops/data-movement/mte-l1-ub.md b/docs/isa/cube/ops/data-movement/mte-l1-ub.md new file mode 100644 index 000000000..22d6674c6 --- /dev/null +++ b/docs/isa/cube/ops/data-movement/mte-l1-ub.md @@ -0,0 +1,54 @@ +# pto.mte_l1_ub + +`pto.mte_l1_ub` is part of [Cube Data Movement Ops](../../README.md#cube-data-movement-ops). + +## Summary + +Structured L1→UB copy. Reads grouped byte ranges from `%src` in L1 and writes them to `%dst` in UB. This is the L1→Vector data path complement to [`pto.mte_ub_l1`](../../../scalar/ops/dma-copy/mte-ub-l1.md). + +## Mechanism + +Uses the same grouped `nburst(...) [loop(...)]*` model as [`pto.mte_gm_l1`](./mte-gm-l1.md). For each `nburst` row the source and destination advance by `src_stride` / `dst_stride`. Outer `loop(...)` groups wrap the inner transfer pattern. + +## Syntax + +```mlir +pto.mte_l1_ub %src, %dst, %len_burst + nburst(%count, %src_stride, %dst_stride) + [loop(%count_i, %src_stride_i, %dst_stride_i)]* + : !pto.ptr, !pto.ptr, i64, i64, i64, i64 +``` + +## Inputs + +Same grouped byte model as [`pto.mte_gm_l1`](./mte-gm-l1.md#inputs), with source and destination address spaces reversed to `l1 -> ub`. + +## Expected Outputs + +| Result | Type | Description | +| --- | --- | --- | +| None | `—` | Writes data into the UB destination region. | + +## Side Effects + +Reads L1-visible storage; writes UB-visible storage. The transfer is issued on the AIC side and the cube-to-vector data path; UB consumers on the AIV side must synchronize accordingly. + +## Constraints + +!!! warning "Constraints" + - `%src` must be in `l1`, `%dst` must be in `ub`. + - `nburst(...)` is required. + - Each `loop(...)` group must provide all three operands. + +## Examples + +```mlir +pto.mte_l1_ub %l1_src, %ub_dst, %c64_i64 + nburst(%c2_i64, %c128_i64, %c64_i64) + : !pto.ptr, !pto.ptr, i64, i64, i64, i64 +``` + +## Related Ops + +- Reverse direction (UB → L1): [pto.mte_ub_l1](../../../scalar/ops/dma-copy/mte-ub-l1.md) +- GM → L1: [pto.mte_gm_l1](./mte-gm-l1.md), [pto.mte_gm_l1_frac](./mte-gm-l1-frac.md) diff --git a/docs/isa/cube/ops/data-movement/mte-l1-ub_zh.md b/docs/isa/cube/ops/data-movement/mte-l1-ub_zh.md new file mode 100644 index 000000000..22ac417e9 --- /dev/null +++ b/docs/isa/cube/ops/data-movement/mte-l1-ub_zh.md @@ -0,0 +1,54 @@ +# pto.mte_l1_ub + +`pto.mte_l1_ub` 属于 [Cube 数据搬运指令](../../README_zh.md#cube-数据搬运指令)。 + +## 摘要 + +结构化 L1→UB 拷贝。从 `%src`(L1)读取分组字节区间并写入 `%dst`(UB)。这是 [`pto.mte_ub_l1`](../../../scalar/ops/dma-copy/mte-ub-l1_zh.md) 的反向通路,即 L1→Vector 数据通路。 + +## 机制 + +使用与 [`pto.mte_gm_l1`](./mte-gm-l1_zh.md) 相同的分组 `nburst(...) [loop(...)]*` 模型。每个 `nburst` 行后,源/目标按 `src_stride` / `dst_stride` 前进;外层 `loop(...)` 把内层传输打包。 + +## 语法 + +```mlir +pto.mte_l1_ub %src, %dst, %len_burst + nburst(%count, %src_stride, %dst_stride) + [loop(%count_i, %src_stride_i, %dst_stride_i)]* + : !pto.ptr, !pto.ptr, i64, i64, i64, i64 +``` + +## 输入 + +与 [`pto.mte_gm_l1`](./mte-gm-l1_zh.md#输入) 同样的分组字节模型,源/目标地址空间反向为 `l1 -> ub`。 + +## 预期输出 + +| 结果 | 类型 | 描述 | +| --- | --- | --- | +| 无 | `—` | 把数据写入 UB 目标区域。 | + +## 副作用 + +读 L1 可见存储,写 UB 可见存储。传输由 AIC 侧发起并经过 cube-to-vector 数据通路;AIV 侧的 UB 消费者需相应同步。 + +## 约束 + +!!! warning "约束" + - `%src` 必须在 `l1`,`%dst` 必须在 `ub`。 + - `nburst(...)` 必须存在。 + - 每个 `loop(...)` 子句出现时必须给出完整三元组。 + +## 示例 + +```mlir +pto.mte_l1_ub %l1_src, %ub_dst, %c64_i64 + nburst(%c2_i64, %c128_i64, %c64_i64) + : !pto.ptr, !pto.ptr, i64, i64, i64, i64 +``` + +## 相关指令 + +- 反向通路(UB → L1):[pto.mte_ub_l1](../../../scalar/ops/dma-copy/mte-ub-l1_zh.md) +- GM → L1:[pto.mte_gm_l1](./mte-gm-l1_zh.md)、[pto.mte_gm_l1_frac](./mte-gm-l1-frac_zh.md) diff --git a/docs/isa/cube/ops/data-movement/mte-ub-l1.md b/docs/isa/cube/ops/data-movement/mte-ub-l1.md new file mode 100644 index 000000000..5ab5e5e04 --- /dev/null +++ b/docs/isa/cube/ops/data-movement/mte-ub-l1.md @@ -0,0 +1,17 @@ +# pto.mte_ub_l1 (cube view) + +`pto.mte_ub_l1` is the reverse direction of [`pto.mte_l1_ub`](./mte-l1-ub.md). The canonical reference page for this instruction lives under the scalar DMA Copy section because the AIV-side scalar program issues the transfer: + +- **Reference page**: [pto.mte_ub_l1 — Scalar DMA Copy](../../../scalar/ops/dma-copy/mte-ub-l1.md) + +This stub exists so the cube section index resolves the link cleanly. Refer to the scalar DMA Copy reference page for the full syntax, parameter table, and constraints. + +## Why it lives under scalar DMA + +The UB→L1 transfer is launched from the Vector (AIV) scalar program because the source pointer is in UB — the buffer owned by the Vector block. The cube (AIC) consumes the resulting L1 tile through [`pto.mte_l1_l0a`](./mte-l1-l0a.md) / [`pto.mte_l1_l0b`](./mte-l1-l0b.md) once the AIC has been notified that L1 is ready. + +## Related Ops + +- Reverse direction (L1 → UB): [pto.mte_l1_ub](./mte-l1-ub.md) +- Cube operand staging: [pto.mte_l1_l0a](./mte-l1-l0a.md), [pto.mte_l1_l0b](./mte-l1-l0b.md) +- Inter-block sync: [Cluster Programming Model](../../../machine-model/execution-agents.md) diff --git a/docs/isa/cube/ops/data-movement/mte-ub-l1_zh.md b/docs/isa/cube/ops/data-movement/mte-ub-l1_zh.md new file mode 100644 index 000000000..29a963bbf --- /dev/null +++ b/docs/isa/cube/ops/data-movement/mte-ub-l1_zh.md @@ -0,0 +1,17 @@ +# pto.mte_ub_l1(cube 侧视图) + +`pto.mte_ub_l1` 是 [`pto.mte_l1_ub`](./mte-l1-ub_zh.md) 的反向通路。该指令的标准参考页面位于标量 DMA Copy 节,因为它由 AIV 侧的标量程序发起: + +- **参考页面**:[pto.mte_ub_l1 — 标量 DMA Copy](../../../scalar/ops/dma-copy/mte-ub-l1_zh.md) + +本占位页存在的目的,是让 cube 节的索引链接能稳定解析;完整的语法、参数表与约束请直接参考标量 DMA Copy 中的指令参考。 + +## 为什么挂在标量 DMA 下 + +UB→L1 传输由 Vector(AIV)侧的标量程序发起,因为源指针位于 Vector 块拥有的 UB 缓冲。Cube(AIC)侧在收到 L1 就绪通知后,通过 [`pto.mte_l1_l0a`](./mte-l1-l0a_zh.md) / [`pto.mte_l1_l0b`](./mte-l1-l0b_zh.md) 消费产出的 L1 tile。 + +## 相关指令 + +- 反向通路(L1 → UB):[pto.mte_l1_ub](./mte-l1-ub_zh.md) +- Cube 操作数装载:[pto.mte_l1_l0a](./mte-l1-l0a_zh.md)、[pto.mte_l1_l0b](./mte-l1-l0b_zh.md) +- 跨块同步:[Cluster 编程模型](../../../machine-model/execution-agents_zh.md) diff --git a/docs/isa/cube/ops/mad/mad-acc.md b/docs/isa/cube/ops/mad/mad-acc.md new file mode 100644 index 000000000..03a7ea065 --- /dev/null +++ b/docs/isa/cube/ops/mad/mad-acc.md @@ -0,0 +1,58 @@ +# pto.mad_acc + +`pto.mad_acc` is part of the [Cube MAD Ops](../../README.md#matrix-multiply-mad-ops). + +## Summary + +Accumulating cube matrix multiply: `dst[m, n] = dst[m, n] + sum_k(lhs[m, k] * rhs[k, n])`. + +## Mechanism + +Like [`pto.mad`](./mad.md), but adds the freshly-computed product to the existing L0C accumulator state instead of overwriting it. Typical use is K-axis tiling, where successive MAD calls accumulate partial sums along K until the full reduction is complete. + +## Syntax + +```mlir +pto.mad_acc %lhs, %rhs, %dst, %m, %n, %k + unit_flag(check_only | check_and_set)? + disable_gemv? + (sat | nosat)? + tf32_mode(round_even | round_away)? + n_dir? + : !pto.ptr, !pto.ptr, !pto.ptr, i64, i64, i64 +``` + +## Inputs + +Same parameter shape and semantics as [`pto.mad`](./mad.md#inputs). See [MAD Common Clauses](./mad.md#mad-common-clauses) for the optional clauses. + +## Expected Outputs + +| Result | Type | Description | +| --- | --- | --- | +| None | `—` | Updates the existing `M x N` tile in L0C in place. No SSA result. | + +## Side Effects + +Engages the CUBE pipe; reads from and writes to L0C. The caller is responsible for ensuring the L0C tile has been initialized (typically by an initial [`pto.mad`](./mad.md) or [`pto.mad_bias`](./mad-bias.md) on the same `%dst` before the first `pto.mad_acc`). + +## Constraints + +Same as [`pto.mad`](./mad.md#constraints). + +## Examples + +```mlir +// K-axis tiling: initial pto.mad then repeated pto.mad_acc. +pto.mad %l0a_k0, %l0b_k0, %l0c, %c16_i64, %c16_i64, %c32_i64 + : !pto.ptr, !pto.ptr, !pto.ptr, i64, i64, i64 + +pto.mad_acc %l0a_k1, %l0b_k1, %l0c, %c16_i64, %c16_i64, %c32_i64 unit_flag(check_only) + : !pto.ptr, !pto.ptr, !pto.ptr, i64, i64, i64 +``` + +## Related Ops + +- Zero-init form: [pto.mad](./mad.md) +- Bias-init form: [pto.mad_bias](./mad-bias.md) +- MX accumulating form: [pto.mad_mx_acc](./mad-mx-acc.md) diff --git a/docs/isa/cube/ops/mad/mad-acc_zh.md b/docs/isa/cube/ops/mad/mad-acc_zh.md new file mode 100644 index 000000000..9891a5fbe --- /dev/null +++ b/docs/isa/cube/ops/mad/mad-acc_zh.md @@ -0,0 +1,58 @@ +# pto.mad_acc + +`pto.mad_acc` 属于 [Cube MAD 指令](../../README_zh.md#矩阵乘加mad指令)。 + +## 摘要 + +累加 cube 矩阵乘加:`dst[m, n] = dst[m, n] + sum_k(lhs[m, k] * rhs[k, n])`。 + +## 机制 + +与 [`pto.mad`](./mad_zh.md) 相同,但把新计算的乘积**累加**到 L0C 已有的累加器状态上,而不是覆盖。典型用法是 K 方向分块:连续的 MAD 沿 K 累加部分和,直到整个归约完成。 + +## 语法 + +```mlir +pto.mad_acc %lhs, %rhs, %dst, %m, %n, %k + unit_flag(check_only | check_and_set)? + disable_gemv? + (sat | nosat)? + tf32_mode(round_even | round_away)? + n_dir? + : !pto.ptr, !pto.ptr, !pto.ptr, i64, i64, i64 +``` + +## 输入 + +参数形状与语义同 [`pto.mad`](./mad_zh.md#输入)。可选 clauses 见 [MAD 通用 Clauses](./mad_zh.md#mad-通用-clauses)。 + +## 预期输出 + +| 结果 | 类型 | 描述 | +| --- | --- | --- | +| 无 | `—` | 就地更新 L0C 中已有的 `M x N` tile,没有 SSA 结果。 | + +## 副作用 + +占用 CUBE 流水线,并读写 L0C。调用方需要保证 L0C tile 已被初始化(通常由该 `%dst` 上的首次 [`pto.mad`](./mad_zh.md) 或 [`pto.mad_bias`](./mad-bias_zh.md) 完成,之后才允许多次 `pto.mad_acc`)。 + +## 约束 + +同 [`pto.mad`](./mad_zh.md#约束)。 + +## 示例 + +```mlir +// K 方向分块:先 pto.mad 初始化,再多次 pto.mad_acc。 +pto.mad %l0a_k0, %l0b_k0, %l0c, %c16_i64, %c16_i64, %c32_i64 + : !pto.ptr, !pto.ptr, !pto.ptr, i64, i64, i64 + +pto.mad_acc %l0a_k1, %l0b_k1, %l0c, %c16_i64, %c16_i64, %c32_i64 unit_flag(check_only) + : !pto.ptr, !pto.ptr, !pto.ptr, i64, i64, i64 +``` + +## 相关指令 + +- 零初始化形式:[pto.mad](./mad_zh.md) +- 偏置初始化形式:[pto.mad_bias](./mad-bias_zh.md) +- MX 累加形式:[pto.mad_mx_acc](./mad-mx-acc_zh.md) diff --git a/docs/isa/cube/ops/mad/mad-bias.md b/docs/isa/cube/ops/mad/mad-bias.md new file mode 100644 index 000000000..8d4cc16d7 --- /dev/null +++ b/docs/isa/cube/ops/mad/mad-bias.md @@ -0,0 +1,64 @@ +# pto.mad_bias + +`pto.mad_bias` is part of the [Cube MAD Ops](../../README.md#matrix-multiply-mad-ops). + +## Summary + +Bias-init cube matrix multiply: `dst[m, n] = sum_k(lhs[m, k] * rhs[k, n]) + bias[n]`. + +## Mechanism + +Like [`pto.mad`](./mad.md), but seeds the accumulator with a per-N bias vector instead of zero. Useful as the first MAD in a K-tiled sequence where the bias is known up front; subsequent partial sums can accumulate via [`pto.mad_acc`](./mad-acc.md). + +## Syntax + +```mlir +pto.mad_bias %lhs, %rhs, %dst, %bias, %m, %n, %k + unit_flag(check_only | check_and_set)? + disable_gemv? + (sat | nosat)? + tf32_mode(round_even | round_away)? + n_dir? + : !pto.ptr, !pto.ptr, !pto.ptr, !pto.ptr, i64, i64, i64 +``` + +## Inputs + +| Parameter | Type | Description | +|-----------|------|-------------| +| `%lhs`, `%rhs`, `%dst`, `%m`, `%n`, `%k` | — | Same as [`pto.mad`](./mad.md#inputs) | +| `%bias` | `!pto.ptr` | Bias vector in BT, interpreted as `N` values broadcast across M | + +See [MAD Common Clauses](./mad.md#mad-common-clauses) for the optional clauses. + +## Expected Outputs + +| Result | Type | Description | +| --- | --- | --- | +| None | `—` | Writes the produced `M x N` tile to L0C with bias-init seed. | + +## Side Effects + +Engages the CUBE pipe, reads `%bias` from BT, writes to L0C. The caller is responsible for staging `%bias` into BT via [`pto.mte_l1_bt`](../data-movement/mte-l1-bt.md) prior to this op. + +## Constraints + +!!! warning "Constraints" + - `%bias` must be in `bt` address space. + - `%bias` element type must match `%dst` element type. + - Only `N` bias values are consumed; `%bias` is not an `M x N` matrix. + - Other constraints match [`pto.mad`](./mad.md#constraints). + +## Examples + +```mlir +pto.mad_bias %l0a, %l0b, %l0c, %bt, %c16_i64, %c16_i64, %c32_i64 + : !pto.ptr, !pto.ptr, !pto.ptr, !pto.ptr, i64, i64, i64 +``` + +## Related Ops + +- Zero-init form: [pto.mad](./mad.md) +- Accumulating form: [pto.mad_acc](./mad-acc.md) +- MX bias-init form: [pto.mad_mx_bias](./mad-mx-bias.md) +- Bias staging: [pto.mte_l1_bt](../data-movement/mte-l1-bt.md) diff --git a/docs/isa/cube/ops/mad/mad-bias_zh.md b/docs/isa/cube/ops/mad/mad-bias_zh.md new file mode 100644 index 000000000..1cc1fc62b --- /dev/null +++ b/docs/isa/cube/ops/mad/mad-bias_zh.md @@ -0,0 +1,64 @@ +# pto.mad_bias + +`pto.mad_bias` 属于 [Cube MAD 指令](../../README_zh.md#矩阵乘加mad指令)。 + +## 摘要 + +偏置初始化 cube 矩阵乘加:`dst[m, n] = sum_k(lhs[m, k] * rhs[k, n]) + bias[n]`。 + +## 机制 + +与 [`pto.mad`](./mad_zh.md) 相同,但用 N 维 bias 向量作为累加器种子,而不是 0。常用作 K 分块序列里的第一条 MAD;后续部分和可以用 [`pto.mad_acc`](./mad-acc_zh.md) 继续累加。 + +## 语法 + +```mlir +pto.mad_bias %lhs, %rhs, %dst, %bias, %m, %n, %k + unit_flag(check_only | check_and_set)? + disable_gemv? + (sat | nosat)? + tf32_mode(round_even | round_away)? + n_dir? + : !pto.ptr, !pto.ptr, !pto.ptr, !pto.ptr, i64, i64, i64 +``` + +## 输入 + +| 参数 | 类型 | 描述 | +|-----------|------|------| +| `%lhs`、`%rhs`、`%dst`、`%m`、`%n`、`%k` | — | 同 [`pto.mad`](./mad_zh.md#输入) | +| `%bias` | `!pto.ptr` | BT 中的 bias 向量,解读为 `N` 个值并沿 M 维广播 | + +可选 clauses 见 [MAD 通用 Clauses](./mad_zh.md#mad-通用-clauses)。 + +## 预期输出 + +| 结果 | 类型 | 描述 | +| --- | --- | --- | +| 无 | `—` | 把以 bias 为初值的 `M x N` 结果 tile 写到 L0C。 | + +## 副作用 + +占用 CUBE 流水线,从 BT 读 `%bias`,写 L0C。调用方需在本指令之前通过 [`pto.mte_l1_bt`](../data-movement/mte-l1-bt_zh.md) 把 `%bias` 加载到 BT。 + +## 约束 + +!!! warning "约束" + - `%bias` 必须位于 `bt` 地址空间。 + - `%bias` 的元素类型必须与 `%dst` 相同。 + - 只消费 `N` 个 bias 值;`%bias` 不是 `M x N` 矩阵。 + - 其它约束同 [`pto.mad`](./mad_zh.md#约束)。 + +## 示例 + +```mlir +pto.mad_bias %l0a, %l0b, %l0c, %bt, %c16_i64, %c16_i64, %c32_i64 + : !pto.ptr, !pto.ptr, !pto.ptr, !pto.ptr, i64, i64, i64 +``` + +## 相关指令 + +- 零初始化形式:[pto.mad](./mad_zh.md) +- 累加形式:[pto.mad_acc](./mad-acc_zh.md) +- MX 偏置初始化形式:[pto.mad_mx_bias](./mad-mx-bias_zh.md) +- Bias 装载:[pto.mte_l1_bt](../data-movement/mte-l1-bt_zh.md) diff --git a/docs/isa/cube/ops/mad/mad-mx-acc.md b/docs/isa/cube/ops/mad/mad-mx-acc.md new file mode 100644 index 000000000..32c7cd0ce --- /dev/null +++ b/docs/isa/cube/ops/mad/mad-mx-acc.md @@ -0,0 +1,57 @@ +# pto.mad_mx_acc + +`pto.mad_mx_acc` is part of the [Cube MAD Ops](../../README.md#matrix-multiply-mad-ops). + +## Summary + +Accumulating **MX (microscaled)** cube matrix multiply: `dst[m, n] = dst[m, n] + mx_product[m, n]`. + +See [MX Matmul Model](./mad-mx.md#mx-matmul-model) for the per-K-group scaled multiply-accumulate. + +## Mechanism + +Like [`pto.mad_mx`](./mad-mx.md) but adds the MX-scaled product to existing L0C state. Typical use is K-axis tiling for MX GEMM. + +## Syntax + +```mlir +pto.mad_mx_acc %lhs, %rhs, %dst, %m, %n, %k + unit_flag(check_only | check_and_set)? + disable_gemv? + (sat | nosat)? + n_dir? + : !pto.ptr, !pto.ptr, !pto.ptr, i64, i64, i64 +``` + +## Inputs + +Same parameter shape as [`pto.mad_mx`](./mad-mx.md#inputs). + +See [MAD Common Clauses](./mad.md#mad-common-clauses) for the optional clauses. `tf32_mode(...)` is not accepted. + +## Expected Outputs + +| Result | Type | Description | +| --- | --- | --- | +| None | `—` | Updates the existing `M x N` tile in L0C with MX-scaled accumulation. | + +## Side Effects + +Same as [`pto.mad_mx`](./mad-mx.md#side-effects). The caller is responsible for ensuring the L0C tile has been initialized (typically by an initial [`pto.mad_mx`](./mad-mx.md) or [`pto.mad_mx_bias`](./mad-mx-bias.md) on the same `%dst`). + +## Constraints + +Same as [`pto.mad_mx`](./mad-mx.md#constraints). + +## Examples + +```mlir +pto.mad_mx_acc %l0a, %l0b, %l0c, %c16_i64, %c16_i64, %c64_i64 + : !pto.ptr, !pto.ptr, !pto.ptr, i64, i64, i64 +``` + +## Related Ops + +- Zero-init MX form: [pto.mad_mx](./mad-mx.md) +- Bias-init MX form: [pto.mad_mx_bias](./mad-mx-bias.md) +- Non-MX accumulating form: [pto.mad_acc](./mad-acc.md) diff --git a/docs/isa/cube/ops/mad/mad-mx-acc_zh.md b/docs/isa/cube/ops/mad/mad-mx-acc_zh.md new file mode 100644 index 000000000..a0d1a42fe --- /dev/null +++ b/docs/isa/cube/ops/mad/mad-mx-acc_zh.md @@ -0,0 +1,55 @@ +# pto.mad_mx_acc + +`pto.mad_mx_acc` 属于 [Cube MAD 指令](../../README_zh.md#矩阵乘加mad指令)。 + +## 摘要 + +累加 **MX(微缩放)** cube 矩阵乘加:`dst[m, n] = dst[m, n] + mx_product[m, n]`。 + +MX 分组乘加的语义见 [MX Matmul 模型](./mad-mx_zh.md#mx-matmul-模型)。 + +## 机制 + +与 [`pto.mad_mx`](./mad-mx_zh.md) 相同,但把 MX 缩放后的乘积**累加**到 L0C 已有状态上。典型用法是 MX GEMM 的 K 方向分块。 + +## 语法 + +```mlir +pto.mad_mx_acc %lhs, %rhs, %dst, %m, %n, %k + unit_flag(check_only | check_and_set)? + disable_gemv? + (sat | nosat)? + n_dir? + : !pto.ptr, !pto.ptr, !pto.ptr, i64, i64, i64 +``` + +## 输入 + +参数形状同 [`pto.mad_mx`](./mad-mx_zh.md#输入)。可选 clauses 见 [MAD 通用 Clauses](./mad_zh.md#mad-通用-clauses),但不接受 `tf32_mode(...)`。 + +## 预期输出 + +| 结果 | 类型 | 描述 | +| --- | --- | --- | +| 无 | `—` | 用 MX 缩放结果就地更新 L0C 中已有的 `M x N` tile。 | + +## 副作用 + +同 [`pto.mad_mx`](./mad-mx_zh.md#副作用)。调用方需保证 L0C tile 已被初始化(通常由该 `%dst` 上的首次 [`pto.mad_mx`](./mad-mx_zh.md) 或 [`pto.mad_mx_bias`](./mad-mx-bias_zh.md) 完成)。 + +## 约束 + +同 [`pto.mad_mx`](./mad-mx_zh.md#约束)。 + +## 示例 + +```mlir +pto.mad_mx_acc %l0a, %l0b, %l0c, %c16_i64, %c16_i64, %c64_i64 + : !pto.ptr, !pto.ptr, !pto.ptr, i64, i64, i64 +``` + +## 相关指令 + +- 零初始化 MX 形式:[pto.mad_mx](./mad-mx_zh.md) +- MX 偏置初始化形式:[pto.mad_mx_bias](./mad-mx-bias_zh.md) +- 非 MX 累加形式:[pto.mad_acc](./mad-acc_zh.md) diff --git a/docs/isa/cube/ops/mad/mad-mx-bias.md b/docs/isa/cube/ops/mad/mad-mx-bias.md new file mode 100644 index 000000000..723ff349f --- /dev/null +++ b/docs/isa/cube/ops/mad/mad-mx-bias.md @@ -0,0 +1,61 @@ +# pto.mad_mx_bias + +`pto.mad_mx_bias` is part of the [Cube MAD Ops](../../README.md#matrix-multiply-mad-ops). + +## Summary + +Bias-init **MX (microscaled)** cube matrix multiply: `dst[m, n] = mx_product[m, n] + bias[n]`. + +See [MX Matmul Model](./mad-mx.md#mx-matmul-model) for the per-K-group scaled multiply-accumulate. + +## Mechanism + +Combines the MX scaling of [`pto.mad_mx`](./mad-mx.md) with the bias-init seed of [`pto.mad_bias`](./mad-bias.md). The accumulator starts from `bias[n]` instead of zero. + +## Syntax + +```mlir +pto.mad_mx_bias %lhs, %rhs, %dst, %bias, %m, %n, %k + unit_flag(check_only | check_and_set)? + disable_gemv? + (sat | nosat)? + n_dir? + : !pto.ptr, !pto.ptr, !pto.ptr, !pto.ptr, i64, i64, i64 +``` + +## Inputs + +Same parameter shape as [`pto.mad_bias`](./mad-bias.md#inputs), with MX `%lhs` / `%rhs` scale payload requirements from [`pto.mad_mx`](./mad-mx.md). + +See [MAD Common Clauses](./mad.md#mad-common-clauses) for the optional clauses. `tf32_mode(...)` is not accepted. + +## Expected Outputs + +| Result | Type | Description | +| --- | --- | --- | +| None | `—` | Writes the produced `M x N` MX-scaled tile to L0C with bias-init seed. | + +## Side Effects + +Engages the CUBE pipe; reads `%bias` from BT and MX scale payloads associated with `%lhs` / `%rhs`; writes to L0C. + +## Constraints + +!!! warning "Constraints" + - All constraints from [`pto.mad_mx`](./mad-mx.md#constraints) (MX dtype combination, scale payload prerequisites, K grouping rule). + - All `%bias` constraints from [`pto.mad_bias`](./mad-bias.md#constraints): `%bias` must be in `bt` space with element type matching `%dst`; only `N` values are consumed. + +## Examples + +```mlir +pto.mad_mx_bias %l0a, %l0b, %l0c, %bt, %c16_i64, %c16_i64, %c64_i64 + : !pto.ptr, !pto.ptr, !pto.ptr, !pto.ptr, i64, i64, i64 +``` + +## Related Ops + +- Zero-init MX form: [pto.mad_mx](./mad-mx.md) +- Accumulating MX form: [pto.mad_mx_acc](./mad-mx-acc.md) +- Non-MX bias-init form: [pto.mad_bias](./mad-bias.md) +- Bias staging: [pto.mte_l1_bt](../data-movement/mte-l1-bt.md) +- MX scale loaders: [pto.mte_l1_l0a_mx](../data-movement/mte-l1-l0a-mx.md), [pto.mte_l1_l0b_mx](../data-movement/mte-l1-l0b-mx.md) diff --git a/docs/isa/cube/ops/mad/mad-mx-bias_zh.md b/docs/isa/cube/ops/mad/mad-mx-bias_zh.md new file mode 100644 index 000000000..1ea0cfef6 --- /dev/null +++ b/docs/isa/cube/ops/mad/mad-mx-bias_zh.md @@ -0,0 +1,61 @@ +# pto.mad_mx_bias + +`pto.mad_mx_bias` 属于 [Cube MAD 指令](../../README_zh.md#矩阵乘加mad指令)。 + +## 摘要 + +偏置初始化 **MX(微缩放)** cube 矩阵乘加:`dst[m, n] = mx_product[m, n] + bias[n]`。 + +MX 分组乘加的语义见 [MX Matmul 模型](./mad-mx_zh.md#mx-matmul-模型)。 + +## 机制 + +把 [`pto.mad_mx`](./mad-mx_zh.md) 的 MX 缩放与 [`pto.mad_bias`](./mad-bias_zh.md) 的偏置初始化结合在一起。累加器从 `bias[n]` 开始,而不是 0。 + +## 语法 + +```mlir +pto.mad_mx_bias %lhs, %rhs, %dst, %bias, %m, %n, %k + unit_flag(check_only | check_and_set)? + disable_gemv? + (sat | nosat)? + n_dir? + : !pto.ptr, !pto.ptr, !pto.ptr, !pto.ptr, i64, i64, i64 +``` + +## 输入 + +参数形状同 [`pto.mad_bias`](./mad-bias_zh.md#输入),且 `%lhs` / `%rhs` 需带有 [`pto.mad_mx`](./mad-mx_zh.md) 所要求的 MX scale payload。 + +可选 clauses 见 [MAD 通用 Clauses](./mad_zh.md#mad-通用-clauses),但不接受 `tf32_mode(...)`。 + +## 预期输出 + +| 结果 | 类型 | 描述 | +| --- | --- | --- | +| 无 | `—` | 以 bias 为种子,把 MX 缩放后的 `M x N` 结果写到 L0C。 | + +## 副作用 + +占用 CUBE 流水线;从 BT 读 `%bias`,从与 `%lhs` / `%rhs` 关联的 MX scale 状态读 scale;写 L0C。 + +## 约束 + +!!! warning "约束" + - [`pto.mad_mx`](./mad-mx_zh.md#约束) 的全部约束(MX 类型组合、scale payload 前置条件、K 分组规则)。 + - [`pto.mad_bias`](./mad-bias_zh.md#约束) 中关于 `%bias` 的所有约束:`%bias` 必须在 `bt` 空间,元素类型与 `%dst` 相同,只消费 `N` 个值。 + +## 示例 + +```mlir +pto.mad_mx_bias %l0a, %l0b, %l0c, %bt, %c16_i64, %c16_i64, %c64_i64 + : !pto.ptr, !pto.ptr, !pto.ptr, !pto.ptr, i64, i64, i64 +``` + +## 相关指令 + +- 零初始化 MX 形式:[pto.mad_mx](./mad-mx_zh.md) +- MX 累加形式:[pto.mad_mx_acc](./mad-mx-acc_zh.md) +- 非 MX 偏置初始化形式:[pto.mad_bias](./mad-bias_zh.md) +- Bias 装载:[pto.mte_l1_bt](../data-movement/mte-l1-bt_zh.md) +- MX scale 装载:[pto.mte_l1_l0a_mx](../data-movement/mte-l1-l0a-mx_zh.md)、[pto.mte_l1_l0b_mx](../data-movement/mte-l1-l0b-mx_zh.md) diff --git a/docs/isa/cube/ops/mad/mad-mx.md b/docs/isa/cube/ops/mad/mad-mx.md new file mode 100644 index 000000000..0b4427be4 --- /dev/null +++ b/docs/isa/cube/ops/mad/mad-mx.md @@ -0,0 +1,78 @@ +# pto.mad_mx + +`pto.mad_mx` is part of the [Cube MAD Ops](../../README.md#matrix-multiply-mad-ops). + +## Summary + +Zero-init **MX (microscaled)** cube matrix multiply: `dst[m, n] = mx_product[m, n]`. + +## MX Matmul Model + +`pto.mad_mx*` additionally applies microscaling. The scale payloads are loaded with [`pto.mte_l1_l0a_mx`](../data-movement/mte-l1-l0a-mx.md) / [`pto.mte_l1_l0b_mx`](../data-movement/mte-l1-l0b-mx.md) and are associated with the selected `%lhs` / `%rhs` tiles; they are **not** direct operands of `pto.mad_mx*`. + +The K dimension is partitioned into 32-element groups: + +```text +k_group = floor(k / 32) + +mx_product[m, n] = + sum k in 0 .. K-1: + (lhs[m, k] * lhs_scale[m, k_group]) * + (rhs[k, n] * rhs_scale[k_group, n]) +``` + +Current target-profile MX data tiles use `f8E4M3FN`. `%k` must be compatible with MX grouping. On the current target profile, MX matmul consumes K in 64-element multiples, which contain two 32-element scale groups. + +## Mechanism + +Functionally equivalent to [`pto.mad`](./mad.md) but with the MX scaling applied during the multiply-accumulate. Like `pto.mad`, the result overwrites L0C. + +## Syntax + +```mlir +pto.mad_mx %lhs, %rhs, %dst, %m, %n, %k + unit_flag(check_only | check_and_set)? + disable_gemv? + (sat | nosat)? + n_dir? + : !pto.ptr, !pto.ptr, !pto.ptr, i64, i64, i64 +``` + +## Inputs + +Same parameter shape as [`pto.mad`](./mad.md#inputs). `%lhs` and `%rhs` must additionally have matching MX scale payloads loaded into L0A / L0B before this op is issued. + +See [MAD Common Clauses](./mad.md#mad-common-clauses) for the optional clauses (note: `tf32_mode(...)` is **not** a clause of MX MAD). + +## Expected Outputs + +| Result | Type | Description | +| --- | --- | --- | +| None | `—` | Writes the produced `M x N` MX-scaled tile to L0C. | + +## Side Effects + +Engages the CUBE pipe; reads scale payloads associated with `%lhs` / `%rhs`; writes to L0C. + +## Constraints + +!!! warning "Constraints" + - Operands must use a target-supported MX dtype combination (currently `f8E4M3FN` on the supported profile). + - Matching left and right MX scale payloads must be loaded before this op via [`pto.mte_l1_l0a_mx`](../data-movement/mte-l1-l0a-mx.md) / [`pto.mte_l1_l0b_mx`](../data-movement/mte-l1-l0b-mx.md). + - `%k` must satisfy the MX grouping rule described in [MX Matmul Model](#mx-matmul-model). + - `tf32_mode(...)` is not a clause of MX MAD. + - Other constraints match [`pto.mad`](./mad.md#constraints). + +## Examples + +```mlir +pto.mad_mx %l0a, %l0b, %l0c, %c16_i64, %c16_i64, %c64_i64 + : !pto.ptr, !pto.ptr, !pto.ptr, i64, i64, i64 +``` + +## Related Ops + +- Non-MX form: [pto.mad](./mad.md) +- Accumulating MX form: [pto.mad_mx_acc](./mad-mx-acc.md) +- Bias-init MX form: [pto.mad_mx_bias](./mad-mx-bias.md) +- MX scale loaders: [pto.mte_l1_l0a_mx](../data-movement/mte-l1-l0a-mx.md), [pto.mte_l1_l0b_mx](../data-movement/mte-l1-l0b-mx.md) diff --git a/docs/isa/cube/ops/mad/mad-mx_zh.md b/docs/isa/cube/ops/mad/mad-mx_zh.md new file mode 100644 index 000000000..a10fb9684 --- /dev/null +++ b/docs/isa/cube/ops/mad/mad-mx_zh.md @@ -0,0 +1,78 @@ +# pto.mad_mx + +`pto.mad_mx` 属于 [Cube MAD 指令](../../README_zh.md#矩阵乘加mad指令)。 + +## 摘要 + +零初始化 **MX(微缩放)** cube 矩阵乘加:`dst[m, n] = mx_product[m, n]`。 + +## MX Matmul 模型 + +`pto.mad_mx*` 在 MAD 之外额外应用微缩放。Scale payload 通过 [`pto.mte_l1_l0a_mx`](../data-movement/mte-l1-l0a-mx_zh.md) / [`pto.mte_l1_l0b_mx`](../data-movement/mte-l1-l0b-mx_zh.md) 加载,**不**作为本指令的直接操作数。 + +K 维按 32 个元素分组: + +```text +k_group = floor(k / 32) + +mx_product[m, n] = + sum k in 0 .. K-1: + (lhs[m, k] * lhs_scale[m, k_group]) * + (rhs[k, n] * rhs_scale[k_group, n]) +``` + +当前目标 profile 的 MX 数据 tile 使用 `f8E4M3FN`。`%k` 必须满足 MX 分组规则;目前的目标 profile 要求 MX matmul 消费的 K 是 64 的倍数(包含两个 32 元素 scale 组)。 + +## 机制 + +功能上等同于 [`pto.mad`](./mad_zh.md),但在乘加过程中应用 MX scale。与 `pto.mad` 一样,结果覆盖 L0C。 + +## 语法 + +```mlir +pto.mad_mx %lhs, %rhs, %dst, %m, %n, %k + unit_flag(check_only | check_and_set)? + disable_gemv? + (sat | nosat)? + n_dir? + : !pto.ptr, !pto.ptr, !pto.ptr, i64, i64, i64 +``` + +## 输入 + +参数形状同 [`pto.mad`](./mad_zh.md#输入)。本指令发起前,`%lhs` / `%rhs` 必须已经加载好匹配的 MX scale payload。 + +可选 clauses 见 [MAD 通用 Clauses](./mad_zh.md#mad-通用-clauses)(注意:`tf32_mode(...)` **不**是 MX MAD 的 clause)。 + +## 预期输出 + +| 结果 | 类型 | 描述 | +| --- | --- | --- | +| 无 | `—` | 把 MX 缩放后的 `M x N` 结果 tile 写到 L0C。 | + +## 副作用 + +占用 CUBE 流水线;读取与 `%lhs` / `%rhs` 关联的 scale payload;写 L0C。 + +## 约束 + +!!! warning "约束" + - 操作数必须使用目标支持的 MX 数据类型组合(当前 profile 是 `f8E4M3FN`)。 + - 在本指令前必须通过 [`pto.mte_l1_l0a_mx`](../data-movement/mte-l1-l0a-mx_zh.md) / [`pto.mte_l1_l0b_mx`](../data-movement/mte-l1-l0b-mx_zh.md) 加载好匹配的左右 MX scale payload。 + - `%k` 必须满足 [MX Matmul 模型](#mx-matmul-模型) 描述的分组规则。 + - `tf32_mode(...)` 不是 MX MAD 的 clause。 + - 其它约束同 [`pto.mad`](./mad_zh.md#约束)。 + +## 示例 + +```mlir +pto.mad_mx %l0a, %l0b, %l0c, %c16_i64, %c16_i64, %c64_i64 + : !pto.ptr, !pto.ptr, !pto.ptr, i64, i64, i64 +``` + +## 相关指令 + +- 非 MX 形式:[pto.mad](./mad_zh.md) +- MX 累加形式:[pto.mad_mx_acc](./mad-mx-acc_zh.md) +- MX 偏置初始化形式:[pto.mad_mx_bias](./mad-mx-bias_zh.md) +- MX scale 装载:[pto.mte_l1_l0a_mx](../data-movement/mte-l1-l0a-mx_zh.md)、[pto.mte_l1_l0b_mx](../data-movement/mte-l1-l0b-mx_zh.md) diff --git a/docs/isa/cube/ops/mad/mad.md b/docs/isa/cube/ops/mad/mad.md new file mode 100644 index 000000000..710f91ad0 --- /dev/null +++ b/docs/isa/cube/ops/mad/mad.md @@ -0,0 +1,82 @@ +# pto.mad + +`pto.mad` is part of the [Cube MAD Ops](../../README.md#matrix-multiply-mad-ops). + +## Summary + +Zero-init cube matrix multiply: `dst[m, n] = sum_k(lhs[m, k] * rhs[k, n])`. + +## Mechanism + +Reads tiled operands from L0A and L0B, multiplies them in the cube MMAD pipe, and writes the accumulator tile in L0C. The result overwrites L0C (no accumulation with prior L0C state — use [`pto.mad_acc`](./mad-acc.md) for accumulation, or [`pto.mad_bias`](./mad-bias.md) for bias-init). + +The matrix element types are inferred from `%lhs`, `%rhs`, and `%dst` pointer element types — there is no separate type selector. Unsupported type combinations are invalid programs. + +## Syntax + +```mlir +pto.mad %lhs, %rhs, %dst, %m, %n, %k + unit_flag(check_only | check_and_set)? + disable_gemv? + (sat | nosat)? + tf32_mode(round_even | round_away)? + n_dir? + : !pto.ptr, !pto.ptr, !pto.ptr, i64, i64, i64 +``` + +## Inputs + +| Parameter | Type | Description | +|-----------|------|-------------| +| `%lhs` | `!pto.ptr` | Left operand tile in L0A, interpreted as logical `M x K` | +| `%rhs` | `!pto.ptr` | Right operand tile in L0B, interpreted as logical `K x N` | +| `%dst` | `!pto.ptr` | Accumulator destination tile in L0C, interpreted as logical `M x N` | +| `%m` | `i64` | Logical M element count | +| `%n` | `i64` | Logical N element count | +| `%k` | `i64` | Logical K element count | + +See [MAD Common Clauses](#mad-common-clauses) for the optional clauses. + +## Expected Outputs + +| Result | Type | Description | +| --- | --- | --- | +| None | `—` | Writes the produced `M x N` tile to L0C. No SSA result. | + +## Side Effects + +Engages the CUBE pipe and writes to L0C. Downstream FIXPIPE consumers must synchronize through `pto.set_flag` / `pto.wait_flag` (`PIPE_CUBE` → `PIPE_FIXP`). + +## Constraints + +!!! warning "Constraints" + - `%lhs`, `%rhs`, and `%dst` must be in `l0a`, `l0b`, and `l0c`. + - `%m`, `%n`, and `%k` must be positive and satisfy the target shape limits for the selected element-type combination. + - `tf32_mode(...)` requires `f32` lhs, rhs, and dst element types. + - `sat` / `nosat` requires a floating element-type combination. + - Packed 4-bit integer data requires `%k` to select an even number of K elements. + +## MAD Common Clauses + +| Clause | Values | Effect | +|--------|--------|--------| +| `unit_flag(...)` | `check_only`, `check_and_set` | Participates in producer-side tile synchronization. `check_only` checks that the producer slot can be used. `check_and_set` also publishes the produced `%dst` tile for later consumers. Omit when the schedule does not use unit flags for this tile. | +| `disable_gemv` | flag | Applies only when `%m = 1`. Omitted means GEMV A-vector consumption: `%lhs` must contain the logical `1 x K` row in the target GEMV left-tile organization. Present means normal matmul left-tile organization. The mathematical result is still `lhs @ rhs`; only the required `%lhs` organization changes. For `%m != 1`, normal matmul organization is used. | +| `sat` / `nosat` | flags | Floating exceptional-value mode for floating and MX MAD forms. With `sat`, exceptional multiply inputs are normalized before arithmetic (`+/-inf` to finite type extrema, `nan` to 0) and finite overflow saturates to the finite type range. With `nosat`, exceptional inputs are preserved and overflow may produce exceptional outputs. Omit both to use the execution mode selected outside this op. Integer MAD forms do not accept these flags. | +| `tf32_mode(...)` | `round_even`, `round_away` | Valid only for non-MX `f32 x f32 -> f32`. FP32 inputs are rounded to TF32 precision before multiplication; accumulation and output remain FP32. | +| `n_dir` | flag | Requests N-direction result production order for schedules that combine compute with unit flags and later layout movement. It does not change `dst[m, n]`. | + +## Examples + +```mlir +pto.mad %l0a, %l0b, %l0c, %c16_i64, %c16_i64, %c32_i64 + : !pto.ptr, !pto.ptr, !pto.ptr, i64, i64, i64 +``` + +## Related Ops + +- Accumulating form: [pto.mad_acc](./mad-acc.md) +- Bias-init form: [pto.mad_bias](./mad-bias.md) +- MX variants: [pto.mad_mx](./mad-mx.md), [pto.mad_mx_acc](./mad-mx-acc.md), [pto.mad_mx_bias](./mad-mx-bias.md) +- Operand staging: [pto.mte_l1_l0a](../data-movement/mte-l1-l0a.md), [pto.mte_l1_l0b](../data-movement/mte-l1-l0b.md) +- Result writeback: [FIXPIPE Model](../../fixpipe-model.md) diff --git a/docs/isa/cube/ops/mad/mad_zh.md b/docs/isa/cube/ops/mad/mad_zh.md new file mode 100644 index 000000000..89c3ee50b --- /dev/null +++ b/docs/isa/cube/ops/mad/mad_zh.md @@ -0,0 +1,82 @@ +# pto.mad + +`pto.mad` 属于 [Cube MAD 指令](../../README_zh.md#矩阵乘加mad指令)。 + +## 摘要 + +零初始化 cube 矩阵乘加:`dst[m, n] = sum_k(lhs[m, k] * rhs[k, n])`。 + +## 机制 + +从 L0A、L0B 读取已分块的操作数,在 cube MMAD 流水线中相乘后,把累加 tile 写到 L0C。结果**覆盖** L0C(不会与原有 L0C 状态累加;如需累加请使用 [`pto.mad_acc`](./mad-acc_zh.md),需要偏置初始化请用 [`pto.mad_bias`](./mad-bias_zh.md))。 + +矩阵元素类型从 `%lhs`、`%rhs`、`%dst` 三个指针的元素类型推断,没有单独的类型选择器。不被支持的类型组合属于非法程序。 + +## 语法 + +```mlir +pto.mad %lhs, %rhs, %dst, %m, %n, %k + unit_flag(check_only | check_and_set)? + disable_gemv? + (sat | nosat)? + tf32_mode(round_even | round_away)? + n_dir? + : !pto.ptr, !pto.ptr, !pto.ptr, i64, i64, i64 +``` + +## 输入 + +| 参数 | 类型 | 描述 | +|-----------|------|------| +| `%lhs` | `!pto.ptr` | L0A 中的左操作数 tile,逻辑形状 `M x K` | +| `%rhs` | `!pto.ptr` | L0B 中的右操作数 tile,逻辑形状 `K x N` | +| `%dst` | `!pto.ptr` | L0C 中的累加器目标 tile,逻辑形状 `M x N` | +| `%m` | `i64` | 逻辑 M 元素数 | +| `%n` | `i64` | 逻辑 N 元素数 | +| `%k` | `i64` | 逻辑 K 元素数 | + +可选 clauses 见 [MAD 通用 Clauses](#mad-通用-clauses)。 + +## 预期输出 + +| 结果 | 类型 | 描述 | +| --- | --- | --- | +| 无 | `—` | 把 `M x N` 结果 tile 写到 L0C,没有 SSA 结果。 | + +## 副作用 + +占用 CUBE 流水线并写 L0C。下游 FIXPIPE 消费者必须通过 `pto.set_flag` / `pto.wait_flag`(`PIPE_CUBE` → `PIPE_FIXP`)做同步。 + +## 约束 + +!!! warning "约束" + - `%lhs`、`%rhs`、`%dst` 分别必须在 `l0a`、`l0b`、`l0c`。 + - `%m`、`%n`、`%k` 必须为正,并满足所选元素类型组合的目标形状上限。 + - `tf32_mode(...)` 仅在 `f32` lhs、rhs、dst 上有效。 + - `sat` / `nosat` 仅对浮点元素类型组合有效。 + - 打包 4-bit 整型数据要求 `%k` 选择偶数个 K 元素。 + +## MAD 通用 Clauses + +| Clause | 取值 | 作用 | +|--------|------|------| +| `unit_flag(...)` | `check_only`、`check_and_set` | 参与生产者侧的 tile 同步。`check_only` 检查生产者 slot 可用;`check_and_set` 还会把生产出的 `%dst` tile 发布给后续消费者。如果调度不使用 unit flag,省略即可。 | +| `disable_gemv` | flag | 仅在 `%m = 1` 时生效。省略意味着 GEMV A-vector 消费:`%lhs` 必须按目标的 GEMV 左 tile 组织包含逻辑的 `1 x K` 行;带上则使用普通 matmul 左 tile 组织。`%m != 1` 时使用普通 matmul 组织。 | +| `sat` / `nosat` | flag | 浮点与 MX MAD 的浮点异常值行为。`sat` 时异常乘数会先做规范化(`±inf` 映射到有限类型极值,`nan` 映射到 0),有限溢出饱和到有限范围;`nosat` 保留异常输入,溢出可能产生异常输出。两者都不带时使用外部选择的执行模式。整型 MAD 不接受这两个 flag。 | +| `tf32_mode(...)` | `round_even`、`round_away` | 仅适用于非 MX 的 `f32 x f32 -> f32`。FP32 输入在相乘前先舍入到 TF32 精度;累加与输出仍为 FP32。 | +| `n_dir` | flag | 请求按 N 方向产出结果,方便与 unit flag 调度和后续布局移动配合。不会改变 `dst[m, n]`。 | + +## 示例 + +```mlir +pto.mad %l0a, %l0b, %l0c, %c16_i64, %c16_i64, %c32_i64 + : !pto.ptr, !pto.ptr, !pto.ptr, i64, i64, i64 +``` + +## 相关指令 + +- 累加形式:[pto.mad_acc](./mad-acc_zh.md) +- 偏置初始化形式:[pto.mad_bias](./mad-bias_zh.md) +- MX 变体:[pto.mad_mx](./mad-mx_zh.md)、[pto.mad_mx_acc](./mad-mx-acc_zh.md)、[pto.mad_mx_bias](./mad-mx-bias_zh.md) +- 操作数搬运:[pto.mte_l1_l0a](../data-movement/mte-l1-l0a_zh.md)、[pto.mte_l1_l0b](../data-movement/mte-l1-l0b_zh.md) +- 结果回写:[FIXPIPE 模型](../../fixpipe-model_zh.md) diff --git a/docs/isa/instruction-families/README.md b/docs/isa/instruction-families/README.md index 7c6725d5c..19f6cfc8b 100644 --- a/docs/isa/instruction-families/README.md +++ b/docs/isa/instruction-families/README.md @@ -9,7 +9,7 @@ PTO ISA is organized into five instruction sets, each representing a distinct ha | | Instruction Set | Prefix | Domain | Primary Role | Operand Types | |-|-----------------|--------|--------|-------------|--------------| | | [Tile Instruction Set](./tile-families.md) | `pto.t*` | Tile Buffers (Vec/Mat/Acc/Left/Right) | Tile-oriented compute, data movement, layout transforms, synchronization | `!pto.tile<...>`, `!pto.tile_buf<...>`, `!pto.partition_tensor_view<...>` | -| | [Vector Instruction Set](../instruction-families/vector-families.md) | `pto.v*` | Vector Pipeline (V) | Lane-level compute, masking, vector load/store, SFU operations | `!pto.vreg`, `!pto.mask`, `!pto.ptr` | +| | [Vector Instruction Set](../instruction-families/vector-families.md) | `pto.v*` | Vector Pipeline (V) | Lane-level compute, masking, vector load/store, SFU operations | `!pto.vreg`, `!pto.mask`, `!pto.ptr` | | | [Scalar/Control Instruction Set](./scalar-and-control-families.md) | `pto.*` | Scalar Unit, DMA Controller | Configuration, synchronization, DMA setup, predicate generation and load/store | Scalar registers, pipe IDs, event IDs, buffer IDs, predicate masks | | | [Communication Instruction Set](./communication-families.md) | `pto.tbroadcast`, `pto.tget`, etc. | Inter-NPU Interconnect | Collective communication, point-to-point exchange, runtime synchronization | `!pto.group`, tile operands, allocation handles | | | [System Scheduling Instruction Set](../system/README.md) | `pto.tpush`, `pto.tpop`, `pto.tfree` | Runtime-visible scheduling state | TPipe/TMPipe producer-consumer flow and resource lifetime | Tile handles, stream state, resource handles | @@ -121,7 +121,7 @@ Instruction Sets │ ├── Compare and Select → pto.vcmp, pto.vcmps, pto.vsel, pto.vselr, pto.vselrv2 │ ├── Data Rearrangement → pto.vintlv, pto.vdintlv, pto.vslide, pto.vshift, │ │ pto.vpack, pto.vzunpack, pto.vperm, etc. -│ └── SFU and DSA Instructions → pto.vprelu, pto.vexpdiff, pto.vaxpy, pto.vtranspose, +│ └── SFU and DSA Instructions → pto.vprelu, pto.vexpdif, pto.vaxpy, pto.vtranspose, │ pto.vsort32, pto.vmrgsort, etc. │ ├── Scalar And Control Instruction Set diff --git a/docs/isa/instruction-families/scalar-and-control-families.md b/docs/isa/instruction-families/scalar-and-control-families.md index 8a0a2181a..a0f0d6e0f 100644 --- a/docs/isa/instruction-families/scalar-and-control-families.md +++ b/docs/isa/instruction-families/scalar-and-control-families.md @@ -41,7 +41,7 @@ Scalar/control instructions produce: - Control state changes (pipeline barriers, control flow) - Event tokens for explicit synchronization (`RecordEvent`) -- Predicate masks (`!pto.mask`) +- Predicate masks (`!pto.mask`) - Configured DMA state ready for transfer - UB buffer handles diff --git a/docs/isa/instruction-families/vector-families.md b/docs/isa/instruction-families/vector-families.md index 7626e90d1..38e73ed3f 100644 --- a/docs/isa/instruction-families/vector-families.md +++ b/docs/isa/instruction-families/vector-families.md @@ -45,7 +45,7 @@ Vector Registers (!pto.vreg) ──► Vector Compute (pto.v*) ──► Ve | | Reduction Instructions | Cross-lane reductions (channelled) | `vcadd`, `vcmax`, `vcmin`, `vcgadd`, `vcgmax` | | | Compare and Select | Comparison and conditional lane selection | `vcmp`, `vcmps`, `vsel`, `vselr`, `vselrv2` | | | Data Rearrangement | Lane permutation, interleaving, packing | `vintlv`, `vdintlv`, `vslide`, `vshift`, `vpack`, `vzunpack` | -| | SFU and DSA Instructions | Special function units and DSA-style operations | `vprelu`, `vexpdiff`, `vaxpy`, `vtranspose`, `vsort32` | +| | SFU and DSA Instructions | Special function units and DSA-style operations | `vprelu`, `vexpdif`, `vaxpy`, `vtranspose`, `vsort32` | ## Inputs @@ -53,7 +53,7 @@ Vector instructions consume combinations of: - Vector registers (`!pto.vreg`) - Scalar registers or immediate operands -- Predicate masks (`!pto.mask`) — selects which lanes participate +- Predicate masks (`!pto.mask`) — selects which lanes participate - Memory addresses (`!pto.ptr`) — for load/store ops - Rounding-mode or distribution-mode attributes @@ -99,7 +99,7 @@ Every vector instruction set must state: ## Mask Behavior -Vector operations can be gated by a predicate mask. A predicate mask (`!pto.mask`) with width equal to the vector length `N` selects which lanes participate: +Vector operations can be gated by a predicate mask. A predicate mask (`!pto.mask`) with width equal to the vector length `N` selects which lanes participate: - Lanes where the mask bit is **1**: the operation executes normally. - Lanes where the mask bit is **0**: the operation produces a **defined result** but the specific value depends on the operation: @@ -170,13 +170,13 @@ vlds %vreg, %ub_ptr[%offset] {dist = "NORM"} : !pto.ptr ```mlir %vdst = pto.vadd %vsrc0, %vsrc1, %mask - : (!pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask) -> !pto.vreg<64xf32> + : (!pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask) -> !pto.vreg<64xf32> ``` ### DPS Form (AS Level 2) ```mlir -pto.vadd ins(%vsrc0, %vsrc1, %mask : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask) +pto.vadd ins(%vsrc0, %vsrc1, %mask : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask) outs(%vdst : !pto.vreg<64xf32>) ``` diff --git a/docs/isa/introduction/current-isa-scope.md b/docs/isa/introduction/current-isa-scope.md index 9738d6302..542a5e9ee 100644 --- a/docs/isa/introduction/current-isa-scope.md +++ b/docs/isa/introduction/current-isa-scope.md @@ -26,6 +26,10 @@ That yields **289 named instructions** in the current reference set. ## Tile Instruction Inventory +### View and Tile Buffer + +`make_tensor_view`, `get_tensor_view_dim`, `get_tensor_view_stride`, `tensor_view_addr`, `partition_view`, `alloc_tile`, `subset`, `set_validshape`, `tile_buf_addr` + ### Sync And Config `tsync`, `tassign`, `talias`, `sethf32mode`, `settf32mode`, `setfmatrix`, `set_img2col_rpt`, `set_img2col_padding`, `subview`, `get_scale_addr` @@ -62,7 +66,7 @@ That yields **289 named instructions** in the current reference set. ### Vector Load-Store -`vgather2`, `vgather2_bc`, `vgatherb`, `vldas`, `vlds`, `vldus`, `vldx2`, `vscatter`, `vsld`, `vsldb`, `vsst`, `vsstb`, `vsta`, `vstar`, `vstas`, `vsts`, `vstu`, `vstur`, `vstus`, `vstx2` +`vgather2`, `vgather2_bc`, `vgatherb`, `vldas`, `vlds`, `vldus`, `vldsx2`, `vscatter`, `vsld`, `vsldb`, `vsst`, `vsstb`, `vsta`, `vstar`, `vstas`, `vsts`, `vstu`, `vstur`, `vstus`, `vstsx2` ### Predicate And Materialization @@ -98,7 +102,7 @@ That yields **289 named instructions** in the current reference set. ### SFU And DSA Operations -`vaddrelu`, `vaddreluconv`, `vaxpy`, `vexpdiff`, `vmrgsort`, `vmula`, `vmulconv`, `vmull`, `vprelu`, `vsort32`, `vsubrelu`, `vtranspose` +`vaddrelu`, `vaddreluconv`, `vaxpy`, `vexpdif`, `vmrgsort`, `vmula`, `vmulconv`, `vmull`, `vprelu`, `vsort32`, `vsubrelu`, `vtranspose` ## Scalar And Control Instruction Inventory diff --git a/docs/isa/introduction/current-isa-scope_zh.md b/docs/isa/introduction/current-isa-scope_zh.md index bd16e5c42..43764cfd7 100644 --- a/docs/isa/introduction/current-isa-scope_zh.md +++ b/docs/isa/introduction/current-isa-scope_zh.md @@ -26,6 +26,10 @@ PTO 定义五套具名指令集,并为每条指令提供明确参考页: ## Tile 指令清单 +### 视图与 Tile Buffer + +`make_tensor_view`、`get_tensor_view_dim`、`get_tensor_view_stride`、`tensor_view_addr`、`partition_view`、`alloc_tile`、`subset`、`set_validshape`、`tile_buf_addr` + ### 同步与配置 `tsync`、`tassign`、`talias`、`sethf32mode`、`settf32mode`、`setfmatrix`、`set_img2col_rpt`、`set_img2col_padding`、`subview`、`get_scale_addr` @@ -62,7 +66,7 @@ PTO 定义五套具名指令集,并为每条指令提供明确参考页: ### 向量加载存储 -`vgather2`、`vgather2_bc`、`vgatherb`、`vldas`、`vlds`、`vldus`、`vldx2`、`vscatter`、`vsld`、`vsldb`、`vsst`、`vsstb`、`vsta`、`vstar`、`vstas`、`vsts`、`vstu`、`vstur`、`vstus`、`vstx2` +`vgather2`、`vgather2_bc`、`vgatherb`、`vldas`、`vlds`、`vldus`、`vldsx2`、`vscatter`、`vsld`、`vsldb`、`vsst`、`vsstb`、`vsta`、`vstar`、`vstas`、`vsts`、`vstu`、`vstur`、`vstus`、`vstsx2` ### 谓词与物化 @@ -98,7 +102,7 @@ PTO 定义五套具名指令集,并为每条指令提供明确参考页: ### SFU 与 DSA 操作 -`vaddrelu`、`vaddreluconv`、`vaxpy`、`vexpdiff`、`vmrgsort`、`vmula`、`vmulconv`、`vmull`、`vprelu`、`vsort32`、`vsubrelu`、`vtranspose` +`vaddrelu`、`vaddreluconv`、`vaxpy`、`vexpdif`、`vmrgsort`、`vmula`、`vmulconv`、`vmull`、`vprelu`、`vsort32`、`vsubrelu`、`vtranspose` ## 标量与控制指令清单 diff --git a/docs/isa/introduction/what-is-pto-visa.md b/docs/isa/introduction/what-is-pto-visa.md index dc407f7ff..f10caf9bd 100644 --- a/docs/isa/introduction/what-is-pto-visa.md +++ b/docs/isa/introduction/what-is-pto-visa.md @@ -355,7 +355,7 @@ GM ──(copy_gm_to_ubuf)──► UB ──(vlds)──► Vector Register ─ | Data movement | TLOAD/TSTORE (implicit tile↔UB) | copy_gm_to_ubuf / copy_ubuf_to_gm + vlds/vsts | | Synchronization | TSYNC, set_flag/wait_flag | set_flag/wait_flag on vector pipe, mem_bar | | Layout control | Via tile layout parameters | Via distribution mode (NORM, BRC, DS, etc.) | -| Predicate support | No per-lane masking | Yes — `%mask : !pto.mask` on every vector op | +| Predicate support | No per-lane masking | Yes — `%mask : !pto.mask` on every vector op | | Target portability | All profiles | A5 hardware; emulated on CPU/A2/A3 | ## Audience: Who Reads This Manual diff --git a/docs/isa/scalar/control-and-configuration.md b/docs/isa/scalar/control-and-configuration.md index c726cdab3..a2d6f7748 100644 --- a/docs/isa/scalar/control-and-configuration.md +++ b/docs/isa/scalar/control-and-configuration.md @@ -12,7 +12,7 @@ This overview groups all scalar/control operations by their architectural role: - [Pipeline sync](./pipeline-sync.md): explicit producer-consumer edges, buffer-token protocols, and memory barriers. - [DMA copy](./dma-copy.md): loop-size and stride configuration plus GM↔vector-tile-buffer and vector-tile-buffer↔vector-tile-buffer copy operations. -- [Predicate load store](./predicate-load-store.md): moving `!pto.mask` state through UB and handling unaligned predicate-store streams. +- [Predicate load store](./predicate-load-store.md): moving `!pto.mask` state through UB and handling unaligned predicate-store streams. - [Predicate generation and algebra](./predicate-generation-and-algebra.md): mask creation, tail masks, boolean combination, and predicate rearrangement. - [Micro-instruction reference](./ops/micro-instruction/README.md): scalar/vector boundary and runtime query operations. diff --git a/docs/isa/scalar/dma-copy.md b/docs/isa/scalar/dma-copy.md index f0e043970..735459463 100644 --- a/docs/isa/scalar/dma-copy.md +++ b/docs/isa/scalar/dma-copy.md @@ -1,15 +1,25 @@ # DMA Copy -These `pto.*` forms configure and execute scalar-side DMA movement between GM and UB or inside UB. They are part of the scalar and control instructions because they define configuration and copy behavior, not vector-register compute. +These `pto.*` forms configure and execute scalar-side DMA movement between GM, UB, and L1. They are part of the scalar and control instructions because they describe DMA configuration and copy behavior, not vector-register compute. ## What This Instruction Set Covers -- nested-loop size and stride registers for GM↔UB transfers -- GM to UB copies -- UB to GM copies -- UB to UB copies +- Grouped GM↔UB transfers with inline burst / loop / pad clauses +- Grouped UB↔UB and UB→L1 copies +- (Pre-v0.6) standalone loop-size and loop-stride configuration registers -## Per-Op Pages +## v0.6 Grouped Transfer Ops + +These are the four public grouped DMA interfaces in the PTO ISA v0.6 micro-instruction surface. Each instruction expresses its repetition structure via inline `nburst(...)` / `loop(...)` clauses on the op itself; standalone loop / stride configuration registers are no longer required. + +- [pto.mte_gm_ub](./ops/dma-copy/copy-gm-to-ubuf.md) — GM → UB, with optional `pad(...)` for 32B-aligned row padding +- [pto.mte_ub_gm](./ops/dma-copy/copy-ubuf-to-gm.md) — UB → GM, strips padding added during load +- [pto.mte_ub_ub](./ops/dma-copy/copy-ubuf-to-ubuf.md) — intra-UB copy in 32B-unit bursts with gap fields +- [pto.mte_ub_l1](./ops/dma-copy/mte-ub-l1.md) — UB → L1 (cube CBUF), 32B-unit bursts with gap fields + +## Deprecated Pre-v0.6 Configuration Ops + +These ops correspond to the older surface where loop counts and per-level strides were programmed via standalone configuration registers and then consumed by a separate copy op. In v0.6 the same information lives inline on the grouped transfer op (`nburst(...)` and outer `loop(...)` clauses). The pages below are retained for historical reference and pre-v0.6 ports. - [pto.set_loop_size_outtoub](./ops/dma-copy/set-loop-size-outtoub.md) - [pto.set_loop2_stride_outtoub](./ops/dma-copy/set-loop2-stride-outtoub.md) @@ -17,11 +27,11 @@ These `pto.*` forms configure and execute scalar-side DMA movement between GM an - [pto.set_loop_size_ubtoout](./ops/dma-copy/set-loop-size-ubtoout.md) - [pto.set_loop2_stride_ubtoout](./ops/dma-copy/set-loop2-stride-ubtoout.md) - [pto.set_loop1_stride_ubtoout](./ops/dma-copy/set-loop1-stride-ubtoout.md) -- [pto.copy_gm_to_ubuf](./ops/dma-copy/copy-gm-to-ubuf.md) -- [pto.copy_ubuf_to_gm](./ops/dma-copy/copy-ubuf-to-gm.md) -- [pto.copy_ubuf_to_ubuf](./ops/dma-copy/copy-ubuf-to-ubuf.md) + +The legacy execution ops `pto.copy_gm_to_ubuf` / `pto.copy_ubuf_to_gm` / `pto.copy_ubuf_to_ubuf` have been replaced by the v0.6 grouped forms `pto.mte_gm_ub` / `pto.mte_ub_gm` / `pto.mte_ub_ub` linked above. Their per-op pages (URL slugs preserved) now document the v0.6 surface. ## Related Material - [Control and configuration](./control-and-configuration.md) - [Vector Instruction Set: DMA Copy](../vector/dma-copy.md) +- [Pipeline Synchronization](./ops/pipeline-sync/) diff --git a/docs/isa/scalar/dma-copy_zh.md b/docs/isa/scalar/dma-copy_zh.md index fa570a51e..89f0ce3a3 100644 --- a/docs/isa/scalar/dma-copy_zh.md +++ b/docs/isa/scalar/dma-copy_zh.md @@ -1,27 +1,37 @@ # DMA 拷贝 -这些 `pto.*` 形式配置并执行 GM↔UB 以及 UB 内部的标量侧 DMA 搬运。它们属于标量与控制指令,因为它们定义的是配置和搬运行为,而不是向量寄存器计算。 +这些 `pto.*` 形式配置并执行 GM、UB、L1 之间的标量侧 DMA 搬运。它们属于标量与控制指令,因为它们定义的是 DMA 配置和搬运行为,而不是向量寄存器计算。 ## 本指令集覆盖 -- GM↔UB 传输的嵌套循环大小和 stride 配置 -- GM → UB 拷贝 -- UB → GM 拷贝 -- UB → UB 拷贝 +- 使用内联 burst / loop / pad 子句的分组 GM↔UB 传输 +- 分组 UB↔UB 与 UB→L1 拷贝 +- (pre-v0.6 历史)独立的循环大小与循环 stride 配置寄存器 -## per-op 页面 +## v0.6 分组传输指令 -- `pto.set_loop_size_outtoub` -- `pto.set_loop2_stride_outtoub` -- `pto.set_loop1_stride_outtoub` -- `pto.set_loop_size_ubtoout` -- `pto.set_loop2_stride_ubtoout` -- `pto.set_loop1_stride_ubtoout` -- `pto.copy_gm_to_ubuf` -- `pto.copy_ubuf_to_gm` -- `pto.copy_ubuf_to_ubuf` +以下是 PTO ISA v0.6 微指令表面中四条公开的分组 DMA 接口。每条指令都通过内联的 `nburst(...)` / `loop(...)` 子句表达自己的重复结构,不再需要外部独立的循环/步长配置寄存器。 + +- [pto.mte_gm_ub](./ops/dma-copy/copy-gm-to-ubuf_zh.md):GM → UB,附带可选 `pad(...)` 做 32B 对齐行填充 +- [pto.mte_ub_gm](./ops/dma-copy/copy-ubuf-to-gm_zh.md):UB → GM,剥除 load 时增加的 padding +- [pto.mte_ub_ub](./ops/dma-copy/copy-ubuf-to-ubuf_zh.md):UB 内拷贝,以 32B 为单位的 burst + gap 字段 +- [pto.mte_ub_l1](./ops/dma-copy/mte-ub-l1_zh.md):UB → L1(cube CBUF),以 32B 为单位的 burst + gap 字段 + +## Pre-v0.6 已弃用配置指令 + +下列指令对应旧的表面:循环计数与每层步长在独立的配置寄存器里编程,再由单独的拷贝指令消费。v0.6 把这些信息全部放进了分组传输指令本身的 `nburst(...)` 与外层 `loop(...)` 子句。下面这些页面保留作为历史参考与 pre-v0.6 移植用途。 + +- [pto.set_loop_size_outtoub](./ops/dma-copy/set-loop-size-outtoub_zh.md) +- [pto.set_loop2_stride_outtoub](./ops/dma-copy/set-loop2-stride-outtoub_zh.md) +- [pto.set_loop1_stride_outtoub](./ops/dma-copy/set-loop1-stride-outtoub_zh.md) +- [pto.set_loop_size_ubtoout](./ops/dma-copy/set-loop-size-ubtoout_zh.md) +- [pto.set_loop2_stride_ubtoout](./ops/dma-copy/set-loop2-stride-ubtoout_zh.md) +- [pto.set_loop1_stride_ubtoout](./ops/dma-copy/set-loop1-stride-ubtoout_zh.md) + +旧的执行指令 `pto.copy_gm_to_ubuf` / `pto.copy_ubuf_to_gm` / `pto.copy_ubuf_to_ubuf` 已被 v0.6 的分组形式 `pto.mte_gm_ub` / `pto.mte_ub_gm` / `pto.mte_ub_ub` 取代(链接见上方)。它们对应的 per-op 页面(URL slug 保留不变)现在直接记录 v0.6 表面。 ## 相关页面 - [控制与配置](./control-and-configuration_zh.md) - [向量 DMA 路径](../vector/dma-copy_zh.md) +- [流水线同步](./ops/pipeline-sync/) diff --git a/docs/isa/scalar/ops/dma-copy/copy-gm-to-ubuf.md b/docs/isa/scalar/ops/dma-copy/copy-gm-to-ubuf.md index 10ac94e48..1f7deeb20 100644 --- a/docs/isa/scalar/ops/dma-copy/copy-gm-to-ubuf.md +++ b/docs/isa/scalar/ops/dma-copy/copy-gm-to-ubuf.md @@ -1,44 +1,41 @@ -# pto.copy_gm_to_ubuf +# pto.mte_gm_ub -`pto.copy_gm_to_ubuf` is part of the [DMA Copy](../../dma-copy.md) instruction set. +`pto.mte_gm_ub` is part of the [DMA Copy](../../dma-copy.md) instruction set. + +!!! note "PTO ISA v0.6 surface" + The v0.6 PTO micro-instruction surface replaces the earlier `pto.copy_gm_to_ubuf` plus standalone `set_loop_size_*` / `set_loop_stride_*` configuration ops with a single grouped instruction: `pto.mte_gm_ub` with inline `nburst(...)`, optional `loop(...)`, and optional `pad(...)` clauses. The information previously carried in separate loop / stride configuration registers is now expressed directly on the transfer op. ## Summary -Execute a DMA transfer from Global Memory into Unified Buffer using the current GM→UB loop and stride configuration. +Execute a grouped GM→UB DMA transfer. `nburst(...)` defines the innermost repeated burst transfer, optional `loop(...)` groups add outer repetition levels, and optional `pad(...)` controls UB row padding. ## Mechanism -The DMA engine reads `%n_burst` rows from `%gm_src` and writes them to `%ub_dst`. `%len_burst` controls the contiguous byte count copied per row. `%left_padding`, `%right_padding`, and `%data_select_bit` control whether the destination row is padded beyond the copied byte range. `%src_stride` and `%dst_stride` specify the row-to-row start offsets for this copy invocation. +The MTE2 engine reads `%n_burst` source rows from `%gm_src` and writes them to `%ub_dst`. Each row transfers `%len_burst` contiguous bytes, and the source / destination stride operands give the start-to-start byte distance from one row to the next. Optional outer `loop(...)` groups wrap `nburst(...)` to express multi-level repetition without external loop-config state. When `pad(...)` is present, UB rows are padded up to the next 32-byte aligned boundary using the supplied fill value. ## Syntax -### PTO Assembly Form - -```text -copy_gm_to_ubuf %gm_src, %ub_dst, %sid, %n_burst, %len_burst, %left_padding, %right_padding, %data_select_bit, %l2_cache_ctl, %src_stride, %dst_stride -``` - -### AS Level 1 (SSA) - ```mlir -pto.copy_gm_to_ubuf %gm_src, %ub_dst, %sid, %n_burst, %len_burst, %left_padding, %right_padding, %data_select_bit, %l2_cache_ctl, %src_stride, %dst_stride : !pto.ptr, !pto.ptr, i64, i64, i64, i64, i64, i1, i64, i64, i64 +pto.mte_gm_ub %gm_src, %ub_dst, %l2_cache_ctl, %len_burst + nburst(%n_burst, %src_stride, %dst_stride) + [loop(%loop_count, %loop_src_stride, %loop_dst_stride)]* + [pad(%pad_value[, %left_padding_count, %right_padding_count])] + : !pto.ptr, !pto.ptr, i64, i64, i64, i64, i64, + [loop i64, i64, i64,]* + [pad T[, i64, i64]] ``` ## Inputs -| Operand | Type | Description | -| --- | --- | --- | -| %gm_src | `!pto.ptr` | GM source pointer | -| %ub_dst | `!pto.ptr` | UB destination pointer | -| %sid | `i64` | DMA stream identifier | -| %n_burst | `i64` | Number of burst rows to transfer | -| %len_burst | `i64` | Contiguous byte count transferred per row | -| %left_padding | `i64` | Left padding byte count applied in the destination row | -| %right_padding | `i64` | Right padding byte count applied in the destination row | -| %data_select_bit | `i1` | Controls whether padding bytes are materialized according to the configured pad behavior | -| %l2_cache_ctl | `i64` | Target-specific L2 cache allocation hint | -| %src_stride | `i64` | GM row-to-row start offset in bytes | -| %dst_stride | `i64` | UB row-to-row start offset in bytes | +| Parameter | Width | Description | +|-----------|-------|-------------| +| `%gm_src` | ptr | GM source pointer (`!pto.ptr`) | +| `%ub_dst` | ptr | UB destination pointer (`!pto.ptr`, 32B-aligned) | +| `%l2_cache_ctl` | 2 bits | L2 cache allocate control | +| `%len_burst` | 16 bits | Contiguous bytes transferred per burst row | +| `nburst(%n_burst, %src_stride, %dst_stride)` | 16 bits / 40 bits / 21 bits | Required innermost burst group: count, GM source stride, UB destination stride | +| `loop(%loop_count, %loop_src_stride, %loop_dst_stride)` | 21 bits / 40 bits / 21 bits | Optional outer repetition group: count, GM source stride, UB destination stride | +| `pad(%pad_value[, %left_padding_count, %right_padding_count])` | scalar / 8 bits / 8 bits | Optional padding: fill value, optional left padding count, optional right padding count | ## Expected Outputs @@ -48,19 +45,23 @@ pto.copy_gm_to_ubuf %gm_src, %ub_dst, %sid, %n_burst, %len_burst, %left_padding, ## Side Effects -Reads GM-visible storage, writes UB-visible storage, and consumes the active GM→UB loop and stride configuration. +Reads GM-visible storage and writes UB-visible storage. The MTE2 pipe is engaged for the duration of the transfer; downstream consumers must synchronize through `pto.set_flag` / `pto.wait_flag` (`PIPE_MTE2` → `PIPE_V`). ## Constraints !!! warning "Constraints" - - `%ub_dst` MUST satisfy the UB alignment requirements of the selected target profile. - - `%len_burst` MUST fit within the configured row stride and DMA limits of the selected target profile. - - If padding is enabled, the padded destination footprint MUST still fit in the destination UB region. + - `nburst(...)` is always required. + - Each `loop(...)` group must be provided as a complete triple when present. + - `nburst(...)` is the innermost group. + - `loop(...)` groups are ordered from inner to outer; the first `loop(...)` group wraps `nburst(...)`, and each additional `loop(...)` group wraps all earlier groups. + - `pad(...)` may contain only `%pad_value`; omitted left and right padding counts default to 0. If either left or right count is provided, both must be provided. + - `pad(...)` is independent of the optional `loop(...)` groups. A DMA load may use `nburst(...) pad(...)` without any `loop(...)` group. + - `%ub_dst` MUST be 32-byte aligned. When `pad(...)` is present, each UB row is padded from `%len_burst` up to the 32B-aligned boundary of the UB destination stride, ensuring every row starts at a 32B-aligned offset. ## Exceptions !!! danger "Exceptions" - - The verifier rejects illegal operand shapes, unsupported pipe or event identifiers, and attribute combinations that are not valid for the selected instruction set or target profile. + - The verifier rejects illegal operand shapes, malformed clause groups, and attribute combinations not valid for the selected target profile. - Any additional illegality stated in the constraints section is also part of the contract. ## Target-Profile Restrictions @@ -72,12 +73,29 @@ Reads GM-visible storage, writes UB-visible storage, and consumes the active GM ## Examples ```mlir -pto.copy_gm_to_ubuf %gm_src, %ub_dst, %sid, %n_burst, %len_burst, %left_padding, %right_padding, %data_select_bit, %l2_cache_ctl, %src_stride, %dst_stride : !pto.ptr, !pto.ptr, i64, i64, i64, i64, i64, i1, i64, i64, i64 +// Single-level transfer with padding: rows of `%len_burst` bytes, +// padded to 32B-aligned UB rows using %pad as fill. +pto.mte_gm_ub %gm_in, %ub_out, %cache, %len_burst + nburst(%rows, %gm_row_stride, %ub_row_stride) + pad(%pad) + : !pto.ptr, !pto.ptr, i64, i64, i64, i64, i64, + pad f16 +``` + +```mlir +// Two-level transfer: rows × tiles, with UB row padding. +pto.mte_gm_ub %gm_in, %ub_out, %cache, %len_burst + nburst(%rows, %gm_row_stride, %ub_row_stride) + loop(%tiles, %gm_tile_stride, %ub_tile_stride) + pad(%pad) + : !pto.ptr, !pto.ptr, i64, i64, i64, i64, i64, + loop i64, i64, i64, pad f16 ``` ## Related Ops / Instruction Set Links - Instruction set overview: [DMA Copy](../../dma-copy.md) -- Previous op in instruction set: [pto.set_loop1_stride_ubtoout](./set-loop1-stride-ubtoout.md) -- Next op in instruction set: [pto.copy_ubuf_to_gm](./copy-ubuf-to-gm.md) +- Reverse direction: [pto.mte_ub_gm](./copy-ubuf-to-gm.md) +- Intra-UB copy: [pto.mte_ub_ub](./copy-ubuf-to-ubuf.md) +- Pipeline sync: [pto.set_flag](../pipeline-sync/set-flag.md), [pto.wait_flag](../pipeline-sync/wait-flag.md) - Control-shell overview: [Control and configuration](../../control-and-configuration.md) diff --git a/docs/isa/scalar/ops/dma-copy/copy-gm-to-ubuf_zh.md b/docs/isa/scalar/ops/dma-copy/copy-gm-to-ubuf_zh.md index 15bdfd079..0e36aa2d0 100644 --- a/docs/isa/scalar/ops/dma-copy/copy-gm-to-ubuf_zh.md +++ b/docs/isa/scalar/ops/dma-copy/copy-gm-to-ubuf_zh.md @@ -1,22 +1,100 @@ -# pto.copy_gm_to_ubuf +# pto.mte_gm_ub -把数据从 Global Memory 搬运到 Unified Buffer。 +`pto.mte_gm_ub` 属于 [DMA Copy](../../dma-copy_zh.md) 指令集。 + +!!! note "PTO ISA v0.6 表面" + v0.6 PTO 微指令表面已经把旧的 `pto.copy_gm_to_ubuf` 与独立的 `set_loop_size_*` / `set_loop_stride_*` 配置指令合并成一个分组指令 `pto.mte_gm_ub`,并通过内联的 `nburst(...)`、可选的 `loop(...)` 和可选的 `pad(...)` 子句承载所有重复结构。之前由独立循环/步长寄存器记录的信息,现在全部表达在传输指令本身上。 + +## 摘要 + +执行一次分组 GM→UB DMA 传输。`nburst(...)` 描述最内层的重复 burst,`loop(...)` 可选地追加外层重复,`pad(...)` 可选地控制 UB 行填充。 + +## 机制 + +MTE2 引擎从 `%gm_src` 读取 `%n_burst` 行,写入 `%ub_dst`。每行搬运 `%len_burst` 字节的连续数据,源/目标步长操作数给出相邻行起点的字节距离。可选的外层 `loop(...)` 子句把 `nburst(...)` 包裹起来表达多层重复,而无需外部的循环配置状态。当存在 `pad(...)` 时,UB 行会被填充到下一个 32 字节对齐边界,填充值来自 `pad_value`。 ## 语法 ```mlir -pto.copy_gm_to_ubuf %gm_src, %ub_dst, - %sid, %n_burst, %len_burst, %left_padding, %right_padding, - %data_select_bit, %l2_cache_ctl, %src_stride, %dst_stride - : !pto.ptr, !pto.ptr, i64, i64, i64, - i64, i64, i1, i64, i64, i64 +pto.mte_gm_ub %gm_src, %ub_dst, %l2_cache_ctl, %len_burst + nburst(%n_burst, %src_stride, %dst_stride) + [loop(%loop_count, %loop_src_stride, %loop_dst_stride)]* + [pad(%pad_value[, %left_padding_count, %right_padding_count])] + : !pto.ptr, !pto.ptr, i64, i64, i64, i64, i64, + [loop i64, i64, i64,]* + [pad T[, i64, i64]] ``` -## 关键约束 +## 输入 + +| 参数 | 位宽 | 描述 | +|-----------|-------|-------------| +| `%gm_src` | ptr | GM 源指针(`!pto.ptr`) | +| `%ub_dst` | ptr | UB 目标指针(`!pto.ptr`,32B 对齐) | +| `%l2_cache_ctl` | 2 bits | L2 cache 分配控制 | +| `%len_burst` | 16 bits | 每行连续搬运的字节数 | +| `nburst(%n_burst, %src_stride, %dst_stride)` | 16 / 40 / 21 bits | 必备最内层 burst 组:数量、GM 源步长、UB 目标步长 | +| `loop(%loop_count, %loop_src_stride, %loop_dst_stride)` | 21 / 40 / 21 bits | 可选外层重复组:数量、GM 源步长、UB 目标步长 | +| `pad(%pad_value[, %left_padding_count, %right_padding_count])` | 标量 / 8 / 8 bits | 可选填充:填充值,可选左右填充计数 | + +## 预期输出 + +| 结果 | 类型 | 描述 | +| --- | --- | --- | +| 无 | `—` | 此形式不返回 SSA 值;它把数据写入 Unified Buffer。 | + +## 副作用 + +读取 GM 可见存储并写入 UB 可见存储。MTE2 流水线在传输期间被占用;下游消费者需要通过 `pto.set_flag` / `pto.wait_flag`(`PIPE_MTE2` → `PIPE_V`)做显式同步。 + +## 约束 -- 当前 target profile 可能对该形式施加额外限制。 +!!! warning "约束" + - `nburst(...)` 必须存在。 + - 每个 `loop(...)` 子句出现时必须给出完整三元组。 + - `nburst(...)` 是最内层。 + - `loop(...)` 子句按从内到外排序;第一个 `loop(...)` 包裹 `nburst(...)`,每多一个 `loop(...)` 又把前面所有层包裹起来。 + - `pad(...)` 可以只包含 `%pad_value`;省略时左右填充计数默认为 0。若提供左右填充任一者,必须两者都提供。 + - `pad(...)` 与可选的 `loop(...)` 互相独立。DMA load 可以只带 `nburst(...) pad(...)` 而没有任何 `loop(...)`。 + - `%ub_dst` 必须 32 字节对齐。当存在 `pad(...)` 时,每个 UB 行会被从 `%len_burst` 填充到 UB 目标步长的 32B 对齐边界,确保每行起点 32B 对齐。 +## 异常 + +!!! danger "异常" + - verifier 会拒绝非法的 operand 形状、错误的子句组以及目标 profile 不支持的属性组合。 + - 约束中列出的其它非法情形同样属于契约。 + +## 目标 Profile 限制 + +??? info "目标 Profile 限制" + - CPU 仿真保留可见拷贝契约,但可能不暴露所有 DMA 重叠风险。 + - A2/A3 和 A5 可能收窄元素大小、行宽或 cache 控制语义。 + +## 示例 + +```mlir +// 单层传输,带 padding:每行 %len_burst 字节,UB 行被填充到 32B 对齐。 +pto.mte_gm_ub %gm_in, %ub_out, %cache, %len_burst + nburst(%rows, %gm_row_stride, %ub_row_stride) + pad(%pad) + : !pto.ptr, !pto.ptr, i64, i64, i64, i64, i64, + pad f16 +``` + +```mlir +// 两层传输:rows × tiles,带 UB 行 padding。 +pto.mte_gm_ub %gm_in, %ub_out, %cache, %len_burst + nburst(%rows, %gm_row_stride, %ub_row_stride) + loop(%tiles, %gm_tile_stride, %ub_tile_stride) + pad(%pad) + : !pto.ptr, !pto.ptr, i64, i64, i64, i64, i64, + loop i64, i64, i64, pad f16 +``` ## 相关页面 -- [DMA 拷贝](../../dma-copy_zh.md) +- 指令集总览:[DMA Copy](../../dma-copy_zh.md) +- 反向传输:[pto.mte_ub_gm](./copy-ubuf-to-gm_zh.md) +- UB 内拷贝:[pto.mte_ub_ub](./copy-ubuf-to-ubuf_zh.md) +- 流水线同步:[pto.set_flag](../pipeline-sync/set-flag_zh.md)、[pto.wait_flag](../pipeline-sync/wait-flag_zh.md) +- 控制壳总览:[Control and configuration](../../control-and-configuration_zh.md) diff --git a/docs/isa/scalar/ops/dma-copy/copy-ubuf-to-gm.md b/docs/isa/scalar/ops/dma-copy/copy-ubuf-to-gm.md index 9d1d4e01b..ee4a92407 100644 --- a/docs/isa/scalar/ops/dma-copy/copy-ubuf-to-gm.md +++ b/docs/isa/scalar/ops/dma-copy/copy-ubuf-to-gm.md @@ -1,41 +1,37 @@ -# pto.copy_ubuf_to_gm +# pto.mte_ub_gm -`pto.copy_ubuf_to_gm` is part of the [DMA Copy](../../dma-copy.md) instruction set. +`pto.mte_ub_gm` is part of the [DMA Copy](../../dma-copy.md) instruction set. + +!!! note "PTO ISA v0.6 surface" + The v0.6 PTO micro-instruction surface replaces the earlier `pto.copy_ubuf_to_gm` plus standalone `set_loop_size_*` / `set_loop_stride_*` configuration ops with a single grouped instruction: `pto.mte_ub_gm` with inline `nburst(...)` and optional `loop(...)` clauses. ## Summary -Execute a DMA transfer from Unified Buffer into Global Memory using the current UB→GM loop and stride configuration. +Execute a grouped UB→GM DMA transfer. `nburst(...)` defines the innermost repeated burst transfer, and optional `loop(...)` groups add outer repetition levels. ## Mechanism -The DMA engine reads `%n_burst` rows from `%ub_src` and writes them to `%gm_dst`. `%len_burst` controls the contiguous byte count copied per row, so padded bytes in the UB row are not written back unless they are part of the burst length. `%src_stride` and `%dst_stride` specify the row-to-row start offsets for this copy invocation. +The MTE3 engine reads `%n_burst` source rows from `%ub_src` and writes them to `%gm_dst`. Each row transfers `%len_burst` contiguous bytes, and the source / destination stride operands give the start-to-start byte distance from one row to the next. Optional outer `loop(...)` groups wrap `nburst(...)` to express multi-level repetition without external loop-config state. Padding bytes added during a previous GM→UB load are stripped: MTE3 reads `%len_burst` bytes from each 32B-aligned UB row and writes only valid data to GM. ## Syntax -### PTO Assembly Form - -```text -copy_ubuf_to_gm %ub_src, %gm_dst, %sid, %n_burst, %len_burst, %reserved, %dst_stride, %src_stride -``` - -### AS Level 1 (SSA) - ```mlir -pto.copy_ubuf_to_gm %ub_src, %gm_dst, %sid, %n_burst, %len_burst, %reserved, %dst_stride, %src_stride : !pto.ptr, !pto.ptr, i64, i64, i64, i64, i64, i64 +pto.mte_ub_gm %ub_src, %gm_dst, %len_burst + nburst(%n_burst, %src_stride, %dst_stride) + [loop(%loop_count, %loop_src_stride, %loop_dst_stride)]* + : !pto.ptr, !pto.ptr, i64, i64, i64, i64, + [loop i64, i64, i64,]* ``` ## Inputs -| Operand | Type | Description | -| --- | --- | --- | -| %ub_src | `!pto.ptr` | UB source pointer | -| %gm_dst | `!pto.ptr` | GM destination pointer | -| %sid | `i64` | DMA stream identifier | -| %n_burst | `i64` | Number of burst rows to transfer | -| %len_burst | `i64` | Contiguous byte count transferred per row | -| %reserved | `i64` | Reserved field; portable code should pass zero unless a target profile documents another meaning | -| %dst_stride | `i64` | GM row-to-row start offset in bytes | -| %src_stride | `i64` | UB row-to-row start offset in bytes | +| Parameter | Width | Description | +|-----------|-------|-------------| +| `%ub_src` | ptr | UB source pointer (`!pto.ptr`, 32B-aligned) | +| `%gm_dst` | ptr | GM destination pointer (`!pto.ptr`) | +| `%len_burst` | 16 bits | Contiguous bytes transferred per burst row | +| `nburst(%n_burst, %src_stride, %dst_stride)` | 16 bits / 21 bits / 40 bits | Required innermost burst group: count, UB source stride, GM destination stride | +| `loop(%loop_count, %loop_src_stride, %loop_dst_stride)` | 21 bits / 21 bits / 40 bits | Optional outer repetition group: count, UB source stride, GM destination stride | ## Expected Outputs @@ -45,19 +41,21 @@ pto.copy_ubuf_to_gm %ub_src, %gm_dst, %sid, %n_burst, %len_burst, %reserved, %ds ## Side Effects -Reads UB-visible storage, writes GM-visible storage, and consumes the active UB→GM loop and stride configuration. +Reads UB-visible storage and writes GM-visible storage. The MTE3 pipe is engaged for the duration of the transfer; downstream consumers (and ordering against further GM writes) must synchronize through `pto.set_flag` / `pto.wait_flag` (`PIPE_V` → `PIPE_MTE3`, and `pto.mem_bar` between back-to-back stores to the same GM address). ## Constraints !!! warning "Constraints" - - `%ub_src` MUST satisfy the UB alignment requirements of the selected target profile. - - `%len_burst` MUST fit within the configured row stride and DMA limits of the selected target profile. - - Only the requested burst bytes are copied from each UB row; padded tail bytes remain local to UB. + - `nburst(...)` is always required. + - Each `loop(...)` group must be provided as a complete triple when present. + - `nburst(...)` is the innermost group. + - `loop(...)` groups are ordered from inner to outer; the first `loop(...)` group wraps `nburst(...)`, and each additional `loop(...)` group wraps all earlier groups. + - `%ub_src` MUST be 32-byte aligned. ## Exceptions !!! danger "Exceptions" - - The verifier rejects illegal operand shapes, unsupported pipe or event identifiers, and attribute combinations that are not valid for the selected instruction set or target profile. + - The verifier rejects illegal operand shapes, malformed clause groups, and attribute combinations not valid for the selected target profile. - Any additional illegality stated in the constraints section is also part of the contract. ## Target-Profile Restrictions @@ -69,12 +67,28 @@ Reads UB-visible storage, writes GM-visible storage, and consumes the active UB ## Examples ```mlir -pto.copy_ubuf_to_gm %ub_src, %gm_dst, %sid, %n_burst, %len_burst, %reserved, %dst_stride, %src_stride : !pto.ptr, !pto.ptr, i64, i64, i64, i64, i64, i64 +// Two-level outbound transfer: rows x tiles. +pto.mte_ub_gm %ub_in, %gm_out, %len_burst + nburst(%rows, %ub_row_stride, %gm_row_stride) + loop(%tiles, %ub_tile_stride, %gm_tile_stride) + : !pto.ptr, !pto.ptr, i64, i64, i64, i64, + loop i64, i64, i64 +``` + +```mlir +// Three-level outbound transfer: rows x tiles x batches. +pto.mte_ub_gm %ub_in, %gm_out, %len_burst + nburst(%rows, %ub_row_stride, %gm_row_stride) + loop(%tiles, %ub_tile_stride, %gm_tile_stride) + loop(%batches, %ub_batch_stride, %gm_batch_stride) + : !pto.ptr, !pto.ptr, i64, i64, i64, i64, + loop i64, i64, i64, loop i64, i64, i64 ``` ## Related Ops / Instruction Set Links - Instruction set overview: [DMA Copy](../../dma-copy.md) -- Previous op in instruction set: [pto.copy_gm_to_ubuf](./copy-gm-to-ubuf.md) -- Next op in instruction set: [pto.copy_ubuf_to_ubuf](./copy-ubuf-to-ubuf.md) +- Reverse direction: [pto.mte_gm_ub](./copy-gm-to-ubuf.md) +- Intra-UB copy: [pto.mte_ub_ub](./copy-ubuf-to-ubuf.md) +- Pipeline sync: [pto.set_flag](../pipeline-sync/set-flag.md), [pto.wait_flag](../pipeline-sync/wait-flag.md), [pto.mem_bar](../pipeline-sync/mem-bar.md) - Control-shell overview: [Control and configuration](../../control-and-configuration.md) diff --git a/docs/isa/scalar/ops/dma-copy/copy-ubuf-to-gm_zh.md b/docs/isa/scalar/ops/dma-copy/copy-ubuf-to-gm_zh.md index f577cacee..2a33339af 100644 --- a/docs/isa/scalar/ops/dma-copy/copy-ubuf-to-gm_zh.md +++ b/docs/isa/scalar/ops/dma-copy/copy-ubuf-to-gm_zh.md @@ -1,20 +1,94 @@ -# pto.copy_ubuf_to_gm +# pto.mte_ub_gm -把数据从 Unified Buffer 回写到 Global Memory。 +`pto.mte_ub_gm` 属于 [DMA Copy](../../dma-copy_zh.md) 指令集。 + +!!! note "PTO ISA v0.6 表面" + v0.6 PTO 微指令表面已把旧的 `pto.copy_ubuf_to_gm` 与独立的 `set_loop_size_*` / `set_loop_stride_*` 配置指令合并成一个分组指令 `pto.mte_ub_gm`,并通过内联的 `nburst(...)` 与可选的 `loop(...)` 子句承载所有重复结构。 + +## 摘要 + +执行一次分组 UB→GM DMA 传输。`nburst(...)` 描述最内层的重复 burst,`loop(...)` 可选地追加外层重复。 + +## 机制 + +MTE3 引擎从 `%ub_src` 读取 `%n_burst` 行,写入 `%gm_dst`。每行搬运 `%len_burst` 字节的连续数据,源/目标步长操作数给出相邻行起点的字节距离。可选的外层 `loop(...)` 子句把 `nburst(...)` 包裹起来表达多层重复。在 GM→UB 加载时增加的填充字节会被剥除:MTE3 仅从每个 32B 对齐的 UB 行读取 `%len_burst` 个字节,写入 GM 的只是有效数据。 ## 语法 ```mlir -pto.copy_ubuf_to_gm %ub_src, %gm_dst, - %sid, %n_burst, %len_burst, %reserved, %dst_stride, %src_stride - : !pto.ptr, !pto.ptr, i64, i64, i64, i64, i64, i64 +pto.mte_ub_gm %ub_src, %gm_dst, %len_burst + nburst(%n_burst, %src_stride, %dst_stride) + [loop(%loop_count, %loop_src_stride, %loop_dst_stride)]* + : !pto.ptr, !pto.ptr, i64, i64, i64, i64, + [loop i64, i64, i64,]* ``` -## 关键约束 +## 输入 + +| 参数 | 位宽 | 描述 | +|-----------|-------|-------------| +| `%ub_src` | ptr | UB 源指针(`!pto.ptr`,32B 对齐) | +| `%gm_dst` | ptr | GM 目标指针(`!pto.ptr`) | +| `%len_burst` | 16 bits | 每行连续搬运的字节数 | +| `nburst(%n_burst, %src_stride, %dst_stride)` | 16 / 21 / 40 bits | 必备最内层 burst 组:数量、UB 源步长、GM 目标步长 | +| `loop(%loop_count, %loop_src_stride, %loop_dst_stride)` | 21 / 21 / 40 bits | 可选外层重复组:数量、UB 源步长、GM 目标步长 | + +## 预期输出 + +| 结果 | 类型 | 描述 | +| --- | --- | --- | +| 无 | `—` | 此形式不返回 SSA 值;它把数据写入 Global Memory。 | + +## 副作用 + +读取 UB 可见存储并写入 GM 可见存储。MTE3 流水线在传输期间被占用;下游消费者(以及对同一 GM 地址的后续写入排序)需要通过 `pto.set_flag` / `pto.wait_flag`(`PIPE_V` → `PIPE_MTE3`),以及在背靠背写同一 GM 地址时通过 `pto.mem_bar` 做显式同步。 + +## 约束 -- 当前 target profile 可能对该形式施加额外限制。 +!!! warning "约束" + - `nburst(...)` 必须存在。 + - 每个 `loop(...)` 子句出现时必须给出完整三元组。 + - `nburst(...)` 是最内层。 + - `loop(...)` 子句按从内到外排序;第一个 `loop(...)` 包裹 `nburst(...)`,每多一个 `loop(...)` 又把前面所有层包裹起来。 + - `%ub_src` 必须 32 字节对齐。 +## 异常 + +!!! danger "异常" + - verifier 会拒绝非法的 operand 形状、错误的子句组以及目标 profile 不支持的属性组合。 + - 约束中列出的其它非法情形同样属于契约。 + +## 目标 Profile 限制 + +??? info "目标 Profile 限制" + - CPU 仿真保留可见拷贝契约,但可能不暴露所有 DMA 重叠风险。 + - A2/A3 和 A5 可能收窄元素大小、行宽或 cache 控制语义。 + +## 示例 + +```mlir +// 两层 outbound 传输:rows × tiles。 +pto.mte_ub_gm %ub_in, %gm_out, %len_burst + nburst(%rows, %ub_row_stride, %gm_row_stride) + loop(%tiles, %ub_tile_stride, %gm_tile_stride) + : !pto.ptr, !pto.ptr, i64, i64, i64, i64, + loop i64, i64, i64 +``` + +```mlir +// 三层 outbound 传输:rows × tiles × batches。 +pto.mte_ub_gm %ub_in, %gm_out, %len_burst + nburst(%rows, %ub_row_stride, %gm_row_stride) + loop(%tiles, %ub_tile_stride, %gm_tile_stride) + loop(%batches, %ub_batch_stride, %gm_batch_stride) + : !pto.ptr, !pto.ptr, i64, i64, i64, i64, + loop i64, i64, i64, loop i64, i64, i64 +``` ## 相关页面 -- [DMA 拷贝](../../dma-copy_zh.md) +- 指令集总览:[DMA Copy](../../dma-copy_zh.md) +- 反向传输:[pto.mte_gm_ub](./copy-gm-to-ubuf_zh.md) +- UB 内拷贝:[pto.mte_ub_ub](./copy-ubuf-to-ubuf_zh.md) +- 流水线同步:[pto.set_flag](../pipeline-sync/set-flag_zh.md)、[pto.wait_flag](../pipeline-sync/wait-flag_zh.md)、[pto.mem_bar](../pipeline-sync/mem-bar_zh.md) +- 控制壳总览:[Control and configuration](../../control-and-configuration_zh.md) diff --git a/docs/isa/scalar/ops/dma-copy/copy-ubuf-to-ubuf.md b/docs/isa/scalar/ops/dma-copy/copy-ubuf-to-ubuf.md index 861bbe330..2c9b5a937 100644 --- a/docs/isa/scalar/ops/dma-copy/copy-ubuf-to-ubuf.md +++ b/docs/isa/scalar/ops/dma-copy/copy-ubuf-to-ubuf.md @@ -1,79 +1,74 @@ -# pto.copy_ubuf_to_ubuf +# pto.mte_ub_ub -`pto.copy_ubuf_to_ubuf` is part of the [DMA Copy](../../dma-copy.md) instruction set. +`pto.mte_ub_ub` is part of the [DMA Copy](../../dma-copy.md) instruction set. + +!!! note "PTO ISA v0.6 surface" + The v0.6 PTO micro-instruction surface replaces the earlier `pto.copy_ubuf_to_ubuf` with the grouped `pto.mte_ub_ub` instruction, expressed with an inline `nburst(...)` clause. Burst length, source gap, and destination gap are all encoded in units of 32 bytes. ## Summary -Execute a DMA transfer between two Unified Buffer regions. +Execute a grouped intra-UB copy. `nburst(...)` defines the repeated burst transfer between two UB regions. ## Mechanism -The DMA engine reads `%n_burst` rows from `%source` and writes them to `%dest`. `%len_burst` controls the contiguous byte count copied per row, while `%src_stride` and `%dst_stride` specify the row-to-row start offsets for the copy. This form is useful when the producer and consumer both operate in UB space but a DMA-style row copy is still preferred over vector payload instructions. +The MTE engine reads `%n_burst` blocks of `%len_burst * 32` bytes from `%ub_src` and writes them to `%ub_dst`. Between bursts, the source advances by `(len_burst + src_gap) * 32` bytes and the destination advances by `(len_burst + dst_gap) * 32` bytes. `src_gap` and `dst_gap` are the inter-burst gap fields (in 32-byte units) that follow each copied block. ## Syntax -### PTO Assembly Form - -```text -copy_ubuf_to_ubuf %source, %dest, %sid, %n_burst, %len_burst, %src_stride, %dst_stride -``` - -### AS Level 1 (SSA) - ```mlir -pto.copy_ubuf_to_ubuf %source, %dest, %sid, %n_burst, %len_burst, %src_stride, %dst_stride : !pto.ptr, !pto.ptr, i64, i64, i64, i64, i64 +pto.mte_ub_ub %ub_src, %ub_dst, %len_burst + nburst(%n_burst, %src_gap, %dst_gap) + : !pto.ptr, !pto.ptr, i64, i64, i64, i64 ``` ## Inputs -| Operand | Type | Description | -| --- | --- | --- | -| %source | `!pto.ptr` | UB source pointer | -| %dest | `!pto.ptr` | UB destination pointer | -| %sid | `i64` | DMA stream identifier | -| %n_burst | `i64` | Number of burst rows to transfer | -| %len_burst | `i64` | Contiguous byte count transferred per row | -| %src_stride | `i64` | UB source row-to-row start offset in bytes | -| %dst_stride | `i64` | UB destination row-to-row start offset in bytes | +| Parameter | Width | Description | +|-----------|-------|-------------| +| `%ub_src` | ptr | UB source pointer (`!pto.ptr`, 32B-aligned) | +| `%ub_dst` | ptr | UB destination pointer (`!pto.ptr`, 32B-aligned) | +| `%len_burst` | 16 bits | Burst length in units of 32 bytes | +| `nburst(%n_burst, %src_gap, %dst_gap)` | 16 bits / 16 bits / 16 bits | Required copy burst group: count, source gap, destination gap | ## Expected Outputs | Result | Type | Description | | --- | --- | --- | -| None | `—` | This form does not return SSA values; it writes data into Unified Buffer memory. | +| None | `—` | This form does not return SSA values; it writes data into the Unified Buffer destination region. | ## Side Effects -Reads UB-visible storage, writes UB-visible storage, and consumes the active UB DMA state for the selected target profile. +Reads UB-visible storage and writes UB-visible storage. The MTE pipe is engaged for the duration of the transfer; downstream consumers must synchronize through the appropriate pipeline-sync primitives. ## Constraints !!! warning "Constraints" - - Source and destination regions MUST both satisfy the UB alignment rules of the selected target profile. - - `%len_burst` MUST fit within both the source and destination row stride. - - If source and destination regions alias, portable code MUST provide ordering that avoids undefined behavior. On A2/A3 and A5: if source and destination regions alias, the copy may proceed in either forward or backward direction and the exact order is not guaranteed; the programmer must ensure the copy is sequenced to avoid data hazard. On CPU simulator: aliasing is resolved by copying to a temporary buffer first. + - UB source and destination addresses must be 32-byte aligned. + - `%len_burst`, `%src_gap`, and `%dst_gap` are encoded in units of 32 bytes. ## Exceptions !!! danger "Exceptions" - - The verifier rejects illegal operand shapes, unsupported pipe or event identifiers, and attribute combinations that are not valid for the selected instruction set or target profile. + - The verifier rejects illegal operand shapes and clause groups not valid for the selected target profile. - Any additional illegality stated in the constraints section is also part of the contract. ## Target-Profile Restrictions ??? info "Target-Profile Restrictions" - CPU simulation preserves the visible copy contract but may not expose all DMA overlap hazards. - - A2/A3 and A5 may narrow supported element sizes, row widths, or overlap behavior. + - A2/A3 and A5 may narrow supported element sizes, burst lengths, or gap encodings. ## Examples ```mlir -pto.copy_ubuf_to_ubuf %source, %dest, %sid, %n_burst, %len_burst, %src_stride, %dst_stride : !pto.ptr, !pto.ptr, i64, i64, i64, i64, i64 +pto.mte_ub_ub %ub_src, %ub_dst, %len32b + nburst(%rows, %src_gap, %dst_gap) + : !pto.ptr, !pto.ptr, i64, i64, i64, i64 ``` ## Related Ops / Instruction Set Links - Instruction set overview: [DMA Copy](../../dma-copy.md) -- Previous op in instruction set: [pto.copy_ubuf_to_gm](./copy-ubuf-to-gm.md) -- Next op in instruction set: (none) +- GM↔UB transfers: [pto.mte_gm_ub](./copy-gm-to-ubuf.md), [pto.mte_ub_gm](./copy-ubuf-to-gm.md) +- UB→L1 transfer (cube path): [pto.mte_ub_l1](./mte-ub-l1.md) - Control-shell overview: [Control and configuration](../../control-and-configuration.md) diff --git a/docs/isa/scalar/ops/dma-copy/copy-ubuf-to-ubuf_zh.md b/docs/isa/scalar/ops/dma-copy/copy-ubuf-to-ubuf_zh.md index 71bb265b2..19cf382b6 100644 --- a/docs/isa/scalar/ops/dma-copy/copy-ubuf-to-ubuf_zh.md +++ b/docs/isa/scalar/ops/dma-copy/copy-ubuf-to-ubuf_zh.md @@ -1,19 +1,74 @@ -# pto.copy_ubuf_to_ubuf +# pto.mte_ub_ub -在 Unified Buffer 内部做 DMA 拷贝。 +`pto.mte_ub_ub` 属于 [DMA Copy](../../dma-copy_zh.md) 指令集。 + +!!! note "PTO ISA v0.6 表面" + v0.6 PTO 微指令表面用分组指令 `pto.mte_ub_ub` 取代了旧的 `pto.copy_ubuf_to_ubuf`,并通过内联的 `nburst(...)` 子句承载 burst 结构。`%len_burst`、`%src_gap`、`%dst_gap` 都以 32 字节为单位编码。 + +## 摘要 + +执行一次分组 UB→UB 拷贝。`nburst(...)` 描述两块 UB 区域之间的重复 burst 传输。 + +## 机制 + +MTE 引擎从 `%ub_src` 读取 `%n_burst` 个块(每块 `%len_burst * 32` 字节),写入 `%ub_dst`。两次 burst 之间,源前进 `(len_burst + src_gap) * 32` 字节,目标前进 `(len_burst + dst_gap) * 32` 字节。`src_gap` 和 `dst_gap` 是跨 burst 的间隔(以 32 字节为单位),用于推进到下一块的起点。 ## 语法 ```mlir -pto.copy_ubuf_to_ubuf %source, %dest, %sid, %n_burst, %len_burst, %src_stride, %dst_stride - : !pto.ptr, !pto.ptr, i64 x5 +pto.mte_ub_ub %ub_src, %ub_dst, %len_burst + nburst(%n_burst, %src_gap, %dst_gap) + : !pto.ptr, !pto.ptr, i64, i64, i64, i64 ``` -## 关键约束 +## 输入 + +| 参数 | 位宽 | 描述 | +|-----------|-------|-------------| +| `%ub_src` | ptr | UB 源指针(`!pto.ptr`,32B 对齐) | +| `%ub_dst` | ptr | UB 目标指针(`!pto.ptr`,32B 对齐) | +| `%len_burst` | 16 bits | Burst 长度(以 32 字节为单位) | +| `nburst(%n_burst, %src_gap, %dst_gap)` | 16 / 16 / 16 bits | 必备 burst 组:数量、源间隔、目标间隔 | + +## 预期输出 + +| 结果 | 类型 | 描述 | +| --- | --- | --- | +| 无 | `—` | 此形式不返回 SSA 值;它把数据写入 UB 目标区域。 | + +## 副作用 -- 当前 target profile 可能对该形式施加额外限制。 +读取 UB 可见存储并写入 UB 可见存储。MTE 流水线在传输期间被占用;下游消费者需要通过相应的流水线同步原语做显式同步。 +## 约束 + +!!! warning "约束" + - UB 源与目标地址都必须 32 字节对齐。 + - `%len_burst`、`%src_gap`、`%dst_gap` 都以 32 字节为单位编码。 + +## 异常 + +!!! danger "异常" + - verifier 会拒绝非法的 operand 形状以及目标 profile 不支持的子句组合。 + - 约束中列出的其它非法情形同样属于契约。 + +## 目标 Profile 限制 + +??? info "目标 Profile 限制" + - CPU 仿真保留可见拷贝契约,但可能不暴露所有 DMA 重叠风险。 + - A2/A3 和 A5 可能收窄元素大小、burst 长度或间隔编码。 + +## 示例 + +```mlir +pto.mte_ub_ub %ub_src, %ub_dst, %len32b + nburst(%rows, %src_gap, %dst_gap) + : !pto.ptr, !pto.ptr, i64, i64, i64, i64 +``` ## 相关页面 -- [DMA 拷贝](../../dma-copy_zh.md) +- 指令集总览:[DMA Copy](../../dma-copy_zh.md) +- GM↔UB 传输:[pto.mte_gm_ub](./copy-gm-to-ubuf_zh.md)、[pto.mte_ub_gm](./copy-ubuf-to-gm_zh.md) +- UB→L1(cube 路径):[pto.mte_ub_l1](./mte-ub-l1_zh.md) +- 控制壳总览:[Control and configuration](../../control-and-configuration_zh.md) diff --git a/docs/isa/scalar/ops/dma-copy/mte-ub-l1.md b/docs/isa/scalar/ops/dma-copy/mte-ub-l1.md new file mode 100644 index 000000000..01cfc95b7 --- /dev/null +++ b/docs/isa/scalar/ops/dma-copy/mte-ub-l1.md @@ -0,0 +1,73 @@ +# pto.mte_ub_l1 + +`pto.mte_ub_l1` is part of the [DMA Copy](../../dma-copy.md) instruction set. + +## Summary + +Execute a grouped UB→L1 (CBUF) copy. `nburst(...)` defines the repeated burst transfer that stages a UB tile into the cube-side L1 (CBUF) buffer. + +## Mechanism + +The MTE engine reads `%n_burst` blocks of `%len_burst * 32` bytes from `%ub_src` and writes them to `%l1_dst`. Between bursts, the source advances by `(len_burst + src_gap) * 32` bytes and the destination advances by `(len_burst + dst_gap) * 32` bytes. `src_gap` and `dst_gap` are the inter-burst gap fields (in 32-byte units) that follow each copied block. + +`pto.mte_ub_l1` is the architecturally-supported fallback for moving a Vector-produced tile back into the cube's L1 staging buffer. The hardware applies the ND→NZ layout conversion required by L1's fractal format; see [NZ Fractal Layout](../../../cube/nz-fractal-layout.md) (when authored) for details. + +## Syntax + +```mlir +pto.mte_ub_l1 %ub_src, %l1_dst, %len_burst + nburst(%n_burst, %src_gap, %dst_gap) + : !pto.ptr, !pto.ptr, i64, i64, i64, i64 +``` + +## Inputs + +| Parameter | Width | Description | +|-----------|-------|-------------| +| `%ub_src` | ptr | UB source pointer (`!pto.ptr`, 32B-aligned) | +| `%l1_dst` | ptr | L1 destination pointer (`!pto.ptr`, 32B-aligned) | +| `%len_burst` | 16 bits | Burst length in units of 32 bytes | +| `nburst(%n_burst, %src_gap, %dst_gap)` | 16 bits / 16 bits / 16 bits | Required copy burst group: count, source gap, destination gap | + +## Expected Outputs + +| Result | Type | Description | +| --- | --- | --- | +| None | `—` | This form does not return SSA values; it writes data into the L1 destination region. | + +## Side Effects + +Reads UB-visible storage and writes L1-visible storage. The MTE pipe is engaged for the duration of the transfer. Downstream cube consumers (e.g., `pto.mte_l1_l0a` / `pto.mte_l1_l0b`) must wait on the appropriate pipeline-sync events before reading from L1. + +## Constraints + +!!! warning "Constraints" + - UB source and L1 destination addresses must be 32-byte aligned. + - `%len_burst`, `%src_gap`, and `%dst_gap` are encoded in units of 32 bytes. + +## Exceptions + +!!! danger "Exceptions" + - The verifier rejects illegal operand shapes and clause groups not valid for the selected target profile. + - Any additional illegality stated in the constraints section is also part of the contract. + +## Target-Profile Restrictions + +??? info "Target-Profile Restrictions" + - The UB↔L1 dedicated data path is an architectural feature of cube-equipped profiles (A2/A3/A5). CPU simulation models the copy contract but does not surface the underlying NZ fractal staging. + - For 1:2 Cube/Vector cooperation, both AIV0 and AIV1 each issue their own `pto.mte_ub_l1` against the same L1 base offset; the cube assembles the two sub-tiles into a single contiguous NZ Mat tile. + +## Examples + +```mlir +pto.mte_ub_l1 %ub_src, %l1_dst, %len32b + nburst(%rows, %src_gap, %dst_gap) + : !pto.ptr, !pto.ptr, i64, i64, i64, i64 +``` + +## Related Ops / Instruction Set Links + +- Instruction set overview: [DMA Copy](../../dma-copy.md) +- Intra-UB copy: [pto.mte_ub_ub](./copy-ubuf-to-ubuf.md) +- GM↔UB transfers: [pto.mte_gm_ub](./copy-gm-to-ubuf.md), [pto.mte_ub_gm](./copy-ubuf-to-gm.md) +- Control-shell overview: [Control and configuration](../../control-and-configuration.md) diff --git a/docs/isa/scalar/ops/dma-copy/mte-ub-l1_zh.md b/docs/isa/scalar/ops/dma-copy/mte-ub-l1_zh.md new file mode 100644 index 000000000..997e4e07d --- /dev/null +++ b/docs/isa/scalar/ops/dma-copy/mte-ub-l1_zh.md @@ -0,0 +1,73 @@ +# pto.mte_ub_l1 + +`pto.mte_ub_l1` 属于 [DMA Copy](../../dma-copy_zh.md) 指令集。 + +## 摘要 + +执行一次分组 UB→L1(CBUF)拷贝。`nburst(...)` 描述把 UB tile 暂存到 cube 侧 L1(CBUF)缓冲区的重复 burst。 + +## 机制 + +MTE 引擎从 `%ub_src` 读取 `%n_burst` 个块(每块 `%len_burst * 32` 字节),写入 `%l1_dst`。两次 burst 之间,源前进 `(len_burst + src_gap) * 32` 字节,目标前进 `(len_burst + dst_gap) * 32` 字节。`src_gap` 和 `dst_gap` 是跨 burst 的间隔(以 32 字节为单位)。 + +`pto.mte_ub_l1` 是把 Vector 产出 tile 回灌到 cube 侧 L1 暂存区的架构级回退路径。硬件会做 L1 fractal 布局所需的 ND→NZ 转换;详见(后续会补齐的)[NZ Fractal Layout](../../../cube/nz-fractal-layout_zh.md)。 + +## 语法 + +```mlir +pto.mte_ub_l1 %ub_src, %l1_dst, %len_burst + nburst(%n_burst, %src_gap, %dst_gap) + : !pto.ptr, !pto.ptr, i64, i64, i64, i64 +``` + +## 输入 + +| 参数 | 位宽 | 描述 | +|-----------|-------|-------------| +| `%ub_src` | ptr | UB 源指针(`!pto.ptr`,32B 对齐) | +| `%l1_dst` | ptr | L1 目标指针(`!pto.ptr`,32B 对齐) | +| `%len_burst` | 16 bits | Burst 长度(以 32 字节为单位) | +| `nburst(%n_burst, %src_gap, %dst_gap)` | 16 / 16 / 16 bits | 必备 burst 组:数量、源间隔、目标间隔 | + +## 预期输出 + +| 结果 | 类型 | 描述 | +| --- | --- | --- | +| 无 | `—` | 此形式不返回 SSA 值;它把数据写入 L1 目标区域。 | + +## 副作用 + +读取 UB 可见存储并写入 L1 可见存储。MTE 流水线在传输期间被占用。下游 cube 消费者(例如 `pto.mte_l1_l0a` / `pto.mte_l1_l0b`)必须等待相应的流水线同步事件,再去读 L1。 + +## 约束 + +!!! warning "约束" + - UB 源和 L1 目标地址都必须 32 字节对齐。 + - `%len_burst`、`%src_gap`、`%dst_gap` 都以 32 字节为单位编码。 + +## 异常 + +!!! danger "异常" + - verifier 会拒绝非法的 operand 形状以及目标 profile 不支持的子句组合。 + - 约束中列出的其它非法情形同样属于契约。 + +## 目标 Profile 限制 + +??? info "目标 Profile 限制" + - UB↔L1 专用通路是带 cube 的目标 profile(A2/A3/A5)的架构特性。CPU 仿真建模了拷贝契约,但不会暴露底层的 NZ fractal 暂存细节。 + - 在 1:2 Cube/Vector 协同模式下,AIV0 和 AIV1 各自对同一个 L1 基址发起自己的 `pto.mte_ub_l1`;cube 把两个 sub-tile 组合成一个连续的 NZ Mat tile。 + +## 示例 + +```mlir +pto.mte_ub_l1 %ub_src, %l1_dst, %len32b + nburst(%rows, %src_gap, %dst_gap) + : !pto.ptr, !pto.ptr, i64, i64, i64, i64 +``` + +## 相关页面 + +- 指令集总览:[DMA Copy](../../dma-copy_zh.md) +- UB 内拷贝:[pto.mte_ub_ub](./copy-ubuf-to-ubuf_zh.md) +- GM↔UB 传输:[pto.mte_gm_ub](./copy-gm-to-ubuf_zh.md)、[pto.mte_ub_gm](./copy-ubuf-to-gm_zh.md) +- 控制壳总览:[Control and configuration](../../control-and-configuration_zh.md) diff --git a/docs/isa/scalar/ops/dma-copy/set-loop-size-outtoub.md b/docs/isa/scalar/ops/dma-copy/set-loop-size-outtoub.md index c6ad61e65..b1a0b7a5f 100644 --- a/docs/isa/scalar/ops/dma-copy/set-loop-size-outtoub.md +++ b/docs/isa/scalar/ops/dma-copy/set-loop-size-outtoub.md @@ -2,6 +2,9 @@ `pto.set_loop_size_outtoub` is part of the [DMA Copy](../../dma-copy.md) instruction set. +!!! warning "Deprecated in PTO ISA v0.6" + The v0.6 PTO micro-instruction surface no longer uses standalone loop / stride configuration registers for GM→UB DMA. The information previously carried in `pto.set_loop_size_outtoub` (loop counts) is now expressed inline on the grouped transfer op via `loop(%loop_count, %loop_src_stride, %loop_dst_stride)` clauses on [`pto.mte_gm_ub`](./copy-gm-to-ubuf.md). New code should use the grouped form. This page is retained for historical reference and pre-v0.6 ports. + ## Summary Configure the inner and outer loop counts that the GM→UB DMA engine will use for subsequent transfers. diff --git a/docs/isa/scalar/ops/dma-copy/set-loop-size-outtoub_zh.md b/docs/isa/scalar/ops/dma-copy/set-loop-size-outtoub_zh.md index fbd1406d4..9b05f3325 100644 --- a/docs/isa/scalar/ops/dma-copy/set-loop-size-outtoub_zh.md +++ b/docs/isa/scalar/ops/dma-copy/set-loop-size-outtoub_zh.md @@ -1,5 +1,8 @@ # pto.set_loop_size_outtoub +!!! warning "v0.6 已弃用" + v0.6 PTO 微指令表面不再使用独立的循环/步长配置寄存器。`pto.set_loop_size_outtoub` 原本承载的循环计数信息,现在通过 [`pto.mte_gm_ub`](./copy-gm-to-ubuf_zh.md) 上的内联 `loop(%loop_count, %loop_src_stride, %loop_dst_stride)` 子句直接表达。新代码请使用分组形式。本页保留作为历史参考与 pre-v0.6 移植用途。 + 配置 out-to-ub 方向 DMA 的循环大小。 ## 语法 diff --git a/docs/isa/scalar/ops/dma-copy/set-loop-size-ubtoout.md b/docs/isa/scalar/ops/dma-copy/set-loop-size-ubtoout.md index 6d67fefa1..c4af924a0 100644 --- a/docs/isa/scalar/ops/dma-copy/set-loop-size-ubtoout.md +++ b/docs/isa/scalar/ops/dma-copy/set-loop-size-ubtoout.md @@ -2,6 +2,9 @@ `pto.set_loop_size_ubtoout` is part of the [DMA Copy](../../dma-copy.md) instruction set. +!!! warning "Deprecated in PTO ISA v0.6" + The v0.6 PTO micro-instruction surface no longer uses standalone loop / stride configuration registers for UB→GM DMA. The information previously carried in `pto.set_loop_size_ubtoout` (loop counts) is now expressed inline on the grouped transfer op via `loop(%loop_count, %loop_src_stride, %loop_dst_stride)` clauses on [`pto.mte_ub_gm`](./copy-ubuf-to-gm.md). New code should use the grouped form. This page is retained for historical reference and pre-v0.6 ports. + ## Summary Configure the inner and outer loop counts that the UB→GM DMA engine will use for subsequent transfers. diff --git a/docs/isa/scalar/ops/dma-copy/set-loop-size-ubtoout_zh.md b/docs/isa/scalar/ops/dma-copy/set-loop-size-ubtoout_zh.md index ba05fffd6..a51a9bb03 100644 --- a/docs/isa/scalar/ops/dma-copy/set-loop-size-ubtoout_zh.md +++ b/docs/isa/scalar/ops/dma-copy/set-loop-size-ubtoout_zh.md @@ -1,5 +1,8 @@ # pto.set_loop_size_ubtoout +!!! warning "v0.6 已弃用" + v0.6 PTO 微指令表面不再使用独立的循环/步长配置寄存器。`pto.set_loop_size_ubtoout` 原本承载的循环计数信息,现在通过 [`pto.mte_ub_gm`](./copy-ubuf-to-gm_zh.md) 上的内联 `loop(%loop_count, %loop_src_stride, %loop_dst_stride)` 子句直接表达。新代码请使用分组形式。本页保留作为历史参考与 pre-v0.6 移植用途。 + 配置 ub-to-out 方向 DMA 的循环大小。 ## 语法 diff --git a/docs/isa/scalar/ops/dma-copy/set-loop1-stride-outtoub.md b/docs/isa/scalar/ops/dma-copy/set-loop1-stride-outtoub.md index 6b554932e..b8c21fc04 100644 --- a/docs/isa/scalar/ops/dma-copy/set-loop1-stride-outtoub.md +++ b/docs/isa/scalar/ops/dma-copy/set-loop1-stride-outtoub.md @@ -2,6 +2,9 @@ `pto.set_loop1_stride_outtoub` is part of the [DMA Copy](../../dma-copy.md) instruction set. +!!! warning "Deprecated in PTO ISA v0.6" + The v0.6 PTO micro-instruction surface no longer uses standalone loop / stride configuration registers for GM→UB DMA. The stride information previously carried by `pto.set_loop1_stride_outtoub` is now expressed inline on the grouped transfer op as part of the inner `nburst(%n_burst, %src_stride, %dst_stride)` clause on [`pto.mte_gm_ub`](./copy-gm-to-ubuf.md). New code should use the grouped form. This page is retained for historical reference and pre-v0.6 ports. + ## Summary Configure the inner-loop pointer advance used by the GM→UB DMA engine. diff --git a/docs/isa/scalar/ops/dma-copy/set-loop1-stride-outtoub_zh.md b/docs/isa/scalar/ops/dma-copy/set-loop1-stride-outtoub_zh.md index 09154d877..7e8db1d1b 100644 --- a/docs/isa/scalar/ops/dma-copy/set-loop1-stride-outtoub_zh.md +++ b/docs/isa/scalar/ops/dma-copy/set-loop1-stride-outtoub_zh.md @@ -1,5 +1,8 @@ # pto.set_loop1_stride_outtoub +!!! warning "v0.6 已弃用" + v0.6 PTO 微指令表面不再使用独立的循环/步长配置寄存器。`pto.set_loop1_stride_outtoub` 原本承载的步长信息,现在通过 [`pto.mte_gm_ub`](./copy-gm-to-ubuf_zh.md) 上最内层 `nburst(%n_burst, %src_stride, %dst_stride)` 子句直接表达。新代码请使用分组形式。本页保留作为历史参考与 pre-v0.6 移植用途。 + 配置 out-to-ub 方向 DMA 的第一层 stride。 ## 语法 diff --git a/docs/isa/scalar/ops/dma-copy/set-loop1-stride-ubtoout.md b/docs/isa/scalar/ops/dma-copy/set-loop1-stride-ubtoout.md index bf1c4050c..bb5dfa22b 100644 --- a/docs/isa/scalar/ops/dma-copy/set-loop1-stride-ubtoout.md +++ b/docs/isa/scalar/ops/dma-copy/set-loop1-stride-ubtoout.md @@ -2,6 +2,9 @@ `pto.set_loop1_stride_ubtoout` is part of the [DMA Copy](../../dma-copy.md) instruction set. +!!! warning "Deprecated in PTO ISA v0.6" + The v0.6 PTO micro-instruction surface no longer uses standalone loop / stride configuration registers for UB→GM DMA. The stride information previously carried by `pto.set_loop1_stride_ubtoout` is now expressed inline on the grouped transfer op as part of the inner `nburst(%n_burst, %src_stride, %dst_stride)` clause on [`pto.mte_ub_gm`](./copy-ubuf-to-gm.md). New code should use the grouped form. This page is retained for historical reference and pre-v0.6 ports. + ## Summary Configure the inner-loop pointer advance used by the UB→GM DMA engine. diff --git a/docs/isa/scalar/ops/dma-copy/set-loop1-stride-ubtoout_zh.md b/docs/isa/scalar/ops/dma-copy/set-loop1-stride-ubtoout_zh.md index 34d5337dc..020ed41e8 100644 --- a/docs/isa/scalar/ops/dma-copy/set-loop1-stride-ubtoout_zh.md +++ b/docs/isa/scalar/ops/dma-copy/set-loop1-stride-ubtoout_zh.md @@ -1,5 +1,8 @@ # pto.set_loop1_stride_ubtoout +!!! warning "v0.6 已弃用" + v0.6 PTO 微指令表面不再使用独立的循环/步长配置寄存器。`pto.set_loop1_stride_ubtoout` 原本承载的步长信息,现在通过 [`pto.mte_ub_gm`](./copy-ubuf-to-gm_zh.md) 上最内层 `nburst(%n_burst, %src_stride, %dst_stride)` 子句直接表达。新代码请使用分组形式。本页保留作为历史参考与 pre-v0.6 移植用途。 + 配置 ub-to-out 方向 DMA 的第一层 stride。 ## 语法 diff --git a/docs/isa/scalar/ops/dma-copy/set-loop2-stride-outtoub.md b/docs/isa/scalar/ops/dma-copy/set-loop2-stride-outtoub.md index f8ce62a90..f2c16f6eb 100644 --- a/docs/isa/scalar/ops/dma-copy/set-loop2-stride-outtoub.md +++ b/docs/isa/scalar/ops/dma-copy/set-loop2-stride-outtoub.md @@ -2,6 +2,9 @@ `pto.set_loop2_stride_outtoub` is part of the [DMA Copy](../../dma-copy.md) instruction set. +!!! warning "Deprecated in PTO ISA v0.6" + The v0.6 PTO micro-instruction surface no longer uses standalone loop / stride configuration registers for GM→UB DMA. The outer-loop stride information previously carried by `pto.set_loop2_stride_outtoub` is now expressed inline on the grouped transfer op as part of the outer `loop(%loop_count, %loop_src_stride, %loop_dst_stride)` clause on [`pto.mte_gm_ub`](./copy-gm-to-ubuf.md). New code should use the grouped form. This page is retained for historical reference and pre-v0.6 ports. + ## Summary Configure the outer-loop pointer advance used by the GM→UB DMA engine. diff --git a/docs/isa/scalar/ops/dma-copy/set-loop2-stride-outtoub_zh.md b/docs/isa/scalar/ops/dma-copy/set-loop2-stride-outtoub_zh.md index 5014b1b3c..d9f504bd7 100644 --- a/docs/isa/scalar/ops/dma-copy/set-loop2-stride-outtoub_zh.md +++ b/docs/isa/scalar/ops/dma-copy/set-loop2-stride-outtoub_zh.md @@ -1,5 +1,8 @@ # pto.set_loop2_stride_outtoub +!!! warning "v0.6 已弃用" + v0.6 PTO 微指令表面不再使用独立的循环/步长配置寄存器。`pto.set_loop2_stride_outtoub` 原本承载的外层步长信息,现在通过 [`pto.mte_gm_ub`](./copy-gm-to-ubuf_zh.md) 上的外层 `loop(%loop_count, %loop_src_stride, %loop_dst_stride)` 子句直接表达。新代码请使用分组形式。本页保留作为历史参考与 pre-v0.6 移植用途。 + 配置 out-to-ub 方向 DMA 的第二层 stride。 ## 语法 diff --git a/docs/isa/scalar/ops/dma-copy/set-loop2-stride-ubtoout.md b/docs/isa/scalar/ops/dma-copy/set-loop2-stride-ubtoout.md index 030d99aff..d6e90548b 100644 --- a/docs/isa/scalar/ops/dma-copy/set-loop2-stride-ubtoout.md +++ b/docs/isa/scalar/ops/dma-copy/set-loop2-stride-ubtoout.md @@ -2,6 +2,9 @@ `pto.set_loop2_stride_ubtoout` is part of the [DMA Copy](../../dma-copy.md) instruction set. +!!! warning "Deprecated in PTO ISA v0.6" + The v0.6 PTO micro-instruction surface no longer uses standalone loop / stride configuration registers for UB→GM DMA. The outer-loop stride information previously carried by `pto.set_loop2_stride_ubtoout` is now expressed inline on the grouped transfer op as part of the outer `loop(%loop_count, %loop_src_stride, %loop_dst_stride)` clause on [`pto.mte_ub_gm`](./copy-ubuf-to-gm.md). New code should use the grouped form. This page is retained for historical reference and pre-v0.6 ports. + ## Summary Configure the outer-loop pointer advance used by the UB→GM DMA engine. diff --git a/docs/isa/scalar/ops/dma-copy/set-loop2-stride-ubtoout_zh.md b/docs/isa/scalar/ops/dma-copy/set-loop2-stride-ubtoout_zh.md index 162d8e213..ca107e699 100644 --- a/docs/isa/scalar/ops/dma-copy/set-loop2-stride-ubtoout_zh.md +++ b/docs/isa/scalar/ops/dma-copy/set-loop2-stride-ubtoout_zh.md @@ -1,5 +1,8 @@ # pto.set_loop2_stride_ubtoout +!!! warning "v0.6 已弃用" + v0.6 PTO 微指令表面不再使用独立的循环/步长配置寄存器。`pto.set_loop2_stride_ubtoout` 原本承载的外层步长信息,现在通过 [`pto.mte_ub_gm`](./copy-ubuf-to-gm_zh.md) 上的外层 `loop(%loop_count, %loop_src_stride, %loop_dst_stride)` 子句直接表达。新代码请使用分组形式。本页保留作为历史参考与 pre-v0.6 移植用途。 + 配置 ub-to-out 方向 DMA 的第二层 stride。 ## 语法 diff --git a/docs/isa/scalar/ops/micro-instruction/README.md b/docs/isa/scalar/ops/micro-instruction/README.md index 784f805c0..6d16fc01b 100644 --- a/docs/isa/scalar/ops/micro-instruction/README.md +++ b/docs/isa/scalar/ops/micro-instruction/README.md @@ -10,9 +10,10 @@ This section documents the PTO micro-instruction surface for the A5 (Ascend 950) | Group | Description | Operations | |-------|-------------|-----------| | [BlockDim and Runtime Query](./block-dim-query.md) | Block and subblock index/number queries | `pto.get_block_idx`, `pto.get_subblock_idx`, `pto.get_block_num`, `pto.get_subblock_num` | +| [VMS4 Status Query](./vms4-status-query.md) | Read 4-way merge-sort status register | `pto.get_vms4_sr` | | [Pointer Operations](./pointer-operations.md) | Typed pointer construction and arithmetic | `pto.castptr`, `pto.addptr`, `pto.load_scalar`, `pto.store_scalar` | | [Vector Execution Scope](./vecscope.md) | Vector function launch and scope boundary | `pto.vecscope`, `pto.strict_vecscope` | -| [Alignment State Type](./align-type.md) | Unaligned load/store alignment management | `pto.init_align`, `pto.vldas`, `pto.vldus`, `pto.vstus` | +| [Alignment State Type](./align-type.md) | Unaligned load/store alignment management. Load streams start from `pto.vldas`; store streams start from `pto.init_align`. | `pto.vldas`, `pto.vldus` (load); `pto.init_align`, `pto.vstus` (store) | ## Scope diff --git a/docs/isa/scalar/ops/micro-instruction/README_zh.md b/docs/isa/scalar/ops/micro-instruction/README_zh.md index 154503bb4..b8404c0bf 100644 --- a/docs/isa/scalar/ops/micro-instruction/README_zh.md +++ b/docs/isa/scalar/ops/micro-instruction/README_zh.md @@ -10,9 +10,10 @@ | 分组 | 说明 | 操作 | |-------|------|------| | [BlockDim 与运行时查询](./block-dim-query_zh.md) | 查询 block / subblock 编号与数量 | `pto.get_block_idx`、`pto.get_subblock_idx`、`pto.get_block_num`、`pto.get_subblock_num` | +| [VMS4 状态查询](./vms4-status-query_zh.md) | 读取 4 路合并排序状态寄存器 | `pto.get_vms4_sr` | | [指针操作](./pointer-operations_zh.md) | 构造类型化指针并做指针算术 | `pto.castptr`、`pto.addptr`、`pto.load_scalar`、`pto.store_scalar` | | [向量执行作用域](./vecscope_zh.md) | 向量函数启动与作用域边界 | `pto.vecscope`、`pto.strict_vecscope` | -| [对齐状态类型](./align-type_zh.md) | 非对齐 load/store 的对齐状态管理 | `pto.init_align`、`pto.vldas`、`pto.vldus`、`pto.vstus` | +| [对齐状态类型](./align-type_zh.md) | 非对齐 load/store 的对齐状态管理。load 流由 `pto.vldas` 起始;store 流由 `pto.init_align` 起始。 | `pto.vldas`、`pto.vldus`(load);`pto.init_align`、`pto.vstus`(store) | ## 覆盖范围 diff --git a/docs/isa/scalar/ops/micro-instruction/align-type.md b/docs/isa/scalar/ops/micro-instruction/align-type.md index 74c6e6be6..814eddf82 100644 --- a/docs/isa/scalar/ops/micro-instruction/align-type.md +++ b/docs/isa/scalar/ops/micro-instruction/align-type.md @@ -31,11 +31,15 @@ The page defines the contract of `!pto.align` and the stream discipline around i ## Alignment State Operations -### `pto.init_align` +### `pto.init_align` — Initialize Store-Side Align Carrier -**Syntax:** `%align = pto.init_align : -> !pto.align` +**Syntax:** `%result = pto.init_align : !pto.align` -**Semantics:** Initialize a new alignment state carrier. +**Semantics:** Initialize store-side align carrier state. + +**Outputs:** `%result` is a fresh zero-initialized align carrier for **store-side** unaligned streams such as `pto.vstus`, `pto.vstur`, `pto.vstar`, `pto.vstas`, and `pto.pstu`. + +**Constraints:** This op is for store-family initialization only. Unaligned load streams still start from `pto.vldas`, not `pto.init_align`. ```c align = init_align(); @@ -43,9 +47,20 @@ align = init_align(); ### `pto.vldas` — Prime Alignment for Unaligned Load -**Syntax:** `%align = pto.vldas %ub : !pto.ptr -> !pto.align` +**Syntax:** `%result = pto.vldas %source : !pto.ptr -> !pto.align` + +**Semantics:** Prime alignment buffer for subsequent unaligned load. + +**Inputs:** `%source` is the UB address whose surrounding aligned block seeds the load alignment state. + +**Outputs:** `%result` is the initialized load-alignment state. -**Semantics:** Prime the alignment buffer for a subsequent unaligned load. The source address's surrounding aligned block seeds the load alignment state. +**Constraints:** + +- This op is the required leading operation for a `pto.vldus` stream using the same alignment state. +- The source address itself need not be 32-byte aligned; hardware truncates it to the aligned block boundary for the priming load. + +**Latency:** **9** cycles. ```mlir %align = pto.vldas %ub : !pto.ptr -> !pto.align @@ -53,22 +68,46 @@ align = init_align(); ### `pto.vldus` — Unaligned Load with Alignment State Update -**Syntax:** `%vec, %align_out = pto.vldus %ub, %align : !pto.ptr, !pto.align -> !pto.vreg, !pto.align` +**Syntax:** `%result, %align_out = pto.vldus %source, %align : !pto.ptr, !pto.align -> !pto.vreg, !pto.align` + +**Semantics:** Unaligned load using primed align state. + +**Inputs:** `%source` is the current UB address; `%align` is the incoming load alignment state primed by `pto.vldas` or a prior `pto.vldus`. -**Semantics:** Perform an unaligned load using the provided alignment state, and produce both the loaded vector and the updated alignment state. +**Outputs:** `%result` is the assembled vector value; `%align_out` is the updated alignment state. + +**Constraints:** + +- A matching `pto.vldas` MUST appear before the first dependent `pto.vldus` stream in the same vector loop. +- The installed no-post A5 interface keeps a struct-shaped internal return for lowering convenience, but its no-post `base` field is not meaningful user-visible state. VPTO therefore hides that value and only exposes the updated align carrier. +- Reusing the original `%source` starts a new explicit access point; if the caller wants another no-post access, it should compute the next source pointer explicitly and pair it with the required align setup. + +**Latency:** **9** cycles. ```mlir %vec, %align_out = pto.vldus %ub, %align : !pto.ptr, !pto.align -> !pto.vreg<64xf32>, !pto.align ``` -### `pto.vstus` — Unaligned Store with Alignment State Update +### `pto.vstus` — No-Post Unaligned Store with Scalar Offset + +**Syntax:** `%align_out = pto.vstus %align_in, %offset, %value, %base : !pto.align, i32, !pto.vreg, !pto.ptr -> !pto.align` + +**Semantics:** No-post unaligned store with scalar offset. + +**Inputs:** `%align_in` is the incoming store-alignment state, `%offset` is the scalar displacement, `%value` is the vector being stored, and `%base` is the UB base pointer. + +**Outputs:** `%align_out` is the updated buffered-tail state. + +**Constraints:** -**Syntax:** `%align_out = pto.vstus %align, %offset, %vec, %ub : !pto.align, i32, !pto.vreg, !pto.ptr -> !pto.align` +- This is the scalar-offset stateful form of the unaligned store family. The first `%align_in` in the stream should come from `pto.init_align`. +- This op does **not** mean "store a full vector starting at `%base + %offset`". Instead, `%offset` describes how far the store stream advances at this step, and `%align_out` carries any residual tail that could not be committed yet. +- The no-post surface does not expose an updated base pointer. A later flush op (`pto.vstas` / `pto.vstar`) must therefore use an explicit destination/offset pair that identifies the same logical flush point as this `pto.vstus`. -**Semantics:** Perform an unaligned store using the provided alignment state, and produce the updated alignment state. +**Latency:** **9** cycles. ```mlir -%store_align = pto.init_align : -> !pto.align +%store_align = pto.init_align : !pto.align %next_align = pto.vstus %store_align, %offset, %vec, %ub : !pto.align, i32, !pto.vreg<64xf32>, !pto.ptr -> !pto.align ``` @@ -91,7 +130,7 @@ The following example shows the complete unaligned load/store stream lifecycle: %result1 = pto.vabs %v1, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> // ─── Store stream ─── -%store_align0 = pto.init_align : -> !pto.align +%store_align0 = pto.init_align : !pto.align %align_out1 = pto.vstus %store_align0, %c32, %result0, %ub_out : !pto.align, i32, !pto.vreg<64xf32>, !pto.ptr -> !pto.align %align_out2 = pto.vstus %align_out1, %c32, %result1, %ub_out : !pto.align, i32, !pto.vreg<64xf32>, !pto.ptr -> !pto.align ``` @@ -101,9 +140,9 @@ The following example shows the complete unaligned load/store stream lifecycle: !!! warning "Constraints" - `pto.vldas` must be the leading operation of an unaligned load stream. - `pto.vldus` must follow `pto.vldas` using the same alignment state. - - `pto.vstus` must be preceded by `pto.init_align` to start a new store alignment stream. + - Store-side unaligned streams (`pto.vstus` and the related `pto.vstur`, `pto.vstar`, `pto.vstas`, `pto.pstu`) must be initialized by `pto.init_align`. `pto.init_align` is **store-side only** — it cannot be used to prime a load stream. - The alignment state must be threaded through all operations in the stream without branching. - - For `pto.vstus`, the `%offset` parameter controls the per-operation stride within the stream. + - For `pto.vstus`, `%offset` controls how far the store stream advances at each step, not the absolute store displacement from `%base`. A later flush op (`pto.vstas` / `pto.vstar`) must reuse the matching destination/offset pair. ## Why Explicit Alignment State? diff --git a/docs/isa/scalar/ops/micro-instruction/align-type_zh.md b/docs/isa/scalar/ops/micro-instruction/align-type_zh.md index 56f7a8bdf..efeb65a22 100644 --- a/docs/isa/scalar/ops/micro-instruction/align-type_zh.md +++ b/docs/isa/scalar/ops/micro-instruction/align-type_zh.md @@ -31,11 +31,15 @@ ## 对齐状态相关操作 -### `pto.init_align` +### `pto.init_align`:初始化 store 端对齐状态载体 -**语法**:`%align = pto.init_align : -> !pto.align` +**语法**:`%result = pto.init_align : !pto.align` -**语义**:初始化一个新的对齐状态。 +**语义**:初始化 store 端对齐状态载体。 + +**输出**:`%result` 是一个全新的、零初始化的对齐状态载体,用于 **store 端** 的非对齐流,例如 `pto.vstus`、`pto.vstur`、`pto.vstar`、`pto.vstas` 和 `pto.pstu`。 + +**约束**:此操作只用于 store 端初始化。非对齐 load 流仍然必须从 `pto.vldas` 开始,而不是 `pto.init_align`。 ```c align = init_align(); @@ -43,9 +47,20 @@ align = init_align(); ### `pto.vldas`:为非对齐 load 预热对齐状态 -**语法**:`%align = pto.vldas %ub : !pto.ptr -> !pto.align` +**语法**:`%result = pto.vldas %source : !pto.ptr -> !pto.align` + +**语义**:为后续的非对齐 load 预热对齐缓冲区。 + +**输入**:`%source` 是 UB 地址,其所在的对齐块用于作为 load 对齐状态的种子。 + +**输出**:`%result` 是初始化后的 load 端对齐状态。 -**语义**:为后续的非对齐 load 预热对齐状态。源地址周围的对齐块会作为后续 load 流的种子状态。 +**约束**: + +- 此操作是同一对齐状态链上 `pto.vldus` 流的必备起始操作。 +- 源地址本身不需要 32 字节对齐;硬件会将其截断到对齐块边界以执行预热 load。 + +**延迟**:**9** 周期。 ```mlir %align = pto.vldas %ub : !pto.ptr -> !pto.align @@ -53,22 +68,46 @@ align = init_align(); ### `pto.vldus`:带对齐状态更新的非对齐 load -**语法**:`%vec, %align_out = pto.vldus %ub, %align : !pto.ptr, !pto.align -> !pto.vreg, !pto.align` +**语法**:`%result, %align_out = pto.vldus %source, %align : !pto.ptr, !pto.align -> !pto.vreg, !pto.align` + +**语义**:使用预热好的对齐状态执行一次非对齐 load。 + +**输入**:`%source` 是当前 UB 地址;`%align` 是由 `pto.vldas` 或前一条 `pto.vldus` 产出的 load 端对齐状态。 -**语义**:使用给定的对齐状态执行一次非对齐 load,并同时返回加载得到的向量与更新后的对齐状态。 +**输出**:`%result` 是装配出的向量值;`%align_out` 是更新后的对齐状态。 + +**约束**: + +- 同一向量循环中,第一条依赖的 `pto.vldus` 之前必须出现匹配的 `pto.vldas`。 +- A5 no-post 接口在内部保留了一个结构体形式的返回值以便下沉,但其中的 `base` 字段不是用户可见的有意义状态;VPTO 在表面上隐藏该值,仅暴露更新后的对齐状态载体。 +- 重新使用原始的 `%source` 会启动一个新的显式访问点;如果调用方希望再次进行 no-post 访问,应显式计算下一个源指针,并配上必需的对齐状态初始化。 + +**延迟**:**9** 周期。 ```mlir %vec, %align_out = pto.vldus %ub, %align : !pto.ptr, !pto.align -> !pto.vreg<64xf32>, !pto.align ``` -### `pto.vstus`:带对齐状态更新的非对齐 store +### `pto.vstus`:带标量偏移的 no-post 非对齐 store + +**语法**:`%align_out = pto.vstus %align_in, %offset, %value, %base : !pto.align, i32, !pto.vreg, !pto.ptr -> !pto.align` + +**语义**:带标量偏移的 no-post 非对齐 store。 + +**输入**:`%align_in` 是输入的 store 端对齐状态;`%offset` 是标量步长;`%value` 是被存储的向量;`%base` 是 UB 基址。 + +**输出**:`%align_out` 是更新后的缓冲尾部状态。 + +**约束**: -**语法**:`%align_out = pto.vstus %align, %offset, %vec, %ub : !pto.align, i32, !pto.vreg, !pto.ptr -> !pto.align` +- 这是非对齐 store 家族里带标量偏移的有状态形式。同一个流的首个 `%align_in` 应来自 `pto.init_align`。 +- 此指令 **不** 表示「从 `%base + %offset` 开始存储一整个向量」。相反,`%offset` 描述当前这步在流中前进多远,而 `%align_out` 携带尚未提交的尾部残量。 +- no-post 表面不暴露已更新的基址指针。后续 flush 操作(`pto.vstas` / `pto.vstar`)必须显式使用与本条 `pto.vstus` 对应的目的地/偏移对,指明同一个逻辑 flush 点。 -**语义**:使用给定的对齐状态执行一次非对齐 store,并返回更新后的对齐状态。 +**延迟**:**9** 周期。 ```mlir -%store_align = pto.init_align : -> !pto.align +%store_align = pto.init_align : !pto.align %next_align = pto.vstus %store_align, %offset, %vec, %ub : !pto.align, i32, !pto.vreg<64xf32>, !pto.ptr -> !pto.align ``` @@ -88,7 +127,7 @@ align = init_align(); %result1 = pto.vabs %v1, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> // ─── Store 流 ─── -%store_align0 = pto.init_align : -> !pto.align +%store_align0 = pto.init_align : !pto.align %align_out1 = pto.vstus %store_align0, %c32, %result0, %ub_out : !pto.align, i32, !pto.vreg<64xf32>, !pto.ptr -> !pto.align %align_out2 = pto.vstus %align_out1, %c32, %result1, %ub_out : !pto.align, i32, !pto.vreg<64xf32>, !pto.ptr -> !pto.align ``` @@ -98,9 +137,9 @@ align = init_align(); !!! warning "约束" - `pto.vldas` 必须是非对齐 load 流的起始操作。 - `pto.vldus` 必须接在同一条对齐状态链上的 `pto.vldas` 之后。 - - `pto.vstus` 必须以 `pto.init_align` 启动新的 store 对齐流。 + - store 端的非对齐流(`pto.vstus` 及相关的 `pto.vstur`、`pto.vstar`、`pto.vstas`、`pto.pstu`)必须由 `pto.init_align` 初始化。`pto.init_align` **只用于 store 端**,不能用来预热 load 流。 - 对齐状态必须在线性流中传递,不能随意分叉。 - - 对于 `pto.vstus`,`%offset` 控制每次 store 在流中的步进。 + - 对于 `pto.vstus`,`%offset` 控制每一步在流中前进多远,而不是相对 `%base` 的绝对存储位移。后续 flush 操作(`pto.vstas` / `pto.vstar`)必须复用配套的目的地/偏移对。 ## 为什么要显式化对齐状态 diff --git a/docs/isa/scalar/ops/micro-instruction/vms4-status-query.md b/docs/isa/scalar/ops/micro-instruction/vms4-status-query.md new file mode 100644 index 000000000..f41dab95a --- /dev/null +++ b/docs/isa/scalar/ops/micro-instruction/vms4-status-query.md @@ -0,0 +1,71 @@ +# PTO Micro-Instruction: VMS4 Status Query (`pto.get_vms4_sr`) + +This page documents the PTO micro-instruction runtime query for the `VMS4_SR` status register. The op is part of the PTO micro-instruction surface (A5 Ascend 950 profile). + +## Overview + +`pto.get_vms4_sr` exposes the contents of the `VMS4_SR` hardware register to scalar code. After an exhausted [`pto.vmrgsort4`](../../../vector/ops/sfu-and-dsa-ops/vmrgsort.md) merge-sort operation, `VMS4_SR` records the per-source-list executed counts; reading it lets a kernel reason about how many elements of each input list were consumed. + +## Mechanism + +`pto.get_vms4_sr` is a pure scalar producer. It does not move data, does not synchronize pipelines, and does not change any architectural state. It simply reads the four 16-bit fields of `VMS4_SR` and returns them as four SSA `i16` values. + +The intended pattern is to issue a `pto.vmrgsort4` that may exhaust before fully consuming all inputs, then read `VMS4_SR` to discover how far each source list advanced, and use those counts to drive the next round of sort/merge work. + +## `pto.get_vms4_sr` + +**Syntax:** `%list0, %list1, %list2, %list3 = pto.get_vms4_sr : i16, i16, i16, i16` + +**Semantics:** Read `VMS4_SR` and return the finished element counts for source lists 0, 1, 2, and 3. + +### Inputs + +None. + +### Expected Outputs + +| Result | Type | Description | +|--------|------|-------------| +| `%list0` | `i16` | Finished count for source list 0 | +| `%list1` | `i16` | Finished count for source list 1 | +| `%list2` | `i16` | Finished count for source list 2 | +| `%list3` | `i16` | Finished count for source list 3 | + +### Register Layout + +| Bits | Meaning | +|------|---------| +| `[15:0]` | finished count for source list 0 | +| `[31:16]` | finished count for source list 1 | +| `[47:32]` | finished count for source list 2 | +| `[63:48]` | finished count for source list 3 | + +```c +status = VMS4_SR; +list0 = (uint16_t)(status & 0xffff); +list1 = (uint16_t)((status >> 16) & 0xffff); +list2 = (uint16_t)((status >> 32) & 0xffff); +list3 = (uint16_t)((status >> 48) & 0xffff); +``` + +### Constraints + +- The returned values are unsigned 16-bit counts of elements consumed from each source list. +- The intended pattern is to read `VMS4_SR` after an exhausted `pto.vmrgsort4` to determine partial-progress counts. +- The op is a pure scalar producer; it has no architectural side effects. + +### Examples + +```mlir +// After a partial pto.vmrgsort4, read per-list executed counts +%list0, %list1, %list2, %list3 = pto.get_vms4_sr : i16, i16, i16, i16 + +// Use the counts to advance the next sort round +%c0_i64 = arith.extui %list0 : i16 to i64 +// ... feed back into the next vmrgsort4 setup +``` + +## Related Operations + +- 4-way merge sort: [`pto.vmrgsort`](../../../vector/ops/sfu-and-dsa-ops/vmrgsort.md) +- Block runtime queries: [BlockDim Query Operations](./block-dim-query.md) diff --git a/docs/isa/scalar/ops/micro-instruction/vms4-status-query_zh.md b/docs/isa/scalar/ops/micro-instruction/vms4-status-query_zh.md new file mode 100644 index 000000000..798a54658 --- /dev/null +++ b/docs/isa/scalar/ops/micro-instruction/vms4-status-query_zh.md @@ -0,0 +1,71 @@ +# PTO 微指令:VMS4 状态查询(`pto.get_vms4_sr`) + +本页说明 PTO 微指令中的 `VMS4_SR` 状态寄存器查询操作。它属于 PTO 微指令表面,对应 A5(Ascend 950)profile。 + +## 概览 + +`pto.get_vms4_sr` 把 `VMS4_SR` 硬件寄存器的内容暴露给标量代码。在一次会耗尽的 [`pto.vmrgsort4`](../../../vector/ops/sfu-and-dsa-ops/vmrgsort_zh.md) 合并排序结束后,`VMS4_SR` 记录了各源列表已执行的元素计数;读出这个寄存器可以让 kernel 知道每条输入列表消费到了哪里。 + +## 机制 + +`pto.get_vms4_sr` 是一条纯标量生产者操作。它不搬数据、不做流水线同步、不会改变任何架构状态。它只是读 `VMS4_SR` 的四个 16 位字段,并以四个 SSA `i16` 值返回。 + +典型用法是:发起一次可能在中途耗尽的 `pto.vmrgsort4`,然后读 `VMS4_SR` 查看各源列表的推进位置,根据这些计数驱动下一轮排序/合并。 + +## `pto.get_vms4_sr` + +**语法**:`%list0, %list1, %list2, %list3 = pto.get_vms4_sr : i16, i16, i16, i16` + +**语义**:读取 `VMS4_SR`,返回源列表 0、1、2、3 的已完成元素计数。 + +### 输入 + +无。 + +### 预期输出 + +| 结果 | 类型 | 描述 | +|--------|------|------| +| `%list0` | `i16` | 源列表 0 的已完成计数 | +| `%list1` | `i16` | 源列表 1 的已完成计数 | +| `%list2` | `i16` | 源列表 2 的已完成计数 | +| `%list3` | `i16` | 源列表 3 的已完成计数 | + +### 寄存器位段 + +| 位 | 含义 | +|------|------| +| `[15:0]` | 源列表 0 的已完成计数 | +| `[31:16]` | 源列表 1 的已完成计数 | +| `[47:32]` | 源列表 2 的已完成计数 | +| `[63:48]` | 源列表 3 的已完成计数 | + +```c +status = VMS4_SR; +list0 = (uint16_t)(status & 0xffff); +list1 = (uint16_t)((status >> 16) & 0xffff); +list2 = (uint16_t)((status >> 32) & 0xffff); +list3 = (uint16_t)((status >> 48) & 0xffff); +``` + +### 约束 + +- 返回值是各源列表已消费元素数量的无符号 16 位计数。 +- 典型用法是在一次会耗尽的 `pto.vmrgsort4` 之后读 `VMS4_SR`,以了解部分进度。 +- 本操作是纯标量生产者,没有任何架构副作用。 + +### 示例 + +```mlir +// 在一次部分完成的 pto.vmrgsort4 之后读出各列表已执行计数 +%list0, %list1, %list2, %list3 = pto.get_vms4_sr : i16, i16, i16, i16 + +// 用这些计数驱动下一轮排序 +%c0_i64 = arith.extui %list0 : i16 to i64 +// ... 喂给下一个 vmrgsort4 的初始化 +``` + +## 相关操作 + +- 4 路合并排序:[`pto.vmrgsort`](../../../vector/ops/sfu-and-dsa-ops/vmrgsort_zh.md) +- Block 运行时查询:[BlockDim 查询操作](./block-dim-query_zh.md) diff --git a/docs/isa/scalar/ops/predicate-generation-and-algebra/pand.md b/docs/isa/scalar/ops/predicate-generation-and-algebra/pand.md index 987b52a62..edcffebf8 100644 --- a/docs/isa/scalar/ops/predicate-generation-and-algebra/pand.md +++ b/docs/isa/scalar/ops/predicate-generation-and-algebra/pand.md @@ -19,19 +19,19 @@ The third operand (`%mask`) in the syntax is an optional masking predicate for t ### PTO Assembly Form ```mlir -%dst = pto.pand %src0, %src1, %mask : !pto.mask, !pto.mask, !pto.mask -> !pto.mask +%dst = pto.pand %src0, %src1, %mask : !pto.mask, !pto.mask, !pto.mask -> !pto.mask ``` ### AS Level 1 (SSA) ```mlir -%dst = pto.pand %src0, %src1, %mask : !pto.mask, !pto.mask, !pto.mask -> !pto.mask +%dst = pto.pand %src0, %src1, %mask : !pto.mask, !pto.mask, !pto.mask -> !pto.mask ``` ### AS Level 2 (DPS) ```mlir -pto.pand ins(%src0, %src1, %mask : !pto.mask, !pto.mask, !pto.mask) outs(%dst : !pto.mask) +pto.pand ins(%src0, %src1, %mask : !pto.mask, !pto.mask, !pto.mask) outs(%dst : !pto.mask) ``` ## C++ Intrinsic @@ -48,15 +48,15 @@ pand(dst, src0, src1, mask); | Operand | Type | Description | |---------|------|-------------| -| `%src0` | `!pto.mask` | First source predicate | -| `%src1` | `!pto.mask` | Second source predicate | -| `%mask` | `!pto.mask` | Optional masking predicate (scalar and control instructions context) | +| `%src0` | `!pto.mask` | First source predicate | +| `%src1` | `!pto.mask` | Second source predicate | +| `%mask` | `!pto.mask` | Optional masking predicate (scalar and control instructions context) | ## Expected Outputs | Result | Type | Description | |--------|------|-------------| -| `%dst` | `!pto.mask` | Bitwise AND of src0 and src1 | +| `%dst` | `!pto.mask` | Bitwise AND of src0 and src1 | ## Side Effects @@ -99,16 +99,16 @@ void combine_masks(RegBuf& dst, ```mlir // %cmp_mask: lanes where a[i] < b[i] -%cmp = pto.vcmp %va, %vb, %seed, "lt" : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.mask +%cmp = pto.vcmp %va, %vb, %seed, "lt" : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.mask // %tail_mask: lanes in the remainder region -%tail = pto.pge_b32 %rem : i32 -> !pto.mask +%tail = pto.pge_b32 %rem : i32 -> !pto.mask // Intersection: only process remainder lanes where comparison is true -%active = pto.pand %cmp, %tail, %cmp : !pto.mask, !pto.mask, !pto.mask -> !pto.mask +%active = pto.pand %cmp, %tail, %cmp : !pto.mask, !pto.mask, !pto.mask -> !pto.mask // Use in predicated operation -%result = pto.vsel %v_true, %v_false, %active : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> +%result = pto.vsel %v_true, %v_false, %active : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> ``` ## Related Ops / Instruction Set Links diff --git a/docs/isa/scalar/ops/predicate-generation-and-algebra/pand_zh.md b/docs/isa/scalar/ops/predicate-generation-and-algebra/pand_zh.md index ef1e56d28..cf58d1e16 100644 --- a/docs/isa/scalar/ops/predicate-generation-and-algebra/pand_zh.md +++ b/docs/isa/scalar/ops/predicate-generation-and-algebra/pand_zh.md @@ -7,19 +7,19 @@ ### PTO 汇编形式 ```text -pand %dst, %src0, %src1 : !pto.mask, !pto.mask, !pto.mask +pand %dst, %src0, %src1 : !pto.mask, !pto.mask, !pto.mask ``` ### AS Level 1(SSA) ```mlir -%dst = pto.pand %src0, %src1, %mask : !pto.mask, !pto.mask, !pto.mask -> !pto.mask +%dst = pto.pand %src0, %src1, %mask : !pto.mask, !pto.mask, !pto.mask -> !pto.mask ``` ### AS Level 2(DPS) ```mlir -pto.pand ins(%src0, %src1, %mask : !pto.mask, !pto.mask, !pto.mask) outs(%dst : !pto.mask) +pto.pand ins(%src0, %src1, %mask : !pto.mask, !pto.mask, !pto.mask) outs(%dst : !pto.mask) ``` ## 关键约束 diff --git a/docs/isa/scalar/ops/predicate-generation-and-algebra/pdintlv-b16.md b/docs/isa/scalar/ops/predicate-generation-and-algebra/pdintlv-b16.md new file mode 100644 index 000000000..dce4d6028 --- /dev/null +++ b/docs/isa/scalar/ops/predicate-generation-and-algebra/pdintlv-b16.md @@ -0,0 +1,58 @@ +# pto.pdintlv_b16 + +`pto.pdintlv_b16` is part of the [Predicate Generation And Algebra](../../predicate-generation-and-algebra.md) instruction set. + +## Summary + +Deinterleave two `b16`-granularity predicate sources and materialize the lower and higher result halves as two predicate outputs. + +## Mechanism + +`pto.pdintlv_b16` is the 16-bit-element-granularity variant of the predicate-deinterleave family ([`pto.pdintlv_b8`](./pdintlv-b8.md) / `pto.pdintlv_b16` / [`pto.pdintlv_b32`](./pdintlv-b32.md)). It takes two `!pto.mask` sources and emits the two deinterleaved halves under the same `b16` granularity. The hardware view of the predicate-register image is preserved bit-for-bit; only how the bits are grouped into 16-bit element slots changes. + +## Syntax + +### AS Level 1 (SSA) + +```mlir +%low, %high = pto.pdintlv_b16 %src0, %src1 + : !pto.mask, !pto.mask -> !pto.mask, !pto.mask +``` + +## Inputs + +| Operand | Type | Description | +|---------|------|-------------| +| `%src0` | `!pto.mask` | First predicate source | +| `%src1` | `!pto.mask` | Second predicate source | + +## Expected Outputs + +| Result | Type | Description | +|--------|------|-------------| +| `%low` | `!pto.mask` | Lower deinterleaved half produced from `%src0` / `%src1` | +| `%high` | `!pto.mask` | Upper deinterleaved half produced from `%src0` / `%src1` | + +## Side Effects + +None. `pto.pdintlv_b16` is a pure predicate transform; it does not read or write UB, GM, or any architectural state beyond producing its two SSA results. + +## Constraints + +!!! warning "Constraints" + - All operands and results MUST use `!pto.mask`. Mixing predicate granularities is illegal; use `pto.pbitcast` first if a producer emits a different granularity. + - The two outputs form an ordered pair (`%low`, `%high`) and that pairing MUST be preserved. + +## Examples + +```mlir +%lo, %hi = pto.pdintlv_b16 %m0, %m1 + : !pto.mask, !pto.mask -> !pto.mask, !pto.mask +``` + +## Related Ops / Instruction Set Links + +- Instruction set overview: [Predicate Generation And Algebra](../../predicate-generation-and-algebra.md) +- Other variants: [pto.pdintlv_b8](./pdintlv-b8.md), [pto.pdintlv_b32](./pdintlv-b32.md) +- Inverse: [pto.pintlv_b16](./pintlv-b16.md) +- Mask granularity reinterpret: [pto.pbitcast](../../../vector/ops/conversion-ops/pbitcast.md) diff --git a/docs/isa/scalar/ops/predicate-generation-and-algebra/pdintlv-b16_zh.md b/docs/isa/scalar/ops/predicate-generation-and-algebra/pdintlv-b16_zh.md new file mode 100644 index 000000000..2a9c9037b --- /dev/null +++ b/docs/isa/scalar/ops/predicate-generation-and-algebra/pdintlv-b16_zh.md @@ -0,0 +1,58 @@ +# pto.pdintlv_b16 + +`pto.pdintlv_b16` 属于 [谓词生成与代数](../../predicate-generation-and-algebra_zh.md) 指令集。 + +## 摘要 + +按 `b16` 粒度对两个谓词源做按位去交错,分别生成低半和高半两路谓词输出。 + +## 机制 + +`pto.pdintlv_b16` 是谓词去交错家族([`pto.pdintlv_b8`](./pdintlv-b8_zh.md) / `pto.pdintlv_b16` / [`pto.pdintlv_b32`](./pdintlv-b32_zh.md))的 16 位元素粒度变体。它接收两个 `!pto.mask` 源,按相同的 `b16` 粒度产生两路去交错后的谓词。底层硬件的谓词寄存器位模式保持不变;改变的只是把这些位按 16 位元素分组的方式。 + +## 语法 + +### AS Level 1(SSA) + +```mlir +%low, %high = pto.pdintlv_b16 %src0, %src1 + : !pto.mask, !pto.mask -> !pto.mask, !pto.mask +``` + +## 输入 + +| 操作数 | 类型 | 描述 | +|---------|------|------| +| `%src0` | `!pto.mask` | 第一路谓词源 | +| `%src1` | `!pto.mask` | 第二路谓词源 | + +## 预期输出 + +| 结果 | 类型 | 描述 | +|--------|------|------| +| `%low` | `!pto.mask` | 从 `%src0` / `%src1` 去交错产生的低半 | +| `%high` | `!pto.mask` | 从 `%src0` / `%src1` 去交错产生的高半 | + +## 副作用 + +无。`pto.pdintlv_b16` 是纯粹的谓词变换:不会读写 UB、GM,也不会改变除两路 SSA 结果以外的任何架构状态。 + +## 约束 + +!!! warning "约束" + - 所有操作数和结果都必须使用 `!pto.mask`。混合谓词粒度是非法的;如果生产者产生的是另一种粒度,先用 `pto.pbitcast` 重新解释。 + - 两路输出形成一个有序对 (`%low`, `%high`),这种配对关系必须保持。 + +## 示例 + +```mlir +%lo, %hi = pto.pdintlv_b16 %m0, %m1 + : !pto.mask, !pto.mask -> !pto.mask, !pto.mask +``` + +## 相关页面 + +- 指令集总览:[谓词生成与代数](../../predicate-generation-and-algebra_zh.md) +- 其它变体:[pto.pdintlv_b8](./pdintlv-b8_zh.md)、[pto.pdintlv_b32](./pdintlv-b32_zh.md) +- 反向操作:[pto.pintlv_b16](./pintlv-b16_zh.md) +- 谓词粒度重解释:[pto.pbitcast](../../../vector/ops/conversion-ops/pbitcast_zh.md) diff --git a/docs/isa/scalar/ops/predicate-generation-and-algebra/pdintlv-b32.md b/docs/isa/scalar/ops/predicate-generation-and-algebra/pdintlv-b32.md new file mode 100644 index 000000000..c994d6d07 --- /dev/null +++ b/docs/isa/scalar/ops/predicate-generation-and-algebra/pdintlv-b32.md @@ -0,0 +1,58 @@ +# pto.pdintlv_b32 + +`pto.pdintlv_b32` is part of the [Predicate Generation And Algebra](../../predicate-generation-and-algebra.md) instruction set. + +## Summary + +Deinterleave two `b32`-granularity predicate sources and materialize the lower and higher result halves as two predicate outputs. + +## Mechanism + +`pto.pdintlv_b32` is the 32-bit-element-granularity variant of the predicate-deinterleave family ([`pto.pdintlv_b8`](./pdintlv-b8.md) / [`pto.pdintlv_b16`](./pdintlv-b16.md) / `pto.pdintlv_b32`). It takes two `!pto.mask` sources and emits the two deinterleaved halves under the same `b32` granularity. The hardware view of the predicate-register image is preserved bit-for-bit; only how the bits are grouped into 32-bit element slots changes. + +## Syntax + +### AS Level 1 (SSA) + +```mlir +%low, %high = pto.pdintlv_b32 %src0, %src1 + : !pto.mask, !pto.mask -> !pto.mask, !pto.mask +``` + +## Inputs + +| Operand | Type | Description | +|---------|------|-------------| +| `%src0` | `!pto.mask` | First predicate source | +| `%src1` | `!pto.mask` | Second predicate source | + +## Expected Outputs + +| Result | Type | Description | +|--------|------|-------------| +| `%low` | `!pto.mask` | Lower deinterleaved half produced from `%src0` / `%src1` | +| `%high` | `!pto.mask` | Upper deinterleaved half produced from `%src0` / `%src1` | + +## Side Effects + +None. `pto.pdintlv_b32` is a pure predicate transform; it does not read or write UB, GM, or any architectural state beyond producing its two SSA results. + +## Constraints + +!!! warning "Constraints" + - All operands and results MUST use `!pto.mask`. Mixing predicate granularities is illegal; use `pto.pbitcast` first if a producer emits a different granularity. + - The two outputs form an ordered pair (`%low`, `%high`) and that pairing MUST be preserved. + +## Examples + +```mlir +%lo, %hi = pto.pdintlv_b32 %m0, %m1 + : !pto.mask, !pto.mask -> !pto.mask, !pto.mask +``` + +## Related Ops / Instruction Set Links + +- Instruction set overview: [Predicate Generation And Algebra](../../predicate-generation-and-algebra.md) +- Other variants: [pto.pdintlv_b8](./pdintlv-b8.md), [pto.pdintlv_b16](./pdintlv-b16.md) +- Inverse: [pto.pintlv_b32](./pintlv-b32.md) +- Mask granularity reinterpret: [pto.pbitcast](../../../vector/ops/conversion-ops/pbitcast.md) diff --git a/docs/isa/scalar/ops/predicate-generation-and-algebra/pdintlv-b32_zh.md b/docs/isa/scalar/ops/predicate-generation-and-algebra/pdintlv-b32_zh.md new file mode 100644 index 000000000..dc5396179 --- /dev/null +++ b/docs/isa/scalar/ops/predicate-generation-and-algebra/pdintlv-b32_zh.md @@ -0,0 +1,58 @@ +# pto.pdintlv_b32 + +`pto.pdintlv_b32` 属于 [谓词生成与代数](../../predicate-generation-and-algebra_zh.md) 指令集。 + +## 摘要 + +按 `b32` 粒度对两个谓词源做按位去交错,分别生成低半和高半两路谓词输出。 + +## 机制 + +`pto.pdintlv_b32` 是谓词去交错家族([`pto.pdintlv_b8`](./pdintlv-b8_zh.md) / [`pto.pdintlv_b16`](./pdintlv-b16_zh.md) / `pto.pdintlv_b32`)的 32 位元素粒度变体。它接收两个 `!pto.mask` 源,按相同的 `b32` 粒度产生两路去交错后的谓词。底层硬件的谓词寄存器位模式保持不变;改变的只是把这些位按 32 位元素分组的方式。 + +## 语法 + +### AS Level 1(SSA) + +```mlir +%low, %high = pto.pdintlv_b32 %src0, %src1 + : !pto.mask, !pto.mask -> !pto.mask, !pto.mask +``` + +## 输入 + +| 操作数 | 类型 | 描述 | +|---------|------|------| +| `%src0` | `!pto.mask` | 第一路谓词源 | +| `%src1` | `!pto.mask` | 第二路谓词源 | + +## 预期输出 + +| 结果 | 类型 | 描述 | +|--------|------|------| +| `%low` | `!pto.mask` | 从 `%src0` / `%src1` 去交错产生的低半 | +| `%high` | `!pto.mask` | 从 `%src0` / `%src1` 去交错产生的高半 | + +## 副作用 + +无。`pto.pdintlv_b32` 是纯粹的谓词变换:不会读写 UB、GM,也不会改变除两路 SSA 结果以外的任何架构状态。 + +## 约束 + +!!! warning "约束" + - 所有操作数和结果都必须使用 `!pto.mask`。混合谓词粒度是非法的;如果生产者产生的是另一种粒度,先用 `pto.pbitcast` 重新解释。 + - 两路输出形成一个有序对 (`%low`, `%high`),这种配对关系必须保持。 + +## 示例 + +```mlir +%lo, %hi = pto.pdintlv_b32 %m0, %m1 + : !pto.mask, !pto.mask -> !pto.mask, !pto.mask +``` + +## 相关页面 + +- 指令集总览:[谓词生成与代数](../../predicate-generation-and-algebra_zh.md) +- 其它变体:[pto.pdintlv_b8](./pdintlv-b8_zh.md)、[pto.pdintlv_b16](./pdintlv-b16_zh.md) +- 反向操作:[pto.pintlv_b32](./pintlv-b32_zh.md) +- 谓词粒度重解释:[pto.pbitcast](../../../vector/ops/conversion-ops/pbitcast_zh.md) diff --git a/docs/isa/scalar/ops/predicate-generation-and-algebra/pdintlv-b8.md b/docs/isa/scalar/ops/predicate-generation-and-algebra/pdintlv-b8.md index bdbe3a985..9688bd843 100644 --- a/docs/isa/scalar/ops/predicate-generation-and-algebra/pdintlv-b8.md +++ b/docs/isa/scalar/ops/predicate-generation-and-algebra/pdintlv-b8.md @@ -19,19 +19,19 @@ The public call surface therefore models `pto.pdintlv_b8` as a paired-result ope ### PTO Assembly Form ```mlir -%dst0, %dst1 = pto.pdintlv_b8 %src0, %src1 : !pto.mask, !pto.mask -> !pto.mask, !pto.mask +%dst0, %dst1 = pto.pdintlv_b8 %src0, %src1 : !pto.mask, !pto.mask -> !pto.mask, !pto.mask ``` ### AS Level 1 (SSA) ```mlir -%dst0, %dst1 = pto.pdintlv_b8 %src0, %src1 : !pto.mask, !pto.mask -> !pto.mask, !pto.mask +%dst0, %dst1 = pto.pdintlv_b8 %src0, %src1 : !pto.mask, !pto.mask -> !pto.mask, !pto.mask ``` ### AS Level 2 (DPS) ```mlir -pto.pdintlv_b8 ins(%src0, %src1 : !pto.mask, !pto.mask) outs(%dst0, %dst1 : !pto.mask, !pto.mask) +pto.pdintlv_b8 ins(%src0, %src1 : !pto.mask, !pto.mask) outs(%dst0, %dst1 : !pto.mask, !pto.mask) ``` ## C++ Intrinsic @@ -48,15 +48,15 @@ pdintlv_b8(dst0, dst1, src0, src1); | Operand | Type | Description | |---------|------|-------------| -| `%src0` | `!pto.mask` | First predicate source | -| `%src1` | `!pto.mask` | Second predicate source | +| `%src0` | `!pto.mask` | First predicate source | +| `%src1` | `!pto.mask` | Second predicate source | ## Expected Outputs | Result | Type | Description | |--------|------|-------------| -| `%dst0` | `!pto.mask` | Lower result half returned by the deinterleave helper | -| `%dst1` | `!pto.mask` | Upper result half returned by the deinterleave helper | +| `%dst0` | `!pto.mask` | Lower result half returned by the deinterleave helper | +| `%dst1` | `!pto.mask` | Upper result half returned by the deinterleave helper | ## Side Effects @@ -96,12 +96,14 @@ pdintlv_b8(dst0, dst1, src0, src1); ### SSA form ```mlir -%dst0, %dst1 = pto.pdintlv_b8 %src0, %src1 : !pto.mask, !pto.mask -> !pto.mask, !pto.mask +%dst0, %dst1 = pto.pdintlv_b8 %src0, %src1 : !pto.mask, !pto.mask -> !pto.mask, !pto.mask ``` ## Related Ops / Instruction Set Links - Instruction set overview: [Predicate Generation And Algebra](../../predicate-generation-and-algebra.md) +- Other variants: [pto.pdintlv_b16](./pdintlv-b16.md), [pto.pdintlv_b32](./pdintlv-b32.md) +- Inverse: [pto.pintlv_b8](./pintlv-b8.md) - Previous op in instruction set: [pto.psel](./psel.md) -- Next op in instruction set: [pto.pintlv_b16](./pintlv-b16.md) +- Next op in instruction set: [pto.pdintlv_b16](./pdintlv-b16.md) - Control-shell overview: [Control and configuration](../../control-and-configuration.md) diff --git a/docs/isa/scalar/ops/predicate-generation-and-algebra/pdintlv-b8_zh.md b/docs/isa/scalar/ops/predicate-generation-and-algebra/pdintlv-b8_zh.md index 4f7791b99..8038f2e02 100644 --- a/docs/isa/scalar/ops/predicate-generation-and-algebra/pdintlv-b8_zh.md +++ b/docs/isa/scalar/ops/predicate-generation-and-algebra/pdintlv-b8_zh.md @@ -1,32 +1,58 @@ # pto.pdintlv_b8 -对谓词做按位去交错。 +`pto.pdintlv_b8` 属于 [谓词生成与代数](../../predicate-generation-and-algebra_zh.md) 指令集。 -## 语法 +## 摘要 -### PTO 汇编形式 +按 `b8` 粒度对两个谓词源做按位去交错,分别生成低半和高半两路谓词输出。 -```text -pdintlv_b8 %dst0, %dst1, %src : !pto.mask, !pto.mask, !pto.mask -``` +## 机制 + +`pto.pdintlv_b8` 是谓词去交错家族(`pto.pdintlv_b8` / [`pto.pdintlv_b16`](./pdintlv-b16_zh.md) / [`pto.pdintlv_b32`](./pdintlv-b32_zh.md))的 8 位元素粒度变体。它接收两个 `!pto.mask` 源,按相同的 `b8` 粒度产生两路去交错后的谓词。底层硬件的谓词寄存器位模式保持不变;改变的只是把这些位按 8 位元素分组的方式。 + +## 语法 ### AS Level 1(SSA) ```mlir -%dst0, %dst1 = pto.pdintlv_b8 %src : !pto.mask -> !pto.mask, !pto.mask +%low, %high = pto.pdintlv_b8 %src0, %src1 + : !pto.mask, !pto.mask -> !pto.mask, !pto.mask ``` -### AS Level 2(DPS) +## 输入 -```mlir -pto.pdintlv_b8 ins(%src : !pto.mask) outs(%dst0, %dst1 : !pto.mask, !pto.mask) -``` +| 操作数 | 类型 | 描述 | +|---------|------|------| +| `%src0` | `!pto.mask` | 第一路谓词源 | +| `%src1` | `!pto.mask` | 第二路谓词源 | + +## 预期输出 -## 关键约束 +| 结果 | 类型 | 描述 | +|--------|------|------| +| `%low` | `!pto.mask` | 从 `%src0` / `%src1` 去交错产生的低半 | +| `%high` | `!pto.mask` | 从 `%src0` / `%src1` 去交错产生的高半 | -- 参与操作的谓词宽度必须兼容。 -- pattern / partition token 必须属于文档化取值域。 +## 副作用 + +无。`pto.pdintlv_b8` 是纯粹的谓词变换:不会读写 UB、GM,也不会改变除两路 SSA 结果以外的任何架构状态。 + +## 约束 + +!!! warning "约束" + - 所有操作数和结果都必须使用 `!pto.mask`。混合谓词粒度是非法的;如果生产者产生的是另一种粒度,先用 `pto.pbitcast` 重新解释。 + - 两路输出形成一个有序对 (`%low`, `%high`),这种配对关系必须保持。 + +## 示例 + +```mlir +%lo, %hi = pto.pdintlv_b8 %m0, %m1 + : !pto.mask, !pto.mask -> !pto.mask, !pto.mask +``` ## 相关页面 -- [谓词生成与代数](../../predicate-generation-and-algebra_zh.md) +- 指令集总览:[谓词生成与代数](../../predicate-generation-and-algebra_zh.md) +- 其它变体:[pto.pdintlv_b16](./pdintlv-b16_zh.md)、[pto.pdintlv_b32](./pdintlv-b32_zh.md) +- 反向操作:[pto.pintlv_b8](./pintlv-b8_zh.md) +- 谓词粒度重解释:[pto.pbitcast](../../../vector/ops/conversion-ops/pbitcast_zh.md) diff --git a/docs/isa/scalar/ops/predicate-generation-and-algebra/pge-b16.md b/docs/isa/scalar/ops/predicate-generation-and-algebra/pge-b16.md index 49ad5479b..8c16cba75 100644 --- a/docs/isa/scalar/ops/predicate-generation-and-algebra/pge-b16.md +++ b/docs/isa/scalar/ops/predicate-generation-and-algebra/pge-b16.md @@ -17,19 +17,19 @@ This page therefore models `pto.pge_b16` as pattern-based predicate materializat ### PTO Assembly Form ```mlir -%mask = pto.pge_b16 "PAT_VL8" : !pto.mask +%mask = pto.pge_b16 "PAT_VL8" : !pto.mask ``` ### AS Level 1 (SSA) ```mlir -%mask = pto.pge_b16 "PAT_VL8" : !pto.mask +%mask = pto.pge_b16 "PAT_VL8" : !pto.mask ``` ### AS Level 2 (DPS) ```mlir -pto.pge_b16 "PAT_VL8" outs(%mask : !pto.mask) +pto.pge_b16 "PAT_VL8" outs(%mask : !pto.mask) ``` ## C++ Intrinsic @@ -48,7 +48,7 @@ vector_bool mask = pge_b16(__cce_simd::PAT_VL8); | Result | Type | Description | |--------|------|-------------| -| `%mask` | `!pto.mask` | 16-bit predicate generated from the selected pattern token | +| `%mask` | `!pto.mask` | 16-bit predicate generated from the selected pattern token | ## Side Effects @@ -86,7 +86,7 @@ vector_bool mask = pge_b16(__cce_simd::PAT_VL8); ### SSA form ```mlir -%mask = pto.pge_b16 "PAT_VL8" : !pto.mask +%mask = pto.pge_b16 "PAT_VL8" : !pto.mask ``` ## Related Ops / Instruction Set Links diff --git a/docs/isa/scalar/ops/predicate-generation-and-algebra/pge-b16_zh.md b/docs/isa/scalar/ops/predicate-generation-and-algebra/pge-b16_zh.md index b43870fb1..5921c86e2 100644 --- a/docs/isa/scalar/ops/predicate-generation-and-algebra/pge-b16_zh.md +++ b/docs/isa/scalar/ops/predicate-generation-and-algebra/pge-b16_zh.md @@ -7,19 +7,19 @@ ### PTO 汇编形式 ```text -pge_b16 %dst, %scalar : !pto.mask, i16 +pge_b16 %dst, %scalar : !pto.mask, i16 ``` ### AS Level 1(SSA) ```mlir -%mask = pto.pge_b16 %scalar : i16 -> !pto.mask +%mask = pto.pge_b16 %scalar : i16 -> !pto.mask ``` ### AS Level 2(DPS) ```mlir -pto.pge_b16 ins(%scalar : i16) outs(%mask : !pto.mask) +pto.pge_b16 ins(%scalar : i16) outs(%mask : !pto.mask) ``` ## 关键约束 diff --git a/docs/isa/scalar/ops/predicate-generation-and-algebra/pge-b32.md b/docs/isa/scalar/ops/predicate-generation-and-algebra/pge-b32.md index 17bc77be4..8bc547e41 100644 --- a/docs/isa/scalar/ops/predicate-generation-and-algebra/pge-b32.md +++ b/docs/isa/scalar/ops/predicate-generation-and-algebra/pge-b32.md @@ -17,19 +17,19 @@ This page therefore models `pto.pge_b32` as pattern-based predicate materializat ### PTO Assembly Form ```mlir -%mask = pto.pge_b32 "PAT_VL16" : !pto.mask +%mask = pto.pge_b32 "PAT_VL16" : !pto.mask ``` ### AS Level 1 (SSA) ```mlir -%mask = pto.pge_b32 "PAT_VL16" : !pto.mask +%mask = pto.pge_b32 "PAT_VL16" : !pto.mask ``` ### AS Level 2 (DPS) ```mlir -pto.pge_b32 "PAT_VL16" outs(%mask : !pto.mask) +pto.pge_b32 "PAT_VL16" outs(%mask : !pto.mask) ``` ## C++ Intrinsic @@ -48,7 +48,7 @@ vector_bool mask = pge_b32(__cce_simd::PAT_VL16); | Result | Type | Description | |--------|------|-------------| -| `%mask` | `!pto.mask` | 32-bit predicate generated from the selected pattern token | +| `%mask` | `!pto.mask` | 32-bit predicate generated from the selected pattern token | ## Side Effects @@ -86,7 +86,7 @@ vector_bool mask = pge_b32(__cce_simd::PAT_VL16); ### SSA form ```mlir -%mask = pto.pge_b32 "PAT_VL16" : !pto.mask +%mask = pto.pge_b32 "PAT_VL16" : !pto.mask ``` ## Related Ops / Instruction Set Links diff --git a/docs/isa/scalar/ops/predicate-generation-and-algebra/pge-b32_zh.md b/docs/isa/scalar/ops/predicate-generation-and-algebra/pge-b32_zh.md index 79f62e6e5..6d865db7a 100644 --- a/docs/isa/scalar/ops/predicate-generation-and-algebra/pge-b32_zh.md +++ b/docs/isa/scalar/ops/predicate-generation-and-algebra/pge-b32_zh.md @@ -7,19 +7,19 @@ ### PTO 汇编形式 ```text -pge_b32 %dst, %scalar : !pto.mask, i32 +pge_b32 %dst, %scalar : !pto.mask, i32 ``` ### AS Level 1(SSA) ```mlir -%mask = pto.pge_b32 %scalar : i32 -> !pto.mask +%mask = pto.pge_b32 %scalar : i32 -> !pto.mask ``` ### AS Level 2(DPS) ```mlir -pto.pge_b32 ins(%scalar : i32) outs(%mask : !pto.mask) +pto.pge_b32 ins(%scalar : i32) outs(%mask : !pto.mask) ``` ## 关键约束 diff --git a/docs/isa/scalar/ops/predicate-generation-and-algebra/pge-b8.md b/docs/isa/scalar/ops/predicate-generation-and-algebra/pge-b8.md index 401b2049b..8ed95b7cc 100644 --- a/docs/isa/scalar/ops/predicate-generation-and-algebra/pge-b8.md +++ b/docs/isa/scalar/ops/predicate-generation-and-algebra/pge-b8.md @@ -17,19 +17,19 @@ This page therefore models `pto.pge_b8` as pattern-based predicate materializati ### PTO Assembly Form ```mlir -%mask = pto.pge_b8 "PAT_VL4" : !pto.mask +%mask = pto.pge_b8 "PAT_VL4" : !pto.mask ``` ### AS Level 1 (SSA) ```mlir -%mask = pto.pge_b8 "PAT_VL4" : !pto.mask +%mask = pto.pge_b8 "PAT_VL4" : !pto.mask ``` ### AS Level 2 (DPS) ```mlir -pto.pge_b8 "PAT_VL4" outs(%mask : !pto.mask) +pto.pge_b8 "PAT_VL4" outs(%mask : !pto.mask) ``` ## C++ Intrinsic @@ -48,7 +48,7 @@ vector_bool mask = pge_b8(__cce_simd::PAT_VL4); | Result | Type | Description | |--------|------|-------------| -| `%mask` | `!pto.mask` | 8-bit predicate generated from the selected pattern token | +| `%mask` | `!pto.mask` | 8-bit predicate generated from the selected pattern token | ## Side Effects @@ -86,7 +86,7 @@ vector_bool mask = pge_b8(__cce_simd::PAT_VL4); ### SSA form ```mlir -%mask = pto.pge_b8 "PAT_VL4" : !pto.mask +%mask = pto.pge_b8 "PAT_VL4" : !pto.mask ``` ## Related Ops / Instruction Set Links diff --git a/docs/isa/scalar/ops/predicate-generation-and-algebra/pge-b8_zh.md b/docs/isa/scalar/ops/predicate-generation-and-algebra/pge-b8_zh.md index c8ca874ce..fbd317795 100644 --- a/docs/isa/scalar/ops/predicate-generation-and-algebra/pge-b8_zh.md +++ b/docs/isa/scalar/ops/predicate-generation-and-algebra/pge-b8_zh.md @@ -7,19 +7,19 @@ ### PTO 汇编形式 ```text -pge_b8 %dst, %scalar : !pto.mask, i8 +pge_b8 %dst, %scalar : !pto.mask, i8 ``` ### AS Level 1(SSA) ```mlir -%mask = pto.pge_b8 %scalar : i8 -> !pto.mask +%mask = pto.pge_b8 %scalar : i8 -> !pto.mask ``` ### AS Level 2(DPS) ```mlir -pto.pge_b8 ins(%scalar : i8) outs(%mask : !pto.mask) +pto.pge_b8 ins(%scalar : i8) outs(%mask : !pto.mask) ``` ## 关键约束 diff --git a/docs/isa/scalar/ops/predicate-generation-and-algebra/pintlv-b16.md b/docs/isa/scalar/ops/predicate-generation-and-algebra/pintlv-b16.md index 2ccb4e39c..023e373fa 100644 --- a/docs/isa/scalar/ops/predicate-generation-and-algebra/pintlv-b16.md +++ b/docs/isa/scalar/ops/predicate-generation-and-algebra/pintlv-b16.md @@ -19,19 +19,19 @@ The public call surface therefore models `pto.pintlv_b16` as a paired-result ope ### PTO Assembly Form ```mlir -%dst0, %dst1 = pto.pintlv_b16 %src0, %src1 : !pto.mask, !pto.mask -> !pto.mask, !pto.mask +%dst0, %dst1 = pto.pintlv_b16 %src0, %src1 : !pto.mask, !pto.mask -> !pto.mask, !pto.mask ``` ### AS Level 1 (SSA) ```mlir -%dst0, %dst1 = pto.pintlv_b16 %src0, %src1 : !pto.mask, !pto.mask -> !pto.mask, !pto.mask +%dst0, %dst1 = pto.pintlv_b16 %src0, %src1 : !pto.mask, !pto.mask -> !pto.mask, !pto.mask ``` ### AS Level 2 (DPS) ```mlir -pto.pintlv_b16 ins(%src0, %src1 : !pto.mask, !pto.mask) outs(%dst0, %dst1 : !pto.mask, !pto.mask) +pto.pintlv_b16 ins(%src0, %src1 : !pto.mask, !pto.mask) outs(%dst0, %dst1 : !pto.mask, !pto.mask) ``` ## C++ Intrinsic @@ -48,15 +48,15 @@ pintlv_b16(dst0, dst1, src0, src1); | Operand | Type | Description | |---------|------|-------------| -| `%src0` | `!pto.mask` | First predicate source | -| `%src1` | `!pto.mask` | Second predicate source | +| `%src0` | `!pto.mask` | First predicate source | +| `%src1` | `!pto.mask` | Second predicate source | ## Expected Outputs | Result | Type | Description | |--------|------|-------------| -| `%dst0` | `!pto.mask` | Lower result half returned by the interleave helper | -| `%dst1` | `!pto.mask` | Upper result half returned by the interleave helper | +| `%dst0` | `!pto.mask` | Lower result half returned by the interleave helper | +| `%dst1` | `!pto.mask` | Upper result half returned by the interleave helper | ## Side Effects @@ -96,12 +96,14 @@ pintlv_b16(dst0, dst1, src0, src1); ### SSA form ```mlir -%dst0, %dst1 = pto.pintlv_b16 %src0, %src1 : !pto.mask, !pto.mask -> !pto.mask, !pto.mask +%dst0, %dst1 = pto.pintlv_b16 %src0, %src1 : !pto.mask, !pto.mask -> !pto.mask, !pto.mask ``` ## Related Ops / Instruction Set Links - Instruction set overview: [Predicate Generation And Algebra](../../predicate-generation-and-algebra.md) -- Previous op in instruction set: [pto.pdintlv_b8](./pdintlv-b8.md) -- Next op in instruction set: (none - last in instruction set) +- Other variants: [pto.pintlv_b8](./pintlv-b8.md), [pto.pintlv_b32](./pintlv-b32.md) +- Inverse: [pto.pdintlv_b16](./pdintlv-b16.md) +- Previous op in instruction set: [pto.pintlv_b8](./pintlv-b8.md) +- Next op in instruction set: [pto.pintlv_b32](./pintlv-b32.md) - Control-shell overview: [Control and configuration](../../control-and-configuration.md) diff --git a/docs/isa/scalar/ops/predicate-generation-and-algebra/pintlv-b16_zh.md b/docs/isa/scalar/ops/predicate-generation-and-algebra/pintlv-b16_zh.md index 11b82509e..b9453924c 100644 --- a/docs/isa/scalar/ops/predicate-generation-and-algebra/pintlv-b16_zh.md +++ b/docs/isa/scalar/ops/predicate-generation-and-algebra/pintlv-b16_zh.md @@ -1,32 +1,58 @@ # pto.pintlv_b16 -对谓词做交错组合。 +`pto.pintlv_b16` 属于 [谓词生成与代数](../../predicate-generation-and-algebra_zh.md) 指令集。 -## 语法 +## 摘要 -### PTO 汇编形式 +按 `b16` 粒度对两个谓词源做按位交错,分别生成低半和高半两路谓词输出。 -```text -pintlv_b16 %dst, %src0, %src1 : !pto.mask, !pto.mask, !pto.mask -``` +## 机制 + +`pto.pintlv_b16` 是谓词交错家族([`pto.pintlv_b8`](./pintlv-b8_zh.md) / `pto.pintlv_b16` / [`pto.pintlv_b32`](./pintlv-b32_zh.md))的 16 位元素粒度变体。它接收两个 `!pto.mask` 源,按相同的 `b16` 粒度产生两路交错后的谓词。底层硬件的谓词寄存器位模式保持不变;改变的只是把这些位按 16 位元素分组的方式。 + +## 语法 ### AS Level 1(SSA) ```mlir -%dst = pto.pintlv_b16 %src0, %src1 : !pto.mask, !pto.mask -> !pto.mask +%low, %high = pto.pintlv_b16 %src0, %src1 + : !pto.mask, !pto.mask -> !pto.mask, !pto.mask ``` -### AS Level 2(DPS) +## 输入 -```mlir -pto.pintlv_b16 ins(%src0, %src1 : !pto.mask, !pto.mask) outs(%dst : !pto.mask) -``` +| 操作数 | 类型 | 描述 | +|---------|------|------| +| `%src0` | `!pto.mask` | 第一路谓词源 | +| `%src1` | `!pto.mask` | 第二路谓词源 | + +## 预期输出 -## 关键约束 +| 结果 | 类型 | 描述 | +|--------|------|------| +| `%low` | `!pto.mask` | 从 `%src0` / `%src1` 交错产生的低半 | +| `%high` | `!pto.mask` | 从 `%src0` / `%src1` 交错产生的高半 | -- 参与操作的谓词宽度必须兼容。 -- pattern / partition token 必须属于文档化取值域。 +## 副作用 + +无。`pto.pintlv_b16` 是纯粹的谓词变换:不会读写 UB、GM,也不会改变除两路 SSA 结果以外的任何架构状态。 + +## 约束 + +!!! warning "约束" + - 所有操作数和结果都必须使用 `!pto.mask`。混合谓词粒度是非法的;如果生产者产生的是另一种粒度,先用 `pto.pbitcast` 重新解释。 + - 两路输出形成一个有序对 (`%low`, `%high`),这种配对关系必须保持。 + +## 示例 + +```mlir +%lo, %hi = pto.pintlv_b16 %m0, %m1 + : !pto.mask, !pto.mask -> !pto.mask, !pto.mask +``` ## 相关页面 -- [谓词生成与代数](../../predicate-generation-and-algebra_zh.md) +- 指令集总览:[谓词生成与代数](../../predicate-generation-and-algebra_zh.md) +- 其它变体:[pto.pintlv_b8](./pintlv-b8_zh.md)、[pto.pintlv_b32](./pintlv-b32_zh.md) +- 反向操作:[pto.pdintlv_b16](./pdintlv-b16_zh.md) +- 谓词粒度重解释:[pto.pbitcast](../../../vector/ops/conversion-ops/pbitcast_zh.md) diff --git a/docs/isa/scalar/ops/predicate-generation-and-algebra/pintlv-b32.md b/docs/isa/scalar/ops/predicate-generation-and-algebra/pintlv-b32.md new file mode 100644 index 000000000..327702851 --- /dev/null +++ b/docs/isa/scalar/ops/predicate-generation-and-algebra/pintlv-b32.md @@ -0,0 +1,58 @@ +# pto.pintlv_b32 + +`pto.pintlv_b32` is part of the [Predicate Generation And Algebra](../../predicate-generation-and-algebra.md) instruction set. + +## Summary + +Interleave two `b32`-granularity predicate sources and materialize the lower and higher result halves as two predicate outputs. + +## Mechanism + +`pto.pintlv_b32` is the 32-bit-element-granularity variant of the predicate-interleave family ([`pto.pintlv_b8`](./pintlv-b8.md) / [`pto.pintlv_b16`](./pintlv-b16.md) / `pto.pintlv_b32`). It takes two `!pto.mask` sources and emits the two interleaved halves under the same `b32` granularity. The hardware view of the predicate-register image is preserved bit-for-bit; only how the bits are grouped into 32-bit element slots changes. + +## Syntax + +### AS Level 1 (SSA) + +```mlir +%low, %high = pto.pintlv_b32 %src0, %src1 + : !pto.mask, !pto.mask -> !pto.mask, !pto.mask +``` + +## Inputs + +| Operand | Type | Description | +|---------|------|-------------| +| `%src0` | `!pto.mask` | First predicate source | +| `%src1` | `!pto.mask` | Second predicate source | + +## Expected Outputs + +| Result | Type | Description | +|--------|------|-------------| +| `%low` | `!pto.mask` | Lower interleaved half produced from `%src0` / `%src1` | +| `%high` | `!pto.mask` | Upper interleaved half produced from `%src0` / `%src1` | + +## Side Effects + +None. `pto.pintlv_b32` is a pure predicate transform; it does not read or write UB, GM, or any architectural state beyond producing its two SSA results. + +## Constraints + +!!! warning "Constraints" + - All operands and results MUST use `!pto.mask`. Mixing predicate granularities is illegal; use `pto.pbitcast` first if a producer emits a different granularity. + - The two outputs form an ordered pair (`%low`, `%high`) and that pairing MUST be preserved. + +## Examples + +```mlir +%lo, %hi = pto.pintlv_b32 %m0, %m1 + : !pto.mask, !pto.mask -> !pto.mask, !pto.mask +``` + +## Related Ops / Instruction Set Links + +- Instruction set overview: [Predicate Generation And Algebra](../../predicate-generation-and-algebra.md) +- Other variants: [pto.pintlv_b8](./pintlv-b8.md), [pto.pintlv_b16](./pintlv-b16.md) +- Inverse: [pto.pdintlv_b32](./pdintlv-b32.md) +- Mask granularity reinterpret: [pto.pbitcast](../../../vector/ops/conversion-ops/pbitcast.md) diff --git a/docs/isa/scalar/ops/predicate-generation-and-algebra/pintlv-b32_zh.md b/docs/isa/scalar/ops/predicate-generation-and-algebra/pintlv-b32_zh.md new file mode 100644 index 000000000..3ea4f674e --- /dev/null +++ b/docs/isa/scalar/ops/predicate-generation-and-algebra/pintlv-b32_zh.md @@ -0,0 +1,58 @@ +# pto.pintlv_b32 + +`pto.pintlv_b32` 属于 [谓词生成与代数](../../predicate-generation-and-algebra_zh.md) 指令集。 + +## 摘要 + +按 `b32` 粒度对两个谓词源做按位交错,分别生成低半和高半两路谓词输出。 + +## 机制 + +`pto.pintlv_b32` 是谓词交错家族([`pto.pintlv_b8`](./pintlv-b8_zh.md) / [`pto.pintlv_b16`](./pintlv-b16_zh.md) / `pto.pintlv_b32`)的 32 位元素粒度变体。它接收两个 `!pto.mask` 源,按相同的 `b32` 粒度产生两路交错后的谓词。底层硬件的谓词寄存器位模式保持不变;改变的只是把这些位按 32 位元素分组的方式。 + +## 语法 + +### AS Level 1(SSA) + +```mlir +%low, %high = pto.pintlv_b32 %src0, %src1 + : !pto.mask, !pto.mask -> !pto.mask, !pto.mask +``` + +## 输入 + +| 操作数 | 类型 | 描述 | +|---------|------|------| +| `%src0` | `!pto.mask` | 第一路谓词源 | +| `%src1` | `!pto.mask` | 第二路谓词源 | + +## 预期输出 + +| 结果 | 类型 | 描述 | +|--------|------|------| +| `%low` | `!pto.mask` | 从 `%src0` / `%src1` 交错产生的低半 | +| `%high` | `!pto.mask` | 从 `%src0` / `%src1` 交错产生的高半 | + +## 副作用 + +无。`pto.pintlv_b32` 是纯粹的谓词变换:不会读写 UB、GM,也不会改变除两路 SSA 结果以外的任何架构状态。 + +## 约束 + +!!! warning "约束" + - 所有操作数和结果都必须使用 `!pto.mask`。混合谓词粒度是非法的;如果生产者产生的是另一种粒度,先用 `pto.pbitcast` 重新解释。 + - 两路输出形成一个有序对 (`%low`, `%high`),这种配对关系必须保持。 + +## 示例 + +```mlir +%lo, %hi = pto.pintlv_b32 %m0, %m1 + : !pto.mask, !pto.mask -> !pto.mask, !pto.mask +``` + +## 相关页面 + +- 指令集总览:[谓词生成与代数](../../predicate-generation-and-algebra_zh.md) +- 其它变体:[pto.pintlv_b8](./pintlv-b8_zh.md)、[pto.pintlv_b16](./pintlv-b16_zh.md) +- 反向操作:[pto.pdintlv_b32](./pdintlv-b32_zh.md) +- 谓词粒度重解释:[pto.pbitcast](../../../vector/ops/conversion-ops/pbitcast_zh.md) diff --git a/docs/isa/scalar/ops/predicate-generation-and-algebra/pintlv-b8.md b/docs/isa/scalar/ops/predicate-generation-and-algebra/pintlv-b8.md new file mode 100644 index 000000000..e143306db --- /dev/null +++ b/docs/isa/scalar/ops/predicate-generation-and-algebra/pintlv-b8.md @@ -0,0 +1,58 @@ +# pto.pintlv_b8 + +`pto.pintlv_b8` is part of the [Predicate Generation And Algebra](../../predicate-generation-and-algebra.md) instruction set. + +## Summary + +Interleave two `b8`-granularity predicate sources and materialize the lower and higher result halves as two predicate outputs. + +## Mechanism + +`pto.pintlv_b8` is the 8-bit-element-granularity variant of the predicate-interleave family (`pto.pintlv_b8` / [`pto.pintlv_b16`](./pintlv-b16.md) / [`pto.pintlv_b32`](./pintlv-b32.md)). It takes two `!pto.mask` sources and emits the two interleaved halves under the same `b8` granularity. The hardware view of the predicate-register image is preserved bit-for-bit; only how the bits are grouped into 8-bit element slots changes. + +## Syntax + +### AS Level 1 (SSA) + +```mlir +%low, %high = pto.pintlv_b8 %src0, %src1 + : !pto.mask, !pto.mask -> !pto.mask, !pto.mask +``` + +## Inputs + +| Operand | Type | Description | +|---------|------|-------------| +| `%src0` | `!pto.mask` | First predicate source | +| `%src1` | `!pto.mask` | Second predicate source | + +## Expected Outputs + +| Result | Type | Description | +|--------|------|-------------| +| `%low` | `!pto.mask` | Lower interleaved half produced from `%src0` / `%src1` | +| `%high` | `!pto.mask` | Upper interleaved half produced from `%src0` / `%src1` | + +## Side Effects + +None. `pto.pintlv_b8` is a pure predicate transform; it does not read or write UB, GM, or any architectural state beyond producing its two SSA results. + +## Constraints + +!!! warning "Constraints" + - All operands and results MUST use `!pto.mask`. Mixing predicate granularities is illegal; use `pto.pbitcast` first if a producer emits a different granularity. + - The two outputs form an ordered pair (`%low`, `%high`) and that pairing MUST be preserved. + +## Examples + +```mlir +%lo, %hi = pto.pintlv_b8 %m0, %m1 + : !pto.mask, !pto.mask -> !pto.mask, !pto.mask +``` + +## Related Ops / Instruction Set Links + +- Instruction set overview: [Predicate Generation And Algebra](../../predicate-generation-and-algebra.md) +- Other variants: [pto.pintlv_b16](./pintlv-b16.md), [pto.pintlv_b32](./pintlv-b32.md) +- Inverse: [pto.pdintlv_b8](./pdintlv-b8.md) +- Mask granularity reinterpret: [pto.pbitcast](../../../vector/ops/conversion-ops/pbitcast.md) diff --git a/docs/isa/scalar/ops/predicate-generation-and-algebra/pintlv-b8_zh.md b/docs/isa/scalar/ops/predicate-generation-and-algebra/pintlv-b8_zh.md new file mode 100644 index 000000000..bb70a0761 --- /dev/null +++ b/docs/isa/scalar/ops/predicate-generation-and-algebra/pintlv-b8_zh.md @@ -0,0 +1,58 @@ +# pto.pintlv_b8 + +`pto.pintlv_b8` 属于 [谓词生成与代数](../../predicate-generation-and-algebra_zh.md) 指令集。 + +## 摘要 + +按 `b8` 粒度对两个谓词源做按位交错,分别生成低半和高半两路谓词输出。 + +## 机制 + +`pto.pintlv_b8` 是谓词交错家族(`pto.pintlv_b8` / [`pto.pintlv_b16`](./pintlv-b16_zh.md) / [`pto.pintlv_b32`](./pintlv-b32_zh.md))的 8 位元素粒度变体。它接收两个 `!pto.mask` 源,按相同的 `b8` 粒度产生两路交错后的谓词。底层硬件的谓词寄存器位模式保持不变;改变的只是把这些位按 8 位元素分组的方式。 + +## 语法 + +### AS Level 1(SSA) + +```mlir +%low, %high = pto.pintlv_b8 %src0, %src1 + : !pto.mask, !pto.mask -> !pto.mask, !pto.mask +``` + +## 输入 + +| 操作数 | 类型 | 描述 | +|---------|------|------| +| `%src0` | `!pto.mask` | 第一路谓词源 | +| `%src1` | `!pto.mask` | 第二路谓词源 | + +## 预期输出 + +| 结果 | 类型 | 描述 | +|--------|------|------| +| `%low` | `!pto.mask` | 从 `%src0` / `%src1` 交错产生的低半 | +| `%high` | `!pto.mask` | 从 `%src0` / `%src1` 交错产生的高半 | + +## 副作用 + +无。`pto.pintlv_b8` 是纯粹的谓词变换:不会读写 UB、GM,也不会改变除两路 SSA 结果以外的任何架构状态。 + +## 约束 + +!!! warning "约束" + - 所有操作数和结果都必须使用 `!pto.mask`。混合谓词粒度是非法的;如果生产者产生的是另一种粒度,先用 `pto.pbitcast` 重新解释。 + - 两路输出形成一个有序对 (`%low`, `%high`),这种配对关系必须保持。 + +## 示例 + +```mlir +%lo, %hi = pto.pintlv_b8 %m0, %m1 + : !pto.mask, !pto.mask -> !pto.mask, !pto.mask +``` + +## 相关页面 + +- 指令集总览:[谓词生成与代数](../../predicate-generation-and-algebra_zh.md) +- 其它变体:[pto.pintlv_b16](./pintlv-b16_zh.md)、[pto.pintlv_b32](./pintlv-b32_zh.md) +- 反向操作:[pto.pdintlv_b8](./pdintlv-b8_zh.md) +- 谓词粒度重解释:[pto.pbitcast](../../../vector/ops/conversion-ops/pbitcast_zh.md) diff --git a/docs/isa/scalar/ops/predicate-generation-and-algebra/plt-b16.md b/docs/isa/scalar/ops/predicate-generation-and-algebra/plt-b16.md index 4bc206335..00d9d595f 100644 --- a/docs/isa/scalar/ops/predicate-generation-and-algebra/plt-b16.md +++ b/docs/isa/scalar/ops/predicate-generation-and-algebra/plt-b16.md @@ -17,19 +17,19 @@ In practice, this is the public CCE helper used for remainder-mask generation: t ### PTO Assembly Form ```mlir -%mask, %scalar_out = pto.plt_b16 %scalar_in {post_update} : i32 -> !pto.mask, i32 +%mask, %scalar_out = pto.plt_b16 %scalar_in {post_update} : i32 -> !pto.mask, i32 ``` ### AS Level 1 (SSA) ```mlir -%mask, %scalar_out = pto.plt_b16 %scalar_in {post_update} : i32 -> !pto.mask, i32 +%mask, %scalar_out = pto.plt_b16 %scalar_in {post_update} : i32 -> !pto.mask, i32 ``` ### AS Level 2 (DPS) ```mlir -pto.plt_b16 ins(%scalar_in : i32) outs(%mask, %scalar_out : !pto.mask, i32) +pto.plt_b16 ins(%scalar_in : i32) outs(%mask, %scalar_out : !pto.mask, i32) ``` ## C++ Intrinsic @@ -50,7 +50,7 @@ vector_bool mask = plt_b16(scalar, __cce_simd::POST_UPDATE); | Result | Type | Description | |--------|------|-------------| -| `%mask` | `!pto.mask` | 16-bit predicate generated from the current scalar value | +| `%mask` | `!pto.mask` | 16-bit predicate generated from the current scalar value | | `%scalar_out` | `i32` | Scalar value after the intrinsic's post-update step | ## Side Effects @@ -90,7 +90,7 @@ vector_bool mask = plt_b16(scalar, __cce_simd::POST_UPDATE); ### SSA form ```mlir -%mask, %scalar_out = pto.plt_b16 %scalar_in {post_update} : i32 -> !pto.mask, i32 +%mask, %scalar_out = pto.plt_b16 %scalar_in {post_update} : i32 -> !pto.mask, i32 ``` ## Related Ops / Instruction Set Links diff --git a/docs/isa/scalar/ops/predicate-generation-and-algebra/plt-b16_zh.md b/docs/isa/scalar/ops/predicate-generation-and-algebra/plt-b16_zh.md index 54d77afda..ebbcf26e1 100644 --- a/docs/isa/scalar/ops/predicate-generation-and-algebra/plt-b16_zh.md +++ b/docs/isa/scalar/ops/predicate-generation-and-algebra/plt-b16_zh.md @@ -7,19 +7,19 @@ ### PTO 汇编形式 ```text -plt_b16 %dst, %scalar_in : !pto.mask, i16 -> !pto.mask, i16 +plt_b16 %dst, %scalar_in : !pto.mask, i16 -> !pto.mask, i16 ``` ### AS Level 1(SSA) ```mlir -%mask, %scalar_out = pto.plt_b16 %scalar_in : i16 -> !pto.mask, i16 +%mask, %scalar_out = pto.plt_b16 %scalar_in : i16 -> !pto.mask, i16 ``` ### AS Level 2(DPS) ```mlir -pto.plt_b16 ins(%scalar_in : i16) outs(%mask, %scalar_out : !pto.mask, i16) +pto.plt_b16 ins(%scalar_in : i16) outs(%mask, %scalar_out : !pto.mask, i16) ``` ## 关键约束 diff --git a/docs/isa/scalar/ops/predicate-generation-and-algebra/plt-b32.md b/docs/isa/scalar/ops/predicate-generation-and-algebra/plt-b32.md index f9185e976..4ee8c2da0 100644 --- a/docs/isa/scalar/ops/predicate-generation-and-algebra/plt-b32.md +++ b/docs/isa/scalar/ops/predicate-generation-and-algebra/plt-b32.md @@ -17,19 +17,19 @@ In practice, this is the public CCE helper used for remainder-mask generation: t ### PTO Assembly Form ```mlir -%mask, %scalar_out = pto.plt_b32 %scalar_in {post_update} : i32 -> !pto.mask, i32 +%mask, %scalar_out = pto.plt_b32 %scalar_in {post_update} : i32 -> !pto.mask, i32 ``` ### AS Level 1 (SSA) ```mlir -%mask, %scalar_out = pto.plt_b32 %scalar_in {post_update} : i32 -> !pto.mask, i32 +%mask, %scalar_out = pto.plt_b32 %scalar_in {post_update} : i32 -> !pto.mask, i32 ``` ### AS Level 2 (DPS) ```mlir -pto.plt_b32 ins(%scalar_in : i32) outs(%mask, %scalar_out : !pto.mask, i32) +pto.plt_b32 ins(%scalar_in : i32) outs(%mask, %scalar_out : !pto.mask, i32) ``` ## C++ Intrinsic @@ -50,7 +50,7 @@ vector_bool mask = plt_b32(scalar, __cce_simd::POST_UPDATE); | Result | Type | Description | |--------|------|-------------| -| `%mask` | `!pto.mask` | 32-bit predicate generated from the current scalar value | +| `%mask` | `!pto.mask` | 32-bit predicate generated from the current scalar value | | `%scalar_out` | `i32` | Scalar value after the intrinsic's post-update step | ## Side Effects @@ -90,7 +90,7 @@ vector_bool mask = plt_b32(scalar, __cce_simd::POST_UPDATE); ### SSA form ```mlir -%mask, %scalar_out = pto.plt_b32 %scalar_in {post_update} : i32 -> !pto.mask, i32 +%mask, %scalar_out = pto.plt_b32 %scalar_in {post_update} : i32 -> !pto.mask, i32 ``` ## Related Ops / Instruction Set Links diff --git a/docs/isa/scalar/ops/predicate-generation-and-algebra/plt-b32_zh.md b/docs/isa/scalar/ops/predicate-generation-and-algebra/plt-b32_zh.md index ab701fae2..8d9c4f8b0 100644 --- a/docs/isa/scalar/ops/predicate-generation-and-algebra/plt-b32_zh.md +++ b/docs/isa/scalar/ops/predicate-generation-and-algebra/plt-b32_zh.md @@ -7,19 +7,19 @@ ### PTO 汇编形式 ```text -plt_b32 %dst, %scalar_in : !pto.mask, i32 -> !pto.mask, i32 +plt_b32 %dst, %scalar_in : !pto.mask, i32 -> !pto.mask, i32 ``` ### AS Level 1(SSA) ```mlir -%mask, %scalar_out = pto.plt_b32 %scalar_in : i32 -> !pto.mask, i32 +%mask, %scalar_out = pto.plt_b32 %scalar_in : i32 -> !pto.mask, i32 ``` ### AS Level 2(DPS) ```mlir -pto.plt_b32 ins(%scalar_in : i32) outs(%mask, %scalar_out : !pto.mask, i32) +pto.plt_b32 ins(%scalar_in : i32) outs(%mask, %scalar_out : !pto.mask, i32) ``` ## 关键约束 diff --git a/docs/isa/scalar/ops/predicate-generation-and-algebra/plt-b8.md b/docs/isa/scalar/ops/predicate-generation-and-algebra/plt-b8.md index 6d5bfd1b4..87b230569 100644 --- a/docs/isa/scalar/ops/predicate-generation-and-algebra/plt-b8.md +++ b/docs/isa/scalar/ops/predicate-generation-and-algebra/plt-b8.md @@ -17,19 +17,19 @@ In practice, this is the public CCE helper used for remainder-mask generation: t ### PTO Assembly Form ```mlir -%mask, %scalar_out = pto.plt_b8 %scalar_in {post_update} : i32 -> !pto.mask, i32 +%mask, %scalar_out = pto.plt_b8 %scalar_in {post_update} : i32 -> !pto.mask, i32 ``` ### AS Level 1 (SSA) ```mlir -%mask, %scalar_out = pto.plt_b8 %scalar_in {post_update} : i32 -> !pto.mask, i32 +%mask, %scalar_out = pto.plt_b8 %scalar_in {post_update} : i32 -> !pto.mask, i32 ``` ### AS Level 2 (DPS) ```mlir -pto.plt_b8 ins(%scalar_in : i32) outs(%mask, %scalar_out : !pto.mask, i32) +pto.plt_b8 ins(%scalar_in : i32) outs(%mask, %scalar_out : !pto.mask, i32) ``` ## C++ Intrinsic @@ -50,7 +50,7 @@ vector_bool mask = plt_b8(scalar, __cce_simd::POST_UPDATE); | Result | Type | Description | |--------|------|-------------| -| `%mask` | `!pto.mask` | 8-bit predicate generated from the current scalar value | +| `%mask` | `!pto.mask` | 8-bit predicate generated from the current scalar value | | `%scalar_out` | `i32` | Scalar value after the intrinsic's post-update step | ## Side Effects @@ -90,7 +90,7 @@ vector_bool mask = plt_b8(scalar, __cce_simd::POST_UPDATE); ### SSA form ```mlir -%mask, %scalar_out = pto.plt_b8 %scalar_in {post_update} : i32 -> !pto.mask, i32 +%mask, %scalar_out = pto.plt_b8 %scalar_in {post_update} : i32 -> !pto.mask, i32 ``` ## Related Ops / Instruction Set Links diff --git a/docs/isa/scalar/ops/predicate-generation-and-algebra/plt-b8_zh.md b/docs/isa/scalar/ops/predicate-generation-and-algebra/plt-b8_zh.md index c46048140..19d563f35 100644 --- a/docs/isa/scalar/ops/predicate-generation-and-algebra/plt-b8_zh.md +++ b/docs/isa/scalar/ops/predicate-generation-and-algebra/plt-b8_zh.md @@ -7,19 +7,19 @@ ### PTO 汇编形式 ```text -plt_b8 %dst, %scalar_in : !pto.mask, i8 -> !pto.mask, i8 +plt_b8 %dst, %scalar_in : !pto.mask, i8 -> !pto.mask, i8 ``` ### AS Level 1(SSA) ```mlir -%mask, %scalar_out = pto.plt_b8 %scalar_in : i8 -> !pto.mask, i8 +%mask, %scalar_out = pto.plt_b8 %scalar_in : i8 -> !pto.mask, i8 ``` ### AS Level 2(DPS) ```mlir -pto.plt_b8 ins(%scalar_in : i8) outs(%mask, %scalar_out : !pto.mask, i8) +pto.plt_b8 ins(%scalar_in : i8) outs(%mask, %scalar_out : !pto.mask, i8) ``` ## 关键约束 diff --git a/docs/isa/scalar/ops/predicate-generation-and-algebra/pnot.md b/docs/isa/scalar/ops/predicate-generation-and-algebra/pnot.md index 0fd048ea4..34de7a802 100644 --- a/docs/isa/scalar/ops/predicate-generation-and-algebra/pnot.md +++ b/docs/isa/scalar/ops/predicate-generation-and-algebra/pnot.md @@ -17,19 +17,19 @@ $$ \mathrm{dst}_i = \neg \mathrm{src}_i $$ ### PTO Assembly Form ```mlir -%dst = pto.pnot %src, %mask : !pto.mask, !pto.mask -> !pto.mask +%dst = pto.pnot %src, %mask : !pto.mask, !pto.mask -> !pto.mask ``` ### AS Level 1 (SSA) ```mlir -%dst = pto.pnot %src, %mask : !pto.mask, !pto.mask -> !pto.mask +%dst = pto.pnot %src, %mask : !pto.mask, !pto.mask -> !pto.mask ``` ### AS Level 2 (DPS) ```mlir -pto.pnot ins(%src, %mask : !pto.mask, !pto.mask) outs(%dst : !pto.mask) +pto.pnot ins(%src, %mask : !pto.mask, !pto.mask) outs(%dst : !pto.mask) ``` ## C++ Intrinsic @@ -45,14 +45,14 @@ pnot(dst, src, mask); | Operand | Type | Description | |---------|------|-------------| -| `%src` | `!pto.mask` | Source predicate to invert | -| `%mask` | `!pto.mask` | Optional masking predicate | +| `%src` | `!pto.mask` | Source predicate to invert | +| `%mask` | `!pto.mask` | Optional masking predicate | ## Expected Outputs | Result | Type | Description | |--------|------|-------------| -| `%dst` | `!pto.mask` | Bitwise NOT of src | +| `%dst` | `!pto.mask` | Bitwise NOT of src | ## Side Effects @@ -94,16 +94,16 @@ void invert_mask(RegBuf& dst, ```mlir // %cmp: lanes where a[i] < b[i] -%cmp = pto.vcmp %va, %vb, %seed, "lt" : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.mask +%cmp = pto.vcmp %va, %vb, %seed, "lt" : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.mask // %tail: lanes in remainder region -%tail = pto.pge_b32 %rem : i32 -> !pto.mask +%tail = pto.pge_b32 %rem : i32 -> !pto.mask // Complement: lanes NOT in remainder region -%not_tail = pto.pnot %tail, %tail : !pto.mask, !pto.mask -> !pto.mask +%not_tail = pto.pnot %tail, %tail : !pto.mask, !pto.mask -> !pto.mask // Combine: lanes in remainder region AND NOT in comparison result -%active = pto.pand %tail, %not_tail, %tail : !pto.mask, !pto.mask, !pto.mask -> !pto.mask +%active = pto.pand %tail, %not_tail, %tail : !pto.mask, !pto.mask, !pto.mask -> !pto.mask ``` ## Related Ops / Instruction Set Links diff --git a/docs/isa/scalar/ops/predicate-generation-and-algebra/pnot_zh.md b/docs/isa/scalar/ops/predicate-generation-and-algebra/pnot_zh.md index 2704f1434..3789d8031 100644 --- a/docs/isa/scalar/ops/predicate-generation-and-algebra/pnot_zh.md +++ b/docs/isa/scalar/ops/predicate-generation-and-algebra/pnot_zh.md @@ -7,19 +7,19 @@ ### PTO 汇编形式 ```text -pnot %dst, %src : !pto.mask, !pto.mask +pnot %dst, %src : !pto.mask, !pto.mask ``` ### AS Level 1(SSA) ```mlir -%dst = pto.pnot %src, %mask : !pto.mask, !pto.mask -> !pto.mask +%dst = pto.pnot %src, %mask : !pto.mask, !pto.mask -> !pto.mask ``` ### AS Level 2(DPS) ```mlir -pto.pnot ins(%src, %mask : !pto.mask, !pto.mask) outs(%dst : !pto.mask) +pto.pnot ins(%src, %mask : !pto.mask, !pto.mask) outs(%dst : !pto.mask) ``` ## 关键约束 diff --git a/docs/isa/scalar/ops/predicate-generation-and-algebra/por.md b/docs/isa/scalar/ops/predicate-generation-and-algebra/por.md index 5918eea43..a5c10f9a6 100644 --- a/docs/isa/scalar/ops/predicate-generation-and-algebra/por.md +++ b/docs/isa/scalar/ops/predicate-generation-and-algebra/por.md @@ -17,19 +17,19 @@ $$ \mathrm{dst}_i = \mathrm{src0}_i \lor \mathrm{src1}_i $$ ### PTO Assembly Form ```mlir -%dst = pto.por %src0, %src1, %mask : !pto.mask, !pto.mask, !pto.mask -> !pto.mask +%dst = pto.por %src0, %src1, %mask : !pto.mask, !pto.mask, !pto.mask -> !pto.mask ``` ### AS Level 1 (SSA) ```mlir -%dst = pto.por %src0, %src1, %mask : !pto.mask, !pto.mask, !pto.mask -> !pto.mask +%dst = pto.por %src0, %src1, %mask : !pto.mask, !pto.mask, !pto.mask -> !pto.mask ``` ### AS Level 2 (DPS) ```mlir -pto.por ins(%src0, %src1, %mask : !pto.mask, !pto.mask, !pto.mask) outs(%dst : !pto.mask) +pto.por ins(%src0, %src1, %mask : !pto.mask, !pto.mask, !pto.mask) outs(%dst : !pto.mask) ``` ## C++ Intrinsic @@ -46,15 +46,15 @@ por(dst, src0, src1, mask); | Operand | Type | Description | |---------|------|-------------| -| `%src0` | `!pto.mask` | First source predicate | -| `%src1` | `!pto.mask` | Second source predicate | -| `%mask` | `!pto.mask` | Optional masking predicate | +| `%src0` | `!pto.mask` | First source predicate | +| `%src1` | `!pto.mask` | Second source predicate | +| `%mask` | `!pto.mask` | Optional masking predicate | ## Expected Outputs | Result | Type | Description | |--------|------|-------------| -| `%dst` | `!pto.mask` | Bitwise OR of src0 and src1 | +| `%dst` | `!pto.mask` | Bitwise OR of src0 and src1 | ## Side Effects @@ -99,11 +99,11 @@ void union_masks(RegBuf& dst, // %mask_b: lanes where b[i] > threshold_b // Union: lanes satisfying either condition -%combined = pto.por %mask_a, %mask_b, %mask_a : !pto.mask, !pto.mask, !pto.mask -> !pto.mask +%combined = pto.por %mask_a, %mask_b, %mask_a : !pto.mask, !pto.mask, !pto.mask -> !pto.mask // Reconstruct full-width predicate from two halves -%lo_combined = pto.por %mask_a_lo, %mask_b_lo, %mask_a_lo : !pto.mask, !pto.mask, !pto.mask -> !pto.mask -%hi_combined = pto.por %mask_a_hi, %mask_b_hi, %mask_a_hi : !pto.mask, !pto.mask, !pto.mask -> !pto.mask +%lo_combined = pto.por %mask_a_lo, %mask_b_lo, %mask_a_lo : !pto.mask, !pto.mask, !pto.mask -> !pto.mask +%hi_combined = pto.por %mask_a_hi, %mask_b_hi, %mask_a_hi : !pto.mask, !pto.mask, !pto.mask -> !pto.mask ``` ## Related Ops / Instruction Set Links diff --git a/docs/isa/scalar/ops/predicate-generation-and-algebra/por_zh.md b/docs/isa/scalar/ops/predicate-generation-and-algebra/por_zh.md index 4fa3fcb3f..e357935d0 100644 --- a/docs/isa/scalar/ops/predicate-generation-and-algebra/por_zh.md +++ b/docs/isa/scalar/ops/predicate-generation-and-algebra/por_zh.md @@ -7,19 +7,19 @@ ### PTO 汇编形式 ```text -por %dst, %src0, %src1 : !pto.mask, !pto.mask, !pto.mask +por %dst, %src0, %src1 : !pto.mask, !pto.mask, !pto.mask ``` ### AS Level 1(SSA) ```mlir -%dst = pto.por %src0, %src1, %mask : !pto.mask, !pto.mask, !pto.mask -> !pto.mask +%dst = pto.por %src0, %src1, %mask : !pto.mask, !pto.mask, !pto.mask -> !pto.mask ``` ### AS Level 2(DPS) ```mlir -pto.por ins(%src0, %src1, %mask : !pto.mask, !pto.mask, !pto.mask) outs(%dst : !pto.mask) +pto.por ins(%src0, %src1, %mask : !pto.mask, !pto.mask, !pto.mask) outs(%dst : !pto.mask) ``` ## 关键约束 diff --git a/docs/isa/scalar/ops/predicate-generation-and-algebra/ppack.md b/docs/isa/scalar/ops/predicate-generation-and-algebra/ppack.md index 7018d39ac..bbfde62f2 100644 --- a/docs/isa/scalar/ops/predicate-generation-and-algebra/ppack.md +++ b/docs/isa/scalar/ops/predicate-generation-and-algebra/ppack.md @@ -19,19 +19,19 @@ $$ \mathrm{dst}_{2N} = \begin{cases} \mathrm{ZERO}(N) \Vert \mathrm{src}_N & \te ### PTO Assembly Form ```mlir -%dst = pto.ppack %src, "PART" : !pto.mask -> !pto.mask +%dst = pto.ppack %src, "PART" : !pto.mask -> !pto.mask ``` ### AS Level 1 (SSA) ```mlir -%dst = pto.ppack %src, "PART" : !pto.mask -> !pto.mask +%dst = pto.ppack %src, "PART" : !pto.mask -> !pto.mask ``` ### AS Level 2 (DPS) ```mlir -pto.ppack ins(%src, "PART" : !pto.mask) outs(%dst : !pto.mask) +pto.ppack ins(%src, "PART" : !pto.mask) outs(%dst : !pto.mask) ``` ## C++ Intrinsic @@ -46,14 +46,14 @@ ppack(dst, src, __cce_simd::LOWER); | Operand | Type | Description | |---------|------|-------------| -| `%src` | `!pto.mask` | Source N-bit predicate | +| `%src` | `!pto.mask` | Source N-bit predicate | | `"PART"` | string attribute | Partition token: `"LOWER"` or `"HIGHER"` | ## Expected Outputs | Result | Type | Description | |--------|------|-------------| -| `%dst` | `!pto.mask` | 2N-bit predicate with the source in the selected half | +| `%dst` | `!pto.mask` | 2N-bit predicate with the source in the selected half | ## Side Effects @@ -106,26 +106,26 @@ void pack_for_f32(RegBuf& dst, // %hi: lanes 0-14 active (from plt_b32 iteration 2, rem = 15) // Pack %lo into lower half of 64-bit predicate -%full_lo = pto.ppack %lo, "LOWER" : !pto.mask -> !pto.mask +%full_lo = pto.ppack %lo, "LOWER" : !pto.mask -> !pto.mask // Pack %hi into upper half of 64-bit predicate -%full_hi = pto.ppack %hi, "HIGHER" : !pto.mask -> !pto.mask +%full_hi = pto.ppack %hi, "HIGHER" : !pto.mask -> !pto.mask // OR them together to get full 64-lane tail mask -%tail = pto.por %full_lo, %full_hi, %full_lo : !pto.mask, !pto.mask, !pto.mask -> !pto.mask +%tail = pto.por %full_lo, %full_hi, %full_lo : !pto.mask, !pto.mask, !pto.mask -> !pto.mask ``` ### Construct a full-width mask from two half-width masks ```mlir // Pack lower half -%dst_lower = pto.ppack %src_lower, "LOWER" : !pto.mask -> !pto.mask +%dst_lower = pto.ppack %src_lower, "LOWER" : !pto.mask -> !pto.mask // Pack upper half -%dst_upper = pto.ppack %src_upper, "HIGHER" : !pto.mask -> !pto.mask +%dst_upper = pto.ppack %src_upper, "HIGHER" : !pto.mask -> !pto.mask // Combine with OR to get full-width predicate -%combined = pto.por %dst_lower, %dst_upper, %dst_lower : !pto.mask, !pto.mask, !pto.mask -> !pto.mask +%combined = pto.por %dst_lower, %dst_upper, %dst_lower : !pto.mask, !pto.mask, !pto.mask -> !pto.mask ``` ## Related Ops / Instruction Set Links diff --git a/docs/isa/scalar/ops/predicate-generation-and-algebra/ppack_zh.md b/docs/isa/scalar/ops/predicate-generation-and-algebra/ppack_zh.md index 8954d20d9..86a925035 100644 --- a/docs/isa/scalar/ops/predicate-generation-and-algebra/ppack_zh.md +++ b/docs/isa/scalar/ops/predicate-generation-and-algebra/ppack_zh.md @@ -7,19 +7,19 @@ ### PTO 汇编形式 ```text -ppack %dst, %src, "PART" : !pto.mask, !pto.mask +ppack %dst, %src, "PART" : !pto.mask, !pto.mask ``` ### AS Level 1(SSA) ```mlir -%dst = pto.ppack %src, "PART" : !pto.mask -> !pto.mask +%dst = pto.ppack %src, "PART" : !pto.mask -> !pto.mask ``` ### AS Level 2(DPS) ```mlir -pto.ppack ins(%src, "PART" : !pto.mask) outs(%dst : !pto.mask) +pto.ppack ins(%src, "PART" : !pto.mask) outs(%dst : !pto.mask) ``` ## 关键约束 diff --git a/docs/isa/scalar/ops/predicate-generation-and-algebra/psel.md b/docs/isa/scalar/ops/predicate-generation-and-algebra/psel.md index cc3ad7d28..c9cf1e200 100644 --- a/docs/isa/scalar/ops/predicate-generation-and-algebra/psel.md +++ b/docs/isa/scalar/ops/predicate-generation-and-algebra/psel.md @@ -19,19 +19,19 @@ This is a predicate-level ternary select, analogous to vector `vsel` but operati ### PTO Assembly Form ```mlir -%dst = pto.psel %src0, %src1, %sel, %mask : !pto.mask, !pto.mask, !pto.mask, !pto.mask -> !pto.mask +%dst = pto.psel %src0, %src1, %sel, %mask : !pto.mask, !pto.mask, !pto.mask, !pto.mask -> !pto.mask ``` ### AS Level 1 (SSA) ```mlir -%dst = pto.psel %src0, %src1, %sel, %mask : !pto.mask, !pto.mask, !pto.mask, !pto.mask -> !pto.mask +%dst = pto.psel %src0, %src1, %sel, %mask : !pto.mask, !pto.mask, !pto.mask, !pto.mask -> !pto.mask ``` ### AS Level 2 (DPS) ```mlir -pto.psel ins(%src0, %src1, %sel, %mask : !pto.mask, !pto.mask, !pto.mask, !pto.mask) outs(%dst : !pto.mask) +pto.psel ins(%src0, %src1, %sel, %mask : !pto.mask, !pto.mask, !pto.mask, !pto.mask) outs(%dst : !pto.mask) ``` ## C++ Intrinsic @@ -48,16 +48,16 @@ psel(dst, src0, src1, mask); | Operand | Type | Description | |---------|------|-------------| -| `%src0` | `!pto.mask` | Predicate selected when corresponding sel bit is 1 | -| `%src1` | `!pto.mask` | Predicate selected when corresponding sel bit is 0 | -| `%sel` | `!pto.mask` | Per-lane selection predicate | -| `%mask` | `!pto.mask` | Optional masking predicate | +| `%src0` | `!pto.mask` | Predicate selected when corresponding sel bit is 1 | +| `%src1` | `!pto.mask` | Predicate selected when corresponding sel bit is 0 | +| `%sel` | `!pto.mask` | Per-lane selection predicate | +| `%mask` | `!pto.mask` | Optional masking predicate | ## Expected Outputs | Result | Type | Description | |--------|------|-------------| -| `%dst` | `!pto.mask` | Per-lane selection between src0 and src1 | +| `%dst` | `!pto.mask` | Per-lane selection between src0 and src1 | ## Side Effects @@ -106,8 +106,8 @@ void select_predicate(RegBuf& dst, // If condition is true, use set A; otherwise use set B %active = pto.psel %active_a, %active_b, %condition, %condition - : !pto.mask, !pto.mask, !pto.mask, !pto.mask - -> !pto.mask + : !pto.mask, !pto.mask, !pto.mask, !pto.mask + -> !pto.mask ``` ### Equivalent to boolean expression @@ -118,10 +118,10 @@ The `psel` operation is equivalent to the following boolean expression: // psel %dst, %src0, %src1, %sel // = (src0 AND sel) OR (src1 AND NOT sel) -%sel_inv = pto.pnot %sel, %sel : !pto.mask, !pto.mask -> !pto.mask -%and0 = pto.pand %src0, %sel, %sel : !pto.mask, !pto.mask, !pto.mask -> !pto.mask -%and1 = pto.pand %src1, %sel_inv, %sel : !pto.mask, !pto.mask, !pto.mask -> !pto.mask -%dst = pto.por %and0, %and1, %and0 : !pto.mask, !pto.mask, !pto.mask -> !pto.mask +%sel_inv = pto.pnot %sel, %sel : !pto.mask, !pto.mask -> !pto.mask +%and0 = pto.pand %src0, %sel, %sel : !pto.mask, !pto.mask, !pto.mask -> !pto.mask +%and1 = pto.pand %src1, %sel_inv, %sel : !pto.mask, !pto.mask, !pto.mask -> !pto.mask +%dst = pto.por %and0, %and1, %and0 : !pto.mask, !pto.mask, !pto.mask -> !pto.mask ``` ## Related Ops / Instruction Set Links diff --git a/docs/isa/scalar/ops/predicate-generation-and-algebra/psel_zh.md b/docs/isa/scalar/ops/predicate-generation-and-algebra/psel_zh.md index d9d9e0b4f..098fb8983 100644 --- a/docs/isa/scalar/ops/predicate-generation-and-algebra/psel_zh.md +++ b/docs/isa/scalar/ops/predicate-generation-and-algebra/psel_zh.md @@ -7,19 +7,19 @@ ### PTO 汇编形式 ```text -psel %dst, %src0, %src1, %sel : !pto.mask, !pto.mask, !pto.mask, !pto.mask +psel %dst, %src0, %src1, %sel : !pto.mask, !pto.mask, !pto.mask, !pto.mask ``` ### AS Level 1(SSA) ```mlir -%dst = pto.psel %src0, %src1, %sel, %mask : !pto.mask, !pto.mask, !pto.mask, !pto.mask -> !pto.mask +%dst = pto.psel %src0, %src1, %sel, %mask : !pto.mask, !pto.mask, !pto.mask, !pto.mask -> !pto.mask ``` ### AS Level 2(DPS) ```mlir -pto.psel ins(%src0, %src1, %sel, %mask : !pto.mask, !pto.mask, !pto.mask, !pto.mask) outs(%dst : !pto.mask) +pto.psel ins(%src0, %src1, %sel, %mask : !pto.mask, !pto.mask, !pto.mask, !pto.mask) outs(%dst : !pto.mask) ``` ## 关键约束 diff --git a/docs/isa/scalar/ops/predicate-generation-and-algebra/pset-b16.md b/docs/isa/scalar/ops/predicate-generation-and-algebra/pset-b16.md index ed93bfb8f..dc5d6aac4 100644 --- a/docs/isa/scalar/ops/predicate-generation-and-algebra/pset-b16.md +++ b/docs/isa/scalar/ops/predicate-generation-and-algebra/pset-b16.md @@ -21,19 +21,19 @@ The pattern token fully determines which bits are set. ### PTO Assembly Form ```mlir -%mask = pto.pset_b16 "PATTERN" : !pto.mask +%mask = pto.pset_b16 "PATTERN" : !pto.mask ``` ### AS Level 1 (SSA) ```mlir -%mask = pto.pset_b16 "PATTERN" : !pto.mask +%mask = pto.pset_b16 "PATTERN" : !pto.mask ``` ### AS Level 2 (DPS) ```mlir -pto.pset_b16 "PATTERN" outs(%mask : !pto.mask) +pto.pset_b16 "PATTERN" outs(%mask : !pto.mask) ``` ## C++ Intrinsic @@ -64,7 +64,7 @@ vector_bool mask = pset_b16(__cce_simd::PAT_VL8); | Result | Type | Description | |--------|------|-------------| -| `%mask` | `!pto.mask` | Constructed 16-bit predicate | +| `%mask` | `!pto.mask` | Constructed 16-bit predicate | ## Side Effects @@ -107,14 +107,14 @@ void set_all_active(RegBuf& dst) { ```mlir // Modular 3 pattern: lanes 3, 7, 11, 15 active -%mod3 = pto.pset_b16 "PAT_M3" : !pto.mask +%mod3 = pto.pset_b16 "PAT_M3" : !pto.mask ``` ### Construct first-half-active mask ```mlir // High half: bits 8–15 active, bits 0–7 inactive -%high = pto.pset_b16 "PAT_H" : !pto.mask +%high = pto.pset_b16 "PAT_H" : !pto.mask ``` ## Related Ops / Instruction Set Links diff --git a/docs/isa/scalar/ops/predicate-generation-and-algebra/pset-b16_zh.md b/docs/isa/scalar/ops/predicate-generation-and-algebra/pset-b16_zh.md index cfc0da6a8..863f98f21 100644 --- a/docs/isa/scalar/ops/predicate-generation-and-algebra/pset-b16_zh.md +++ b/docs/isa/scalar/ops/predicate-generation-and-algebra/pset-b16_zh.md @@ -7,19 +7,19 @@ ### PTO 汇编形式 ```text -pset_b16 %dst, "PATTERN" : !pto.mask +pset_b16 %dst, "PATTERN" : !pto.mask ``` ### AS Level 1(SSA) ```mlir -%mask = pto.pset_b16 "PATTERN" : !pto.mask +%mask = pto.pset_b16 "PATTERN" : !pto.mask ``` ### AS Level 2(DPS) ```mlir -pto.pset_b16 "PATTERN" outs(%mask : !pto.mask) +pto.pset_b16 "PATTERN" outs(%mask : !pto.mask) ``` ## 关键约束 diff --git a/docs/isa/scalar/ops/predicate-generation-and-algebra/pset-b32.md b/docs/isa/scalar/ops/predicate-generation-and-algebra/pset-b32.md index 0fe163c37..9fc96a033 100644 --- a/docs/isa/scalar/ops/predicate-generation-and-algebra/pset-b32.md +++ b/docs/isa/scalar/ops/predicate-generation-and-algebra/pset-b32.md @@ -21,19 +21,19 @@ The `_b32` variant is the widest directly-constructable predicate segment. For w ### PTO Assembly Form ```mlir -%mask = pto.pset_b32 "PATTERN" : !pto.mask +%mask = pto.pset_b32 "PATTERN" : !pto.mask ``` ### AS Level 1 (SSA) ```mlir -%mask = pto.pset_b32 "PATTERN" : !pto.mask +%mask = pto.pset_b32 "PATTERN" : !pto.mask ``` ### AS Level 2 (DPS) ```mlir -pto.pset_b32 "PATTERN" outs(%mask : !pto.mask) +pto.pset_b32 "PATTERN" outs(%mask : !pto.mask) ``` ## C++ Intrinsic @@ -64,7 +64,7 @@ vector_bool mask = pset_b32(__cce_simd::PAT_VL16); | Result | Type | Description | |--------|------|-------------| -| `%mask` | `!pto.mask` | Constructed 32-bit predicate | +| `%mask` | `!pto.mask` | Constructed 32-bit predicate | ## Side Effects @@ -107,17 +107,17 @@ void set_all_active(RegBuf& dst) { ```mlir // All lanes active for f32 (64-bit predicate = pack two b32) -%all32 = pto.pset_b32 "PAT_ALL" : !pto.mask -%all64_lo = pto.pset_b32 "PAT_ALL" : !pto.mask -%all64_hi = pto.pset_b32 "PAT_ALL" : !pto.mask -%all64 = pto.ppack %all64_lo, "LOWER" : !pto.mask -> !pto.mask +%all32 = pto.pset_b32 "PAT_ALL" : !pto.mask +%all64_lo = pto.pset_b32 "PAT_ALL" : !pto.mask +%all64_hi = pto.pset_b32 "PAT_ALL" : !pto.mask +%all64 = pto.ppack %all64_lo, "LOWER" : !pto.mask -> !pto.mask ``` ### Construct remainder mask ```mlir // First 12 lanes active (remainder loop) -%remainder = pto.pset_b32 "PAT_VL12" : !pto.mask +%remainder = pto.pset_b32 "PAT_VL12" : !pto.mask ``` ## Related Ops / Instruction Set Links diff --git a/docs/isa/scalar/ops/predicate-generation-and-algebra/pset-b32_zh.md b/docs/isa/scalar/ops/predicate-generation-and-algebra/pset-b32_zh.md index 47d0fec54..a64d3324a 100644 --- a/docs/isa/scalar/ops/predicate-generation-and-algebra/pset-b32_zh.md +++ b/docs/isa/scalar/ops/predicate-generation-and-algebra/pset-b32_zh.md @@ -7,19 +7,19 @@ ### PTO 汇编形式 ```text -pset_b32 %dst, "PATTERN" : !pto.mask +pset_b32 %dst, "PATTERN" : !pto.mask ``` ### AS Level 1(SSA) ```mlir -%mask = pto.pset_b32 "PATTERN" : !pto.mask +%mask = pto.pset_b32 "PATTERN" : !pto.mask ``` ### AS Level 2(DPS) ```mlir -pto.pset_b32 "PATTERN" outs(%mask : !pto.mask) +pto.pset_b32 "PATTERN" outs(%mask : !pto.mask) ``` ## 关键约束 diff --git a/docs/isa/scalar/ops/predicate-generation-and-algebra/pset-b8.md b/docs/isa/scalar/ops/predicate-generation-and-algebra/pset-b8.md index 5ddd860c7..13d8415ef 100644 --- a/docs/isa/scalar/ops/predicate-generation-and-algebra/pset-b8.md +++ b/docs/isa/scalar/ops/predicate-generation-and-algebra/pset-b8.md @@ -21,19 +21,19 @@ The pattern token fully determines which bits are set. The operation is purely c ### PTO Assembly Form ```mlir -%mask = pto.pset_b8 "PATTERN" : !pto.mask +%mask = pto.pset_b8 "PATTERN" : !pto.mask ``` ### AS Level 1 (SSA) ```mlir -%mask = pto.pset_b8 "PATTERN" : !pto.mask +%mask = pto.pset_b8 "PATTERN" : !pto.mask ``` ### AS Level 2 (DPS) ```mlir -pto.pset_b8 "PATTERN" outs(%mask : !pto.mask) +pto.pset_b8 "PATTERN" outs(%mask : !pto.mask) ``` ## C++ Intrinsic @@ -63,7 +63,7 @@ vector_bool mask = pset_b8(__cce_simd::PAT_VL4); | Result | Type | Description | |--------|------|-------------| -| `%mask` | `!pto.mask` | Constructed 8-bit predicate | +| `%mask` | `!pto.mask` | Constructed 8-bit predicate | ## Side Effects @@ -106,13 +106,13 @@ void set_all_active(RegBuf& dst) { ### Construct all-inactive mask ```mlir -%none = pto.pset_b8 "PAT_ALLF" : !pto.mask +%none = pto.pset_b8 "PAT_ALLF" : !pto.mask ``` ### Construct first-3-lanes-active mask ```mlir -%first3 = pto.pset_b8 "PAT_VL3" : !pto.mask +%first3 = pto.pset_b8 "PAT_VL3" : !pto.mask ``` ## Related Ops / Instruction Set Links diff --git a/docs/isa/scalar/ops/predicate-generation-and-algebra/pset-b8_zh.md b/docs/isa/scalar/ops/predicate-generation-and-algebra/pset-b8_zh.md index 778c7121b..edfa80f1e 100644 --- a/docs/isa/scalar/ops/predicate-generation-and-algebra/pset-b8_zh.md +++ b/docs/isa/scalar/ops/predicate-generation-and-algebra/pset-b8_zh.md @@ -7,19 +7,19 @@ ### PTO 汇编形式 ```text -pset_b8 %dst, "PATTERN" : !pto.mask +pset_b8 %dst, "PATTERN" : !pto.mask ``` ### AS Level 1(SSA) ```mlir -%mask = pto.pset_b8 "PATTERN" : !pto.mask +%mask = pto.pset_b8 "PATTERN" : !pto.mask ``` ### AS Level 2(DPS) ```mlir -pto.pset_b8 "PATTERN" outs(%mask : !pto.mask) +pto.pset_b8 "PATTERN" outs(%mask : !pto.mask) ``` ## 关键约束 diff --git a/docs/isa/scalar/ops/predicate-generation-and-algebra/punpack.md b/docs/isa/scalar/ops/predicate-generation-and-algebra/punpack.md index b917cd2e2..354765c21 100644 --- a/docs/isa/scalar/ops/predicate-generation-and-algebra/punpack.md +++ b/docs/isa/scalar/ops/predicate-generation-and-algebra/punpack.md @@ -19,19 +19,19 @@ $$ \mathrm{dst}_N = \begin{cases} \mathrm{LOWER}(\mathrm{src}_{2N}) & \text{if } ### PTO Assembly Form ```mlir -%dst = pto.punpack %src, "PART" : !pto.mask -> !pto.mask +%dst = pto.punpack %src, "PART" : !pto.mask -> !pto.mask ``` ### AS Level 1 (SSA) ```mlir -%dst = pto.punpack %src, "PART" : !pto.mask -> !pto.mask +%dst = pto.punpack %src, "PART" : !pto.mask -> !pto.mask ``` ### AS Level 2 (DPS) ```mlir -pto.punpack ins(%src, "PART" : !pto.mask) outs(%dst : !pto.mask) +pto.punpack ins(%src, "PART" : !pto.mask) outs(%dst : !pto.mask) ``` ## C++ Intrinsic @@ -46,14 +46,14 @@ punpack(dst, src, __cce_simd::LOWER); | Operand | Type | Description | |---------|------|-------------| -| `%src` | `!pto.mask` | Source 2N-bit predicate | +| `%src` | `!pto.mask` | Source 2N-bit predicate | | `"PART"` | string attribute | Partition token: `"LOWER"` or `"HIGHER"` | ## Expected Outputs | Result | Type | Description | |--------|------|-------------| -| `%dst` | `!pto.mask` | N-bit predicate extracted from the selected half | +| `%dst` | `!pto.mask` | N-bit predicate extracted from the selected half | ## Side Effects @@ -101,18 +101,18 @@ void extract_upper(RegBuf& dst, // %full_64: 64-bit predicate from a comparison // Extract lower half -%lo = pto.punpack %full_64, "LOWER" : !pto.mask -> !pto.mask +%lo = pto.punpack %full_64, "LOWER" : !pto.mask -> !pto.mask // Extract upper half -%hi = pto.punpack %full_64, "HIGHER" : !pto.mask -> !pto.mask +%hi = pto.punpack %full_64, "HIGHER" : !pto.mask -> !pto.mask // Modify lower half (e.g., invert) -%lo_inv = pto.pnot %lo, %lo : !pto.mask, !pto.mask -> !pto.mask +%lo_inv = pto.pnot %lo, %lo : !pto.mask, !pto.mask -> !pto.mask // Re-pack into 64-bit predicate -%new_lo = pto.ppack %lo_inv, "LOWER" : !pto.mask -> !pto.mask -%new_hi = pto.ppack %hi, "HIGHER" : !pto.mask -> !pto.mask -%new_full = pto.por %new_lo, %new_hi, %new_lo : !pto.mask, !pto.mask, !pto.mask -> !pto.mask +%new_lo = pto.ppack %lo_inv, "LOWER" : !pto.mask -> !pto.mask +%new_hi = pto.ppack %hi, "HIGHER" : !pto.mask -> !pto.mask +%new_full = pto.por %new_lo, %new_hi, %new_lo : !pto.mask, !pto.mask, !pto.mask -> !pto.mask ``` ## Related Ops / Instruction Set Links diff --git a/docs/isa/scalar/ops/predicate-generation-and-algebra/punpack_zh.md b/docs/isa/scalar/ops/predicate-generation-and-algebra/punpack_zh.md index ae9d4687e..167df65c7 100644 --- a/docs/isa/scalar/ops/predicate-generation-and-algebra/punpack_zh.md +++ b/docs/isa/scalar/ops/predicate-generation-and-algebra/punpack_zh.md @@ -7,19 +7,19 @@ ### PTO 汇编形式 ```text -punpack %dst, %src, "PART" : !pto.mask, !pto.mask +punpack %dst, %src, "PART" : !pto.mask, !pto.mask ``` ### AS Level 1(SSA) ```mlir -%dst = pto.punpack %src, "PART" : !pto.mask -> !pto.mask +%dst = pto.punpack %src, "PART" : !pto.mask -> !pto.mask ``` ### AS Level 2(DPS) ```mlir -pto.punpack ins(%src, "PART" : !pto.mask) outs(%dst : !pto.mask) +pto.punpack ins(%src, "PART" : !pto.mask) outs(%dst : !pto.mask) ``` ## 关键约束 diff --git a/docs/isa/scalar/ops/predicate-generation-and-algebra/pxor.md b/docs/isa/scalar/ops/predicate-generation-and-algebra/pxor.md index f160bb79a..5629881ca 100644 --- a/docs/isa/scalar/ops/predicate-generation-and-algebra/pxor.md +++ b/docs/isa/scalar/ops/predicate-generation-and-algebra/pxor.md @@ -19,19 +19,19 @@ XOR is commonly used to invert one predicate within a mask context: `pxor %p, %i ### PTO Assembly Form ```mlir -%dst = pto.pxor %src0, %src1, %mask : !pto.mask, !pto.mask, !pto.mask -> !pto.mask +%dst = pto.pxor %src0, %src1, %mask : !pto.mask, !pto.mask, !pto.mask -> !pto.mask ``` ### AS Level 1 (SSA) ```mlir -%dst = pto.pxor %src0, %src1, %mask : !pto.mask, !pto.mask, !pto.mask -> !pto.mask +%dst = pto.pxor %src0, %src1, %mask : !pto.mask, !pto.mask, !pto.mask -> !pto.mask ``` ### AS Level 2 (DPS) ```mlir -pto.pxor ins(%src0, %src1, %mask : !pto.mask, !pto.mask, !pto.mask) outs(%dst : !pto.mask) +pto.pxor ins(%src0, %src1, %mask : !pto.mask, !pto.mask, !pto.mask) outs(%dst : !pto.mask) ``` ## C++ Intrinsic @@ -48,15 +48,15 @@ pxor(dst, src0, src1, mask); | Operand | Type | Description | |---------|------|-------------| -| `%src0` | `!pto.mask` | First source predicate | -| `%src1` | `!pto.mask` | Second source predicate | -| `%mask` | `!pto.mask` | Optional masking predicate | +| `%src0` | `!pto.mask` | First source predicate | +| `%src1` | `!pto.mask` | Second source predicate | +| `%mask` | `!pto.mask` | Optional masking predicate | ## Expected Outputs | Result | Type | Description | |--------|------|-------------| -| `%dst` | `!pto.mask` | Bitwise XOR of src0 and src1 | +| `%dst` | `!pto.mask` | Bitwise XOR of src0 and src1 | ## Side Effects @@ -102,11 +102,11 @@ void invert_with_mask(RegBuf& dst, // %mask_b: lanes active in set B // Symmetric difference: lanes active in exactly one set -%diff = pto.pxor %mask_a, %mask_b, %mask_a : !pto.mask, !pto.mask, !pto.mask -> !pto.mask +%diff = pto.pxor %mask_a, %mask_b, %mask_a : !pto.mask, !pto.mask, !pto.mask -> !pto.mask // Intersection: lanes active in both sets (via De Morgan: A AND B = NOT(A XOR B)) -%inv = pto.pnot %diff, %diff : !pto.mask, !pto.mask -> !pto.mask -%intersection = pto.pand %mask_a, %mask_b, %mask_a : !pto.mask, !pto.mask, !pto.mask -> !pto.mask +%inv = pto.pnot %diff, %diff : !pto.mask, !pto.mask -> !pto.mask +%intersection = pto.pand %mask_a, %mask_b, %mask_a : !pto.mask, !pto.mask, !pto.mask -> !pto.mask ``` ## Related Ops / Instruction Set Links diff --git a/docs/isa/scalar/ops/predicate-generation-and-algebra/pxor_zh.md b/docs/isa/scalar/ops/predicate-generation-and-algebra/pxor_zh.md index fa1acbc9e..60baa02d4 100644 --- a/docs/isa/scalar/ops/predicate-generation-and-algebra/pxor_zh.md +++ b/docs/isa/scalar/ops/predicate-generation-and-algebra/pxor_zh.md @@ -7,19 +7,19 @@ ### PTO 汇编形式 ```text -pxor %dst, %src0, %src1 : !pto.mask, !pto.mask, !pto.mask +pxor %dst, %src0, %src1 : !pto.mask, !pto.mask, !pto.mask ``` ### AS Level 1(SSA) ```mlir -%dst = pto.pxor %src0, %src1, %mask : !pto.mask, !pto.mask, !pto.mask -> !pto.mask +%dst = pto.pxor %src0, %src1, %mask : !pto.mask, !pto.mask, !pto.mask -> !pto.mask ``` ### AS Level 2(DPS) ```mlir -pto.pxor ins(%src0, %src1, %mask : !pto.mask, !pto.mask, !pto.mask) outs(%dst : !pto.mask) +pto.pxor ins(%src0, %src1, %mask : !pto.mask, !pto.mask, !pto.mask) outs(%dst : !pto.mask) ``` ## 关键约束 diff --git a/docs/isa/scalar/ops/predicate-load-store/pld.md b/docs/isa/scalar/ops/predicate-load-store/pld.md index 5542d711a..566eb0df3 100644 --- a/docs/isa/scalar/ops/predicate-load-store/pld.md +++ b/docs/isa/scalar/ops/predicate-load-store/pld.md @@ -8,7 +8,7 @@ Load the full predicate register from a UB location with a register-relative add ## Mechanism -`pto.pld` reads a predicate word from a UB address computed as `base + areg * sizeof(predicate)`, then materializes it as `!pto.mask`. The offset is sourced from a scalar register, making the effective address data-dependent. +`pto.pld` reads a predicate word from a UB address computed as `base + areg * sizeof(predicate)`, then materializes it as `!pto.mask`. The offset is sourced from a scalar register, making the effective address data-dependent. For predicate width `Pw`, UB base `base`, and offset register `areg`: @@ -22,19 +22,19 @@ The offset register value is interpreted as a byte displacement in units of 8 by ### PTO Assembly Form ```mlir -%mask = pto.pld %ub_ptr, %areg, "DIST" : !pto.ptr, i32 -> !pto.mask +%mask = pto.pld %ub_ptr, %areg, "DIST" : !pto.ptr, i32 -> !pto.mask ``` ### AS Level 1 (SSA) ```mlir -%mask = pto.pld %ub_ptr, %areg, "DIST" : !pto.ptr, i32 -> !pto.mask +%mask = pto.pld %ub_ptr, %areg, "DIST" : !pto.ptr, i32 -> !pto.mask ``` ### AS Level 2 (DPS) ```mlir -pto.pld ins(%ub_ptr, %areg, "DIST" : !pto.ptr, i32) outs(%mask : !pto.mask) +pto.pld ins(%ub_ptr, %areg, "DIST" : !pto.ptr, i32) outs(%mask : !pto.mask) ``` ## C++ Intrinsic @@ -58,7 +58,7 @@ pld(dst, base, offset, __cce_simd::NORM); | Result | Type | Description | |--------|------|-------------| -| `%mask` | `!pto.mask` | Loaded predicate register | +| `%mask` | `!pto.mask` | Loaded predicate register | ## Side Effects @@ -109,10 +109,10 @@ void load_with_offset(RegBuf& dst, ```mlir // UB base at %ub_base; %c1 holds slot index (in 8-byte units) -%mask = pto.pld %ub_base, %c1, "NORM" : !pto.ptr, i32 -> !pto.mask +%mask = pto.pld %ub_base, %c1, "NORM" : !pto.ptr, i32 -> !pto.mask // Use predicate in predicated vector operation -%result = pto.vsel %v_a, %v_b, %mask : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> +%result = pto.vsel %v_a, %v_b, %mask : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> ``` ## Related Ops / Instruction Set Links diff --git a/docs/isa/scalar/ops/predicate-load-store/pld_zh.md b/docs/isa/scalar/ops/predicate-load-store/pld_zh.md index d3d7d9e7b..feeba00f9 100644 --- a/docs/isa/scalar/ops/predicate-load-store/pld_zh.md +++ b/docs/isa/scalar/ops/predicate-load-store/pld_zh.md @@ -7,19 +7,19 @@ ### PTO 汇编形式 ```text -pld %mask, %ub_ptr[%areg], "DIST" : !pto.mask, !pto.ptr, i32 +pld %mask, %ub_ptr[%areg], "DIST" : !pto.mask, !pto.ptr, i32 ``` ### AS Level 1(SSA) ```mlir -%mask = pto.pld %ub_ptr, %areg, "DIST" : !pto.ptr, i32 -> !pto.mask +%mask = pto.pld %ub_ptr, %areg, "DIST" : !pto.ptr, i32 -> !pto.mask ``` ### AS Level 2(DPS) ```mlir -pto.pld ins(%ub_ptr, %areg, "DIST" : !pto.ptr, i32) outs(%mask : !pto.mask) +pto.pld ins(%ub_ptr, %areg, "DIST" : !pto.ptr, i32) outs(%mask : !pto.mask) ``` ## 关键约束 diff --git a/docs/isa/scalar/ops/predicate-load-store/pldi.md b/docs/isa/scalar/ops/predicate-load-store/pldi.md index fae919294..a79aad1c6 100644 --- a/docs/isa/scalar/ops/predicate-load-store/pldi.md +++ b/docs/isa/scalar/ops/predicate-load-store/pldi.md @@ -8,7 +8,7 @@ Load the full predicate register from a UB location with an immediate (compile-t ## Mechanism -`pto.pldi` reads a predicate word from a UB address computed as `base + imm * 8`, then materializes it as `!pto.mask`. The offset is a compile-time immediate, enabling address resolution at assembly time. +`pto.pldi` reads a predicate word from a UB address computed as `base + imm * 8`, then materializes it as `!pto.mask`. The offset is a compile-time immediate, enabling address resolution at assembly time. For predicate width `Pw`, UB base `base`, and immediate offset `imm`: @@ -22,19 +22,19 @@ The immediate offset is encoded directly in the instruction word, in units of 8 ### PTO Assembly Form ```mlir -%mask = pto.pldi %ub_ptr, %imm, "DIST" : !pto.ptr, i32 -> !pto.mask +%mask = pto.pldi %ub_ptr, %imm, "DIST" : !pto.ptr, i32 -> !pto.mask ``` ### AS Level 1 (SSA) ```mlir -%mask = pto.pldi %ub_ptr, %imm, "DIST" : !pto.ptr, i32 -> !pto.mask +%mask = pto.pldi %ub_ptr, %imm, "DIST" : !pto.ptr, i32 -> !pto.mask ``` ### AS Level 2 (DPS) ```mlir -pto.pldi ins(%ub_ptr, %imm, "DIST" : !pto.ptr, i32) outs(%mask : !pto.mask) +pto.pldi ins(%ub_ptr, %imm, "DIST" : !pto.ptr, i32) outs(%mask : !pto.mask) ``` ## C++ Intrinsic @@ -59,7 +59,7 @@ pldi(dst, base, offset, __cce_simd::NORM, __cce_simd::POST_UPDATE); | Result | Type | Description | |--------|------|-------------| -| `%mask` | `!pto.mask` | Loaded predicate register | +| `%mask` | `!pto.mask` | Loaded predicate register | ## Side Effects @@ -111,10 +111,10 @@ void load_immediate(RegBuf& dst, ```mlir // Load predicate from slot 2 (2 * 8 = 16 bytes offset) -%mask = pto.pldi %ub_base, 2, "NORM" : !pto.ptr, i32 -> !pto.mask +%mask = pto.pldi %ub_base, 2, "NORM" : !pto.ptr, i32 -> !pto.mask // Use in predicated vector select -%result = pto.vsel %v_true, %v_false, %mask : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> +%result = pto.vsel %v_true, %v_false, %mask : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> ``` ## Related Ops / Instruction Set Links diff --git a/docs/isa/scalar/ops/predicate-load-store/pldi_zh.md b/docs/isa/scalar/ops/predicate-load-store/pldi_zh.md index 1e33b5fff..adde5a9c9 100644 --- a/docs/isa/scalar/ops/predicate-load-store/pldi_zh.md +++ b/docs/isa/scalar/ops/predicate-load-store/pldi_zh.md @@ -7,19 +7,19 @@ ### PTO 汇编形式 ```text -pldi %mask, %ub_ptr[%imm], "DIST" : !pto.mask, !pto.ptr, i32 +pldi %mask, %ub_ptr[%imm], "DIST" : !pto.mask, !pto.ptr, i32 ``` ### AS Level 1(SSA) ```mlir -%mask = pto.pldi %ub_ptr, %imm, "DIST" : !pto.ptr, i32 -> !pto.mask +%mask = pto.pldi %ub_ptr, %imm, "DIST" : !pto.ptr, i32 -> !pto.mask ``` ### AS Level 2(DPS) ```mlir -pto.pldi ins(%ub_ptr, %imm, "DIST" : !pto.ptr, i32) outs(%mask : !pto.mask) +pto.pldi ins(%ub_ptr, %imm, "DIST" : !pto.ptr, i32) outs(%mask : !pto.mask) ``` ## 关键约束 diff --git a/docs/isa/scalar/ops/predicate-load-store/plds.md b/docs/isa/scalar/ops/predicate-load-store/plds.md index e9a1a358d..c025f2cc2 100644 --- a/docs/isa/scalar/ops/predicate-load-store/plds.md +++ b/docs/isa/scalar/ops/predicate-load-store/plds.md @@ -8,7 +8,7 @@ Load the full predicate register from a contiguous UB location. ## Mechanism -`pto.plds` reads a predicate word from a UB address and materializes it as `!pto.mask`. The operation covers the full predicate width for the active element type (64 bits for f32, 128 bits for f16/bf16, 256 bits for i8/u8). +`pto.plds` reads a predicate word from a UB address and materializes it as `!pto.mask`. The operation covers the full predicate width for the active element type (64 bits for f32, 128 bits for f16/bf16, 256 bits for i8/u8). For predicate width `Pw` and UB address `base`: @@ -21,19 +21,19 @@ The predicate register is updated atomically. All bits are meaningful only withi ### PTO Assembly Form ```mlir -%mask = pto.plds %ub_ptr : !pto.ptr -> !pto.mask +%mask = pto.plds %ub_ptr : !pto.ptr -> !pto.mask ``` ### AS Level 1 (SSA) ```mlir -%mask = pto.plds %ub_ptr : !pto.ptr -> !pto.mask +%mask = pto.plds %ub_ptr : !pto.ptr -> !pto.mask ``` ### AS Level 2 (DPS) ```mlir -pto.plds ins(%ub_ptr : !pto.ptr) outs(%mask : !pto.mask) +pto.plds ins(%ub_ptr : !pto.ptr) outs(%mask : !pto.mask) ``` ## C++ Intrinsic @@ -56,7 +56,7 @@ plds(dst, base, offset, __cce_simd::NORM, __cce_simd::POST_UPDATE); | Result | Type | Description | |--------|------|-------------| -| `%mask` | `!pto.mask` | Loaded predicate register | +| `%mask` | `!pto.mask` | Loaded predicate register | ## Side Effects @@ -103,10 +103,10 @@ void load_saved_mask(RegBuf& dst, Ptr src) { ```mlir // Load predicate from UB slot 0 -%mask = pto.plds %ub_mask_slot0 : !pto.ptr -> !pto.mask +%mask = pto.plds %ub_mask_slot0 : !pto.ptr -> !pto.mask // Use predicate in vector select -%result = pto.vsel %v_true, %v_false, %mask : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> +%result = pto.vsel %v_true, %v_false, %mask : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> ``` ## Related Ops / Instruction Set Links diff --git a/docs/isa/scalar/ops/predicate-load-store/plds_zh.md b/docs/isa/scalar/ops/predicate-load-store/plds_zh.md index 780ef9ac1..2cddcb011 100644 --- a/docs/isa/scalar/ops/predicate-load-store/plds_zh.md +++ b/docs/isa/scalar/ops/predicate-load-store/plds_zh.md @@ -7,19 +7,19 @@ ### PTO 汇编形式 ```text -plds %mask, %ub_ptr : !pto.mask, !pto.ptr +plds %mask, %ub_ptr : !pto.mask, !pto.ptr ``` ### AS Level 1(SSA) ```mlir -%mask = pto.plds %ub_ptr : !pto.ptr -> !pto.mask +%mask = pto.plds %ub_ptr : !pto.ptr -> !pto.mask ``` ### AS Level 2(DPS) ```mlir -pto.plds ins(%ub_ptr : !pto.ptr) outs(%mask : !pto.mask) +pto.plds ins(%ub_ptr : !pto.ptr) outs(%mask : !pto.mask) ``` ## 关键约束 diff --git a/docs/isa/scalar/ops/predicate-load-store/pst.md b/docs/isa/scalar/ops/predicate-load-store/pst.md index a0412416a..62d94a44f 100644 --- a/docs/isa/scalar/ops/predicate-load-store/pst.md +++ b/docs/isa/scalar/ops/predicate-load-store/pst.md @@ -8,7 +8,7 @@ Store the full predicate register to a UB location with a register-relative addr ## Mechanism -`pto.pst` writes a predicate word from `!pto.mask` to a UB address computed as `base + areg * 8`. The offset is sourced from a scalar register, enabling data-dependent addressing. +`pto.pst` writes a predicate word from `!pto.mask` to a UB address computed as `base + areg * 8`. The offset is sourced from a scalar register, enabling data-dependent addressing. For predicate `mask`, UB base `base`, and offset register `areg`: @@ -22,19 +22,19 @@ The predicate register is read atomically. Only bits within the current element- ### PTO Assembly Form ```mlir -pto.pst %mask, %ub_ptr, %areg, "DIST" : !pto.mask, !pto.ptr, i32 +pto.pst %mask, %ub_ptr, %areg, "DIST" : !pto.mask, !pto.ptr, i32 ``` ### AS Level 1 (SSA) ```mlir -pto.pst %mask, %ub_ptr, %areg, "DIST" : !pto.mask, !pto.ptr, i32 +pto.pst %mask, %ub_ptr, %areg, "DIST" : !pto.mask, !pto.ptr, i32 ``` ### AS Level 2 (DPS) ```mlir -pto.pst ins(%mask, %ub_ptr, %areg, "DIST" : !pto.mask, !pto.ptr, i32) +pto.pst ins(%mask, %ub_ptr, %areg, "DIST" : !pto.mask, !pto.ptr, i32) ``` ## C++ Intrinsic @@ -50,7 +50,7 @@ pst(src, base, offset, __cce_simd::NORM); | Operand | Type | Description | |---------|------|-------------| -| `%mask` | `!pto.mask` | Predicate register to store | +| `%mask` | `!pto.mask` | Predicate register to store | | `%ub_ptr` | `!pto.ptr` | UB base address | | `%areg` | `i32` | Scalar register holding the byte offset in 8-byte units | | `"DIST"` | string attribute | Distribution mode: `"NORM"` or `"PK"` | @@ -109,10 +109,10 @@ void store_with_offset(RegBuf& src, ```mlir // Generate predicate from comparison -%mask = pto.vcmp %v0, %v1, %seed, "lt" : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.mask +%mask = pto.vcmp %v0, %v1, %seed, "lt" : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.mask // Store predicate to UB at base + slot * 8 -pto.pst %mask, %ub_base, %slot, "NORM" : !pto.mask, !pto.ptr, i32 +pto.pst %mask, %ub_base, %slot, "NORM" : !pto.mask, !pto.ptr, i32 ``` ## Related Ops / Instruction Set Links diff --git a/docs/isa/scalar/ops/predicate-load-store/pst_zh.md b/docs/isa/scalar/ops/predicate-load-store/pst_zh.md index 78e6c6c24..fe2feae99 100644 --- a/docs/isa/scalar/ops/predicate-load-store/pst_zh.md +++ b/docs/isa/scalar/ops/predicate-load-store/pst_zh.md @@ -7,19 +7,19 @@ ### PTO 汇编形式 ```text -pst %mask, %ub_ptr[%areg], "DIST" : !pto.mask, !pto.ptr, i32 +pst %mask, %ub_ptr[%areg], "DIST" : !pto.mask, !pto.ptr, i32 ``` ### AS Level 1(SSA) ```mlir -pto.pst %mask, %ub_ptr, %areg, "DIST" : !pto.mask, !pto.ptr, i32 +pto.pst %mask, %ub_ptr, %areg, "DIST" : !pto.mask, !pto.ptr, i32 ``` ### AS Level 2(DPS) ```mlir -pto.pst ins(%mask, %ub_ptr, %areg, "DIST" : !pto.mask, !pto.ptr, i32) +pto.pst ins(%mask, %ub_ptr, %areg, "DIST" : !pto.mask, !pto.ptr, i32) ``` ## 关键约束 diff --git a/docs/isa/scalar/ops/predicate-load-store/psti.md b/docs/isa/scalar/ops/predicate-load-store/psti.md index 87c1aaaf9..5f93e4c79 100644 --- a/docs/isa/scalar/ops/predicate-load-store/psti.md +++ b/docs/isa/scalar/ops/predicate-load-store/psti.md @@ -8,7 +8,7 @@ Store the full predicate register to a UB location with an immediate (compile-ti ## Mechanism -`pto.psti` writes a predicate word from `!pto.mask` to a UB address computed as `base + imm * 8`. The offset is a compile-time immediate, enabling address resolution at assembly time. +`pto.psti` writes a predicate word from `!pto.mask` to a UB address computed as `base + imm * 8`. The offset is a compile-time immediate, enabling address resolution at assembly time. For predicate `mask`, UB base `base`, and immediate offset `imm`: @@ -22,19 +22,19 @@ The immediate offset is encoded directly in the instruction word, in units of 8 ### PTO Assembly Form ```mlir -pto.psti %mask, %ub_ptr, %imm, "DIST" : !pto.mask, !pto.ptr, i32 +pto.psti %mask, %ub_ptr, %imm, "DIST" : !pto.mask, !pto.ptr, i32 ``` ### AS Level 1 (SSA) ```mlir -pto.psti %mask, %ub_ptr, %imm, "DIST" : !pto.mask, !pto.ptr, i32 +pto.psti %mask, %ub_ptr, %imm, "DIST" : !pto.mask, !pto.ptr, i32 ``` ### AS Level 2 (DPS) ```mlir -pto.psti ins(%mask, %ub_ptr, %imm, "DIST" : !pto.mask, !pto.ptr, i32) +pto.psti ins(%mask, %ub_ptr, %imm, "DIST" : !pto.mask, !pto.ptr, i32) ``` ## C++ Intrinsic @@ -51,7 +51,7 @@ psti(src, base, offset, __cce_simd::NORM, __cce_simd::POST_UPDATE); | Operand | Type | Description | |---------|------|-------------| -| `%mask` | `!pto.mask` | Predicate register to store | +| `%mask` | `!pto.mask` | Predicate register to store | | `%ub_ptr` | `!pto.ptr` | UB base address | | `%imm` | `i32` | Immediate byte offset in 8-byte units (compile-time constant) | | `"DIST"` | string attribute | Distribution mode: `"NORM"` or `"PK"` | @@ -112,10 +112,10 @@ void store_immediate(RegBuf& src, ```mlir // Generate predicate from comparison -%mask = pto.vcmp %v0, %v1, %seed, "lt" : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.mask +%mask = pto.vcmp %v0, %v1, %seed, "lt" : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.mask // Store predicate to UB at base + 4 * 8 = base + 32 bytes -pto.psti %mask, %ub_base, 4, "NORM" : !pto.mask, !pto.ptr, i32 +pto.psti %mask, %ub_base, 4, "NORM" : !pto.mask, !pto.ptr, i32 ``` ## Related Ops / Instruction Set Links diff --git a/docs/isa/scalar/ops/predicate-load-store/psti_zh.md b/docs/isa/scalar/ops/predicate-load-store/psti_zh.md index 20533f09c..a79d7990a 100644 --- a/docs/isa/scalar/ops/predicate-load-store/psti_zh.md +++ b/docs/isa/scalar/ops/predicate-load-store/psti_zh.md @@ -7,19 +7,19 @@ ### PTO 汇编形式 ```text -psti %mask, %ub_ptr[%imm], "DIST" : !pto.mask, !pto.ptr, i32 +psti %mask, %ub_ptr[%imm], "DIST" : !pto.mask, !pto.ptr, i32 ``` ### AS Level 1(SSA) ```mlir -pto.psti %mask, %ub_ptr, %imm, "DIST" : !pto.mask, !pto.ptr, i32 +pto.psti %mask, %ub_ptr, %imm, "DIST" : !pto.mask, !pto.ptr, i32 ``` ### AS Level 2(DPS) ```mlir -pto.psti ins(%mask, %ub_ptr, %imm, "DIST" : !pto.mask, !pto.ptr, i32) +pto.psti ins(%mask, %ub_ptr, %imm, "DIST" : !pto.mask, !pto.ptr, i32) ``` ## 关键约束 diff --git a/docs/isa/scalar/ops/predicate-load-store/psts.md b/docs/isa/scalar/ops/predicate-load-store/psts.md index 1860a766e..482be69b2 100644 --- a/docs/isa/scalar/ops/predicate-load-store/psts.md +++ b/docs/isa/scalar/ops/predicate-load-store/psts.md @@ -8,7 +8,7 @@ Store the full predicate register to a contiguous UB location. ## Mechanism -`pto.psts` writes a predicate word from `!pto.mask` to a UB address. The operation covers the full predicate width for the active element type (64 bits for f32, 128 bits for f16/bf16, 256 bits for i8/u8). +`pto.psts` writes a predicate word from `!pto.mask` to a UB address. The operation covers the full predicate width for the active element type (64 bits for f32, 128 bits for f16/bf16, 256 bits for i8/u8). For predicate width `Pw` and UB address `base`: @@ -21,19 +21,19 @@ The predicate register is read atomically. On A2/A3 and A5 only the low N bits o ### PTO Assembly Form ```mlir -pto.psts %mask, %ub_ptr : !pto.mask, !pto.ptr +pto.psts %mask, %ub_ptr : !pto.mask, !pto.ptr ``` ### AS Level 1 (SSA) ```mlir -pto.psts %mask, %ub_ptr : !pto.mask, !pto.ptr +pto.psts %mask, %ub_ptr : !pto.mask, !pto.ptr ``` ### AS Level 2 (DPS) ```mlir -pto.psts ins(%mask, %ub_ptr : !pto.mask, !pto.ptr) +pto.psts ins(%mask, %ub_ptr : !pto.mask, !pto.ptr) ``` ## C++ Intrinsic @@ -50,7 +50,7 @@ psts(src, base, offset, __cce_simd::NORM, __cce_simd::POST_UPDATE); | Operand | Type | Description | |---------|------|-------------| -| `%mask` | `!pto.mask` | Predicate register to store | +| `%mask` | `!pto.mask` | Predicate register to store | | `%ub_ptr` | `!pto.ptr` | UB destination address (must be 64-bit aligned) | ## Expected Outputs @@ -103,10 +103,10 @@ void save_mask(RegBuf& src, Ptr dst) { ```mlir // Generate comparison mask -%mask = pto.vcmp %v0, %v1, %seed, "lt" : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.mask +%mask = pto.vcmp %v0, %v1, %seed, "lt" : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.mask // Store predicate to UB for later reuse -pto.psts %mask, %ub_mask_slot0 : !pto.mask, !pto.ptr +pto.psts %mask, %ub_mask_slot0 : !pto.mask, !pto.ptr ``` ## Related Ops / Instruction Set Links diff --git a/docs/isa/scalar/ops/predicate-load-store/psts_zh.md b/docs/isa/scalar/ops/predicate-load-store/psts_zh.md index 4bdd78e7c..df409bd6b 100644 --- a/docs/isa/scalar/ops/predicate-load-store/psts_zh.md +++ b/docs/isa/scalar/ops/predicate-load-store/psts_zh.md @@ -7,19 +7,19 @@ ### PTO 汇编形式 ```text -psts %mask, %ub_ptr : !pto.mask, !pto.ptr +psts %mask, %ub_ptr : !pto.mask, !pto.ptr ``` ### AS Level 1(SSA) ```mlir -pto.psts %mask, %ub_ptr : !pto.mask, !pto.ptr +pto.psts %mask, %ub_ptr : !pto.mask, !pto.ptr ``` ### AS Level 2(DPS) ```mlir -pto.psts ins(%mask, %ub_ptr : !pto.mask, !pto.ptr) +pto.psts ins(%mask, %ub_ptr : !pto.mask, !pto.ptr) ``` ## 关键约束 diff --git a/docs/isa/scalar/ops/predicate-load-store/pstu.md b/docs/isa/scalar/ops/predicate-load-store/pstu.md index 22be7cc92..9755a61a2 100644 --- a/docs/isa/scalar/ops/predicate-load-store/pstu.md +++ b/docs/isa/scalar/ops/predicate-load-store/pstu.md @@ -8,7 +8,7 @@ Stream predicate register to UB with alignment state tracking. High-throughput v ## Mechanism -`pto.pstu` writes a predicate word from `!pto.mask` to a UB address while tracking and updating alignment state. Unlike `psts`, this operation does not require 64-bit alignment and may batch multiple predicate writes into a single DMA transaction. +`pto.pstu` writes a predicate word from `!pto.mask` to a UB address while tracking and updating alignment state. Unlike `psts`, this operation does not require 64-bit alignment and may batch multiple predicate writes into a single DMA transaction. For alignment state `align_in`, predicate `mask`, and base address `base`: @@ -22,19 +22,19 @@ The `%align_out` state carries forward into the next `pstu` call, enabling strea ### PTO Assembly Form ```mlir -%align_out, %base_out = pto.pstu %align_in, %mask, %base_in : !pto.align, !pto.mask, !pto.ptr -> !pto.align, !pto.ptr +%align_out, %base_out = pto.pstu %align_in, %mask, %base_in : !pto.align, !pto.mask, !pto.ptr -> !pto.align, !pto.ptr ``` ### AS Level 1 (SSA) ```mlir -%align_out, %base_out = pto.pstu %align_in, %mask, %base_in : !pto.align, !pto.mask, !pto.ptr -> !pto.align, !pto.ptr +%align_out, %base_out = pto.pstu %align_in, %mask, %base_in : !pto.align, !pto.mask, !pto.ptr -> !pto.align, !pto.ptr ``` ### AS Level 2 (DPS) ```mlir -pto.pstu ins(%align_in, %mask, %base_in : !pto.align, !pto.mask, !pto.ptr) +pto.pstu ins(%align_in, %mask, %base_in : !pto.align, !pto.mask, !pto.ptr) outs(%align_out, %base_out : !pto.align, !pto.ptr) ``` @@ -52,7 +52,7 @@ pstu(alignData, src, base); | Operand | Type | Description | |---------|------|-------------| | `%align_in` | `!pto.align` | Alignment state from previous `pstu` or `pld`-instruction set operation | -| `%mask` | `!pto.mask` | Predicate register to stream-store | +| `%mask` | `!pto.mask` | Predicate register to stream-store | | `%base_in` | `!pto.ptr` | UB base address (no alignment requirement) | ## Expected Outputs @@ -118,13 +118,13 @@ void stream_masks(Ptr dst_base, ```mlir // Initialize alignment state (e.g., from a dummy load or zero) -%align0 = pto.plds %ub_dummy : !pto.ptr -> !pto.mask +%align0 = pto.plds %ub_dummy : !pto.ptr -> !pto.mask // Stream store first predicate; align_out carries forward -%align1, %base1 = pto.pstu %align0, %mask0, %base0 : !pto.align, !pto.mask, !pto.ptr -> !pto.align, !pto.ptr +%align1, %base1 = pto.pstu %align0, %mask0, %base0 : !pto.align, !pto.mask, !pto.ptr -> !pto.align, !pto.ptr // Stream store second predicate using updated alignment state -%align2, %base2 = pto.pstu %align1, %mask1, %base1 : !pto.align, !pto.mask, !pto.ptr -> !pto.align, !pto.ptr +%align2, %base2 = pto.pstu %align1, %mask1, %base1 : !pto.align, !pto.mask, !pto.ptr -> !pto.align, !pto.ptr ``` !!! note "Note" diff --git a/docs/isa/scalar/ops/predicate-load-store/pstu_zh.md b/docs/isa/scalar/ops/predicate-load-store/pstu_zh.md index edb69faeb..baf20da02 100644 --- a/docs/isa/scalar/ops/predicate-load-store/pstu_zh.md +++ b/docs/isa/scalar/ops/predicate-load-store/pstu_zh.md @@ -7,19 +7,19 @@ ### PTO 汇编形式 ```text -pstu %align_in, %mask, %base_in : !pto.align, !pto.mask, !pto.ptr -> !pto.align, !pto.ptr +pstu %align_in, %mask, %base_in : !pto.align, !pto.mask, !pto.ptr -> !pto.align, !pto.ptr ``` ### AS Level 1(SSA) ```mlir -%align_out, %base_out = pto.pstu %align_in, %mask, %base_in : !pto.align, !pto.mask, !pto.ptr -> !pto.align, !pto.ptr +%align_out, %base_out = pto.pstu %align_in, %mask, %base_in : !pto.align, !pto.mask, !pto.ptr -> !pto.align, !pto.ptr ``` ### AS Level 2(DPS) ```mlir -pto.pstu ins(%align_in, %mask, %base_in : !pto.align, !pto.mask, !pto.ptr) +pto.pstu ins(%align_in, %mask, %base_in : !pto.align, !pto.mask, !pto.ptr) outs(%align_out, %base_out : !pto.align, !pto.ptr) ``` diff --git a/docs/isa/scalar/predicate-generation-and-algebra.md b/docs/isa/scalar/predicate-generation-and-algebra.md index fa84051f1..7ae8d98c4 100644 --- a/docs/isa/scalar/predicate-generation-and-algebra.md +++ b/docs/isa/scalar/predicate-generation-and-algebra.md @@ -1,10 +1,10 @@ # Predicate Generation And Algebra -Predicate generation and algebra operations create, combine, pack, unpack, and interleave `!pto.mask` values on the scalar and control instructions. The `!pto.mask` type is the lane-masking mechanism that `pto.v*` vector operations consume. +Predicate generation and algebra operations create, combine, pack, unpack, and interleave `!pto.mask` values on the scalar and control instructions. The `!pto.mask` type is the lane-masking mechanism that `pto.v*` vector operations consume. -## The `!pto.mask` Type +## The `!pto.mask` Type -`!pto.mask` is a predicate mask type whose width is tied to the active element type rather than being a fixed number of bits: +`!pto.mask` is a predicate mask type whose width is tied to the active element type rather than being a fixed number of bits: | Element Type | Vector Width N | Predicate Width | |-------------|:-------------:|:--------------:| @@ -25,8 +25,8 @@ A predicate mask with bit value `1` at position `i` means lane `i` is **active** | Predicate unpack | `punpack` | Widen: extract half from a 2N-bit mask | Static (partition token) | | Boolean algebra | `pand`, `por`, `pxor`, `pnot` | AND / OR / XOR / NOT | Dynamic (runtime operands) | | Predicate select | `psel` | `mask0 ? mask1 : mask2` | Dynamic (runtime operands) | -| Deinterleave | `pdintlv_b8` | Split one 2N-bit mask into two N-bit masks | Static | -| Interleave | `pintlv_b16` | Combine two N-bit masks into one 2N-bit mask | Static | +| Deinterleave | `pdintlv_b8`, `pdintlv_b16`, `pdintlv_b32` | Deinterleave two predicate sources into two predicate outputs at the matching granularity | Static | +| Interleave | `pintlv_b8`, `pintlv_b16`, `pintlv_b32` | Interleave two predicate sources into two predicate outputs at the matching granularity | Static | ## Pattern Tokens @@ -55,7 +55,7 @@ A predicate mask with bit value `1` at position `i` means lane `i` is **active** All predicate generation and algebra operations MUST satisfy: -1. **Operand type**: All predicate operands MUST be `!pto.mask`. Mixing predicate operands with scalar or vector register operands is **illegal**. +1. **Operand type**: All predicate operands MUST be `!pto.mask`. Mixing predicate operands with scalar or vector register operands is **illegal**. 2. **Predicate width consistency**: All operands in a single operation MUST share the same predicate width. Operations that mix N-bit and 2N-bit predicates MUST use explicit pack/unpack. 3. **Pattern token validity**: Pattern tokens MUST be supported by the target profile. Using a pattern token outside its supported width context is **illegal**. 4. **Scalar operand type**: For `pge_*` and `plt_*` operations, the scalar operand type MUST match the variant suffix (`_b8` → i8, `_b16` → i16, `_b32` → i32). @@ -99,7 +99,11 @@ All predicate generation and algebra operations MUST satisfy: ### Interleave / Deinterleave - [pto.pdintlv_b8](./ops/predicate-generation-and-algebra/pdintlv-b8.md) +- [pto.pdintlv_b16](./ops/predicate-generation-and-algebra/pdintlv-b16.md) +- [pto.pdintlv_b32](./ops/predicate-generation-and-algebra/pdintlv-b32.md) +- [pto.pintlv_b8](./ops/predicate-generation-and-algebra/pintlv-b8.md) - [pto.pintlv_b16](./ops/predicate-generation-and-algebra/pintlv-b16.md) +- [pto.pintlv_b32](./ops/predicate-generation-and-algebra/pintlv-b32.md) ## Related Material diff --git a/docs/isa/scalar/predicate-generation-and-algebra_zh.md b/docs/isa/scalar/predicate-generation-and-algebra_zh.md index 8893c9ea6..2d2f24ef1 100644 --- a/docs/isa/scalar/predicate-generation-and-algebra_zh.md +++ b/docs/isa/scalar/predicate-generation-and-algebra_zh.md @@ -1,10 +1,10 @@ # 谓词生成与代数 -谓词生成与代数操作在标量与控制指令集中创建、组合、打包、解包和交错 `!pto.mask`。`!pto.mask` 是 `pto.v*` 向量操作消费的 lane mask 机制。 +谓词生成与代数操作在标量与控制指令集中创建、组合、打包、解包和交错 `!pto.mask`。`!pto.mask` 是 `pto.v*` 向量操作消费的 lane mask 机制。 -## `!pto.mask` 类型 +## `!pto.mask` 类型 -`!pto.mask` 的宽度与当前元素类型绑定: +`!pto.mask` 的宽度与当前元素类型绑定: | 元素类型 | 向量宽度 N | 谓词宽度 | | --- | :---: | :---: | @@ -22,11 +22,11 @@ | Comparison generation | `pge_*`, `plt_*` | | Predicate pack / unpack | `ppack`, `punpack` | | Boolean algebra | `pand`, `por`, `pxor`, `pnot`, `psel` | -| Interleave / deinterleave | `pdintlv_b8`, `pintlv_b16` | +| Interleave / deinterleave | `pdintlv_b8`、`pdintlv_b16`、`pdintlv_b32`、`pintlv_b8`、`pintlv_b16`、`pintlv_b32` | ## 共享约束 -- 所有谓词操作数必须是 `!pto.mask` +- 所有谓词操作数必须是 `!pto.mask` - 同一操作中的谓词宽度必须一致 - pattern token 必须被当前 profile 支持 - `pge_*` / `plt_*` 的标量类型必须与后缀匹配 diff --git a/docs/isa/scalar/predicate-load-store.md b/docs/isa/scalar/predicate-load-store.md index f0b5541ae..6bcd45cad 100644 --- a/docs/isa/scalar/predicate-load-store.md +++ b/docs/isa/scalar/predicate-load-store.md @@ -1,6 +1,6 @@ # Predicate Load Store -Predicate load/store instruction set moves predicate-register state (`!pto.mask`) between UB-visible storage and the architectural predicate instruction set. Predicates are the lane-masking mechanism that `pto.v*` vector operations consume. +Predicate load/store instruction set moves predicate-register state (`!pto.mask`) between UB-visible storage and the architectural predicate instruction set. Predicates are the lane-masking mechanism that `pto.v*` vector operations consume. ## Mechanism @@ -72,13 +72,13 @@ A typical predicate load/store lifecycle: ``` // Kernel entry: load saved predicate -%mask = pto.plds %ub_saved : !pto.ptr -> !pto.mask +%mask = pto.plds %ub_saved : !pto.ptr -> !pto.mask // Use predicate for vector computation -%result = pto.vsel %v_true, %v_false, %mask : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> +%result = pto.vsel %v_true, %v_false, %mask : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> // At kernel exit: save predicate for next kernel -pto.psts %mask, %ub_saved : !pto.mask, !pto.ptr +pto.psts %mask, %ub_saved : !pto.mask, !pto.ptr ``` ## Target-Profile Restrictions diff --git a/docs/isa/scalar/predicate-load-store_zh.md b/docs/isa/scalar/predicate-load-store_zh.md index eed462fcb..50b6460e0 100644 --- a/docs/isa/scalar/predicate-load-store_zh.md +++ b/docs/isa/scalar/predicate-load-store_zh.md @@ -1,6 +1,6 @@ # 谓词加载存储 -谓词加载存储指令集在 UB 可见存储与架构谓词指令集之间搬运 `!pto.mask` 状态。谓词是 `pto.v*` 向量操作消费的 lane mask 机制。 +谓词加载存储指令集在 UB 可见存储与架构谓词指令集之间搬运 `!pto.mask` 状态。谓词是 `pto.v*` 向量操作消费的 lane mask 机制。 ## 机制 diff --git a/docs/isa/scalar/shared-scf.md b/docs/isa/scalar/shared-scf.md index 1f53405f3..c430221f6 100644 --- a/docs/isa/scalar/shared-scf.md +++ b/docs/isa/scalar/shared-scf.md @@ -66,10 +66,10 @@ It produces: ```mlir scf.for %i = %c0 to %tile_count step %c1 { %offset = arith.muli %i, %tile_stride : index - %mask = pto.pset_b32 "PAT_ALL" : !pto.mask + %mask = pto.pset_b32 "PAT_ALL" : !pto.mask %v = pto.vlds %ub[%offset] : !pto.ptr -> !pto.vreg<64xf32> - %abs = pto.vabs %v, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> - pto.vsts %abs, %ub_out[%offset], %mask : !pto.vreg<64xf32>, !pto.ptr, !pto.mask + %abs = pto.vabs %v, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> + pto.vsts %abs, %ub_out[%offset], %mask : !pto.vreg<64xf32>, !pto.ptr, !pto.mask } ``` diff --git a/docs/isa/state-and-types/location-intent-and-legality.md b/docs/isa/state-and-types/location-intent-and-legality.md index ddb115d61..ceeb7e7ef 100644 --- a/docs/isa/state-and-types/location-intent-and-legality.md +++ b/docs/isa/state-and-types/location-intent-and-legality.md @@ -160,7 +160,7 @@ Different instruction sets have different legality rules beyond the four-stage p ### Vector Compute (vadd, vmul, etc.) - Operands MUST be `!pto.vreg`. -- Mask operand MUST be `!pto.mask` with matching width. +- Mask operand MUST be `!pto.mask` with matching width. - `dtype` MUST be in the vector instruction set type list (varies by profile). ## GM-Facing Operands (GlobalTensor) diff --git a/docs/isa/syntax-and-operands/assembly-model.md b/docs/isa/syntax-and-operands/assembly-model.md index f76ff7084..436ca0759 100644 --- a/docs/isa/syntax-and-operands/assembly-model.md +++ b/docs/isa/syntax-and-operands/assembly-model.md @@ -93,7 +93,7 @@ In PTO-AS, a `GlobalTensor` operand appears as a `memref` or `partition_tensor_v A predicate operand is written as a mask register: ``` -%mask : !pto.mask -- predicate operand in SSA form +%mask : !pto.mask -- predicate operand in SSA form ``` Vector instructions that take a mask write it as an explicit operand: @@ -185,14 +185,14 @@ tload %tile, %tensor[%r, %c] : (!pto.tile, !pto.memref) -> !pt **SSA Form (AS Level 1)**: ``` -%result = pto.vadd %src0, %src1, %mask : (!pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask) -> !pto.vreg<64xf32> +%result = pto.vadd %src0, %src1, %mask : (!pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask) -> !pto.vreg<64xf32> ``` ### Scalar Compare: Predicate Generation **SSA Form (AS Level 1)**: ``` -%pred = pto.pge_b32 %src0, %src1 : (!pto.vreg<64xi32>, !pto.vreg<64xi32>) -> !pto.mask +%pred = pto.pge_b32 %src0, %src1 : (!pto.vreg<64xi32>, !pto.vreg<64xi32>) -> !pto.mask ``` ## What Textual Spelling Does Not Replace diff --git a/docs/isa/syntax-and-operands/assembly-model_zh.md b/docs/isa/syntax-and-operands/assembly-model_zh.md index d709159b1..810ddcf22 100644 --- a/docs/isa/syntax-and-operands/assembly-model_zh.md +++ b/docs/isa/syntax-and-operands/assembly-model_zh.md @@ -91,7 +91,7 @@ PTO-AS 中的 tile 操作数可以带修饰: ### 谓词操作数 ```text -%mask : !pto.mask +%mask : !pto.mask ``` 向量指令把 mask 作为显式操作数: @@ -168,13 +168,13 @@ tload %tile, %tensor[%r, %c] : (!pto.tile, !pto.memref) -> !pt ### 向量加法 ```text -%result = pto.vadd %src0, %src1, %mask : (!pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask) -> !pto.vreg<64xf32> +%result = pto.vadd %src0, %src1, %mask : (!pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask) -> !pto.vreg<64xf32> ``` ### 谓词生成 ```text -%pred = pto.pge_b32 %src0, %src1 : (!pto.vreg<64xi32>, !pto.vreg<64xi32>) -> !pto.mask +%pred = pto.pge_b32 %src0, %src1 : (!pto.vreg<64xi32>, !pto.vreg<64xi32>) -> !pto.mask ``` ## 文本拼写不能替代的内容 diff --git a/docs/isa/syntax-and-operands/operands-and-attributes.md b/docs/isa/syntax-and-operands/operands-and-attributes.md index adf9afd2e..f1e6bf9c7 100644 --- a/docs/isa/syntax-and-operands/operands-and-attributes.md +++ b/docs/isa/syntax-and-operands/operands-and-attributes.md @@ -11,7 +11,7 @@ PTO defines seven operand kinds. Each kind maps to a specific SSA type and has d | **Tile** | `!pto.tile<...>` / `!pto.tile_buf<...>` | `Tile` | Tile operand with shape, layout, valid-region metadata | | **GlobalTensor** | `!pto.partition_tensor_view<...>` / `!pto.memref<...>` | `GlobalTensor` | GM-facing view; the source or destination of data movement | | **Scalar** | `i8`–`i64`, `u8`–`u64`, `f16`, `bf16`, `f32` | Built-in C++ types | Immediate values or runtime-computed scalars | -| **Predicate** | `!pto.mask` | (IR-level) | Per-lane mask controlling which lanes participate in vector instructions | +| **Predicate** | `!pto.mask` | (IR-level) | Per-lane mask controlling which lanes participate in vector instructions | | **Event** | `!pto.event` | `RecordEvent` (return type) | Synchronization token; carries ordering information between operations | | **UB Pointer** | `!pto.ptr` | (IR-level) | Pointer into Unified Buffer; used by vector load/store and DMA copy ops | | **GM Pointer** | `!pto.ptr` | `__gm__ T*` | Pointer into Global Memory; used by scalar load/store and DMA copy ops | @@ -59,7 +59,7 @@ Scalar operands are immediate values encoded directly in the instruction or comp ### Predicate Operands -Predicate operands (`!pto.mask`) control which lanes participate in vector operations. They are produced by predicate-generation operations (`pset_b8`, `pge_b32`, `plt_b16`, etc.) and consumed by vector operations. +Predicate operands (`!pto.mask`) control which lanes participate in vector operations. They are produced by predicate-generation operations (`pset_b8`, `pge_b32`, `plt_b16`, etc.) and consumed by vector operations. A predicate with all bits set means "all lanes active". A predicate with some bits cleared means "only those lanes participate". diff --git a/docs/isa/syntax-and-operands/operands-and-attributes_zh.md b/docs/isa/syntax-and-operands/operands-and-attributes_zh.md index 4f56fd2bd..30ae31998 100644 --- a/docs/isa/syntax-and-operands/operands-and-attributes_zh.md +++ b/docs/isa/syntax-and-operands/operands-and-attributes_zh.md @@ -11,7 +11,7 @@ PTO 定义七种操作数类别: | **Tile** | `!pto.tile<...>` / `!pto.tile_buf<...>` | `Tile<...>` | 带 shape、layout、valid-region 元数据的 tile | | **GlobalTensor** | `!pto.partition_tensor_view<...>` / `!pto.memref<...>` | `GlobalTensor<...>` | 面向 GM 的视图 | | **Scalar** | `i8`–`i64`, `u8`–`u64`, `f16`, `bf16`, `f32` | 标准 C++ 标量类型 | 立即数或运行时标量 | -| **Predicate** | `!pto.mask` | IR 层 | 控制向量 lane 参与的 mask | +| **Predicate** | `!pto.mask` | IR 层 | 控制向量 lane 参与的 mask | | **Event** | `!pto.event` | `RecordEvent` | 顺序令牌 | | **UB Pointer** | `!pto.ptr` | IR 层 | 指向 UB 的指针 | | **GM Pointer** | `!pto.ptr` | `__gm__ T*` | 指向 GM 的指针 | @@ -44,7 +44,7 @@ Tile 是 `pto.t*` 的主要有效载荷类型。 ### Predicate -谓词 `!pto.mask` 控制向量操作中哪些 lane 参与。 +谓词 `!pto.mask` 控制向量操作中哪些 lane 参与。 ### UB Pointer diff --git a/docs/isa/system/ops/TFREE.md b/docs/isa/system/ops/TFREE.md index b750b2f67..a1827d055 100644 --- a/docs/isa/system/ops/TFREE.md +++ b/docs/isa/system/ops/TFREE.md @@ -1,4 +1,4 @@ -# TFREE +# pto.tfree ## Tile Operation Diagram diff --git a/docs/isa/tile/ops/elementwise-tile-tile/tands_zh.md b/docs/isa/tile/ops/elementwise-tile-tile/tands_zh.md index 71be5aee1..81cb5ad4a 100644 --- a/docs/isa/tile/ops/elementwise-tile-tile/tands_zh.md +++ b/docs/isa/tile/ops/elementwise-tile-tile/tands_zh.md @@ -1,4 +1,4 @@ -# TANDS +# TANDS ## 指令示意图 diff --git a/docs/isa/tile/ops/elementwise-tile-tile/tdivs_zh.md b/docs/isa/tile/ops/elementwise-tile-tile/tdivs_zh.md index e2a3c2b66..a30b03e91 100644 --- a/docs/isa/tile/ops/elementwise-tile-tile/tdivs_zh.md +++ b/docs/isa/tile/ops/elementwise-tile-tile/tdivs_zh.md @@ -1,4 +1,4 @@ -# TDIVS +# TDIVS ## 指令示意图 diff --git a/docs/isa/tile/ops/elementwise-tile-tile/texp.md b/docs/isa/tile/ops/elementwise-tile-tile/texp.md index 073f18bac..410699a35 100644 --- a/docs/isa/tile/ops/elementwise-tile-tile/texp.md +++ b/docs/isa/tile/ops/elementwise-tile-tile/texp.md @@ -114,7 +114,7 @@ R = 16 × 64 / 8 = 128 total ≈ 13 + 26 + 256 + (128-1) × 18 = 2571 cycles ``` -**Note**: `TEXP` is significantly more expensive than `TADD`/`TMUL` due to SFU pipeline. For numerically stable softmax kernels, prefer the vector-level `vexpdiff` fused operation instead. +**Note**: `TEXP` is significantly more expensive than `TADD`/`TMUL` due to SFU pipeline. For numerically stable softmax kernels, prefer the vector-level `vexpdif` fused operation instead. --- diff --git a/docs/isa/tile/ops/elementwise-tile-tile/tfmods_zh.md b/docs/isa/tile/ops/elementwise-tile-tile/tfmods_zh.md index 1a2cff4fa..2ea3e4307 100644 --- a/docs/isa/tile/ops/elementwise-tile-tile/tfmods_zh.md +++ b/docs/isa/tile/ops/elementwise-tile-tile/tfmods_zh.md @@ -1,4 +1,4 @@ -# TFMODS +# TFMODS ## 指令示意图 diff --git a/docs/isa/tile/ops/elementwise-tile-tile/tmaxs_zh.md b/docs/isa/tile/ops/elementwise-tile-tile/tmaxs_zh.md index 53c95229c..205fc8bdd 100644 --- a/docs/isa/tile/ops/elementwise-tile-tile/tmaxs_zh.md +++ b/docs/isa/tile/ops/elementwise-tile-tile/tmaxs_zh.md @@ -1,4 +1,4 @@ -# TMAXS +# TMAXS ## 指令示意图 diff --git a/docs/isa/tile/ops/elementwise-tile-tile/tmins_zh.md b/docs/isa/tile/ops/elementwise-tile-tile/tmins_zh.md index 2a57335ea..b1c840b25 100644 --- a/docs/isa/tile/ops/elementwise-tile-tile/tmins_zh.md +++ b/docs/isa/tile/ops/elementwise-tile-tile/tmins_zh.md @@ -1,4 +1,4 @@ -# TMINS +# TMINS ## 指令示意图 diff --git a/docs/isa/tile/ops/elementwise-tile-tile/tors_zh.md b/docs/isa/tile/ops/elementwise-tile-tile/tors_zh.md index 53d809aeb..f1077e2f7 100644 --- a/docs/isa/tile/ops/elementwise-tile-tile/tors_zh.md +++ b/docs/isa/tile/ops/elementwise-tile-tile/tors_zh.md @@ -1,4 +1,4 @@ -# TORS +# TORS ## 指令示意图 diff --git a/docs/isa/tile/ops/elementwise-tile-tile/trems_zh.md b/docs/isa/tile/ops/elementwise-tile-tile/trems_zh.md index 7dec848e9..cba06e88e 100644 --- a/docs/isa/tile/ops/elementwise-tile-tile/trems_zh.md +++ b/docs/isa/tile/ops/elementwise-tile-tile/trems_zh.md @@ -1,4 +1,4 @@ -# TREMS +# TREMS ## 指令示意图 diff --git a/docs/isa/tile/ops/elementwise-tile-tile/tsels_zh.md b/docs/isa/tile/ops/elementwise-tile-tile/tsels_zh.md index 855327f7a..75f4564e3 100644 --- a/docs/isa/tile/ops/elementwise-tile-tile/tsels_zh.md +++ b/docs/isa/tile/ops/elementwise-tile-tile/tsels_zh.md @@ -1,4 +1,4 @@ -# TSELS +# TSELS ## 指令示意图 diff --git a/docs/isa/tile/ops/elementwise-tile-tile/tsubs_zh.md b/docs/isa/tile/ops/elementwise-tile-tile/tsubs_zh.md index c092c29cc..82b684a33 100644 --- a/docs/isa/tile/ops/elementwise-tile-tile/tsubs_zh.md +++ b/docs/isa/tile/ops/elementwise-tile-tile/tsubs_zh.md @@ -1,4 +1,4 @@ -# TSUBS +# TSUBS ## 指令示意图 diff --git a/docs/isa/tile/ops/elementwise-tile-tile/txors_zh.md b/docs/isa/tile/ops/elementwise-tile-tile/txors_zh.md index 829c3c6f6..931d1562d 100644 --- a/docs/isa/tile/ops/elementwise-tile-tile/txors_zh.md +++ b/docs/isa/tile/ops/elementwise-tile-tile/txors_zh.md @@ -1,4 +1,4 @@ -# TXORS +# TXORS ## 指令示意图 diff --git a/docs/isa/tile/ops/irregular-and-complex/tgather_zh.md b/docs/isa/tile/ops/irregular-and-complex/tgather_zh.md index d6204dadc..a01d96993 100644 --- a/docs/isa/tile/ops/irregular-and-complex/tgather_zh.md +++ b/docs/isa/tile/ops/irregular-and-complex/tgather_zh.md @@ -1,4 +1,4 @@ -# TGATHER +# TGATHER ## 指令示意图 diff --git a/docs/isa/tile/ops/irregular-and-complex/tpartadd_zh.md b/docs/isa/tile/ops/irregular-and-complex/tpartadd_zh.md index 5f0f9c1e0..8457bbf42 100644 --- a/docs/isa/tile/ops/irregular-and-complex/tpartadd_zh.md +++ b/docs/isa/tile/ops/irregular-and-complex/tpartadd_zh.md @@ -1,4 +1,4 @@ -# TPARTADD +# TPARTADD ## 指令示意图 diff --git a/docs/isa/tile/ops/irregular-and-complex/tpartmax_zh.md b/docs/isa/tile/ops/irregular-and-complex/tpartmax_zh.md index b3f34a7ad..024e3daa3 100644 --- a/docs/isa/tile/ops/irregular-and-complex/tpartmax_zh.md +++ b/docs/isa/tile/ops/irregular-and-complex/tpartmax_zh.md @@ -1,4 +1,4 @@ -# TPARTMAX +# TPARTMAX ## 指令示意图 diff --git a/docs/isa/tile/ops/irregular-and-complex/tpartmin_zh.md b/docs/isa/tile/ops/irregular-and-complex/tpartmin_zh.md index 4f135e22e..205049b7a 100644 --- a/docs/isa/tile/ops/irregular-and-complex/tpartmin_zh.md +++ b/docs/isa/tile/ops/irregular-and-complex/tpartmin_zh.md @@ -1,4 +1,4 @@ -# TPARTMIN +# TPARTMIN ## 指令示意图 diff --git a/docs/isa/tile/ops/irregular-and-complex/tpartmul_zh.md b/docs/isa/tile/ops/irregular-and-complex/tpartmul_zh.md index 2211a7a4b..b6bf91c4b 100644 --- a/docs/isa/tile/ops/irregular-and-complex/tpartmul_zh.md +++ b/docs/isa/tile/ops/irregular-and-complex/tpartmul_zh.md @@ -1,4 +1,4 @@ -# TPARTMUL +# TPARTMUL ## 指令示意图 diff --git a/docs/isa/tile/ops/irregular-and-complex/tprint_zh.md b/docs/isa/tile/ops/irregular-and-complex/tprint_zh.md index e4a28d2f6..f869819c7 100644 --- a/docs/isa/tile/ops/irregular-and-complex/tprint_zh.md +++ b/docs/isa/tile/ops/irregular-and-complex/tprint_zh.md @@ -1,4 +1,4 @@ -# TPRINT +# TPRINT ## 指令示意图 diff --git a/docs/isa/tile/ops/irregular-and-complex/tsort32_zh.md b/docs/isa/tile/ops/irregular-and-complex/tsort32_zh.md index 52e4eee68..859792a78 100644 --- a/docs/isa/tile/ops/irregular-and-complex/tsort32_zh.md +++ b/docs/isa/tile/ops/irregular-and-complex/tsort32_zh.md @@ -1,4 +1,4 @@ -# TSORT32 +# TSORT32 ## 指令示意图 diff --git a/docs/isa/tile/ops/layout-and-rearrangement/textract_zh.md b/docs/isa/tile/ops/layout-and-rearrangement/textract_zh.md index b71e037a8..1ba348660 100644 --- a/docs/isa/tile/ops/layout-and-rearrangement/textract_zh.md +++ b/docs/isa/tile/ops/layout-and-rearrangement/textract_zh.md @@ -1,4 +1,4 @@ -# TEXTRACT +# TEXTRACT ## 指令示意图 diff --git a/docs/isa/tile/ops/layout-and-rearrangement/tinsert_zh.md b/docs/isa/tile/ops/layout-and-rearrangement/tinsert_zh.md index 68a00ae69..83f5d8781 100644 --- a/docs/isa/tile/ops/layout-and-rearrangement/tinsert_zh.md +++ b/docs/isa/tile/ops/layout-and-rearrangement/tinsert_zh.md @@ -1,4 +1,4 @@ -# TINSERT +# TINSERT ## 指令示意图 diff --git a/docs/isa/tile/ops/layout-and-rearrangement/ttrans_zh.md b/docs/isa/tile/ops/layout-and-rearrangement/ttrans_zh.md index 370c62710..4aebef11e 100644 --- a/docs/isa/tile/ops/layout-and-rearrangement/ttrans_zh.md +++ b/docs/isa/tile/ops/layout-and-rearrangement/ttrans_zh.md @@ -1,4 +1,4 @@ -# TTRANS +# TTRANS ## 指令示意图 diff --git a/docs/isa/tile/ops/sync-and-config/tsync_zh.md b/docs/isa/tile/ops/sync-and-config/tsync_zh.md index 3d1bbc5ed..86f9e193c 100644 --- a/docs/isa/tile/ops/sync-and-config/tsync_zh.md +++ b/docs/isa/tile/ops/sync-and-config/tsync_zh.md @@ -1,4 +1,4 @@ -# TSYNC +# TSYNC ## 指令示意图 diff --git a/docs/isa/tile/ops/tile-scalar-and-immediate/texpands_zh.md b/docs/isa/tile/ops/tile-scalar-and-immediate/texpands_zh.md index 13ba03282..d9282c110 100644 --- a/docs/isa/tile/ops/tile-scalar-and-immediate/texpands_zh.md +++ b/docs/isa/tile/ops/tile-scalar-and-immediate/texpands_zh.md @@ -1,4 +1,4 @@ -# TEXPANDS +# TEXPANDS ## 指令示意图 diff --git a/docs/isa/tile/ops/tile-scalar-and-immediate/tlrelu_zh.md b/docs/isa/tile/ops/tile-scalar-and-immediate/tlrelu_zh.md index 7667941cf..1733b73f2 100644 --- a/docs/isa/tile/ops/tile-scalar-and-immediate/tlrelu_zh.md +++ b/docs/isa/tile/ops/tile-scalar-and-immediate/tlrelu_zh.md @@ -1,4 +1,4 @@ -# TLRELU +# TLRELU ## 指令示意图 diff --git a/docs/isa/tile/ops/tile-scalar-and-immediate/tshls_zh.md b/docs/isa/tile/ops/tile-scalar-and-immediate/tshls_zh.md index 9cdeda58f..eb41ef27d 100644 --- a/docs/isa/tile/ops/tile-scalar-and-immediate/tshls_zh.md +++ b/docs/isa/tile/ops/tile-scalar-and-immediate/tshls_zh.md @@ -1,4 +1,4 @@ -# TSHLS +# TSHLS ## 指令示意图 diff --git a/docs/isa/tile/ops/tile-scalar-and-immediate/tshrs_zh.md b/docs/isa/tile/ops/tile-scalar-and-immediate/tshrs_zh.md index 2be8e5c94..62064b824 100644 --- a/docs/isa/tile/ops/tile-scalar-and-immediate/tshrs_zh.md +++ b/docs/isa/tile/ops/tile-scalar-and-immediate/tshrs_zh.md @@ -1,4 +1,4 @@ -# TSHRS +# TSHRS ## 指令示意图 diff --git a/docs/isa/tile/ops/view-and-tile-buf/alloc-tile.md b/docs/isa/tile/ops/view-and-tile-buf/alloc-tile.md new file mode 100644 index 000000000..0fda8d7bc --- /dev/null +++ b/docs/isa/tile/ops/view-and-tile-buf/alloc-tile.md @@ -0,0 +1,63 @@ +# pto.alloc_tile + +`pto.alloc_tile` is part of the [View and Tile Buffer](../../view-and-tile-buf.md) instruction set. + +## Summary + +Declare the lifetime of a `!pto.tile_buf<...>`. Each call produces an **independent** tile-buffer instance. + +## Mechanism + +A tile buffer is a bounded, rectangular 2-D region of on-chip memory (UB / L1 / L0A / L0B / L0C / BT / scaling buffer) with an explicit lifetime. `pto.alloc_tile` introduces a fresh SSA value standing for one such instance and lets the implementation decide on the concrete address — or accepts an explicit address via the optional `addr` clause. + +When the result tile type has dynamic `v_row` / `v_col` (`valid=?x?`), the corresponding `valid_row` / `valid_col` operands must be supplied at allocation time so downstream ops see a well-defined valid region. + +The op is pure: no data movement, no synchronization. + +## Syntax + +```mlir +%tb = pto.alloc_tile : !pto.tile_buf<...> +%tb2 = pto.alloc_tile valid_row = %vr valid_col = %vc : !pto.tile_buf +%tb3 = pto.alloc_tile addr = %ad : !pto.tile_buf<...> +``` + +## Inputs + +| Operand | Type | Description | +|---------|------|-------------| +| `addr` | `Optional` | Optional explicit start address. If omitted, assigned by the implementation. | +| `valid_row` | `Optional` | Dynamic valid-row count. Required when the result type has `v_row = ?`. | +| `valid_col` | `Optional` | Dynamic valid-col count. Required when the result type has `v_col = ?`. | + +## Expected Outputs + +| Result | Type | Description | +|--------|------|-------------| +| `%tb` | `!pto.tile_buf<...>` | Newly allocated tile buffer instance. | + +## Constraints + +!!! warning "Constraints" + - If result `v_row` / `v_col` are dynamic (`?`), the corresponding operands MUST be present. + - If result `v_row` / `v_col` are static, the corresponding operands MUST be absent. + - Each call produces an **independent** tile-buffer instance, even when called repeatedly with the same arguments. + +## Examples + +```mlir +%tb = pto.alloc_tile : !pto.tile_buf +``` + +```mlir +// Dynamic valid shape: must pass valid_row / valid_col. +%tb = pto.alloc_tile valid_row = %vr valid_col = %vc + : !pto.tile_buf +``` + +## Related Ops / Instruction Set Links + +- Instruction set overview: [View and Tile Buffer](../../view-and-tile-buf.md) +- Update dynamic valid shape: [pto.set_validshape](./set-validshape.md) +- Carve a sub-region: [pto.subset](./subset.md) +- Bridge to vector load/store: [pto.tile_buf_addr](./tile-buf-addr.md) diff --git a/docs/isa/tile/ops/view-and-tile-buf/alloc-tile_zh.md b/docs/isa/tile/ops/view-and-tile-buf/alloc-tile_zh.md new file mode 100644 index 000000000..e419b7ddc --- /dev/null +++ b/docs/isa/tile/ops/view-and-tile-buf/alloc-tile_zh.md @@ -0,0 +1,63 @@ +# pto.alloc_tile + +`pto.alloc_tile` 属于 [视图与 Tile Buffer](../../view-and-tile-buf_zh.md) 指令集。 + +## 摘要 + +声明一个 `!pto.tile_buf<...>` 的生命周期。每一次调用都会产生一个**相互独立**的 tile buffer 实例。 + +## 机制 + +tile buffer 是片上内存(UB / L1 / L0A / L0B / L0C / BT / scaling buffer)上一个有界的 2D 矩形区域,具有显式的生命周期。`pto.alloc_tile` 引入一个新鲜的 SSA 值来代表这样一个实例,并让实现自动分配具体地址,或通过可选的 `addr` 子句接受显式地址。 + +如果结果 tile 类型的 `v_row` / `v_col` 是动态的(`valid=?x?`),则必须在 alloc 时提供对应的 `valid_row` / `valid_col`,以便下游操作看到一个定义良好的 valid 区域。 + +纯操作:不搬数据、不做同步。 + +## 语法 + +```mlir +%tb = pto.alloc_tile : !pto.tile_buf<...> +%tb2 = pto.alloc_tile valid_row = %vr valid_col = %vc : !pto.tile_buf +%tb3 = pto.alloc_tile addr = %ad : !pto.tile_buf<...> +``` + +## 输入 + +| 操作数 | 类型 | 描述 | +|---------|------|------| +| `addr` | `Optional` | 可选起始地址。省略时由实现分配。 | +| `valid_row` | `Optional` | 动态 valid-row 数量。当结果类型的 `v_row = ?` 时必须提供。 | +| `valid_col` | `Optional` | 动态 valid-col 数量。当结果类型的 `v_col = ?` 时必须提供。 | + +## 预期输出 + +| 结果 | 类型 | 描述 | +|--------|------|------| +| `%tb` | `!pto.tile_buf<...>` | 新分配的 tile buffer 实例 | + +## 约束 + +!!! warning "约束" + - 若结果 `v_row` / `v_col` 是动态(`?`),对应操作数**必须**存在。 + - 若结果 `v_row` / `v_col` 是静态,对应操作数**必须**省略。 + - 每次调用都产生一个**独立**的 tile buffer 实例,即使重复用相同参数调用也不会复用。 + +## 示例 + +```mlir +%tb = pto.alloc_tile : !pto.tile_buf +``` + +```mlir +// 动态 valid 形状:必须传 valid_row / valid_col +%tb = pto.alloc_tile valid_row = %vr valid_col = %vc + : !pto.tile_buf +``` + +## 相关页面 + +- 指令集总览:[视图与 Tile Buffer](../../view-and-tile-buf_zh.md) +- 更新动态 valid 形状:[pto.set_validshape](./set-validshape_zh.md) +- 切出子区:[pto.subset](./subset_zh.md) +- 跨向量加载/存储桥梁:[pto.tile_buf_addr](./tile-buf-addr_zh.md) diff --git a/docs/isa/tile/ops/view-and-tile-buf/get-tensor-view-dim.md b/docs/isa/tile/ops/view-and-tile-buf/get-tensor-view-dim.md new file mode 100644 index 000000000..23efe4194 --- /dev/null +++ b/docs/isa/tile/ops/view-and-tile-buf/get-tensor-view-dim.md @@ -0,0 +1,49 @@ +# pto.get_tensor_view_dim + +`pto.get_tensor_view_dim` is part of the [View and Tile Buffer](../../view-and-tile-buf.md) instruction set. + +## Summary + +Return the runtime size of a specific dimension from a `!pto.tensor_view<...>`. + +## Mechanism + +The op reads the extent of dimension `%idx` from the descriptor produced by [`pto.make_tensor_view`](./make-tensor-view.md). It is pure: no memory access, no synchronization, no architectural side effects. + +## Syntax + +```mlir +%dim = pto.get_tensor_view_dim %tv, %idx : !pto.tensor_view<...> -> index +``` + +## Inputs + +| Operand | Type | Description | +|---------|------|-------------| +| `%tv` | `!pto.tensor_view<...>` | Logical tensor view. | +| `%idx` | `index` | Dimension index (0-based). | + +## Expected Outputs + +| Result | Type | Description | +|--------|------|-------------| +| `%dim` | `index` | Runtime size of dimension `%idx`. | + +## Constraints + +!!! warning "Constraints" + - `%idx` MUST be in `[0, rank(%tv))`. + - The op is pure; it does not modify the view or any underlying memory. + +## Examples + +```mlir +%h = pto.get_tensor_view_dim %tv, %c0 : !pto.tensor_view -> index +%w = pto.get_tensor_view_dim %tv, %c1 : !pto.tensor_view -> index +``` + +## Related Ops / Instruction Set Links + +- Instruction set overview: [View and Tile Buffer](../../view-and-tile-buf.md) +- Construct a view: [pto.make_tensor_view](./make-tensor-view.md) +- Query stride: [pto.get_tensor_view_stride](./get-tensor-view-stride.md) diff --git a/docs/isa/tile/ops/view-and-tile-buf/get-tensor-view-dim_zh.md b/docs/isa/tile/ops/view-and-tile-buf/get-tensor-view-dim_zh.md new file mode 100644 index 000000000..096ca54cb --- /dev/null +++ b/docs/isa/tile/ops/view-and-tile-buf/get-tensor-view-dim_zh.md @@ -0,0 +1,49 @@ +# pto.get_tensor_view_dim + +`pto.get_tensor_view_dim` 属于 [视图与 Tile Buffer](../../view-and-tile-buf_zh.md) 指令集。 + +## 摘要 + +从 `!pto.tensor_view<...>` 读出指定维度的运行时大小。 + +## 机制 + +读取由 [`pto.make_tensor_view`](./make-tensor-view_zh.md) 产生的描述符中 `%idx` 维度的尺寸。纯操作:不访问内存,不做同步,无任何架构副作用。 + +## 语法 + +```mlir +%dim = pto.get_tensor_view_dim %tv, %idx : !pto.tensor_view<...> -> index +``` + +## 输入 + +| 操作数 | 类型 | 描述 | +|---------|------|------| +| `%tv` | `!pto.tensor_view<...>` | 逻辑张量视图 | +| `%idx` | `index` | 维度索引(0-based) | + +## 预期输出 + +| 结果 | 类型 | 描述 | +|--------|------|------| +| `%dim` | `index` | `%idx` 维度的运行时尺寸 | + +## 约束 + +!!! warning "约束" + - `%idx` 必须在 `[0, rank(%tv))` 范围内。 + - 纯操作:不改变视图或底层内存。 + +## 示例 + +```mlir +%h = pto.get_tensor_view_dim %tv, %c0 : !pto.tensor_view -> index +%w = pto.get_tensor_view_dim %tv, %c1 : !pto.tensor_view -> index +``` + +## 相关页面 + +- 指令集总览:[视图与 Tile Buffer](../../view-and-tile-buf_zh.md) +- 构造视图:[pto.make_tensor_view](./make-tensor-view_zh.md) +- 查询步长:[pto.get_tensor_view_stride](./get-tensor-view-stride_zh.md) diff --git a/docs/isa/tile/ops/view-and-tile-buf/get-tensor-view-stride.md b/docs/isa/tile/ops/view-and-tile-buf/get-tensor-view-stride.md new file mode 100644 index 000000000..2d1e6cfd4 --- /dev/null +++ b/docs/isa/tile/ops/view-and-tile-buf/get-tensor-view-stride.md @@ -0,0 +1,56 @@ +# pto.get_tensor_view_stride + +`pto.get_tensor_view_stride` is part of the [View and Tile Buffer](../../view-and-tile-buf.md) instruction set. + +## Summary + +Return the logical stride of a specific dimension, measured in **elements** (not bytes), from a `!pto.tensor_view<...>` (or its lowered memref form). + +## Mechanism + +The op reads the per-dimension stride from the descriptor produced by [`pto.make_tensor_view`](./make-tensor-view.md). It is pure: no memory access, no synchronization. + +Because the stride is reported in **elements**, downstream pointer arithmetic computed via [`pto.addptr`](../../../scalar/ops/micro-instruction/pointer-operations.md) (which is also element-based) composes cleanly without an additional `sizeof(T)` multiply. + +## Syntax + +```mlir +%stride = pto.get_tensor_view_stride %tv, %idx : !pto.tensor_view<...> -> index +``` + +## Inputs + +| Operand | Type | Description | +|---------|------|-------------| +| `%tv` | `!pto.tensor_view<...>` or memref form | Tensor view or its lowered memory-reference form. | +| `%idx` | `index` | Dimension index (0-based). | + +## Expected Outputs + +| Result | Type | Description | +|--------|------|-------------| +| `%stride` | `index` | Element-stride of dimension `%idx`. | + +## Constraints + +!!! warning "Constraints" + - `%idx` MUST be in `[0, rank(%tv))`. + - The returned stride is counted in **elements**, not bytes. Mixing element-stride and byte-offset values without an explicit `sizeof(T)` conversion is a bug. + - The op is pure; it does not modify the view or any underlying memory. + +## Examples + +```mlir +// Stride of the leading dim (rows) for a row-major view. +%s0 = pto.get_tensor_view_stride %tv, %c0 : !pto.tensor_view -> index + +// Stride of the inner dim is 1 (one element). +%s1 = pto.get_tensor_view_stride %tv, %c1 : !pto.tensor_view -> index +``` + +## Related Ops / Instruction Set Links + +- Instruction set overview: [View and Tile Buffer](../../view-and-tile-buf.md) +- Construct a view: [pto.make_tensor_view](./make-tensor-view.md) +- Query dim size: [pto.get_tensor_view_dim](./get-tensor-view-dim.md) +- Element-offset pointer arithmetic: [pto.addptr](../../../scalar/ops/micro-instruction/pointer-operations.md) diff --git a/docs/isa/tile/ops/view-and-tile-buf/get-tensor-view-stride_zh.md b/docs/isa/tile/ops/view-and-tile-buf/get-tensor-view-stride_zh.md new file mode 100644 index 000000000..d6f00edd3 --- /dev/null +++ b/docs/isa/tile/ops/view-and-tile-buf/get-tensor-view-stride_zh.md @@ -0,0 +1,56 @@ +# pto.get_tensor_view_stride + +`pto.get_tensor_view_stride` 属于 [视图与 Tile Buffer](../../view-and-tile-buf_zh.md) 指令集。 + +## 摘要 + +从 `!pto.tensor_view<...>`(或其下沉后的 memref 形式)读出指定维度的逻辑步长,**以元素为单位**(非字节)。 + +## 机制 + +读取由 [`pto.make_tensor_view`](./make-tensor-view_zh.md) 产生的描述符中 `%idx` 维度的步长。纯操作:不访问内存,不做同步。 + +因为步长以**元素**为单位返回,与 [`pto.addptr`](../../../scalar/ops/micro-instruction/pointer-operations_zh.md)(也是以元素为单位)做指针算术时可以直接组合,无需额外的 `sizeof(T)` 乘法。 + +## 语法 + +```mlir +%stride = pto.get_tensor_view_stride %tv, %idx : !pto.tensor_view<...> -> index +``` + +## 输入 + +| 操作数 | 类型 | 描述 | +|---------|------|------| +| `%tv` | `!pto.tensor_view<...>` 或下沉后的 memref 形式 | 张量视图或其下沉形式 | +| `%idx` | `index` | 维度索引(0-based) | + +## 预期输出 + +| 结果 | 类型 | 描述 | +|--------|------|------| +| `%stride` | `index` | `%idx` 维度的元素步长 | + +## 约束 + +!!! warning "约束" + - `%idx` 必须在 `[0, rank(%tv))` 范围内。 + - 返回的步长以**元素**为单位(不是字节)。把元素步长和字节偏移混用而不做显式 `sizeof(T)` 换算是错误。 + - 纯操作:不改变视图或底层内存。 + +## 示例 + +```mlir +// 行优先视图的外层维度步长 +%s0 = pto.get_tensor_view_stride %tv, %c0 : !pto.tensor_view -> index + +// 内层维度步长通常为 1(一个元素) +%s1 = pto.get_tensor_view_stride %tv, %c1 : !pto.tensor_view -> index +``` + +## 相关页面 + +- 指令集总览:[视图与 Tile Buffer](../../view-and-tile-buf_zh.md) +- 构造视图:[pto.make_tensor_view](./make-tensor-view_zh.md) +- 查询维度尺寸:[pto.get_tensor_view_dim](./get-tensor-view-dim_zh.md) +- 元素粒度的指针算术:[pto.addptr](../../../scalar/ops/micro-instruction/pointer-operations_zh.md) diff --git a/docs/isa/tile/ops/view-and-tile-buf/make-tensor-view.md b/docs/isa/tile/ops/view-and-tile-buf/make-tensor-view.md new file mode 100644 index 000000000..ba42f8337 --- /dev/null +++ b/docs/isa/tile/ops/view-and-tile-buf/make-tensor-view.md @@ -0,0 +1,57 @@ +# pto.make_tensor_view + +`pto.make_tensor_view` is part of the [View and Tile Buffer](../../view-and-tile-buf.md) instruction set. + +## Summary + +Construct a global tensor view from a base pointer, runtime shape, and runtime strides. No allocation, no data movement — purely descriptor construction. + +## Mechanism + +A `!pto.tensor_view<...>` is a logical descriptor that carries a base pointer, per-dimension extents, per-dimension element strides, and an optional layout hint. Tile-level ops (`pto.tload`, `pto.tstore`, `pto.partition_view`, …) consume these views as their source-of-truth for global memory addressing. + +`pto.make_tensor_view` packages those four pieces of information into a single SSA value. The op is pure: it materializes the descriptor only and does not touch memory. + +## Syntax + +```mlir +%tv = pto.make_tensor_view %ptr, shape = [%m, %n], strides = [%s0, %s1] + : !pto.tensor_view +``` + +## Inputs + +| Operand | Type | Description | +|---------|------|-------------| +| `%ptr` | `!pto.ptr` (or matching pointer) | Source pointer; element type must match the result. | +| `shape` | `Variadic` | Dynamic shape dimensions, one entry per result rank. | +| `strides` | `Variadic` | Dynamic strides, counted in **elements** (not bytes), one entry per result rank. | +| `layout` (attr, optional) | `LayoutAttr` | `nd` / `dn` / `nz` hint. | + +## Expected Outputs + +| Result | Type | Description | +|--------|------|-------------| +| `%tv` | `!pto.tensor_view<...>` | Logical view descriptor. | + +## Constraints + +!!! warning "Constraints" + - `%ptr` element type must match the result element type. + - `shape` and `strides` operand counts must match the tensor_view rank. + - If `layout` is provided with static shapes/strides, it must be consistent with the inferred layout. + +## Examples + +```mlir +%tv = pto.make_tensor_view %ptr, shape = [%m, %n], strides = [%s0, %s1] + : !pto.tensor_view +``` + +## Related Ops / Instruction Set Links + +- Instruction set overview: [View and Tile Buffer](../../view-and-tile-buf.md) +- Query view dim: [pto.get_tensor_view_dim](./get-tensor-view-dim.md) +- Query view stride: [pto.get_tensor_view_stride](./get-tensor-view-stride.md) +- Extract address: [pto.tensor_view_addr](./tensor-view-addr.md) +- Partition a view: [pto.partition_view](./partition-view.md) diff --git a/docs/isa/tile/ops/view-and-tile-buf/make-tensor-view_zh.md b/docs/isa/tile/ops/view-and-tile-buf/make-tensor-view_zh.md new file mode 100644 index 000000000..e09bebe91 --- /dev/null +++ b/docs/isa/tile/ops/view-and-tile-buf/make-tensor-view_zh.md @@ -0,0 +1,57 @@ +# pto.make_tensor_view + +`pto.make_tensor_view` 属于 [视图与 Tile Buffer](../../view-and-tile-buf_zh.md) 指令集。 + +## 摘要 + +由基地址指针、运行时形状和运行时步长构造一个全局张量视图。**不分配内存、不搬数据**——只是构造描述符。 + +## 机制 + +`!pto.tensor_view<...>` 是一个逻辑描述符,承载基址指针、每维度尺寸、每维度元素步长,以及可选的布局提示。tile 层操作(`pto.tload`、`pto.tstore`、`pto.partition_view` 等)以它作为全局内存寻址的唯一来源。 + +`pto.make_tensor_view` 把上述四块信息打包成一个 SSA 值。本操作是纯操作:只生成描述符,不访问内存。 + +## 语法 + +```mlir +%tv = pto.make_tensor_view %ptr, shape = [%m, %n], strides = [%s0, %s1] + : !pto.tensor_view +``` + +## 输入 + +| 操作数 | 类型 | 描述 | +|---------|------|------| +| `%ptr` | `!pto.ptr`(或匹配的指针) | 源指针;元素类型必须与结果一致 | +| `shape` | `Variadic` | 动态形状,按结果 rank 提供每维度一个 | +| `strides` | `Variadic` | 动态步长,**以元素为单位**(非字节),按结果 rank 提供每维度一个 | +| `layout`(属性,可选) | `LayoutAttr` | `nd` / `dn` / `nz` 提示 | + +## 预期输出 + +| 结果 | 类型 | 描述 | +|--------|------|------| +| `%tv` | `!pto.tensor_view<...>` | 逻辑视图描述符 | + +## 约束 + +!!! warning "约束" + - `%ptr` 的元素类型必须与结果元素类型一致。 + - `shape` 与 `strides` 的操作数数量必须等于 tensor_view 的 rank。 + - 若提供了 `layout` 而 shape/strides 是静态的,则推断出的布局必须与 `layout` 一致。 + +## 示例 + +```mlir +%tv = pto.make_tensor_view %ptr, shape = [%m, %n], strides = [%s0, %s1] + : !pto.tensor_view +``` + +## 相关页面 + +- 指令集总览:[视图与 Tile Buffer](../../view-and-tile-buf_zh.md) +- 查询维度尺寸:[pto.get_tensor_view_dim](./get-tensor-view-dim_zh.md) +- 查询步长:[pto.get_tensor_view_stride](./get-tensor-view-stride_zh.md) +- 取底层地址:[pto.tensor_view_addr](./tensor-view-addr_zh.md) +- 切分子视图:[pto.partition_view](./partition-view_zh.md) diff --git a/docs/isa/tile/ops/view-and-tile-buf/partition-view.md b/docs/isa/tile/ops/view-and-tile-buf/partition-view.md new file mode 100644 index 000000000..5eb9d6881 --- /dev/null +++ b/docs/isa/tile/ops/view-and-tile-buf/partition-view.md @@ -0,0 +1,55 @@ +# pto.partition_view + +`pto.partition_view` is part of the [View and Tile Buffer](../../view-and-tile-buf.md) instruction set. + +## Summary + +Carve a `!pto.partition_tensor_view<...>` out of a parent `!pto.tensor_view<...>` by specifying per-dimension offsets and sizes. Logical sub-window only — no allocation, no data movement. + +## Mechanism + +`result = source[offsets, sizes]`. The operation captures both static and dynamic shape information into the result partition descriptor. Downstream tile-level ops (e.g., `pto.tload`, `pto.tstore`) can consume the partition view directly without re-deriving offsets at every call site. + +The op is pure: it does not touch memory and does not change the parent view. + +## Syntax + +```mlir +%pv = pto.partition_view %tv, offsets = [%o0, %o1], sizes = [%s0, %s1] + : !pto.tensor_view<...> -> !pto.partition_tensor_view<...> +``` + +## Inputs + +| Operand | Type | Description | +|---------|------|-------------| +| `%tv` | `!pto.tensor_view<...>` | Input tensor view. | +| `offsets` | `Variadic` | Dynamic offsets along each dimension. | +| `sizes` | `Variadic` | Dynamic sizes (extents) of the partition. | + +## Expected Outputs + +| Result | Type | Description | +|--------|------|-------------| +| `%pv` | `!pto.partition_tensor_view<...>` | Logical partition descriptor. | + +## Constraints + +!!! warning "Constraints" + - `offsets` and `sizes` operand counts MUST each match the rank of `%tv`. + - The op is pure; it does not allocate memory or move data. + - Out-of-bounds combinations of `offsets + sizes` against the parent shape are target-defined. + +## Examples + +```mlir +// 16x16 tile starting at (%off0, %off1) inside a 1024x512 view. +%pv = pto.partition_view %tv, offsets = [%off0, %off1], sizes = [%s0, %s1] + : !pto.tensor_view<1024x512xf16> -> !pto.partition_tensor_view<16x16xf16> +``` + +## Related Ops / Instruction Set Links + +- Instruction set overview: [View and Tile Buffer](../../view-and-tile-buf.md) +- Construct the parent view: [pto.make_tensor_view](./make-tensor-view.md) +- Extract the underlying address: [pto.tensor_view_addr](./tensor-view-addr.md) diff --git a/docs/isa/tile/ops/view-and-tile-buf/partition-view_zh.md b/docs/isa/tile/ops/view-and-tile-buf/partition-view_zh.md new file mode 100644 index 000000000..c6cba9a36 --- /dev/null +++ b/docs/isa/tile/ops/view-and-tile-buf/partition-view_zh.md @@ -0,0 +1,55 @@ +# pto.partition_view + +`pto.partition_view` 属于 [视图与 Tile Buffer](../../view-and-tile-buf_zh.md) 指令集。 + +## 摘要 + +按每维度的偏移和大小从父 `!pto.tensor_view<...>` 中切出一个 `!pto.partition_tensor_view<...>`,即一个逻辑子窗口——**不分配内存、不搬数据**。 + +## 机制 + +`result = source[offsets, sizes]`。这条操作把静态与动态形状信息一并捕获到结果分区描述符中。下游 tile 层操作(如 `pto.tload`、`pto.tstore`)可以直接消费这个分区视图,无需在每个调用点重新计算偏移。 + +纯操作:不访问内存、不改变父视图。 + +## 语法 + +```mlir +%pv = pto.partition_view %tv, offsets = [%o0, %o1], sizes = [%s0, %s1] + : !pto.tensor_view<...> -> !pto.partition_tensor_view<...> +``` + +## 输入 + +| 操作数 | 类型 | 描述 | +|---------|------|------| +| `%tv` | `!pto.tensor_view<...>` | 输入张量视图 | +| `offsets` | `Variadic` | 各维度的动态偏移 | +| `sizes` | `Variadic` | 分区在各维度上的动态大小 | + +## 预期输出 + +| 结果 | 类型 | 描述 | +|--------|------|------| +| `%pv` | `!pto.partition_tensor_view<...>` | 逻辑分区描述符 | + +## 约束 + +!!! warning "约束" + - `offsets` 与 `sizes` 的操作数数量必须各自等于 `%tv` 的 rank。 + - 纯操作,不分配内存、不搬数据。 + - `offsets + sizes` 超出父视图形状的行为是 target-defined。 + +## 示例 + +```mlir +// 在 1024x512 视图内取一个 16x16 分区,起点为 (%off0, %off1) +%pv = pto.partition_view %tv, offsets = [%off0, %off1], sizes = [%s0, %s1] + : !pto.tensor_view<1024x512xf16> -> !pto.partition_tensor_view<16x16xf16> +``` + +## 相关页面 + +- 指令集总览:[视图与 Tile Buffer](../../view-and-tile-buf_zh.md) +- 构造父视图:[pto.make_tensor_view](./make-tensor-view_zh.md) +- 取底层地址:[pto.tensor_view_addr](./tensor-view-addr_zh.md) diff --git a/docs/isa/tile/ops/view-and-tile-buf/set-validshape.md b/docs/isa/tile/ops/view-and-tile-buf/set-validshape.md new file mode 100644 index 000000000..5fd7289af --- /dev/null +++ b/docs/isa/tile/ops/view-and-tile-buf/set-validshape.md @@ -0,0 +1,53 @@ +# pto.set_validshape + +`pto.set_validshape` is part of the [View and Tile Buffer](../../view-and-tile-buf.md) instruction set. + +## Summary + +Update the runtime `v_row` / `v_col` metadata on an existing **dynamic** rank-2 tile buffer (allocated with `valid=?x?`). + +## Mechanism + +A tile buffer can be allocated with dynamic valid shape (`valid=?x?`). At allocation time the valid region is unspecified; `pto.set_validshape` writes the runtime `valid_row` / `valid_col` values into the descriptor so subsequent tile-level ops (loads, stores, compute) honor a well-defined valid window. + +The op updates metadata only — it does NOT move data and does NOT change the physical storage layout. The static `R x C` shape of the tile is unchanged; only the valid sub-region inside that shape is updated. + +## Syntax + +```mlir +pto.set_validshape %src, %valid_row, %valid_col : !pto.tile_buf +``` + +## Inputs + +| Operand | Type | Description | +|---------|------|-------------| +| `%src` | `!pto.tile_buf` | Dynamic rank-2 tile buffer (both valid dims dynamic). | +| `%valid_row` | `index` | Runtime valid row count. | +| `%valid_col` | `index` | Runtime valid column count. | + +## Expected Outputs + +| Result | Type | Description | +| --- | --- | --- | +| None | `—` | This form has no SSA result; it updates the tile descriptor in place. | + +## Constraints + +!!! warning "Constraints" + - `%src` MUST be rank-2 and use `v_row = ?` and `v_col = ?` on both dimensions. + - Tile programs use `pto.tile_buf`; memref forms are a lowering artifact and are not part of this surface. + - Constant `valid_row` / `valid_col` MUST be non-negative and `<=` the tile's static shape bounds. + +## Examples + +```mlir +%src = pto.alloc_tile : !pto.tile_buf +pto.set_validshape %src, %vr, %vc : !pto.tile_buf +``` + +## Related Ops / Instruction Set Links + +- Instruction set overview: [View and Tile Buffer](../../view-and-tile-buf.md) +- Allocate the dynamic tile: [pto.alloc_tile](./alloc-tile.md) +- Carve a sub-region with known size: [pto.subset](./subset.md) diff --git a/docs/isa/tile/ops/view-and-tile-buf/set-validshape_zh.md b/docs/isa/tile/ops/view-and-tile-buf/set-validshape_zh.md new file mode 100644 index 000000000..aa5d0ca70 --- /dev/null +++ b/docs/isa/tile/ops/view-and-tile-buf/set-validshape_zh.md @@ -0,0 +1,53 @@ +# pto.set_validshape + +`pto.set_validshape` 属于 [视图与 Tile Buffer](../../view-and-tile-buf_zh.md) 指令集。 + +## 摘要 + +为已存在的**动态** rank-2 tile buffer(用 `valid=?x?` 分配的)写入运行时 `v_row` / `v_col` 元数据。 + +## 机制 + +tile buffer 可以以动态 valid 形状分配(`valid=?x?`),此时 valid 区域在分配时未定。`pto.set_validshape` 把运行时 `valid_row` / `valid_col` 写入描述符,使后续的 tile 层操作(load、store、compute)按一个定义良好的 valid 子区域工作。 + +本操作只更新元数据——**不搬数据、不改变物理存储布局**。tile 的静态 `R x C` 形状不变,只更新其中的 valid 子区。 + +## 语法 + +```mlir +pto.set_validshape %src, %valid_row, %valid_col : !pto.tile_buf +``` + +## 输入 + +| 操作数 | 类型 | 描述 | +|---------|------|------| +| `%src` | `!pto.tile_buf` | 动态 rank-2 tile buffer(两个 valid 维度都是动态的) | +| `%valid_row` | `index` | 运行时 valid row 数 | +| `%valid_col` | `index` | 运行时 valid col 数 | + +## 预期输出 + +| 结果 | 类型 | 描述 | +| --- | --- | --- | +| 无 | `—` | 此形式无 SSA 结果,原地更新 tile 描述符 | + +## 约束 + +!!! warning "约束" + - `%src` 必须是 rank-2,并且两个维度都使用 `v_row = ?` / `v_col = ?`。 + - tile 程序使用 `pto.tile_buf`;memref 形式是 lowering 的产物,不属于本表面。 + - 常量 `valid_row` / `valid_col` 必须非负,且不超过 tile 静态形状的边界。 + +## 示例 + +```mlir +%src = pto.alloc_tile : !pto.tile_buf +pto.set_validshape %src, %vr, %vc : !pto.tile_buf +``` + +## 相关页面 + +- 指令集总览:[视图与 Tile Buffer](../../view-and-tile-buf_zh.md) +- 分配动态 tile:[pto.alloc_tile](./alloc-tile_zh.md) +- 在已知大小下切子区:[pto.subset](./subset_zh.md) diff --git a/docs/isa/tile/ops/view-and-tile-buf/subset.md b/docs/isa/tile/ops/view-and-tile-buf/subset.md new file mode 100644 index 000000000..341e806e2 --- /dev/null +++ b/docs/isa/tile/ops/view-and-tile-buf/subset.md @@ -0,0 +1,63 @@ +# pto.subset + +`pto.subset` is part of the [View and Tile Buffer](../../view-and-tile-buf.md) instruction set. + +## Summary + +Create a strided view of a parent tile buffer at runtime offsets with static sizes — `result = source[offsets] sizes [rows, cols]`. No data movement. + +## Mechanism + +`pto.subset` produces a child `!pto.tile_buf<...>` that aliases a sub-region of the parent. The runtime `%i`, `%j` give the top-left corner of the sub-region; the static `sizes` attribute fixes the result extents. + +Boxed-layout tile buffers (e.g., fractal NZ tiles on the cube path) carry extra alignment constraints derived from their inner box shape; for them, the subset must align with the box. Non-boxed layouts apply no additional structural checks beyond the parent's element type and address space. + +The op is pure: no allocation, no memory movement. + +## Syntax + +```mlir +%sub = pto.subset %src[%i, %j] sizes [rows, cols] : !pto.tile_buf<...> +``` + +## Inputs + +| Operand | Type | Description | +|---------|------|-------------| +| `%src` | `!pto.tile_buf<...>` | Parent tile buffer. | +| `offsets` | `Variadic` | Runtime offsets `[i, j]`. | +| `sizes` (attr) | `I64ArrayAttr` | Static shape `[rows, cols]`. | + +## Expected Outputs + +| Result | Type | Description | +|--------|------|-------------| +| `%sub` | `!pto.tile_buf<...>` | Strided sub-view of the parent. Element type, address space, and tile config are inherited from `%src`; `valid_shape` is derived from the parent valid shape and constant offsets when possible. | + +## Constraints + +!!! warning "Constraints" + - Boxed-vs-non-boxed behavior is derived from the source's tile config (`blayout`, `slayout`, `fractal`) and element type. + - For non-boxed layouts (`slayout=none_box`), no additional subset-specific structural checks are enforced. + - For boxed layouts: + - `sizes` MUST have length 2 and both subset sizes MUST be positive. + - Subset sizes MUST be multiples of the inferred inner boxed shape. + - `offsets` MUST have length 2; constant offsets MUST be non-negative and multiples of the inferred inner boxed shape. + - Source tile shape MUST be statically known. + - For boxed row-major tiles: subset MUST keep the full source column extent, and the column offset MUST be the constant `0`. + - For boxed col-major tiles: subset MUST keep the full source row extent, and the row offset MUST be the constant `0`. + - The inferred result reuses the source's element type, address space, and tile config. `valid_shape` is derived from the parent valid shape and constant offsets, or dynamic when offsets are dynamic. + +## Examples + +```mlir +%sub = pto.subset %src[%i, %j] sizes [32, 32] + : !pto.tile_buf +``` + +## Related Ops / Instruction Set Links + +- Instruction set overview: [View and Tile Buffer](../../view-and-tile-buf.md) +- Allocate the parent tile: [pto.alloc_tile](./alloc-tile.md) +- Tile-level reinterpretation (different op): [pto.subview](../sync-and-config/subview.md) +- Extract a pointer (vector scope): [pto.tile_buf_addr](./tile-buf-addr.md) diff --git a/docs/isa/tile/ops/view-and-tile-buf/subset_zh.md b/docs/isa/tile/ops/view-and-tile-buf/subset_zh.md new file mode 100644 index 000000000..cdc93bf90 --- /dev/null +++ b/docs/isa/tile/ops/view-and-tile-buf/subset_zh.md @@ -0,0 +1,63 @@ +# pto.subset + +`pto.subset` 属于 [视图与 Tile Buffer](../../view-and-tile-buf_zh.md) 指令集。 + +## 摘要 + +在运行时偏移、静态大小下创建父 tile buffer 的 strided 子视图:`result = source[offsets] sizes [rows, cols]`。**不搬数据。** + +## 机制 + +`pto.subset` 生成一个与父 tile 别名的子 `!pto.tile_buf<...>`。运行时 `%i`、`%j` 给出子区域的左上角,静态 `sizes` 属性固定结果尺寸。 + +带 box 布局的 tile buffer(例如 cube 路径上的 fractal NZ tile)从其内层 box 形状继承额外的对齐约束;对这类 tile,subset 必须与 box 对齐。非 box 布局除继承父 tile 的元素类型和地址空间外,不再施加额外结构性检查。 + +纯操作:不分配内存、不搬数据。 + +## 语法 + +```mlir +%sub = pto.subset %src[%i, %j] sizes [rows, cols] : !pto.tile_buf<...> +``` + +## 输入 + +| 操作数 | 类型 | 描述 | +|---------|------|------| +| `%src` | `!pto.tile_buf<...>` | 父 tile buffer | +| `offsets` | `Variadic` | 运行时偏移 `[i, j]` | +| `sizes`(属性) | `I64ArrayAttr` | 静态形状 `[rows, cols]` | + +## 预期输出 + +| 结果 | 类型 | 描述 | +|--------|------|------| +| `%sub` | `!pto.tile_buf<...>` | 父 tile 的 strided 子视图。元素类型、地址空间、tile config 从 `%src` 继承;常量偏移下 `valid_shape` 由父 valid 形状与偏移推导,动态偏移下结果 valid 形状为动态。 | + +## 约束 + +!!! warning "约束" + - box / 非 box 的行为由源 tile config(`blayout`、`slayout`、`fractal`)和元素类型决定。 + - 非 box 布局(`slayout=none_box`)下不施加额外的结构性检查。 + - box 布局下: + - `sizes` 长度必须为 2,且两个 subset 尺寸都必须为正。 + - subset 尺寸必须是推断出的内层 box 形状的整数倍。 + - `offsets` 长度必须为 2;常量偏移必须非负且为内层 box 形状的整数倍。 + - 源 tile 形状必须静态可知。 + - 对 box 行优先 tile:subset 必须保留源 tile 的完整列范围,且列偏移必须为常量 `0`。 + - 对 box 列优先 tile:subset 必须保留源 tile 的完整行范围,且行偏移必须为常量 `0`。 + - 结果继承源 tile 的元素类型、地址空间和 tile config。`valid_shape` 在常量偏移下由父 valid 形状与偏移推导,动态偏移下则保持动态。 + +## 示例 + +```mlir +%sub = pto.subset %src[%i, %j] sizes [32, 32] + : !pto.tile_buf +``` + +## 相关页面 + +- 指令集总览:[视图与 Tile Buffer](../../view-and-tile-buf_zh.md) +- 分配父 tile:[pto.alloc_tile](./alloc-tile_zh.md) +- tile 层的重解释(不同操作):[pto.subview](../sync-and-config/subview_zh.md) +- 取指针(向量作用域内):[pto.tile_buf_addr](./tile-buf-addr_zh.md) diff --git a/docs/isa/tile/ops/view-and-tile-buf/tensor-view-addr.md b/docs/isa/tile/ops/view-and-tile-buf/tensor-view-addr.md new file mode 100644 index 000000000..defe4e1d0 --- /dev/null +++ b/docs/isa/tile/ops/view-and-tile-buf/tensor-view-addr.md @@ -0,0 +1,52 @@ +# pto.tensor_view_addr + +`pto.tensor_view_addr` is part of the [View and Tile Buffer](../../view-and-tile-buf.md) instruction set. + +## Summary + +Extract the underlying address (as a typed PTO pointer or a memref view) from a `!pto.tensor_view<...>` or `!pto.partition_tensor_view<...>` descriptor. Pure op — does not move data. + +## Mechanism + +A tensor view carries both addressing metadata (shape, strides, base pointer) and a logical descriptor. `pto.tensor_view_addr` projects out the address side: it returns the same underlying storage exposed as either a typed GM pointer (`!pto.ptr`) or as a memref view. + +The op is pure. During compiler-internal lowering, the operand may already be rewritten to a memref form; in that case this op is folded away or rewritten to an equivalent memref-to-ptr cast. + +## Syntax + +```mlir +%result = pto.tensor_view_addr %src : !pto.tensor_view<...> -> memref<...> +%result = pto.tensor_view_addr %src : !pto.tensor_view<...> -> !pto.ptr +``` + +## Inputs + +| Operand | Type | Description | +|---------|------|-------------| +| `%src` | `!pto.tensor_view<...>` or `!pto.partition_tensor_view<...>` | Source view descriptor. | + +## Expected Outputs + +| Result | Type | Description | +|--------|------|-------------| +| `%result` | `memref<...>` or `!pto.ptr` | Underlying address in the requested form. | + +## Constraints + +!!! warning "Constraints" + - The result type MUST be either the lowered memref view or a GM pointer `!pto.ptr` to the same underlying storage. Other result types are rejected. + - The op is pure and does not move data. + +## Examples + +```mlir +// Extract a GM pointer from a tensor view, for use in DMA copy ops. +%base = pto.tensor_view_addr %tv : !pto.tensor_view -> !pto.ptr +``` + +## Related Ops / Instruction Set Links + +- Instruction set overview: [View and Tile Buffer](../../view-and-tile-buf.md) +- Construct a view: [pto.make_tensor_view](./make-tensor-view.md) +- Partition a view: [pto.partition_view](./partition-view.md) +- Sister op for tile-buffer addresses: [pto.tile_buf_addr](./tile-buf-addr.md) diff --git a/docs/isa/tile/ops/view-and-tile-buf/tensor-view-addr_zh.md b/docs/isa/tile/ops/view-and-tile-buf/tensor-view-addr_zh.md new file mode 100644 index 000000000..8422e1aa1 --- /dev/null +++ b/docs/isa/tile/ops/view-and-tile-buf/tensor-view-addr_zh.md @@ -0,0 +1,52 @@ +# pto.tensor_view_addr + +`pto.tensor_view_addr` 属于 [视图与 Tile Buffer](../../view-and-tile-buf_zh.md) 指令集。 + +## 摘要 + +从 `!pto.tensor_view<...>` 或 `!pto.partition_tensor_view<...>` 描述符中提取底层地址,返回类型化的 PTO 指针或 memref 视图。**纯操作,不搬数据。** + +## 机制 + +张量视图同时承载寻址元信息(形状、步长、基址)和逻辑描述。`pto.tensor_view_addr` 把其中的地址部分投影出来:以类型化 GM 指针(`!pto.ptr`)或 memref 视图的形式暴露同一底层存储。 + +本操作是纯操作。在编译器内部下沉过程中,操作数可能已被改写为 memref 形式;此时本 op 会被折叠或被改写成等价的 memref-to-ptr cast。 + +## 语法 + +```mlir +%result = pto.tensor_view_addr %src : !pto.tensor_view<...> -> memref<...> +%result = pto.tensor_view_addr %src : !pto.tensor_view<...> -> !pto.ptr +``` + +## 输入 + +| 操作数 | 类型 | 描述 | +|---------|------|------| +| `%src` | `!pto.tensor_view<...>` 或 `!pto.partition_tensor_view<...>` | 源视图描述符 | + +## 预期输出 + +| 结果 | 类型 | 描述 | +|--------|------|------| +| `%result` | `memref<...>` 或 `!pto.ptr` | 按所请求形式返回的底层地址 | + +## 约束 + +!!! warning "约束" + - 结果类型必须是下沉后的 memref 视图或指向同一底层存储的 GM 指针 `!pto.ptr`,其它类型不接受。 + - 纯操作,不搬数据。 + +## 示例 + +```mlir +// 从 tensor view 提取 GM 指针,用于 DMA 拷贝 +%base = pto.tensor_view_addr %tv : !pto.tensor_view -> !pto.ptr +``` + +## 相关页面 + +- 指令集总览:[视图与 Tile Buffer](../../view-and-tile-buf_zh.md) +- 构造视图:[pto.make_tensor_view](./make-tensor-view_zh.md) +- 切分子视图:[pto.partition_view](./partition-view_zh.md) +- tile-buffer 取地址的姊妹操作:[pto.tile_buf_addr](./tile-buf-addr_zh.md) diff --git a/docs/isa/tile/ops/view-and-tile-buf/tile-buf-addr.md b/docs/isa/tile/ops/view-and-tile-buf/tile-buf-addr.md new file mode 100644 index 000000000..a62318b70 --- /dev/null +++ b/docs/isa/tile/ops/view-and-tile-buf/tile-buf-addr.md @@ -0,0 +1,71 @@ +# pto.tile_buf_addr + +`pto.tile_buf_addr` is part of the [View and Tile Buffer](../../view-and-tile-buf.md) instruction set. + +## Summary + +Extract the data-region address of a `!pto.tile_buf<...>` as either a typed PTO pointer (`!pto.ptr`) or a memref view. **This op is the boundary between tile-buffer instructions and pointer-based vector instructions.** + +## Mechanism + +Inside a `pto.vecscope` / `pto.strict_vecscope` body, vector load/store ops (`pto.vlds`, `pto.vsts`, etc.) consume typed pointers, not tile handles. `pto.tile_buf_addr` materializes a `vec`-space pointer (or memref) from a tile handle allocated outside the scope so vector-scope code can read and write the same on-chip data the tile-level code prepared. + +The op is pure: it does not move data, does not allocate, and does not participate in pipeline synchronization. During lowering it typically becomes a no-op or an attribute-driven address constant. + +## Syntax + +```mlir +%ub_ptr = pto.tile_buf_addr %tile : !pto.tile_buf<...> -> !pto.ptr +%ub_ref = pto.tile_buf_addr %tile : !pto.tile_buf<...> -> memref<...> +``` + +## Inputs + +| Operand | Type | Description | +|---------|------|-------------| +| `%tile` | `!pto.tile_buf<...>` (or tile-bound memref form) | Tile handle whose data-region address is taken. | + +## Expected Outputs + +| Result | Type | Description | +|--------|------|-------------| +| `%ub_ptr` / `%ub_ref` | `!pto.ptr` or `memref<...>` | Typed pointer (e.g., `!pto.ptr`) or memref view of the tile's data region. Memref results use the tile's static shape and address space; pointer results use the tile's element type and memory space. | + +## Constraints + +!!! warning "Constraints" + - Result MUST be either a typed PTO pointer or a memref view; no other result types are accepted. + - When a memref result is requested, the lowered form uses the tile's static shape and address space. + - `pto.tile_buf_addr` is **only legal inside `pto.vecscope` / `pto.strict_vecscope`**. + - Outside a vector scope, tile handles MUST be consumed by tile-level ops (`pto.tload`, `pto.tstore`, `pto.tadd`, …) rather than by address extraction. + - Conversely, tile-level ops MUST NOT appear inside `pto.vecscope`. + +## Examples + +```mlir +%tile = pto.alloc_tile addr = %c0_i64 valid_row = %r + : !pto.tile_buf + +pto.vecscope { + %ub = pto.tile_buf_addr %tile + : !pto.tile_buf -> !pto.ptr + // ... vector-scope loads/stores on %ub ... +} +``` + +## Relationship to Tile vs Micro Surfaces + +| Surface | Consumes | Bridge | +|---|---|---| +| **Tile** (`pto.t*`) | `!pto.tile_buf<...>` | — | +| **Micro / vector** (`pto.v*`, `pto.vlds`, `pto.vsts`) | `!pto.ptr` | `pto.tile_buf_addr` | + +The micro side is fenced by `pto.vecscope`. Inside that scope, `pto.tile_buf_addr` is the only legal way to obtain a pointer from a tile handle. Outside the scope, vector ops are illegal and tile ops own the tile handle exclusively. + +## Related Ops / Instruction Set Links + +- Instruction set overview: [View and Tile Buffer](../../view-and-tile-buf.md) +- Allocate the tile: [pto.alloc_tile](./alloc-tile.md) +- Vector execution scope: [pto.vecscope](../../../scalar/ops/micro-instruction/vecscope.md) +- Vector loads/stores that consume the resulting pointer: [Vector Load Store](../../../vector/vector-load-store.md) +- Sister op for tensor views: [pto.tensor_view_addr](./tensor-view-addr.md) diff --git a/docs/isa/tile/ops/view-and-tile-buf/tile-buf-addr_zh.md b/docs/isa/tile/ops/view-and-tile-buf/tile-buf-addr_zh.md new file mode 100644 index 000000000..0ff3a5b64 --- /dev/null +++ b/docs/isa/tile/ops/view-and-tile-buf/tile-buf-addr_zh.md @@ -0,0 +1,71 @@ +# pto.tile_buf_addr + +`pto.tile_buf_addr` 属于 [视图与 Tile Buffer](../../view-and-tile-buf_zh.md) 指令集。 + +## 摘要 + +把 `!pto.tile_buf<...>` 的数据区地址提取为类型化的 PTO 指针(`!pto.ptr`)或 memref 视图。**这条指令是 tile-buffer 指令与基于指针的向量指令之间的桥梁。** + +## 机制 + +在 `pto.vecscope` / `pto.strict_vecscope` 体内,向量加载/存储操作(`pto.vlds`、`pto.vsts` 等)消费的是类型化指针,而不是 tile 句柄。`pto.tile_buf_addr` 从作用域外分配的 tile 句柄中物化出一个 `vec` 地址空间的指针(或 memref),让向量作用域代码可以读写 tile 层准备好的同一片片上数据。 + +纯操作:不搬数据、不分配、不参与流水线同步。下沉时通常变成 no-op 或一条由属性驱动的地址常量。 + +## 语法 + +```mlir +%ub_ptr = pto.tile_buf_addr %tile : !pto.tile_buf<...> -> !pto.ptr +%ub_ref = pto.tile_buf_addr %tile : !pto.tile_buf<...> -> memref<...> +``` + +## 输入 + +| 操作数 | 类型 | 描述 | +|---------|------|------| +| `%tile` | `!pto.tile_buf<...>`(或绑定到 tile 的 memref 形式) | 要取数据区地址的 tile 句柄 | + +## 预期输出 + +| 结果 | 类型 | 描述 | +|--------|------|------| +| `%ub_ptr` / `%ub_ref` | `!pto.ptr` 或 `memref<...>` | tile 数据区的类型化指针(如 `!pto.ptr`)或 memref 视图。memref 结果使用 tile 的静态形状与地址空间;指针结果使用 tile 的元素类型与内存空间。 | + +## 约束 + +!!! warning "约束" + - 结果必须是类型化 PTO 指针或 memref 视图,其它结果类型不接受。 + - 当请求 memref 结果时,下沉形式使用 tile 的静态形状与地址空间。 + - `pto.tile_buf_addr` **只在 `pto.vecscope` / `pto.strict_vecscope` 内合法**。 + - 在向量作用域外,tile 句柄**必须**由 tile 层操作(`pto.tload`、`pto.tstore`、`pto.tadd` 等)消费,不能通过取地址来用。 + - 反之,tile 层操作**不得**出现在 `pto.vecscope` 内。 + +## 示例 + +```mlir +%tile = pto.alloc_tile addr = %c0_i64 valid_row = %r + : !pto.tile_buf + +pto.vecscope { + %ub = pto.tile_buf_addr %tile + : !pto.tile_buf -> !pto.ptr + // ... 向量作用域内的 load / store 在 %ub 上进行 ... +} +``` + +## 与 Tile / 微指令两个表面的关系 + +| 表面 | 消费什么 | 桥梁 | +|---|---|---| +| **Tile**(`pto.t*`) | `!pto.tile_buf<...>` | — | +| **微指令 / 向量**(`pto.v*`、`pto.vlds`、`pto.vsts`) | `!pto.ptr` | `pto.tile_buf_addr` | + +微指令侧由 `pto.vecscope` 包裹。在该作用域内,`pto.tile_buf_addr` 是从 tile 句柄拿到指针的**唯一**合法方式。作用域外,向量操作非法,tile 句柄完全由 tile 层操作所有。 + +## 相关页面 + +- 指令集总览:[视图与 Tile Buffer](../../view-and-tile-buf_zh.md) +- 分配 tile:[pto.alloc_tile](./alloc-tile_zh.md) +- 向量执行作用域:[pto.vecscope](../../../scalar/ops/micro-instruction/vecscope_zh.md) +- 消费该指针的向量加载/存储:[向量加载存储](../../../vector/vector-load-store_zh.md) +- tensor view 取地址姊妹操作:[pto.tensor_view_addr](./tensor-view-addr_zh.md) diff --git a/docs/isa/tile/view-and-tile-buf.md b/docs/isa/tile/view-and-tile-buf.md new file mode 100644 index 000000000..eca71e863 --- /dev/null +++ b/docs/isa/tile/view-and-tile-buf.md @@ -0,0 +1,40 @@ +# View and Tile Buffer + +The view-and-tile-buffer operations are the foundation of the PTO tile programming model. They cover four concerns: + +1. **Build descriptors** for global tensors (shape + strides + base pointer) — `pto.make_tensor_view` +2. **Query descriptors** for runtime shape/stride information — `pto.get_tensor_view_dim`, `pto.get_tensor_view_stride` +3. **Partition** a global descriptor into logical sub-windows — `pto.partition_view` +4. **Manage on-chip tile buffers** (allocate, sub-set, set valid shape, extract pointer) — `pto.alloc_tile`, `pto.subset`, `pto.set_validshape`, `pto.tile_buf_addr`, `pto.tensor_view_addr` + +All these ops are pure descriptor/handle manipulation: none moves data, allocates memory at runtime, or participates in pipeline synchronization. They establish the addressing and lifetime contract that the tile compute ops (`pto.tload`, `pto.tstore`, `pto.tadd`, `pto.tmatmul`, …) and the vector micro ops (`pto.vlds`, `pto.vsts`, …) consume. + +## Per-Op Pages + +### Tensor View — Global Memory Descriptors + +- [pto.make_tensor_view](./ops/view-and-tile-buf/make-tensor-view.md) — Build a tensor view from a pointer, shape, and strides +- [pto.get_tensor_view_dim](./ops/view-and-tile-buf/get-tensor-view-dim.md) — Read a dimension extent +- [pto.get_tensor_view_stride](./ops/view-and-tile-buf/get-tensor-view-stride.md) — Read an element-stride +- [pto.tensor_view_addr](./ops/view-and-tile-buf/tensor-view-addr.md) — Project the underlying address (memref or `!pto.ptr`) +- [pto.partition_view](./ops/view-and-tile-buf/partition-view.md) — Carve a partition window from a tensor view + +### Tile Buffer — On-Chip Storage + +- [pto.alloc_tile](./ops/view-and-tile-buf/alloc-tile.md) — Declare a new tile buffer lifetime +- [pto.subset](./ops/view-and-tile-buf/subset.md) — Strided sub-region of a parent tile +- [pto.set_validshape](./ops/view-and-tile-buf/set-validshape.md) — Update runtime valid shape on a dynamic tile +- [pto.tile_buf_addr](./ops/view-and-tile-buf/tile-buf-addr.md) — **Tile↔vector bridge:** extract a typed pointer inside `pto.vecscope` + +## Tile ↔ Vector Bridge + +The `!pto.tile_buf<...>` type belongs to the tile surface. Vector micro instructions consume typed pointers `!pto.ptr`. The only legal bridge between the two surfaces is [`pto.tile_buf_addr`](./ops/view-and-tile-buf/tile-buf-addr.md), and it is only valid **inside** a [`pto.vecscope`](../scalar/ops/micro-instruction/vecscope.md) region. Outside `pto.vecscope`, tile handles can only be passed to tile-level ops; inside `pto.vecscope`, tile-level ops are illegal and the vector-scope code must work through the pointer obtained from `pto.tile_buf_addr`. + +This split is what makes the two surfaces composable without ambiguity. + +## Related Material + +- [Tile ISA Reference](./README.md) — Tile instruction inventory +- [Memory and Data Movement](./memory-and-data-movement.md) — Tile-level GM ↔ tile DMA +- [Vector Execution Scope (`pto.vecscope`)](../scalar/ops/micro-instruction/vecscope.md) — Where `pto.tile_buf_addr` is legal +- [Pointer Operations](../scalar/ops/micro-instruction/pointer-operations.md) — `pto.addptr` / `pto.castptr` / `pto.load_scalar` / `pto.store_scalar` diff --git a/docs/isa/tile/view-and-tile-buf_zh.md b/docs/isa/tile/view-and-tile-buf_zh.md new file mode 100644 index 000000000..207b1269d --- /dev/null +++ b/docs/isa/tile/view-and-tile-buf_zh.md @@ -0,0 +1,40 @@ +# 视图与 Tile Buffer + +视图与 tile buffer 操作是 PTO tile 编程模型的底层基础,覆盖四类职责: + +1. **构造描述符** —— 为全局张量构造形状 + 步长 + 基址描述符 `pto.make_tensor_view` +2. **查询描述符** —— 读取运行时形状/步长 `pto.get_tensor_view_dim`、`pto.get_tensor_view_stride` +3. **切分子视图** —— 把全局描述符切成逻辑子窗口 `pto.partition_view` +4. **管理片上 tile buffer** —— 分配、取子区、设置 valid 形状、取指针 `pto.alloc_tile`、`pto.subset`、`pto.set_validshape`、`pto.tile_buf_addr`、`pto.tensor_view_addr` + +这些都是纯描述符/句柄操作:不搬数据、不在运行时分配内存、不参与流水线同步。它们建立了 tile 计算操作(`pto.tload`、`pto.tstore`、`pto.tadd`、`pto.tmatmul` 等)和向量微指令(`pto.vlds`、`pto.vsts` 等)共同依赖的寻址与生命周期契约。 + +## per-op 页面 + +### Tensor View —— 全局内存描述符 + +- [pto.make_tensor_view](./ops/view-and-tile-buf/make-tensor-view_zh.md):由指针、形状、步长构造张量视图 +- [pto.get_tensor_view_dim](./ops/view-and-tile-buf/get-tensor-view-dim_zh.md):读维度尺寸 +- [pto.get_tensor_view_stride](./ops/view-and-tile-buf/get-tensor-view-stride_zh.md):读元素步长 +- [pto.tensor_view_addr](./ops/view-and-tile-buf/tensor-view-addr_zh.md):投影底层地址(memref 或 `!pto.ptr`) +- [pto.partition_view](./ops/view-and-tile-buf/partition-view_zh.md):从张量视图切出分区窗口 + +### Tile Buffer —— 片上存储 + +- [pto.alloc_tile](./ops/view-and-tile-buf/alloc-tile_zh.md):声明新的 tile buffer 生命周期 +- [pto.subset](./ops/view-and-tile-buf/subset_zh.md):从父 tile 切出 strided 子区 +- [pto.set_validshape](./ops/view-and-tile-buf/set-validshape_zh.md):为动态 tile 设置运行时 valid 形状 +- [pto.tile_buf_addr](./ops/view-and-tile-buf/tile-buf-addr_zh.md):**tile↔向量桥梁**——在 `pto.vecscope` 内取类型化指针 + +## Tile ↔ 向量桥梁 + +`!pto.tile_buf<...>` 类型属于 tile 表面,向量微指令消费的是类型化指针 `!pto.ptr`。两个表面之间**唯一**合法的桥梁是 [`pto.tile_buf_addr`](./ops/view-and-tile-buf/tile-buf-addr_zh.md),并且**只在** [`pto.vecscope`](../scalar/ops/micro-instruction/vecscope_zh.md) 区域内有效。在 `pto.vecscope` 外,tile 句柄只能传给 tile 层操作;在 `pto.vecscope` 内,tile 层操作非法,向量作用域代码必须通过 `pto.tile_buf_addr` 拿到的指针来工作。 + +这种拆分让两个表面可以无歧义地组合在一起。 + +## 相关页面 + +- [Tile ISA 参考](./README_zh.md):tile 指令清单 +- [内存与数据搬运](./memory-and-data-movement_zh.md):tile 层 GM ↔ tile DMA +- [向量执行作用域 (`pto.vecscope`)](../scalar/ops/micro-instruction/vecscope_zh.md):`pto.tile_buf_addr` 合法的作用域 +- [指针操作](../scalar/ops/micro-instruction/pointer-operations_zh.md):`pto.addptr` / `pto.castptr` / `pto.load_scalar` / `pto.store_scalar` diff --git a/docs/isa/vector/README.md b/docs/isa/vector/README.md index 702f8c41d..0b93a9fa2 100644 --- a/docs/isa/vector/README.md +++ b/docs/isa/vector/README.md @@ -24,7 +24,7 @@ The `pto.v*` vector micro-instruction set of PTO ISA is organized by instruction | Type | Description | |------|-------------| | `!pto.vreg` | Vector register with N lanes of type T | -| `!pto.mask` | Predicate mask (width matches vector length) | +| `!pto.mask` | Predicate mask (width matches vector length) | | `!pto.scalar` | Scalar register | ### Vector Lengths diff --git a/docs/isa/vector/binary-vector-ops.md b/docs/isa/vector/binary-vector-ops.md index 980e07a13..bddc57ece 100644 --- a/docs/isa/vector/binary-vector-ops.md +++ b/docs/isa/vector/binary-vector-ops.md @@ -36,11 +36,11 @@ pto.rls_buf "PIPE_MTE2", %bufid, %c0 : i64, i64 pto.get_buf "PIPE_V", %bufid, %c0 : i64, i64 pto.vecscope { scf.for %offset = %c0 to %N step %c64 iter_args(%remaining = %N_i32) -> (i32) { - %mask, %next = pto.plt_b32 %remaining : i32 -> !pto.mask, i32 + %mask, %next = pto.plt_b32 %remaining : i32 -> !pto.mask, i32 %lhs = pto.vlds %ub_a[%offset] : !pto.ptr -> !pto.vreg<64xf32> %rhs = pto.vlds %ub_b[%offset] : !pto.ptr -> !pto.vreg<64xf32> - %out = pto.vadd %lhs, %rhs, %mask : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> - pto.vsts %out, %ub_out[%offset], %mask : !pto.vreg<64xf32>, !pto.ptr, !pto.mask + %out = pto.vadd %lhs, %rhs, %mask : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> + pto.vsts %out, %ub_out[%offset], %mask : !pto.vreg<64xf32>, !pto.ptr, !pto.mask scf.yield %next : i32 } } @@ -98,7 +98,7 @@ total_cycles = startup + completion + repeats × per_repeat + (repeats - 1) × i ### `pto.vadd` -- **syntax:** `%result = pto.vadd %lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask -> !pto.vreg` +- **syntax:** `%result = pto.vadd %lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask -> !pto.vreg` - **A5 RV:** `RV_VADD`; **Latency:** 7 (f32/f16), 7 (i32/i16/i8) - **A2/A3 throughput:** 2 cycles/repeat; **interval:** 18 cycles @@ -115,7 +115,7 @@ for (int i = 0; i < N; i++) ### `pto.vsub` -- **syntax:** `%result = pto.vsub %lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask -> !pto.vreg` +- **syntax:** `%result = pto.vsub %lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask -> !pto.vreg` - **A5 RV:** `RV_VSUB`; **Latency:** 7 (f32/f16), 7 (i32/i16/i8) - **A2/A3 throughput:** 2 cycles/repeat; **interval:** 18 cycles @@ -132,7 +132,7 @@ for (int i = 0; i < N; i++) ### `pto.vmul` -- **syntax:** `%result = pto.vmul %lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask -> !pto.vreg` +- **syntax:** `%result = pto.vmul %lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask -> !pto.vreg` - **A5 RV:** `RV_VMUL`; **Latency:** 8 (f32/f16), 8 (i32/i16) - **A2/A3 throughput:** 2 cycles/repeat; **interval:** 18 cycles @@ -149,7 +149,7 @@ for (int i = 0; i < N; i++) ### `pto.vdiv` -- **syntax:** `%result = pto.vdiv %lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask -> !pto.vreg` +- **syntax:** `%result = pto.vdiv %lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask -> !pto.vreg` - **A5 RV:** `RV_VDIV`; **Latency:** 17 (f32), 22 (f16) - **A2/A3 throughput:** 2 cycles/repeat (f32), 4 cycles/repeat (f16); **interval:** 18 cycles @@ -167,7 +167,7 @@ for (int i = 0; i < N; i++) ### `pto.vmax` -- **syntax:** `%result = pto.vmax %lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask -> !pto.vreg` +- **syntax:** `%result = pto.vmax %lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask -> !pto.vreg` - **A5 RV:** `RV_VMAX`; **Latency:** 7 (f32/f16), 7 (i32/i16/i8) - **A2/A3 throughput:** 2 cycles/repeat; **interval:** 18 cycles @@ -184,7 +184,7 @@ for (int i = 0; i < N; i++) ### `pto.vmin` -- **syntax:** `%result = pto.vmin %lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask -> !pto.vreg` +- **syntax:** `%result = pto.vmin %lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask -> !pto.vreg` - **A5 RV:** `RV_VMAX`; **Latency:** 7 (f32/f16), 7 (i32/i16/i8) - **A2/A3 throughput:** 2 cycles/repeat; **interval:** 18 cycles @@ -203,7 +203,7 @@ for (int i = 0; i < N; i++) ### `pto.vand` -- **syntax:** `%result = pto.vand %lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask -> !pto.vreg` +- **syntax:** `%result = pto.vand %lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask -> !pto.vreg` - **A5 RV:** `RV_VAND`; **Latency:** 7 (integer types) - **A2/A3 throughput:** 1 cycle/repeat; **interval:** 18 cycles @@ -220,7 +220,7 @@ for (int i = 0; i < N; i++) ### `pto.vor` -- **syntax:** `%result = pto.vor %lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask -> !pto.vreg` +- **syntax:** `%result = pto.vor %lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask -> !pto.vreg` - **A5 RV:** `RV_VOR`; **Latency:** 7 (integer types) - **A2/A3 throughput:** 1 cycle/repeat; **interval:** 18 cycles @@ -237,7 +237,7 @@ for (int i = 0; i < N; i++) ### `pto.vxor` -- **syntax:** `%result = pto.vxor %lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask -> !pto.vreg` +- **syntax:** `%result = pto.vxor %lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask -> !pto.vreg` - **A5 RV:** `RV_VXOR`; **Latency:** 7 (integer types) - **A2/A3 throughput:** 1 cycle/repeat; **interval:** 18 cycles @@ -256,7 +256,7 @@ for (int i = 0; i < N; i++) ### `pto.vshl` -- **syntax:** `%result = pto.vshl %lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask -> !pto.vreg` +- **syntax:** `%result = pto.vshl %lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask -> !pto.vreg` - **A5 RV:** `RV_VSHL`; **Latency:** 7 (integer types) - **A2/A3 throughput:** 1 cycle/repeat; **interval:** 18 cycles @@ -273,7 +273,7 @@ for (int i = 0; i < N; i++) ### `pto.vshr` -- **syntax:** `%result = pto.vshr %lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask -> !pto.vreg` +- **syntax:** `%result = pto.vshr %lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask -> !pto.vreg` - **A5 RV:** `RV_VSHR`; **Latency:** 7 (integer types) - **A2/A3 throughput:** 1 cycle/repeat; **interval:** 18 cycles @@ -292,7 +292,7 @@ for (int i = 0; i < N; i++) ### `pto.vaddc` -- **syntax:** `%result, %carry = pto.vaddc %lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask -> !pto.vreg, !pto.mask` +- **syntax:** `%result, %carry = pto.vaddc %lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask -> !pto.vreg, !pto.mask` - **A5 RV:** `RV_VADDC`; **Latency:** 7 (i32, unsigned carry semantics) - **A2/A3 throughput:** 1 cycle/repeat; **interval:** 18 cycles @@ -313,7 +313,7 @@ for (int i = 0; i < N; i++) { ### `pto.vsubc` -- **syntax:** `%result, %borrow = pto.vsubc %lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask -> !pto.vreg, !pto.mask` +- **syntax:** `%result, %borrow = pto.vsubc %lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask -> !pto.vreg, !pto.mask` - **A5 RV:** `RV_VSUBC`; **Latency:** 7 (i32, unsigned borrow semantics) - **A2/A3 throughput:** 1 cycle/repeat; **interval:** 18 cycles @@ -335,15 +335,15 @@ for (int i = 0; i < N; i++) { ```mlir // Vector addition -%sum = pto.vadd %a, %b, %mask : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> +%sum = pto.vadd %a, %b, %mask : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> // Element-wise multiply -%prod = pto.vmul %x, %y, %mask : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> +%prod = pto.vmul %x, %y, %mask : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> // Clamp to range [min, max] -%clamped_low = pto.vmax %input, %min_vec, %mask : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> -%clamped = pto.vmin %clamped_low, %max_vec, %mask : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> +%clamped_low = pto.vmax %input, %min_vec, %mask : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> +%clamped = pto.vmin %clamped_low, %max_vec, %mask : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> // Bit manipulation -%masked = pto.vand %data, %bitmask, %mask : !pto.vreg<64xi32>, !pto.vreg<64xi32>, !pto.mask -> !pto.vreg<64xi32> +%masked = pto.vand %data, %bitmask, %mask : !pto.vreg<64xi32>, !pto.vreg<64xi32>, !pto.mask -> !pto.vreg<64xi32> ``` diff --git a/docs/isa/vector/binary-vector-ops_zh.md b/docs/isa/vector/binary-vector-ops_zh.md index b8c105d77..f7b99f858 100644 --- a/docs/isa/vector/binary-vector-ops_zh.md +++ b/docs/isa/vector/binary-vector-ops_zh.md @@ -33,12 +33,12 @@ pto.rls_buf "PIPE_MTE2", %bufid, %c0 : i64, i64 pto.get_buf "PIPE_V", %bufid, %c0 : i64, i64 pto.vecscope { scf.for %offset = %c0 to %N step %c64 iter_args(%remaining = %N_i32) -> (i32) { - %mask, %next = pto.plt_b32 %remaining : i32 -> !pto.mask, i32 + %mask, %next = pto.plt_b32 %remaining : i32 -> !pto.mask, i32 %lhs = pto.vlds %ub_a[%offset] : !pto.ptr -> !pto.vreg<64xf32> %rhs = pto.vlds %ub_b[%offset] : !pto.ptr -> !pto.vreg<64xf32> %out = pto.vadd %lhs, %rhs, %mask - : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> - pto.vsts %out, %ub_out[%offset], %mask : !pto.vreg<64xf32>, !pto.ptr, !pto.mask + : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> + pto.vsts %out, %ub_out[%offset], %mask : !pto.vreg<64xf32>, !pto.ptr, !pto.mask scf.yield %next : i32 } } @@ -90,7 +90,7 @@ total_cycles = startup + completion + repeats × per_repeat + (repeats - 1) × i ### `pto.vadd` -- **语法:** `%result = pto.vadd %lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask -> !pto.vreg` +- **语法:** `%result = pto.vadd %lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask -> !pto.vreg` - **语义:** 逐 lane 相加。 ```c @@ -100,33 +100,33 @@ for (int i = 0; i < N; i++) ### `pto.vsub` -- **语法:** `%result = pto.vsub %lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask -> !pto.vreg` +- **语法:** `%result = pto.vsub %lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask -> !pto.vreg` - **语义:** 逐 lane 相减。 `%lhs` 是被减数,`%rhs` 是减数。 ### `pto.vmul` -- **语法:** `%result = pto.vmul %lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask -> !pto.vreg` +- **语法:** `%result = pto.vmul %lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask -> !pto.vreg` - **语义:** 逐 lane 相乘。 当前 A5 文档没有把 `i8/u8` 形式纳入这条指令的常规范畴。整数溢出的精确行为由目标平台决定。 ### `pto.vdiv` -- **语法:** `%result = pto.vdiv %lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask -> !pto.vreg` +- **语法:** `%result = pto.vdiv %lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask -> !pto.vreg` - **语义:** 逐 lane 相除。 这是最贵的常见二元运算之一。A5 上 f32 需要 17 周期,f16 需要 22 周期,显著高于乘法。如果精度允许,更推荐通过倒数与乘法来近似。 ### `pto.vmax` -- **语法:** `%result = pto.vmax %lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask -> !pto.vreg` +- **语法:** `%result = pto.vmax %lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask -> !pto.vreg` - **语义:** 逐 lane 取较大值。 ### `pto.vmin` -- **语法:** `%result = pto.vmin %lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask -> !pto.vreg` +- **语法:** `%result = pto.vmin %lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask -> !pto.vreg` - **语义:** 逐 lane 取较小值。 --- @@ -135,17 +135,17 @@ for (int i = 0; i < N; i++) ### `pto.vand` -- **语法:** `%result = pto.vand %lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask -> !pto.vreg` +- **语法:** `%result = pto.vand %lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask -> !pto.vreg` - **语义:** 逐 lane 按位与。 ### `pto.vor` -- **语法:** `%result = pto.vor %lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask -> !pto.vreg` +- **语法:** `%result = pto.vor %lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask -> !pto.vreg` - **语义:** 逐 lane 按位或。 ### `pto.vxor` -- **语法:** `%result = pto.vxor %lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask -> !pto.vreg` +- **语法:** `%result = pto.vxor %lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask -> !pto.vreg` - **语义:** 逐 lane 按位异或。 这三条都只对整数元素类型合法。 @@ -156,14 +156,14 @@ for (int i = 0; i < N; i++) ### `pto.vshl` -- **语法:** `%result = pto.vshl %lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask -> !pto.vreg` +- **语法:** `%result = pto.vshl %lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask -> !pto.vreg` - **语义:** 逐 lane 左移。 这里右操作数 `%rhs` 不是一个统一的立即数,而是“每个 lane 自带一个位移量”的第二个向量寄存器。 ### `pto.vshr` -- **语法:** `%result = pto.vshr %lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask -> !pto.vreg` +- **语法:** `%result = pto.vshr %lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask -> !pto.vreg` - **语义:** 逐 lane 右移。 对有符号整数通常是算术右移,对无符号整数通常是逻辑右移;真正的行为由元素类型的 signedness 决定。 @@ -176,7 +176,7 @@ for (int i = 0; i < N; i++) ### `pto.vaddc` -- **语法:** `%result, %carry = pto.vaddc %lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask -> !pto.vreg, !pto.mask` +- **语法:** `%result, %carry = pto.vaddc %lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask -> !pto.vreg, !pto.mask` - **语义:** 带 carry-out 的逐 lane 加法。 ```c @@ -191,7 +191,7 @@ for (int i = 0; i < N; i++) { ### `pto.vsubc` -- **语法:** `%result, %borrow = pto.vsubc %lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask -> !pto.vreg, !pto.mask` +- **语法:** `%result, %borrow = pto.vsubc %lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask -> !pto.vreg, !pto.mask` - **语义:** 带 borrow-out 的逐 lane 减法。 ```c @@ -209,18 +209,18 @@ for (int i = 0; i < N; i++) { ```mlir %sum = pto.vadd %a, %b, %mask - : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> + : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> %prod = pto.vmul %x, %y, %mask - : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> + : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> %clamped_low = pto.vmax %input, %min_vec, %mask - : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> + : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> %clamped = pto.vmin %clamped_low, %max_vec, %mask - : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> + : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> %masked = pto.vand %data, %bitmask, %mask - : !pto.vreg<64xi32>, !pto.vreg<64xi32>, !pto.mask -> !pto.vreg<64xi32> + : !pto.vreg<64xi32>, !pto.vreg<64xi32>, !pto.mask -> !pto.vreg<64xi32> ``` --- diff --git a/docs/isa/vector/compare-select.md b/docs/isa/vector/compare-select.md index 41df5aa3f..7cc2fa803 100644 --- a/docs/isa/vector/compare-select.md +++ b/docs/isa/vector/compare-select.md @@ -22,7 +22,7 @@ Operations that compare vectors and conditionally select elements. ### `pto.vcmp` -- **syntax:** `%result = pto.vcmp %src0, %src1, %seed, "CMP_MODE" : !pto.vreg, !pto.vreg, !pto.mask -> !pto.mask` +- **syntax:** `%result = pto.vcmp %src0, %src1, %seed, "CMP_MODE" : !pto.vreg, !pto.vreg, !pto.mask -> !pto.mask` - **semantics:** Element-wise comparison, output predicate mask. ```c @@ -44,8 +44,8 @@ for (int i = 0; i < N; i++) **Example:** ```mlir -%all_active = pto.pset_b32 "PAT_ALL" : !pto.mask -%lt_mask = pto.vcmp %a, %b, %all_active, "lt" : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.mask +%all_active = pto.pset_b32 "PAT_ALL" : !pto.mask +%lt_mask = pto.vcmp %a, %b, %all_active, "lt" : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.mask // lt_mask[i] = 1 if a[i] < b[i] ``` @@ -60,7 +60,7 @@ for (int i = 0; i < N; i++) ### `pto.vcmps` -- **syntax:** `%result = pto.vcmps %src, %scalar, %seed, "CMP_MODE" : !pto.vreg, T, !pto.mask -> !pto.mask` +- **syntax:** `%result = pto.vcmps %src, %scalar, %seed, "CMP_MODE" : !pto.vreg, T, !pto.mask -> !pto.mask` - **semantics:** Compare vector against scalar. ```c @@ -72,7 +72,7 @@ for (int i = 0; i < N; i++) **Example:** ```mlir %positive_mask = pto.vcmps %values, %c0_f32, %all_active, "gt" - : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.mask + : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.mask // positive_mask[i] = 1 if values[i] > 0 ``` @@ -88,7 +88,7 @@ for (int i = 0; i < N; i++) ### `pto.vsel` -- **syntax:** `%result = pto.vsel %src0, %src1, %mask : !pto.vreg, !pto.vreg, !pto.mask -> !pto.vreg` +- **syntax:** `%result = pto.vsel %src0, %src1, %mask : !pto.vreg, !pto.vreg, !pto.mask -> !pto.vreg` - **semantics:** Per-lane select based on mask. ```c @@ -100,7 +100,7 @@ for (int i = 0; i < N; i++) ```mlir // dst = mask ? true_vals : false_vals %result = pto.vsel %true_vals, %false_vals, %condition - : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> + : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> ``` - **inputs:** `%src0` is the true-path vector, `%src1` is the false-path vector, @@ -144,18 +144,18 @@ for (int i = 0; i < N; i++) ```mlir // Clamp negative values to zero (manual ReLU) -%all = pto.pset_b32 "PAT_ALL" : !pto.mask +%all = pto.pset_b32 "PAT_ALL" : !pto.mask %zero = pto.vbr %c0_f32 : f32 -> !pto.vreg<64xf32> -%neg_mask = pto.vcmps %input, %c0_f32, %all, "lt" : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.mask -%clamped = pto.vsel %zero, %input, %neg_mask : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> +%neg_mask = pto.vcmps %input, %c0_f32, %all, "lt" : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.mask +%clamped = pto.vsel %zero, %input, %neg_mask : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> // Element-wise max via compare+select -%gt_mask = pto.vcmp %a, %b, %all, "gt" : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.mask -%max_ab = pto.vsel %a, %b, %gt_mask : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> +%gt_mask = pto.vcmp %a, %b, %all, "gt" : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.mask +%max_ab = pto.vsel %a, %b, %gt_mask : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> // Threshold filter -%above_thresh = pto.vcmps %scores, %threshold, %all, "ge" : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.mask -%filtered = pto.vsel %scores, %zero, %above_thresh : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> +%above_thresh = pto.vcmps %scores, %threshold, %all, "ge" : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.mask +%filtered = pto.vsel %scores, %zero, %above_thresh : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> ``` --- @@ -166,16 +166,16 @@ for (int i = 0; i < N; i++) // Softmax safe exp: exp(x - max) where x < max returns exp of negative // but we want to clamp to avoid underflow -%all = pto.pset_b32 "PAT_ALL" : !pto.mask +%all = pto.pset_b32 "PAT_ALL" : !pto.mask // 1. Compare against threshold %too_small = pto.vcmps %x_minus_max, %min_exp_arg, %all, "lt" - : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.mask + : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.mask // 2. Clamp values below threshold %clamped = pto.vsel %min_exp_arg_vec, %x_minus_max, %too_small - : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> + : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> // 3. Safe exp -%exp_result = pto.vexp %clamped, %all : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> +%exp_result = pto.vexp %clamped, %all : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> ``` diff --git a/docs/isa/vector/conversion-ops.md b/docs/isa/vector/conversion-ops.md index a260897f2..1010241de 100644 --- a/docs/isa/vector/conversion-ops.md +++ b/docs/isa/vector/conversion-ops.md @@ -113,7 +113,7 @@ For conversions that change width (e.g., f32→f16), use even/odd parts and comb : !pto.vreg<64xf32> -> !pto.vreg<128xf16> %odd = pto.vcvt %in1 {round_mode = "ROUND_R", sat = "RS_ENABLE", part = "PART_ODD"} : !pto.vreg<64xf32> -> !pto.vreg<128xf16> -%result = pto.vor %even, %odd, %mask : !pto.vreg<128xf16>, !pto.vreg<128xf16>, !pto.mask -> !pto.vreg<128xf16> +%result = pto.vor %even, %odd, %mask : !pto.vreg<128xf16>, !pto.vreg<128xf16>, !pto.mask -> !pto.vreg<128xf16> ``` --- @@ -148,11 +148,40 @@ for (int i = 0; i < N; i++) --- +## `pto.vbitcast` + +- **syntax:** `%result = pto.vbitcast %input : !pto.vreg -> !pto.vreg` +- **semantics:** Bitwise reinterpretation of a vreg vector without changing the underlying bit pattern. Performs a pure type cast that preserves the exact bits of each element, changing only their interpretation (for example, from floating-point to integer). +- **inputs:** `%input` is the source vector register value. +- **outputs:** `%result` is the reinterpreted vector register value. +- **constraints and limitations:** + - Both source and result must be `!pto.vreg<...>` types. + - Source and result vectors must have the same total bit width (currently 2048 bits): `N * bitwidth(T0) = M * bitwidth(T1) = 2048`. + - Only integer and floating-point element types are supported. + +See [`pto.vbitcast`](./ops/conversion-ops/vbitcast.md) for full details, type-pair examples, and the comparison with `pto.vcvt`. + +--- + +## `pto.pbitcast` + +- **syntax:** `%result = pto.pbitcast %input : !pto.mask -> !pto.mask` +- **semantics:** Bitwise reinterpretation of a predicate register without changing the underlying predicate-register image. Makes mask-family granularity reinterpretation explicit in VPTO IR when a producer and consumer expect different `!pto.mask<...>` views of the same hardware predicate state. +- **inputs:** `%input` is the source predicate register value. +- **outputs:** `%result` is the reinterpreted predicate register value. +- **constraints and limitations:** + - Both source and result must be `!pto.mask<...>` types. + - `pto.pbitcast` does not materialize or normalize predicate contents; it only changes which mask granularity the surrounding VPTO IR uses to interpret the same predicate bits. + +See [`pto.pbitcast`](./ops/conversion-ops/pbitcast.md) for full details and examples. + +--- + ## Typical Usage ```mlir // Quantization: f32 → i8 with saturation -%scaled = pto.vmuls %input, %scale, %mask : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.vreg<64xf32> +%scaled = pto.vmuls %input, %scale, %mask : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.vreg<64xf32> %quantized = pto.vcvt %scaled {round_mode = "ROUND_R", sat = "RS_ENABLE"} : !pto.vreg<64xf32> -> !pto.vreg<64xi32> // Then narrow i32 → i8 via pack ops diff --git a/docs/isa/vector/conversion-ops_zh.md b/docs/isa/vector/conversion-ops_zh.md index 32ecd01f7..79f49e2b9 100644 --- a/docs/isa/vector/conversion-ops_zh.md +++ b/docs/isa/vector/conversion-ops_zh.md @@ -7,6 +7,8 @@ - `pto.vci` - `pto.vcvt` - `pto.vtrc` +- `pto.vbitcast` +- `pto.pbitcast` ## 操作数模型 @@ -30,6 +32,14 @@ 把浮点值按指定舍入模式变成“整数值的浮点数”,但不改变元素类型。 +### `pto.vbitcast` + +对 `!pto.vreg<...>` 值做按位重新解释,保持位模式不变(总位宽恒为 2048 bits),只改变元素类型与车道数。源与目标都必须是 `!pto.vreg<...>`,且 `N * bitwidth(T0) = M * bitwidth(T1) = 2048`。仅支持整型和浮点元素类型。详细说明见 [`pto.vbitcast`](./ops/conversion-ops/vbitcast_zh.md)。 + +### `pto.pbitcast` + +对 `!pto.mask<...>` 值做按位重新解释,不改变底层谓词位,仅切换 mask 粒度视图(`b8` / `b16` / `b32`)。常用于生产者与消费者对同一谓词状态采用不同粒度的场景。详细说明见 [`pto.pbitcast`](./ops/conversion-ops/pbitcast_zh.md)。 + ## 舍入模式 | 模式 | 含义 | diff --git a/docs/isa/vector/data-rearrangement.md b/docs/isa/vector/data-rearrangement.md index 2633f2c94..edfcf027f 100644 --- a/docs/isa/vector/data-rearrangement.md +++ b/docs/isa/vector/data-rearrangement.md @@ -106,7 +106,7 @@ for (int i = 0; i < N; i++) ### `pto.vsqz` -- **syntax:** `%result = pto.vsqz %src, %mask : !pto.vreg, !pto.mask -> !pto.vreg` +- **syntax:** `%result = pto.vsqz %src, %mask : !pto.vreg, !pto.mask -> !pto.vreg` - **semantics:** Compress — pack active lanes to front. ```c @@ -128,7 +128,7 @@ while (j < N) dst[j++] = 0; ### `pto.vusqz` -- **syntax:** `%result = pto.vusqz %mask : !pto.mask -> !pto.vreg` +- **syntax:** `%result = pto.vusqz %mask : !pto.mask -> !pto.vreg` - **semantics:** Expand — scatter front elements to active positions. ```c @@ -255,15 +255,15 @@ for (int i = 0; i < N/2; i++) // Filter: keep only elements passing condition %pass_mask = pto.vcmps %values, %threshold, %all, "gt" - : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.mask + : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.mask %compacted = pto.vsqz %values, %pass_mask - : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> + : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> // Sliding window sum %prev_window = pto.vslide %curr, %prev, %c1 : !pto.vreg<64xf32>, !pto.vreg<64xf32>, i16 -> !pto.vreg<64xf32> %window_sum = pto.vadd %curr, %prev_window, %all - : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> + : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> // Type narrowing via pack %packed_i16 = pto.vpack %wide0_i32, %wide1_i32, %c0 diff --git a/docs/isa/vector/data-rearrangement_zh.md b/docs/isa/vector/data-rearrangement_zh.md index 7d091f846..ada5b03e3 100644 --- a/docs/isa/vector/data-rearrangement_zh.md +++ b/docs/isa/vector/data-rearrangement_zh.md @@ -69,7 +69,7 @@ ### `pto.vsqz` -- **语法:** `%result = pto.vsqz %src, %mask : !pto.vreg, !pto.mask -> !pto.vreg` +- **语法:** `%result = pto.vsqz %src, %mask : !pto.vreg, !pto.mask -> !pto.vreg` - **语义:** 把 mask 选中的活跃 lane 压紧到结果前部。 ```c @@ -83,7 +83,7 @@ while (j < N) dst[j++] = 0; ### `pto.vusqz` -- **语法:** `%result = pto.vusqz %mask : !pto.mask -> !pto.vreg` +- **语法:** `%result = pto.vusqz %mask : !pto.mask -> !pto.vreg` - **语义:** 把前部压紧流再按 `%mask` 的活跃位置展开回固定形状。 当前指令面把“前部压紧流”的来源隐含在形式之中,因此后端更不能随意改写其放置规则。 @@ -150,14 +150,14 @@ for (int i = 0; i < N; i++) { : !pto.vreg<64xf32>, !pto.vreg<64xf32> -> !pto.vreg<64xf32>, !pto.vreg<64xf32> %pass_mask = pto.vcmps %values, %threshold, %all, "gt" - : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.mask + : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.mask %compacted = pto.vsqz %values, %pass_mask - : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> + : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> %prev_window = pto.vslide %curr, %prev, %c1 : !pto.vreg<64xf32>, !pto.vreg<64xf32>, i16 -> !pto.vreg<64xf32> %window_sum = pto.vadd %curr, %prev_window, %all - : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> + : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> %packed_i16 = pto.vpack %wide0_i32, %wide1_i32, %c0 : !pto.vreg<64xi32>, !pto.vreg<64xi32>, index -> !pto.vreg<128xi16> diff --git a/docs/isa/vector/ops/binary-vector-ops/vadd.md b/docs/isa/vector/ops/binary-vector-ops/vadd.md index 49cb56616..7544b4a86 100644 --- a/docs/isa/vector/ops/binary-vector-ops/vadd.md +++ b/docs/isa/vector/ops/binary-vector-ops/vadd.md @@ -24,19 +24,19 @@ For each lane `i` where the predicate is false (inactive lanes): ### PTO Assembly Form ```mlir -%result = pto.vadd %lhs, %rhs, %mask : (!pto.vreg, !pto.vreg, !pto.mask) -> !pto.vreg +%result = pto.vadd %lhs, %rhs, %mask : (!pto.vreg, !pto.vreg, !pto.mask) -> !pto.vreg ``` ### AS Level 1 (SSA) ```mlir -%result = pto.vadd %lhs, %rhs, %mask : (!pto.vreg, !pto.vreg, !pto.mask) -> !pto.vreg +%result = pto.vadd %lhs, %rhs, %mask : (!pto.vreg, !pto.vreg, !pto.mask) -> !pto.vreg ``` ### AS Level 2 (DPS) ```mlir -pto.vadd ins(%lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask) +pto.vadd ins(%lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask) outs(%result : !pto.vreg) ``` @@ -56,7 +56,7 @@ vadd(dst, src0, src1, mask); |---------|------|-------------| | `%lhs` | `!pto.vreg` | Left-hand source vector register | | `%rhs` | `!pto.vreg` | Right-hand source vector register | -| `%mask` | `!pto.mask` | Predicate mask; lanes where mask bit is 1 are active | +| `%mask` | `!pto.mask` | Predicate mask; lanes where mask bit is 1 are active | All three registers must have the same element type and same vector width `N`. The mask width must match `N`. @@ -144,7 +144,7 @@ VADD(vdst, va, vb, mask); ```mlir // Only lanes where %cond is true participate in addition -%result = pto.vadd %va, %vb, %cond : (!pto.vreg<128xf16>, !pto.vreg<128xf16>, !pto.mask) -> !pto.vreg<128xf16> +%result = pto.vadd %va, %vb, %cond : (!pto.vreg<128xf16>, !pto.vreg<128xf16>, !pto.mask) -> !pto.vreg<128xf16> ``` ### Complete vector-load / compute / vector-store pipeline diff --git a/docs/isa/vector/ops/binary-vector-ops/vadd_zh.md b/docs/isa/vector/ops/binary-vector-ops/vadd_zh.md index c38d7ec53..e7ca731e4 100644 --- a/docs/isa/vector/ops/binary-vector-ops/vadd_zh.md +++ b/docs/isa/vector/ops/binary-vector-ops/vadd_zh.md @@ -27,13 +27,13 @@ vadd %dst, %lhs, %rhs, %mask : !pto.vreg ### AS Level 1(SSA) ```mlir -%result = pto.vadd %lhs, %rhs, %mask : (!pto.vreg, !pto.vreg, !pto.mask) -> !pto.vreg +%result = pto.vadd %lhs, %rhs, %mask : (!pto.vreg, !pto.vreg, !pto.mask) -> !pto.vreg ``` ### AS Level 2(DPS) ```mlir -pto.vadd ins(%lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask) +pto.vadd ins(%lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask) outs(%result : !pto.vreg) ``` @@ -53,7 +53,7 @@ PTO_INST RecordEvent VADD(VecDst& dst, const VecLhs& lhs, const VecRhs& rhs, |--------|------|------| | `%lhs` | `!pto.vreg` | 左操作数向量寄存器 | | `%rhs` | `!pto.vreg` | 右操作数向量寄存器 | -| `%mask` | `!pto.mask` | 谓词掩码;掩码位为 1 的 lane 参与运算 | +| `%mask` | `!pto.mask` | 谓词掩码;掩码位为 1 的 lane 参与运算 | 两个源寄存器必须有相同的元素类型和相同的向量宽度 `N`。掩码宽度也必须与 `N` 一致。 @@ -143,7 +143,7 @@ VADD(vdst, va, vb, mask); ```mlir %result = pto.vadd %va, %vb, %cond - : (!pto.vreg<128xf16>, !pto.vreg<128xf16>, !pto.mask) -> !pto.vreg<128xf16> + : (!pto.vreg<128xf16>, !pto.vreg<128xf16>, !pto.mask) -> !pto.vreg<128xf16> ``` ### 完整的 load / compute / store 链 diff --git a/docs/isa/vector/ops/binary-vector-ops/vaddc.md b/docs/isa/vector/ops/binary-vector-ops/vaddc.md index 46c3b2103..2fb4ab703 100644 --- a/docs/isa/vector/ops/binary-vector-ops/vaddc.md +++ b/docs/isa/vector/ops/binary-vector-ops/vaddc.md @@ -31,14 +31,14 @@ vaddc %dst, %carry, %lhs, %rhs, %mask : !pto.vreg ### AS Level 1 (SSA) ```mlir -%result, %carry = pto.vaddc %lhs, %rhs, %mask : (!pto.vreg, !pto.vreg, !pto.mask) -> !pto.vreg, !pto.mask +%result, %carry = pto.vaddc %lhs, %rhs, %mask : (!pto.vreg, !pto.vreg, !pto.mask) -> !pto.vreg, !pto.mask ``` ### AS Level 2 (DPS) ```mlir -pto.vaddc ins(%lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask) - outs(%result, %carry : !pto.vreg, !pto.mask) +pto.vaddc ins(%lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask) + outs(%result, %carry : !pto.vreg, !pto.mask) ``` ## Inputs @@ -47,7 +47,7 @@ pto.vaddc ins(%lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask) |---------|------|-------------| | `%lhs` | `!pto.vreg` | Minuend: the first addend | | `%rhs` | `!pto.vreg` | Subtrahend: the second addend | -| `%mask` | `!pto.mask` | Predicate mask; lanes where mask bit is 1 are active | +| `%mask` | `!pto.mask` | Predicate mask; lanes where mask bit is 1 are active | Both source registers MUST have the same element type and the same vector width `N`. The mask width MUST match `N`. @@ -56,7 +56,7 @@ Both source registers MUST have the same element type and the same vector width | Result | Type | Description | |--------|------|-------------| | `%result` | `!pto.vreg` | Lane-wise truncated sum on active lanes; inactive lanes are unmodified | -| `%carry` | `!pto.mask` | Per-lane carry/overflow predicate: lane `i` is 1 if unsigned overflow occurred in lane `i` | +| `%carry` | `!pto.mask` | Per-lane carry/overflow predicate: lane `i` is 1 if unsigned overflow occurred in lane `i` | ## Side Effects @@ -110,7 +110,7 @@ for (int i = 0; i < N; i++) { ```mlir // Single-element addition with carry -%result, %carry = pto.vaddc %a, %b, %active : (!pto.vreg<64xi32>, !pto.vreg<64xi32>, !pto.mask) -> !pto.vreg<64xi32>, !pto.mask +%result, %carry = pto.vaddc %a, %b, %active : (!pto.vreg<64xi32>, !pto.vreg<64xi32>, !pto.mask) -> !pto.vreg<64xi32>, !pto.mask // Multi-word addition: chain carries into next segment %sum0, %carry0 = pto.vaddc %a0, %b0, %active : ... // low words diff --git a/docs/isa/vector/ops/binary-vector-ops/vaddc_zh.md b/docs/isa/vector/ops/binary-vector-ops/vaddc_zh.md index ed3d23874..63d1b2923 100644 --- a/docs/isa/vector/ops/binary-vector-ops/vaddc_zh.md +++ b/docs/isa/vector/ops/binary-vector-ops/vaddc_zh.md @@ -30,14 +30,14 @@ vaddc %dst, %carry, %lhs, %rhs, %mask : !pto.vreg ### AS Level 1(SSA) ```mlir -%result, %carry = pto.vaddc %lhs, %rhs, %mask : (!pto.vreg, !pto.vreg, !pto.mask) -> !pto.vreg, !pto.mask +%result, %carry = pto.vaddc %lhs, %rhs, %mask : (!pto.vreg, !pto.vreg, !pto.mask) -> !pto.vreg, !pto.mask ``` ### AS Level 2(DPS) ```mlir -pto.vaddc ins(%lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask) - outs(%result, %carry : !pto.vreg, !pto.mask) +pto.vaddc ins(%lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask) + outs(%result, %carry : !pto.vreg, !pto.mask) ``` ## 输入 @@ -46,7 +46,7 @@ pto.vaddc ins(%lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask) |--------|------|------| | `%lhs` | `!pto.vreg` | 第一个加数 | | `%rhs` | `!pto.vreg` | 第二个加数 | -| `%mask` | `!pto.mask` | 谓词掩码;掩码位为 1 的 lane 参与加法 | +| `%mask` | `!pto.mask` | 谓词掩码;掩码位为 1 的 lane 参与加法 | 两个源寄存器必须具有相同的元素类型和相同的向量宽度 `N`。掩码宽度必须与 `N` 一致。 @@ -55,7 +55,7 @@ pto.vaddc ins(%lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask) | 结果 | 类型 | 说明 | |------|------|------| | `%result` | `!pto.vreg` | 活跃 lane 上得到截断后的和;非活跃 lane 保持原值 | -| `%carry` | `!pto.mask` | 逐 lane 的进位 / 溢出谓词;无符号加法发生溢出时,对应 lane 为 1 | +| `%carry` | `!pto.mask` | 逐 lane 的进位 / 溢出谓词;无符号加法发生溢出时,对应 lane 为 1 | ## 副作用 @@ -106,7 +106,7 @@ for (int i = 0; i < N; i++) { ```mlir %result, %carry = pto.vaddc %a, %b, %active - : (!pto.vreg<64xi32>, !pto.vreg<64xi32>, !pto.mask) -> !pto.vreg<64xi32>, !pto.mask + : (!pto.vreg<64xi32>, !pto.vreg<64xi32>, !pto.mask) -> !pto.vreg<64xi32>, !pto.mask %sum0, %carry0 = pto.vaddc %a0, %b0, %active : ... %sum1, %carry1 = pto.vaddc %a1, %b1, %carry0 : ... diff --git a/docs/isa/vector/ops/binary-vector-ops/vand.md b/docs/isa/vector/ops/binary-vector-ops/vand.md index b31bec336..a907a5607 100644 --- a/docs/isa/vector/ops/binary-vector-ops/vand.md +++ b/docs/isa/vector/ops/binary-vector-ops/vand.md @@ -25,13 +25,13 @@ vand %dst, %lhs, %rhs, %mask : !pto.vreg ### AS Level 1 (SSA) ```mlir -%result = pto.vand %lhs, %rhs, %mask : (!pto.vreg, !pto.vreg, !pto.mask) -> !pto.vreg +%result = pto.vand %lhs, %rhs, %mask : (!pto.vreg, !pto.vreg, !pto.mask) -> !pto.vreg ``` ### AS Level 2 (DPS) ```mlir -pto.vand ins(%lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask) +pto.vand ins(%lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask) outs(%result : !pto.vreg) ``` @@ -43,7 +43,7 @@ Supported element types: all integer types (`i8`–`i64`, `u8`–`u64`). |---------|------|-------------| | `%lhs` | `!pto.vreg` | Left-hand source vector register | | `%rhs` | `!pto.vreg` | Right-hand source vector register | -| `%mask` | `!pto.mask` | Predicate mask; lanes where mask bit is 1 are active | +| `%mask` | `!pto.mask` | Predicate mask; lanes where mask bit is 1 are active | Both source registers MUST have the same integer element type and the same vector width `N`. The mask width MUST match `N`. @@ -114,11 +114,11 @@ for (int i = 0; i < N; i++) ```mlir // Bitwise AND of two integer vectors -%result = pto.vand %a, %b, %active : (!pto.vreg<64xi32>, !pto.vreg<64xi32>, !pto.mask) -> !pto.vreg<64xi32> +%result = pto.vand %a, %b, %active : (!pto.vreg<64xi32>, !pto.vreg<64xi32>, !pto.mask) -> !pto.vreg<64xi32> // Mask out specific bits: keep only bits 0-7 %mask = pto.vbroadcast %c255 : i32 -> !pto.vreg<64xi32> -%masked = pto.vand %data, %mask, %active : (!pto.vreg<64xi32>, !pto.vreg<64xi32>, !pto.mask) -> !pto.vreg<64xi32> +%masked = pto.vand %data, %mask, %active : (!pto.vreg<64xi32>, !pto.vreg<64xi32>, !pto.mask) -> !pto.vreg<64xi32> ``` ## Related Ops / Instruction Set Links diff --git a/docs/isa/vector/ops/binary-vector-ops/vand_zh.md b/docs/isa/vector/ops/binary-vector-ops/vand_zh.md index 9087bbc8d..f68f61d9f 100644 --- a/docs/isa/vector/ops/binary-vector-ops/vand_zh.md +++ b/docs/isa/vector/ops/binary-vector-ops/vand_zh.md @@ -29,13 +29,13 @@ vand %dst, %lhs, %rhs, %mask : !pto.vreg ### AS Level 1(SSA) ```mlir -%result = pto.vand %lhs, %rhs, %mask : (!pto.vreg, !pto.vreg, !pto.mask) -> !pto.vreg +%result = pto.vand %lhs, %rhs, %mask : (!pto.vreg, !pto.vreg, !pto.mask) -> !pto.vreg ``` ### AS Level 2(DPS) ```mlir -pto.vand ins(%lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask) +pto.vand ins(%lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask) outs(%result : !pto.vreg) ``` @@ -47,7 +47,7 @@ pto.vand ins(%lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask) |--------|------|------| | `%lhs` | `!pto.vreg` | 左操作数向量寄存器 | | `%rhs` | `!pto.vreg` | 右操作数向量寄存器 | -| `%mask` | `!pto.mask` | 谓词掩码;掩码位为 1 的 lane 参与运算 | +| `%mask` | `!pto.mask` | 谓词掩码;掩码位为 1 的 lane 参与运算 | 两个源寄存器必须有相同的整数元素类型和相同的向量宽度 `N`。掩码宽度必须与 `N` 一致。 @@ -116,10 +116,10 @@ for (int i = 0; i < N; i++) ```mlir %result = pto.vand %a, %b, %active - : (!pto.vreg<64xi32>, !pto.vreg<64xi32>, !pto.mask) -> !pto.vreg<64xi32> + : (!pto.vreg<64xi32>, !pto.vreg<64xi32>, !pto.mask) -> !pto.vreg<64xi32> %masked = pto.vand %data, %mask_bits, %active - : (!pto.vreg<64xi32>, !pto.vreg<64xi32>, !pto.mask) -> !pto.vreg<64xi32> + : (!pto.vreg<64xi32>, !pto.vreg<64xi32>, !pto.mask) -> !pto.vreg<64xi32> ``` ## 相关页面 diff --git a/docs/isa/vector/ops/binary-vector-ops/vdiv.md b/docs/isa/vector/ops/binary-vector-ops/vdiv.md index e3db3b3ea..5137d8f64 100644 --- a/docs/isa/vector/ops/binary-vector-ops/vdiv.md +++ b/docs/isa/vector/ops/binary-vector-ops/vdiv.md @@ -21,7 +21,7 @@ vdiv %result, %lhs, %rhs, %mask ### AS Level 1 (SSA) ```mlir -%result = pto.vdiv %lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask -> !pto.vreg +%result = pto.vdiv %lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask -> !pto.vreg ``` Documented A5 types or forms: `f16, f32 only (no integer division)`. diff --git a/docs/isa/vector/ops/binary-vector-ops/vdiv_zh.md b/docs/isa/vector/ops/binary-vector-ops/vdiv_zh.md index 225103295..2a0c6af29 100644 --- a/docs/isa/vector/ops/binary-vector-ops/vdiv_zh.md +++ b/docs/isa/vector/ops/binary-vector-ops/vdiv_zh.md @@ -27,7 +27,7 @@ vdiv %result, %lhs, %rhs, %mask ### AS Level 1(SSA) ```mlir -%result = pto.vdiv %lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask -> !pto.vreg +%result = pto.vdiv %lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask -> !pto.vreg ``` A5 当前文档化的形式只有 `f16` 与 `f32`。 diff --git a/docs/isa/vector/ops/binary-vector-ops/vmax.md b/docs/isa/vector/ops/binary-vector-ops/vmax.md index b273d936e..74381955f 100644 --- a/docs/isa/vector/ops/binary-vector-ops/vmax.md +++ b/docs/isa/vector/ops/binary-vector-ops/vmax.md @@ -25,13 +25,13 @@ vmax %dst, %lhs, %rhs, %mask : !pto.vreg ### AS Level 1 (SSA) ```mlir -%result = pto.vmax %lhs, %rhs, %mask : (!pto.vreg, !pto.vreg, !pto.mask) -> !pto.vreg +%result = pto.vmax %lhs, %rhs, %mask : (!pto.vreg, !pto.vreg, !pto.mask) -> !pto.vreg ``` ### AS Level 2 (DPS) ```mlir -pto.vmax ins(%lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask) +pto.vmax ins(%lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask) outs(%result : !pto.vreg) ``` @@ -43,7 +43,7 @@ Supported element types on A5: `i8-i32`, `f16`, `bf16`, `f32`. |---------|------|-------------| | `%lhs` | `!pto.vreg` | First source vector register (first operand of the comparison) | | `%rhs` | `!pto.vreg` | Second source vector register (second operand of the comparison) | -| `%mask` | `!pto.mask` | Predicate mask; lanes where mask bit is 1 are active | +| `%mask` | `!pto.mask` | Predicate mask; lanes where mask bit is 1 are active | Both source registers MUST have the same element type and the same vector width `N`. The mask width MUST match `N`. @@ -119,10 +119,10 @@ for (int i = 0; i < N; i++) ```mlir // Element-wise max of two vectors -%result = pto.vmax %a, %b, %active : (!pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask) -> !pto.vreg<64xf32> +%result = pto.vmax %a, %b, %active : (!pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask) -> !pto.vreg<64xf32> // Clamp to a minimum value: max(x, lower) -%clamped = pto.vmax %input, %lower, %active : (!pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask) -> !pto.vreg<64xf32> +%clamped = pto.vmax %input, %lower, %active : (!pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask) -> !pto.vreg<64xf32> ``` ## Related Ops / Instruction Set Links diff --git a/docs/isa/vector/ops/binary-vector-ops/vmax_zh.md b/docs/isa/vector/ops/binary-vector-ops/vmax_zh.md index 06c3a7e05..18dc70bf1 100644 --- a/docs/isa/vector/ops/binary-vector-ops/vmax_zh.md +++ b/docs/isa/vector/ops/binary-vector-ops/vmax_zh.md @@ -31,13 +31,13 @@ vmax %dst, %lhs, %rhs, %mask : !pto.vreg ### AS Level 1(SSA) ```mlir -%result = pto.vmax %lhs, %rhs, %mask : (!pto.vreg, !pto.vreg, !pto.mask) -> !pto.vreg +%result = pto.vmax %lhs, %rhs, %mask : (!pto.vreg, !pto.vreg, !pto.mask) -> !pto.vreg ``` ### AS Level 2(DPS) ```mlir -pto.vmax ins(%lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask) +pto.vmax ins(%lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask) outs(%result : !pto.vreg) ``` @@ -49,7 +49,7 @@ A5 当前记录的支持元素类型包括 `i8-i32`、`f16`、`bf16`、`f32`。 |--------|------|------| | `%lhs` | `!pto.vreg` | 第一个比较操作数 | | `%rhs` | `!pto.vreg` | 第二个比较操作数 | -| `%mask` | `!pto.mask` | 谓词掩码;掩码位为 1 的 lane 参与比较 | +| `%mask` | `!pto.mask` | 谓词掩码;掩码位为 1 的 lane 参与比较 | ## 预期输出 @@ -120,10 +120,10 @@ for (int i = 0; i < N; i++) ```mlir %result = pto.vmax %a, %b, %active - : (!pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask) -> !pto.vreg<64xf32> + : (!pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask) -> !pto.vreg<64xf32> %clamped = pto.vmax %input, %lower, %active - : (!pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask) -> !pto.vreg<64xf32> + : (!pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask) -> !pto.vreg<64xf32> ``` ## 相关页面 diff --git a/docs/isa/vector/ops/binary-vector-ops/vmin.md b/docs/isa/vector/ops/binary-vector-ops/vmin.md index 5365f9d41..c2e45da71 100644 --- a/docs/isa/vector/ops/binary-vector-ops/vmin.md +++ b/docs/isa/vector/ops/binary-vector-ops/vmin.md @@ -25,13 +25,13 @@ vmin %dst, %lhs, %rhs, %mask : !pto.vreg ### AS Level 1 (SSA) ```mlir -%result = pto.vmin %lhs, %rhs, %mask : (!pto.vreg, !pto.vreg, !pto.mask) -> !pto.vreg +%result = pto.vmin %lhs, %rhs, %mask : (!pto.vreg, !pto.vreg, !pto.mask) -> !pto.vreg ``` ### AS Level 2 (DPS) ```mlir -pto.vmin ins(%lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask) +pto.vmin ins(%lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask) outs(%result : !pto.vreg) ``` @@ -43,7 +43,7 @@ Supported element types on A5: `i8-i32`, `f16`, `bf16`, `f32`. |---------|------|-------------| | `%lhs` | `!pto.vreg` | First source vector register (first operand of the comparison) | | `%rhs` | `!pto.vreg` | Second source vector register (second operand of the comparison) | -| `%mask` | `!pto.mask` | Predicate mask; lanes where mask bit is 1 are active | +| `%mask` | `!pto.mask` | Predicate mask; lanes where mask bit is 1 are active | Both source registers MUST have the same element type and the same vector width `N`. The mask width MUST match `N`. @@ -119,10 +119,10 @@ for (int i = 0; i < N; i++) ```mlir // Element-wise min of two vectors -%result = pto.vmin %a, %b, %active : (!pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask) -> !pto.vreg<64xf32> +%result = pto.vmin %a, %b, %active : (!pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask) -> !pto.vreg<64xf32> // Clamp to a maximum value: min(x, upper) -%clamped = pto.vmin %input, %upper, %active : (!pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask) -> !pto.vreg<64xf32> +%clamped = pto.vmin %input, %upper, %active : (!pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask) -> !pto.vreg<64xf32> ``` ## Related Ops / Instruction Set Links diff --git a/docs/isa/vector/ops/binary-vector-ops/vmin_zh.md b/docs/isa/vector/ops/binary-vector-ops/vmin_zh.md index fc29d4a9c..e77fa6783 100644 --- a/docs/isa/vector/ops/binary-vector-ops/vmin_zh.md +++ b/docs/isa/vector/ops/binary-vector-ops/vmin_zh.md @@ -31,13 +31,13 @@ vmin %dst, %lhs, %rhs, %mask : !pto.vreg ### AS Level 1(SSA) ```mlir -%result = pto.vmin %lhs, %rhs, %mask : (!pto.vreg, !pto.vreg, !pto.mask) -> !pto.vreg +%result = pto.vmin %lhs, %rhs, %mask : (!pto.vreg, !pto.vreg, !pto.mask) -> !pto.vreg ``` ### AS Level 2(DPS) ```mlir -pto.vmin ins(%lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask) +pto.vmin ins(%lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask) outs(%result : !pto.vreg) ``` @@ -49,7 +49,7 @@ A5 当前记录的支持元素类型包括 `i8-i32`、`f16`、`bf16`、`f32`。 |--------|------|------| | `%lhs` | `!pto.vreg` | 第一个比较操作数 | | `%rhs` | `!pto.vreg` | 第二个比较操作数 | -| `%mask` | `!pto.mask` | 谓词掩码;掩码位为 1 的 lane 参与比较 | +| `%mask` | `!pto.mask` | 谓词掩码;掩码位为 1 的 lane 参与比较 | ## 预期输出 @@ -120,10 +120,10 @@ for (int i = 0; i < N; i++) ```mlir %result = pto.vmin %a, %b, %active - : (!pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask) -> !pto.vreg<64xf32> + : (!pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask) -> !pto.vreg<64xf32> %clamped = pto.vmin %input, %upper, %active - : (!pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask) -> !pto.vreg<64xf32> + : (!pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask) -> !pto.vreg<64xf32> ``` ## 相关页面 diff --git a/docs/isa/vector/ops/binary-vector-ops/vmul.md b/docs/isa/vector/ops/binary-vector-ops/vmul.md index 8787ee1c6..69656ff60 100644 --- a/docs/isa/vector/ops/binary-vector-ops/vmul.md +++ b/docs/isa/vector/ops/binary-vector-ops/vmul.md @@ -21,13 +21,13 @@ vmul %dst, %lhs, %rhs, %mask : !pto.vreg ### AS Level 1 (SSA) ```mlir -%result = pto.vmul %lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask -> !pto.vreg +%result = pto.vmul %lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask -> !pto.vreg ``` ### AS Level 2 (DPS) ```mlir -pto.vmul ins(%lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask) +pto.vmul ins(%lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask) outs(%result : !pto.vreg) ``` diff --git a/docs/isa/vector/ops/binary-vector-ops/vmul_zh.md b/docs/isa/vector/ops/binary-vector-ops/vmul_zh.md index 5fee1fcc9..5aec3340c 100644 --- a/docs/isa/vector/ops/binary-vector-ops/vmul_zh.md +++ b/docs/isa/vector/ops/binary-vector-ops/vmul_zh.md @@ -27,13 +27,13 @@ vmul %dst, %lhs, %rhs, %mask : !pto.vreg ### AS Level 1(SSA) ```mlir -%result = pto.vmul %lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask -> !pto.vreg +%result = pto.vmul %lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask -> !pto.vreg ``` ### AS Level 2(DPS) ```mlir -pto.vmul ins(%lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask) +pto.vmul ins(%lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask) outs(%result : !pto.vreg) ``` diff --git a/docs/isa/vector/ops/binary-vector-ops/vor.md b/docs/isa/vector/ops/binary-vector-ops/vor.md index 38036ee08..2560358bc 100644 --- a/docs/isa/vector/ops/binary-vector-ops/vor.md +++ b/docs/isa/vector/ops/binary-vector-ops/vor.md @@ -25,13 +25,13 @@ vor %dst, %lhs, %rhs, %mask : !pto.vreg ### AS Level 1 (SSA) ```mlir -%result = pto.vor %lhs, %rhs, %mask : (!pto.vreg, !pto.vreg, !pto.mask) -> !pto.vreg +%result = pto.vor %lhs, %rhs, %mask : (!pto.vreg, !pto.vreg, !pto.mask) -> !pto.vreg ``` ### AS Level 2 (DPS) ```mlir -pto.vor ins(%lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask) +pto.vor ins(%lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask) outs(%result : !pto.vreg) ``` @@ -43,7 +43,7 @@ Supported element types: all integer types (`i8`–`i64`, `u8`–`u64`). |---------|------|-------------| | `%lhs` | `!pto.vreg` | Left-hand source vector register | | `%rhs` | `!pto.vreg` | Right-hand source vector register | -| `%mask` | `!pto.mask` | Predicate mask; lanes where mask bit is 1 are active | +| `%mask` | `!pto.mask` | Predicate mask; lanes where mask bit is 1 are active | Both source registers MUST have the same integer element type and the same vector width `N`. The mask width MUST match `N`. @@ -114,11 +114,11 @@ for (int i = 0; i < N; i++) ```mlir // Bitwise OR of two integer vectors -%result = pto.vor %a, %b, %active : (!pto.vreg<64xi32>, !pto.vreg<64xi32>, !pto.mask) -> !pto.vreg<64xi32> +%result = pto.vor %a, %b, %active : (!pto.vreg<64xi32>, !pto.vreg<64xi32>, !pto.mask) -> !pto.vreg<64xi32> // Set specific bits %flags = pto.vbroadcast %c1 : i32 -> !pto.vreg<64xi32> -%set = pto.vor %data, %flags, %active : (!pto.vreg<64xi32>, !pto.vreg<64xi32>, !pto.mask) -> !pto.vreg<64xi32> +%set = pto.vor %data, %flags, %active : (!pto.vreg<64xi32>, !pto.vreg<64xi32>, !pto.mask) -> !pto.vreg<64xi32> ``` ## Related Ops / Instruction Set Links diff --git a/docs/isa/vector/ops/binary-vector-ops/vor_zh.md b/docs/isa/vector/ops/binary-vector-ops/vor_zh.md index 6d460edde..0ec269020 100644 --- a/docs/isa/vector/ops/binary-vector-ops/vor_zh.md +++ b/docs/isa/vector/ops/binary-vector-ops/vor_zh.md @@ -29,13 +29,13 @@ vor %dst, %lhs, %rhs, %mask : !pto.vreg ### AS Level 1(SSA) ```mlir -%result = pto.vor %lhs, %rhs, %mask : (!pto.vreg, !pto.vreg, !pto.mask) -> !pto.vreg +%result = pto.vor %lhs, %rhs, %mask : (!pto.vreg, !pto.vreg, !pto.mask) -> !pto.vreg ``` ### AS Level 2(DPS) ```mlir -pto.vor ins(%lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask) +pto.vor ins(%lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask) outs(%result : !pto.vreg) ``` @@ -47,7 +47,7 @@ pto.vor ins(%lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask) |--------|------|------| | `%lhs` | `!pto.vreg` | 左操作数向量寄存器 | | `%rhs` | `!pto.vreg` | 右操作数向量寄存器 | -| `%mask` | `!pto.mask` | 谓词掩码;掩码位为 1 的 lane 参与运算 | +| `%mask` | `!pto.mask` | 谓词掩码;掩码位为 1 的 lane 参与运算 | ## 预期输出 @@ -113,10 +113,10 @@ for (int i = 0; i < N; i++) ```mlir %result = pto.vor %a, %b, %active - : (!pto.vreg<64xi32>, !pto.vreg<64xi32>, !pto.mask) -> !pto.vreg<64xi32> + : (!pto.vreg<64xi32>, !pto.vreg<64xi32>, !pto.mask) -> !pto.vreg<64xi32> %set = pto.vor %data, %flags, %active - : (!pto.vreg<64xi32>, !pto.vreg<64xi32>, !pto.mask) -> !pto.vreg<64xi32> + : (!pto.vreg<64xi32>, !pto.vreg<64xi32>, !pto.mask) -> !pto.vreg<64xi32> ``` ## 相关页面 diff --git a/docs/isa/vector/ops/binary-vector-ops/vshl.md b/docs/isa/vector/ops/binary-vector-ops/vshl.md index 40d4a8b64..fa321bcc6 100644 --- a/docs/isa/vector/ops/binary-vector-ops/vshl.md +++ b/docs/isa/vector/ops/binary-vector-ops/vshl.md @@ -25,13 +25,13 @@ vshl %dst, %lhs, %rhs, %mask : !pto.vreg ### AS Level 1 (SSA) ```mlir -%result = pto.vshl %lhs, %rhs, %mask : (!pto.vreg, !pto.vreg, !pto.mask) -> !pto.vreg +%result = pto.vshl %lhs, %rhs, %mask : (!pto.vreg, !pto.vreg, !pto.mask) -> !pto.vreg ``` ### AS Level 2 (DPS) ```mlir -pto.vshl ins(%lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask) +pto.vshl ins(%lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask) outs(%result : !pto.vreg) ``` @@ -43,7 +43,7 @@ Supported element types: all integer types (`i8`–`i64`, `u8`–`u64`). |---------|------|-------------| | `%lhs` | `!pto.vreg` | Value to be shifted (left operand) | | `%rhs` | `!pto.vreg` | Per-lane unsigned shift count | -| `%mask` | `!pto.mask` | Predicate mask; lanes where mask bit is 1 are active | +| `%mask` | `!pto.mask` | Predicate mask; lanes where mask bit is 1 are active | Both source registers MUST have the same integer element type and the same vector width `N`. The mask width MUST match `N`. @@ -116,10 +116,10 @@ for (int i = 0; i < N; i++) ```mlir // Left shift by scalar count (broadcast to all lanes) %count = pto.vbroadcast %c3 : i32 -> !pto.vreg<64xi32> -%shifted = pto.vshl %data, %count, %active : (!pto.vreg<64xi32>, !pto.vreg<64xi32>, !pto.mask) -> !pto.vreg<64xi32> +%shifted = pto.vshl %data, %count, %active : (!pto.vreg<64xi32>, !pto.vreg<64xi32>, !pto.mask) -> !pto.vreg<64xi32> // Per-lane variable shift -%shifted2 = pto.vshl %data, %counts, %active : (!pto.vreg<64xi32>, !pto.vreg<64xi32>, !pto.mask) -> !pto.vreg<64xi32> +%shifted2 = pto.vshl %data, %counts, %active : (!pto.vreg<64xi32>, !pto.vreg<64xi32>, !pto.mask) -> !pto.vreg<64xi32> ``` ## Related Ops / Instruction Set Links diff --git a/docs/isa/vector/ops/binary-vector-ops/vshl_zh.md b/docs/isa/vector/ops/binary-vector-ops/vshl_zh.md index 1a68ded88..32ae4bf79 100644 --- a/docs/isa/vector/ops/binary-vector-ops/vshl_zh.md +++ b/docs/isa/vector/ops/binary-vector-ops/vshl_zh.md @@ -31,13 +31,13 @@ vshl %dst, %lhs, %rhs, %mask : !pto.vreg ### AS Level 1(SSA) ```mlir -%result = pto.vshl %lhs, %rhs, %mask : (!pto.vreg, !pto.vreg, !pto.mask) -> !pto.vreg +%result = pto.vshl %lhs, %rhs, %mask : (!pto.vreg, !pto.vreg, !pto.mask) -> !pto.vreg ``` ### AS Level 2(DPS) ```mlir -pto.vshl ins(%lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask) +pto.vshl ins(%lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask) outs(%result : !pto.vreg) ``` @@ -49,7 +49,7 @@ pto.vshl ins(%lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask) |--------|------|------| | `%lhs` | `!pto.vreg` | 被左移的值 | | `%rhs` | `!pto.vreg` | 每个 lane 的无符号位移计数 | -| `%mask` | `!pto.mask` | 谓词掩码;掩码位为 1 的 lane 参与位移 | +| `%mask` | `!pto.mask` | 谓词掩码;掩码位为 1 的 lane 参与位移 | ## 预期输出 @@ -117,10 +117,10 @@ for (int i = 0; i < N; i++) ```mlir %count = pto.vbroadcast %c3 : i32 -> !pto.vreg<64xi32> %shifted = pto.vshl %data, %count, %active - : (!pto.vreg<64xi32>, !pto.vreg<64xi32>, !pto.mask) -> !pto.vreg<64xi32> + : (!pto.vreg<64xi32>, !pto.vreg<64xi32>, !pto.mask) -> !pto.vreg<64xi32> %shifted2 = pto.vshl %data, %counts, %active - : (!pto.vreg<64xi32>, !pto.vreg<64xi32>, !pto.mask) -> !pto.vreg<64xi32> + : (!pto.vreg<64xi32>, !pto.vreg<64xi32>, !pto.mask) -> !pto.vreg<64xi32> ``` ## 性能 diff --git a/docs/isa/vector/ops/binary-vector-ops/vshr.md b/docs/isa/vector/ops/binary-vector-ops/vshr.md index 81dd36f68..c215bfbca 100644 --- a/docs/isa/vector/ops/binary-vector-ops/vshr.md +++ b/docs/isa/vector/ops/binary-vector-ops/vshr.md @@ -29,13 +29,13 @@ vshr %dst, %lhs, %rhs, %mask : !pto.vreg ### AS Level 1 (SSA) ```mlir -%result = pto.vshr %lhs, %rhs, %mask : (!pto.vreg, !pto.vreg, !pto.mask) -> !pto.vreg +%result = pto.vshr %lhs, %rhs, %mask : (!pto.vreg, !pto.vreg, !pto.mask) -> !pto.vreg ``` ### AS Level 2 (DPS) ```mlir -pto.vshr ins(%lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask) +pto.vshr ins(%lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask) outs(%result : !pto.vreg) ``` @@ -47,7 +47,7 @@ Supported element types: all integer types (`i8`–`i64`, `u8`–`u64`). |---------|------|-------------| | `%lhs` | `!pto.vreg` | Value to be shifted (left operand) | | `%rhs` | `!pto.vreg` | Per-lane unsigned shift count | -| `%mask` | `!pto.mask` | Predicate mask; lanes where mask bit is 1 are active | +| `%mask` | `!pto.mask` | Predicate mask; lanes where mask bit is 1 are active | Both source registers MUST have the same integer element type and the same vector width `N`. The mask width MUST match `N`. @@ -120,10 +120,10 @@ for (int i = 0; i < N; i++) ```mlir // Right shift by scalar count (broadcast to all lanes) %count = pto.vbroadcast %c2 : i32 -> !pto.vreg<64xi32> -%shifted = pto.vshr %data, %count, %active : (!pto.vreg<64xi32>, !pto.vreg<64xi32>, !pto.mask) -> !pto.vreg<64xi32> +%shifted = pto.vshr %data, %count, %active : (!pto.vreg<64xi32>, !pto.vreg<64xi32>, !pto.mask) -> !pto.vreg<64xi32> // Per-lane variable shift -%shifted2 = pto.vshr %data, %counts, %active : (!pto.vreg<64xi32>, !pto.vreg<64xi32>, !pto.mask) -> !pto.vreg<64xi32> +%shifted2 = pto.vshr %data, %counts, %active : (!pto.vreg<64xi32>, !pto.vreg<64xi32>, !pto.mask) -> !pto.vreg<64xi32> ``` ## Related Ops / Instruction Set Links diff --git a/docs/isa/vector/ops/binary-vector-ops/vshr_zh.md b/docs/isa/vector/ops/binary-vector-ops/vshr_zh.md index 15639a406..9361906fa 100644 --- a/docs/isa/vector/ops/binary-vector-ops/vshr_zh.md +++ b/docs/isa/vector/ops/binary-vector-ops/vshr_zh.md @@ -35,13 +35,13 @@ vshr %dst, %lhs, %rhs, %mask : !pto.vreg ### AS Level 1(SSA) ```mlir -%result = pto.vshr %lhs, %rhs, %mask : (!pto.vreg, !pto.vreg, !pto.mask) -> !pto.vreg +%result = pto.vshr %lhs, %rhs, %mask : (!pto.vreg, !pto.vreg, !pto.mask) -> !pto.vreg ``` ### AS Level 2(DPS) ```mlir -pto.vshr ins(%lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask) +pto.vshr ins(%lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask) outs(%result : !pto.vreg) ``` @@ -53,7 +53,7 @@ pto.vshr ins(%lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask) |--------|------|------| | `%lhs` | `!pto.vreg` | 被右移的值 | | `%rhs` | `!pto.vreg` | 每个 lane 的无符号位移计数 | -| `%mask` | `!pto.mask` | 谓词掩码;掩码位为 1 的 lane 参与位移 | +| `%mask` | `!pto.mask` | 谓词掩码;掩码位为 1 的 lane 参与位移 | ## 预期输出 @@ -122,10 +122,10 @@ for (int i = 0; i < N; i++) ```mlir %count = pto.vbroadcast %c2 : i32 -> !pto.vreg<64xi32> %shifted = pto.vshr %data, %count, %active - : (!pto.vreg<64xi32>, !pto.vreg<64xi32>, !pto.mask) -> !pto.vreg<64xi32> + : (!pto.vreg<64xi32>, !pto.vreg<64xi32>, !pto.mask) -> !pto.vreg<64xi32> %shifted2 = pto.vshr %data, %counts, %active - : (!pto.vreg<64xi32>, !pto.vreg<64xi32>, !pto.mask) -> !pto.vreg<64xi32> + : (!pto.vreg<64xi32>, !pto.vreg<64xi32>, !pto.mask) -> !pto.vreg<64xi32> ``` ## 性能 diff --git a/docs/isa/vector/ops/binary-vector-ops/vsub.md b/docs/isa/vector/ops/binary-vector-ops/vsub.md index 7f20ef18f..4db1866cc 100644 --- a/docs/isa/vector/ops/binary-vector-ops/vsub.md +++ b/docs/isa/vector/ops/binary-vector-ops/vsub.md @@ -25,13 +25,13 @@ vsub %dst, %lhs, %rhs, %mask : !pto.vreg ### AS Level 1 (SSA) ```mlir -%result = pto.vsub %lhs, %rhs, %mask : (!pto.vreg, !pto.vreg, !pto.mask) -> !pto.vreg +%result = pto.vsub %lhs, %rhs, %mask : (!pto.vreg, !pto.vreg, !pto.mask) -> !pto.vreg ``` ### AS Level 2 (DPS) ```mlir -pto.vsub ins(%lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask) +pto.vsub ins(%lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask) outs(%result : !pto.vreg) ``` @@ -43,7 +43,7 @@ Supported element types on A5: `i8-i64`, `f16`, `bf16`, `f32`. |---------|------|-------------| | `%lhs` | `!pto.vreg` | Minuend: the value being subtracted from | | `%rhs` | `!pto.vreg` | Subtrahend: the value being subtracted | -| `%mask` | `!pto.mask` | Predicate mask; lanes where mask bit is 1 are active | +| `%mask` | `!pto.mask` | Predicate mask; lanes where mask bit is 1 are active | Both source registers MUST have the same element type and the same vector width `N`. The mask width MUST match `N`. @@ -119,10 +119,10 @@ for (int i = 0; i < N; i++) ```mlir // Full-vector subtraction (all lanes active) -%result = pto.vsub %lhs, %rhs, %active : (!pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask) -> !pto.vreg<64xf32> +%result = pto.vsub %lhs, %rhs, %active : (!pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask) -> !pto.vreg<64xf32> // Partial predication: only subtract where %cond is true -%diff = pto.vsub %a, %b, %cond : (!pto.vreg<128xf16>, !pto.vreg<128xf16>, !pto.mask) -> !pto.vreg<128xf16> +%diff = pto.vsub %a, %b, %cond : (!pto.vreg<128xf16>, !pto.vreg<128xf16>, !pto.mask) -> !pto.vreg<128xf16> ``` ## Related Ops / Instruction Set Links diff --git a/docs/isa/vector/ops/binary-vector-ops/vsub_zh.md b/docs/isa/vector/ops/binary-vector-ops/vsub_zh.md index 2b6594e6c..6ffefdf5c 100644 --- a/docs/isa/vector/ops/binary-vector-ops/vsub_zh.md +++ b/docs/isa/vector/ops/binary-vector-ops/vsub_zh.md @@ -25,13 +25,13 @@ vsub %dst, %lhs, %rhs, %mask : !pto.vreg ### AS Level 1(SSA) ```mlir -%result = pto.vsub %lhs, %rhs, %mask : (!pto.vreg, !pto.vreg, !pto.mask) -> !pto.vreg +%result = pto.vsub %lhs, %rhs, %mask : (!pto.vreg, !pto.vreg, !pto.mask) -> !pto.vreg ``` ### AS Level 2(DPS) ```mlir -pto.vsub ins(%lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask) +pto.vsub ins(%lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask) outs(%result : !pto.vreg) ``` @@ -43,7 +43,7 @@ A5 当前记录的支持类型包括 `i8-i64`、`f16`、`bf16`、`f32`。 |--------|------|------| | `%lhs` | `!pto.vreg` | 被减数 | | `%rhs` | `!pto.vreg` | 减数 | -| `%mask` | `!pto.mask` | 谓词掩码;掩码位为 1 的 lane 参与减法 | +| `%mask` | `!pto.mask` | 谓词掩码;掩码位为 1 的 lane 参与减法 | 两个源寄存器必须有相同的元素类型和相同的向量宽度 `N`。掩码宽度必须与 `N` 相同。 @@ -117,10 +117,10 @@ for (int i = 0; i < N; i++) ```mlir %result = pto.vsub %lhs, %rhs, %active - : (!pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask) -> !pto.vreg<64xf32> + : (!pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask) -> !pto.vreg<64xf32> %diff = pto.vsub %a, %b, %cond - : (!pto.vreg<128xf16>, !pto.vreg<128xf16>, !pto.mask) -> !pto.vreg<128xf16> + : (!pto.vreg<128xf16>, !pto.vreg<128xf16>, !pto.mask) -> !pto.vreg<128xf16> ``` ## 相关页面 diff --git a/docs/isa/vector/ops/binary-vector-ops/vsubc.md b/docs/isa/vector/ops/binary-vector-ops/vsubc.md index 74313b087..e260998a7 100644 --- a/docs/isa/vector/ops/binary-vector-ops/vsubc.md +++ b/docs/isa/vector/ops/binary-vector-ops/vsubc.md @@ -31,14 +31,14 @@ vsubc %dst, %borrow, %lhs, %rhs, %mask : !pto.vreg ### AS Level 1 (SSA) ```mlir -%result, %borrow = pto.vsubc %lhs, %rhs, %mask : (!pto.vreg, !pto.vreg, !pto.mask) -> !pto.vreg, !pto.mask +%result, %borrow = pto.vsubc %lhs, %rhs, %mask : (!pto.vreg, !pto.vreg, !pto.mask) -> !pto.vreg, !pto.mask ``` ### AS Level 2 (DPS) ```mlir -pto.vsubc ins(%lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask) - outs(%result, %borrow : !pto.vreg, !pto.mask) +pto.vsubc ins(%lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask) + outs(%result, %borrow : !pto.vreg, !pto.mask) ``` ## Inputs @@ -47,7 +47,7 @@ pto.vsubc ins(%lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask) |---------|------|-------------| | `%lhs` | `!pto.vreg` | Minuend: the value being subtracted from | | `%rhs` | `!pto.vreg` | Subtrahend: the value being subtracted | -| `%mask` | `!pto.mask` | Predicate mask; lanes where mask bit is 1 are active | +| `%mask` | `!pto.mask` | Predicate mask; lanes where mask bit is 1 are active | Both source registers MUST have the same element type and the same vector width `N`. The mask width MUST match `N`. @@ -56,7 +56,7 @@ Both source registers MUST have the same element type and the same vector width | Result | Type | Description | |--------|------|-------------| | `%result` | `!pto.vreg` | Lane-wise arithmetic difference on active lanes; inactive lanes are unmodified | -| `%borrow` | `!pto.mask` | Per-lane borrow predicate: lane `i` is 1 if unsigned underflow occurred in lane `i` | +| `%borrow` | `!pto.mask` | Per-lane borrow predicate: lane `i` is 1 if unsigned underflow occurred in lane `i` | ## Side Effects @@ -109,7 +109,7 @@ for (int i = 0; i < N; i++) { ```mlir // Single-element subtraction with borrow -%result, %borrow = pto.vsubc %a, %b, %active : (!pto.vreg<64xi32>, !pto.vreg<64xi32>, !pto.mask) -> !pto.vreg<64xi32>, !pto.mask +%result, %borrow = pto.vsubc %a, %b, %active : (!pto.vreg<64xi32>, !pto.vreg<64xi32>, !pto.mask) -> !pto.vreg<64xi32>, !pto.mask // Multi-word subtraction: chain borrows into next segment %diff0, %borrow0 = pto.vsubc %a0, %b0, %active : ... // low words diff --git a/docs/isa/vector/ops/binary-vector-ops/vsubc_zh.md b/docs/isa/vector/ops/binary-vector-ops/vsubc_zh.md index 3f8ffba53..a39bb9446 100644 --- a/docs/isa/vector/ops/binary-vector-ops/vsubc_zh.md +++ b/docs/isa/vector/ops/binary-vector-ops/vsubc_zh.md @@ -30,14 +30,14 @@ vsubc %dst, %borrow, %lhs, %rhs, %mask : !pto.vreg ### AS Level 1(SSA) ```mlir -%result, %borrow = pto.vsubc %lhs, %rhs, %mask : (!pto.vreg, !pto.vreg, !pto.mask) -> !pto.vreg, !pto.mask +%result, %borrow = pto.vsubc %lhs, %rhs, %mask : (!pto.vreg, !pto.vreg, !pto.mask) -> !pto.vreg, !pto.mask ``` ### AS Level 2(DPS) ```mlir -pto.vsubc ins(%lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask) - outs(%result, %borrow : !pto.vreg, !pto.mask) +pto.vsubc ins(%lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask) + outs(%result, %borrow : !pto.vreg, !pto.mask) ``` ## 输入 @@ -46,7 +46,7 @@ pto.vsubc ins(%lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask) |--------|------|------| | `%lhs` | `!pto.vreg` | 被减数 | | `%rhs` | `!pto.vreg` | 减数 | -| `%mask` | `!pto.mask` | 谓词掩码;掩码位为 1 的 lane 参与减法 | +| `%mask` | `!pto.mask` | 谓词掩码;掩码位为 1 的 lane 参与减法 | 两个源寄存器必须具有相同的元素类型和相同的向量宽度 `N`。掩码宽度必须与 `N` 一致。 @@ -55,7 +55,7 @@ pto.vsubc ins(%lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask) | 结果 | 类型 | 说明 | |------|------|------| | `%result` | `!pto.vreg` | 活跃 lane 上得到逐 lane 差值;非活跃 lane 保持原值 | -| `%borrow` | `!pto.mask` | 逐 lane 的借位谓词;无符号下溢时,对应 lane 为 1 | +| `%borrow` | `!pto.mask` | 逐 lane 的借位谓词;无符号下溢时,对应 lane 为 1 | ## 副作用 @@ -105,7 +105,7 @@ for (int i = 0; i < N; i++) { ```mlir %result, %borrow = pto.vsubc %a, %b, %active - : (!pto.vreg<64xi32>, !pto.vreg<64xi32>, !pto.mask) -> !pto.vreg<64xi32>, !pto.mask + : (!pto.vreg<64xi32>, !pto.vreg<64xi32>, !pto.mask) -> !pto.vreg<64xi32>, !pto.mask %diff0, %borrow0 = pto.vsubc %a0, %b0, %active : ... %diff1, %borrow1 = pto.vsubc %a1, %b1, %borrow0 : ... diff --git a/docs/isa/vector/ops/binary-vector-ops/vxor.md b/docs/isa/vector/ops/binary-vector-ops/vxor.md index 3ba4e755c..644500fa2 100644 --- a/docs/isa/vector/ops/binary-vector-ops/vxor.md +++ b/docs/isa/vector/ops/binary-vector-ops/vxor.md @@ -25,13 +25,13 @@ vxor %dst, %lhs, %rhs, %mask : !pto.vreg ### AS Level 1 (SSA) ```mlir -%result = pto.vxor %lhs, %rhs, %mask : (!pto.vreg, !pto.vreg, !pto.mask) -> !pto.vreg +%result = pto.vxor %lhs, %rhs, %mask : (!pto.vreg, !pto.vreg, !pto.mask) -> !pto.vreg ``` ### AS Level 2 (DPS) ```mlir -pto.vxor ins(%lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask) +pto.vxor ins(%lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask) outs(%result : !pto.vreg) ``` @@ -43,7 +43,7 @@ Supported element types: all integer types (`i8`–`i64`, `u8`–`u64`). |---------|------|-------------| | `%lhs` | `!pto.vreg` | Left-hand source vector register | | `%rhs` | `!pto.vreg` | Right-hand source vector register | -| `%mask` | `!pto.mask` | Predicate mask; lanes where mask bit is 1 are active | +| `%mask` | `!pto.mask` | Predicate mask; lanes where mask bit is 1 are active | Both source registers MUST have the same integer element type and the same vector width `N`. The mask width MUST match `N`. @@ -114,11 +114,11 @@ for (int i = 0; i < N; i++) ```mlir // Bitwise XOR of two integer vectors -%result = pto.vxor %a, %b, %active : (!pto.vreg<64xi32>, !pto.vreg<64xi32>, !pto.mask) -> !pto.vreg<64xi32> +%result = pto.vxor %a, %b, %active : (!pto.vreg<64xi32>, !pto.vreg<64xi32>, !pto.mask) -> !pto.vreg<64xi32> // Toggle specific bits %toggle = pto.vbroadcast %cmask : i32 -> !pto.vreg<64xi32> -%toggled = pto.vxor %data, %toggle, %active : (!pto.vreg<64xi32>, !pto.vreg<64xi32>, !pto.mask) -> !pto.vreg<64xi32> +%toggled = pto.vxor %data, %toggle, %active : (!pto.vreg<64xi32>, !pto.vreg<64xi32>, !pto.mask) -> !pto.vreg<64xi32> ``` ## Related Ops / Instruction Set Links diff --git a/docs/isa/vector/ops/binary-vector-ops/vxor_zh.md b/docs/isa/vector/ops/binary-vector-ops/vxor_zh.md index 5793e0391..7ecadedc9 100644 --- a/docs/isa/vector/ops/binary-vector-ops/vxor_zh.md +++ b/docs/isa/vector/ops/binary-vector-ops/vxor_zh.md @@ -29,13 +29,13 @@ vxor %dst, %lhs, %rhs, %mask : !pto.vreg ### AS Level 1(SSA) ```mlir -%result = pto.vxor %lhs, %rhs, %mask : (!pto.vreg, !pto.vreg, !pto.mask) -> !pto.vreg +%result = pto.vxor %lhs, %rhs, %mask : (!pto.vreg, !pto.vreg, !pto.mask) -> !pto.vreg ``` ### AS Level 2(DPS) ```mlir -pto.vxor ins(%lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask) +pto.vxor ins(%lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask) outs(%result : !pto.vreg) ``` @@ -47,7 +47,7 @@ pto.vxor ins(%lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask) |--------|------|------| | `%lhs` | `!pto.vreg` | 左操作数向量寄存器 | | `%rhs` | `!pto.vreg` | 右操作数向量寄存器 | -| `%mask` | `!pto.mask` | 谓词掩码;掩码位为 1 的 lane 参与运算 | +| `%mask` | `!pto.mask` | 谓词掩码;掩码位为 1 的 lane 参与运算 | ## 预期输出 @@ -113,10 +113,10 @@ for (int i = 0; i < N; i++) ```mlir %result = pto.vxor %a, %b, %active - : (!pto.vreg<64xi32>, !pto.vreg<64xi32>, !pto.mask) -> !pto.vreg<64xi32> + : (!pto.vreg<64xi32>, !pto.vreg<64xi32>, !pto.mask) -> !pto.vreg<64xi32> %toggled = pto.vxor %data, %toggle, %active - : (!pto.vreg<64xi32>, !pto.vreg<64xi32>, !pto.mask) -> !pto.vreg<64xi32> + : (!pto.vreg<64xi32>, !pto.vreg<64xi32>, !pto.mask) -> !pto.vreg<64xi32> ``` ## 相关页面 diff --git a/docs/isa/vector/ops/compare-select/vcmp.md b/docs/isa/vector/ops/compare-select/vcmp.md index a65185086..4643d7986 100644 --- a/docs/isa/vector/ops/compare-select/vcmp.md +++ b/docs/isa/vector/ops/compare-select/vcmp.md @@ -15,13 +15,13 @@ For each lane `i` where `%seed[i]` is true, `result[i]` is set to the outcome of ### PTO Assembly Form ```text -vcmp %dst, %src0, %src1, %seed, "CMP_MODE" : !pto.mask +vcmp %dst, %src0, %src1, %seed, "CMP_MODE" : !pto.mask ``` ### AS Level 1 (SSA) ```mlir -%result = pto.vcmp %src0, %src1, %seed, "CMP_MODE" : !pto.vreg, !pto.vreg, !pto.mask -> !pto.mask +%result = pto.vcmp %src0, %src1, %seed, "CMP_MODE" : !pto.vreg, !pto.vreg, !pto.mask -> !pto.mask ``` ## Inputs @@ -30,14 +30,14 @@ vcmp %dst, %src0, %src1, %seed, "CMP_MODE" : !pto.mask | --- | --- | --- | | %src0 | `!pto.vreg` | Left-hand vector operand | | %src1 | `!pto.vreg` | Right-hand vector operand | -| %seed | `!pto.mask` | Incoming predicate mask that limits which lanes are compared | +| %seed | `!pto.mask` | Incoming predicate mask that limits which lanes are compared | | `CMP_MODE` | enum | Comparison predicate such as `eq`, `ne`, `lt`, `le`, `gt`, or `ge` | ## Expected Outputs | Result | Type | Description | | --- | --- | --- | -| %result | `!pto.mask` | Predicate mask whose active bits record the comparison result | +| %result | `!pto.mask` | Predicate mask whose active bits record the comparison result | ## Side Effects @@ -71,7 +71,7 @@ for (int i = 0; i < N; i++) ``` ```mlir -%lt_mask = pto.vcmp %a, %b, %all_active, "lt" : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.mask +%lt_mask = pto.vcmp %a, %b, %all_active, "lt" : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.mask ``` ## Related Ops / Instruction Set Links diff --git a/docs/isa/vector/ops/compare-select/vcmp_zh.md b/docs/isa/vector/ops/compare-select/vcmp_zh.md index d16a7c140..869963922 100644 --- a/docs/isa/vector/ops/compare-select/vcmp_zh.md +++ b/docs/isa/vector/ops/compare-select/vcmp_zh.md @@ -19,13 +19,13 @@ result[i] = seed[i] ? cmp(src0[i], src1[i], CMP_MODE) : 0 ### PTO 汇编形式 ```text -vcmp %dst, %src0, %src1, %seed, "CMP_MODE" : !pto.mask +vcmp %dst, %src0, %src1, %seed, "CMP_MODE" : !pto.mask ``` ### AS Level 1(SSA) ```mlir -%result = pto.vcmp %src0, %src1, %seed, "CMP_MODE" : !pto.vreg, !pto.vreg, !pto.mask -> !pto.mask +%result = pto.vcmp %src0, %src1, %seed, "CMP_MODE" : !pto.vreg, !pto.vreg, !pto.mask -> !pto.mask ``` ## 输入 @@ -34,14 +34,14 @@ vcmp %dst, %src0, %src1, %seed, "CMP_MODE" : !pto.mask |--------|------|------| | `%src0` | `!pto.vreg` | 左操作数向量 | | `%src1` | `!pto.vreg` | 右操作数向量 | -| `%seed` | `!pto.mask` | 限定哪些 lane 真正参与比较的输入谓词 | +| `%seed` | `!pto.mask` | 限定哪些 lane 真正参与比较的输入谓词 | | `CMP_MODE` | 枚举 | 比较模式,如 `eq`、`ne`、`lt`、`le`、`gt`、`ge` | ## 预期输出 | 结果 | 类型 | 说明 | |------|------|------| -| `%result` | `!pto.mask` | 每个活跃 bit 记录比较结果的谓词掩码 | +| `%result` | `!pto.mask` | 每个活跃 bit 记录比较结果的谓词掩码 | ## 副作用 @@ -76,7 +76,7 @@ for (int i = 0; i < N; i++) ```mlir %lt_mask = pto.vcmp %a, %b, %all_active, "lt" - : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.mask + : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.mask ``` ## 性能 diff --git a/docs/isa/vector/ops/compare-select/vcmps.md b/docs/isa/vector/ops/compare-select/vcmps.md index 2812f1446..4100e95be 100644 --- a/docs/isa/vector/ops/compare-select/vcmps.md +++ b/docs/isa/vector/ops/compare-select/vcmps.md @@ -15,13 +15,13 @@ For each lane `i` where `%seed[i]` is true, `result[i]` is set to the outcome of ### PTO Assembly Form ```text -vcmps %dst, %src, %scalar, %seed, "CMP_MODE" : !pto.mask +vcmps %dst, %src, %scalar, %seed, "CMP_MODE" : !pto.mask ``` ### AS Level 1 (SSA) ```mlir -%result = pto.vcmps %src, %scalar, %seed, "CMP_MODE" : !pto.vreg, T, !pto.mask -> !pto.mask +%result = pto.vcmps %src, %scalar, %seed, "CMP_MODE" : !pto.vreg, T, !pto.mask -> !pto.mask ``` ## Inputs @@ -30,14 +30,14 @@ vcmps %dst, %src, %scalar, %seed, "CMP_MODE" : !pto.mask | --- | --- | --- | | %src | `!pto.vreg` | Vector operand | | %scalar | `T` | Scalar comparison value broadcast to every active lane | -| %seed | `!pto.mask` | Incoming predicate mask that limits which lanes are compared | +| %seed | `!pto.mask` | Incoming predicate mask that limits which lanes are compared | | `CMP_MODE` | enum | Comparison predicate such as `eq`, `ne`, `lt`, `le`, `gt`, or `ge` | ## Expected Outputs | Result | Type | Description | | --- | --- | --- | -| %result | `!pto.mask` | Predicate mask whose active bits record the comparison result | +| %result | `!pto.mask` | Predicate mask whose active bits record the comparison result | ## Side Effects @@ -71,7 +71,7 @@ for (int i = 0; i < N; i++) ``` ```mlir -%positive_mask = pto.vcmps %values, %c0_f32, %all_active, "gt" : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.mask +%positive_mask = pto.vcmps %values, %c0_f32, %all_active, "gt" : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.mask ``` ## Related Ops / Instruction Set Links diff --git a/docs/isa/vector/ops/compare-select/vcmps_zh.md b/docs/isa/vector/ops/compare-select/vcmps_zh.md index d2c619348..2d98c512d 100644 --- a/docs/isa/vector/ops/compare-select/vcmps_zh.md +++ b/docs/isa/vector/ops/compare-select/vcmps_zh.md @@ -19,13 +19,13 @@ result[i] = seed[i] ? cmp(src[i], scalar, CMP_MODE) : 0 ### PTO 汇编形式 ```text -vcmps %dst, %src, %scalar, %seed, "CMP_MODE" : !pto.mask +vcmps %dst, %src, %scalar, %seed, "CMP_MODE" : !pto.mask ``` ### AS Level 1(SSA) ```mlir -%result = pto.vcmps %src, %scalar, %seed, "CMP_MODE" : !pto.vreg, T, !pto.mask -> !pto.mask +%result = pto.vcmps %src, %scalar, %seed, "CMP_MODE" : !pto.vreg, T, !pto.mask -> !pto.mask ``` ## 输入 @@ -34,14 +34,14 @@ vcmps %dst, %src, %scalar, %seed, "CMP_MODE" : !pto.mask |--------|------|------| | `%src` | `!pto.vreg` | 向量操作数 | | `%scalar` | `T` | 广播到每个活跃 lane 的标量比较值 | -| `%seed` | `!pto.mask` | 限定哪些 lane 真正参与比较的输入谓词 | +| `%seed` | `!pto.mask` | 限定哪些 lane 真正参与比较的输入谓词 | | `CMP_MODE` | 枚举 | 比较模式,如 `eq`、`ne`、`lt`、`le`、`gt`、`ge` | ## 预期输出 | 结果 | 类型 | 说明 | |------|------|------| -| `%result` | `!pto.mask` | 每个活跃 bit 记录比较结果的谓词掩码 | +| `%result` | `!pto.mask` | 每个活跃 bit 记录比较结果的谓词掩码 | ## 副作用 @@ -76,7 +76,7 @@ for (int i = 0; i < N; i++) ```mlir %positive_mask = pto.vcmps %values, %c0_f32, %all_active, "gt" - : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.mask + : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.mask ``` ## 性能 diff --git a/docs/isa/vector/ops/compare-select/vsel.md b/docs/isa/vector/ops/compare-select/vsel.md index 84717f84e..4dc9083ab 100644 --- a/docs/isa/vector/ops/compare-select/vsel.md +++ b/docs/isa/vector/ops/compare-select/vsel.md @@ -21,7 +21,7 @@ vsel %dst, %src_true, %src_false, %mask : !pto.vreg ### AS Level 1 (SSA) ```mlir -%result = pto.vsel %src0, %src1, %mask : !pto.vreg, !pto.vreg, !pto.mask -> !pto.vreg +%result = pto.vsel %src0, %src1, %mask : !pto.vreg, !pto.vreg, !pto.mask -> !pto.vreg ``` ## Inputs @@ -30,7 +30,7 @@ vsel %dst, %src_true, %src_false, %mask : !pto.vreg | --- | --- | --- | | %src0 | `!pto.vreg` | Value selected when the mask bit is 1 | | %src1 | `!pto.vreg` | Value selected when the mask bit is 0 | -| %mask | `!pto.mask` | Predicate mask that chooses between `%src0` and `%src1` per lane | +| %mask | `!pto.mask` | Predicate mask that chooses between `%src0` and `%src1` per lane | ## Expected Outputs @@ -68,7 +68,7 @@ for (int i = 0; i < N; i++) ``` ```mlir -%result = pto.vsel %true_vals, %false_vals, %condition : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> +%result = pto.vsel %true_vals, %false_vals, %condition : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> ``` ## Related Ops / Instruction Set Links diff --git a/docs/isa/vector/ops/compare-select/vsel_zh.md b/docs/isa/vector/ops/compare-select/vsel_zh.md index 3d7a5039f..7231f9af0 100644 --- a/docs/isa/vector/ops/compare-select/vsel_zh.md +++ b/docs/isa/vector/ops/compare-select/vsel_zh.md @@ -27,7 +27,7 @@ vsel %dst, %src_true, %src_false, %mask : !pto.vreg ### AS Level 1(SSA) ```mlir -%result = pto.vsel %src0, %src1, %mask : !pto.vreg, !pto.vreg, !pto.mask -> !pto.vreg +%result = pto.vsel %src0, %src1, %mask : !pto.vreg, !pto.vreg, !pto.mask -> !pto.vreg ``` ## 输入 @@ -36,7 +36,7 @@ vsel %dst, %src_true, %src_false, %mask : !pto.vreg |--------|------|------| | `%src0` | `!pto.vreg` | 当掩码位为 1 时被选中的值 | | `%src1` | `!pto.vreg` | 当掩码位为 0 时被选中的值 | -| `%mask` | `!pto.mask` | 逐 lane 选择 `%src0` 与 `%src1` 的谓词 | +| `%mask` | `!pto.mask` | 逐 lane 选择 `%src0` 与 `%src1` 的谓词 | ## 预期输出 @@ -75,7 +75,7 @@ for (int i = 0; i < N; i++) ```mlir %result = pto.vsel %true_vals, %false_vals, %condition - : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> + : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> ``` ## 性能 diff --git a/docs/isa/vector/ops/conversion-ops/pbitcast.md b/docs/isa/vector/ops/conversion-ops/pbitcast.md new file mode 100644 index 000000000..6b096e921 --- /dev/null +++ b/docs/isa/vector/ops/conversion-ops/pbitcast.md @@ -0,0 +1,58 @@ +# pto.pbitcast + +`pto.pbitcast` is part of the [Conversion Ops](../../conversion-ops.md) instruction set. + +## Summary + +`pto.pbitcast` performs a bitwise reinterpretation of a `!pto.mask<...>` value without changing the underlying predicate-register image. It makes mask-family reinterpretation explicit in VPTO IR for cases where a producer and consumer expect different granularity views (`b8`, `b16`, `b32`) of the same hardware predicate state. + +## Mechanism + +The op is a pure type cast at the mask-register level. No predicate bits are materialized, normalized, or recomputed — VPTO only updates which mask granularity the surrounding IR uses to interpret the same predicate bits. This decouples the mask producer's natural granularity from the consumer's required granularity without inserting an extra hardware operation. + +## Syntax + +```mlir +%result = pto.pbitcast %input : !pto.mask -> !pto.mask +``` + +## Inputs + +| Operand | Type | Description | +|---------|------|-------------| +| `%input` | `!pto.mask` | Source predicate register value. | + +## Expected Outputs + +| Result | Type | Description | +|--------|------|-------------| +| `%result` | `!pto.mask` | Same predicate bits, reinterpreted under granularity `G1`. | + +## Side Effects + +`pto.pbitcast` has no architectural side effects beyond producing its SSA result. It does not materialize new mask bits or rewrite hardware predicate state. + +## Constraints + +!!! warning "Constraints" + - Both source and result must be `!pto.mask<...>` types. + - `pto.pbitcast` does not materialize or normalize predicate contents; it only changes which mask granularity the surrounding VPTO IR uses to interpret the same predicate bits. + - Use only when the consumer requires a different mask granularity (`b8` / `b16` / `b32`) but the underlying predicate-register image is intended to be reused as-is. If the consumer needs a recomputed predicate, lower or materialize the mask through the appropriate predicate-generation op instead of `pto.pbitcast`. + +## Examples + +### Reinterpret a b16 predicate as b32 before a consumer + +```mlir +%m16 = pto.pintlv_b16 %lhs, %rhs + : !pto.mask, !pto.mask -> !pto.mask, !pto.mask +%m32 = pto.pbitcast %m16#0 : !pto.mask -> !pto.mask +%result = pto.vsel %a, %b, %m32 + : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> +``` + +## Related Ops / Instruction Set Links + +- Instruction set overview: [Conversion Ops](../../conversion-ops.md) +- Vector-side bitcast: [pto.vbitcast](./vbitcast.md) +- Predicate generation and algebra: [Predicate Generation and Algebra](../../../scalar/ops/predicate-generation-and-algebra/) diff --git a/docs/isa/vector/ops/conversion-ops/pbitcast_zh.md b/docs/isa/vector/ops/conversion-ops/pbitcast_zh.md new file mode 100644 index 000000000..07f10ff0a --- /dev/null +++ b/docs/isa/vector/ops/conversion-ops/pbitcast_zh.md @@ -0,0 +1,58 @@ +# pto.pbitcast + +`pto.pbitcast` 属于 [Conversion Ops](../../conversion-ops_zh.md) 指令集。 + +## 摘要 + +`pto.pbitcast` 对 `!pto.mask<...>` 值执行按位重新解释,不改变底层谓词寄存器映像。当生产者与消费者对同一个硬件谓词状态期待不同的粒度视图(`b8`、`b16`、`b32`)时,本操作把 mask 家族之间的重新解释显式化到 VPTO IR 里。 + +## 机制 + +本操作在 mask 寄存器层面是一次纯类型转换。VPTO 不会重新计算、规范化或物化任何谓词位,只会更新周围 IR 对同一段谓词位所采用的粒度视图。这样可以把 mask 生产者天然的粒度与消费者所需的粒度解耦,而不必插入额外的硬件操作。 + +## 语法 + +```mlir +%result = pto.pbitcast %input : !pto.mask -> !pto.mask +``` + +## 输入 + +| 操作数 | 类型 | 描述 | +|---------|------|------| +| `%input` | `!pto.mask` | 源谓词寄存器。 | + +## 预期输出 + +| 结果 | 类型 | 描述 | +|--------|------|------| +| `%result` | `!pto.mask` | 同一份谓词位,在粒度 `G1` 下重新解释。 | + +## 副作用 + +`pto.pbitcast` 除了产生 SSA 结果以外没有任何架构层面的副作用。它不会重新生成 mask 位,也不会改写硬件谓词状态。 + +## 约束 + +!!! warning "约束" + - 源和目标都必须是 `!pto.mask<...>` 类型。 + - `pto.pbitcast` 不会物化或规范化谓词内容;它只更新周围 VPTO IR 对这同一段谓词位采用的粒度视图。 + - 仅在消费者需要不同 mask 粒度(`b8` / `b16` / `b32`)、但底层谓词映像可以原样复用时使用。若消费者需要的是「重新计算后的谓词」,应通过相应的谓词生成操作(而非 `pto.pbitcast`)来下沉或物化 mask。 + +## 示例 + +### 在消费者前把 b16 谓词重新解释为 b32 + +```mlir +%m16 = pto.pintlv_b16 %lhs, %rhs + : !pto.mask, !pto.mask -> !pto.mask, !pto.mask +%m32 = pto.pbitcast %m16#0 : !pto.mask -> !pto.mask +%result = pto.vsel %a, %b, %m32 + : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> +``` + +## 相关页面 + +- 指令集总览:[Conversion Ops](../../conversion-ops_zh.md) +- 向量侧 bitcast:[pto.vbitcast](./vbitcast_zh.md) +- 谓词生成与代数:[谓词生成与代数](../../../scalar/ops/predicate-generation-and-algebra/) diff --git a/docs/isa/vector/ops/conversion-ops/vbitcast.md b/docs/isa/vector/ops/conversion-ops/vbitcast.md new file mode 100644 index 000000000..01b72f9eb --- /dev/null +++ b/docs/isa/vector/ops/conversion-ops/vbitcast.md @@ -0,0 +1,94 @@ +# pto.vbitcast + +`pto.vbitcast` is part of the [Conversion Ops](../../conversion-ops.md) instruction set. + +## Summary + +`pto.vbitcast` performs a bitwise reinterpretation of a `!pto.vreg<...>` value without changing the underlying bit pattern. The total bit width is preserved (always 2048 bits for a VPTO `vreg`), so only the element type and lane count interpretation change. + +Unlike [pto.vcvt](./vcvt.md), `pto.vbitcast` does not round, saturate, or rescale any value — every bit of the source register is copied unchanged into the result register. + +## Mechanism + +The op is a pure type cast at the vector-register level. No payload bytes are modified; only the surrounding VPTO IR's interpretation of the register changes. This makes type punning between integer and floating-point families explicit in SSA form, instead of being inferred from hidden hardware state. + +## Syntax + +```mlir +%result = pto.vbitcast %input : !pto.vreg -> !pto.vreg +``` + +## Inputs + +| Operand | Type | Description | +|---------|------|-------------| +| `%input` | `!pto.vreg` | Source vector register. | + +## Expected Outputs + +| Result | Type | Description | +|--------|------|-------------| +| `%result` | `!pto.vreg` | Destination vector register with the same bit pattern, reinterpreted as `MxT1`. | + +## Side Effects + +`pto.vbitcast` has no architectural side effects beyond producing its SSA result. It does not implicitly reserve buffers, signal events, or establish memory fences. + +## Constraints + +!!! warning "Constraints" + - Both source and result must be `!pto.vreg<...>` types. + - Source and result vectors must have the same total bit width (currently 2048 bits): `N * bitwidth(T0) = M * bitwidth(T1) = 2048`. + - Only integer and floating-point element types are supported. + +**Element-bit-width equality examples:** + +- `f32<64>` → `i32<64>` (both 32-bit elements, total 2048 bits) +- `f16<128>` → `i16<128>` (both 16-bit elements, total 2048 bits) +- `bf16<128>` → `ui16<128>` (both 16-bit elements, total 2048 bits) +- `si32<64>` → `ui32<64>` (both 32-bit elements, total 2048 bits) +- `f32<64>` → `i16<128>` (32-bit and 16-bit elements, total 2048 bits) + +The verifier rejects shapes for which the source and destination total bit widths differ. + +## Comparison with `pto.vcvt` + +| Aspect | `pto.vcvt` | `pto.vbitcast` | +|--------|------------|----------------| +| Bit pattern | May change (rounding, saturation, sign extension) | Preserved exactly | +| Lane count | May change with documented type-pair rules | May change as long as total bit width stays 2048 | +| Rounding / saturation attributes | Supported (`rnd`, `sat`, `part`) | None | +| Predicate operand | Required (`%mask`) | None — bitcast is unconditional | + +## Examples + +### Reinterpret float as integer for bit manipulation + +```mlir +// Prepare a vector of float values +%fvec = pto.vlds %ub[%lane] : !pto.ptr -> !pto.vreg<64xf32> + +// Reinterpret as integer for bitwise operations +%ivec = pto.vbitcast %fvec : !pto.vreg<64xf32> -> !pto.vreg<64xi32> + +// Extract sign bit (bit 31) +%sign_bits = pto.vand %ivec, %sign_mask, %mask + : !pto.vreg<64xi32>, !pto.vreg<64xi32>, !pto.mask -> !pto.vreg<64xi32> + +// Reinterpret back to float +%fvec_without_sign = pto.vbitcast %sign_bits : !pto.vreg<64xi32> -> !pto.vreg<64xf32> +``` + +### Type punning between signed and unsigned integer + +```mlir +%signed = pto.vlds %ub[%lane] : !pto.ptr -> !pto.vreg<64xsi32> +%unsigned = pto.vbitcast %signed : !pto.vreg<64xsi32> -> !pto.vreg<64xui32> +// Bits are identical; interpretation changes from signed to unsigned +``` + +## Related Ops / Instruction Set Links + +- Instruction set overview: [Conversion Ops](../../conversion-ops.md) +- Value-changing conversion: [pto.vcvt](./vcvt.md) +- Mask-side bitcast: [pto.pbitcast](./pbitcast.md) diff --git a/docs/isa/vector/ops/conversion-ops/vbitcast_zh.md b/docs/isa/vector/ops/conversion-ops/vbitcast_zh.md new file mode 100644 index 000000000..2f12e3d69 --- /dev/null +++ b/docs/isa/vector/ops/conversion-ops/vbitcast_zh.md @@ -0,0 +1,94 @@ +# pto.vbitcast + +`pto.vbitcast` 属于 [Conversion Ops](../../conversion-ops_zh.md) 指令集。 + +## 摘要 + +`pto.vbitcast` 对 `!pto.vreg<...>` 值执行按位重新解释,不改变底层位模式。VPTO 的 `vreg` 总位宽恒为 2048 bits,本操作只改变元素类型与车道数解释。 + +与 [pto.vcvt](./vcvt_zh.md) 不同,`pto.vbitcast` 不做舍入、饱和或数值重定标——源寄存器的每一位都原封不动地拷贝到目标寄存器。 + +## 机制 + +本操作在向量寄存器层面是一次纯类型转换。Payload 字节不变,只改变周围 VPTO IR 对该寄存器的解释。这样可以把整型与浮点家族之间的 type punning 在 SSA 中显式化,而不必依赖任何隐含的硬件状态。 + +## 语法 + +```mlir +%result = pto.vbitcast %input : !pto.vreg -> !pto.vreg +``` + +## 输入 + +| 操作数 | 类型 | 描述 | +|---------|------|------| +| `%input` | `!pto.vreg` | 源向量寄存器。 | + +## 预期输出 + +| 结果 | 类型 | 描述 | +|--------|------|------| +| `%result` | `!pto.vreg` | 与源位模式完全相同、被重新解释为 `MxT1` 的目标寄存器。 | + +## 副作用 + +`pto.vbitcast` 除了产生 SSA 结果以外没有任何架构层面的副作用。它不会预留缓冲区、发出事件,也不会建立内存屏障。 + +## 约束 + +!!! warning "约束" + - 源和目标都必须是 `!pto.vreg<...>` 类型。 + - 源与目标的总位宽必须相等(当前为 2048 bits):`N * bitwidth(T0) = M * bitwidth(T1) = 2048`。 + - 仅支持整型和浮点元素类型。 + +**位宽相等的形状示例:** + +- `f32<64>` → `i32<64>`(两侧都是 32 位元素,共 2048 bits) +- `f16<128>` → `i16<128>`(两侧都是 16 位元素,共 2048 bits) +- `bf16<128>` → `ui16<128>`(两侧都是 16 位元素,共 2048 bits) +- `si32<64>` → `ui32<64>`(两侧都是 32 位元素,共 2048 bits) +- `f32<64>` → `i16<128>`(32 位/16 位元素,共 2048 bits) + +verifier 会拒绝总位宽不一致的形状。 + +## 与 `pto.vcvt` 的比较 + +| 维度 | `pto.vcvt` | `pto.vbitcast` | +|--------|------------|----------------| +| 位模式 | 可能改变(舍入、饱和、符号扩展) | 完全保持 | +| 车道数 | 在已记录的类型对规则下可改变 | 在总位宽 2048 不变的前提下可改变 | +| 舍入 / 饱和属性 | 支持 (`rnd`, `sat`, `part`) | 无 | +| 谓词操作数 | 必须提供 `%mask` | 不需要——bitcast 是无条件的 | + +## 示例 + +### 将浮点重新解释为整型以做位操作 + +```mlir +// 准备一个浮点向量 +%fvec = pto.vlds %ub[%lane] : !pto.ptr -> !pto.vreg<64xf32> + +// 重新解释为整型,准备做位操作 +%ivec = pto.vbitcast %fvec : !pto.vreg<64xf32> -> !pto.vreg<64xi32> + +// 取符号位(bit 31) +%sign_bits = pto.vand %ivec, %sign_mask, %mask + : !pto.vreg<64xi32>, !pto.vreg<64xi32>, !pto.mask -> !pto.vreg<64xi32> + +// 再解释回浮点 +%fvec_without_sign = pto.vbitcast %sign_bits : !pto.vreg<64xi32> -> !pto.vreg<64xf32> +``` + +### 有符号与无符号整型之间的 type punning + +```mlir +%signed = pto.vlds %ub[%lane] : !pto.ptr -> !pto.vreg<64xsi32> +%unsigned = pto.vbitcast %signed : !pto.vreg<64xsi32> -> !pto.vreg<64xui32> +// 位完全相同,解释从有符号变成无符号 +``` + +## 相关页面 + +- 指令集总览:[Conversion Ops](../../conversion-ops_zh.md) +- 数值变换:[pto.vcvt](./vcvt_zh.md) +- 谓词侧 bitcast:[pto.pbitcast](./pbitcast_zh.md) diff --git a/docs/isa/vector/ops/data-rearrangement/vsqz.md b/docs/isa/vector/ops/data-rearrangement/vsqz.md index e703c3e2f..aeb961802 100644 --- a/docs/isa/vector/ops/data-rearrangement/vsqz.md +++ b/docs/isa/vector/ops/data-rearrangement/vsqz.md @@ -21,7 +21,7 @@ vsqz %dst, %src, %mask ### AS Level 1 (SSA) ```mlir -%result = pto.vsqz %src, %mask : !pto.vreg, !pto.mask -> !pto.vreg +%result = pto.vsqz %src, %mask : !pto.vreg, !pto.mask -> !pto.vreg ``` ## Inputs @@ -29,7 +29,7 @@ vsqz %dst, %src, %mask | Operand | Type | Description | | --- | --- | --- | | %src | `!pto.vreg` | Source vector | -| %mask | `!pto.mask` | Predicate mask selecting which lanes are kept | +| %mask | `!pto.mask` | Predicate mask selecting which lanes are kept | ## Expected Outputs diff --git a/docs/isa/vector/ops/data-rearrangement/vsqz_zh.md b/docs/isa/vector/ops/data-rearrangement/vsqz_zh.md index 5740bf4f4..d2bf0f116 100644 --- a/docs/isa/vector/ops/data-rearrangement/vsqz_zh.md +++ b/docs/isa/vector/ops/data-rearrangement/vsqz_zh.md @@ -30,7 +30,7 @@ vsqz %dst, %src, %mask ### AS Level 1(SSA) ```mlir -%result = pto.vsqz %src, %mask : !pto.vreg, !pto.mask -> !pto.vreg +%result = pto.vsqz %src, %mask : !pto.vreg, !pto.mask -> !pto.vreg ``` ## 输入 @@ -38,7 +38,7 @@ vsqz %dst, %src, %mask | 操作数 | 类型 | 说明 | |--------|------|------| | `%src` | `!pto.vreg` | 源向量 | -| `%mask` | `!pto.mask` | 选出要保留 lane 的谓词掩码 | +| `%mask` | `!pto.mask` | 选出要保留 lane 的谓词掩码 | ## 预期输出 diff --git a/docs/isa/vector/ops/data-rearrangement/vusqz.md b/docs/isa/vector/ops/data-rearrangement/vusqz.md index b32e29cbe..e344b51f3 100644 --- a/docs/isa/vector/ops/data-rearrangement/vusqz.md +++ b/docs/isa/vector/ops/data-rearrangement/vusqz.md @@ -21,14 +21,14 @@ vusqz %dst, %mask ### AS Level 1 (SSA) ```mlir -%result = pto.vusqz %mask : !pto.mask -> !pto.vreg +%result = pto.vusqz %mask : !pto.mask -> !pto.vreg ``` ## Inputs | Operand | Type | Description | | --- | --- | --- | -| %mask | `!pto.mask` | Predicate mask that selects the lanes that should receive front-packed elements | +| %mask | `!pto.mask` | Predicate mask that selects the lanes that should receive front-packed elements | ## Expected Outputs diff --git a/docs/isa/vector/ops/data-rearrangement/vusqz_zh.md b/docs/isa/vector/ops/data-rearrangement/vusqz_zh.md index 12197e21a..3e2eba752 100644 --- a/docs/isa/vector/ops/data-rearrangement/vusqz_zh.md +++ b/docs/isa/vector/ops/data-rearrangement/vusqz_zh.md @@ -30,14 +30,14 @@ vusqz %dst, %mask ### AS Level 1(SSA) ```mlir -%result = pto.vusqz %mask : !pto.mask -> !pto.vreg +%result = pto.vusqz %mask : !pto.mask -> !pto.vreg ``` ## 输入 | 操作数 | 类型 | 说明 | |--------|------|------| -| `%mask` | `!pto.mask` | 指定哪些 lane 应接收前部压紧流元素的谓词 | +| `%mask` | `!pto.mask` | 指定哪些 lane 应接收前部压紧流元素的谓词 | ## 预期输出 diff --git a/docs/isa/vector/ops/reduction-ops/vcadd.md b/docs/isa/vector/ops/reduction-ops/vcadd.md index a569949cb..52c5bb21a 100644 --- a/docs/isa/vector/ops/reduction-ops/vcadd.md +++ b/docs/isa/vector/ops/reduction-ops/vcadd.md @@ -27,7 +27,7 @@ vcadd %dst, %src, %mask : !pto.vreg ### AS Level 1 (SSA) ```mlir -%result = pto.vcadd %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg +%result = pto.vcadd %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg ``` Supported element types on A5: `i16-i64`, `f16`, `f32`. @@ -94,13 +94,13 @@ for (int i = 1; i < N; i++) ```mlir // Full-vector sum reduction: result in lane 0 -%result = pto.vcadd %input, %mask : !pto.vreg<128xf32>, !pto.mask -> !pto.vreg<128xf32> +%result = pto.vcadd %input, %mask : !pto.vreg<128xf32>, !pto.mask -> !pto.vreg<128xf32> ``` ### MLIR — DPS Form ```mlir -pto.vcadd ins(%input, %mask : !pto.vreg<128xf32>, !pto.mask) +pto.vcadd ins(%input, %mask : !pto.vreg<128xf32>, !pto.mask) outs(%result : !pto.vreg<128xf32>) ``` @@ -108,8 +108,8 @@ pto.vcadd ins(%input, %mask : !pto.vreg<128xf32>, !pto.mask) ```mlir // Compute the sum of a 128-element f32 vector tile -%mask = pto vidu %c128 : i1 -> !pto.mask -%sum = pto.vcadd %vec, %mask : !pto.vreg<128xf32>, !pto.mask -> !pto.vreg<128xf32> +%mask = pto vidu %c128 : i1 -> !pto.mask +%sum = pto.vcadd %vec, %mask : !pto.vreg<128xf32>, !pto.mask -> !pto.vreg<128xf32> // %sum[0] contains the total; %sum[1..127] are zero ``` diff --git a/docs/isa/vector/ops/reduction-ops/vcadd_zh.md b/docs/isa/vector/ops/reduction-ops/vcadd_zh.md index 8e822cca3..3d5c68b48 100644 --- a/docs/isa/vector/ops/reduction-ops/vcadd_zh.md +++ b/docs/isa/vector/ops/reduction-ops/vcadd_zh.md @@ -25,7 +25,7 @@ vcadd %dst, %src, %mask : !pto.vreg ### AS Level 1(SSA) ```mlir -%result = pto.vcadd %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg +%result = pto.vcadd %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg ``` A5 当前文档化支持的类型:`i16-i64`、`f16`、`f32`。 @@ -88,7 +88,7 @@ for (int i = 1; i < N; i++) ### MLIR ```mlir -%result = pto.vcadd %input, %mask : !pto.vreg<128xf32>, !pto.mask -> !pto.vreg<128xf32> +%result = pto.vcadd %input, %mask : !pto.vreg<128xf32>, !pto.mask -> !pto.vreg<128xf32> ``` ## 性能 diff --git a/docs/isa/vector/ops/reduction-ops/vcgadd.md b/docs/isa/vector/ops/reduction-ops/vcgadd.md index 218e865f9..6955863bc 100644 --- a/docs/isa/vector/ops/reduction-ops/vcgadd.md +++ b/docs/isa/vector/ops/reduction-ops/vcgadd.md @@ -29,7 +29,7 @@ vcgadd %dst, %src, %mask : !pto.vreg ### AS Level 1 (SSA) ```mlir -%result = pto.vcgadd %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg +%result = pto.vcgadd %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg ``` Supported element types on A5: `i16-i32`, `f16`, `f32`. @@ -105,13 +105,13 @@ for (int g = 0; g < 8; g++) { ```mlir // Lane-group sum reduction: one result per 32-byte VLane -%result = pto.vcgadd %input, %mask : !pto.vreg<128xf32>, !pto.mask -> !pto.vreg<128xf32> +%result = pto.vcgadd %input, %mask : !pto.vreg<128xf32>, !pto.mask -> !pto.vreg<128xf32> ``` ### MLIR — DPS Form ```mlir -pto.vcgadd ins(%input, %mask : !pto.vreg<128xf32>, !pto.mask) +pto.vcgadd ins(%input, %mask : !pto.vreg<128xf32>, !pto.mask) outs(%result : !pto.vreg<128xf32>) ``` @@ -119,9 +119,9 @@ pto.vcgadd ins(%input, %mask : !pto.vreg<128xf32>, !pto.mask) ```mlir // Compute row-wise softmax: step 1 — compute exp and lane-group sum -%exp = pto.vexpdiff %row, %c0 : !pto.vreg<128xf32>, f32 -> !pto.vreg<128xf32> -%mask = pto vidu %c128 : i1 -> !pto.mask -%sum = pto.vcgadd %exp, %mask : !pto.vreg<128xf32>, !pto.mask -> !pto.vreg<128xf32> +%exp = pto.vexpdif %row, %c0 : !pto.vreg<128xf32>, f32 -> !pto.vreg<128xf32> +%mask = pto vidu %c128 : i1 -> !pto.mask +%sum = pto.vcgadd %exp, %mask : !pto.vreg<128xf32>, !pto.mask -> !pto.vreg<128xf32> // %sum[0,8,16,...] holds per-VLane exp sums for normalization ``` diff --git a/docs/isa/vector/ops/reduction-ops/vcgadd_zh.md b/docs/isa/vector/ops/reduction-ops/vcgadd_zh.md index 246202266..04a2e7f16 100644 --- a/docs/isa/vector/ops/reduction-ops/vcgadd_zh.md +++ b/docs/isa/vector/ops/reduction-ops/vcgadd_zh.md @@ -25,7 +25,7 @@ vcgadd %dst, %src, %mask : !pto.vreg ### AS Level 1(SSA) ```mlir -%result = pto.vcgadd %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg +%result = pto.vcgadd %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg ``` A5 当前文档化支持的类型:`i16-i32`、`f16`、`f32`。 @@ -91,7 +91,7 @@ for (int g = 0; g < 8; g++) { ### MLIR ```mlir -%result = pto.vcgadd %input, %mask : !pto.vreg<128xf32>, !pto.mask -> !pto.vreg<128xf32> +%result = pto.vcgadd %input, %mask : !pto.vreg<128xf32>, !pto.mask -> !pto.vreg<128xf32> ``` ## 性能 diff --git a/docs/isa/vector/ops/reduction-ops/vcgmax.md b/docs/isa/vector/ops/reduction-ops/vcgmax.md index f24da219b..d10adebfa 100644 --- a/docs/isa/vector/ops/reduction-ops/vcgmax.md +++ b/docs/isa/vector/ops/reduction-ops/vcgmax.md @@ -21,7 +21,7 @@ vcgmax %dst, %src, %mask : !pto.vreg ### AS Level 1 (SSA) ```mlir -%result = pto.vcgmax %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg +%result = pto.vcgmax %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg ``` ## Inputs @@ -29,7 +29,7 @@ vcgmax %dst, %src, %mask : !pto.vreg | Operand | Type | Description | | --- | --- | --- | | %input | `!pto.vreg` | Source vector register to reduce per VLane group | -| %mask | `!pto.mask` | Predicate mask; inactive lanes do not participate | +| %mask | `!pto.mask` | Predicate mask; inactive lanes do not participate | ## Expected Outputs @@ -71,7 +71,7 @@ for (int g = 0; g < GROUPS; g++) { ``` ```mlir -%result = pto.vcgmax %input, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> +%result = pto.vcgmax %input, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> ``` ## Related Ops / Instruction Set Links diff --git a/docs/isa/vector/ops/reduction-ops/vcgmax_zh.md b/docs/isa/vector/ops/reduction-ops/vcgmax_zh.md index 1fbedc75e..9d7211005 100644 --- a/docs/isa/vector/ops/reduction-ops/vcgmax_zh.md +++ b/docs/isa/vector/ops/reduction-ops/vcgmax_zh.md @@ -21,7 +21,7 @@ vcgmax %dst, %src, %mask : !pto.vreg ### AS Level 1(SSA) ```mlir -%result = pto.vcgmax %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg +%result = pto.vcgmax %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg ``` ## 输入 @@ -29,7 +29,7 @@ vcgmax %dst, %src, %mask : !pto.vreg | 操作数 | 类型 | 说明 | |--------|------|------| | `%input` | `!pto.vreg` | 在每个 VLane 组内参与归约的源向量 | -| `%mask` | `!pto.mask` | 谓词掩码;inactive lane 不参与归约 | +| `%mask` | `!pto.mask` | 谓词掩码;inactive lane 不参与归约 | ## 预期输出 @@ -71,7 +71,7 @@ for (int g = 0; g < GROUPS; g++) { ``` ```mlir -%result = pto.vcgmax %input, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> +%result = pto.vcgmax %input, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> ``` ## 性能 diff --git a/docs/isa/vector/ops/reduction-ops/vcgmin.md b/docs/isa/vector/ops/reduction-ops/vcgmin.md index 5452f0bcb..9d536aae4 100644 --- a/docs/isa/vector/ops/reduction-ops/vcgmin.md +++ b/docs/isa/vector/ops/reduction-ops/vcgmin.md @@ -21,7 +21,7 @@ vcgmin %dst, %src, %mask : !pto.vreg ### AS Level 1 (SSA) ```mlir -%result = pto.vcgmin %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg +%result = pto.vcgmin %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg ``` ## Inputs @@ -29,7 +29,7 @@ vcgmin %dst, %src, %mask : !pto.vreg | Operand | Type | Description | | --- | --- | --- | | %input | `!pto.vreg` | Source vector register to reduce per VLane group | -| %mask | `!pto.mask` | Predicate mask; inactive lanes do not participate | +| %mask | `!pto.mask` | Predicate mask; inactive lanes do not participate | ## Expected Outputs @@ -71,7 +71,7 @@ for (int g = 0; g < GROUPS; g++) { ``` ```mlir -%result = pto.vcgmin %input, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> +%result = pto.vcgmin %input, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> ``` ## Related Ops / Instruction Set Links diff --git a/docs/isa/vector/ops/reduction-ops/vcgmin_zh.md b/docs/isa/vector/ops/reduction-ops/vcgmin_zh.md index 45ca58138..cfbc9db43 100644 --- a/docs/isa/vector/ops/reduction-ops/vcgmin_zh.md +++ b/docs/isa/vector/ops/reduction-ops/vcgmin_zh.md @@ -21,7 +21,7 @@ vcgmin %dst, %src, %mask : !pto.vreg ### AS Level 1(SSA) ```mlir -%result = pto.vcgmin %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg +%result = pto.vcgmin %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg ``` ## 输入 @@ -29,7 +29,7 @@ vcgmin %dst, %src, %mask : !pto.vreg | 操作数 | 类型 | 说明 | |--------|------|------| | `%input` | `!pto.vreg` | 在每个 VLane 组内参与归约的源向量 | -| `%mask` | `!pto.mask` | 谓词掩码;inactive lane 不参与归约 | +| `%mask` | `!pto.mask` | 谓词掩码;inactive lane 不参与归约 | ## 预期输出 @@ -71,7 +71,7 @@ for (int g = 0; g < GROUPS; g++) { ``` ```mlir -%result = pto.vcgmin %input, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> +%result = pto.vcgmin %input, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> ``` ## 性能 diff --git a/docs/isa/vector/ops/reduction-ops/vcmax.md b/docs/isa/vector/ops/reduction-ops/vcmax.md index bc3080f66..3dbafb325 100644 --- a/docs/isa/vector/ops/reduction-ops/vcmax.md +++ b/docs/isa/vector/ops/reduction-ops/vcmax.md @@ -21,7 +21,7 @@ vcmax %dst, %src, %mask : !pto.vreg ### AS Level 1 (SSA) ```mlir -%result = pto.vcmax %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg +%result = pto.vcmax %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg ``` ## Inputs @@ -29,7 +29,7 @@ vcmax %dst, %src, %mask : !pto.vreg | Operand | Type | Description | | --- | --- | --- | | %input | `!pto.vreg` | Source vector register to reduce | -| %mask | `!pto.mask` | Predicate mask; inactive lanes do not participate | +| %mask | `!pto.mask` | Predicate mask; inactive lanes do not participate | ## Expected Outputs @@ -72,7 +72,7 @@ result_index = idx; ``` ```mlir -%result = pto.vcmax %input, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> +%result = pto.vcmax %input, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> ``` ## Related Ops / Instruction Set Links diff --git a/docs/isa/vector/ops/reduction-ops/vcmax_zh.md b/docs/isa/vector/ops/reduction-ops/vcmax_zh.md index 30f4d150e..2b9686b97 100644 --- a/docs/isa/vector/ops/reduction-ops/vcmax_zh.md +++ b/docs/isa/vector/ops/reduction-ops/vcmax_zh.md @@ -21,7 +21,7 @@ vcmax %dst, %src, %mask : !pto.vreg ### AS Level 1(SSA) ```mlir -%result = pto.vcmax %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg +%result = pto.vcmax %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg ``` ## 输入 @@ -29,7 +29,7 @@ vcmax %dst, %src, %mask : !pto.vreg | 操作数 | 类型 | 说明 | |--------|------|------| | `%input` | `!pto.vreg` | 待归约的源向量 | -| `%mask` | `!pto.mask` | 谓词掩码;inactive lane 不参与归约 | +| `%mask` | `!pto.mask` | 谓词掩码;inactive lane 不参与归约 | ## 预期输出 @@ -72,7 +72,7 @@ result_index = idx; ``` ```mlir -%result = pto.vcmax %input, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> +%result = pto.vcmax %input, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> ``` ## 性能 diff --git a/docs/isa/vector/ops/reduction-ops/vcmin.md b/docs/isa/vector/ops/reduction-ops/vcmin.md index 9ab1c4d50..28fafc932 100644 --- a/docs/isa/vector/ops/reduction-ops/vcmin.md +++ b/docs/isa/vector/ops/reduction-ops/vcmin.md @@ -21,7 +21,7 @@ vcmin %dst, %src, %mask : !pto.vreg ### AS Level 1 (SSA) ```mlir -%result = pto.vcmin %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg +%result = pto.vcmin %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg ``` ## Inputs @@ -29,7 +29,7 @@ vcmin %dst, %src, %mask : !pto.vreg | Operand | Type | Description | | --- | --- | --- | | %input | `!pto.vreg` | Source vector register to reduce | -| %mask | `!pto.mask` | Predicate mask; inactive lanes do not participate | +| %mask | `!pto.mask` | Predicate mask; inactive lanes do not participate | ## Expected Outputs @@ -72,7 +72,7 @@ result_index = idx; ``` ```mlir -%result = pto.vcmin %input, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> +%result = pto.vcmin %input, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> ``` ## Related Ops / Instruction Set Links diff --git a/docs/isa/vector/ops/reduction-ops/vcmin_zh.md b/docs/isa/vector/ops/reduction-ops/vcmin_zh.md index 413a82822..c72e7d7d6 100644 --- a/docs/isa/vector/ops/reduction-ops/vcmin_zh.md +++ b/docs/isa/vector/ops/reduction-ops/vcmin_zh.md @@ -21,7 +21,7 @@ vcmin %dst, %src, %mask : !pto.vreg ### AS Level 1(SSA) ```mlir -%result = pto.vcmin %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg +%result = pto.vcmin %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg ``` ## 输入 @@ -29,7 +29,7 @@ vcmin %dst, %src, %mask : !pto.vreg | 操作数 | 类型 | 说明 | |--------|------|------| | `%input` | `!pto.vreg` | 待归约的源向量 | -| `%mask` | `!pto.mask` | 谓词掩码;inactive lane 不参与归约 | +| `%mask` | `!pto.mask` | 谓词掩码;inactive lane 不参与归约 | ## 预期输出 @@ -72,7 +72,7 @@ result_index = idx; ``` ```mlir -%result = pto.vcmin %input, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> +%result = pto.vcmin %input, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> ``` ## 性能 diff --git a/docs/isa/vector/ops/reduction-ops/vcpadd.md b/docs/isa/vector/ops/reduction-ops/vcpadd.md index 31f219cc8..bf7d099c8 100644 --- a/docs/isa/vector/ops/reduction-ops/vcpadd.md +++ b/docs/isa/vector/ops/reduction-ops/vcpadd.md @@ -21,7 +21,7 @@ vcpadd %dst, %src, %mask : !pto.vreg ### AS Level 1 (SSA) ```mlir -%result = pto.vcpadd %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg +%result = pto.vcpadd %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg ``` ## Inputs @@ -29,7 +29,7 @@ vcpadd %dst, %src, %mask : !pto.vreg | Operand | Type | Description | | --- | --- | --- | | %input | `!pto.vreg` | Source vector register to scan | -| %mask | `!pto.mask` | Predicate mask; inactive lanes contribute zero | +| %mask | `!pto.mask` | Predicate mask; inactive lanes contribute zero | ## Expected Outputs @@ -71,7 +71,7 @@ for (int i = 0; i < N; i++) { ``` ```mlir -%cdf = pto.vcpadd %pdf, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> +%cdf = pto.vcpadd %pdf, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> ``` ## Related Ops / Instruction Set Links diff --git a/docs/isa/vector/ops/reduction-ops/vcpadd_zh.md b/docs/isa/vector/ops/reduction-ops/vcpadd_zh.md index fe201577e..899907d76 100644 --- a/docs/isa/vector/ops/reduction-ops/vcpadd_zh.md +++ b/docs/isa/vector/ops/reduction-ops/vcpadd_zh.md @@ -21,7 +21,7 @@ vcpadd %dst, %src, %mask : !pto.vreg ### AS Level 1(SSA) ```mlir -%result = pto.vcpadd %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg +%result = pto.vcpadd %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg ``` ## 输入 @@ -29,7 +29,7 @@ vcpadd %dst, %src, %mask : !pto.vreg | 操作数 | 类型 | 说明 | |--------|------|------| | `%input` | `!pto.vreg` | 做 scan 的源向量 | -| `%mask` | `!pto.mask` | 谓词掩码;inactive lane 贡献 0 | +| `%mask` | `!pto.mask` | 谓词掩码;inactive lane 贡献 0 | ## 预期输出 @@ -71,7 +71,7 @@ for (int i = 0; i < N; i++) { ``` ```mlir -%cdf = pto.vcpadd %pdf, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> +%cdf = pto.vcpadd %pdf, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> ``` ## 性能 diff --git a/docs/isa/vector/ops/sfu-and-dsa-ops/vaddrelu.md b/docs/isa/vector/ops/sfu-and-dsa-ops/vaddrelu.md index 8adcd1194..f2e6f41ae 100644 --- a/docs/isa/vector/ops/sfu-and-dsa-ops/vaddrelu.md +++ b/docs/isa/vector/ops/sfu-and-dsa-ops/vaddrelu.md @@ -21,7 +21,7 @@ vaddrelu %dst, %lhs, %rhs, %mask : !pto.vreg ### AS Level 1 (SSA) ```mlir -%result = pto.vaddrelu %lhs, %rhs, %mask : (!pto.vreg, !pto.vreg, !pto.mask) -> !pto.vreg +%result = pto.vaddrelu %lhs, %rhs, %mask : (!pto.vreg, !pto.vreg, !pto.mask) -> !pto.vreg ``` Documented A5 types: `f16, f32`. @@ -32,7 +32,7 @@ Documented A5 types: `f16, f32`. ||---------|------|-------------| || `%lhs` | `!pto.vreg` | Left-hand source vector register | || `%rhs` | `!pto.vreg` | Right-hand source vector register | -|| `%mask` | `!pto.mask` | Predicate mask; lanes where mask bit is 1 are active | +|| `%mask` | `!pto.mask` | Predicate mask; lanes where mask bit is 1 are active | Both source registers MUST have the same element type and the same vector width `N`. The mask width MUST match `N`. @@ -97,7 +97,7 @@ for (int i = 0; i < N; i++) ### MLIR form ```mlir -%result = pto.vaddrelu %lhs, %rhs, %mask : (!pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask) -> !pto.vreg<64xf32> +%result = pto.vaddrelu %lhs, %rhs, %mask : (!pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask) -> !pto.vreg<64xf32> ``` ### C++ intrinsic @@ -115,5 +115,5 @@ VADDRELU(vdst, va, vb, mask); ## Related Ops / Instruction Set Links - Instruction set overview: [SFU And DSA Instructions](../../sfu-and-dsa-ops.md) -- Previous op in instruction set: [pto.vexpdiff](./vexpdiff.md) +- Previous op in instruction set: [pto.vexpdif](./vexpdif.md) - Next op in instruction set: [pto.vsubrelu](./vsubrelu.md) diff --git a/docs/isa/vector/ops/sfu-and-dsa-ops/vaddrelu_zh.md b/docs/isa/vector/ops/sfu-and-dsa-ops/vaddrelu_zh.md index 2283ad5f5..25d2cc8c4 100644 --- a/docs/isa/vector/ops/sfu-and-dsa-ops/vaddrelu_zh.md +++ b/docs/isa/vector/ops/sfu-and-dsa-ops/vaddrelu_zh.md @@ -27,7 +27,7 @@ vaddrelu %dst, %lhs, %rhs, %mask : !pto.vreg ### AS Level 1(SSA) ```mlir -%result = pto.vaddrelu %lhs, %rhs, %mask : (!pto.vreg, !pto.vreg, !pto.mask) -> !pto.vreg +%result = pto.vaddrelu %lhs, %rhs, %mask : (!pto.vreg, !pto.vreg, !pto.mask) -> !pto.vreg ``` A5 当前文档化支持的类型:`f16`、`f32`。 @@ -38,7 +38,7 @@ A5 当前文档化支持的类型:`f16`、`f32`。 |--------|------|------| | `%lhs` | `!pto.vreg` | 左操作数向量 | | `%rhs` | `!pto.vreg` | 右操作数向量 | -| `%mask` | `!pto.mask` | 谓词掩码;掩码位为 1 的 lane 为活跃 lane | +| `%mask` | `!pto.mask` | 谓词掩码;掩码位为 1 的 lane 为活跃 lane | ## 预期输出 @@ -95,7 +95,7 @@ for (int i = 0; i < N; i++) ```mlir %result = pto.vaddrelu %lhs, %rhs, %mask - : (!pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask) -> !pto.vreg<64xf32> + : (!pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask) -> !pto.vreg<64xf32> ``` ## 性能 @@ -115,5 +115,5 @@ PTO 微指令页面当前使用的时序来源是 `~/visa.txt` 与最新抓取 ## 相关页面 - 指令集总览:[SFU 与 DSA 操作](../../sfu-and-dsa-ops_zh.md) -- 上一条指令:[pto.vexpdiff](./vexpdiff_zh.md) +- 上一条指令:[pto.vexpdif](./vexpdif_zh.md) - 下一条指令:[pto.vsubrelu](./vsubrelu_zh.md) diff --git a/docs/isa/vector/ops/sfu-and-dsa-ops/vaddreluconv.md b/docs/isa/vector/ops/sfu-and-dsa-ops/vaddreluconv.md index f33107235..cda6651fe 100644 --- a/docs/isa/vector/ops/sfu-and-dsa-ops/vaddreluconv.md +++ b/docs/isa/vector/ops/sfu-and-dsa-ops/vaddreluconv.md @@ -21,7 +21,7 @@ vaddreluconv %dst, %lhs, %rhs, %mask : !pto.vreg ### AS Level 1 (SSA) ```mlir -%result = pto.vaddreluconv %lhs, %rhs, %mask : (!pto.vreg, !pto.vreg, !pto.mask) -> !pto.vreg +%result = pto.vaddreluconv %lhs, %rhs, %mask : (!pto.vreg, !pto.vreg, !pto.mask) -> !pto.vreg ``` ## Inputs @@ -30,7 +30,7 @@ vaddreluconv %dst, %lhs, %rhs, %mask : !pto.vreg ||---------|------|-------------| || `%lhs` | `!pto.vreg` | Left-hand source vector | || `%rhs` | `!pto.vreg` | Right-hand source vector | -|| `%mask` | `!pto.mask` | Predicate mask; lanes where mask bit is 1 are active | +|| `%mask` | `!pto.mask` | Predicate mask; lanes where mask bit is 1 are active | Both source vectors MUST have the same element type `T0` and the same vector width `N`. The mask width MUST match `N`. @@ -103,10 +103,10 @@ for (int i = 0; i < 128; i++) ```mlir // Widening: f16 → f32 -%result = pto.vaddreluconv %lhs, %rhs, %mask : (!pto.vreg<64xf16>, !pto.vreg<64xf16>, !pto.mask) -> !pto.vreg<64xf32> +%result = pto.vaddreluconv %lhs, %rhs, %mask : (!pto.vreg<64xf16>, !pto.vreg<64xf16>, !pto.mask) -> !pto.vreg<64xf32> // Narrowing: f32 → f16 with saturation -%result = pto.vaddreluconv %lhs, %rhs, %mask : (!pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask) -> !pto.vreg<64xf16> +%result = pto.vaddreluconv %lhs, %rhs, %mask : (!pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask) -> !pto.vreg<64xf16> ``` ### Common use cases diff --git a/docs/isa/vector/ops/sfu-and-dsa-ops/vaddreluconv_zh.md b/docs/isa/vector/ops/sfu-and-dsa-ops/vaddreluconv_zh.md index ea4dcc2c1..00327acd8 100644 --- a/docs/isa/vector/ops/sfu-and-dsa-ops/vaddreluconv_zh.md +++ b/docs/isa/vector/ops/sfu-and-dsa-ops/vaddreluconv_zh.md @@ -27,7 +27,7 @@ vaddreluconv %dst, %lhs, %rhs, %mask : !pto.vreg ### AS Level 1(SSA) ```mlir -%result = pto.vaddreluconv %lhs, %rhs, %mask : (!pto.vreg, !pto.vreg, !pto.mask) -> !pto.vreg +%result = pto.vaddreluconv %lhs, %rhs, %mask : (!pto.vreg, !pto.vreg, !pto.mask) -> !pto.vreg ``` ## 输入 @@ -36,7 +36,7 @@ vaddreluconv %dst, %lhs, %rhs, %mask : !pto.vreg |--------|------|------| | `%lhs` | `!pto.vreg` | 左操作数向量 | | `%rhs` | `!pto.vreg` | 右操作数向量 | -| `%mask` | `!pto.mask` | 谓词掩码;掩码位为 1 的 lane 为活跃 lane | +| `%mask` | `!pto.mask` | 谓词掩码;掩码位为 1 的 lane 为活跃 lane | ## 预期输出 diff --git a/docs/isa/vector/ops/sfu-and-dsa-ops/vaxpy.md b/docs/isa/vector/ops/sfu-and-dsa-ops/vaxpy.md index ed495f914..0799a1696 100644 --- a/docs/isa/vector/ops/sfu-and-dsa-ops/vaxpy.md +++ b/docs/isa/vector/ops/sfu-and-dsa-ops/vaxpy.md @@ -21,7 +21,7 @@ vaxpy %dst, %x, %y, %alpha, %mask : !pto.vreg ### AS Level 1 (SSA) ```mlir -%result = pto.vaxpy %x, %y, %alpha, %mask : (!pto.vreg, !pto.vreg, T, !pto.mask) -> !pto.vreg +%result = pto.vaxpy %x, %y, %alpha, %mask : (!pto.vreg, !pto.vreg, T, !pto.mask) -> !pto.vreg ``` Documented A5 types: `f16, f32`. @@ -33,7 +33,7 @@ Documented A5 types: `f16, f32`. || `%x` | `!pto.vreg` | Scaled vector operand | || `%y` | `!pto.vreg` | Addend vector operand | || `%alpha` | `T` (scalar) | Scalar multiplier | -|| `%mask` | `!pto.mask` | Predicate mask; lanes where mask bit is 1 are active | +|| `%mask` | `!pto.mask` | Predicate mask; lanes where mask bit is 1 are active | Both source vectors MUST have the same element type and the same vector width `N`. The mask width MUST match `N`. @@ -98,7 +98,7 @@ for (int i = 0; i < N; i++) ### MLIR form ```mlir -%result = pto.vaxpy %x, %y, %alpha, %mask : (!pto.vreg<64xf32>, !pto.vreg<64xf32>, f32, !pto.mask) -> !pto.vreg<64xf32> +%result = pto.vaxpy %x, %y, %alpha, %mask : (!pto.vreg<64xf32>, !pto.vreg<64xf32>, f32, !pto.mask) -> !pto.vreg<64xf32> ``` ### C++ intrinsic diff --git a/docs/isa/vector/ops/sfu-and-dsa-ops/vaxpy_zh.md b/docs/isa/vector/ops/sfu-and-dsa-ops/vaxpy_zh.md index 621963dd0..09da35818 100644 --- a/docs/isa/vector/ops/sfu-and-dsa-ops/vaxpy_zh.md +++ b/docs/isa/vector/ops/sfu-and-dsa-ops/vaxpy_zh.md @@ -27,7 +27,7 @@ vaxpy %dst, %x, %y, %alpha, %mask : !pto.vreg ### AS Level 1(SSA) ```mlir -%result = pto.vaxpy %x, %y, %alpha, %mask : (!pto.vreg, !pto.vreg, T, !pto.mask) -> !pto.vreg +%result = pto.vaxpy %x, %y, %alpha, %mask : (!pto.vreg, !pto.vreg, T, !pto.mask) -> !pto.vreg ``` A5 当前文档化支持的类型:`f16`、`f32`。 @@ -39,7 +39,7 @@ A5 当前文档化支持的类型:`f16`、`f32`。 | `%x` | `!pto.vreg` | 被标量缩放的向量 | | `%y` | `!pto.vreg` | 加数向量 | | `%alpha` | `T` | 广播到各 lane 的标量乘子 | -| `%mask` | `!pto.mask` | 谓词掩码;掩码位为 1 的 lane 为活跃 lane | +| `%mask` | `!pto.mask` | 谓词掩码;掩码位为 1 的 lane 为活跃 lane | ## 预期输出 @@ -96,7 +96,7 @@ for (int i = 0; i < N; i++) ```mlir %result = pto.vaxpy %x, %y, %alpha, %mask - : (!pto.vreg<64xf32>, !pto.vreg<64xf32>, f32, !pto.mask) -> !pto.vreg<64xf32> + : (!pto.vreg<64xf32>, !pto.vreg<64xf32>, f32, !pto.mask) -> !pto.vreg<64xf32> ``` ## 性能 diff --git a/docs/isa/vector/ops/sfu-and-dsa-ops/vexpdiff.md b/docs/isa/vector/ops/sfu-and-dsa-ops/vexpdif.md similarity index 76% rename from docs/isa/vector/ops/sfu-and-dsa-ops/vexpdiff.md rename to docs/isa/vector/ops/sfu-and-dsa-ops/vexpdif.md index 594d345e6..611240e44 100644 --- a/docs/isa/vector/ops/sfu-and-dsa-ops/vexpdiff.md +++ b/docs/isa/vector/ops/sfu-and-dsa-ops/vexpdif.md @@ -1,6 +1,6 @@ -# pto.vexpdiff +# pto.vexpdif -`pto.vexpdiff` is part of the [SFU And DSA Instructions](../../sfu-and-dsa-ops.md) instruction set. +`pto.vexpdif` is part of the [SFU And DSA Instructions](../../sfu-and-dsa-ops.md) instruction set. ## Summary @@ -8,20 +8,20 @@ Fused exp(x - max) for numerically stable softmax. ## Mechanism -`pto.vexpdiff` is a specialized `pto.v*` operation. It exposes fused, widening, or domain-specific hardware behavior through one stable virtual mnemonic so the instruction set can be reasoned about at the ISA level. +`pto.vexpdif` is a specialized `pto.v*` operation. It exposes fused, widening, or domain-specific hardware behavior through one stable virtual mnemonic so the instruction set can be reasoned about at the ISA level. ## Syntax ### PTO Assembly Form ```text -vexpdiff %result, %input, %max +vexpdif %result, %input, %max ``` ### AS Level 1 (SSA) ```mlir -%result = pto.vexpdiff %input, %max : !pto.vreg, !pto.vreg -> !pto.vreg +%result = pto.vexpdif %input, %max : !pto.vreg, !pto.vreg -> !pto.vreg ``` Documented A5 types or forms: `f16, f32`. @@ -63,14 +63,14 @@ This operation has no architectural side effect beyond producing its SSA results ### Timing Disclosure The current public VPTO timing material for PTO micro instructions remains limited. -For `pto.vexpdiff`, those public sources describe the instruction semantics, operand legality, and pipeline placement, but they do **not** publish a numeric latency or steady-state throughput. +For `pto.vexpdif`, those public sources describe the instruction semantics, operand legality, and pipeline placement, but they do **not** publish a numeric latency or steady-state throughput. | Metric | Status | Source Basis | |--------|--------|--------------| | A5 latency | Not publicly published | Current public VPTO timing material | | Steady-state throughput | Not publicly published | Current public VPTO timing material | -If software scheduling or performance modeling depends on the exact cost of `pto.vexpdiff`, treat that cost as target-profile-specific and measure it on the concrete backend rather than inferring a manual constant. +If software scheduling or performance modeling depends on the exact cost of `pto.vexpdif`, treat that cost as target-profile-specific and measure it on the concrete backend rather than inferring a manual constant. ## Examples diff --git a/docs/isa/vector/ops/sfu-and-dsa-ops/vexpdiff_zh.md b/docs/isa/vector/ops/sfu-and-dsa-ops/vexpdif_zh.md similarity index 79% rename from docs/isa/vector/ops/sfu-and-dsa-ops/vexpdiff_zh.md rename to docs/isa/vector/ops/sfu-and-dsa-ops/vexpdif_zh.md index d2a4fdcd1..47c3def52 100644 --- a/docs/isa/vector/ops/sfu-and-dsa-ops/vexpdiff_zh.md +++ b/docs/isa/vector/ops/sfu-and-dsa-ops/vexpdif_zh.md @@ -1,6 +1,6 @@ -# pto.vexpdiff +# pto.vexpdif -`pto.vexpdiff` 属于[SFU 与 DSA 操作](../../sfu-and-dsa-ops_zh.md)指令集。 +`pto.vexpdif` 属于[SFU 与 DSA 操作](../../sfu-and-dsa-ops_zh.md)指令集。 ## 概述 @@ -8,7 +8,7 @@ ## 机制 -`pto.vexpdiff` 把“先减去最大值,再做指数”这两个步骤压成一条指令: +`pto.vexpdif` 把“先减去最大值,再做指数”这两个步骤压成一条指令: ```text dst[i] = exp(input[i] - max[i]) @@ -21,13 +21,13 @@ dst[i] = exp(input[i] - max[i]) ### PTO 汇编形式 ```text -vexpdiff %result, %input, %max +vexpdif %result, %input, %max ``` ### AS Level 1(SSA) ```mlir -%result = pto.vexpdiff %input, %max : !pto.vreg, !pto.vreg -> !pto.vreg +%result = pto.vexpdif %input, %max : !pto.vreg, !pto.vreg -> !pto.vreg ``` A5 当前文档化支持的类型:`f16`、`f32`。 @@ -55,7 +55,7 @@ A5 当前文档化支持的类型:`f16`、`f32`。 !!! danger "异常与非法情形" - verifier 会拒绝非法的操作数形状、不支持的元素类型以及不合法的属性组合。 - 数值异常仍由目标 profile 决定。 - - 约束部分列出的额外非法情形,同样属于 `pto.vexpdiff` 的契约。 + - 约束部分列出的额外非法情形,同样属于 `pto.vexpdif` 的契约。 ## 目标 Profile 限制 @@ -67,7 +67,7 @@ A5 当前文档化支持的类型:`f16`、`f32`。 ### A5 时延 -当前手册未给出 `vexpdiff` 的独立周期表,但它正是为了替代 `vsub + vexp` 的两段式路径而存在,通常应视为 softmax 路径的优先形式。 +当前手册未给出 `vexpdif` 的独立周期表,但它正是为了替代 `vsub + vexp` 的两段式路径而存在,通常应视为 softmax 路径的优先形式。 ### A2/A3 吞吐 diff --git a/docs/isa/vector/ops/sfu-and-dsa-ops/vmula.md b/docs/isa/vector/ops/sfu-and-dsa-ops/vmula.md index 7f59c9d5e..4898778bf 100644 --- a/docs/isa/vector/ops/sfu-and-dsa-ops/vmula.md +++ b/docs/isa/vector/ops/sfu-and-dsa-ops/vmula.md @@ -21,7 +21,7 @@ vmula %dst, %add, %lhs, %rhs, %mask : !pto.vreg ### AS Level 1 (SSA) ```mlir -%result = pto.vmula %add, %lhs, %rhs, %mask : (!pto.vreg, !pto.vreg, !pto.vreg, !pto.mask) -> !pto.vreg +%result = pto.vmula %add, %lhs, %rhs, %mask : (!pto.vreg, !pto.vreg, !pto.vreg, !pto.mask) -> !pto.vreg ``` ## Inputs @@ -31,7 +31,7 @@ vmula %dst, %add, %lhs, %rhs, %mask : !pto.vreg || `%add` | `!pto.vreg` | Accumulator input vector | || `%lhs` | `!pto.vreg` | Left-hand multiplicand vector | || `%rhs` | `!pto.vreg` | Right-hand multiplicand vector | -|| `%mask` | `!pto.mask` | Predicate mask; lanes where mask bit is 1 are active | +|| `%mask` | `!pto.mask` | Predicate mask; lanes where mask bit is 1 are active | All four operands MUST have the same element type and the same vector width `N`. The mask width MUST match `N`. @@ -96,7 +96,7 @@ for (int i = 0; i < N; i++) ### MLIR form ```mlir -%result = pto.vmula %acc, %lhs, %rhs, %mask : (!pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask) -> !pto.vreg<64xf32> +%result = pto.vmula %acc, %lhs, %rhs, %mask : (!pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask) -> !pto.vreg<64xf32> ``` ### C++ intrinsic diff --git a/docs/isa/vector/ops/sfu-and-dsa-ops/vmula_zh.md b/docs/isa/vector/ops/sfu-and-dsa-ops/vmula_zh.md index 10b273a25..f4d7f3ba7 100644 --- a/docs/isa/vector/ops/sfu-and-dsa-ops/vmula_zh.md +++ b/docs/isa/vector/ops/sfu-and-dsa-ops/vmula_zh.md @@ -27,7 +27,7 @@ vmula %dst, %add, %lhs, %rhs, %mask : !pto.vreg ### AS Level 1(SSA) ```mlir -%result = pto.vmula %add, %lhs, %rhs, %mask : (!pto.vreg, !pto.vreg, !pto.vreg, !pto.mask) -> !pto.vreg +%result = pto.vmula %add, %lhs, %rhs, %mask : (!pto.vreg, !pto.vreg, !pto.vreg, !pto.mask) -> !pto.vreg ``` ## 输入 @@ -37,7 +37,7 @@ vmula %dst, %add, %lhs, %rhs, %mask : !pto.vreg | `%add` | `!pto.vreg` | 累加输入向量 | | `%lhs` | `!pto.vreg` | 左乘数向量 | | `%rhs` | `!pto.vreg` | 右乘数向量 | -| `%mask` | `!pto.mask` | 谓词掩码;掩码位为 1 的 lane 为活跃 lane | +| `%mask` | `!pto.mask` | 谓词掩码;掩码位为 1 的 lane 为活跃 lane | ## 预期输出 @@ -88,7 +88,7 @@ vmula %dst, %add, %lhs, %rhs, %mask : !pto.vreg ```mlir %result = pto.vmula %acc, %lhs, %rhs, %mask - : (!pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask) -> !pto.vreg<64xf32> + : (!pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask) -> !pto.vreg<64xf32> ``` ## 性能 diff --git a/docs/isa/vector/ops/sfu-and-dsa-ops/vmulconv.md b/docs/isa/vector/ops/sfu-and-dsa-ops/vmulconv.md index a596b505c..a0d64ef47 100644 --- a/docs/isa/vector/ops/sfu-and-dsa-ops/vmulconv.md +++ b/docs/isa/vector/ops/sfu-and-dsa-ops/vmulconv.md @@ -21,7 +21,7 @@ vmulconv %dst, %lhs, %rhs, %mask : !pto.vreg ### AS Level 1 (SSA) ```mlir -%result = pto.vmulconv %lhs, %rhs, %mask : (!pto.vreg, !pto.vreg, !pto.mask) -> !pto.vreg +%result = pto.vmulconv %lhs, %rhs, %mask : (!pto.vreg, !pto.vreg, !pto.mask) -> !pto.vreg ``` ## Inputs @@ -30,7 +30,7 @@ vmulconv %dst, %lhs, %rhs, %mask : !pto.vreg ||---------|------|-------------| || `%lhs` | `!pto.vreg` | Left-hand source vector | || `%rhs` | `!pto.vreg` | Right-hand source vector | -|| `%mask` | `!pto.mask` | Predicate mask; lanes where mask bit is 1 are active | +|| `%mask` | `!pto.mask` | Predicate mask; lanes where mask bit is 1 are active | Both source vectors MUST have the same element type `T0` and the same vector width `N`. The mask width MUST match `N`. @@ -98,7 +98,7 @@ for (int i = 0; i < 128; i++) ### MLIR form ```mlir -%result = pto.vmulconv %lhs, %rhs, %mask : (!pto.vreg<128xf16>, !pto.vreg<128xf16>, !pto.mask) -> !pto.vreg<128xi8> +%result = pto.vmulconv %lhs, %rhs, %mask : (!pto.vreg<128xf16>, !pto.vreg<128xf16>, !pto.mask) -> !pto.vreg<128xi8> ``` ## Extended Arithmetic diff --git a/docs/isa/vector/ops/sfu-and-dsa-ops/vmulconv_zh.md b/docs/isa/vector/ops/sfu-and-dsa-ops/vmulconv_zh.md index c5d44f876..0ccb55f97 100644 --- a/docs/isa/vector/ops/sfu-and-dsa-ops/vmulconv_zh.md +++ b/docs/isa/vector/ops/sfu-and-dsa-ops/vmulconv_zh.md @@ -27,7 +27,7 @@ vmulconv %dst, %lhs, %rhs, %mask : !pto.vreg ### AS Level 1(SSA) ```mlir -%result = pto.vmulconv %lhs, %rhs, %mask : (!pto.vreg, !pto.vreg, !pto.mask) -> !pto.vreg +%result = pto.vmulconv %lhs, %rhs, %mask : (!pto.vreg, !pto.vreg, !pto.mask) -> !pto.vreg ``` ## 输入 @@ -36,7 +36,7 @@ vmulconv %dst, %lhs, %rhs, %mask : !pto.vreg |--------|------|------| | `%lhs` | `!pto.vreg` | 左操作数向量 | | `%rhs` | `!pto.vreg` | 右操作数向量 | -| `%mask` | `!pto.mask` | 谓词掩码;掩码位为 1 的 lane 为活跃 lane | +| `%mask` | `!pto.mask` | 谓词掩码;掩码位为 1 的 lane 为活跃 lane | ## 预期输出 diff --git a/docs/isa/vector/ops/sfu-and-dsa-ops/vmull.md b/docs/isa/vector/ops/sfu-and-dsa-ops/vmull.md index d848ebed0..c84b08a0b 100644 --- a/docs/isa/vector/ops/sfu-and-dsa-ops/vmull.md +++ b/docs/isa/vector/ops/sfu-and-dsa-ops/vmull.md @@ -21,7 +21,7 @@ vmull %dst, %sub, %lhs, %rhs, %mask : !pto.vreg ### AS Level 1 (SSA) ```mlir -%result = pto.vmull %sub, %lhs, %rhs, %mask : (!pto.vreg, !pto.vreg, !pto.vreg, !pto.mask) -> !pto.vreg +%result = pto.vmull %sub, %lhs, %rhs, %mask : (!pto.vreg, !pto.vreg, !pto.vreg, !pto.mask) -> !pto.vreg ``` Documented A5 types: `i32/u32 (native 32×32→64 widening multiply)`. @@ -33,7 +33,7 @@ Documented A5 types: `i32/u32 (native 32×32→64 widening multiply)`. || `%sub` | `!pto.vreg` | Subtrahend input vector | || `%lhs` | `!pto.vreg` | Left-hand multiplicand vector | || `%rhs` | `!pto.vreg` | Right-hand multiplicand vector | -|| `%mask` | `!pto.mask` | Predicate mask; lanes where mask bit is 1 are active | +|| `%mask` | `!pto.mask` | Predicate mask; lanes where mask bit is 1 are active | All four operands MUST have the same element type and the same vector width `N`. The mask width MUST match `N`. @@ -99,7 +99,7 @@ for (int i = 0; i < N; i++) ### MLIR form ```mlir -%result = pto.vmull %sub, %lhs, %rhs, %mask : (!pto.vreg<64xi32>, !pto.vreg<64xi32>, !pto.vreg<64xi32>, !pto.mask) -> !pto.vreg<64xi32> +%result = pto.vmull %sub, %lhs, %rhs, %mask : (!pto.vreg<64xi32>, !pto.vreg<64xi32>, !pto.vreg<64xi32>, !pto.mask) -> !pto.vreg<64xi32> ``` ### C++ intrinsic diff --git a/docs/isa/vector/ops/sfu-and-dsa-ops/vmull_zh.md b/docs/isa/vector/ops/sfu-and-dsa-ops/vmull_zh.md index 4901472a8..08be2a368 100644 --- a/docs/isa/vector/ops/sfu-and-dsa-ops/vmull_zh.md +++ b/docs/isa/vector/ops/sfu-and-dsa-ops/vmull_zh.md @@ -31,7 +31,7 @@ vmull %dst, %sub, %lhs, %rhs, %mask : !pto.vreg ### AS Level 1(SSA) ```mlir -%low, %high = pto.vmull %lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask -> !pto.vreg, !pto.vreg +%low, %high = pto.vmull %lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask -> !pto.vreg, !pto.vreg ``` A5 当前文档化支持的形式:`i32/u32` 的原生 `32×32→64` 扩宽乘法。 @@ -42,7 +42,7 @@ A5 当前文档化支持的形式:`i32/u32` 的原生 `32×32→64` 扩宽乘 |--------|------|------| | `%lhs` | `!pto.vreg` | 左乘数向量 | | `%rhs` | `!pto.vreg` | 右乘数向量 | -| `%mask` | `!pto.mask` | 谓词掩码;掩码位为 1 的 lane 为活跃 lane | +| `%mask` | `!pto.mask` | 谓词掩码;掩码位为 1 的 lane 为活跃 lane | ## 预期输出 diff --git a/docs/isa/vector/ops/sfu-and-dsa-ops/vprelu.md b/docs/isa/vector/ops/sfu-and-dsa-ops/vprelu.md index 7e926fd1a..9f502b736 100644 --- a/docs/isa/vector/ops/sfu-and-dsa-ops/vprelu.md +++ b/docs/isa/vector/ops/sfu-and-dsa-ops/vprelu.md @@ -21,7 +21,7 @@ vprelu %dst, %src, %alpha, %mask : !pto.vreg ### AS Level 1 (SSA) ```mlir -%result = pto.vprelu %input, %alpha, %mask : !pto.vreg, !pto.vreg, !pto.mask -> !pto.vreg +%result = pto.vprelu %input, %alpha, %mask : !pto.vreg, !pto.vreg, !pto.mask -> !pto.vreg ``` Documented A5 types: `f16, f32`. @@ -32,7 +32,7 @@ Documented A5 types: `f16, f32`. ||---------|------|-------------| || `%input` | `!pto.vreg` | Activation input vector | || `%alpha` | `!pto.vreg` | Per-element slope (alpha) vector | -|| `%mask` | `!pto.mask` | Predicate mask; lanes where mask bit is 1 are active | +|| `%mask` | `!pto.mask` | Predicate mask; lanes where mask bit is 1 are active | ## Expected Outputs @@ -95,10 +95,10 @@ for (int i = 0; i < N; i++) ### MLIR form ```mlir -%result = pto.vprelu %input, %alpha, %mask : (!pto.vreg<64xf16>, !pto.vreg<64xf16>, !pto.mask) -> !pto.vreg<64xf16> +%result = pto.vprelu %input, %alpha, %mask : (!pto.vreg<64xf16>, !pto.vreg<64xf16>, !pto.mask) -> !pto.vreg<64xf16> ``` ## Related Ops / Instruction Set Links - Instruction set overview: [SFU And DSA Instructions](../../sfu-and-dsa-ops.md) -- Next op in instruction set: [pto.vexpdiff](./vexpdiff.md) +- Next op in instruction set: [pto.vexpdif](./vexpdif.md) diff --git a/docs/isa/vector/ops/sfu-and-dsa-ops/vprelu_zh.md b/docs/isa/vector/ops/sfu-and-dsa-ops/vprelu_zh.md index 5aa198a34..e59c3a17a 100644 --- a/docs/isa/vector/ops/sfu-and-dsa-ops/vprelu_zh.md +++ b/docs/isa/vector/ops/sfu-and-dsa-ops/vprelu_zh.md @@ -27,7 +27,7 @@ vprelu %dst, %src, %alpha, %mask : !pto.vreg ### AS Level 1(SSA) ```mlir -%result = pto.vprelu %input, %alpha, %mask : !pto.vreg, !pto.vreg, !pto.mask -> !pto.vreg +%result = pto.vprelu %input, %alpha, %mask : !pto.vreg, !pto.vreg, !pto.mask -> !pto.vreg ``` A5 当前文档化支持的类型:`f16`、`f32`。 @@ -38,7 +38,7 @@ A5 当前文档化支持的类型:`f16`、`f32`。 |--------|------|------| | `%input` | `!pto.vreg` | 激活输入向量 | | `%alpha` | `!pto.vreg` | 每个元素各自的负半轴斜率 | -| `%mask` | `!pto.mask` | 谓词掩码;掩码位为 1 的 lane 为活跃 lane | +| `%mask` | `!pto.mask` | 谓词掩码;掩码位为 1 的 lane 为活跃 lane | ## 预期输出 @@ -95,7 +95,7 @@ for (int i = 0; i < N; i++) ```mlir %result = pto.vprelu %input, %alpha, %mask - : !pto.vreg<64xf16>, !pto.vreg<64xf16>, !pto.mask -> !pto.vreg<64xf16> + : !pto.vreg<64xf16>, !pto.vreg<64xf16>, !pto.mask -> !pto.vreg<64xf16> ``` ## 性能 @@ -115,4 +115,4 @@ PTO 微指令页面当前使用的时序来源是 `~/visa.txt` 与最新抓取 ## 相关页面 - 指令集总览:[SFU 与 DSA 操作](../../sfu-and-dsa-ops_zh.md) -- 下一条指令:[pto.vexpdiff](./vexpdiff_zh.md) +- 下一条指令:[pto.vexpdif](./vexpdif_zh.md) diff --git a/docs/isa/vector/ops/sfu-and-dsa-ops/vsubrelu.md b/docs/isa/vector/ops/sfu-and-dsa-ops/vsubrelu.md index 6839d242e..a04c60fb3 100644 --- a/docs/isa/vector/ops/sfu-and-dsa-ops/vsubrelu.md +++ b/docs/isa/vector/ops/sfu-and-dsa-ops/vsubrelu.md @@ -21,7 +21,7 @@ vsubrelu %dst, %lhs, %rhs, %mask : !pto.vreg ### AS Level 1 (SSA) ```mlir -%result = pto.vsubrelu %lhs, %rhs, %mask : (!pto.vreg, !pto.vreg, !pto.mask) -> !pto.vreg +%result = pto.vsubrelu %lhs, %rhs, %mask : (!pto.vreg, !pto.vreg, !pto.mask) -> !pto.vreg ``` Documented A5 types: `f16, f32`. @@ -32,7 +32,7 @@ Documented A5 types: `f16, f32`. ||---------|------|-------------| || `%lhs` | `!pto.vreg` | Minuend source vector register | || `%rhs` | `!pto.vreg` | Subtrahend source vector register | -|| `%mask` | `!pto.mask` | Predicate mask; lanes where mask bit is 1 are active | +|| `%mask` | `!pto.mask` | Predicate mask; lanes where mask bit is 1 are active | Both source registers MUST have the same element type and the same vector width `N`. The mask width MUST match `N`. @@ -97,7 +97,7 @@ for (int i = 0; i < N; i++) ### MLIR form ```mlir -%result = pto.vsubrelu %lhs, %rhs, %mask : (!pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask) -> !pto.vreg<64xf32> +%result = pto.vsubrelu %lhs, %rhs, %mask : (!pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask) -> !pto.vreg<64xf32> ``` ### C++ intrinsic diff --git a/docs/isa/vector/ops/sfu-and-dsa-ops/vsubrelu_zh.md b/docs/isa/vector/ops/sfu-and-dsa-ops/vsubrelu_zh.md index ce253d193..ee7838197 100644 --- a/docs/isa/vector/ops/sfu-and-dsa-ops/vsubrelu_zh.md +++ b/docs/isa/vector/ops/sfu-and-dsa-ops/vsubrelu_zh.md @@ -27,7 +27,7 @@ vsubrelu %dst, %lhs, %rhs, %mask : !pto.vreg ### AS Level 1(SSA) ```mlir -%result = pto.vsubrelu %lhs, %rhs, %mask : (!pto.vreg, !pto.vreg, !pto.mask) -> !pto.vreg +%result = pto.vsubrelu %lhs, %rhs, %mask : (!pto.vreg, !pto.vreg, !pto.mask) -> !pto.vreg ``` A5 当前文档化支持的类型:`f16`、`f32`。 @@ -38,7 +38,7 @@ A5 当前文档化支持的类型:`f16`、`f32`。 |--------|------|------| | `%lhs` | `!pto.vreg` | 被减数向量 | | `%rhs` | `!pto.vreg` | 减数向量 | -| `%mask` | `!pto.mask` | 谓词掩码;掩码位为 1 的 lane 为活跃 lane | +| `%mask` | `!pto.mask` | 谓词掩码;掩码位为 1 的 lane 为活跃 lane | ## 预期输出 @@ -95,7 +95,7 @@ for (int i = 0; i < N; i++) ```mlir %result = pto.vsubrelu %lhs, %rhs, %mask - : (!pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask) -> !pto.vreg<64xf32> + : (!pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask) -> !pto.vreg<64xf32> ``` ## 性能 diff --git a/docs/isa/vector/ops/unary-vector-ops/vabs.md b/docs/isa/vector/ops/unary-vector-ops/vabs.md index 039a2bcda..9af415325 100644 --- a/docs/isa/vector/ops/unary-vector-ops/vabs.md +++ b/docs/isa/vector/ops/unary-vector-ops/vabs.md @@ -21,7 +21,7 @@ vabs %result, %input, %mask ### AS Level 1 (SSA) ```mlir -%result = pto.vabs %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg +%result = pto.vabs %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg ``` Documented A5 types or forms: `i8-i32, f16, f32`. @@ -31,7 +31,7 @@ Documented A5 types or forms: `i8-i32, f16, f32`. | Operand | Type | Description | |---------|------|-------------| | `%input` | `!pto.vreg` | Source vector register; read at each active lane `i` | -| `%mask` | `!pto.mask` | Predicate mask; lanes where mask bit is 1 (true) are active | +| `%mask` | `!pto.mask` | Predicate mask; lanes where mask bit is 1 (true) are active | ## Expected Outputs diff --git a/docs/isa/vector/ops/unary-vector-ops/vabs_zh.md b/docs/isa/vector/ops/unary-vector-ops/vabs_zh.md index 374beb021..4ef781a99 100644 --- a/docs/isa/vector/ops/unary-vector-ops/vabs_zh.md +++ b/docs/isa/vector/ops/unary-vector-ops/vabs_zh.md @@ -21,7 +21,7 @@ vabs %result, %input, %mask ### AS Level 1(SSA) ```mlir -%result = pto.vabs %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg +%result = pto.vabs %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg ``` A5 当前文档化支持的类型:`i8-i32`、`f16`、`f32`。 @@ -31,7 +31,7 @@ A5 当前文档化支持的类型:`i8-i32`、`f16`、`f32`。 | 操作数 | 类型 | 说明 | |--------|------|------| | `%input` | `!pto.vreg` | 源向量寄存器;在每个活跃 lane 上读取 | -| `%mask` | `!pto.mask` | 谓词掩码;掩码位为 1 的 lane 为活跃 lane | +| `%mask` | `!pto.mask` | 谓词掩码;掩码位为 1 的 lane 为活跃 lane | ## 预期输出 diff --git a/docs/isa/vector/ops/unary-vector-ops/vbcnt.md b/docs/isa/vector/ops/unary-vector-ops/vbcnt.md index 48e794bb4..dcb621ec4 100644 --- a/docs/isa/vector/ops/unary-vector-ops/vbcnt.md +++ b/docs/isa/vector/ops/unary-vector-ops/vbcnt.md @@ -21,7 +21,7 @@ vbcnt %result, %input, %mask ### AS Level 1 (SSA) ```mlir -%result = pto.vbcnt %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg +%result = pto.vbcnt %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg ``` Documented A5 types or forms: `all integer types`. @@ -31,7 +31,7 @@ Documented A5 types or forms: `all integer types`. | Operand | Type | Description | |---------|------|-------------| | `%input` | `!pto.vreg` | Source vector register; read at each active lane `i` | -| `%mask` | `!pto.mask` | Predicate mask; lanes where mask bit is 1 (true) are active | +| `%mask` | `!pto.mask` | Predicate mask; lanes where mask bit is 1 (true) are active | ## Expected Outputs diff --git a/docs/isa/vector/ops/unary-vector-ops/vbcnt_zh.md b/docs/isa/vector/ops/unary-vector-ops/vbcnt_zh.md index 1f8997108..fb01fec93 100644 --- a/docs/isa/vector/ops/unary-vector-ops/vbcnt_zh.md +++ b/docs/isa/vector/ops/unary-vector-ops/vbcnt_zh.md @@ -21,7 +21,7 @@ vbcnt %result, %input, %mask ### AS Level 1(SSA) ```mlir -%result = pto.vbcnt %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg +%result = pto.vbcnt %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg ``` A5 当前文档化支持的类型:全部整数类型。 @@ -31,7 +31,7 @@ A5 当前文档化支持的类型:全部整数类型。 | 操作数 | 类型 | 说明 | |--------|------|------| | `%input` | `!pto.vreg` | 源向量寄存器;在每个活跃 lane 上读取 | -| `%mask` | `!pto.mask` | 谓词掩码;掩码位为 1 的 lane 为活跃 lane | +| `%mask` | `!pto.mask` | 谓词掩码;掩码位为 1 的 lane 为活跃 lane | ## 预期输出 diff --git a/docs/isa/vector/ops/unary-vector-ops/vcls.md b/docs/isa/vector/ops/unary-vector-ops/vcls.md index 422a38880..941859f86 100644 --- a/docs/isa/vector/ops/unary-vector-ops/vcls.md +++ b/docs/isa/vector/ops/unary-vector-ops/vcls.md @@ -21,7 +21,7 @@ vcls %result, %input, %mask ### AS Level 1 (SSA) ```mlir -%result = pto.vcls %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg +%result = pto.vcls %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg ``` Documented A5 types or forms: `all integer types`. @@ -31,7 +31,7 @@ Documented A5 types or forms: `all integer types`. | Operand | Type | Description | |---------|------|-------------| | `%input` | `!pto.vreg` | Source vector register; read at each active lane `i` | -| `%mask` | `!pto.mask` | Predicate mask; lanes where mask bit is 1 (true) are active | +| `%mask` | `!pto.mask` | Predicate mask; lanes where mask bit is 1 (true) are active | ## Expected Outputs diff --git a/docs/isa/vector/ops/unary-vector-ops/vcls_zh.md b/docs/isa/vector/ops/unary-vector-ops/vcls_zh.md index 00a9015c2..e5cf08908 100644 --- a/docs/isa/vector/ops/unary-vector-ops/vcls_zh.md +++ b/docs/isa/vector/ops/unary-vector-ops/vcls_zh.md @@ -21,7 +21,7 @@ vcls %result, %input, %mask ### AS Level 1(SSA) ```mlir -%result = pto.vcls %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg +%result = pto.vcls %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg ``` A5 当前文档化支持的类型:全部整数类型。 @@ -31,7 +31,7 @@ A5 当前文档化支持的类型:全部整数类型。 | 操作数 | 类型 | 说明 | |--------|------|------| | `%input` | `!pto.vreg` | 源向量寄存器;在每个活跃 lane 上读取 | -| `%mask` | `!pto.mask` | 谓词掩码;掩码位为 1 的 lane 为活跃 lane | +| `%mask` | `!pto.mask` | 谓词掩码;掩码位为 1 的 lane 为活跃 lane | ## 预期输出 diff --git a/docs/isa/vector/ops/unary-vector-ops/vexp.md b/docs/isa/vector/ops/unary-vector-ops/vexp.md index 1e203a86a..01ef2c58f 100644 --- a/docs/isa/vector/ops/unary-vector-ops/vexp.md +++ b/docs/isa/vector/ops/unary-vector-ops/vexp.md @@ -23,19 +23,19 @@ Inactive lanes (`mask[i] == 0`): `dst[i]` is **unmodified** (preserves the prior ### PTO Assembly Form ```mlir -%result = pto.vexp %input, %mask : (!pto.vreg, !pto.mask) -> !pto.vreg +%result = pto.vexp %input, %mask : (!pto.vreg, !pto.mask) -> !pto.vreg ``` ### AS Level 1 (SSA) ```mlir -%result = pto.vexp %input, %mask : (!pto.vreg, !pto.mask) -> !pto.vreg +%result = pto.vexp %input, %mask : (!pto.vreg, !pto.mask) -> !pto.vreg ``` ### AS Level 2 (DPS) ```mlir -pto.vexp ins(%input, %mask : !pto.vreg, !pto.mask) +pto.vexp ins(%input, %mask : !pto.vreg, !pto.mask) outs(%result : !pto.vreg) ``` @@ -70,7 +70,7 @@ Where `N` is the vector lane count determined by the element type: | Operand | Type | Description | |---------|------|-------------| | `%input` | `!pto.vreg` | Source vector register; holds the input values | -| `%mask` | `!pto.mask` | Predicate mask; lanes where mask bit is 1 are active | +| `%mask` | `!pto.mask` | Predicate mask; lanes where mask bit is 1 are active | ## Expected Outputs @@ -134,7 +134,7 @@ A5 (pipelined): 16 + 15×2 = 46 cycles A2/A3: 13 + 26 + 32 + 270 = 341 cycles ``` -**Performance note**: For numerically stable softmax, prefer `vexpdiff` (fused exp-diff) over `vexp` + `vsub` since it avoids a separate max-subtraction kernel and has better combined throughput. +**Performance note**: For numerically stable softmax, prefer `vexpdif` (fused exp-diff) over `vexp` + `vsub` since it avoids a separate max-subtraction kernel and has better combined throughput. --- @@ -145,8 +145,8 @@ A2/A3: 13 + 26 + 32 + 270 = 341 cycles ```mlir // Softmax: exp(x - max) for numerical stability %max_bc = pto.vlds %ub_max[%c0] {dist = "BRC"} : !pto.ptr -> !pto.vreg<64xf32> -%sub = pto.vsub %x, %max_bc, %mask : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> -%exp = pto.vexp %sub, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> +%sub = pto.vsub %x, %max_bc, %mask : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> +%exp = pto.vexp %sub, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> ``` ### C++ usage diff --git a/docs/isa/vector/ops/unary-vector-ops/vexp_zh.md b/docs/isa/vector/ops/unary-vector-ops/vexp_zh.md index 394d1d85d..2f229b5d6 100644 --- a/docs/isa/vector/ops/unary-vector-ops/vexp_zh.md +++ b/docs/isa/vector/ops/unary-vector-ops/vexp_zh.md @@ -21,19 +21,19 @@ $$ \mathrm{dst}_i = \exp(\mathrm{src}_i) $$ ### PTO 汇编形式 ```text -%result = vexp %input, %mask : !pto.vreg, !pto.mask +%result = vexp %input, %mask : !pto.vreg, !pto.mask ``` ### AS Level 1(SSA) ```mlir -%result = pto.vexp %input, %mask : (!pto.vreg, !pto.mask) -> !pto.vreg +%result = pto.vexp %input, %mask : (!pto.vreg, !pto.mask) -> !pto.vreg ``` ### AS Level 2(DPS) ```mlir -pto.vexp ins(%input, %mask : !pto.vreg, !pto.mask) +pto.vexp ins(%input, %mask : !pto.vreg, !pto.mask) outs(%result : !pto.vreg) ``` @@ -69,7 +69,7 @@ for (int i = 0; i < N; i++) | 操作数 | 类型 | 说明 | |--------|------|------| | `%input` | `!pto.vreg` | 源向量寄存器 | -| `%mask` | `!pto.mask` | 谓词掩码;掩码位为 1 的 lane 为活跃 lane | +| `%mask` | `!pto.mask` | 谓词掩码;掩码位为 1 的 lane 为活跃 lane | ## 预期输出 @@ -132,7 +132,7 @@ A5(流水重叠):16 + 15×2 = 46 周期 A2/A3:13 + 26 + 32 + 270 = 341 周期 ``` -**性能说明:** 对数值稳定的 softmax,更推荐使用 `vexpdiff` 融合形式,而不是先 `vsub` 再 `vexp`。 +**性能说明:** 对数值稳定的 softmax,更推荐使用 `vexpdif` 融合形式,而不是先 `vsub` 再 `vexp`。 ## 示例 @@ -141,8 +141,8 @@ A2/A3:13 + 26 + 32 + 270 = 341 周期 ```mlir %max_bc = pto.vlds %ub_max[%c0] {dist = "BRC"} : !pto.ptr -> !pto.vreg<64xf32> %sub = pto.vsub %x, %max_bc, %mask - : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> -%exp = pto.vexp %sub, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> + : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> +%exp = pto.vexp %sub, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> ``` ### C++ 用法 diff --git a/docs/isa/vector/ops/unary-vector-ops/vln.md b/docs/isa/vector/ops/unary-vector-ops/vln.md index 27189d144..a57e431b0 100644 --- a/docs/isa/vector/ops/unary-vector-ops/vln.md +++ b/docs/isa/vector/ops/unary-vector-ops/vln.md @@ -21,7 +21,7 @@ vln %result, %input, %mask ### AS Level 1 (SSA) ```mlir -%result = pto.vln %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg +%result = pto.vln %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg ``` Documented A5 types or forms: `f16, f32`. @@ -31,7 +31,7 @@ Documented A5 types or forms: `f16, f32`. | Operand | Type | Description | |---------|------|-------------| | `%input` | `!pto.vreg` | Source vector register; read at each active lane `i` | -| `%mask` | `!pto.mask` | Predicate mask; lanes where mask bit is 1 (true) are active | +| `%mask` | `!pto.mask` | Predicate mask; lanes where mask bit is 1 (true) are active | ## Expected Outputs @@ -95,7 +95,7 @@ for (int i = 0; i < N; i++) ### Numerical stability note -For softmax denominator (`log(sum(exp(x - max)))`), use `vexpdiff` fused operation rather than separate `vsub` + `vln` for better combined throughput. +For softmax denominator (`log(sum(exp(x - max)))`), use `vexpdif` fused operation rather than separate `vsub` + `vln` for better combined throughput. --- diff --git a/docs/isa/vector/ops/unary-vector-ops/vln_zh.md b/docs/isa/vector/ops/unary-vector-ops/vln_zh.md index 4b067bca3..9a955460b 100644 --- a/docs/isa/vector/ops/unary-vector-ops/vln_zh.md +++ b/docs/isa/vector/ops/unary-vector-ops/vln_zh.md @@ -21,7 +21,7 @@ vln %result, %input, %mask ### AS Level 1(SSA) ```mlir -%result = pto.vln %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg +%result = pto.vln %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg ``` A5 当前文档化支持的类型:`f16`、`f32`。 @@ -31,7 +31,7 @@ A5 当前文档化支持的类型:`f16`、`f32`。 | 操作数 | 类型 | 说明 | |--------|------|------| | `%input` | `!pto.vreg` | 源向量寄存器;在每个活跃 lane 上读取 | -| `%mask` | `!pto.mask` | 谓词掩码;掩码位为 1 的 lane 为活跃 lane | +| `%mask` | `!pto.mask` | 谓词掩码;掩码位为 1 的 lane 为活跃 lane | ## 预期输出 @@ -92,7 +92,7 @@ for (int i = 0; i < N; i++) ### 数值稳定性说明 -对于 softmax 分母这类 `log(sum(exp(x - max)))` 模式,通常优先通过 `vexpdiff` 等融合路径避免不必要的数值风险。 +对于 softmax 分母这类 `log(sum(exp(x - max)))` 模式,通常优先通过 `vexpdif` 等融合路径避免不必要的数值风险。 ## 相关页面 diff --git a/docs/isa/vector/ops/unary-vector-ops/vmov.md b/docs/isa/vector/ops/unary-vector-ops/vmov.md index 969ff5d9c..8eacf8e20 100644 --- a/docs/isa/vector/ops/unary-vector-ops/vmov.md +++ b/docs/isa/vector/ops/unary-vector-ops/vmov.md @@ -21,7 +21,7 @@ vmov %result, %input, %mask ### AS Level 1 (SSA) ```mlir -%result = pto.vmov %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg +%result = pto.vmov %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg ``` ## Inputs @@ -29,7 +29,7 @@ vmov %result, %input, %mask | Operand | Type | Description | |---------|------|-------------| | `%input` | `!pto.vreg` | Source vector register; read at each active lane `i` | -| `%mask` | `!pto.mask` | Predicate mask; lanes where mask bit is 1 (true) are active | +| `%mask` | `!pto.mask` | Predicate mask; lanes where mask bit is 1 (true) are active | ## Expected Outputs @@ -72,14 +72,14 @@ for (int i = 0; i < N; i++) ```mlir // Softmax numerator: exp(x - max) -%sub = pto.vsub %x, %max_broadcast, %mask : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> -%exp = pto.vexp %sub, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> +%sub = pto.vsub %x, %max_broadcast, %mask : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> +%exp = pto.vexp %sub, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> // Reciprocal for division -%sum_rcp = pto.vrec %sum, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> +%sum_rcp = pto.vrec %sum, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> // ReLU activation -%activated = pto.vrelu %linear_out, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> +%activated = pto.vrelu %linear_out, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> ``` ## Related Ops / Instruction Set Links diff --git a/docs/isa/vector/ops/unary-vector-ops/vmov_zh.md b/docs/isa/vector/ops/unary-vector-ops/vmov_zh.md index 46c31ced6..e67d4f5ff 100644 --- a/docs/isa/vector/ops/unary-vector-ops/vmov_zh.md +++ b/docs/isa/vector/ops/unary-vector-ops/vmov_zh.md @@ -21,7 +21,7 @@ vmov %result, %input, %mask ### AS Level 1(SSA) ```mlir -%result = pto.vmov %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg +%result = pto.vmov %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg ``` ## 输入 @@ -29,7 +29,7 @@ vmov %result, %input, %mask | 操作数 | 类型 | 说明 | |--------|------|------| | `%input` | `!pto.vreg` | 源向量寄存器;在每个活跃 lane 上读取 | -| `%mask` | `!pto.mask` | 谓词掩码;掩码位为 1 的 lane 为活跃 lane | +| `%mask` | `!pto.mask` | 谓词掩码;掩码位为 1 的 lane 为活跃 lane | ## 预期输出 @@ -71,7 +71,7 @@ for (int i = 0; i < N; i++) ### MLIR 用法 ```mlir -%copy = pto.vmov %src, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> +%copy = pto.vmov %src, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> ``` ## 性能 diff --git a/docs/isa/vector/ops/unary-vector-ops/vneg.md b/docs/isa/vector/ops/unary-vector-ops/vneg.md index ff2a6d155..ef95c2f01 100644 --- a/docs/isa/vector/ops/unary-vector-ops/vneg.md +++ b/docs/isa/vector/ops/unary-vector-ops/vneg.md @@ -21,7 +21,7 @@ vneg %result, %input, %mask ### AS Level 1 (SSA) ```mlir -%result = pto.vneg %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg +%result = pto.vneg %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg ``` Documented A5 types or forms: `i8-i32, f16, f32`. @@ -31,7 +31,7 @@ Documented A5 types or forms: `i8-i32, f16, f32`. | Operand | Type | Description | |---------|------|-------------| | `%input` | `!pto.vreg` | Source vector register; read at each active lane `i` | -| `%mask` | `!pto.mask` | Predicate mask; lanes where mask bit is 1 (true) are active | +| `%mask` | `!pto.mask` | Predicate mask; lanes where mask bit is 1 (true) are active | ## Expected Outputs diff --git a/docs/isa/vector/ops/unary-vector-ops/vneg_zh.md b/docs/isa/vector/ops/unary-vector-ops/vneg_zh.md index a459f3c9e..cfa05d6de 100644 --- a/docs/isa/vector/ops/unary-vector-ops/vneg_zh.md +++ b/docs/isa/vector/ops/unary-vector-ops/vneg_zh.md @@ -21,7 +21,7 @@ vneg %result, %input, %mask ### AS Level 1(SSA) ```mlir -%result = pto.vneg %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg +%result = pto.vneg %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg ``` A5 当前文档化支持的类型:`i8-i32`、`f16`、`f32`。 @@ -31,7 +31,7 @@ A5 当前文档化支持的类型:`i8-i32`、`f16`、`f32`。 | 操作数 | 类型 | 说明 | |--------|------|------| | `%input` | `!pto.vreg` | 源向量寄存器;在每个活跃 lane 上读取 | -| `%mask` | `!pto.mask` | 谓词掩码;掩码位为 1 的 lane 为活跃 lane | +| `%mask` | `!pto.mask` | 谓词掩码;掩码位为 1 的 lane 为活跃 lane | ## 预期输出 diff --git a/docs/isa/vector/ops/unary-vector-ops/vnot.md b/docs/isa/vector/ops/unary-vector-ops/vnot.md index 986db93f3..0846b6ad5 100644 --- a/docs/isa/vector/ops/unary-vector-ops/vnot.md +++ b/docs/isa/vector/ops/unary-vector-ops/vnot.md @@ -21,7 +21,7 @@ vnot %result, %input, %mask ### AS Level 1 (SSA) ```mlir -%result = pto.vnot %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg +%result = pto.vnot %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg ``` Documented A5 types or forms: `all integer types`. @@ -31,7 +31,7 @@ Documented A5 types or forms: `all integer types`. | Operand | Type | Description | |---------|------|-------------| | `%input` | `!pto.vreg` | Source vector register; read at each active lane `i` | -| `%mask` | `!pto.mask` | Predicate mask; lanes where mask bit is 1 (true) are active | +| `%mask` | `!pto.mask` | Predicate mask; lanes where mask bit is 1 (true) are active | ## Expected Outputs diff --git a/docs/isa/vector/ops/unary-vector-ops/vnot_zh.md b/docs/isa/vector/ops/unary-vector-ops/vnot_zh.md index 1539a633a..bfc08315c 100644 --- a/docs/isa/vector/ops/unary-vector-ops/vnot_zh.md +++ b/docs/isa/vector/ops/unary-vector-ops/vnot_zh.md @@ -21,7 +21,7 @@ vnot %result, %input, %mask ### AS Level 1(SSA) ```mlir -%result = pto.vnot %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg +%result = pto.vnot %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg ``` A5 当前文档化支持的类型:全部整数类型。 @@ -31,7 +31,7 @@ A5 当前文档化支持的类型:全部整数类型。 | 操作数 | 类型 | 说明 | |--------|------|------| | `%input` | `!pto.vreg` | 源向量寄存器;在每个活跃 lane 上读取 | -| `%mask` | `!pto.mask` | 谓词掩码;掩码位为 1 的 lane 为活跃 lane | +| `%mask` | `!pto.mask` | 谓词掩码;掩码位为 1 的 lane 为活跃 lane | ## 预期输出 diff --git a/docs/isa/vector/ops/unary-vector-ops/vrec.md b/docs/isa/vector/ops/unary-vector-ops/vrec.md index 41b650f7b..2f8fbe6da 100644 --- a/docs/isa/vector/ops/unary-vector-ops/vrec.md +++ b/docs/isa/vector/ops/unary-vector-ops/vrec.md @@ -21,7 +21,7 @@ vrec %result, %input, %mask ### AS Level 1 (SSA) ```mlir -%result = pto.vrec %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg +%result = pto.vrec %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg ``` Documented A5 types or forms: `f16, f32`. @@ -31,7 +31,7 @@ Documented A5 types or forms: `f16, f32`. | Operand | Type | Description | |---------|------|-------------| | `%input` | `!pto.vreg` | Source vector register; read at each active lane `i` | -| `%mask` | `!pto.mask` | Predicate mask; lanes where mask bit is 1 (true) are active | +| `%mask` | `!pto.mask` | Predicate mask; lanes where mask bit is 1 (true) are active | ## Expected Outputs diff --git a/docs/isa/vector/ops/unary-vector-ops/vrec_zh.md b/docs/isa/vector/ops/unary-vector-ops/vrec_zh.md index 0376329cb..f973ebdec 100644 --- a/docs/isa/vector/ops/unary-vector-ops/vrec_zh.md +++ b/docs/isa/vector/ops/unary-vector-ops/vrec_zh.md @@ -21,7 +21,7 @@ vrec %result, %input, %mask ### AS Level 1(SSA) ```mlir -%result = pto.vrec %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg +%result = pto.vrec %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg ``` A5 当前文档化支持的类型:`f16`、`f32`。 @@ -31,7 +31,7 @@ A5 当前文档化支持的类型:`f16`、`f32`。 | 操作数 | 类型 | 说明 | |--------|------|------| | `%input` | `!pto.vreg` | 源向量寄存器;在每个活跃 lane 上读取 | -| `%mask` | `!pto.mask` | 谓词掩码;掩码位为 1 的 lane 为活跃 lane | +| `%mask` | `!pto.mask` | 谓词掩码;掩码位为 1 的 lane 为活跃 lane | ## 预期输出 diff --git a/docs/isa/vector/ops/unary-vector-ops/vrelu.md b/docs/isa/vector/ops/unary-vector-ops/vrelu.md index c85da502b..90a503f52 100644 --- a/docs/isa/vector/ops/unary-vector-ops/vrelu.md +++ b/docs/isa/vector/ops/unary-vector-ops/vrelu.md @@ -21,7 +21,7 @@ vrelu %result, %input, %mask ### AS Level 1 (SSA) ```mlir -%result = pto.vrelu %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg +%result = pto.vrelu %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg ``` Documented A5 types or forms: `f16, f32`. @@ -31,7 +31,7 @@ Documented A5 types or forms: `f16, f32`. | Operand | Type | Description | |---------|------|-------------| | `%input` | `!pto.vreg` | Source vector register; read at each active lane `i` | -| `%mask` | `!pto.mask` | Predicate mask; lanes where mask bit is 1 (true) are active | +| `%mask` | `!pto.mask` | Predicate mask; lanes where mask bit is 1 (true) are active | ## Expected Outputs diff --git a/docs/isa/vector/ops/unary-vector-ops/vrelu_zh.md b/docs/isa/vector/ops/unary-vector-ops/vrelu_zh.md index 646baef17..403d58bb4 100644 --- a/docs/isa/vector/ops/unary-vector-ops/vrelu_zh.md +++ b/docs/isa/vector/ops/unary-vector-ops/vrelu_zh.md @@ -21,7 +21,7 @@ vrelu %result, %input, %mask ### AS Level 1(SSA) ```mlir -%result = pto.vrelu %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg +%result = pto.vrelu %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg ``` A5 当前文档化支持的类型:`f16`、`f32`。 @@ -31,7 +31,7 @@ A5 当前文档化支持的类型:`f16`、`f32`。 | 操作数 | 类型 | 说明 | |--------|------|------| | `%input` | `!pto.vreg` | 源向量寄存器;在每个活跃 lane 上读取 | -| `%mask` | `!pto.mask` | 谓词掩码;掩码位为 1 的 lane 为活跃 lane | +| `%mask` | `!pto.mask` | 谓词掩码;掩码位为 1 的 lane 为活跃 lane | ## 预期输出 diff --git a/docs/isa/vector/ops/unary-vector-ops/vrsqrt.md b/docs/isa/vector/ops/unary-vector-ops/vrsqrt.md index 1e34d25d1..eec48b735 100644 --- a/docs/isa/vector/ops/unary-vector-ops/vrsqrt.md +++ b/docs/isa/vector/ops/unary-vector-ops/vrsqrt.md @@ -21,7 +21,7 @@ vrsqrt %result, %input, %mask ### AS Level 1 (SSA) ```mlir -%result = pto.vrsqrt %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg +%result = pto.vrsqrt %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg ``` Documented A5 types or forms: `f16, f32`. @@ -31,7 +31,7 @@ Documented A5 types or forms: `f16, f32`. | Operand | Type | Description | |---------|------|-------------| | `%input` | `!pto.vreg` | Source vector register; read at each active lane `i` | -| `%mask` | `!pto.mask` | Predicate mask; lanes where mask bit is 1 (true) are active | +| `%mask` | `!pto.mask` | Predicate mask; lanes where mask bit is 1 (true) are active | ## Expected Outputs diff --git a/docs/isa/vector/ops/unary-vector-ops/vrsqrt_zh.md b/docs/isa/vector/ops/unary-vector-ops/vrsqrt_zh.md index 29cbf6d26..df8605bcc 100644 --- a/docs/isa/vector/ops/unary-vector-ops/vrsqrt_zh.md +++ b/docs/isa/vector/ops/unary-vector-ops/vrsqrt_zh.md @@ -21,7 +21,7 @@ vrsqrt %result, %input, %mask ### AS Level 1(SSA) ```mlir -%result = pto.vrsqrt %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg +%result = pto.vrsqrt %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg ``` A5 当前文档化支持的类型:`f16`、`f32`。 @@ -31,7 +31,7 @@ A5 当前文档化支持的类型:`f16`、`f32`。 | 操作数 | 类型 | 说明 | |--------|------|------| | `%input` | `!pto.vreg` | 源向量寄存器;在每个活跃 lane 上读取 | -| `%mask` | `!pto.mask` | 谓词掩码;掩码位为 1 的 lane 为活跃 lane | +| `%mask` | `!pto.mask` | 谓词掩码;掩码位为 1 的 lane 为活跃 lane | ## 预期输出 diff --git a/docs/isa/vector/ops/unary-vector-ops/vsqrt.md b/docs/isa/vector/ops/unary-vector-ops/vsqrt.md index 9eda9fc0d..313962c26 100644 --- a/docs/isa/vector/ops/unary-vector-ops/vsqrt.md +++ b/docs/isa/vector/ops/unary-vector-ops/vsqrt.md @@ -21,7 +21,7 @@ vsqrt %result, %input, %mask ### AS Level 1 (SSA) ```mlir -%result = pto.vsqrt %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg +%result = pto.vsqrt %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg ``` Documented A5 types or forms: `f16, f32`. @@ -31,7 +31,7 @@ Documented A5 types or forms: `f16, f32`. | Operand | Type | Description | |---------|------|-------------| | `%input` | `!pto.vreg` | Source vector register; read at each active lane `i` | -| `%mask` | `!pto.mask` | Predicate mask; lanes where mask bit is 1 (true) are active | +| `%mask` | `!pto.mask` | Predicate mask; lanes where mask bit is 1 (true) are active | ## Expected Outputs diff --git a/docs/isa/vector/ops/unary-vector-ops/vsqrt_zh.md b/docs/isa/vector/ops/unary-vector-ops/vsqrt_zh.md index 71c914709..01e2831c5 100644 --- a/docs/isa/vector/ops/unary-vector-ops/vsqrt_zh.md +++ b/docs/isa/vector/ops/unary-vector-ops/vsqrt_zh.md @@ -21,7 +21,7 @@ vsqrt %result, %input, %mask ### AS Level 1(SSA) ```mlir -%result = pto.vsqrt %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg +%result = pto.vsqrt %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg ``` A5 当前文档化支持的类型:`f16`、`f32`。 @@ -31,7 +31,7 @@ A5 当前文档化支持的类型:`f16`、`f32`。 | 操作数 | 类型 | 说明 | |--------|------|------| | `%input` | `!pto.vreg` | 源向量寄存器;在每个活跃 lane 上读取 | -| `%mask` | `!pto.mask` | 谓词掩码;掩码位为 1 的 lane 为活跃 lane | +| `%mask` | `!pto.mask` | 谓词掩码;掩码位为 1 的 lane 为活跃 lane | ## 预期输出 diff --git a/docs/isa/vector/ops/vec-scalar-ops/vaddcs.md b/docs/isa/vector/ops/vec-scalar-ops/vaddcs.md index 3cbe7542c..55316511b 100644 --- a/docs/isa/vector/ops/vec-scalar-ops/vaddcs.md +++ b/docs/isa/vector/ops/vec-scalar-ops/vaddcs.md @@ -15,13 +15,13 @@ For each active lane `i`, `sum = lhs[i] + rhs[i] + carry_in[i]`, `result[i] = lo ### PTO Assembly Form ```text -vaddcs %dst, %carry_out, %lhs, %rhs, %carry_in, %mask : !pto.vreg, !pto.mask +vaddcs %dst, %carry_out, %lhs, %rhs, %carry_in, %mask : !pto.vreg, !pto.mask ``` ### AS Level 1 (SSA) ```mlir -%result, %carry = pto.vaddcs %lhs, %rhs, %carry_in, %mask : !pto.vreg, !pto.vreg, !pto.mask, !pto.mask -> !pto.vreg, !pto.mask +%result, %carry = pto.vaddcs %lhs, %rhs, %carry_in, %mask : !pto.vreg, !pto.vreg, !pto.mask, !pto.mask -> !pto.vreg, !pto.mask ``` ## Inputs @@ -30,15 +30,15 @@ vaddcs %dst, %carry_out, %lhs, %rhs, %carry_in, %mask : !pto.vreg, !pto.mas | --- | --- | --- | | %lhs | `!pto.vreg` | Left-hand value vector | | %rhs | `!pto.vreg` | Right-hand value vector | -| %carry_in | `!pto.mask` | Incoming carry bit per lane | -| %mask | `!pto.mask` | Predicate mask; only lanes with mask bit 1 participate | +| %carry_in | `!pto.mask` | Incoming carry bit per lane | +| %mask | `!pto.mask` | Predicate mask; only lanes with mask bit 1 participate | ## Expected Outputs | Result | Type | Description | | --- | --- | --- | | %result | `!pto.vreg` | Lane-wise arithmetic result on the active lanes | -| %carry | `!pto.mask` | Carry-out bit produced for each active lane | +| %carry | `!pto.mask` | Carry-out bit produced for each active lane | ## Side Effects @@ -75,7 +75,7 @@ for (int i = 0; i < N; i++) { ``` ```mlir -%result, %carry = pto.vaddcs %lhs, %rhs, %carry_in, %mask : !pto.vreg<64xi32>, !pto.vreg<64xi32>, !pto.mask, !pto.mask -> !pto.vreg<64xi32>, !pto.mask +%result, %carry = pto.vaddcs %lhs, %rhs, %carry_in, %mask : !pto.vreg<64xi32>, !pto.vreg<64xi32>, !pto.mask, !pto.mask -> !pto.vreg<64xi32>, !pto.mask ``` ## Related Ops / Instruction Set Links diff --git a/docs/isa/vector/ops/vec-scalar-ops/vaddcs_zh.md b/docs/isa/vector/ops/vec-scalar-ops/vaddcs_zh.md index 761a43180..22c97d791 100644 --- a/docs/isa/vector/ops/vec-scalar-ops/vaddcs_zh.md +++ b/docs/isa/vector/ops/vec-scalar-ops/vaddcs_zh.md @@ -23,13 +23,13 @@ carry[i] = carry_out(sum) ### PTO 汇编形式 ```text -vaddcs %dst, %carry_out, %lhs, %rhs, %carry_in, %mask : !pto.vreg, !pto.mask +vaddcs %dst, %carry_out, %lhs, %rhs, %carry_in, %mask : !pto.vreg, !pto.mask ``` ### AS Level 1(SSA) ```mlir -%result, %carry = pto.vaddcs %lhs, %rhs, %carry_in, %mask : !pto.vreg, !pto.vreg, !pto.mask, !pto.mask -> !pto.vreg, !pto.mask +%result, %carry = pto.vaddcs %lhs, %rhs, %carry_in, %mask : !pto.vreg, !pto.vreg, !pto.mask, !pto.mask -> !pto.vreg, !pto.mask ``` ## 输入 @@ -38,15 +38,15 @@ vaddcs %dst, %carry_out, %lhs, %rhs, %carry_in, %mask : !pto.vreg, !pto.mas |--------|------|------| | `%lhs` | `!pto.vreg` | 左值向量 | | `%rhs` | `!pto.vreg` | 右值向量 | -| `%carry_in` | `!pto.mask` | 每个 lane 的输入进位 bit | -| `%mask` | `!pto.mask` | 谓词掩码;只有掩码位为 1 的 lane 参与运算 | +| `%carry_in` | `!pto.mask` | 每个 lane 的输入进位 bit | +| `%mask` | `!pto.mask` | 谓词掩码;只有掩码位为 1 的 lane 参与运算 | ## 预期输出 | 结果 | 类型 | 说明 | |------|------|------| | `%result` | `!pto.vreg` | 活跃 lane 上得到算术结果 | -| `%carry` | `!pto.mask` | 每个活跃 lane 产生的 carry-out bit | +| `%carry` | `!pto.mask` | 每个活跃 lane 产生的 carry-out bit | ## 副作用 @@ -84,7 +84,7 @@ for (int i = 0; i < N; i++) { ```mlir %result, %carry = pto.vaddcs %lhs, %rhs, %carry_in, %mask - : !pto.vreg<64xi32>, !pto.vreg<64xi32>, !pto.mask, !pto.mask -> !pto.vreg<64xi32>, !pto.mask + : !pto.vreg<64xi32>, !pto.vreg<64xi32>, !pto.mask, !pto.mask -> !pto.vreg<64xi32>, !pto.mask ``` ## 性能 diff --git a/docs/isa/vector/ops/vec-scalar-ops/vadds.md b/docs/isa/vector/ops/vec-scalar-ops/vadds.md index 54d800d5e..9f48e3078 100644 --- a/docs/isa/vector/ops/vec-scalar-ops/vadds.md +++ b/docs/isa/vector/ops/vec-scalar-ops/vadds.md @@ -21,7 +21,7 @@ vadds %dst, %src, %scalar, %mask : !pto.vreg, T ### AS Level 1 (SSA) ```mlir -%result = pto.vadds %input, %scalar, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg +%result = pto.vadds %input, %scalar, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg ``` ## Inputs @@ -30,7 +30,7 @@ vadds %dst, %src, %scalar, %mask : !pto.vreg, T | --- | --- | --- | | %input | `!pto.vreg` | Source vector register | | %scalar | `T` | Scalar operand broadcast to every active lane | -| %mask | `!pto.mask` | Predicate mask; only lanes with mask bit 1 participate | +| %mask | `!pto.mask` | Predicate mask; only lanes with mask bit 1 participate | ## Expected Outputs @@ -70,7 +70,7 @@ for (int i = 0; i < N; i++) ``` ```mlir -%result = pto.vadds %values, %bias, %mask : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.vreg<64xf32> +%result = pto.vadds %values, %bias, %mask : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.vreg<64xf32> ``` ## Related Ops / Instruction Set Links diff --git a/docs/isa/vector/ops/vec-scalar-ops/vadds_zh.md b/docs/isa/vector/ops/vec-scalar-ops/vadds_zh.md index 0dbf5b89e..dfc8e1def 100644 --- a/docs/isa/vector/ops/vec-scalar-ops/vadds_zh.md +++ b/docs/isa/vector/ops/vec-scalar-ops/vadds_zh.md @@ -21,7 +21,7 @@ vadds %dst, %src, %scalar, %mask : !pto.vreg, T ### AS Level 1(SSA) ```mlir -%result = pto.vadds %input, %scalar, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg +%result = pto.vadds %input, %scalar, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg ``` ## 输入 @@ -30,7 +30,7 @@ vadds %dst, %src, %scalar, %mask : !pto.vreg, T |--------|------|------| | `%input` | `!pto.vreg` | 源向量寄存器 | | `%scalar` | `T` | 广播到每个活跃 lane 的标量操作数 | -| `%mask` | `!pto.mask` | 谓词掩码;只有掩码位为 1 的 lane 参与运算 | +| `%mask` | `!pto.mask` | 谓词掩码;只有掩码位为 1 的 lane 参与运算 | ## 预期输出 @@ -70,7 +70,7 @@ for (int i = 0; i < N; i++) ``` ```mlir -%result = pto.vadds %values, %bias, %mask : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.vreg<64xf32> +%result = pto.vadds %values, %bias, %mask : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.vreg<64xf32> ``` ## 性能 diff --git a/docs/isa/vector/ops/vec-scalar-ops/vands.md b/docs/isa/vector/ops/vec-scalar-ops/vands.md index d61b08e68..e7cd97030 100644 --- a/docs/isa/vector/ops/vec-scalar-ops/vands.md +++ b/docs/isa/vector/ops/vec-scalar-ops/vands.md @@ -21,7 +21,7 @@ vands %dst, %src, %scalar, %mask : !pto.vreg, T ### AS Level 1 (SSA) ```mlir -%result = pto.vands %input, %scalar, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg +%result = pto.vands %input, %scalar, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg ``` ## Inputs @@ -30,7 +30,7 @@ vands %dst, %src, %scalar, %mask : !pto.vreg, T | --- | --- | --- | | %input | `!pto.vreg` | Source vector register | | %scalar | `T` | Scalar bit mask broadcast to every active lane | -| %mask | `!pto.mask` | Predicate mask; only lanes with mask bit 1 participate | +| %mask | `!pto.mask` | Predicate mask; only lanes with mask bit 1 participate | ## Expected Outputs @@ -70,7 +70,7 @@ for (int i = 0; i < N; i++) ``` ```mlir -%result = pto.vands %values, %bitmask, %mask : !pto.vreg<64xi32>, i32, !pto.mask -> !pto.vreg<64xi32> +%result = pto.vands %values, %bitmask, %mask : !pto.vreg<64xi32>, i32, !pto.mask -> !pto.vreg<64xi32> ``` ## Related Ops / Instruction Set Links diff --git a/docs/isa/vector/ops/vec-scalar-ops/vands_zh.md b/docs/isa/vector/ops/vec-scalar-ops/vands_zh.md index a064816f2..cd8146a39 100644 --- a/docs/isa/vector/ops/vec-scalar-ops/vands_zh.md +++ b/docs/isa/vector/ops/vec-scalar-ops/vands_zh.md @@ -21,7 +21,7 @@ vands %dst, %src, %scalar, %mask : !pto.vreg, T ### AS Level 1(SSA) ```mlir -%result = pto.vands %input, %scalar, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg +%result = pto.vands %input, %scalar, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg ``` ## 输入 @@ -30,7 +30,7 @@ vands %dst, %src, %scalar, %mask : !pto.vreg, T |--------|------|------| | `%input` | `!pto.vreg` | 源向量寄存器 | | `%scalar` | `T` | 广播到每个活跃 lane 的标量位掩码 | -| `%mask` | `!pto.mask` | 谓词掩码;只有掩码位为 1 的 lane 参与运算 | +| `%mask` | `!pto.mask` | 谓词掩码;只有掩码位为 1 的 lane 参与运算 | ## 预期输出 @@ -70,7 +70,7 @@ for (int i = 0; i < N; i++) ``` ```mlir -%result = pto.vands %values, %bitmask, %mask : !pto.vreg<64xi32>, i32, !pto.mask -> !pto.vreg<64xi32> +%result = pto.vands %values, %bitmask, %mask : !pto.vreg<64xi32>, i32, !pto.mask -> !pto.vreg<64xi32> ``` ## 相关页面 diff --git a/docs/isa/vector/ops/vec-scalar-ops/vlrelu.md b/docs/isa/vector/ops/vec-scalar-ops/vlrelu.md index ff608b29e..710a7f4c5 100644 --- a/docs/isa/vector/ops/vec-scalar-ops/vlrelu.md +++ b/docs/isa/vector/ops/vec-scalar-ops/vlrelu.md @@ -21,7 +21,7 @@ vlrelu %dst, %src, %slope, %mask : !pto.vreg, T ### AS Level 1 (SSA) ```mlir -%result = pto.vlrelu %input, %scalar, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg +%result = pto.vlrelu %input, %scalar, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg ``` ## Inputs @@ -30,7 +30,7 @@ vlrelu %dst, %src, %slope, %mask : !pto.vreg, T | --- | --- | --- | | %input | `!pto.vreg` | Source activation vector | | %scalar | `T` | Negative-path slope broadcast to every active lane | -| %mask | `!pto.mask` | Predicate mask; only lanes with mask bit 1 participate | +| %mask | `!pto.mask` | Predicate mask; only lanes with mask bit 1 participate | ## Expected Outputs @@ -70,7 +70,7 @@ for (int i = 0; i < N; i++) ``` ```mlir -%result = pto.vlrelu %activations, %alpha, %mask : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.vreg<64xf32> +%result = pto.vlrelu %activations, %alpha, %mask : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.vreg<64xf32> ``` ## Related Ops / Instruction Set Links diff --git a/docs/isa/vector/ops/vec-scalar-ops/vlrelu_zh.md b/docs/isa/vector/ops/vec-scalar-ops/vlrelu_zh.md index b523666e7..bbfe17efa 100644 --- a/docs/isa/vector/ops/vec-scalar-ops/vlrelu_zh.md +++ b/docs/isa/vector/ops/vec-scalar-ops/vlrelu_zh.md @@ -27,7 +27,7 @@ vlrelu %dst, %src, %slope, %mask : !pto.vreg, T ### AS Level 1(SSA) ```mlir -%result = pto.vlrelu %input, %scalar, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg +%result = pto.vlrelu %input, %scalar, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg ``` ## 输入 @@ -36,7 +36,7 @@ vlrelu %dst, %src, %slope, %mask : !pto.vreg, T |--------|------|------| | `%input` | `!pto.vreg` | 源激活向量 | | `%scalar` | `T` | 广播到每个活跃 lane 的负半轴斜率 | -| `%mask` | `!pto.mask` | 谓词掩码;只有掩码位为 1 的 lane 参与运算 | +| `%mask` | `!pto.mask` | 谓词掩码;只有掩码位为 1 的 lane 参与运算 | ## 预期输出 @@ -76,7 +76,7 @@ for (int i = 0; i < N; i++) ``` ```mlir -%result = pto.vlrelu %activations, %alpha, %mask : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.vreg<64xf32> +%result = pto.vlrelu %activations, %alpha, %mask : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.vreg<64xf32> ``` ## 性能 diff --git a/docs/isa/vector/ops/vec-scalar-ops/vmaxs.md b/docs/isa/vector/ops/vec-scalar-ops/vmaxs.md index 949964662..fda906274 100644 --- a/docs/isa/vector/ops/vec-scalar-ops/vmaxs.md +++ b/docs/isa/vector/ops/vec-scalar-ops/vmaxs.md @@ -21,7 +21,7 @@ vmaxs %dst, %src, %scalar, %mask : !pto.vreg, T ### AS Level 1 (SSA) ```mlir -%result = pto.vmaxs %input, %scalar, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg +%result = pto.vmaxs %input, %scalar, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg ``` ## Inputs @@ -30,7 +30,7 @@ vmaxs %dst, %src, %scalar, %mask : !pto.vreg, T | --- | --- | --- | | %input | `!pto.vreg` | Source vector register | | %scalar | `T` | Scalar operand compared against each active lane | -| %mask | `!pto.mask` | Predicate mask; only lanes with mask bit 1 participate | +| %mask | `!pto.mask` | Predicate mask; only lanes with mask bit 1 participate | ## Expected Outputs @@ -70,7 +70,7 @@ for (int i = 0; i < N; i++) ``` ```mlir -%result = pto.vmaxs %values, %threshold, %mask : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.vreg<64xf32> +%result = pto.vmaxs %values, %threshold, %mask : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.vreg<64xf32> ``` ## Related Ops / Instruction Set Links diff --git a/docs/isa/vector/ops/vec-scalar-ops/vmaxs_zh.md b/docs/isa/vector/ops/vec-scalar-ops/vmaxs_zh.md index 91629b502..ff725c9a2 100644 --- a/docs/isa/vector/ops/vec-scalar-ops/vmaxs_zh.md +++ b/docs/isa/vector/ops/vec-scalar-ops/vmaxs_zh.md @@ -21,7 +21,7 @@ vmaxs %dst, %src, %scalar, %mask : !pto.vreg, T ### AS Level 1(SSA) ```mlir -%result = pto.vmaxs %input, %scalar, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg +%result = pto.vmaxs %input, %scalar, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg ``` ## 输入 @@ -30,7 +30,7 @@ vmaxs %dst, %src, %scalar, %mask : !pto.vreg, T |--------|------|------| | `%input` | `!pto.vreg` | 源向量寄存器 | | `%scalar` | `T` | 与每个活跃 lane 比较的标量值 | -| `%mask` | `!pto.mask` | 谓词掩码;只有掩码位为 1 的 lane 参与比较 | +| `%mask` | `!pto.mask` | 谓词掩码;只有掩码位为 1 的 lane 参与比较 | ## 预期输出 @@ -70,7 +70,7 @@ for (int i = 0; i < N; i++) ``` ```mlir -%result = pto.vmaxs %values, %threshold, %mask : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.vreg<64xf32> +%result = pto.vmaxs %values, %threshold, %mask : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.vreg<64xf32> ``` ## 相关页面 diff --git a/docs/isa/vector/ops/vec-scalar-ops/vmins.md b/docs/isa/vector/ops/vec-scalar-ops/vmins.md index 32fb13df5..9809d6b28 100644 --- a/docs/isa/vector/ops/vec-scalar-ops/vmins.md +++ b/docs/isa/vector/ops/vec-scalar-ops/vmins.md @@ -21,7 +21,7 @@ vmins %dst, %src, %scalar, %mask : !pto.vreg, T ### AS Level 1 (SSA) ```mlir -%result = pto.vmins %input, %scalar, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg +%result = pto.vmins %input, %scalar, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg ``` ## Inputs @@ -30,7 +30,7 @@ vmins %dst, %src, %scalar, %mask : !pto.vreg, T | --- | --- | --- | | %input | `!pto.vreg` | Source vector register | | %scalar | `T` | Scalar operand compared against each active lane | -| %mask | `!pto.mask` | Predicate mask; only lanes with mask bit 1 participate | +| %mask | `!pto.mask` | Predicate mask; only lanes with mask bit 1 participate | ## Expected Outputs @@ -70,7 +70,7 @@ for (int i = 0; i < N; i++) ``` ```mlir -%result = pto.vmins %values, %limit, %mask : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.vreg<64xf32> +%result = pto.vmins %values, %limit, %mask : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.vreg<64xf32> ``` ## Related Ops / Instruction Set Links diff --git a/docs/isa/vector/ops/vec-scalar-ops/vmins_zh.md b/docs/isa/vector/ops/vec-scalar-ops/vmins_zh.md index a4849442a..aae88a72e 100644 --- a/docs/isa/vector/ops/vec-scalar-ops/vmins_zh.md +++ b/docs/isa/vector/ops/vec-scalar-ops/vmins_zh.md @@ -21,7 +21,7 @@ vmins %dst, %src, %scalar, %mask : !pto.vreg, T ### AS Level 1(SSA) ```mlir -%result = pto.vmins %input, %scalar, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg +%result = pto.vmins %input, %scalar, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg ``` ## 输入 @@ -30,7 +30,7 @@ vmins %dst, %src, %scalar, %mask : !pto.vreg, T |--------|------|------| | `%input` | `!pto.vreg` | 源向量寄存器 | | `%scalar` | `T` | 与每个活跃 lane 比较的标量值 | -| `%mask` | `!pto.mask` | 谓词掩码;只有掩码位为 1 的 lane 参与比较 | +| `%mask` | `!pto.mask` | 谓词掩码;只有掩码位为 1 的 lane 参与比较 | ## 预期输出 @@ -70,7 +70,7 @@ for (int i = 0; i < N; i++) ``` ```mlir -%result = pto.vmins %values, %limit, %mask : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.vreg<64xf32> +%result = pto.vmins %values, %limit, %mask : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.vreg<64xf32> ``` ## 相关页面 diff --git a/docs/isa/vector/ops/vec-scalar-ops/vmuls.md b/docs/isa/vector/ops/vec-scalar-ops/vmuls.md index 2dab39190..47e3b9482 100644 --- a/docs/isa/vector/ops/vec-scalar-ops/vmuls.md +++ b/docs/isa/vector/ops/vec-scalar-ops/vmuls.md @@ -21,7 +21,7 @@ vmuls %dst, %src, %scalar, %mask : !pto.vreg, T ### AS Level 1 (SSA) ```mlir -%result = pto.vmuls %input, %scalar, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg +%result = pto.vmuls %input, %scalar, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg ``` ## Inputs @@ -30,7 +30,7 @@ vmuls %dst, %src, %scalar, %mask : !pto.vreg, T | --- | --- | --- | | %input | `!pto.vreg` | Source vector register | | %scalar | `T` | Scalar multiplier broadcast to every active lane | -| %mask | `!pto.mask` | Predicate mask; only lanes with mask bit 1 participate | +| %mask | `!pto.mask` | Predicate mask; only lanes with mask bit 1 participate | ## Expected Outputs @@ -70,7 +70,7 @@ for (int i = 0; i < N; i++) ``` ```mlir -%result = pto.vmuls %values, %scale, %mask : !pto.vreg<64xf16>, f16, !pto.mask -> !pto.vreg<64xf16> +%result = pto.vmuls %values, %scale, %mask : !pto.vreg<64xf16>, f16, !pto.mask -> !pto.vreg<64xf16> ``` ## Related Ops / Instruction Set Links diff --git a/docs/isa/vector/ops/vec-scalar-ops/vmuls_zh.md b/docs/isa/vector/ops/vec-scalar-ops/vmuls_zh.md index 3676e2c2e..1ff28aa1d 100644 --- a/docs/isa/vector/ops/vec-scalar-ops/vmuls_zh.md +++ b/docs/isa/vector/ops/vec-scalar-ops/vmuls_zh.md @@ -21,7 +21,7 @@ vmuls %dst, %src, %scalar, %mask : !pto.vreg, T ### AS Level 1(SSA) ```mlir -%result = pto.vmuls %input, %scalar, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg +%result = pto.vmuls %input, %scalar, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg ``` ## 输入 @@ -30,7 +30,7 @@ vmuls %dst, %src, %scalar, %mask : !pto.vreg, T |--------|------|------| | `%input` | `!pto.vreg` | 源向量寄存器 | | `%scalar` | `T` | 广播到每个活跃 lane 的标量乘数 | -| `%mask` | `!pto.mask` | 谓词掩码;只有掩码位为 1 的 lane 参与运算 | +| `%mask` | `!pto.mask` | 谓词掩码;只有掩码位为 1 的 lane 参与运算 | ## 预期输出 @@ -70,7 +70,7 @@ for (int i = 0; i < N; i++) ``` ```mlir -%result = pto.vmuls %values, %scale, %mask : !pto.vreg<64xf16>, f16, !pto.mask -> !pto.vreg<64xf16> +%result = pto.vmuls %values, %scale, %mask : !pto.vreg<64xf16>, f16, !pto.mask -> !pto.vreg<64xf16> ``` ## 性能 diff --git a/docs/isa/vector/ops/vec-scalar-ops/vors.md b/docs/isa/vector/ops/vec-scalar-ops/vors.md index ce461b36d..e2eefcfb0 100644 --- a/docs/isa/vector/ops/vec-scalar-ops/vors.md +++ b/docs/isa/vector/ops/vec-scalar-ops/vors.md @@ -21,7 +21,7 @@ vors %dst, %src, %scalar, %mask : !pto.vreg, T ### AS Level 1 (SSA) ```mlir -%result = pto.vors %input, %scalar, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg +%result = pto.vors %input, %scalar, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg ``` ## Inputs @@ -30,7 +30,7 @@ vors %dst, %src, %scalar, %mask : !pto.vreg, T | --- | --- | --- | | %input | `!pto.vreg` | Source vector register | | %scalar | `T` | Scalar bit mask broadcast to every active lane | -| %mask | `!pto.mask` | Predicate mask; only lanes with mask bit 1 participate | +| %mask | `!pto.mask` | Predicate mask; only lanes with mask bit 1 participate | ## Expected Outputs @@ -70,7 +70,7 @@ for (int i = 0; i < N; i++) ``` ```mlir -%result = pto.vors %values, %bitmask, %mask : !pto.vreg<64xi32>, i32, !pto.mask -> !pto.vreg<64xi32> +%result = pto.vors %values, %bitmask, %mask : !pto.vreg<64xi32>, i32, !pto.mask -> !pto.vreg<64xi32> ``` ## Related Ops / Instruction Set Links diff --git a/docs/isa/vector/ops/vec-scalar-ops/vors_zh.md b/docs/isa/vector/ops/vec-scalar-ops/vors_zh.md index 02a0833ab..a4d7c51f4 100644 --- a/docs/isa/vector/ops/vec-scalar-ops/vors_zh.md +++ b/docs/isa/vector/ops/vec-scalar-ops/vors_zh.md @@ -21,7 +21,7 @@ vors %dst, %src, %scalar, %mask : !pto.vreg, T ### AS Level 1(SSA) ```mlir -%result = pto.vors %input, %scalar, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg +%result = pto.vors %input, %scalar, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg ``` ## 输入 @@ -30,7 +30,7 @@ vors %dst, %src, %scalar, %mask : !pto.vreg, T |--------|------|------| | `%input` | `!pto.vreg` | 源向量寄存器 | | `%scalar` | `T` | 广播到每个活跃 lane 的标量位掩码 | -| `%mask` | `!pto.mask` | 谓词掩码;只有掩码位为 1 的 lane 参与运算 | +| `%mask` | `!pto.mask` | 谓词掩码;只有掩码位为 1 的 lane 参与运算 | ## 预期输出 @@ -70,7 +70,7 @@ for (int i = 0; i < N; i++) ``` ```mlir -%result = pto.vors %values, %bitmask, %mask : !pto.vreg<64xi32>, i32, !pto.mask -> !pto.vreg<64xi32> +%result = pto.vors %values, %bitmask, %mask : !pto.vreg<64xi32>, i32, !pto.mask -> !pto.vreg<64xi32> ``` ## 相关页面 diff --git a/docs/isa/vector/ops/vec-scalar-ops/vshls.md b/docs/isa/vector/ops/vec-scalar-ops/vshls.md index 24b853d2d..358fe18d2 100644 --- a/docs/isa/vector/ops/vec-scalar-ops/vshls.md +++ b/docs/isa/vector/ops/vec-scalar-ops/vshls.md @@ -21,7 +21,7 @@ vshls %dst, %src, %shift, %mask : !pto.vreg, T ### AS Level 1 (SSA) ```mlir -%result = pto.vshls %input, %scalar, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg +%result = pto.vshls %input, %scalar, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg ``` ## Inputs @@ -30,7 +30,7 @@ vshls %dst, %src, %shift, %mask : !pto.vreg, T | --- | --- | --- | | %input | `!pto.vreg` | Source vector register | | %scalar | `T` | Uniform shift amount broadcast to every active lane | -| %mask | `!pto.mask` | Predicate mask; only lanes with mask bit 1 participate | +| %mask | `!pto.mask` | Predicate mask; only lanes with mask bit 1 participate | ## Expected Outputs @@ -70,7 +70,7 @@ for (int i = 0; i < N; i++) ``` ```mlir -%result = pto.vshls %values, %shift, %mask : !pto.vreg<64xi32>, i32, !pto.mask -> !pto.vreg<64xi32> +%result = pto.vshls %values, %shift, %mask : !pto.vreg<64xi32>, i32, !pto.mask -> !pto.vreg<64xi32> ``` ## Related Ops / Instruction Set Links diff --git a/docs/isa/vector/ops/vec-scalar-ops/vshls_zh.md b/docs/isa/vector/ops/vec-scalar-ops/vshls_zh.md index d99c6d8f1..2827e8271 100644 --- a/docs/isa/vector/ops/vec-scalar-ops/vshls_zh.md +++ b/docs/isa/vector/ops/vec-scalar-ops/vshls_zh.md @@ -21,7 +21,7 @@ vshls %dst, %src, %shift, %mask : !pto.vreg, T ### AS Level 1(SSA) ```mlir -%result = pto.vshls %input, %scalar, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg +%result = pto.vshls %input, %scalar, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg ``` ## 输入 @@ -30,7 +30,7 @@ vshls %dst, %src, %shift, %mask : !pto.vreg, T |--------|------|------| | `%input` | `!pto.vreg` | 源向量寄存器 | | `%scalar` | `T` | 广播到每个活跃 lane 的统一位移量 | -| `%mask` | `!pto.mask` | 谓词掩码;只有掩码位为 1 的 lane 参与运算 | +| `%mask` | `!pto.mask` | 谓词掩码;只有掩码位为 1 的 lane 参与运算 | ## 预期输出 @@ -70,7 +70,7 @@ for (int i = 0; i < N; i++) ``` ```mlir -%result = pto.vshls %values, %shift, %mask : !pto.vreg<64xi32>, i32, !pto.mask -> !pto.vreg<64xi32> +%result = pto.vshls %values, %shift, %mask : !pto.vreg<64xi32>, i32, !pto.mask -> !pto.vreg<64xi32> ``` ## 性能 diff --git a/docs/isa/vector/ops/vec-scalar-ops/vshrs.md b/docs/isa/vector/ops/vec-scalar-ops/vshrs.md index 9167fc370..d80754244 100644 --- a/docs/isa/vector/ops/vec-scalar-ops/vshrs.md +++ b/docs/isa/vector/ops/vec-scalar-ops/vshrs.md @@ -21,7 +21,7 @@ vshrs %dst, %src, %shift, %mask : !pto.vreg, T ### AS Level 1 (SSA) ```mlir -%result = pto.vshrs %input, %scalar, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg +%result = pto.vshrs %input, %scalar, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg ``` ## Inputs @@ -30,7 +30,7 @@ vshrs %dst, %src, %shift, %mask : !pto.vreg, T | --- | --- | --- | | %input | `!pto.vreg` | Source vector register | | %scalar | `T` | Uniform shift amount broadcast to every active lane | -| %mask | `!pto.mask` | Predicate mask; only lanes with mask bit 1 participate | +| %mask | `!pto.mask` | Predicate mask; only lanes with mask bit 1 participate | ## Expected Outputs @@ -70,7 +70,7 @@ for (int i = 0; i < N; i++) ``` ```mlir -%result = pto.vshrs %values, %shift, %mask : !pto.vreg<64xi32>, i32, !pto.mask -> !pto.vreg<64xi32> +%result = pto.vshrs %values, %shift, %mask : !pto.vreg<64xi32>, i32, !pto.mask -> !pto.vreg<64xi32> ``` ## Related Ops / Instruction Set Links diff --git a/docs/isa/vector/ops/vec-scalar-ops/vshrs_zh.md b/docs/isa/vector/ops/vec-scalar-ops/vshrs_zh.md index 0bc936128..0767315eb 100644 --- a/docs/isa/vector/ops/vec-scalar-ops/vshrs_zh.md +++ b/docs/isa/vector/ops/vec-scalar-ops/vshrs_zh.md @@ -21,7 +21,7 @@ vshrs %dst, %src, %shift, %mask : !pto.vreg, T ### AS Level 1(SSA) ```mlir -%result = pto.vshrs %input, %scalar, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg +%result = pto.vshrs %input, %scalar, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg ``` ## 输入 @@ -30,7 +30,7 @@ vshrs %dst, %src, %shift, %mask : !pto.vreg, T |--------|------|------| | `%input` | `!pto.vreg` | 源向量寄存器 | | `%scalar` | `T` | 广播到每个活跃 lane 的统一位移量 | -| `%mask` | `!pto.mask` | 谓词掩码;只有掩码位为 1 的 lane 参与运算 | +| `%mask` | `!pto.mask` | 谓词掩码;只有掩码位为 1 的 lane 参与运算 | ## 预期输出 @@ -70,7 +70,7 @@ for (int i = 0; i < N; i++) ``` ```mlir -%result = pto.vshrs %values, %shift, %mask : !pto.vreg<64xi32>, i32, !pto.mask -> !pto.vreg<64xi32> +%result = pto.vshrs %values, %shift, %mask : !pto.vreg<64xi32>, i32, !pto.mask -> !pto.vreg<64xi32> ``` ## 相关页面 diff --git a/docs/isa/vector/ops/vec-scalar-ops/vsubcs.md b/docs/isa/vector/ops/vec-scalar-ops/vsubcs.md index 9f51b179d..89ba65901 100644 --- a/docs/isa/vector/ops/vec-scalar-ops/vsubcs.md +++ b/docs/isa/vector/ops/vec-scalar-ops/vsubcs.md @@ -15,13 +15,13 @@ For each active lane `i`, `diff = lhs[i] - rhs[i] - borrow_in[i]`, `result[i] = ### PTO Assembly Form ```text -vsubcs %dst, %borrow_out, %lhs, %rhs, %borrow_in, %mask : !pto.vreg, !pto.mask +vsubcs %dst, %borrow_out, %lhs, %rhs, %borrow_in, %mask : !pto.vreg, !pto.mask ``` ### AS Level 1 (SSA) ```mlir -%result, %borrow = pto.vsubcs %lhs, %rhs, %borrow_in, %mask : !pto.vreg, !pto.vreg, !pto.mask, !pto.mask -> !pto.vreg, !pto.mask +%result, %borrow = pto.vsubcs %lhs, %rhs, %borrow_in, %mask : !pto.vreg, !pto.vreg, !pto.mask, !pto.mask -> !pto.vreg, !pto.mask ``` ## Inputs @@ -30,15 +30,15 @@ vsubcs %dst, %borrow_out, %lhs, %rhs, %borrow_in, %mask : !pto.vreg, !pto.m | --- | --- | --- | | %lhs | `!pto.vreg` | Minuend vector | | %rhs | `!pto.vreg` | Subtrahend vector | -| %borrow_in | `!pto.mask` | Incoming borrow bit per lane | -| %mask | `!pto.mask` | Predicate mask; only lanes with mask bit 1 participate | +| %borrow_in | `!pto.mask` | Incoming borrow bit per lane | +| %mask | `!pto.mask` | Predicate mask; only lanes with mask bit 1 participate | ## Expected Outputs | Result | Type | Description | | --- | --- | --- | | %result | `!pto.vreg` | Lane-wise arithmetic result on the active lanes | -| %borrow | `!pto.mask` | Borrow-out bit produced for each active lane | +| %borrow | `!pto.mask` | Borrow-out bit produced for each active lane | ## Side Effects @@ -75,7 +75,7 @@ for (int i = 0; i < N; i++) { ``` ```mlir -%result, %borrow = pto.vsubcs %lhs, %rhs, %borrow_in, %mask : !pto.vreg<64xi32>, !pto.vreg<64xi32>, !pto.mask, !pto.mask -> !pto.vreg<64xi32>, !pto.mask +%result, %borrow = pto.vsubcs %lhs, %rhs, %borrow_in, %mask : !pto.vreg<64xi32>, !pto.vreg<64xi32>, !pto.mask, !pto.mask -> !pto.vreg<64xi32>, !pto.mask ``` ## Related Ops / Instruction Set Links diff --git a/docs/isa/vector/ops/vec-scalar-ops/vsubcs_zh.md b/docs/isa/vector/ops/vec-scalar-ops/vsubcs_zh.md index d64c1607e..fd6493ff4 100644 --- a/docs/isa/vector/ops/vec-scalar-ops/vsubcs_zh.md +++ b/docs/isa/vector/ops/vec-scalar-ops/vsubcs_zh.md @@ -23,13 +23,13 @@ borrow[i] = borrow_out(diff) ### PTO 汇编形式 ```text -vsubcs %dst, %borrow_out, %lhs, %rhs, %borrow_in, %mask : !pto.vreg, !pto.mask +vsubcs %dst, %borrow_out, %lhs, %rhs, %borrow_in, %mask : !pto.vreg, !pto.mask ``` ### AS Level 1(SSA) ```mlir -%result, %borrow = pto.vsubcs %lhs, %rhs, %borrow_in, %mask : !pto.vreg, !pto.vreg, !pto.mask, !pto.mask -> !pto.vreg, !pto.mask +%result, %borrow = pto.vsubcs %lhs, %rhs, %borrow_in, %mask : !pto.vreg, !pto.vreg, !pto.mask, !pto.mask -> !pto.vreg, !pto.mask ``` ## 输入 @@ -38,15 +38,15 @@ vsubcs %dst, %borrow_out, %lhs, %rhs, %borrow_in, %mask : !pto.vreg, !pto.m |--------|------|------| | `%lhs` | `!pto.vreg` | 被减数向量 | | `%rhs` | `!pto.vreg` | 减数向量 | -| `%borrow_in` | `!pto.mask` | 每个 lane 的输入借位 bit | -| `%mask` | `!pto.mask` | 谓词掩码;只有掩码位为 1 的 lane 参与运算 | +| `%borrow_in` | `!pto.mask` | 每个 lane 的输入借位 bit | +| `%mask` | `!pto.mask` | 谓词掩码;只有掩码位为 1 的 lane 参与运算 | ## 预期输出 | 结果 | 类型 | 说明 | |------|------|------| | `%result` | `!pto.vreg` | 活跃 lane 上得到算术结果 | -| `%borrow` | `!pto.mask` | 每个活跃 lane 产生的 borrow-out bit | +| `%borrow` | `!pto.mask` | 每个活跃 lane 产生的 borrow-out bit | ## 副作用 @@ -84,7 +84,7 @@ for (int i = 0; i < N; i++) { ```mlir %result, %borrow = pto.vsubcs %lhs, %rhs, %borrow_in, %mask - : !pto.vreg<64xi32>, !pto.vreg<64xi32>, !pto.mask, !pto.mask -> !pto.vreg<64xi32>, !pto.mask + : !pto.vreg<64xi32>, !pto.vreg<64xi32>, !pto.mask, !pto.mask -> !pto.vreg<64xi32>, !pto.mask ``` ## 性能 diff --git a/docs/isa/vector/ops/vec-scalar-ops/vsubs.md b/docs/isa/vector/ops/vec-scalar-ops/vsubs.md index 2f7bfcdfe..a63411434 100644 --- a/docs/isa/vector/ops/vec-scalar-ops/vsubs.md +++ b/docs/isa/vector/ops/vec-scalar-ops/vsubs.md @@ -21,7 +21,7 @@ vsubs %dst, %src, %scalar, %mask : !pto.vreg, T ### AS Level 1 (SSA) ```mlir -%result = pto.vsubs %input, %scalar, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg +%result = pto.vsubs %input, %scalar, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg ``` ## Inputs @@ -30,7 +30,7 @@ vsubs %dst, %src, %scalar, %mask : !pto.vreg, T | --- | --- | --- | | %input | `!pto.vreg` | Source vector register | | %scalar | `T` | Scalar operand subtracted from every active lane | -| %mask | `!pto.mask` | Predicate mask; only lanes with mask bit 1 participate | +| %mask | `!pto.mask` | Predicate mask; only lanes with mask bit 1 participate | ## Expected Outputs @@ -70,7 +70,7 @@ for (int i = 0; i < N; i++) ``` ```mlir -%result = pto.vsubs %values, %delta, %mask : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.vreg<64xf32> +%result = pto.vsubs %values, %delta, %mask : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.vreg<64xf32> ``` ## Related Ops / Instruction Set Links diff --git a/docs/isa/vector/ops/vec-scalar-ops/vsubs_zh.md b/docs/isa/vector/ops/vec-scalar-ops/vsubs_zh.md index 6543a1121..ea0d7f768 100644 --- a/docs/isa/vector/ops/vec-scalar-ops/vsubs_zh.md +++ b/docs/isa/vector/ops/vec-scalar-ops/vsubs_zh.md @@ -21,7 +21,7 @@ vsubs %dst, %src, %scalar, %mask : !pto.vreg, T ### AS Level 1(SSA) ```mlir -%result = pto.vsubs %input, %scalar, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg +%result = pto.vsubs %input, %scalar, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg ``` ## 输入 @@ -30,7 +30,7 @@ vsubs %dst, %src, %scalar, %mask : !pto.vreg, T |--------|------|------| | `%input` | `!pto.vreg` | 源向量寄存器 | | `%scalar` | `T` | 从每个活跃 lane 上减去的标量值 | -| `%mask` | `!pto.mask` | 谓词掩码;只有掩码位为 1 的 lane 参与运算 | +| `%mask` | `!pto.mask` | 谓词掩码;只有掩码位为 1 的 lane 参与运算 | ## 预期输出 @@ -70,7 +70,7 @@ for (int i = 0; i < N; i++) ``` ```mlir -%result = pto.vsubs %values, %delta, %mask : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.vreg<64xf32> +%result = pto.vsubs %values, %delta, %mask : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.vreg<64xf32> ``` ## 性能 diff --git a/docs/isa/vector/ops/vec-scalar-ops/vxors.md b/docs/isa/vector/ops/vec-scalar-ops/vxors.md index d8ec04dcf..794e1f4dd 100644 --- a/docs/isa/vector/ops/vec-scalar-ops/vxors.md +++ b/docs/isa/vector/ops/vec-scalar-ops/vxors.md @@ -21,7 +21,7 @@ vxors %dst, %src, %scalar, %mask : !pto.vreg, T ### AS Level 1 (SSA) ```mlir -%result = pto.vxors %input, %scalar, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg +%result = pto.vxors %input, %scalar, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg ``` ## Inputs @@ -30,7 +30,7 @@ vxors %dst, %src, %scalar, %mask : !pto.vreg, T | --- | --- | --- | | %input | `!pto.vreg` | Source vector register | | %scalar | `T` | Scalar bit mask broadcast to every active lane | -| %mask | `!pto.mask` | Predicate mask; only lanes with mask bit 1 participate | +| %mask | `!pto.mask` | Predicate mask; only lanes with mask bit 1 participate | ## Expected Outputs @@ -70,7 +70,7 @@ for (int i = 0; i < N; i++) ``` ```mlir -%result = pto.vxors %values, %bitmask, %mask : !pto.vreg<64xi32>, i32, !pto.mask -> !pto.vreg<64xi32> +%result = pto.vxors %values, %bitmask, %mask : !pto.vreg<64xi32>, i32, !pto.mask -> !pto.vreg<64xi32> ``` ## Related Ops / Instruction Set Links diff --git a/docs/isa/vector/ops/vec-scalar-ops/vxors_zh.md b/docs/isa/vector/ops/vec-scalar-ops/vxors_zh.md index 240df0f88..b62a6b41d 100644 --- a/docs/isa/vector/ops/vec-scalar-ops/vxors_zh.md +++ b/docs/isa/vector/ops/vec-scalar-ops/vxors_zh.md @@ -21,7 +21,7 @@ vxors %dst, %src, %scalar, %mask : !pto.vreg, T ### AS Level 1(SSA) ```mlir -%result = pto.vxors %input, %scalar, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg +%result = pto.vxors %input, %scalar, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg ``` ## 输入 @@ -30,7 +30,7 @@ vxors %dst, %src, %scalar, %mask : !pto.vreg, T |--------|------|------| | `%input` | `!pto.vreg` | 源向量寄存器 | | `%scalar` | `T` | 广播到每个活跃 lane 的标量位掩码 | -| `%mask` | `!pto.mask` | 谓词掩码;只有掩码位为 1 的 lane 参与运算 | +| `%mask` | `!pto.mask` | 谓词掩码;只有掩码位为 1 的 lane 参与运算 | ## 预期输出 @@ -70,7 +70,7 @@ for (int i = 0; i < N; i++) ``` ```mlir -%result = pto.vxors %values, %bitmask, %mask : !pto.vreg<64xi32>, i32, !pto.mask -> !pto.vreg<64xi32> +%result = pto.vxors %values, %bitmask, %mask : !pto.vreg<64xi32>, i32, !pto.mask -> !pto.vreg<64xi32> ``` ## 相关页面 diff --git a/docs/isa/vector/ops/vector-load-store/vgather2-bc.md b/docs/isa/vector/ops/vector-load-store/vgather2-bc.md index 21f6e1a2c..4adc22305 100644 --- a/docs/isa/vector/ops/vector-load-store/vgather2-bc.md +++ b/docs/isa/vector/ops/vector-load-store/vgather2-bc.md @@ -21,7 +21,7 @@ vgather2_bc %result, %source, %offsets, %mask ### AS Level 1 (SSA) ```mlir -%result = pto.vgather2_bc %source, %offsets, %mask : !pto.ptr, !pto.vreg, !pto.mask -> !pto.vreg +%result = pto.vgather2_bc %source, %offsets, %mask : !pto.ptr, !pto.vreg, !pto.mask -> !pto.vreg ``` ## Inputs @@ -74,7 +74,7 @@ If software scheduling or performance modeling depends on the exact cost of `pto ## Examples ```mlir -%result = pto.vgather2_bc %source, %offsets, %mask : !pto.ptr, !pto.vreg, !pto.mask -> !pto.vreg +%result = pto.vgather2_bc %source, %offsets, %mask : !pto.ptr, !pto.vreg, !pto.mask -> !pto.vreg ``` ## Related Ops / Instruction Set Links diff --git a/docs/isa/vector/ops/vector-load-store/vgather2-bc_zh.md b/docs/isa/vector/ops/vector-load-store/vgather2-bc_zh.md index 7b4960d9d..8d2026435 100644 --- a/docs/isa/vector/ops/vector-load-store/vgather2-bc_zh.md +++ b/docs/isa/vector/ops/vector-load-store/vgather2-bc_zh.md @@ -21,7 +21,7 @@ vgather2_bc %result, %source, %offsets, %mask ### AS Level 1(SSA) ```mlir -%result = pto.vgather2_bc %source, %offsets, %mask : !pto.ptr, !pto.vreg, !pto.mask -> !pto.vreg +%result = pto.vgather2_bc %source, %offsets, %mask : !pto.ptr, !pto.vreg, !pto.mask -> !pto.vreg ``` ## 输入 @@ -58,7 +58,7 @@ vgather2_bc %result, %source, %offsets, %mask ## 示例 ```mlir -%result = pto.vgather2_bc %source, %offsets, %mask : !pto.ptr, !pto.vreg, !pto.mask -> !pto.vreg +%result = pto.vgather2_bc %source, %offsets, %mask : !pto.ptr, !pto.vreg, !pto.mask -> !pto.vreg ``` ## 性能 diff --git a/docs/isa/vector/ops/vector-load-store/vldx2.md b/docs/isa/vector/ops/vector-load-store/vldsx2.md similarity index 65% rename from docs/isa/vector/ops/vector-load-store/vldx2.md rename to docs/isa/vector/ops/vector-load-store/vldsx2.md index 288b9b42a..bbb94ea04 100644 --- a/docs/isa/vector/ops/vector-load-store/vldx2.md +++ b/docs/isa/vector/ops/vector-load-store/vldsx2.md @@ -1,6 +1,6 @@ -# pto.vldx2 +# pto.vldsx2 -`pto.vldx2` is part of the [Vector Load Store](../../vector-load-store.md) instruction set. +`pto.vldsx2` is part of the [Vector Load Store](../../vector-load-store.md) instruction set. ## Summary @@ -8,20 +8,20 @@ Dual load with deinterleave (AoS → SoA conversion). ## Mechanism -`pto.vldx2` is part of the PTO vector memory/data-movement instruction set. It keeps UB addressing, distribution, mask behavior, and any alignment-state threading explicit in SSA form rather than hiding those details in backend-specific lowering. +`pto.vldsx2` is part of the PTO vector memory/data-movement instruction set. It keeps UB addressing, distribution, mask behavior, and any alignment-state threading explicit in SSA form rather than hiding those details in backend-specific lowering. ## Syntax ### PTO Assembly Form ```text -vldx2 %low, %high, %source[%offset], "DIST" +vldsx2 %low, %high, %source[%offset], "DIST" ``` ### AS Level 1 (SSA) ```mlir -%low, %high = pto.vldx2 %source[%offset], "DIST" : !pto.ptr, index -> !pto.vreg, !pto.vreg +%low, %high = pto.vldsx2 %source[%offset], "DIST" : !pto.ptr, index -> !pto.vreg, !pto.vreg ``` ## Inputs @@ -60,15 +60,14 @@ This operation reads UB-visible storage and returns SSA results. It does not by ### Timing Disclosure -The current public VPTO timing material for PTO micro instructions remains limited. -For `pto.vldx2`, those public sources describe the instruction semantics, operand legality, and pipeline placement, but they do **not** publish a numeric latency or steady-state throughput. +PTO-Gym v0.6 SPEC publishes a uniform 9-cycle latency for all `pto.vldsx2` distribution families on the A5 profile. -| Metric | Status | Source Basis | -|--------|--------|--------------| -| A5 latency | Not publicly published | Current public VPTO timing material | +| Metric | Value | Source Basis | +|--------|-------|--------------| +| A5 latency (`BDINTLV`, `DINTLV_B8`, `DINTLV_B16`, `DINTLV_B32`) | **9** cycles | PTO-Gym v0.6 SPEC, §III Vector Load/Store | | Steady-state throughput | Not publicly published | Current public VPTO timing material | -If software scheduling or performance modeling depends on the exact cost of `pto.vldx2`, treat that cost as target-profile-specific and measure it on the concrete backend rather than inferring a manual constant. +Other target profiles (CPU simulation, A2/A3) treat the cost as target-defined; measure on the concrete backend rather than reusing the A5 number. ## Examples @@ -81,7 +80,7 @@ for (int i = 0; i < 64; i++) { ``` ```mlir -%x, %y = pto.vldx2 %ub[%offset], "DINTLV_B32" : !pto.ptr, index -> !pto.vreg<64xf32>, !pto.vreg<64xf32> +%x, %y = pto.vldsx2 %ub[%offset], "DINTLV_B32" : !pto.ptr, index -> !pto.vreg<64xf32>, !pto.vreg<64xf32> ``` ## Detailed Notes @@ -98,7 +97,7 @@ for (int i = 0; i < 64; i++) { **Example — Load interleaved XY pairs into separate X/Y vectors:** ```mlir -%x, %y = pto.vldx2 %ub[%offset], "DINTLV_B32" : !pto.ptr, index -> !pto.vreg<64xf32>, !pto.vreg<64xf32> +%x, %y = pto.vldsx2 %ub[%offset], "DINTLV_B32" : !pto.ptr, index -> !pto.vreg<64xf32>, !pto.vreg<64xf32> ``` ## Related Ops / Instruction Set Links diff --git a/docs/isa/vector/ops/vector-load-store/vldx2_zh.md b/docs/isa/vector/ops/vector-load-store/vldsx2_zh.md similarity index 57% rename from docs/isa/vector/ops/vector-load-store/vldx2_zh.md rename to docs/isa/vector/ops/vector-load-store/vldsx2_zh.md index 3b3db3baf..988106e7c 100644 --- a/docs/isa/vector/ops/vector-load-store/vldx2_zh.md +++ b/docs/isa/vector/ops/vector-load-store/vldsx2_zh.md @@ -1,6 +1,6 @@ -# pto.vldx2 +# pto.vldsx2 -`pto.vldx2` 属于[向量加载与存储](../../vector-load-store_zh.md)指令集。 +`pto.vldsx2` 属于[向量加载与存储](../../vector-load-store_zh.md)指令集。 ## 概述 @@ -8,20 +8,20 @@ ## 机制 -`pto.vldx2` 属于 PTO 的向量内存 / 数据搬运指令。它从 UB 中读取交错布局的数据,并一次返回两路结果向量。关键点不只是“读两次”,而是这两路结果构成一个有顺序的语义对。 +`pto.vldsx2` 属于 PTO 的向量内存 / 数据搬运指令。它从 UB 中读取交错布局的数据,并一次返回两路结果向量。关键点不只是“读两次”,而是这两路结果构成一个有顺序的语义对。 ## 语法 ### PTO 汇编形式 ```text -vldx2 %low, %high, %source[%offset], "DIST" +vldsx2 %low, %high, %source[%offset], "DIST" ``` ### AS Level 1(SSA) ```mlir -%low, %high = pto.vldx2 %source[%offset], "DIST" : !pto.ptr, index -> !pto.vreg, !pto.vreg +%low, %high = pto.vldsx2 %source[%offset], "DIST" : !pto.ptr, index -> !pto.vreg, !pto.vreg ``` ## 输入 @@ -48,13 +48,24 @@ vldx2 %low, %high, %source[%offset], "DIST" !!! danger "异常与非法情形" - 使用超出 UB 可见空间的地址,或违反所选分布模式的地址 / 对齐契约,都是非法的。 - - 约束部分列出的额外非法情形,同样属于 `pto.vldx2` 的契约。 + - 约束部分列出的额外非法情形,同样属于 `pto.vldsx2` 的契约。 ## 目标 Profile 限制 ??? info "目标 Profile 限制" - A5 是当前手册里最细的具体 profile;CPU 模拟器和 A2/A3 类目标可以在保留可见 PTO 契约的前提下做等效模拟。 +## 性能 + +PTO-Gym v0.6 SPEC 为 A5 profile 上 `pto.vldsx2` 的所有分布族公布了统一的 9 周期时延。 + +| 指标 | 值 | 来源 | +|------|------|------| +| A5 时延(`BDINTLV`、`DINTLV_B8`、`DINTLV_B16`、`DINTLV_B32`) | **9** 周期 | PTO-Gym v0.6 SPEC §III 向量加载/存储 | +| 稳态吞吐 | 未公开 | 当前公开 VPTO 时序材料 | + +CPU 模拟和 A2/A3 类目标按 target-defined 处理;如需精确成本,请在具体 backend 实测,不要直接套用 A5 数值。 + ## 示例 ```c @@ -66,7 +77,7 @@ for (int i = 0; i < 64; i++) { ``` ```mlir -%x, %y = pto.vldx2 %ub[%offset], "DINTLV_B32" : !pto.ptr, index -> !pto.vreg<64xf32>, !pto.vreg<64xf32> +%x, %y = pto.vldsx2 %ub[%offset], "DINTLV_B32" : !pto.ptr, index -> !pto.vreg<64xf32>, !pto.vreg<64xf32> ``` ## 详细说明 diff --git a/docs/isa/vector/ops/vector-load-store/vldus.md b/docs/isa/vector/ops/vector-load-store/vldus.md index de1e75c02..92f365b0e 100644 --- a/docs/isa/vector/ops/vector-load-store/vldus.md +++ b/docs/isa/vector/ops/vector-load-store/vldus.md @@ -92,4 +92,4 @@ When documentation or scheduling depends on the throughput claim, treat it as a - Instruction set overview: [Vector Load Store](../../vector-load-store.md) - Previous op in instruction set: [pto.vldas](./vldas.md) -- Next op in instruction set: [pto.vldx2](./vldx2.md) +- Next op in instruction set: [pto.vldsx2](./vldsx2.md) diff --git a/docs/isa/vector/ops/vector-load-store/vldus_zh.md b/docs/isa/vector/ops/vector-load-store/vldus_zh.md index 2cc25b342..a4debc925 100644 --- a/docs/isa/vector/ops/vector-load-store/vldus_zh.md +++ b/docs/isa/vector/ops/vector-load-store/vldus_zh.md @@ -92,4 +92,4 @@ vldus %result, %align_out, %base_out, %source, %align - 指令集总览:[向量加载与存储](../../vector-load-store_zh.md) - 上一条指令:[pto.vldas](./vldas_zh.md) -- 下一条指令:[pto.vldx2](./vldx2_zh.md) +- 下一条指令:[pto.vldsx2](./vldsx2_zh.md) diff --git a/docs/isa/vector/ops/vector-load-store/vsld.md b/docs/isa/vector/ops/vector-load-store/vsld.md index 1239befa4..fd678b0d2 100644 --- a/docs/isa/vector/ops/vector-load-store/vsld.md +++ b/docs/isa/vector/ops/vector-load-store/vsld.md @@ -82,5 +82,5 @@ If software scheduling or performance modeling depends on the exact cost of `pto ## Related Ops / Instruction Set Links - Instruction set overview: [Vector Load Store](../../vector-load-store.md) -- Previous op in instruction set: [pto.vldx2](./vldx2.md) +- Previous op in instruction set: [pto.vldsx2](./vldsx2.md) - Next op in instruction set: [pto.vsldb](./vsldb.md) diff --git a/docs/isa/vector/ops/vector-load-store/vsld_zh.md b/docs/isa/vector/ops/vector-load-store/vsld_zh.md index 9152b1dd3..ac04a2472 100644 --- a/docs/isa/vector/ops/vector-load-store/vsld_zh.md +++ b/docs/isa/vector/ops/vector-load-store/vsld_zh.md @@ -82,5 +82,5 @@ PTO 微指令页面当前使用的时序来源是 `~/visa.txt` 与最新抓取 ## 相关页面 - 指令集总览:[向量加载与存储](../../vector-load-store_zh.md) -- 上一条指令:[pto.vldx2](./vldx2_zh.md) +- 上一条指令:[pto.vldsx2](./vldsx2_zh.md) - 下一条指令:[pto.vsldb](./vsldb_zh.md) diff --git a/docs/isa/vector/ops/vector-load-store/vsldb.md b/docs/isa/vector/ops/vector-load-store/vsldb.md index 5c05fec4f..224b91778 100644 --- a/docs/isa/vector/ops/vector-load-store/vsldb.md +++ b/docs/isa/vector/ops/vector-load-store/vsldb.md @@ -21,7 +21,7 @@ vsldb %result, %source, %offset, %mask ### AS Level 1 (SSA) ```mlir -%result = pto.vsldb %source, %offset, %mask : !pto.ptr, i32, !pto.mask -> !pto.vreg +%result = pto.vsldb %source, %offset, %mask : !pto.ptr, i32, !pto.mask -> !pto.vreg ``` ## Inputs @@ -74,7 +74,7 @@ If software scheduling or performance modeling depends on the exact cost of `pto ## Examples ```mlir -%result = pto.vsldb %source, %offset, %mask : !pto.ptr, i32, !pto.mask -> !pto.vreg +%result = pto.vsldb %source, %offset, %mask : !pto.ptr, i32, !pto.mask -> !pto.vreg ``` ## Related Ops / Instruction Set Links diff --git a/docs/isa/vector/ops/vector-load-store/vsldb_zh.md b/docs/isa/vector/ops/vector-load-store/vsldb_zh.md index 594ca0a22..0915beade 100644 --- a/docs/isa/vector/ops/vector-load-store/vsldb_zh.md +++ b/docs/isa/vector/ops/vector-load-store/vsldb_zh.md @@ -21,7 +21,7 @@ vsldb %result, %source, %offset, %mask ### AS Level 1(SSA) ```mlir -%result = pto.vsldb %source, %offset, %mask : !pto.ptr, i32, !pto.mask -> !pto.vreg +%result = pto.vsldb %source, %offset, %mask : !pto.ptr, i32, !pto.mask -> !pto.vreg ``` ## 输入 @@ -58,7 +58,7 @@ vsldb %result, %source, %offset, %mask ## 示例 ```mlir -%result = pto.vsldb %source, %offset, %mask : !pto.ptr, i32, !pto.mask -> !pto.vreg +%result = pto.vsldb %source, %offset, %mask : !pto.ptr, i32, !pto.mask -> !pto.vreg ``` ## 性能 diff --git a/docs/isa/vector/ops/vector-load-store/vsst.md b/docs/isa/vector/ops/vector-load-store/vsst.md index 0b92e99d5..58d64a86d 100644 --- a/docs/isa/vector/ops/vector-load-store/vsst.md +++ b/docs/isa/vector/ops/vector-load-store/vsst.md @@ -82,5 +82,5 @@ The instruction set overview carries the remaining shared rules for this operati ## Related Ops / Instruction Set Links - Instruction set overview: [Vector Load Store](../../vector-load-store.md) -- Previous op in instruction set: [pto.vstx2](./vstx2.md) +- Previous op in instruction set: [pto.vstsx2](./vstsx2.md) - Next op in instruction set: [pto.vsstb](./vsstb.md) diff --git a/docs/isa/vector/ops/vector-load-store/vsst_zh.md b/docs/isa/vector/ops/vector-load-store/vsst_zh.md index f2e326b69..c9d688b46 100644 --- a/docs/isa/vector/ops/vector-load-store/vsst_zh.md +++ b/docs/isa/vector/ops/vector-load-store/vsst_zh.md @@ -77,5 +77,5 @@ PTO 微指令页面当前使用的时序来源是 `~/visa.txt` 与最新抓取 ## 相关页面 - 指令集总览:[向量加载与存储](../../vector-load-store_zh.md) -- 上一条指令:[pto.vstx2](./vstx2_zh.md) +- 上一条指令:[pto.vstsx2](./vstsx2_zh.md) - 下一条指令:[pto.vsstb](./vsstb_zh.md) diff --git a/docs/isa/vector/ops/vector-load-store/vsstb.md b/docs/isa/vector/ops/vector-load-store/vsstb.md index 44cc57049..68385b49f 100644 --- a/docs/isa/vector/ops/vector-load-store/vsstb.md +++ b/docs/isa/vector/ops/vector-load-store/vsstb.md @@ -21,7 +21,7 @@ vsstb %value, %dest, %offset, %mask ### AS Level 1 (SSA) ```mlir -pto.vsstb %value, %dest, %offset, %mask : !pto.vreg, !pto.ptr, i32, !pto.mask +pto.vsstb %value, %dest, %offset, %mask : !pto.vreg, !pto.ptr, i32, !pto.mask ``` ## Inputs @@ -72,7 +72,7 @@ If software scheduling or performance modeling depends on the exact cost of `pto ## Examples ```mlir -pto.vsstb %value, %dest, %offset, %mask : !pto.vreg, !pto.ptr, i32, !pto.mask +pto.vsstb %value, %dest, %offset, %mask : !pto.vreg, !pto.ptr, i32, !pto.mask ``` ## Related Ops / Instruction Set Links diff --git a/docs/isa/vector/ops/vector-load-store/vsstb_zh.md b/docs/isa/vector/ops/vector-load-store/vsstb_zh.md index bf3840329..c18d6efa7 100644 --- a/docs/isa/vector/ops/vector-load-store/vsstb_zh.md +++ b/docs/isa/vector/ops/vector-load-store/vsstb_zh.md @@ -21,7 +21,7 @@ vsstb %value, %dest, %offset, %mask ### AS Level 1(SSA) ```mlir -pto.vsstb %value, %dest, %offset, %mask : !pto.vreg, !pto.ptr, i32, !pto.mask +pto.vsstb %value, %dest, %offset, %mask : !pto.vreg, !pto.ptr, i32, !pto.mask ``` ## 输入 @@ -59,7 +59,7 @@ pto.vsstb %value, %dest, %offset, %mask : !pto.vreg, !pto.ptr, i32, ## 示例 ```mlir -pto.vsstb %value, %dest, %offset, %mask : !pto.vreg, !pto.ptr, i32, !pto.mask +pto.vsstb %value, %dest, %offset, %mask : !pto.vreg, !pto.ptr, i32, !pto.mask ``` ## 性能 diff --git a/docs/isa/vector/ops/vector-load-store/vsts.md b/docs/isa/vector/ops/vector-load-store/vsts.md index 83d57ec4f..a6e4302b3 100644 --- a/docs/isa/vector/ops/vector-load-store/vsts.md +++ b/docs/isa/vector/ops/vector-load-store/vsts.md @@ -21,7 +21,7 @@ vsts %value, %dest[%offset], %mask {dist = "DIST"} ### AS Level 1 (SSA) ```mlir -pto.vsts %value, %dest[%offset], %mask {dist = "DIST"} : !pto.vreg, !pto.ptr, !pto.mask +pto.vsts %value, %dest[%offset], %mask {dist = "DIST"} : !pto.vreg, !pto.ptr, !pto.mask ``` ## Inputs @@ -76,7 +76,7 @@ If software scheduling or performance modeling depends on the exact cost of `pto ## Examples ```mlir -pto.vsts %v, %ub[%offset], %mask {dist = "NORM_B32"} : !pto.vreg<64xf32>, !pto.ptr, !pto.mask +pto.vsts %v, %ub[%offset], %mask {dist = "NORM_B32"} : !pto.vreg<64xf32>, !pto.ptr, !pto.mask ``` ## Detailed Notes @@ -92,11 +92,11 @@ pto.vsts %v, %ub[%offset], %mask {dist = "NORM_B32"} : !pto.vreg<64xf32>, !pto.p **Example — Contiguous store:** ```mlir -pto.vsts %v, %ub[%offset], %mask {dist = "NORM_B32"} : !pto.vreg<64xf32>, !pto.ptr, !pto.mask +pto.vsts %v, %ub[%offset], %mask {dist = "NORM_B32"} : !pto.vreg<64xf32>, !pto.ptr, !pto.mask ``` ## Related Ops / Instruction Set Links - Instruction set overview: [Vector Load Store](../../vector-load-store.md) - Previous op in instruction set: [pto.vgather2_bc](./vgather2-bc.md) -- Next op in instruction set: [pto.vstx2](./vstx2.md) +- Next op in instruction set: [pto.vstsx2](./vstsx2.md) diff --git a/docs/isa/vector/ops/vector-load-store/vsts_zh.md b/docs/isa/vector/ops/vector-load-store/vsts_zh.md index c3396b70a..fb380d729 100644 --- a/docs/isa/vector/ops/vector-load-store/vsts_zh.md +++ b/docs/isa/vector/ops/vector-load-store/vsts_zh.md @@ -21,7 +21,7 @@ vsts %value, %dest[%offset], %mask {dist = "DIST"} ### AS Level 1(SSA) ```mlir -pto.vsts %value, %dest[%offset], %mask {dist = "DIST"} : !pto.vreg, !pto.ptr, !pto.mask +pto.vsts %value, %dest[%offset], %mask {dist = "DIST"} : !pto.vreg, !pto.ptr, !pto.mask ``` ## 输入 @@ -61,7 +61,7 @@ pto.vsts %value, %dest[%offset], %mask {dist = "DIST"} : !pto.vreg, !pto.pt ## 示例 ```mlir -pto.vsts %v, %ub[%offset], %mask {dist = "NORM_B32"} : !pto.vreg<64xf32>, !pto.ptr, !pto.mask +pto.vsts %v, %ub[%offset], %mask {dist = "NORM_B32"} : !pto.vreg<64xf32>, !pto.ptr, !pto.mask ``` ## 详细说明 @@ -93,4 +93,4 @@ PTO 微指令页面当前使用的时序来源是 `~/visa.txt` 与最新抓取 - 指令集总览:[向量加载与存储](../../vector-load-store_zh.md) - 上一条指令:[pto.vgather2_bc](./vgather2-bc_zh.md) -- 下一条指令:[pto.vstx2](./vstx2_zh.md) +- 下一条指令:[pto.vstsx2](./vstsx2_zh.md) diff --git a/docs/isa/vector/ops/vector-load-store/vstx2.md b/docs/isa/vector/ops/vector-load-store/vstsx2.md similarity index 68% rename from docs/isa/vector/ops/vector-load-store/vstx2.md rename to docs/isa/vector/ops/vector-load-store/vstsx2.md index 66dc8735a..5e2b1cb17 100644 --- a/docs/isa/vector/ops/vector-load-store/vstx2.md +++ b/docs/isa/vector/ops/vector-load-store/vstsx2.md @@ -1,6 +1,6 @@ -# pto.vstx2 +# pto.vstsx2 -`pto.vstx2` is part of the [Vector Load Store](../../vector-load-store.md) instruction set. +`pto.vstsx2` is part of the [Vector Load Store](../../vector-load-store.md) instruction set. ## Summary @@ -8,20 +8,20 @@ Dual interleaved store (SoA → AoS conversion). ## Mechanism -`pto.vstx2` is part of the PTO vector memory/data-movement instruction set. It keeps UB addressing, distribution, mask behavior, and any alignment-state threading explicit in SSA form rather than hiding those details in backend-specific lowering. +`pto.vstsx2` is part of the PTO vector memory/data-movement instruction set. It keeps UB addressing, distribution, mask behavior, and any alignment-state threading explicit in SSA form rather than hiding those details in backend-specific lowering. ## Syntax ### PTO Assembly Form ```text -vstx2 %low, %high, %dest[%offset], "DIST", %mask +vstsx2 %low, %high, %dest[%offset], "DIST", %mask ``` ### AS Level 1 (SSA) ```mlir -pto.vstx2 %low, %high, %dest[%offset], "DIST", %mask : !pto.vreg, !pto.vreg, !pto.ptr, index, !pto.mask +pto.vstsx2 %low, %high, %dest[%offset], "DIST", %mask : !pto.vreg, !pto.vreg, !pto.ptr, index, !pto.mask ``` ## Inputs @@ -62,15 +62,14 @@ This operation writes UB-visible memory and/or updates streamed alignment state. ### Timing Disclosure -The current public VPTO timing material for PTO micro instructions remains limited. -For `pto.vstx2`, those public sources describe the instruction semantics, operand legality, and pipeline placement, but they do **not** publish a numeric latency or steady-state throughput. +PTO-Gym v0.6 SPEC publishes a uniform 12-cycle latency for the `INTLV` distribution family of `pto.vstsx2` on the A5 profile. -| Metric | Status | Source Basis | -|--------|--------|--------------| -| A5 latency | Not publicly published | Current public VPTO timing material | +| Metric | Value | Source Basis | +|--------|-------|--------------| +| A5 latency (`INTLV`, all element widths) | **12** cycles | PTO-Gym v0.6 SPEC, §III Vector Load/Store | | Steady-state throughput | Not publicly published | Current public VPTO timing material | -If software scheduling or performance modeling depends on the exact cost of `pto.vstx2`, treat that cost as target-profile-specific and measure it on the concrete backend rather than inferring a manual constant. +Other target profiles (CPU simulation, A2/A3) treat the cost as target-defined; measure on the concrete backend rather than reusing the A5 number. ## Examples diff --git a/docs/isa/vector/ops/vector-load-store/vstx2_zh.md b/docs/isa/vector/ops/vector-load-store/vstsx2_zh.md similarity index 57% rename from docs/isa/vector/ops/vector-load-store/vstx2_zh.md rename to docs/isa/vector/ops/vector-load-store/vstsx2_zh.md index a3b3a0cf3..0ba7422f6 100644 --- a/docs/isa/vector/ops/vector-load-store/vstx2_zh.md +++ b/docs/isa/vector/ops/vector-load-store/vstsx2_zh.md @@ -1,6 +1,6 @@ -# pto.vstx2 +# pto.vstsx2 -`pto.vstx2` 属于[向量加载与存储](../../vector-load-store_zh.md)指令集。 +`pto.vstsx2` 属于[向量加载与存储](../../vector-load-store_zh.md)指令集。 ## 概述 @@ -8,20 +8,20 @@ ## 机制 -`pto.vstx2` 属于 PTO 的向量内存 / 数据搬运指令。它把两路源向量按选定交错布局写回 UB。这里的关键语义不是“写两次”,而是“两个源向量构成有顺序的交错对”。 +`pto.vstsx2` 属于 PTO 的向量内存 / 数据搬运指令。它把两路源向量按选定交错布局写回 UB。这里的关键语义不是“写两次”,而是“两个源向量构成有顺序的交错对”。 ## 语法 ### PTO 汇编形式 ```text -vstx2 %low, %high, %dest[%offset], "DIST", %mask +vstsx2 %low, %high, %dest[%offset], "DIST", %mask ``` ### AS Level 1(SSA) ```mlir -pto.vstx2 %low, %high, %dest[%offset], "DIST", %mask : !pto.vreg, !pto.vreg, !pto.ptr, index, !pto.mask +pto.vstsx2 %low, %high, %dest[%offset], "DIST", %mask : !pto.vreg, !pto.vreg, !pto.ptr, index, !pto.mask ``` ## 输入 @@ -50,7 +50,7 @@ pto.vstx2 %low, %high, %dest[%offset], "DIST", %mask : !pto.vreg, !pto.vreg !!! danger "异常与非法情形" - 使用超出 UB 可见空间的地址,或违反所选分布模式的地址 / 对齐契约,都是非法的。 - - 约束部分列出的额外非法情形,同样属于 `pto.vstx2` 的契约。 + - 约束部分列出的额外非法情形,同样属于 `pto.vstsx2` 的契约。 ## 目标 Profile 限制 @@ -77,15 +77,14 @@ for (int i = 0; i < 64; i++) { ### 时延与吞吐披露 -PTO 微指令页面当前使用的时序来源是 `~/visa.txt` 与最新抓取的 `PTOAS/docs/vpto-spec.md`(`feature_vpto_backend` 分支)。 -对于 `pto.vstx2`,这些公开来源说明了指令语义、操作数合法性和流水线位置,但**没有**发布数字时延或稳态吞吐。 +PTO-Gym v0.6 SPEC 为 A5 profile 上 `pto.vstsx2` 的 `INTLV` 分布族公布了统一的 12 周期时延。 -| 指标 | 状态 | 来源依据 | -|------|------|----------| -| A5 时延 | 公开来源未给出 | `visa.txt`、`PTOAS/docs/vpto-spec.md` | -| 稳态吞吐 | 公开来源未给出 | `visa.txt`、`PTOAS/docs/vpto-spec.md` | +| 指标 | 值 | 来源 | +|------|------|------| +| A5 时延(`INTLV`,所有元素宽度) | **12** 周期 | PTO-Gym v0.6 SPEC §III 向量加载/存储 | +| 稳态吞吐 | 未公开 | 当前公开 VPTO 时序材料 | -如果软件调度或性能建模依赖 `pto.vstx2` 的确切成本,必须在具体 backend 上实测,而不能从当前公开手册里推导出一个并未公布的常数。 +CPU 模拟和 A2/A3 类目标按 target-defined 处理;如需精确成本,请在具体 backend 实测,不要直接套用 A5 数值。 ## 相关页面 diff --git a/docs/isa/vector/pipeline-sync.md b/docs/isa/vector/pipeline-sync.md index 8b51c377e..47355839d 100644 --- a/docs/isa/vector/pipeline-sync.md +++ b/docs/isa/vector/pipeline-sync.md @@ -137,9 +137,9 @@ pto.wait_flag["PIPE_MTE2", "PIPE_V", "EVENT_ID0"] scf.for %dummy = %c0 to %c1 step %c1 { %v = pto.vlds %ub_ptr[%lane] : !pto.ptr -> !pto.vreg<64xf32> - %mask = pto.pset_b32 "PAT_ALL" : !pto.mask - %abs = pto.vabs %v, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> - pto.vsts %abs, %ub_out[%lane], %mask : !pto.vreg<64xf32>, !pto.ptr, !pto.mask + %mask = pto.pset_b32 "PAT_ALL" : !pto.mask + %abs = pto.vabs %v, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> + pto.vsts %abs, %ub_out[%lane], %mask : !pto.vreg<64xf32>, !pto.ptr, !pto.mask } {llvm.loop.aivector_scope} // Vector signals: "UB output is ready for MTE3" @@ -175,10 +175,10 @@ pto.get_buf "PIPE_V", %bufid_ub_ptr, %mode : i64, i64 pto.get_buf "PIPE_V", %bufid_ub_out, %mode : i64, i64 scf.for %dummy = %c0 to %c1 step %c1 { - %mask = pto.pset_b32 "PAT_ALL" : !pto.mask + %mask = pto.pset_b32 "PAT_ALL" : !pto.mask %v = pto.vlds %ub_ptr[%lane] : !pto.ptr -> !pto.vreg<64xf32> - %abs = pto.vabs %v, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> - pto.vsts %abs, %ub_out[%lane], %mask : !pto.vreg<64xf32>, !pto.ptr, !pto.mask + %abs = pto.vabs %v, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> + pto.vsts %abs, %ub_out[%lane], %mask : !pto.vreg<64xf32>, !pto.ptr, !pto.mask } {llvm.loop.aivector_scope} // Vector done reading ub_ptr — release so MTE2 can reuse it in next iteration @@ -253,9 +253,9 @@ scf.for %i = %c0 to %N step %c1 { pto.wait_flag["PIPE_MTE3", "PIPE_V", "EVT_OUT_REV_{pp}"] scf.for %dummy = %c0 to %c1 step %c1 { %v = pto.vlds %ub_in[%pp][%lane] : !pto.ptr -> !pto.vreg<64xf32> - %mask = pto.pset_b32 "PAT_ALL" : !pto.mask - %abs = pto.vabs %v, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> - pto.vsts %abs, %ub_out[%pp][%lane], %mask : !pto.vreg<64xf32>, !pto.ptr, !pto.mask + %mask = pto.pset_b32 "PAT_ALL" : !pto.mask + %abs = pto.vabs %v, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> + pto.vsts %abs, %ub_out[%pp][%lane], %mask : !pto.vreg<64xf32>, !pto.ptr, !pto.mask } {llvm.loop.aivector_scope} // WAR: tell MTE2 "done reading buf_in[i%2]" pto.set_flag["PIPE_V", "PIPE_MTE2", "EVT_IN_REV_{pp}"] @@ -305,9 +305,9 @@ scf.for %i = %c0 to %N step %c1 { pto.get_buf %bufid_out[%pp], "PIPE_V" scf.for %dummy = %c0 to %c1 step %c1 { %v = pto.vlds %ub_buf[%pp][%lane] : !pto.ptr -> !pto.vreg<64xf32> - %mask = pto.pset_b32 "PAT_ALL" : !pto.mask - %abs = pto.vabs %v, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> - pto.vsts %abs, %ub_out[%pp][%lane], %mask : !pto.vreg<64xf32>, !pto.ptr, !pto.mask + %mask = pto.pset_b32 "PAT_ALL" : !pto.mask + %abs = pto.vabs %v, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> + pto.vsts %abs, %ub_out[%pp][%lane], %mask : !pto.vreg<64xf32>, !pto.ptr, !pto.mask } {llvm.loop.aivector_scope} // Release buf[i%2] — MTE2 can reuse in iteration i+2 (WAR resolved) pto.rls_buf %bufid_buf[%pp], "PIPE_V" diff --git a/docs/isa/vector/pipeline-sync_zh.md b/docs/isa/vector/pipeline-sync_zh.md index a08b4b9d7..c56ce6d60 100644 --- a/docs/isa/vector/pipeline-sync_zh.md +++ b/docs/isa/vector/pipeline-sync_zh.md @@ -116,9 +116,9 @@ pto.set_flag["PIPE_MTE2", "PIPE_V", "EVENT_ID0"] pto.wait_flag["PIPE_MTE2", "PIPE_V", "EVENT_ID0"] scf.for %dummy = %c0 to %c1 step %c1 { %v = pto.vlds %ub_ptr[%lane] : !pto.ptr -> !pto.vreg<64xf32> - %mask = pto.pset_b32 "PAT_ALL" : !pto.mask - %abs = pto.vabs %v, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> - pto.vsts %abs, %ub_out[%lane], %mask : !pto.vreg<64xf32>, !pto.ptr, !pto.mask + %mask = pto.pset_b32 "PAT_ALL" : !pto.mask + %abs = pto.vabs %v, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> + pto.vsts %abs, %ub_out[%lane], %mask : !pto.vreg<64xf32>, !pto.ptr, !pto.mask } {llvm.loop.aivector_scope} pto.set_flag["PIPE_V", "PIPE_MTE3", "EVENT_ID0"] @@ -140,10 +140,10 @@ pto.rls_buf "PIPE_MTE2", %bufid_ub_ptr, %mode : i64, i64 pto.get_buf "PIPE_V", %bufid_ub_ptr, %mode : i64, i64 pto.get_buf "PIPE_V", %bufid_ub_out, %mode : i64, i64 scf.for %dummy = %c0 to %c1 step %c1 { - %mask = pto.pset_b32 "PAT_ALL" : !pto.mask + %mask = pto.pset_b32 "PAT_ALL" : !pto.mask %v = pto.vlds %ub_ptr[%lane] : !pto.ptr -> !pto.vreg<64xf32> - %abs = pto.vabs %v, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> - pto.vsts %abs, %ub_out[%lane], %mask : !pto.vreg<64xf32>, !pto.ptr, !pto.mask + %abs = pto.vabs %v, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> + pto.vsts %abs, %ub_out[%lane], %mask : !pto.vreg<64xf32>, !pto.ptr, !pto.mask } {llvm.loop.aivector_scope} pto.rls_buf "PIPE_V", %bufid_ub_ptr, %mode : i64, i64 pto.rls_buf "PIPE_V", %bufid_ub_out, %mode : i64, i64 @@ -200,9 +200,9 @@ scf.for %i = %c0 to %N step %c1 { pto.get_buf %bufid_out[%pp], "PIPE_V" scf.for %dummy = %c0 to %c1 step %c1 { %v = pto.vlds %ub_buf[%pp][%lane] : !pto.ptr -> !pto.vreg<64xf32> - %mask = pto.pset_b32 "PAT_ALL" : !pto.mask - %abs = pto.vabs %v, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> - pto.vsts %abs, %ub_out[%pp][%lane], %mask : !pto.vreg<64xf32>, !pto.ptr, !pto.mask + %mask = pto.pset_b32 "PAT_ALL" : !pto.mask + %abs = pto.vabs %v, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> + pto.vsts %abs, %ub_out[%pp][%lane], %mask : !pto.vreg<64xf32>, !pto.ptr, !pto.mask } {llvm.loop.aivector_scope} pto.rls_buf %bufid_buf[%pp], "PIPE_V" pto.rls_buf %bufid_out[%pp], "PIPE_V" diff --git a/docs/isa/vector/predicate-and-materialization.md b/docs/isa/vector/predicate-and-materialization.md index 5a885296e..7486d6025 100644 --- a/docs/isa/vector/predicate-and-materialization.md +++ b/docs/isa/vector/predicate-and-materialization.md @@ -10,11 +10,11 @@ Vector instructions can consume predicate masks produced by scalar load/store op ## Vector ISA Predicate Consumption -Vector instructions consume `!pto.mask` operands for conditional lane execution: +Vector instructions consume `!pto.mask` operands for conditional lane execution: ```mlir %vdst = pto.vadd %vsrc0, %vsrc1, %mask - : (!pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask) -> !pto.vreg<64xf32> + : (!pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask) -> !pto.vreg<64xf32> ``` The mask is produced by the scalar predicate load/store operations documented in [Predicate Load/Store](../scalar/predicate-load-store.md). See the scalar page for full semantics including distribution modes, scalar load semantics, and constraint tables. diff --git a/docs/isa/vector/predicate-and-materialization_zh.md b/docs/isa/vector/predicate-and-materialization_zh.md index 677ded0a8..248862b91 100644 --- a/docs/isa/vector/predicate-and-materialization_zh.md +++ b/docs/isa/vector/predicate-and-materialization_zh.md @@ -10,11 +10,11 @@ Vector 指令可以消费标量 load/store 操作产生的 predicate mask。Vect ## Vector ISA Predicate 消费 -Vector 指令消费 `!pto.mask` 操作数以实现条件 lane 执行: +Vector 指令消费 `!pto.mask` 操作数以实现条件 lane 执行: ```mlir %vdst = pto.vadd %vsrc0, %vsrc1, %mask - : (!pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask) -> !pto.vreg<64xf32> + : (!pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask) -> !pto.vreg<64xf32> ``` Mask 由标量 predicate load/store 操作产生,详见 [Predicate Load/Store(标量)](../scalar/predicate-load-store_zh.md)。 diff --git a/docs/isa/vector/reduction-ops.md b/docs/isa/vector/reduction-ops.md index 818176536..68fb4e017 100644 --- a/docs/isa/vector/reduction-ops.md +++ b/docs/isa/vector/reduction-ops.md @@ -25,10 +25,10 @@ Reduction operations execute inside a `pto.vecscope { ... }` region. Cross-lane ```mlir pto.vecscope { - %active = pto.pset_b32 "PAT_ALL" : !pto.mask + %active = pto.pset_b32 "PAT_ALL" : !pto.mask scf.for %row = %c0 to %row_count step %c1 { %vec = pto.vlds %ub_q[%row] : !pto.ptr -> !pto.vreg<64xf32> - %row_sum_raw = pto.vcadd %vec, %active : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> + %row_sum_raw = pto.vcadd %vec, %active : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> // row_sum_raw[0] contains the sum pto.vsts %row_sum_raw, %ub_sum[%row], %one_mask {dist = "1PT"} : ... } @@ -77,7 +77,7 @@ pto.vecscope { ### `pto.vcadd` -- **syntax:** `%result = pto.vcadd %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg` +- **syntax:** `%result = pto.vcadd %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg` - **A5 types:** i16-i64, f16, f32 - **semantics:** Sum all elements. Result in lane 0, others zeroed. @@ -101,7 +101,7 @@ for (int i = 1; i < N; i++) ### `pto.vcmax` -- **syntax:** `%result = pto.vcmax %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg` +- **syntax:** `%result = pto.vcmax %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg` - **A5 types:** i16-i32, f16, f32 - **semantics:** Find max element with argmax. Result value + index in lane 0. @@ -126,7 +126,7 @@ dst_idx[0] = idx; ### `pto.vcmin` -- **syntax:** `%result = pto.vcmin %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg` +- **syntax:** `%result = pto.vcmin %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg` - **A5 types:** i16-i32, f16, f32 - **semantics:** Find min element with argmin. Result value + index in lane 0. @@ -159,7 +159,7 @@ VLane 4: [32..39] VLane 5: [40..47] VLane 6: [48..55] VLane 7: [56..63] ### `pto.vcgadd` -- **syntax:** `%result = pto.vcgadd %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg` +- **syntax:** `%result = pto.vcgadd %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg` - **A5 types:** i16-i32, f16, f32 - **semantics:** Sum within each VLane. 8 results at indices 0, 8, 16, 24, 32, 40, 48, 56 (for f32). @@ -187,7 +187,7 @@ for (int g = 0; g < 8; g++) { ### `pto.vcgmax` -- **syntax:** `%result = pto.vcgmax %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg` +- **syntax:** `%result = pto.vcgmax %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg` - **A5 types:** i16-i32, f16, f32 - **semantics:** Max within each VLane. @@ -213,7 +213,7 @@ for (int g = 0; g < 8; g++) { ### `pto.vcgmin` -- **syntax:** `%result = pto.vcgmin %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg` +- **syntax:** `%result = pto.vcgmin %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg` - **A5 types:** i16-i32, f16, f32 - **semantics:** Min within each VLane. @@ -241,7 +241,7 @@ for (int g = 0; g < 8; g++) { ### `pto.vcpadd` -- **syntax:** `%result = pto.vcpadd %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg` +- **syntax:** `%result = pto.vcpadd %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg` - **A5 types:** f16, f32 - **semantics:** Inclusive prefix sum (scan). @@ -269,18 +269,18 @@ for (int i = 1; i < N; i++) ```mlir // Softmax: find max for numerical stability -%max_vec = pto.vcmax %logits, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> +%max_vec = pto.vcmax %logits, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> // max is in lane 0, broadcast it %max_broadcast = pto.vlds %ub_tmp[%c0] {dist = "BRC_B32"} : !pto.ptr -> !pto.vreg<64xf32> // Row-wise sum using vcgadd (for 8-row tile) -%row_sums = pto.vcgadd %tile, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> +%row_sums = pto.vcgadd %tile, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> // Results at indices 0, 8, 16, 24, 32, 40, 48, 56 // Full vector sum for normalization -%total = pto.vcadd %values, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> +%total = pto.vcadd %values, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> // total[0] contains the sum // Prefix sum for cumulative distribution -%cdf = pto.vcpadd %pdf, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> +%cdf = pto.vcpadd %pdf, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> ``` diff --git a/docs/isa/vector/reduction-ops_zh.md b/docs/isa/vector/reduction-ops_zh.md index 27a3b4a5d..ab706cce1 100644 --- a/docs/isa/vector/reduction-ops_zh.md +++ b/docs/isa/vector/reduction-ops_zh.md @@ -27,11 +27,11 @@ ```mlir pto.vecscope { - %active = pto.pset_b32 "PAT_ALL" : !pto.mask + %active = pto.pset_b32 "PAT_ALL" : !pto.mask scf.for %row = %c0 to %row_count step %c1 { %vec = pto.vlds %ub_q[%row] : !pto.ptr -> !pto.vreg<64xf32> %row_sum_raw = pto.vcadd %vec, %active - : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> + : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> pto.vsts %row_sum_raw, %ub_sum[%row], %one_mask {dist = "1PT"} : ... } } @@ -86,7 +86,7 @@ total_cycles = startup + completion + repeats × per_repeat + (repeats - 1) × i ### `pto.vcadd` -- **语法:** `%result = pto.vcadd %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg` +- **语法:** `%result = pto.vcadd %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg` - **A5 类型:** i16-i64、f16、f32 - **语义:** 对整个向量求和,结果写到 lane 0,其他位置清零。 @@ -107,7 +107,7 @@ for (int i = 1; i < N; i++) ### `pto.vcmax` -- **语法:** `%result = pto.vcmax %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg` +- **语法:** `%result = pto.vcmax %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg` - **A5 类型:** i16-i32、f16、f32 - **语义:** 在整个向量里找最大值,同时产出 argmax 信息。 @@ -123,7 +123,7 @@ dst_idx[0] = idx; ### `pto.vcmin` -- **语法:** `%result = pto.vcmin %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg` +- **语法:** `%result = pto.vcmin %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg` - **A5 类型:** i16-i32、f16、f32 - **语义:** 在整个向量里找最小值,同时产出 argmin 信息。 @@ -143,7 +143,7 @@ VLane 4: [32..39] VLane 5: [40..47] VLane 6: [48..55] VLane 7: [56..63] ### `pto.vcgadd` -- **语法:** `%result = pto.vcgadd %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg` +- **语法:** `%result = pto.vcgadd %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg` - **A5 类型:** i16-i32、f16、f32 - **语义:** 在每个 VLane 内求和,每组产生一个结果。 @@ -163,12 +163,12 @@ for (int g = 0; g < 8; g++) { ### `pto.vcgmax` -- **语法:** `%result = pto.vcgmax %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg` +- **语法:** `%result = pto.vcgmax %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg` - **语义:** 在每个 VLane 内求局部最大值。 ### `pto.vcgmin` -- **语法:** `%result = pto.vcgmin %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg` +- **语法:** `%result = pto.vcgmin %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg` - **语义:** 在每个 VLane 内求局部最小值。 这三条指令最容易被误解的地方是:它们不是“把寄存器拆成任意八段”的软件概念,而是严格按硬件 32 字节 VLane 分组。 @@ -179,7 +179,7 @@ for (int g = 0; g < 8; g++) { ### `pto.vcpadd` -- **语法:** `%result = pto.vcpadd %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg` +- **语法:** `%result = pto.vcpadd %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg` - **A5 类型:** f16、f32 - **语义:** 做 inclusive prefix sum。 @@ -205,7 +205,7 @@ for (int i = 1; i < N; i++) ```mlir // Softmax:先找最大值,做数值稳定化 %max_vec = pto.vcmax %logits, %mask - : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> + : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> // max 位于 lane 0,可再配合广播用到全向量 %max_broadcast = pto.vlds %ub_tmp[%c0] {dist = "BRC_B32"} @@ -213,15 +213,15 @@ for (int i = 1; i < N; i++) // 用 vcgadd 做按组求和 %row_sums = pto.vcgadd %tile, %mask - : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> + : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> // 对整向量求和 %total = pto.vcadd %values, %mask - : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> + : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> // 做前缀和 %cdf = pto.vcpadd %pdf, %mask - : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> + : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> ``` --- diff --git a/docs/isa/vector/sfu-and-dsa-ops.md b/docs/isa/vector/sfu-and-dsa-ops.md index 7b60d97b2..b3dcd7d97 100644 --- a/docs/isa/vector/sfu-and-dsa-ops.md +++ b/docs/isa/vector/sfu-and-dsa-ops.md @@ -23,7 +23,7 @@ Fused operations, special functions, and UB-to-UB operations that leverage hardw ### `pto.vlrelu` -- **syntax:** `%result = pto.vlrelu %input, %alpha, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg` +- **syntax:** `%result = pto.vlrelu %input, %alpha, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg` - **A5 types:** f16, f32 - **semantics:** Leaky ReLU with scalar alpha. @@ -59,9 +59,9 @@ for (int i = 0; i < N; i++) --- -### `pto.vexpdiff` +### `pto.vexpdif` -- **syntax:** `%result = pto.vexpdiff %input, %max : !pto.vreg, !pto.vreg -> !pto.vreg` +- **syntax:** `%result = pto.vexpdif %input, %max : !pto.vreg, !pto.vreg -> !pto.vreg` - **A5 types:** f16, f32 - **semantics:** Fused exp(x - max) for numerically stable softmax. @@ -182,7 +182,7 @@ for (int i = 0; i < 128; i++) ### `pto.vmull` -- **syntax:** `%low, %high = pto.vmull %lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask -> !pto.vreg, !pto.vreg` +- **syntax:** `%low, %high = pto.vmull %lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask -> !pto.vreg, !pto.vreg` - **A5 types:** i32/u32 (native 32×32→64 widening multiply) - **semantics:** Widening multiply with high/low results. @@ -204,7 +204,7 @@ for (int i = 0; i < 64; i++) { ### `pto.vmula` -- **syntax:** `%result = pto.vmula %acc, %lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.vreg, !pto.mask -> !pto.vreg` +- **syntax:** `%result = pto.vmula %acc, %lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.vreg, !pto.mask -> !pto.vreg` - **semantics:** Multiply-accumulate. ```c @@ -302,8 +302,8 @@ for (int i = 0; i < N; i++) ## Current Implementation Instruction Set Summary -- `pto.vmull %lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask -> !pto.vreg, !pto.vreg` -- `pto.vmula %acc, %lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.vreg, !pto.mask -> !pto.vreg` +- `pto.vmull %lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask -> !pto.vreg, !pto.vreg` +- `pto.vmula %acc, %lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.vreg, !pto.mask -> !pto.vreg` - `pto.vci %index {order = "ORDER"} : integer -> !pto.vreg` - `pto.vbitsort %dest, %src, %indices, %repeat_times : !pto.ptr<...>, !pto.ptr<...>, !pto.ptr<...>, index` - `pto.vmrgsort4 %dest, %src0, %src1, %src2, %src3, %count, %config : !pto.ptr<...>, !pto.ptr<...>, !pto.ptr<...>, !pto.ptr<...>, !pto.ptr<...>, i64, i64` @@ -315,10 +315,10 @@ for (int i = 0; i < N; i++) ```mlir // Softmax with fused expdiff %max_broadcast = pto.vlds %ub_max[%c0] {dist = "BRC_B32"} : !pto.ptr -> !pto.vreg<64xf32> -%exp_stable = pto.vexpdiff %logits, %max_broadcast : !pto.vreg<64xf32>, !pto.vreg<64xf32> -> !pto.vreg<64xf32> +%exp_stable = pto.vexpdif %logits, %max_broadcast : !pto.vreg<64xf32>, !pto.vreg<64xf32> -> !pto.vreg<64xf32> // Leaky ReLU activation -%activated = pto.vlrelu %linear_out, %alpha_scalar, %mask : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.vreg<64xf32> +%activated = pto.vlrelu %linear_out, %alpha_scalar, %mask : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.vreg<64xf32> // Fused residual add + ReLU %residual = pto.vaddrelu %conv_out, %skip_connection : !pto.vreg<64xf32>, !pto.vreg<64xf32> -> !pto.vreg<64xf32> diff --git a/docs/isa/vector/sfu-and-dsa-ops_zh.md b/docs/isa/vector/sfu-and-dsa-ops_zh.md index 02400a191..ee5613959 100644 --- a/docs/isa/vector/sfu-and-dsa-ops_zh.md +++ b/docs/isa/vector/sfu-and-dsa-ops_zh.md @@ -29,7 +29,7 @@ ### `pto.vlrelu` -- **语法:** `%result = pto.vlrelu %input, %alpha, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg` +- **语法:** `%result = pto.vlrelu %input, %alpha, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg` - **A5 类型:** f16、f32 - **语义:** Leaky ReLU,斜率由标量 `%alpha` 给出。 @@ -49,9 +49,9 @@ for (int i = 0; i < N; i++) dst[i] = (src[i] >= 0) ? src[i] : alpha[i] * src[i]; ``` -### `pto.vexpdiff` +### `pto.vexpdif` -- **语法:** `%result = pto.vexpdiff %input, %max : !pto.vreg, !pto.vreg -> !pto.vreg` +- **语法:** `%result = pto.vexpdif %input, %max : !pto.vreg, !pto.vreg -> !pto.vreg` - **A5 类型:** f16、f32 - **语义:** 计算 `exp(x - max)` 的融合形式,典型用途是数值稳定版 softmax。 @@ -112,7 +112,7 @@ for (int i = 0; i < N; i++) ### `pto.vmull` -- **语法:** `%low, %high = pto.vmull %lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask -> !pto.vreg, !pto.vreg` +- **语法:** `%low, %high = pto.vmull %lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.mask -> !pto.vreg, !pto.vreg` - **A5 类型:** i32 / u32 的原生 32×32→64 扩宽乘 - **语义:** 返回扩宽乘积的低半部分和高半部分。 @@ -126,7 +126,7 @@ for (int i = 0; i < 64; i++) { ### `pto.vmula` -- **语法:** `%result = pto.vmula %acc, %lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.vreg, !pto.mask -> !pto.vreg` +- **语法:** `%result = pto.vmula %acc, %lhs, %rhs, %mask : !pto.vreg, !pto.vreg, !pto.vreg, !pto.mask -> !pto.vreg` - **语义:** 乘加融合。 ```c @@ -216,12 +216,12 @@ for (int i = 0; i < N; i++) // Softmax:先减去最大值,再走 expdiff %max_broadcast = pto.vlds %ub_max[%c0] {dist = "BRC_B32"} : !pto.ptr -> !pto.vreg<64xf32> -%exp_stable = pto.vexpdiff %logits, %max_broadcast +%exp_stable = pto.vexpdif %logits, %max_broadcast : !pto.vreg<64xf32>, !pto.vreg<64xf32> -> !pto.vreg<64xf32> // Leaky ReLU %activated = pto.vlrelu %linear_out, %alpha_scalar, %mask - : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.vreg<64xf32> + : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.vreg<64xf32> // 残差加法 + ReLU %residual = pto.vaddrelu %conv_out, %skip_connection diff --git a/docs/isa/vector/unary-vector-ops.md b/docs/isa/vector/unary-vector-ops.md index 5f7402422..596dcffbf 100644 --- a/docs/isa/vector/unary-vector-ops.md +++ b/docs/isa/vector/unary-vector-ops.md @@ -31,10 +31,10 @@ pto.vecscope { %remaining_init = arith.constant 1024 : i32 %_:1 = scf.for %offset = %c0 to %total step %c64 iter_args(%remaining = %remaining_init) -> (i32) { - %mask, %next_remaining = pto.plt_b32 %remaining : i32 -> !pto.mask, i32 + %mask, %next_remaining = pto.plt_b32 %remaining : i32 -> !pto.mask, i32 %vec = pto.vlds %ub_in[%offset] : !pto.ptr -> !pto.vreg<64xf32> - %out = pto.vabs %vec, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> - pto.vsts %out, %ub_out[%offset], %mask : !pto.vreg<64xf32>, !pto.ptr, !pto.mask + %out = pto.vabs %vec, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> + pto.vsts %out, %ub_out[%offset], %mask : !pto.vreg<64xf32>, !pto.ptr, !pto.mask scf.yield %next_remaining : i32 } } @@ -98,7 +98,7 @@ total_cycles = startup + completion + repeats × per_repeat + (repeats - 1) × i ### `pto.vabs` -- **syntax:** `%result = pto.vabs %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg` +- **syntax:** `%result = pto.vabs %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg` - **A5 RV:** `RV_VABS_FP`; **Latency:** 5 (f32/f16), 5 (i32/i16/i8) - **A2/A3 throughput:** 1 cycle/repeat; **interval:** 18 cycles @@ -115,7 +115,7 @@ for (int i = 0; i < N; i++) ### `pto.vneg` -- **syntax:** `%result = pto.vneg %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg` +- **syntax:** `%result = pto.vneg %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg` - **A5 RV:** `RV_VMULS` (uses scalar-multiply hardware); **Latency:** 8 (f32/f16), 8 (i32/i16/i8) - **A2/A3 throughput:** 1 cycle/repeat; **interval:** 18 cycles @@ -134,7 +134,7 @@ for (int i = 0; i < N; i++) ### `pto.vexp` -- **syntax:** `%result = pto.vexp %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg` +- **syntax:** `%result = pto.vexp %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg` - **A5 RV:** `RV_VEXP`; **Latency:** 16 (f32), 21 (f16) - **A2/A3 throughput:** 2 cycles/repeat (f32), 4 cycles/repeat (f16); **interval:** 18 cycles @@ -146,13 +146,13 @@ for (int i = 0; i < N; i++) - **inputs:** `%input` is the source vector and `%mask` selects active lanes. - **outputs:** `%result` holds `exp(input[i])` per active lane. - **constraints and limitations:** Only floating-point element types are legal. -- **Performance note:** f32 is significantly faster than f16 on A5 (16 vs 21 cycles). For f16, prefer `vexpdiff` (fused exp-diff) for numerical stability in softmax. +- **Performance note:** f32 is significantly faster than f16 on A5 (16 vs 21 cycles). For f16, prefer `vexpdif` (fused exp-diff) for numerical stability in softmax. --- ### `pto.vln` -- **syntax:** `%result = pto.vln %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg` +- **syntax:** `%result = pto.vln %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg` - **A5 RV:** `RV_VLN`; **Latency:** 18 (f32), 23 (f16) - **A2/A3 throughput:** 2 cycles/repeat (f32), 4 cycles/repeat (f16); **interval:** 18 cycles @@ -169,7 +169,7 @@ for (int i = 0; i < N; i++) ### `pto.vsqrt` -- **syntax:** `%result = pto.vsqrt %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg` +- **syntax:** `%result = pto.vsqrt %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg` - **A5 RV:** `RV_VSQRT`; **Latency:** 17 (f32), 22 (f16) - **A2/A3 throughput:** 2 cycles/repeat (f32), 4 cycles/repeat (f16); **interval:** 18 cycles @@ -187,7 +187,7 @@ for (int i = 0; i < N; i++) ### `pto.vrsqrt` -- **syntax:** `%result = pto.vrsqrt %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg` +- **syntax:** `%result = pto.vrsqrt %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg` - **Latency:** equivalent to `vsqrt`; uses `RV_VRSQRT` hardware path - **A2/A3 throughput:** 2 cycles/repeat (f32), 4 cycles/repeat (f16); **interval:** 18 cycles @@ -204,7 +204,7 @@ for (int i = 0; i < N; i++) ### `pto.vrec` -- **syntax:** `%result = pto.vrec %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg` +- **syntax:** `%result = pto.vrec %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg` - **Latency:** synthesized via `vdiv`; throughput matches `vdiv` - **A2/A3 throughput:** 1 cycle/repeat; **interval:** 18 cycles @@ -223,7 +223,7 @@ for (int i = 0; i < N; i++) ### `pto.vrelu` -- **syntax:** `%result = pto.vrelu %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg` +- **syntax:** `%result = pto.vrelu %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg` - **A5 RV:** `RV_VRELU`; **Latency:** 5 (f32/f16) - **A2/A3 throughput:** 1 cycle/repeat; **interval:** 18 cycles @@ -243,7 +243,7 @@ for (int i = 0; i < N; i++) ### `pto.vnot` -- **syntax:** `%result = pto.vnot %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg` +- **syntax:** `%result = pto.vnot %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg` - **A5 RV:** `RV_VNOT`; **Latency:** 5 (integer types only) - **A2/A3 throughput:** 1 cycle/repeat; **interval:** 18 cycles @@ -260,7 +260,7 @@ for (int i = 0; i < N; i++) ### `pto.vbcnt` -- **syntax:** `%result = pto.vbcnt %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg` +- **syntax:** `%result = pto.vbcnt %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg` - **semantics:** Population count — counts the number of set bits in each lane's element. ```c @@ -276,7 +276,7 @@ for (int i = 0; i < N; i++) ### `pto.vcls` -- **syntax:** `%result = pto.vcls %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg` +- **syntax:** `%result = pto.vcls %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg` - **semantics:** Count leading sign bits — for a signed integer, counts how many bits from the MSB are equal to the sign bit. ```c @@ -294,7 +294,7 @@ for (int i = 0; i < N; i++) ### `pto.vmov` -- **syntax:** `%result = pto.vmov %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg` +- **syntax:** `%result = pto.vmov %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg` - **A5 RV:** `RV_VLD` (proxy); **Latency:** 9 (f32/f16), 9 (integer) - **A2/A3 throughput:** 1 cycle/repeat; **interval:** 13 cycles (`A2A3_INTERVAL_VCOPY`) @@ -313,12 +313,12 @@ for (int i = 0; i < N; i++) ```mlir // Softmax numerator: exp(x - max) using vexp -%sub = pto.vsub %x, %max_broadcast, %mask : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> -%exp = pto.vexp %sub, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> +%sub = pto.vsub %x, %max_broadcast, %mask : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> +%exp = pto.vexp %sub, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> // Reciprocal for division -%sum_rcp = pto.vrec %sum, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> +%sum_rcp = pto.vrec %sum, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> // ReLU activation (lowest latency unary on A5: 5 cycles) -%activated = pto.vrelu %linear_out, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> +%activated = pto.vrelu %linear_out, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> ``` diff --git a/docs/isa/vector/unary-vector-ops_zh.md b/docs/isa/vector/unary-vector-ops_zh.md index bd8fef607..da6bda987 100644 --- a/docs/isa/vector/unary-vector-ops_zh.md +++ b/docs/isa/vector/unary-vector-ops_zh.md @@ -28,10 +28,10 @@ pto.vecscope { %remaining_init = arith.constant 1024 : i32 %_:1 = scf.for %offset = %c0 to %total step %c64 iter_args(%remaining = %remaining_init) -> (i32) { - %mask, %next_remaining = pto.plt_b32 %remaining : i32 -> !pto.mask, i32 + %mask, %next_remaining = pto.plt_b32 %remaining : i32 -> !pto.mask, i32 %vec = pto.vlds %ub_in[%offset] : !pto.ptr -> !pto.vreg<64xf32> - %out = pto.vabs %vec, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> - pto.vsts %out, %ub_out[%offset], %mask : !pto.vreg<64xf32>, !pto.ptr, !pto.mask + %out = pto.vabs %vec, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> + pto.vsts %out, %ub_out[%offset], %mask : !pto.vreg<64xf32>, !pto.ptr, !pto.mask scf.yield %next_remaining : i32 } } @@ -89,14 +89,14 @@ pto.vecscope { ### `pto.vabs` -- **语法:** `%result = pto.vabs %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg` +- **语法:** `%result = pto.vabs %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg` - **语义:** 对每个活跃 lane 取绝对值。 整数最小负数的溢出行为由目标平台定义。 ### `pto.vneg` -- **语法:** `%result = pto.vneg %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg` +- **语法:** `%result = pto.vneg %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg` - **语义:** 对每个活跃 lane 做算术取负。 在 A5 上它借用标量乘法类硬件路径,因此延迟不像 `vabs` 那么低。 @@ -107,35 +107,35 @@ pto.vecscope { ### `pto.vexp` -- **语法:** `%result = pto.vexp %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg` +- **语法:** `%result = pto.vexp %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg` - **语义:** 逐 lane 计算 `exp(input[i])`。 -只对浮点类型合法。f16 的成本高于 f32,因此 softmax 等场景通常更偏好融合形式 `vexpdiff`。 +只对浮点类型合法。f16 的成本高于 f32,因此 softmax 等场景通常更偏好融合形式 `vexpdif`。 ### `pto.vln` -- **语法:** `%result = pto.vln %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg` +- **语法:** `%result = pto.vln %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg` - **语义:** 逐 lane 计算自然对数。 对实数语义而言,活跃输入最好严格大于 0;非正输入的异常或 NaN 行为由目标平台决定。 ### `pto.vsqrt` -- **语法:** `%result = pto.vsqrt %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg` +- **语法:** `%result = pto.vsqrt %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg` - **语义:** 逐 lane 求平方根。 负输入的行为由目标平台定义。 ### `pto.vrsqrt` -- **语法:** `%result = pto.vrsqrt %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg` +- **语法:** `%result = pto.vrsqrt %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg` - **语义:** 逐 lane 求倒数平方根。 它与 `vsqrt` 共享一类硬件路径,因此成本也接近。 ### `pto.vrec` -- **语法:** `%result = pto.vrec %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg` +- **语法:** `%result = pto.vrec %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg` - **语义:** 逐 lane 求倒数。 本质上走的是除法路径,因此不该把它当成廉价的一元 ALU 操作。 @@ -146,7 +146,7 @@ pto.vecscope { ### `pto.vrelu` -- **语法:** `%result = pto.vrelu %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg` +- **语法:** `%result = pto.vrelu %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg` - **语义:** 逐 lane 执行 `max(input, 0)`。 这是 A5 上延迟最低的一元浮点操作之一。需要带斜率的版本时,应使用 `vlrelu`。 @@ -157,21 +157,21 @@ pto.vecscope { ### `pto.vnot` -- **语法:** `%result = pto.vnot %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg` +- **语法:** `%result = pto.vnot %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg` - **语义:** 逐 lane 按位取反。 仅对整数类型合法。 ### `pto.vbcnt` -- **语法:** `%result = pto.vbcnt %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg` +- **语法:** `%result = pto.vbcnt %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg` - **语义:** 逐 lane 统计设置位个数。 统计范围是“单个元素的位宽”,不是整个寄存器的总位数。 ### `pto.vcls` -- **语法:** `%result = pto.vcls %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg` +- **语法:** `%result = pto.vcls %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg` - **语义:** 逐 lane 统计前导符号位数量。 它依赖元素的有符号解释,因此 signedness 是语义的一部分。 @@ -182,7 +182,7 @@ pto.vecscope { ### `pto.vmov` -- **语法:** `%result = pto.vmov %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg` +- **语法:** `%result = pto.vmov %input, %mask : !pto.vreg, !pto.mask -> !pto.vreg` - **语义:** 受谓词控制的寄存器复制。 无谓词形式是完整寄存器复制;有谓词时则更接近 masked copy。 @@ -193,15 +193,15 @@ pto.vecscope { ```mlir %sub = pto.vsub %x, %max_broadcast, %mask - : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> + : !pto.vreg<64xf32>, !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> %exp = pto.vexp %sub, %mask - : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> + : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> %sum_rcp = pto.vrec %sum, %mask - : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> + : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> %activated = pto.vrelu %linear_out, %mask - : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> + : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> ``` --- diff --git a/docs/isa/vector/vec-scalar-ops.md b/docs/isa/vector/vec-scalar-ops.md index 05b688e0d..3458dc2dc 100644 --- a/docs/isa/vector/vec-scalar-ops.md +++ b/docs/isa/vector/vec-scalar-ops.md @@ -22,7 +22,7 @@ Operations that combine a vector with a scalar value, applying the scalar to eve ### `pto.vadds` -- **syntax:** `%result = pto.vadds %input, %scalar, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg` +- **syntax:** `%result = pto.vadds %input, %scalar, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg` ```c for (int i = 0; i < N; i++) @@ -40,7 +40,7 @@ for (int i = 0; i < N; i++) ### `pto.vsubs` -- **syntax:** `%result = pto.vsubs %input, %scalar, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg` +- **syntax:** `%result = pto.vsubs %input, %scalar, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg` ```c for (int i = 0; i < N; i++) @@ -56,7 +56,7 @@ for (int i = 0; i < N; i++) ### `pto.vmuls` -- **syntax:** `%result = pto.vmuls %input, %scalar, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg` +- **syntax:** `%result = pto.vmuls %input, %scalar, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg` ```c for (int i = 0; i < N; i++) @@ -72,7 +72,7 @@ for (int i = 0; i < N; i++) ### `pto.vmaxs` -- **syntax:** `%result = pto.vmaxs %input, %scalar, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg` +- **syntax:** `%result = pto.vmaxs %input, %scalar, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg` ```c for (int i = 0; i < N; i++) @@ -87,7 +87,7 @@ for (int i = 0; i < N; i++) ### `pto.vmins` -- **syntax:** `%result = pto.vmins %input, %scalar, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg` +- **syntax:** `%result = pto.vmins %input, %scalar, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg` ```c for (int i = 0; i < N; i++) @@ -104,7 +104,7 @@ for (int i = 0; i < N; i++) ### `pto.vands` -- **syntax:** `%result = pto.vands %input, %scalar, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg` +- **syntax:** `%result = pto.vands %input, %scalar, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg` ```c for (int i = 0; i < N; i++) @@ -119,7 +119,7 @@ for (int i = 0; i < N; i++) ### `pto.vors` -- **syntax:** `%result = pto.vors %input, %scalar, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg` +- **syntax:** `%result = pto.vors %input, %scalar, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg` ```c for (int i = 0; i < N; i++) @@ -134,7 +134,7 @@ for (int i = 0; i < N; i++) ### `pto.vxors` -- **syntax:** `%result = pto.vxors %input, %scalar, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg` +- **syntax:** `%result = pto.vxors %input, %scalar, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg` ```c for (int i = 0; i < N; i++) @@ -151,7 +151,7 @@ for (int i = 0; i < N; i++) ### `pto.vshls` -- **syntax:** `%result = pto.vshls %input, %scalar, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg` +- **syntax:** `%result = pto.vshls %input, %scalar, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg` ```c for (int i = 0; i < N; i++) @@ -168,7 +168,7 @@ for (int i = 0; i < N; i++) ### `pto.vshrs` -- **syntax:** `%result = pto.vshrs %input, %scalar, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg` +- **syntax:** `%result = pto.vshrs %input, %scalar, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg` ```c for (int i = 0; i < N; i++) @@ -184,7 +184,7 @@ for (int i = 0; i < N; i++) ### `pto.vlrelu` -- **syntax:** `%result = pto.vlrelu %input, %scalar, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg` +- **syntax:** `%result = pto.vlrelu %input, %scalar, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg` ```c for (int i = 0; i < N; i++) @@ -203,7 +203,7 @@ for (int i = 0; i < N; i++) ### `pto.vaddcs` -- **syntax:** `%result, %carry = pto.vaddcs %lhs, %rhs, %carry_in, %mask : !pto.vreg, !pto.vreg, !pto.mask, !pto.mask -> !pto.vreg, !pto.mask` +- **syntax:** `%result, %carry = pto.vaddcs %lhs, %rhs, %carry_in, %mask : !pto.vreg, !pto.vreg, !pto.mask, !pto.mask -> !pto.vreg, !pto.mask` - **semantics:** Add with carry-in and carry-out. ```c @@ -226,7 +226,7 @@ for (int i = 0; i < N; i++) { ### `pto.vsubcs` -- **syntax:** `%result, %borrow = pto.vsubcs %lhs, %rhs, %borrow_in, %mask : !pto.vreg, !pto.vreg, !pto.mask, !pto.mask -> !pto.vreg, !pto.mask` +- **syntax:** `%result, %borrow = pto.vsubcs %lhs, %rhs, %borrow_in, %mask : !pto.vreg, !pto.vreg, !pto.mask, !pto.mask -> !pto.vreg, !pto.mask` - **semantics:** Subtract with borrow-in and borrow-out. ```c @@ -249,15 +249,15 @@ for (int i = 0; i < N; i++) { ```mlir // Add bias to all elements -%biased = pto.vadds %activation, %bias_scalar, %mask : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.vreg<64xf32> +%biased = pto.vadds %activation, %bias_scalar, %mask : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.vreg<64xf32> // Scale by constant -%scaled = pto.vmuls %input, %scale, %mask : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.vreg<64xf32> +%scaled = pto.vmuls %input, %scale, %mask : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.vreg<64xf32> // Clamp to [0, 255] for uint8 quantization -%clamped_low = pto.vmaxs %input, %c0, %mask : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.vreg<64xf32> -%clamped = pto.vmins %clamped_low, %c255, %mask : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.vreg<64xf32> +%clamped_low = pto.vmaxs %input, %c0, %mask : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.vreg<64xf32> +%clamped = pto.vmins %clamped_low, %c255, %mask : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.vreg<64xf32> // Shift right by fixed amount -%shifted = pto.vshrs %data, %c4, %mask : !pto.vreg<64xi32>, i32, !pto.mask -> !pto.vreg<64xi32> +%shifted = pto.vshrs %data, %c4, %mask : !pto.vreg<64xi32>, i32, !pto.mask -> !pto.vreg<64xi32> ``` diff --git a/docs/isa/vector/vec-scalar-ops_zh.md b/docs/isa/vector/vec-scalar-ops_zh.md index b2a6ae5a4..f3a5978ce 100644 --- a/docs/isa/vector/vec-scalar-ops_zh.md +++ b/docs/isa/vector/vec-scalar-ops_zh.md @@ -24,27 +24,27 @@ ### `pto.vadds` -- **语法:** `%result = pto.vadds %input, %scalar, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg` +- **语法:** `%result = pto.vadds %input, %scalar, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg` - **语义:** 逐 lane 执行 `src[i] + scalar`。 ### `pto.vsubs` -- **语法:** `%result = pto.vsubs %input, %scalar, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg` +- **语法:** `%result = pto.vsubs %input, %scalar, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg` - **语义:** 逐 lane 执行 `src[i] - scalar`。 ### `pto.vmuls` -- **语法:** `%result = pto.vmuls %input, %scalar, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg` +- **语法:** `%result = pto.vmuls %input, %scalar, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg` - **语义:** 逐 lane 执行 `src[i] * scalar`。 ### `pto.vmaxs` -- **语法:** `%result = pto.vmaxs %input, %scalar, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg` +- **语法:** `%result = pto.vmaxs %input, %scalar, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg` - **语义:** 逐 lane 与同一个标量取最大值。 ### `pto.vmins` -- **语法:** `%result = pto.vmins %input, %scalar, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg` +- **语法:** `%result = pto.vmins %input, %scalar, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg` - **语义:** 逐 lane 与同一个标量取最小值。 这些指令的关键语义不是“标量先被编译器显式广播,再套用普通二元运算”,而是广播本身就是这条指令的定义。 @@ -55,17 +55,17 @@ ### `pto.vands` -- **语法:** `%result = pto.vands %input, %scalar, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg` +- **语法:** `%result = pto.vands %input, %scalar, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg` - **语义:** 逐 lane 与统一标量做按位与。 ### `pto.vors` -- **语法:** `%result = pto.vors %input, %scalar, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg` +- **语法:** `%result = pto.vors %input, %scalar, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg` - **语义:** 逐 lane 与统一标量做按位或。 ### `pto.vxors` -- **语法:** `%result = pto.vxors %input, %scalar, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg` +- **语法:** `%result = pto.vxors %input, %scalar, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg` - **语义:** 逐 lane 与统一标量做按位异或。 这三条只对整数元素类型合法。 @@ -76,12 +76,12 @@ ### `pto.vshls` -- **语法:** `%result = pto.vshls %input, %scalar, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg` +- **语法:** `%result = pto.vshls %input, %scalar, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg` - **语义:** 把统一位移量应用到每个活跃 lane 的左移。 ### `pto.vshrs` -- **语法:** `%result = pto.vshrs %input, %scalar, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg` +- **语法:** `%result = pto.vshrs %input, %scalar, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg` - **语义:** 把统一位移量应用到每个活跃 lane 的右移。 这两条与 `vshl` / `vshr` 的差别在于:后者的位移量来自第二个向量寄存器,前者是单一标量统一广播。 @@ -94,7 +94,7 @@ ### `pto.vlrelu` -- **语法:** `%result = pto.vlrelu %input, %scalar, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg` +- **语法:** `%result = pto.vlrelu %input, %scalar, %mask : !pto.vreg, T, !pto.mask -> !pto.vreg` - **语义:** Leaky ReLU,其中 `%scalar` 是统一斜率。 ```c @@ -110,7 +110,7 @@ for (int i = 0; i < N; i++) ### `pto.vaddcs` -- **语法:** `%result, %carry = pto.vaddcs %lhs, %rhs, %carry_in, %mask : !pto.vreg, !pto.vreg, !pto.mask, !pto.mask -> !pto.vreg, !pto.mask` +- **语法:** `%result, %carry = pto.vaddcs %lhs, %rhs, %carry_in, %mask : !pto.vreg, !pto.vreg, !pto.mask, !pto.mask -> !pto.vreg, !pto.mask` - **语义:** 带 carry-in 和 carry-out 的加法。 ```c @@ -123,7 +123,7 @@ for (int i = 0; i < N; i++) { ### `pto.vsubcs` -- **语法:** `%result, %borrow = pto.vsubcs %lhs, %rhs, %borrow_in, %mask : !pto.vreg, !pto.vreg, !pto.mask, !pto.mask -> !pto.vreg, !pto.mask` +- **语法:** `%result, %borrow = pto.vsubcs %lhs, %rhs, %borrow_in, %mask : !pto.vreg, !pto.vreg, !pto.mask, !pto.mask -> !pto.vreg, !pto.mask` - **语义:** 带 borrow-in 和 borrow-out 的减法。 ```c @@ -141,18 +141,18 @@ for (int i = 0; i < N; i++) { ```mlir %biased = pto.vadds %activation, %bias_scalar, %mask - : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.vreg<64xf32> + : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.vreg<64xf32> %scaled = pto.vmuls %input, %scale, %mask - : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.vreg<64xf32> + : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.vreg<64xf32> %clamped_low = pto.vmaxs %input, %c0, %mask - : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.vreg<64xf32> + : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.vreg<64xf32> %clamped = pto.vmins %clamped_low, %c255, %mask - : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.vreg<64xf32> + : !pto.vreg<64xf32>, f32, !pto.mask -> !pto.vreg<64xf32> %shifted = pto.vshrs %data, %c4, %mask - : !pto.vreg<64xi32>, i32, !pto.mask -> !pto.vreg<64xi32> + : !pto.vreg<64xi32>, i32, !pto.mask -> !pto.vreg<64xi32> ``` --- diff --git a/docs/isa/vector/vector-load-store.md b/docs/isa/vector/vector-load-store.md index 9557f096f..ed48363fd 100644 --- a/docs/isa/vector/vector-load-store.md +++ b/docs/isa/vector/vector-load-store.md @@ -41,10 +41,10 @@ module attributes {pto.target_arch = "a5"} { pto.vecscope { %_:1 = scf.for %offset = %c0 to %c1024 step %c64 iter_args(%remaining = %c1024_i32) -> (i32) { - %mask, %next = pto.plt_b32 %remaining : i32 -> !pto.mask, i32 + %mask, %next = pto.plt_b32 %remaining : i32 -> !pto.mask, i32 %vec = pto.vlds %ub_in[%offset] : !pto.ptr -> !pto.vreg<64xf32> - %out = pto.vabs %vec, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> - pto.vsts %out, %ub_out[%offset], %mask : !pto.vreg<64xf32>, !pto.ptr, !pto.mask + %out = pto.vabs %vec, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> + pto.vsts %out, %ub_out[%offset], %mask : !pto.vreg<64xf32>, !pto.ptr, !pto.mask scf.yield %next : i32 } } @@ -200,9 +200,9 @@ cycles = ceil(4096 / 128) = 32 cycles ## Dual Loads (Deinterleave) -### `pto.vldx2` +### `pto.vldsx2` -- **syntax:** `%low, %high = pto.vldx2 %source[%offset], "DIST" : !pto.ptr, index -> !pto.vreg, !pto.vreg` +- **syntax:** `%low, %high = pto.vldsx2 %source[%offset], "DIST" : !pto.ptr, index -> !pto.vreg, !pto.vreg` - **semantics:** Dual load with deinterleave (AoS → SoA conversion). - **inputs:** `%source` is the UB base pointer, `%offset` is the displacement, and `DIST` @@ -225,7 +225,7 @@ for (int i = 0; i < 64; i++) { **Example — Load interleaved XY pairs into separate X/Y vectors:** ```mlir -%x, %y = pto.vldx2 %ub[%offset], "DINTLV_B32" : !pto.ptr, index -> !pto.vreg<64xf32>, !pto.vreg<64xf32> +%x, %y = pto.vldsx2 %ub[%offset], "DINTLV_B32" : !pto.ptr, index -> !pto.vreg<64xf32>, !pto.vreg<64xf32> ``` --- @@ -250,7 +250,7 @@ for (int i = 0; i < 64; i++) { ### `pto.vsldb` -- **syntax:** `%result = pto.vsldb %source, %offset, %mask : !pto.ptr, i32, !pto.mask -> !pto.vreg` +- **syntax:** `%result = pto.vsldb %source, %offset, %mask : !pto.ptr, i32, !pto.mask -> !pto.vreg` - **semantics:** Block-strided load for 2D tile access. - **inputs:** `%source` is the UB base pointer, `%offset` is the packed stride/control word, @@ -310,7 +310,7 @@ for (int i = 0; i < active_lanes; i++) ### `pto.vgather2_bc` -- **syntax:** `%result = pto.vgather2_bc %source, %offsets, %mask : !pto.ptr, !pto.vreg, !pto.mask -> !pto.vreg` +- **syntax:** `%result = pto.vgather2_bc %source, %offsets, %mask : !pto.ptr, !pto.vreg, !pto.mask -> !pto.vreg` - **semantics:** Gather with broadcast, conditioned by mask. - **inputs:** `%source` is the UB base pointer, `%offsets` contains gather indices, and @@ -328,7 +328,7 @@ for (int i = 0; i < active_lanes; i++) ### `pto.vsts` -- **syntax:** `pto.vsts %value, %dest[%offset], %mask {dist = "DIST"} : !pto.vreg, !pto.ptr, !pto.mask` +- **syntax:** `pto.vsts %value, %dest[%offset], %mask {dist = "DIST"} : !pto.vreg, !pto.ptr, !pto.mask` - **semantics:** Vector store with distribution mode. - **inputs:** `%value` is the source vector, `%dest` is the UB base pointer, `%offset` is @@ -353,16 +353,16 @@ for (int i = 0; i < active_lanes; i++) **Example — Contiguous store:** ```mlir -pto.vsts %v, %ub[%offset], %mask {dist = "NORM_B32"} : !pto.vreg<64xf32>, !pto.ptr, !pto.mask +pto.vsts %v, %ub[%offset], %mask {dist = "NORM_B32"} : !pto.vreg<64xf32>, !pto.ptr, !pto.mask ``` --- ## Dual Stores (Interleave) -### `pto.vstx2` +### `pto.vstsx2` -- **syntax:** `pto.vstx2 %low, %high, %dest[%offset], "DIST", %mask : !pto.vreg, !pto.vreg, !pto.ptr, index, !pto.mask` +- **syntax:** `pto.vstsx2 %low, %high, %dest[%offset], "DIST", %mask : !pto.vreg, !pto.vreg, !pto.ptr, index, !pto.mask` - **semantics:** Dual interleaved store (SoA → AoS conversion). - **inputs:** `%low` and `%high` are the two source vectors, `%dest` is the UB base pointer, @@ -405,7 +405,7 @@ for (int i = 0; i < 64; i++) { ### `pto.vsstb` -- **syntax:** `pto.vsstb %value, %dest, %offset, %mask : !pto.vreg, !pto.ptr, i32, !pto.mask` +- **syntax:** `pto.vsstb %value, %dest, %offset, %mask : !pto.vreg, !pto.ptr, i32, !pto.mask` - **semantics:** Block-strided store for 2D tile access. - **inputs:** `%value` is the source vector, `%dest` is the UB base pointer, `%offset` is diff --git a/docs/isa/vector/vector-load-store_zh.md b/docs/isa/vector/vector-load-store_zh.md index f730cff06..148edffed 100644 --- a/docs/isa/vector/vector-load-store_zh.md +++ b/docs/isa/vector/vector-load-store_zh.md @@ -46,10 +46,10 @@ module attributes {pto.target_arch = "a5"} { pto.vecscope { %_:1 = scf.for %offset = %c0 to %c1024 step %c64 iter_args(%remaining = %c1024_i32) -> (i32) { - %mask, %next = pto.plt_b32 %remaining : i32 -> !pto.mask, i32 + %mask, %next = pto.plt_b32 %remaining : i32 -> !pto.mask, i32 %vec = pto.vlds %ub_in[%offset] : !pto.ptr -> !pto.vreg<64xf32> - %out = pto.vabs %vec, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> - pto.vsts %out, %ub_out[%offset], %mask : !pto.vreg<64xf32>, !pto.ptr, !pto.mask + %out = pto.vabs %vec, %mask : !pto.vreg<64xf32>, !pto.mask -> !pto.vreg<64xf32> + pto.vsts %out, %ub_out[%offset], %mask : !pto.vreg<64xf32>, !pto.ptr, !pto.mask scf.yield %next : i32 } } @@ -175,9 +175,9 @@ PTO 把非对齐状态显式放在 SSA 里,而不是让后端偷偷维护隐 ## 双路加载与去交织 -### `pto.vldx2` +### `pto.vldsx2` -- **语法:** `%low, %high = pto.vldx2 %source[%offset], "DIST" : !pto.ptr, index -> !pto.vreg, !pto.vreg` +- **语法:** `%low, %high = pto.vldsx2 %source[%offset], "DIST" : !pto.ptr, index -> !pto.vreg, !pto.vreg` - **语义:** 从交织布局中一次装出两路结果,常用于 AoS→SoA 转换。 合法分布模式包括 `DINTLV_B8`、`DINTLV_B16`、`DINTLV_B32`、`BDINTLV`。 @@ -204,7 +204,7 @@ for (int i = 0; i < 64; i++) { ### `pto.vsldb` -- **语法:** `%result = pto.vsldb %source, %offset, %mask : !pto.ptr, i32, !pto.mask -> !pto.vreg` +- **语法:** `%result = pto.vsldb %source, %offset, %mask : !pto.ptr, i32, !pto.mask -> !pto.vreg` - **语义:** 按 block stride 从向量 tile buffer 装载,常用于二维 tile 访问。 这里的 `%offset` 不是普通字节偏移,而是一个编码了 stride / repeat 规律的控制字。被 mask 关闭的 block 会在结果中清零,也不应为该 block 触发地址越界异常。 @@ -238,7 +238,7 @@ for (int i = 0; i < active_lanes; i++) ### `pto.vgather2_bc` -- **语法:** `%result = pto.vgather2_bc %source, %offsets, %mask : !pto.ptr, !pto.vreg, !pto.mask -> !pto.vreg` +- **语法:** `%result = pto.vgather2_bc %source, %offsets, %mask : !pto.ptr, !pto.vreg, !pto.mask -> !pto.vreg` - **语义:** 带广播语义的 gather,受谓词控制。 被 mask 关闭的 lane 不参与地址合并,也不会触发地址异常,结果位置按零填充。 @@ -249,7 +249,7 @@ for (int i = 0; i < active_lanes; i++) ### `pto.vsts` -- **语法:** `pto.vsts %value, %dest[%offset], %mask {dist = "DIST"} : !pto.vreg, !pto.ptr, !pto.mask` +- **语法:** `pto.vsts %value, %dest[%offset], %mask {dist = "DIST"} : !pto.vreg, !pto.ptr, !pto.mask` - **语义:** 按给定分布模式,把向量寄存器写回向量 tile buffer。 约束: @@ -265,9 +265,9 @@ for (int i = 0; i < active_lanes; i++) | `MRG4CHN_B8` | 四通道合并 | | `MRG2CHN_B8/B16` | 双通道合并 | -### `pto.vstx2` +### `pto.vstsx2` -- **语法:** `pto.vstx2 %low, %high, %dest[%offset], "DIST", %mask : !pto.vreg, !pto.vreg, !pto.ptr, index, !pto.mask` +- **语法:** `pto.vstsx2 %low, %high, %dest[%offset], "DIST", %mask : !pto.vreg, !pto.vreg, !pto.ptr, index, !pto.mask` - **语义:** 双路交织存储,常用于 SoA→AoS 转换。 合法分布模式包括 `INTLV_B8`、`INTLV_B16`、`INTLV_B32`。 @@ -292,7 +292,7 @@ for (int i = 0; i < 64; i++) { ### `pto.vsstb` -- **语法:** `pto.vsstb %value, %dest, %offset, %mask : !pto.vreg, !pto.ptr, i32, !pto.mask` +- **语法:** `pto.vsstb %value, %dest, %offset, %mask : !pto.vreg, !pto.ptr, i32, !pto.mask` - **语义:** 按 block stride 存储二维 tile。 这里的 `%offset` 同样是控制字,而不是普通的字节偏移。 diff --git a/docs/mkdocs/gen_pages.py b/docs/mkdocs/gen_pages.py index 2ee3cadb7..8c3fb4055 100644 --- a/docs/mkdocs/gen_pages.py +++ b/docs/mkdocs/gen_pages.py @@ -447,14 +447,14 @@ def replace_rel(m: re.Match) -> str: "docs/isa/vector/ops/vector-load-store/vlds.md", "docs/isa/vector/ops/vector-load-store/vldas.md", "docs/isa/vector/ops/vector-load-store/vldus.md", - "docs/isa/vector/ops/vector-load-store/vldx2.md", + "docs/isa/vector/ops/vector-load-store/vldsx2.md", "docs/isa/vector/ops/vector-load-store/vsld.md", "docs/isa/vector/ops/vector-load-store/vsldb.md", "docs/isa/vector/ops/vector-load-store/vgather2.md", "docs/isa/vector/ops/vector-load-store/vgatherb.md", "docs/isa/vector/ops/vector-load-store/vgather2-bc.md", "docs/isa/vector/ops/vector-load-store/vsts.md", - "docs/isa/vector/ops/vector-load-store/vstx2.md", + "docs/isa/vector/ops/vector-load-store/vstsx2.md", "docs/isa/vector/ops/vector-load-store/vsst.md", "docs/isa/vector/ops/vector-load-store/vsstb.md", "docs/isa/vector/ops/vector-load-store/vscatter.md", @@ -541,7 +541,7 @@ def replace_rel(m: re.Match) -> str: "docs/isa/vector/ops/data-rearrangement/vdintlvv2.md", "docs/isa/vector/sfu-and-dsa-ops.md", "docs/isa/vector/ops/sfu-and-dsa-ops/vprelu.md", - "docs/isa/vector/ops/sfu-and-dsa-ops/vexpdiff.md", + "docs/isa/vector/ops/sfu-and-dsa-ops/vexpdif.md", "docs/isa/vector/ops/sfu-and-dsa-ops/vaddrelu.md", "docs/isa/vector/ops/sfu-and-dsa-ops/vsubrelu.md", "docs/isa/vector/ops/sfu-and-dsa-ops/vaxpy.md", diff --git a/docs/mkdocs/mkdocs.yml b/docs/mkdocs/mkdocs.yml index 4386bc0a2..feb803d3c 100644 --- a/docs/mkdocs/mkdocs.yml +++ b/docs/mkdocs/mkdocs.yml @@ -120,6 +120,17 @@ nav: - System Scheduling Instruction Set: docs/isa/instruction-families/system-scheduling-families.md - 8. Tile Instruction Reference: - Overview: docs/isa/tile/README.md + - View And Tile Buffer: + - Instruction Set Contract: docs/isa/tile/view-and-tile-buf.md + - pto.make_tensor_view: docs/isa/tile/ops/view-and-tile-buf/make-tensor-view.md + - pto.get_tensor_view_dim: docs/isa/tile/ops/view-and-tile-buf/get-tensor-view-dim.md + - pto.get_tensor_view_stride: docs/isa/tile/ops/view-and-tile-buf/get-tensor-view-stride.md + - pto.tensor_view_addr: docs/isa/tile/ops/view-and-tile-buf/tensor-view-addr.md + - pto.partition_view: docs/isa/tile/ops/view-and-tile-buf/partition-view.md + - pto.alloc_tile: docs/isa/tile/ops/view-and-tile-buf/alloc-tile.md + - pto.subset: docs/isa/tile/ops/view-and-tile-buf/subset.md + - pto.set_validshape: docs/isa/tile/ops/view-and-tile-buf/set-validshape.md + - pto.tile_buf_addr: docs/isa/tile/ops/view-and-tile-buf/tile-buf-addr.md - Sync And Config: - Instruction Set Contract: docs/isa/tile/sync-and-config.md - pto.tsync: docs/isa/tile/ops/sync-and-config/tsync.md @@ -273,14 +284,14 @@ nav: - pto.vlds: docs/isa/vector/ops/vector-load-store/vlds.md - pto.vldas: docs/isa/vector/ops/vector-load-store/vldas.md - pto.vldus: docs/isa/vector/ops/vector-load-store/vldus.md - - pto.vldx2: docs/isa/vector/ops/vector-load-store/vldx2.md + - pto.vldsx2: docs/isa/vector/ops/vector-load-store/vldsx2.md - pto.vsld: docs/isa/vector/ops/vector-load-store/vsld.md - pto.vsldb: docs/isa/vector/ops/vector-load-store/vsldb.md - pto.vgather2: docs/isa/vector/ops/vector-load-store/vgather2.md - pto.vgatherb: docs/isa/vector/ops/vector-load-store/vgatherb.md - pto.vgather2_bc: docs/isa/vector/ops/vector-load-store/vgather2-bc.md - pto.vsts: docs/isa/vector/ops/vector-load-store/vsts.md - - pto.vstx2: docs/isa/vector/ops/vector-load-store/vstx2.md + - pto.vstsx2: docs/isa/vector/ops/vector-load-store/vstsx2.md - pto.vsst: docs/isa/vector/ops/vector-load-store/vsst.md - pto.vsstb: docs/isa/vector/ops/vector-load-store/vsstb.md - pto.vscatter: docs/isa/vector/ops/vector-load-store/vscatter.md @@ -376,7 +387,7 @@ nav: - SFU And DSA Instructions: - Instruction Set Overview: docs/isa/vector/sfu-and-dsa-ops.md - pto.vprelu: docs/isa/vector/ops/sfu-and-dsa-ops/vprelu.md - - pto.vexpdiff: docs/isa/vector/ops/sfu-and-dsa-ops/vexpdiff.md + - pto.vexpdif: docs/isa/vector/ops/sfu-and-dsa-ops/vexpdif.md - pto.vaddrelu: docs/isa/vector/ops/sfu-and-dsa-ops/vaddrelu.md - pto.vsubrelu: docs/isa/vector/ops/sfu-and-dsa-ops/vsubrelu.md - pto.vaxpy: docs/isa/vector/ops/sfu-and-dsa-ops/vaxpy.md