Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
2 changes: 1 addition & 1 deletion docs/isa/README_zh.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
<p align="center">
<p align="center">
<img src="../figures/pto_logo.svg" alt="PTO Tile Lib" width="180" />
</p>

Expand Down
2 changes: 1 addition & 1 deletion docs/isa/comm/TGET_ASYNC_zh.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# TGET_ASYNC
# pto.tget_async

## 简介

Expand Down
2 changes: 1 addition & 1 deletion docs/isa/comm/TGET_zh.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# pto.tget / TGET
# pto.tget

## 简介

Expand Down
2 changes: 1 addition & 1 deletion docs/isa/comm/TNOTIFY_zh.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# TNOTIFY
# pto.tnotify

## 简介

Expand Down
2 changes: 1 addition & 1 deletion docs/isa/comm/TPUT_ASYNC_zh.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# TPUT_ASYNC
# pto.tput_async

## 简介

Expand Down
2 changes: 1 addition & 1 deletion docs/isa/comm/TPUT_zh.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# TPUT
# pto.tput

## 简介

Expand Down
2 changes: 1 addition & 1 deletion docs/isa/comm/TTEST_zh.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# TTEST
# pto.ttest

## 简介

Expand Down
2 changes: 1 addition & 1 deletion docs/isa/comm/TWAIT.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# pto.twait
# pto.twait

`pto.twait` is part of the [Collective Communication](communication-runtime.md) instruction set.

Expand Down
2 changes: 1 addition & 1 deletion docs/isa/comm/TWAIT_zh.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# TWAIT
# pto.twait

## 简介

Expand Down
2 changes: 1 addition & 1 deletion docs/isa/conventions.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# PTO ISA Conventions
# PTO ISA Conventions

Shared conventions for the per-instruction ISA reference pages in `docs/isa/` and the corresponding C++ intrinsics in `include/pto/common/pto_instr.hpp` are defined below.

Expand Down
2 changes: 1 addition & 1 deletion docs/isa/conventions_zh.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# PTO ISA 通用约定
# PTO ISA 通用约定

`docs/isa/` 指令参考文档使用的通用术语与写法如下,并与 `include/pto/common/pto_instr.hpp` 中的 C++ 内建接口保持一致。

Expand Down
78 changes: 78 additions & 0 deletions docs/isa/cube/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
# Cube Micro-Instruction Reference

This section documents the PTO **Cube micro-instruction surface**: the matrix-multiply (MAD) and cube-side data-movement ops that program the cube core (AIC) and its dedicated buffer hierarchy (L1 / L0A / L0B / L0C / BT).

!!! note "Scope and audience"
Tile-level matrix ops such as `pto.tmatmul` (covered under [Tile ISA Matrix & Matrix-Vector](../tile/matrix-and-matrix-vector.md)) hide most of these primitives behind a tile-shaped interface. The cube micro-instructions documented here are the lower-level surface that compiler back-ends and hand-tuned cube kernels target directly. They make NZ fractal layout, L1/L0 buffer hierarchy, and FIXPIPE writeback explicit.

## Architectural Background

| Page | Purpose |
|------|---------|
| [NZ Fractal Layout](./nz-fractal-layout.md) | The fractal NZ format used by L1, L0A, L0B, and L0C. Defines the `(k1, m1, m0, k0)` re-indexing and per-buffer layout variants. |
| [Buffer Hierarchy](./buffer-hierarchy.md) | The L1 / L0A / L0B / L0C / BT memory hierarchy: address spaces, sizes, and data-flow contracts. |
| [FIXPIPE Model](./fixpipe-model.md) | The FIXPIPE writeback path: how L0C results are converted back to ND and routed to UB or GM. |

## Matrix Multiply (MAD) Ops

The MAD family computes `dst = lhs @ rhs` on tiles staged into the cube's L0A / L0B / L0C buffers. All variants share the same `(M, N, K)` shape parameters and a common set of optional clauses (`unit_flag`, `disable_gemv`, `sat`/`nosat`, `tf32_mode`, `n_dir`).

| Op | Semantics |
|----|-----------|
| [pto.mad](./ops/mad/mad.md) | Zero-init: `dst = lhs @ rhs` |
| [pto.mad_acc](./ops/mad/mad-acc.md) | Accumulate: `dst = dst + lhs @ rhs` |
| [pto.mad_bias](./ops/mad/mad-bias.md) | Bias-init: `dst = lhs @ rhs + bias[n]` |
| [pto.mad_mx](./ops/mad/mad-mx.md) | Zero-init MX (microscaled) matmul |
| [pto.mad_mx_acc](./ops/mad/mad-mx-acc.md) | Accumulating MX matmul |
| [pto.mad_mx_bias](./ops/mad/mad-mx-bias.md) | Bias-init MX matmul |

## Cube Data Movement Ops

These ops move tiles between GM, L1, L0A/L0B, and L0C using grouped `nburst(...)` / `loop(...)` clauses analogous to the [scalar DMA Copy](../scalar/dma-copy.md) surface.

### GM → L1

- [pto.mte_gm_l1](./ops/data-movement/mte-gm-l1.md) — Direct GM→L1 load (no layout transform)
- [pto.mte_gm_l1_frac](./ops/data-movement/mte-gm-l1-frac.md) — GM→L1 with ND→NZ fractal repack

### L1 ↔ UB

- [pto.mte_l1_ub](./ops/data-movement/mte-l1-ub.md) — L1→UB transfer (cube-to-vector data path)
- [pto.mte_ub_l1](../scalar/ops/dma-copy/mte-ub-l1.md) — UB→L1 transfer (vector-to-cube data path; lives in the scalar DMA section)

### L1 → L0A / L0B (cube operand load)

- [pto.mte_l1_l0a](./ops/data-movement/mte-l1-l0a.md) — Stage L1 NZ tile into L0A (left operand)
- [pto.mte_l1_l0b](./ops/data-movement/mte-l1-l0b.md) — Stage L1 NZ tile into L0B (right operand, K-innermost transpose)
- [pto.mte_l1_l0a_mx](./ops/data-movement/mte-l1-l0a-mx.md) — Load MX scale payload for L0A
- [pto.mte_l1_l0b_mx](./ops/data-movement/mte-l1-l0b-mx.md) — Load MX scale payload for L0B

### L1 → BT (bias)

- [pto.mte_l1_bt](./ops/data-movement/mte-l1-bt.md) — Stage bias vector into BT for `pto.mad_bias` / `pto.mad_mx_bias`
- [pto.mte_l1_fb](./ops/data-movement/mte-l1-fb.md) — Stage FIXPIPE-relevant payload (e.g., dequant params)

### L0C writeback (FIXPIPE)

- [pto.mte_l0c_l1](./ops/data-movement/mte-l0c-l1.md) — FIXPIPE: L0C → L1
- [pto.mte_l0c_gm](./ops/data-movement/mte-l0c-gm.md) — FIXPIPE: L0C → GM
- [pto.mte_l0c_ub](./ops/data-movement/mte-l0c-ub.md) — FIXPIPE: L0C → UB

## Full Cube Pipeline

```text
GM (ND) L1/cbuf (NZ) L0A/B (NZ) L0C (NZ) GM (ND)

A[M,K] --mte_gm_l1_frac/mte_gm_l1--> K1 M1 M0 K0 --mte_l1_l0a--> K1 M1 M0 K0 -+
+-MAD-> N1 M1 M0 N0 --> C[M,N]
B[K,N] --mte_gm_l1_frac/mte_gm_l1--> K1 N1 K0 N0 --mte_l1_l0b--> K1 N1 N0 K0 -+
^
transpose as part of mte_l1_l0b when requested
NOT at GM->L1
```

## Related Sections

- [Tile ISA: Matrix and Matrix-Vector](../tile/matrix-and-matrix-vector.md) — Tile-level matrix ops
- [Scalar DMA Copy](../scalar/dma-copy.md) — UB-side DMA grouped transfers
- [Pipeline Synchronization](../scalar/ops/pipeline-sync/) — Cube/Vector synchronization primitives
78 changes: 78 additions & 0 deletions docs/isa/cube/README_zh.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
# Cube 微指令参考

本节记录 PTO 的 **Cube 微指令表面**:矩阵乘加(MAD)以及面向 cube core(AIC)和其专用缓冲层级(L1 / L0A / L0B / L0C / BT)的数据搬运指令。

!!! note "范围与受众"
Tile 级的矩阵指令(例如 `pto.tmatmul`,见 [Tile ISA 矩阵与矩阵-向量](../tile/matrix-and-matrix-vector_zh.md))把这些底层原语隐藏在 tile 形状的接口之后。本节描述的 cube 微指令是编译器后端与手写 cube 内核直接对接的低层表面,把 NZ fractal 布局、L1/L0 缓冲层级、FIXPIPE 回写显式化。

## 架构背景

| 页面 | 用途 |
|------|------|
| [NZ Fractal 布局](./nz-fractal-layout_zh.md) | L1、L0A、L0B、L0C 使用的 fractal NZ 格式,定义 `(k1, m1, m0, k0)` 重新索引与各缓冲变种。 |
| [缓冲层级](./buffer-hierarchy_zh.md) | L1 / L0A / L0B / L0C / BT 内存层级:地址空间、大小、数据流契约。 |
| [FIXPIPE 模型](./fixpipe-model_zh.md) | FIXPIPE 回写通路:L0C 结果如何转换回 ND 并路由到 UB 或 GM。 |

## 矩阵乘加(MAD)指令

MAD 家族在 cube 的 L0A / L0B / L0C 缓冲上计算 `dst = lhs @ rhs`。所有变体共享相同的 `(M, N, K)` 形状参数与一组可选 clauses(`unit_flag`、`disable_gemv`、`sat`/`nosat`、`tf32_mode`、`n_dir`)。

| 指令 | 语义 |
|------|------|
| [pto.mad](./ops/mad/mad_zh.md) | 零初始化:`dst = lhs @ rhs` |
| [pto.mad_acc](./ops/mad/mad-acc_zh.md) | 累加:`dst = dst + lhs @ rhs` |
| [pto.mad_bias](./ops/mad/mad-bias_zh.md) | 偏置初始化:`dst = lhs @ rhs + bias[n]` |
| [pto.mad_mx](./ops/mad/mad-mx_zh.md) | MX(微缩放)零初始化 matmul |
| [pto.mad_mx_acc](./ops/mad/mad-mx-acc_zh.md) | MX 累加 matmul |
| [pto.mad_mx_bias](./ops/mad/mad-mx-bias_zh.md) | MX 偏置初始化 matmul |

## Cube 数据搬运指令

这些指令在 GM、L1、L0A/L0B、L0C 之间搬运 tile,使用与 [标量 DMA Copy](../scalar/dma-copy_zh.md) 同样的内联 `nburst(...)` / `loop(...)` 子句模型。

### GM → L1

- [pto.mte_gm_l1](./ops/data-movement/mte-gm-l1_zh.md):直接 GM→L1 加载(不做布局变换)
- [pto.mte_gm_l1_frac](./ops/data-movement/mte-gm-l1-frac_zh.md):GM→L1 并完成 ND→NZ fractal 重排

### L1 ↔ UB

- [pto.mte_l1_ub](./ops/data-movement/mte-l1-ub_zh.md):L1→UB(cube→vector 数据通路)
- [pto.mte_ub_l1](../scalar/ops/dma-copy/mte-ub-l1_zh.md):UB→L1(vector→cube 数据通路;位于标量 DMA 节)

### L1 → L0A / L0B(cube 操作数加载)

- [pto.mte_l1_l0a](./ops/data-movement/mte-l1-l0a_zh.md):把 L1 NZ tile 加载到 L0A(左操作数)
- [pto.mte_l1_l0b](./ops/data-movement/mte-l1-l0b_zh.md):把 L1 NZ tile 加载到 L0B(右操作数,K-innermost 转置)
- [pto.mte_l1_l0a_mx](./ops/data-movement/mte-l1-l0a-mx_zh.md):为 L0A 加载 MX scale payload
- [pto.mte_l1_l0b_mx](./ops/data-movement/mte-l1-l0b-mx_zh.md):为 L0B 加载 MX scale payload

### L1 → BT(偏置)

- [pto.mte_l1_bt](./ops/data-movement/mte-l1-bt_zh.md):把 bias 向量加载到 BT,供 `pto.mad_bias` / `pto.mad_mx_bias` 消费
- [pto.mte_l1_fb](./ops/data-movement/mte-l1-fb_zh.md):加载 FIXPIPE 相关 payload(例如反量化参数)

### L0C 回写(FIXPIPE)

- [pto.mte_l0c_l1](./ops/data-movement/mte-l0c-l1_zh.md):FIXPIPE 回写 L0C → L1
- [pto.mte_l0c_gm](./ops/data-movement/mte-l0c-gm_zh.md):FIXPIPE 回写 L0C → GM
- [pto.mte_l0c_ub](./ops/data-movement/mte-l0c-ub_zh.md):FIXPIPE 回写 L0C → UB

## 完整 Cube 流水线

```text
GM (ND) L1/cbuf (NZ) L0A/B (NZ) L0C (NZ) GM (ND)

A[M,K] --mte_gm_l1_frac/mte_gm_l1--> K1 M1 M0 K0 --mte_l1_l0a--> K1 M1 M0 K0 -+
+-MAD-> N1 M1 M0 N0 --> C[M,N]
B[K,N] --mte_gm_l1_frac/mte_gm_l1--> K1 N1 K0 N0 --mte_l1_l0b--> K1 N1 N0 K0 -+
^
必要时由 mte_l1_l0b 进行转置
不在 GM→L1 阶段做
```

## 相关章节

- [Tile ISA:矩阵与矩阵-向量](../tile/matrix-and-matrix-vector_zh.md) — Tile 级矩阵指令
- [标量 DMA Copy](../scalar/dma-copy_zh.md) — UB 侧分组 DMA 传输
- [流水线同步](../scalar/ops/pipeline-sync/) — Cube / Vector 同步原语
67 changes: 67 additions & 0 deletions docs/isa/cube/buffer-hierarchy.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
# Cube Buffer Hierarchy

The cube core (AIC) operates on a dedicated buffer hierarchy distinct from the Unified Buffer (UB) that Vector blocks use. Cube operands move through `L1` (cbuf) → `L0A` / `L0B` → `L0C` → writeback, with optional `BT` (bias table) and `FB` (FIXPIPE buffer) helpers.

## Address Spaces

| Space | Role | Layout | Typical Producer | Typical Consumer |
|-------|------|--------|------------------|------------------|
| `gm` | Global Memory (off-chip HBM/DDR) | ND row-major | host / kernel | DMA loaders |
| `l1` | Cube CBUF, ~1 MB on-chip | NZ fractal | `pto.mte_gm_l1`, `pto.mte_gm_l1_frac`, `pto.mte_ub_l1` | `pto.mte_l1_l0a`, `pto.mte_l1_l0b`, `pto.mte_l1_ub`, `pto.mte_l1_bt` |
| `l0a` | Cube left-operand scratchpad | FRACTAL_NZ (A5) / FRACTAL_ZZ (A3) | `pto.mte_l1_l0a` | `pto.mad*` |
| `l0b` | Cube right-operand scratchpad | FRACTAL_ZN (K innermost) | `pto.mte_l1_l0b` | `pto.mad*` |
| `l0c` | Cube accumulator | FRACTAL_NZ output of MMAD | `pto.mad*` | FIXPIPE writeback (`pto.mte_l0c_*`) |
| `bt` | Bias Table | element-type-matched vector | `pto.mte_l1_bt` | `pto.mad_bias`, `pto.mad_mx_bias` |
| `fb` | FIXPIPE auxiliary buffer | implementation-defined | `pto.mte_l1_fb` | FIXPIPE writeback ops |
| `ub` | Vector Unified Buffer | ND | DMA loaders | vector pipe |

See [NZ Fractal Layout](./nz-fractal-layout.md) for the precise per-buffer NZ index orders.

## Data-Flow Contract

```text
+----------------- AIC issue queues -----------------+
| MTE2 MTE1 CUBE (MMAD) FIXP |
| | | | | |
GM (ND) --- pto.mte_gm_l1 / pto.mte_gm_l1_frac | |
| | |
v v |
L1 (NZ) <-- pto.mte_ub_l1 --- UB |
| |
+------+-----+---------------------+ |
| | | |
mte_l1_l0a mte_l1_l0b mte_l1_bt / mte_l1_fb |
| | | |
v v | |
L0A L0B | |
| | | |
+-----+------+ | |
| | |
| pto.mad / pto.mad_acc / pto.mad_bias / *_mx* |
| <----------------------+ |
v |
L0C |
| |
+-- pto.mte_l0c_l1 / pto.mte_l0c_gm / pto.mte_l0c_ub ---+
(FIXPIPE writeback)
```

## Alignment and Sizing Conventions

- All cube buffer pointers (L1 / L0A / L0B / L0C / BT / FB) are 32-byte aligned.
- L0A and L0B fractal tiles are 512B (one 32B-wide × 16-row block in the appropriate inner orientation).
- L0C accumulator tiles use the `N1 M1 M0 N0` order so that FIXPIPE can stream out one M-row of results at a time.
- Element-type-derived inner widths (`K0 = N0 = C0 / sizeof(T)`) follow [NZ Fractal Layout](./nz-fractal-layout.md).

## Synchronization

The cube programs are issued from the AIC's Scalar Unit (SU) into the MTE2 / MTE1 / CUBE / FIXP issue queues. Synchronization with the Vector blocks happens through the System Controller (SC) semaphores and the dedicated 1:2 fixpipe broadcast path. See:

- [Pipeline Synchronization](../scalar/ops/pipeline-sync/) for the intra-block (`pto.set_flag` / `pto.wait_flag`) primitives that order MTE2 → MTE1 → CUBE → FIXP within the AIC.
- [Cluster Programming Model](../machine-model/execution-agents.md) for inter-block (`pto.set_intra_block` / `pto.wait_intra_core`) primitives used between AIC and AIV.

## Related Sections

- [NZ Fractal Layout](./nz-fractal-layout.md)
- [FIXPIPE Model](./fixpipe-model.md)
- [Cube Data Movement Ops](./README.md#cube-data-movement-ops)
67 changes: 67 additions & 0 deletions docs/isa/cube/buffer-hierarchy_zh.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
# Cube 缓冲层级

Cube core(AIC)操作的是一个独立于 Vector 块 UB 的专用缓冲层级。Cube 操作数依次经过 `L1`(cbuf)→ `L0A` / `L0B` → `L0C` → 回写,可选辅助缓冲为 `BT`(bias table)与 `FB`(FIXPIPE buffer)。

## 地址空间

| 空间 | 角色 | 布局 | 典型生产者 | 典型消费者 |
|------|------|------|------------|------------|
| `gm` | Global Memory(片外 HBM/DDR) | ND 行优先 | host / kernel | DMA 加载器 |
| `l1` | Cube CBUF,片上约 1 MB | NZ fractal | `pto.mte_gm_l1`、`pto.mte_gm_l1_frac`、`pto.mte_ub_l1` | `pto.mte_l1_l0a`、`pto.mte_l1_l0b`、`pto.mte_l1_ub`、`pto.mte_l1_bt` |
| `l0a` | Cube 左操作数暂存区 | FRACTAL_NZ(A5)/ FRACTAL_ZZ(A3) | `pto.mte_l1_l0a` | `pto.mad*` |
| `l0b` | Cube 右操作数暂存区 | FRACTAL_ZN(K 最内) | `pto.mte_l1_l0b` | `pto.mad*` |
| `l0c` | Cube 累加器 | MMAD 输出的 FRACTAL_NZ | `pto.mad*` | FIXPIPE 回写(`pto.mte_l0c_*`) |
| `bt` | Bias Table | 与元素类型匹配的向量 | `pto.mte_l1_bt` | `pto.mad_bias`、`pto.mad_mx_bias` |
| `fb` | FIXPIPE 辅助缓冲 | 实现相关 | `pto.mte_l1_fb` | FIXPIPE 回写指令 |
| `ub` | Vector Unified Buffer | ND | DMA 加载器 | vector 流水线 |

各缓冲的精确 NZ 索引顺序见 [NZ Fractal 布局](./nz-fractal-layout_zh.md)。

## 数据流契约

```text
+----------------- AIC 发射队列 -----------------+
| MTE2 MTE1 CUBE (MMAD) FIXP |
| | | | | |
GM (ND) --- pto.mte_gm_l1 / pto.mte_gm_l1_frac | |
| | |
v v |
L1 (NZ) <-- pto.mte_ub_l1 --- UB |
| |
+------+-----+---------------------+ |
| | | |
mte_l1_l0a mte_l1_l0b mte_l1_bt / mte_l1_fb |
| | | |
v v | |
L0A L0B | |
| | | |
+-----+------+ | |
| | |
| pto.mad / pto.mad_acc / pto.mad_bias / *_mx* |
| <----------------------+ |
v |
L0C |
| |
+-- pto.mte_l0c_l1 / pto.mte_l0c_gm / pto.mte_l0c_ub +
(FIXPIPE 回写)
```

## 对齐与尺寸约定

- 所有 cube 缓冲指针(L1 / L0A / L0B / L0C / BT / FB)都要求 32 字节对齐。
- L0A 与 L0B 的 fractal tile 是 512B(一个 32B 宽 × 16 行的 block,按相应的内层朝向)。
- L0C 累加器 tile 使用 `N1 M1 M0 N0` 顺序,方便 FIXPIPE 每次流式输出一行 M 维结果。
- 按元素类型派生的内层宽度(`K0 = N0 = C0 / sizeof(T)`)遵循 [NZ Fractal 布局](./nz-fractal-layout_zh.md)。

## 同步

Cube 程序由 AIC 的 Scalar Unit(SU)发射到 MTE2 / MTE1 / CUBE / FIXP 各自的发射队列。与 Vector 块的同步通过 System Controller(SC)的信号量、以及专用 1:2 fixpipe 广播路径来实现。详见:

- [流水线同步](../scalar/ops/pipeline-sync/):用于在 AIC 内对 MTE2 → MTE1 → CUBE → FIXP 排序的 `pto.set_flag` / `pto.wait_flag` 原语。
- [Cluster 编程模型](../machine-model/execution-agents_zh.md):AIC 与 AIV 之间使用的跨块原语(`pto.set_intra_block` / `pto.wait_intra_core`)。

## 相关章节

- [NZ Fractal 布局](./nz-fractal-layout_zh.md)
- [FIXPIPE 模型](./fixpipe-model_zh.md)
- [Cube 数据搬运指令](./README_zh.md#cube-数据搬运指令)
Loading
Loading