hw-native-sys · ivanmang · Apr 30, 2026 · Apr 30, 2026 · May 7, 2026 · May 8, 2026
diff --git a/docs/isa/comm/TGATHER.md b/docs/isa/comm/TGATHER.md
@@ -1,22 +1,24 @@
 # pto.tgather
 
-## Introduction
+`pto.tgather` is part of the [Communication](./README.md) instruction set.
+
+## Summary
 
 Gather operation: the calling NPU (root) collects data from all ranks in the parallel group and concatenates the results along **DIM_3** (row dimension) into a local output buffer.
 
 Only the root needs to execute `pto.tgather`. Non-root ranks only need to ensure their source buffers are ready and remain valid for the duration of the operation. Calling `pto.tgather` on non-root ranks is undefined behavior.
 
 **Large Tile Support**: When the GlobalTensor exceeds the UB tile capacity in rows and/or columns, the transfer is automatically chunked via 2D sliding — the same mechanism used by other PTO-COMM instructions.
 
-## Math Interpretation
+## Mechanism
 
 Each rank $r$ has source data of shape $(D_0, D_1, D_2, H, W)$. The gather concatenates all $N$ ranks along DIM_3:
 
 $$\mathrm{dst}_{d_0, d_1, d_2,\; r \cdot H + i,\; j} = \mathrm{src}^{(r)}_{d_0, d_1, d_2,\; i,\; j} \quad \forall\, r \in [0, N),\; i \in [0, H),\; j \in [0, W)$$
 
 The destination tensor has shape $(D_0, D_1, D_2, N \times H, W)$.
 
-## Assembly Syntax
+## Syntax
 
 PTO-AS form: see [Assembly Spelling And Operands](../syntax-and-operands/assembly-model.md).
 
@@ -44,6 +46,30 @@ PTO_INST RecordEvent GATHER(ParallelGroupType &parallelGroup, GlobalDstData &dst
                             TileData &pingTile, TileData &pongTile, WaitEvents&... events);
 ```
 
+## Inputs
+
+| Operand | Description |
+|---------|-------------|
+| `parallelGroup` | A `ParallelGroup<GPerRank>` enumerating each rank's source buffer; root identified via `GetRootIdx()`. |
+| `dstGlobalData` | Destination GlobalTensor on the root NPU; receives the concatenated result. |
+| `stagingTileData` | UB staging tile used as the GM→UB→GM relay buffer (single-buffer form). |
+| `pingTile`, `pongTile` | Two UB staging tiles for double-buffered (ping-pong) form, enabling MTE2/MTE3 overlap. |
+| `events...` | Optional `RecordEvent` tokens to wait on before issuing. |
+
+## Expected Outputs
+
+| Result | Type | Description |
+|--------|------|-------------|
+| `RecordEvent` | token | Signals completion of the gather across all participating ranks. |
+| `dstGlobalData` | GlobalTensor | First `N × H` rows along DIM_3 hold the concatenated rank data; rows beyond `N × H` (if `dstGlobalData.GetShape(DIM_3) > N × H`) are left unchanged. |
+
+## Side Effects
+
+- Issues remote MTE2 reads from each rank's source buffer and MTE3 writes into the local destination.
+- Cross-core synchronisation flags are toggled as part of the rank-fan-in protocol.
+- Non-root ranks must keep their source buffers live until the root signals completion; otherwise behavior is undefined.
+- No implicit fence on unrelated tile traffic.
+
 ## Constraints
 
 !!! warning "Constraints"
@@ -62,6 +88,15 @@ PTO_INST RecordEvent GATHER(ParallelGroupType &parallelGroup, GlobalDstData &dst
         - If `TileData` has static `ValidRow`, `GetShape(DIM_3)` of each rank's source must be divisible by `ValidRow`. Use a Tile with `DYNAMIC` ValidRow for partial row support.
         - If `TileData` has static `ValidCol`, `GetShape(DIM_4)` must be divisible by `ValidCol`. Use a Tile with `DYNAMIC` ValidCol for partial column support.
 
+## Exceptions
+
+!!! danger "Exceptions"
+    - Calling `pto.tgather` on a non-root NPU is undefined behavior.
+    - Mismatched per-rank tensor shapes / strides yield undefined behavior; no runtime check is guaranteed.
+    - Using a `dstGlobalData` shape with `GetShape(DIM_3) < N × H` is rejected by the verifier on static shapes; on dynamic shapes the call writes only what fits and silently truncates.
+    - Type-mismatch between `ParallelGroup::value_type::RawDType` and `TileData::DType` / `GlobalDstData::RawDType` is rejected at compile time via `static_assert`.
+    - Programs must not rely on behavior outside the documented legal domain of this operation.
+
 ## Examples
 
 ### Basic Gather (Single Staging Tile)
@@ -125,3 +160,10 @@ void gather_pingpong(__gm__ T* group_addrs[NRANKS], __gm__ T* result, int my_ran
     comm::GATHER(group, dstG, pingTile, pongTile);
 }
 ```
+
+## See Also
+
+- Instruction set overview: [Communication](./README.md)
+- Inverse op: [pto.tscatter](./TSCATTER.md)
+- Related collective ops: [pto.treduce](./TREDUCE.md), [pto.tbroadcast](./TBROADCAST.md)
+- One-sided variants: [pto.tput](./TPUT.md), [pto.tget](./TGET.md)
diff --git a/docs/isa/comm/TSCATTER.md b/docs/isa/comm/TSCATTER.md
@@ -1,20 +1,22 @@
 # pto.tscatter
 
-## Introduction
+`pto.tscatter` is part of the [Communication](./README.md) instruction set.
+
+## Summary
 
 Scatter operation: the calling NPU (root) distributes data to all ranks in the parallel group by splitting the local source tensor along **DIM_3** (row dimension). This is the inverse of `pto.tgather`.
 
 Only the root needs to execute `pto.tscatter`. Non-root ranks only need to ensure their destination buffers are allocated and writable for the duration of the operation. Calling `pto.tscatter` on non-root ranks is undefined behavior.
 
 **Large Tile Support**: When the per-rank data exceeds the UB tile capacity in rows and/or columns, the transfer is automatically chunked via 2D sliding.
 
-## Math Interpretation
+## Mechanism
 
 The local source tensor has shape $(D_0, D_1, D_2, N \times H, W)$, where $N$ is the number of ranks and each rank receives $H$ rows. After the operation:
 
 $$\mathrm{dst}^{(r)}_{d_0, d_1, d_2,\; i,\; j} = \mathrm{src}^{\mathrm{local}}_{d_0, d_1, d_2,\; r \cdot H + i,\; j} \quad \forall\, r \in [0, N),\; i \in [0, H),\; j \in [0, W)$$
 
-## Assembly Syntax
+## Syntax
 
 PTO-AS form: see [Assembly Spelling And Operands](../syntax-and-operands/assembly-model.md).
 
@@ -42,6 +44,30 @@ PTO_INST RecordEvent SCATTER(ParallelGroupType &parallelGroup, GlobalSrcData &sr
                              TileData &pingTile, TileData &pongTile, WaitEvents&... events);
 ```
 
+## Inputs
+
+| Operand | Description |
+|---------|-------------|
+| `parallelGroup` | A `ParallelGroup<GPerRank>` enumerating each rank's destination buffer; root identified via `GetRootIdx()`. |
+| `srcGlobalData` | Source GlobalTensor on the root NPU; concatenation of per-rank slices along DIM_3. |
+| `stagingTileData` | UB staging tile used as the GM→UB→GM relay buffer (single-buffer form). |
+| `pingTile`, `pongTile` | Two UB staging tiles for double-buffered (ping-pong) form, enabling MTE2/MTE3 overlap. |
+| `events...` | Optional `RecordEvent` tokens to wait on before issuing. |
+
+## Expected Outputs
+
+| Result | Type | Description |
+|--------|------|-------------|
+| `RecordEvent` | token | Signals completion of the scatter to all participating ranks. |
+| `parallelGroup.tensors[r]` | GlobalTensor (remote) | Each rank `r`'s destination receives `src[r*H : (r+1)*H, :]` of the source. |
+
+## Side Effects
+
+- Issues local MTE2 reads from `srcGlobalData` and remote MTE3 writes to each rank's destination buffer.
+- Cross-core synchronisation flags are toggled as part of the rank-fan-out protocol.
+- Non-root ranks must keep their destination buffers writable until the root signals completion; otherwise behavior is undefined.
+- No implicit fence on unrelated tile traffic.
+
 ## Constraints
 
 !!! warning "Constraints"
@@ -60,6 +86,15 @@ PTO_INST RecordEvent SCATTER(ParallelGroupType &parallelGroup, GlobalSrcData &sr
         - If `TileData` has static `ValidRow`, `GetShape(DIM_3)` of each rank's destination must be divisible by `ValidRow`. Use a Tile with `DYNAMIC` ValidRow for partial row support.
         - If `TileData` has static `ValidCol`, `GetShape(DIM_4)` must be divisible by `ValidCol`. Use a Tile with `DYNAMIC` ValidCol for partial column support.
 
+## Exceptions
+
+!!! danger "Exceptions"
+    - Calling `pto.tscatter` on a non-root NPU is undefined behavior.
+    - Mismatched per-rank tensor shapes / strides yield undefined behavior; no runtime check is guaranteed.
+    - Using a `srcGlobalData` shape with `GetShape(DIM_3) < N × H` is rejected by the verifier on static shapes; on dynamic shapes the call reads only what fits and remaining ranks receive partial data.
+    - Type-mismatch between `ParallelGroup::value_type::RawDType` and `TileData::DType` / `GlobalSrcData::RawDType` is rejected at compile time via `static_assert`.
+    - Programs must not rely on behavior outside the documented legal domain of this operation.
+
 ## Examples
 
 ### Basic Scatter (Single Staging Tile)
@@ -123,3 +158,10 @@ void scatter_pingpong(__gm__ T* local_data, __gm__ T* group_addrs[NRANKS], int m
     comm::SCATTER(group, srcG, pingTile, pongTile);
 }
 ```
+
+## See Also
+
+- Instruction set overview: [Communication](./README.md)
+- Inverse op: [pto.tgather](./TGATHER.md)
+- Related collective ops: [pto.treduce](./TREDUCE.md), [pto.tbroadcast](./TBROADCAST.md)
+- One-sided variants: [pto.tput](./TPUT.md), [pto.tget](./TGET.md)
diff --git a/docs/isa/system/ops/TFREE.md b/docs/isa/system/ops/TFREE.md
@@ -4,38 +4,92 @@
 
 ![TFREE tile operation](../../../figures/isa/TFREE.svg)
 
-## Introduction
+## Summary
 
-Release the currently held pipe or FIFO slot back to the producer.
+`pto.tfree` releases the consumer-side reservation on a tile pipe previously acquired by `pto.tpop`, returning the slot to the producer side of the FIFO. It is the primary reclaim hook in the system-side three-phase tile-pipe protocol and is paired with `pto.tpop` on the consumer side.
 
-## Math Interpretation
+## Mechanism
 
-Semantics are instruction-specific. Unless stated otherwise, behavior is defined over the destination valid region.
+`pto.tfree` performs a single release transaction against the tile-pipe metadata:
 
-## Assembly Syntax
+1. The current consumer signals it is done with the slot it last popped from `pipe`.
+2. The slot's reference is decremented; when it drops to zero, the slot is returned to the producer pool.
+3. Any waiter blocked on `pto.tpush` for an empty slot is unblocked.
+
+The op executes on the system pipe and does not move tile data; it only updates pipe-control state. The slot identity is implicit in `pipe` — there is no tile handle operand.
+
+## Syntax
 
 Textual spelling is defined by the PTO ISA syntax-and-operands pages.
 
 ### IR Level 1 (SSA)
 
 ```text
-%dst = pto.tfree ...
+%event = pto.tfree %pipe : (!pto.pipe<...>) -> !pto.record_event
 ```
 
 ### IR Level 2 (DPS)
 
 ```text
-pto.tfree ins(...) outs(%dst : !pto.tile_buf<...>)
+pto.tfree ins(%pipe) outs(%event : !pto.record_event)
 ```
+
 ## C++ Intrinsic
 
-Declared in `include/pto/common/pto_instr.hpp`.
+Declared in `include/pto/common/pto_instr.hpp`:
+
+```cpp
+template <typename Pipe, typename... WaitEvents>
+PTO_INST RecordEvent TFREE(Pipe &pipe, WaitEvents&... events);
+```
+
+## Inputs
+
+| Operand | Description |
+|---------|-------------|
+| `pipe` | The tile pipe (`TMPipe` or compatible) whose most recently popped slot is being released. |
+| `events...` | Optional `RecordEvent` tokens to wait on before issuing. |
+
+## Expected Outputs
+
+| Result | Type | Description |
+|--------|------|-------------|
+| `RecordEvent` | token | Signals that the slot has been released and is reusable by the producer. |
+
+## Side Effects
+
+- Updates pipe-control metadata in place; no tile data is moved.
+- May unblock a producer waiting on `pto.tpush`.
+- After `pto.tfree`, the tile previously returned by `pto.tpop` on the same pipe must not be read or written by the consumer.
 
 ## Constraints
 
 !!! warning "Constraints"
-    Refer to backend-specific legality checks for data type/layout/location/shape constraints.
+    - Each `pto.tpop` on a pipe must be matched by exactly one `pto.tfree` on the same pipe; double-free is undefined.
+    - `pto.tfree` must not be issued from the producer side; it is consumer-side only.
+    - Refer to backend-specific legality checks for data type/layout/location/shape constraints not covered above.
+
+## Exceptions
+
+!!! danger "Exceptions"
+    - Issuing `pto.tfree` on a pipe with no outstanding `pto.tpop` is undefined behavior.
+    - Calling `pto.tfree` more times than `pto.tpop` on the same pipe is undefined behavior.
+    - Calling `pto.tfree` from the producer side is rejected by the verifier.
+    - Programs must not rely on behavior outside the documented legal domain of this operation.
 
 ## Examples
 
+```cpp
+// Consumer side of a TMPipe loop.
+auto tile = TPOP(pipe);
+// ... compute on tile ...
+TFREE(pipe);
+```
+
 See related instruction pages in `docs/isa/` for concrete Auto/Manual usage patterns.
+
+## See Also
+
+- Instruction set overview: [System Ops](../README.md)
+- Producer side: [TPUSH](./TPUSH.md)
+- Consumer-acquire side: [TPOP](./TPOP.md)
diff --git a/docs/isa/system/ops/TPOP.md b/docs/isa/system/ops/TPOP.md
@@ -6,6 +6,16 @@
 
 `TPOP` retrieves a tile from a ring FIFO into a consumer pipeline (Vector or Cube). It is the consumer half of the TPipe/TMPipe producer-consumer protocol, paired with [`TPUSH`](./TPUSH.md).
 
+## Mechanism
+
+For every `TPOP` call:
+
+1. The consumer **waits** on the producer's data-ready flag.
+2. It then **pops** the tile data from the FIFO — either via a `TLOAD` from a GM slot buffer, a `TASSIGN` against a local UB/MAT buffer, or a 32-bit control-signal read for `V2C_CTRL`.
+3. Finally it **frees** the slot by signalling the producer with the matching release flag.
+
+The op does not run on the vector or cube datapaths: data movement (when present) is dispatched on MTE1/MTE2, while flag wait/set runs on the system pipe (PIPE_S / PIPE_FIX / PIPE_MTE2). See [Three-Phase Protocol](#three-phase-protocol) for the per-backend signal table.
+
 ## What TPOP Is Not
 
 `TPOP` is **not** a scalar stack pop or a generic FIFO dequeue. It is a structured tile-movement protocol for Cube-Vector tile passing. It is not available on the CPU simulator.
@@ -140,6 +150,30 @@ void consumer_mat(MatTile& matTile) {
 }
 ```
 
+## Inputs
+
+| Operand | Description |
+|---------|-------------|
+| `pipe` | The `TPipe` / `TMPipe` instance shared with the producer. Carries `FlagID`, `DirType`, slot pointers, and consumer-local buffer addresses. |
+| `tile` | The destination tile (Vec or Mat) into which the popped data is materialised. For `*_UB` / `*_MAT` paths, the tile binds to the consumer-local FIFO buffer; for GM paths it is a regular UB/MAT tile filled by `TLOAD`. |
+| `TileSplitAxis` (template) | Optional split mode (`TILE_NO_SPLIT`, `TILE_UP_DOWN`, `TILE_LEFT_RIGHT`) used to compute the per-subblock GM offset. Must match the producer. |
+
+## Expected Outputs
+
+| Result | Type | Description |
+|--------|------|-------------|
+| `RecordEvent` | token | Signals completion of the wait + load + free sequence. |
+| `tile` | tile | Holds the popped tile after this op completes. For `V2C_CTRL`, the 32-bit control signal is stored in `pipe.cons.fifo.ctrlSignal` instead. |
+| `pipe.cons` | state | Slot index advances by one (`tileIndex % SlotNum`); the released slot becomes available to the producer's next `TPUSH`. |
+
+## Side Effects
+
+- Issues a cross-core / intra-block flag wait (blocks until the producer signals).
+- Writes a release flag back to the producer; this can unblock a producer waiting in its allocation phase.
+- For GM paths: issues an MTE1/MTE2 load.
+- For local-buffer paths (`C2V_UB`, `V2C_MAT`): no DMA, only a `TASSIGN` rebind.
+- Does **not** implicitly fence unrelated tile traffic. Callers must use explicit events for cross-pipe ordering.
+
 ## Constraints
 
 !!! warning "Constraints"
@@ -158,7 +192,18 @@ void consumer_mat(MatTile& matTile) {
     - **A2/A3**: `DIR_C2V`, `DIR_V2C`, `DIR_BOTH`, `DIR_V2C_CTRL`. Synchronization via `wait_flag_dev` and `ffts_cross_core_sync`.
     - **A5**: All direction types. Synchronization via `wait_intra_block` and `set_intra_block`. Additional `*_GM` paths with GM load. Sub-block support (`FlagID + 16`) for 2-Vec-core configurations.
 
-## Common Patterns
+## Exceptions
+
+!!! danger "Exceptions"
+    - Calling `TPOP` without a matching prior `TPUSH` causes the consumer to wait indefinitely (deadlock).
+    - Calling `TPOP` on the CPU simulator is rejected: the op requires NPU inter-core synchronization infrastructure.
+    - `TileSplitAxis` mismatched with the producer's split mode produces undefined data (consumer reads from the wrong GM offset).
+    - `FlagID` reuse with another concurrently-active synchronization op is undefined behavior.
+    - Setting `isWait = false` when the producer has not yet recorded the slot reads stale or partially-written data.
+    - Setting `isFree = false` for too many iterations causes the producer to stall in its allocate phase (no deadlock, but throughput collapses).
+    - Programs must not rely on behavior outside the documented legal domain of this operation.
+
+## Examples
 
 ### Pattern 1: Consuming Accumulator Tile (GEMM Post-Processing)