Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 45 additions & 3 deletions docs/isa/comm/TGATHER.md
Original file line number Diff line number Diff line change
@@ -1,22 +1,24 @@
# pto.tgather

## Introduction
`pto.tgather` is part of the [Communication](./README.md) instruction set.

## Summary

Gather operation: the calling NPU (root) collects data from all ranks in the parallel group and concatenates the results along **DIM_3** (row dimension) into a local output buffer.

Only the root needs to execute `pto.tgather`. Non-root ranks only need to ensure their source buffers are ready and remain valid for the duration of the operation. Calling `pto.tgather` on non-root ranks is undefined behavior.

**Large Tile Support**: When the GlobalTensor exceeds the UB tile capacity in rows and/or columns, the transfer is automatically chunked via 2D sliding — the same mechanism used by other PTO-COMM instructions.

## Math Interpretation
## Mechanism

Each rank $r$ has source data of shape $(D_0, D_1, D_2, H, W)$. The gather concatenates all $N$ ranks along DIM_3:

$$\mathrm{dst}_{d_0, d_1, d_2,\; r \cdot H + i,\; j} = \mathrm{src}^{(r)}_{d_0, d_1, d_2,\; i,\; j} \quad \forall\, r \in [0, N),\; i \in [0, H),\; j \in [0, W)$$

The destination tensor has shape $(D_0, D_1, D_2, N \times H, W)$.

## Assembly Syntax
## Syntax

PTO-AS form: see [Assembly Spelling And Operands](../syntax-and-operands/assembly-model.md).

Expand Down Expand Up @@ -44,6 +46,30 @@ PTO_INST RecordEvent GATHER(ParallelGroupType &parallelGroup, GlobalDstData &dst
TileData &pingTile, TileData &pongTile, WaitEvents&... events);
```

## Inputs

| Operand | Description |
|---------|-------------|
| `parallelGroup` | A `ParallelGroup<GPerRank>` enumerating each rank's source buffer; root identified via `GetRootIdx()`. |
| `dstGlobalData` | Destination GlobalTensor on the root NPU; receives the concatenated result. |
| `stagingTileData` | UB staging tile used as the GM→UB→GM relay buffer (single-buffer form). |
| `pingTile`, `pongTile` | Two UB staging tiles for double-buffered (ping-pong) form, enabling MTE2/MTE3 overlap. |
| `events...` | Optional `RecordEvent` tokens to wait on before issuing. |

## Expected Outputs

| Result | Type | Description |
|--------|------|-------------|
| `RecordEvent` | token | Signals completion of the gather across all participating ranks. |
| `dstGlobalData` | GlobalTensor | First `N × H` rows along DIM_3 hold the concatenated rank data; rows beyond `N × H` (if `dstGlobalData.GetShape(DIM_3) > N × H`) are left unchanged. |

## Side Effects

- Issues remote MTE2 reads from each rank's source buffer and MTE3 writes into the local destination.
- Cross-core synchronisation flags are toggled as part of the rank-fan-in protocol.
- Non-root ranks must keep their source buffers live until the root signals completion; otherwise behavior is undefined.
- No implicit fence on unrelated tile traffic.

## Constraints

!!! warning "Constraints"
Expand All @@ -62,6 +88,15 @@ PTO_INST RecordEvent GATHER(ParallelGroupType &parallelGroup, GlobalDstData &dst
- If `TileData` has static `ValidRow`, `GetShape(DIM_3)` of each rank's source must be divisible by `ValidRow`. Use a Tile with `DYNAMIC` ValidRow for partial row support.
- If `TileData` has static `ValidCol`, `GetShape(DIM_4)` must be divisible by `ValidCol`. Use a Tile with `DYNAMIC` ValidCol for partial column support.

## Exceptions

!!! danger "Exceptions"
- Calling `pto.tgather` on a non-root NPU is undefined behavior.
- Mismatched per-rank tensor shapes / strides yield undefined behavior; no runtime check is guaranteed.
- Using a `dstGlobalData` shape with `GetShape(DIM_3) < N × H` is rejected by the verifier on static shapes; on dynamic shapes the call writes only what fits and silently truncates.
- Type-mismatch between `ParallelGroup::value_type::RawDType` and `TileData::DType` / `GlobalDstData::RawDType` is rejected at compile time via `static_assert`.
- Programs must not rely on behavior outside the documented legal domain of this operation.

## Examples

### Basic Gather (Single Staging Tile)
Expand Down Expand Up @@ -125,3 +160,10 @@ void gather_pingpong(__gm__ T* group_addrs[NRANKS], __gm__ T* result, int my_ran
comm::GATHER(group, dstG, pingTile, pongTile);
}
```

## See Also

- Instruction set overview: [Communication](./README.md)
- Inverse op: [pto.tscatter](./TSCATTER.md)
- Related collective ops: [pto.treduce](./TREDUCE.md), [pto.tbroadcast](./TBROADCAST.md)
- One-sided variants: [pto.tput](./TPUT.md), [pto.tget](./TGET.md)
48 changes: 45 additions & 3 deletions docs/isa/comm/TSCATTER.md
Original file line number Diff line number Diff line change
@@ -1,20 +1,22 @@
# pto.tscatter

## Introduction
`pto.tscatter` is part of the [Communication](./README.md) instruction set.

## Summary

Scatter operation: the calling NPU (root) distributes data to all ranks in the parallel group by splitting the local source tensor along **DIM_3** (row dimension). This is the inverse of `pto.tgather`.

Only the root needs to execute `pto.tscatter`. Non-root ranks only need to ensure their destination buffers are allocated and writable for the duration of the operation. Calling `pto.tscatter` on non-root ranks is undefined behavior.

**Large Tile Support**: When the per-rank data exceeds the UB tile capacity in rows and/or columns, the transfer is automatically chunked via 2D sliding.

## Math Interpretation
## Mechanism

The local source tensor has shape $(D_0, D_1, D_2, N \times H, W)$, where $N$ is the number of ranks and each rank receives $H$ rows. After the operation:

$$\mathrm{dst}^{(r)}_{d_0, d_1, d_2,\; i,\; j} = \mathrm{src}^{\mathrm{local}}_{d_0, d_1, d_2,\; r \cdot H + i,\; j} \quad \forall\, r \in [0, N),\; i \in [0, H),\; j \in [0, W)$$

## Assembly Syntax
## Syntax

PTO-AS form: see [Assembly Spelling And Operands](../syntax-and-operands/assembly-model.md).

Expand Down Expand Up @@ -42,6 +44,30 @@ PTO_INST RecordEvent SCATTER(ParallelGroupType &parallelGroup, GlobalSrcData &sr
TileData &pingTile, TileData &pongTile, WaitEvents&... events);
```

## Inputs

| Operand | Description |
|---------|-------------|
| `parallelGroup` | A `ParallelGroup<GPerRank>` enumerating each rank's destination buffer; root identified via `GetRootIdx()`. |
| `srcGlobalData` | Source GlobalTensor on the root NPU; concatenation of per-rank slices along DIM_3. |
| `stagingTileData` | UB staging tile used as the GM→UB→GM relay buffer (single-buffer form). |
| `pingTile`, `pongTile` | Two UB staging tiles for double-buffered (ping-pong) form, enabling MTE2/MTE3 overlap. |
| `events...` | Optional `RecordEvent` tokens to wait on before issuing. |

## Expected Outputs

| Result | Type | Description |
|--------|------|-------------|
| `RecordEvent` | token | Signals completion of the scatter to all participating ranks. |
| `parallelGroup.tensors[r]` | GlobalTensor (remote) | Each rank `r`'s destination receives `src[r*H : (r+1)*H, :]` of the source. |

## Side Effects

- Issues local MTE2 reads from `srcGlobalData` and remote MTE3 writes to each rank's destination buffer.
- Cross-core synchronisation flags are toggled as part of the rank-fan-out protocol.
- Non-root ranks must keep their destination buffers writable until the root signals completion; otherwise behavior is undefined.
- No implicit fence on unrelated tile traffic.

## Constraints

!!! warning "Constraints"
Expand All @@ -60,6 +86,15 @@ PTO_INST RecordEvent SCATTER(ParallelGroupType &parallelGroup, GlobalSrcData &sr
- If `TileData` has static `ValidRow`, `GetShape(DIM_3)` of each rank's destination must be divisible by `ValidRow`. Use a Tile with `DYNAMIC` ValidRow for partial row support.
- If `TileData` has static `ValidCol`, `GetShape(DIM_4)` must be divisible by `ValidCol`. Use a Tile with `DYNAMIC` ValidCol for partial column support.

## Exceptions

!!! danger "Exceptions"
- Calling `pto.tscatter` on a non-root NPU is undefined behavior.
- Mismatched per-rank tensor shapes / strides yield undefined behavior; no runtime check is guaranteed.
- Using a `srcGlobalData` shape with `GetShape(DIM_3) < N × H` is rejected by the verifier on static shapes; on dynamic shapes the call reads only what fits and remaining ranks receive partial data.
- Type-mismatch between `ParallelGroup::value_type::RawDType` and `TileData::DType` / `GlobalSrcData::RawDType` is rejected at compile time via `static_assert`.
- Programs must not rely on behavior outside the documented legal domain of this operation.

## Examples

### Basic Scatter (Single Staging Tile)
Expand Down Expand Up @@ -123,3 +158,10 @@ void scatter_pingpong(__gm__ T* local_data, __gm__ T* group_addrs[NRANKS], int m
comm::SCATTER(group, srcG, pingTile, pongTile);
}
```

## See Also

- Instruction set overview: [Communication](./README.md)
- Inverse op: [pto.tgather](./TGATHER.md)
- Related collective ops: [pto.treduce](./TREDUCE.md), [pto.tbroadcast](./TBROADCAST.md)
- One-sided variants: [pto.tput](./TPUT.md), [pto.tget](./TGET.md)
72 changes: 63 additions & 9 deletions docs/isa/system/ops/TFREE.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,38 +4,92 @@

![TFREE tile operation](../../../figures/isa/TFREE.svg)

## Introduction
## Summary

Release the currently held pipe or FIFO slot back to the producer.
`pto.tfree` releases the consumer-side reservation on a tile pipe previously acquired by `pto.tpop`, returning the slot to the producer side of the FIFO. It is the primary reclaim hook in the system-side three-phase tile-pipe protocol and is paired with `pto.tpop` on the consumer side.

## Math Interpretation
## Mechanism

Semantics are instruction-specific. Unless stated otherwise, behavior is defined over the destination valid region.
`pto.tfree` performs a single release transaction against the tile-pipe metadata:

## Assembly Syntax
1. The current consumer signals it is done with the slot it last popped from `pipe`.
2. The slot's reference is decremented; when it drops to zero, the slot is returned to the producer pool.
3. Any waiter blocked on `pto.tpush` for an empty slot is unblocked.

The op executes on the system pipe and does not move tile data; it only updates pipe-control state. The slot identity is implicit in `pipe` — there is no tile handle operand.

## Syntax

Textual spelling is defined by the PTO ISA syntax-and-operands pages.

### IR Level 1 (SSA)

```text
%dst = pto.tfree ...
%event = pto.tfree %pipe : (!pto.pipe<...>) -> !pto.record_event
```

### IR Level 2 (DPS)

```text
pto.tfree ins(...) outs(%dst : !pto.tile_buf<...>)
pto.tfree ins(%pipe) outs(%event : !pto.record_event)
```
Comment thread
ivanmang marked this conversation as resolved.

## C++ Intrinsic

Declared in `include/pto/common/pto_instr.hpp`.
Declared in `include/pto/common/pto_instr.hpp`:

```cpp
template <typename Pipe, typename... WaitEvents>
PTO_INST RecordEvent TFREE(Pipe &pipe, WaitEvents&... events);
```

## Inputs

| Operand | Description |
|---------|-------------|
| `pipe` | The tile pipe (`TMPipe` or compatible) whose most recently popped slot is being released. |
| `events...` | Optional `RecordEvent` tokens to wait on before issuing. |

## Expected Outputs

| Result | Type | Description |
|--------|------|-------------|
| `RecordEvent` | token | Signals that the slot has been released and is reusable by the producer. |

## Side Effects

- Updates pipe-control metadata in place; no tile data is moved.
- May unblock a producer waiting on `pto.tpush`.
- After `pto.tfree`, the tile previously returned by `pto.tpop` on the same pipe must not be read or written by the consumer.

## Constraints

!!! warning "Constraints"
Refer to backend-specific legality checks for data type/layout/location/shape constraints.
- Each `pto.tpop` on a pipe must be matched by exactly one `pto.tfree` on the same pipe; double-free is undefined.
- `pto.tfree` must not be issued from the producer side; it is consumer-side only.
- Refer to backend-specific legality checks for data type/layout/location/shape constraints not covered above.

## Exceptions

!!! danger "Exceptions"
- Issuing `pto.tfree` on a pipe with no outstanding `pto.tpop` is undefined behavior.
- Calling `pto.tfree` more times than `pto.tpop` on the same pipe is undefined behavior.
- Calling `pto.tfree` from the producer side is rejected by the verifier.
- Programs must not rely on behavior outside the documented legal domain of this operation.

## Examples

```cpp
// Consumer side of a TMPipe loop.
auto tile = TPOP(pipe);
// ... compute on tile ...
TFREE(pipe);
```

See related instruction pages in `docs/isa/` for concrete Auto/Manual usage patterns.

## See Also

- Instruction set overview: [System Ops](../README.md)
- Producer side: [TPUSH](./TPUSH.md)
- Consumer-acquire side: [TPOP](./TPOP.md)
47 changes: 46 additions & 1 deletion docs/isa/system/ops/TPOP.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,16 @@

`TPOP` retrieves a tile from a ring FIFO into a consumer pipeline (Vector or Cube). It is the consumer half of the TPipe/TMPipe producer-consumer protocol, paired with [`TPUSH`](./TPUSH.md).

## Mechanism

For every `TPOP` call:

1. The consumer **waits** on the producer's data-ready flag.
2. It then **pops** the tile data from the FIFO — either via a `TLOAD` from a GM slot buffer, a `TASSIGN` against a local UB/MAT buffer, or a 32-bit control-signal read for `V2C_CTRL`.
3. Finally it **frees** the slot by signalling the producer with the matching release flag.

The op does not run on the vector or cube datapaths: data movement (when present) is dispatched on MTE1/MTE2, while flag wait/set runs on the system pipe (PIPE_S / PIPE_FIX / PIPE_MTE2). See [Three-Phase Protocol](#three-phase-protocol) for the per-backend signal table.

## What TPOP Is Not

`TPOP` is **not** a scalar stack pop or a generic FIFO dequeue. It is a structured tile-movement protocol for Cube-Vector tile passing. It is not available on the CPU simulator.
Expand Down Expand Up @@ -140,6 +150,30 @@ void consumer_mat(MatTile& matTile) {
}
```

## Inputs

| Operand | Description |
|---------|-------------|
| `pipe` | The `TPipe` / `TMPipe` instance shared with the producer. Carries `FlagID`, `DirType`, slot pointers, and consumer-local buffer addresses. |
| `tile` | The destination tile (Vec or Mat) into which the popped data is materialised. For `*_UB` / `*_MAT` paths, the tile binds to the consumer-local FIFO buffer; for GM paths it is a regular UB/MAT tile filled by `TLOAD`. |
| `TileSplitAxis` (template) | Optional split mode (`TILE_NO_SPLIT`, `TILE_UP_DOWN`, `TILE_LEFT_RIGHT`) used to compute the per-subblock GM offset. Must match the producer. |

## Expected Outputs

| Result | Type | Description |
|--------|------|-------------|
| `RecordEvent` | token | Signals completion of the wait + load + free sequence. |
| `tile` | tile | Holds the popped tile after this op completes. For `V2C_CTRL`, the 32-bit control signal is stored in `pipe.cons.fifo.ctrlSignal` instead. |
| `pipe.cons` | state | Slot index advances by one (`tileIndex % SlotNum`); the released slot becomes available to the producer's next `TPUSH`. |

## Side Effects

- Issues a cross-core / intra-block flag wait (blocks until the producer signals).
- Writes a release flag back to the producer; this can unblock a producer waiting in its allocation phase.
- For GM paths: issues an MTE1/MTE2 load.
- For local-buffer paths (`C2V_UB`, `V2C_MAT`): no DMA, only a `TASSIGN` rebind.
- Does **not** implicitly fence unrelated tile traffic. Callers must use explicit events for cross-pipe ordering.

## Constraints

!!! warning "Constraints"
Expand All @@ -158,7 +192,18 @@ void consumer_mat(MatTile& matTile) {
- **A2/A3**: `DIR_C2V`, `DIR_V2C`, `DIR_BOTH`, `DIR_V2C_CTRL`. Synchronization via `wait_flag_dev` and `ffts_cross_core_sync`.
- **A5**: All direction types. Synchronization via `wait_intra_block` and `set_intra_block`. Additional `*_GM` paths with GM load. Sub-block support (`FlagID + 16`) for 2-Vec-core configurations.

## Common Patterns
## Exceptions

!!! danger "Exceptions"
- Calling `TPOP` without a matching prior `TPUSH` causes the consumer to wait indefinitely (deadlock).
- Calling `TPOP` on the CPU simulator is rejected: the op requires NPU inter-core synchronization infrastructure.
- `TileSplitAxis` mismatched with the producer's split mode produces undefined data (consumer reads from the wrong GM offset).
- `FlagID` reuse with another concurrently-active synchronization op is undefined behavior.
- Setting `isWait = false` when the producer has not yet recorded the slot reads stale or partially-written data.
- Setting `isFree = false` for too many iterations causes the producer to stall in its allocate phase (no deadlock, but throughput collapses).
- Programs must not rely on behavior outside the documented legal domain of this operation.

## Examples

### Pattern 1: Consuming Accumulator Tile (GEMM Post-Processing)

Expand Down
Loading