diff --git a/docs/isa/comm/TGATHER.md b/docs/isa/comm/TGATHER.md index 4d9f75a9d..3898fe175 100644 --- a/docs/isa/comm/TGATHER.md +++ b/docs/isa/comm/TGATHER.md @@ -1,6 +1,8 @@ # pto.tgather -## Introduction +`pto.tgather` is part of the [Communication](./README.md) instruction set. + +## Summary Gather operation: the calling NPU (root) collects data from all ranks in the parallel group and concatenates the results along **DIM_3** (row dimension) into a local output buffer. @@ -8,7 +10,7 @@ Only the root needs to execute `pto.tgather`. Non-root ranks only need to ensure **Large Tile Support**: When the GlobalTensor exceeds the UB tile capacity in rows and/or columns, the transfer is automatically chunked via 2D sliding — the same mechanism used by other PTO-COMM instructions. -## Math Interpretation +## Mechanism Each rank $r$ has source data of shape $(D_0, D_1, D_2, H, W)$. The gather concatenates all $N$ ranks along DIM_3: @@ -16,7 +18,7 @@ $$\mathrm{dst}_{d_0, d_1, d_2,\; r \cdot H + i,\; j} = \mathrm{src}^{(r)}_{d_0, The destination tensor has shape $(D_0, D_1, D_2, N \times H, W)$. -## Assembly Syntax +## Syntax PTO-AS form: see [Assembly Spelling And Operands](../syntax-and-operands/assembly-model.md). @@ -44,6 +46,30 @@ PTO_INST RecordEvent GATHER(ParallelGroupType ¶llelGroup, GlobalDstData &dst TileData &pingTile, TileData &pongTile, WaitEvents&... events); ``` +## Inputs + +| Operand | Description | +|---------|-------------| +| `parallelGroup` | A `ParallelGroup` enumerating each rank's source buffer; root identified via `GetRootIdx()`. | +| `dstGlobalData` | Destination GlobalTensor on the root NPU; receives the concatenated result. | +| `stagingTileData` | UB staging tile used as the GM→UB→GM relay buffer (single-buffer form). | +| `pingTile`, `pongTile` | Two UB staging tiles for double-buffered (ping-pong) form, enabling MTE2/MTE3 overlap. | +| `events...` | Optional `RecordEvent` tokens to wait on before issuing. | + +## Expected Outputs + +| Result | Type | Description | +|--------|------|-------------| +| `RecordEvent` | token | Signals completion of the gather across all participating ranks. | +| `dstGlobalData` | GlobalTensor | First `N × H` rows along DIM_3 hold the concatenated rank data; rows beyond `N × H` (if `dstGlobalData.GetShape(DIM_3) > N × H`) are left unchanged. | + +## Side Effects + +- Issues remote MTE2 reads from each rank's source buffer and MTE3 writes into the local destination. +- Cross-core synchronisation flags are toggled as part of the rank-fan-in protocol. +- Non-root ranks must keep their source buffers live until the root signals completion; otherwise behavior is undefined. +- No implicit fence on unrelated tile traffic. + ## Constraints !!! warning "Constraints" @@ -62,6 +88,15 @@ PTO_INST RecordEvent GATHER(ParallelGroupType ¶llelGroup, GlobalDstData &dst - If `TileData` has static `ValidRow`, `GetShape(DIM_3)` of each rank's source must be divisible by `ValidRow`. Use a Tile with `DYNAMIC` ValidRow for partial row support. - If `TileData` has static `ValidCol`, `GetShape(DIM_4)` must be divisible by `ValidCol`. Use a Tile with `DYNAMIC` ValidCol for partial column support. +## Exceptions + +!!! danger "Exceptions" + - Calling `pto.tgather` on a non-root NPU is undefined behavior. + - Mismatched per-rank tensor shapes / strides yield undefined behavior; no runtime check is guaranteed. + - Using a `dstGlobalData` shape with `GetShape(DIM_3) < N × H` is rejected by the verifier on static shapes; on dynamic shapes the call writes only what fits and silently truncates. + - Type-mismatch between `ParallelGroup::value_type::RawDType` and `TileData::DType` / `GlobalDstData::RawDType` is rejected at compile time via `static_assert`. + - Programs must not rely on behavior outside the documented legal domain of this operation. + ## Examples ### Basic Gather (Single Staging Tile) @@ -125,3 +160,10 @@ void gather_pingpong(__gm__ T* group_addrs[NRANKS], __gm__ T* result, int my_ran comm::GATHER(group, dstG, pingTile, pongTile); } ``` + +## See Also + +- Instruction set overview: [Communication](./README.md) +- Inverse op: [pto.tscatter](./TSCATTER.md) +- Related collective ops: [pto.treduce](./TREDUCE.md), [pto.tbroadcast](./TBROADCAST.md) +- One-sided variants: [pto.tput](./TPUT.md), [pto.tget](./TGET.md) diff --git a/docs/isa/comm/TSCATTER.md b/docs/isa/comm/TSCATTER.md index 3ccc5a0b1..4d579620f 100644 --- a/docs/isa/comm/TSCATTER.md +++ b/docs/isa/comm/TSCATTER.md @@ -1,6 +1,8 @@ # pto.tscatter -## Introduction +`pto.tscatter` is part of the [Communication](./README.md) instruction set. + +## Summary Scatter operation: the calling NPU (root) distributes data to all ranks in the parallel group by splitting the local source tensor along **DIM_3** (row dimension). This is the inverse of `pto.tgather`. @@ -8,13 +10,13 @@ Only the root needs to execute `pto.tscatter`. Non-root ranks only need to ensur **Large Tile Support**: When the per-rank data exceeds the UB tile capacity in rows and/or columns, the transfer is automatically chunked via 2D sliding. -## Math Interpretation +## Mechanism The local source tensor has shape $(D_0, D_1, D_2, N \times H, W)$, where $N$ is the number of ranks and each rank receives $H$ rows. After the operation: $$\mathrm{dst}^{(r)}_{d_0, d_1, d_2,\; i,\; j} = \mathrm{src}^{\mathrm{local}}_{d_0, d_1, d_2,\; r \cdot H + i,\; j} \quad \forall\, r \in [0, N),\; i \in [0, H),\; j \in [0, W)$$ -## Assembly Syntax +## Syntax PTO-AS form: see [Assembly Spelling And Operands](../syntax-and-operands/assembly-model.md). @@ -42,6 +44,30 @@ PTO_INST RecordEvent SCATTER(ParallelGroupType ¶llelGroup, GlobalSrcData &sr TileData &pingTile, TileData &pongTile, WaitEvents&... events); ``` +## Inputs + +| Operand | Description | +|---------|-------------| +| `parallelGroup` | A `ParallelGroup` enumerating each rank's destination buffer; root identified via `GetRootIdx()`. | +| `srcGlobalData` | Source GlobalTensor on the root NPU; concatenation of per-rank slices along DIM_3. | +| `stagingTileData` | UB staging tile used as the GM→UB→GM relay buffer (single-buffer form). | +| `pingTile`, `pongTile` | Two UB staging tiles for double-buffered (ping-pong) form, enabling MTE2/MTE3 overlap. | +| `events...` | Optional `RecordEvent` tokens to wait on before issuing. | + +## Expected Outputs + +| Result | Type | Description | +|--------|------|-------------| +| `RecordEvent` | token | Signals completion of the scatter to all participating ranks. | +| `parallelGroup.tensors[r]` | GlobalTensor (remote) | Each rank `r`'s destination receives `src[r*H : (r+1)*H, :]` of the source. | + +## Side Effects + +- Issues local MTE2 reads from `srcGlobalData` and remote MTE3 writes to each rank's destination buffer. +- Cross-core synchronisation flags are toggled as part of the rank-fan-out protocol. +- Non-root ranks must keep their destination buffers writable until the root signals completion; otherwise behavior is undefined. +- No implicit fence on unrelated tile traffic. + ## Constraints !!! warning "Constraints" @@ -60,6 +86,15 @@ PTO_INST RecordEvent SCATTER(ParallelGroupType ¶llelGroup, GlobalSrcData &sr - If `TileData` has static `ValidRow`, `GetShape(DIM_3)` of each rank's destination must be divisible by `ValidRow`. Use a Tile with `DYNAMIC` ValidRow for partial row support. - If `TileData` has static `ValidCol`, `GetShape(DIM_4)` must be divisible by `ValidCol`. Use a Tile with `DYNAMIC` ValidCol for partial column support. +## Exceptions + +!!! danger "Exceptions" + - Calling `pto.tscatter` on a non-root NPU is undefined behavior. + - Mismatched per-rank tensor shapes / strides yield undefined behavior; no runtime check is guaranteed. + - Using a `srcGlobalData` shape with `GetShape(DIM_3) < N × H` is rejected by the verifier on static shapes; on dynamic shapes the call reads only what fits and remaining ranks receive partial data. + - Type-mismatch between `ParallelGroup::value_type::RawDType` and `TileData::DType` / `GlobalSrcData::RawDType` is rejected at compile time via `static_assert`. + - Programs must not rely on behavior outside the documented legal domain of this operation. + ## Examples ### Basic Scatter (Single Staging Tile) @@ -123,3 +158,10 @@ void scatter_pingpong(__gm__ T* local_data, __gm__ T* group_addrs[NRANKS], int m comm::SCATTER(group, srcG, pingTile, pongTile); } ``` + +## See Also + +- Instruction set overview: [Communication](./README.md) +- Inverse op: [pto.tgather](./TGATHER.md) +- Related collective ops: [pto.treduce](./TREDUCE.md), [pto.tbroadcast](./TBROADCAST.md) +- One-sided variants: [pto.tput](./TPUT.md), [pto.tget](./TGET.md) diff --git a/docs/isa/system/ops/TFREE.md b/docs/isa/system/ops/TFREE.md index b750b2f67..02f7e1705 100644 --- a/docs/isa/system/ops/TFREE.md +++ b/docs/isa/system/ops/TFREE.md @@ -4,38 +4,92 @@ ![TFREE tile operation](../../../figures/isa/TFREE.svg) -## Introduction +## Summary -Release the currently held pipe or FIFO slot back to the producer. +`pto.tfree` releases the consumer-side reservation on a tile pipe previously acquired by `pto.tpop`, returning the slot to the producer side of the FIFO. It is the primary reclaim hook in the system-side three-phase tile-pipe protocol and is paired with `pto.tpop` on the consumer side. -## Math Interpretation +## Mechanism -Semantics are instruction-specific. Unless stated otherwise, behavior is defined over the destination valid region. +`pto.tfree` performs a single release transaction against the tile-pipe metadata: -## Assembly Syntax +1. The current consumer signals it is done with the slot it last popped from `pipe`. +2. The slot's reference is decremented; when it drops to zero, the slot is returned to the producer pool. +3. Any waiter blocked on `pto.tpush` for an empty slot is unblocked. + +The op executes on the system pipe and does not move tile data; it only updates pipe-control state. The slot identity is implicit in `pipe` — there is no tile handle operand. + +## Syntax Textual spelling is defined by the PTO ISA syntax-and-operands pages. ### IR Level 1 (SSA) ```text -%dst = pto.tfree ... +%event = pto.tfree %pipe : (!pto.pipe<...>) -> !pto.record_event ``` ### IR Level 2 (DPS) ```text -pto.tfree ins(...) outs(%dst : !pto.tile_buf<...>) +pto.tfree ins(%pipe) outs(%event : !pto.record_event) ``` + ## C++ Intrinsic -Declared in `include/pto/common/pto_instr.hpp`. +Declared in `include/pto/common/pto_instr.hpp`: + +```cpp +template +PTO_INST RecordEvent TFREE(Pipe &pipe, WaitEvents&... events); +``` + +## Inputs + +| Operand | Description | +|---------|-------------| +| `pipe` | The tile pipe (`TMPipe` or compatible) whose most recently popped slot is being released. | +| `events...` | Optional `RecordEvent` tokens to wait on before issuing. | + +## Expected Outputs + +| Result | Type | Description | +|--------|------|-------------| +| `RecordEvent` | token | Signals that the slot has been released and is reusable by the producer. | + +## Side Effects + +- Updates pipe-control metadata in place; no tile data is moved. +- May unblock a producer waiting on `pto.tpush`. +- After `pto.tfree`, the tile previously returned by `pto.tpop` on the same pipe must not be read or written by the consumer. ## Constraints !!! warning "Constraints" - Refer to backend-specific legality checks for data type/layout/location/shape constraints. + - Each `pto.tpop` on a pipe must be matched by exactly one `pto.tfree` on the same pipe; double-free is undefined. + - `pto.tfree` must not be issued from the producer side; it is consumer-side only. + - Refer to backend-specific legality checks for data type/layout/location/shape constraints not covered above. + +## Exceptions + +!!! danger "Exceptions" + - Issuing `pto.tfree` on a pipe with no outstanding `pto.tpop` is undefined behavior. + - Calling `pto.tfree` more times than `pto.tpop` on the same pipe is undefined behavior. + - Calling `pto.tfree` from the producer side is rejected by the verifier. + - Programs must not rely on behavior outside the documented legal domain of this operation. ## Examples +```cpp +// Consumer side of a TMPipe loop. +auto tile = TPOP(pipe); +// ... compute on tile ... +TFREE(pipe); +``` + See related instruction pages in `docs/isa/` for concrete Auto/Manual usage patterns. + +## See Also + +- Instruction set overview: [System Ops](../README.md) +- Producer side: [TPUSH](./TPUSH.md) +- Consumer-acquire side: [TPOP](./TPOP.md) diff --git a/docs/isa/system/ops/TPOP.md b/docs/isa/system/ops/TPOP.md index 4517c38c6..7bea12b30 100644 --- a/docs/isa/system/ops/TPOP.md +++ b/docs/isa/system/ops/TPOP.md @@ -6,6 +6,16 @@ `TPOP` retrieves a tile from a ring FIFO into a consumer pipeline (Vector or Cube). It is the consumer half of the TPipe/TMPipe producer-consumer protocol, paired with [`TPUSH`](./TPUSH.md). +## Mechanism + +For every `TPOP` call: + +1. The consumer **waits** on the producer's data-ready flag. +2. It then **pops** the tile data from the FIFO — either via a `TLOAD` from a GM slot buffer, a `TASSIGN` against a local UB/MAT buffer, or a 32-bit control-signal read for `V2C_CTRL`. +3. Finally it **frees** the slot by signalling the producer with the matching release flag. + +The op does not run on the vector or cube datapaths: data movement (when present) is dispatched on MTE1/MTE2, while flag wait/set runs on the system pipe (PIPE_S / PIPE_FIX / PIPE_MTE2). See [Three-Phase Protocol](#three-phase-protocol) for the per-backend signal table. + ## What TPOP Is Not `TPOP` is **not** a scalar stack pop or a generic FIFO dequeue. It is a structured tile-movement protocol for Cube-Vector tile passing. It is not available on the CPU simulator. @@ -140,6 +150,30 @@ void consumer_mat(MatTile& matTile) { } ``` +## Inputs + +| Operand | Description | +|---------|-------------| +| `pipe` | The `TPipe` / `TMPipe` instance shared with the producer. Carries `FlagID`, `DirType`, slot pointers, and consumer-local buffer addresses. | +| `tile` | The destination tile (Vec or Mat) into which the popped data is materialised. For `*_UB` / `*_MAT` paths, the tile binds to the consumer-local FIFO buffer; for GM paths it is a regular UB/MAT tile filled by `TLOAD`. | +| `TileSplitAxis` (template) | Optional split mode (`TILE_NO_SPLIT`, `TILE_UP_DOWN`, `TILE_LEFT_RIGHT`) used to compute the per-subblock GM offset. Must match the producer. | + +## Expected Outputs + +| Result | Type | Description | +|--------|------|-------------| +| `RecordEvent` | token | Signals completion of the wait + load + free sequence. | +| `tile` | tile | Holds the popped tile after this op completes. For `V2C_CTRL`, the 32-bit control signal is stored in `pipe.cons.fifo.ctrlSignal` instead. | +| `pipe.cons` | state | Slot index advances by one (`tileIndex % SlotNum`); the released slot becomes available to the producer's next `TPUSH`. | + +## Side Effects + +- Issues a cross-core / intra-block flag wait (blocks until the producer signals). +- Writes a release flag back to the producer; this can unblock a producer waiting in its allocation phase. +- For GM paths: issues an MTE1/MTE2 load. +- For local-buffer paths (`C2V_UB`, `V2C_MAT`): no DMA, only a `TASSIGN` rebind. +- Does **not** implicitly fence unrelated tile traffic. Callers must use explicit events for cross-pipe ordering. + ## Constraints !!! warning "Constraints" @@ -158,7 +192,18 @@ void consumer_mat(MatTile& matTile) { - **A2/A3**: `DIR_C2V`, `DIR_V2C`, `DIR_BOTH`, `DIR_V2C_CTRL`. Synchronization via `wait_flag_dev` and `ffts_cross_core_sync`. - **A5**: All direction types. Synchronization via `wait_intra_block` and `set_intra_block`. Additional `*_GM` paths with GM load. Sub-block support (`FlagID + 16`) for 2-Vec-core configurations. -## Common Patterns +## Exceptions + +!!! danger "Exceptions" + - Calling `TPOP` without a matching prior `TPUSH` causes the consumer to wait indefinitely (deadlock). + - Calling `TPOP` on the CPU simulator is rejected: the op requires NPU inter-core synchronization infrastructure. + - `TileSplitAxis` mismatched with the producer's split mode produces undefined data (consumer reads from the wrong GM offset). + - `FlagID` reuse with another concurrently-active synchronization op is undefined behavior. + - Setting `isWait = false` when the producer has not yet recorded the slot reads stale or partially-written data. + - Setting `isFree = false` for too many iterations causes the producer to stall in its allocate phase (no deadlock, but throughput collapses). + - Programs must not rely on behavior outside the documented legal domain of this operation. + +## Examples ### Pattern 1: Consuming Accumulator Tile (GEMM Post-Processing) diff --git a/docs/isa/system/ops/TPUSH.md b/docs/isa/system/ops/TPUSH.md index 5c2cce09c..e37b4919b 100644 --- a/docs/isa/system/ops/TPUSH.md +++ b/docs/isa/system/ops/TPUSH.md @@ -6,6 +6,16 @@ `TPUSH` moves a tile from a producer pipeline (Cube or Vector) into a ring FIFO for consumption by a paired pipeline. It is the producer half of the TPipe/TMPipe producer-consumer protocol. +## Mechanism + +For every `TPUSH` call: + +1. The producer **allocates** by waiting on the consumer's slot-free flag. +2. It then **pushes** the tile data into the FIFO — either via a `TSTORE` to a GM slot buffer, a direct `TMOV`/`TINSERT` into the consumer's local UB/MAT buffer, or a 32-bit control-signal write for `V2C_CTRL`. +3. Finally it **records** the data-ready signal so the consumer's `TPOP` can proceed. + +The op runs on FIXP / MTE3 / system pipes depending on the direction; see [Three-Phase Protocol](#three-phase-protocol) for the per-backend signal table. + ## What TPUSH Is Not `TPUSH` is **not** a scalar stack push or a generic FIFO enqueue. It is a structured tile-movement protocol for Cube-Vector tile passing. It is not available on the CPU simulator. @@ -183,6 +193,31 @@ void producer_vec(VecTile& vTile) { } ``` +## Inputs + +| Operand | Description | +|---------|-------------| +| `pipe` | The `TPipe` / `TMPipe` instance shared with the consumer. Carries `FlagID`, `DirType`, slot pointers, and consumer-local buffer addresses. | +| `tile` | The source tile (Acc or Vec) whose contents are pushed into the FIFO. For `*_UB` / `*_MAT` paths, the consumer-local buffer is the destination; for GM paths the slot is in global memory. | +| `TileSplitAxis` (template) | Optional split mode (`TILE_NO_SPLIT`, `TILE_UP_DOWN`, `TILE_LEFT_RIGHT`) used to compute the per-subblock GM offset. Must match the consumer. | + +## Expected Outputs + +| Result | Type | Description | +|--------|------|-------------| +| `RecordEvent` | token | Signals completion of the allocate + push + record sequence. | +| `pipe` | state | A FIFO slot is filled with the pushed tile data, the slot index advances (`tileIndex % SlotNum`), and the data-ready flag for the consumer is raised. | +| `pipe.cons.fifo.ctrlSignal` | scalar (V2C_CTRL only) | Receives the 32-bit control word taken from the producer tile's first element. | + +## Side Effects + +- Issues a cross-core / intra-block flag wait in the allocate phase (blocks until the consumer frees a slot). +- Writes a data-ready flag for the consumer; this can unblock a consumer waiting in `TPOP`. +- For GM paths: issues an MTE3 store. +- For FIXP paths (`C2V` accumulator drain on A5): issues a fix-pipe drain into the destination buffer. +- For local-buffer paths (`C2V_UB`, `V2C_MAT`): writes directly into the consumer's UB/MAT buffer via `TMOV` / `TINSERT`. +- Does **not** implicitly fence unrelated tile traffic. Callers must use explicit events for cross-pipe ordering. + ## Constraints !!! warning "Constraints" @@ -201,7 +236,18 @@ void producer_vec(VecTile& vTile) { - **A2/A3**: Supports `DIR_C2V`, `DIR_V2C`, `DIR_BOTH`, `DIR_V2C_CTRL`. FIFO paths: GM and local UB/MAT. Does not support `DIR_*_GM` variants. - **A5**: Supports all direction types including `DIR_C2V_GM`, `DIR_V2C_GM`, `DIR_BOTH_GM`. FIFO paths: GM, VEC_FIFO, MAT_FIFO, CTRL_FIFO. Intra-block synchronization uses `set_intra_block`/`wait_intra_block` instead of cross-core `ffts_*`. -## Common Patterns +## Exceptions + +!!! danger "Exceptions" + - Calling `TPUSH` without an eventual matching `TPOP` causes the producer to stall in the allocate phase once all `SlotNum` slots are full. + - Calling `TPUSH` on the CPU simulator is rejected: the op requires NPU inter-core synchronization infrastructure. + - `DirType` incompatible with `TileProd::Loc` / `TileDataCons::Loc` is rejected at compile time via `static_assert`. + - `FlagID` reuse with another concurrently-active synchronization op is undefined behavior. + - Setting `isAllocate = false` when the slot has not actually been freed overwrites in-flight data the consumer has not yet read. + - Setting `isRecord = false` for too many iterations leaves the consumer waiting indefinitely (deadlock). + - Programs must not rely on behavior outside the documented legal domain of this operation. + +## Examples ### Pattern 1: Acc → Vec Tile Passing (GEMM Post-Processing) diff --git a/docs/isa/tile/ops/elementwise-tile-tile/tcvt.md b/docs/isa/tile/ops/elementwise-tile-tile/tcvt.md index fc3cd830a..39e2fdd74 100644 --- a/docs/isa/tile/ops/elementwise-tile-tile/tcvt.md +++ b/docs/isa/tile/ops/elementwise-tile-tile/tcvt.md @@ -109,12 +109,14 @@ No architectural side effects beyond producing the destination tile. Does not im - When a conversion path requires explicit scratch storage, callers MUST use one of the `tmp`-tile overloads. - Disabling saturation may change overflow behavior for some backend/type paths, especially low-precision integer conversions. -## Cases That Are Not Allowed +## Exceptions -!!! danger "Cases That Are Not Allowed" +!!! danger "Exceptions" + - Illegal operand tuples, unsupported types, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend instruction set. - **MUST NOT** use a type pair not supported by the target profile. - **MUST NOT** use a rounding mode not supported for the given type pair. - **MUST NOT** assume that disabling saturation still clamps overflow to the destination range. + - Programs must not rely on behavior outside the documented legal domain of this operation. ## Target-Profile Restrictions @@ -156,7 +158,7 @@ TCVT(dst, src, tmp, RoundMode::CAST_TRUNC, SaturationMode::OFF); pto.tcvt ins(%src {rmode = #pto.round_mode}: !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>) ``` -## Related Ops / Instruction Set Links +## See Also - Instruction set overview: [Elementwise Tile Tile](../../elementwise-tile-tile.md) - Previous op in instruction set: [pto.tsubc](./tsubc.md) diff --git a/docs/isa/tile/ops/irregular-and-complex/tci.md b/docs/isa/tile/ops/irregular-and-complex/tci.md index 7c8411f81..d9087f00c 100644 --- a/docs/isa/tile/ops/irregular-and-complex/tci.md +++ b/docs/isa/tile/ops/irregular-and-complex/tci.md @@ -155,8 +155,8 @@ void example_manual() { pto.tci ins(%scalar {descending = false} : dtype) outs(%dst : !pto.tile_buf<...>) ``` -## Related Ops / Instruction Set Links +## See Also - Instruction set overview: [Irregular And Complex](../../irregular-and-complex.md) -- Previous op in instruction set: [pto.tgather](./tgather.md) +- Previous op in instruction set: [pto.tscatter](./tscatter.md) - Next op in instruction set: [pto.ttri](./ttri.md) diff --git a/docs/isa/tile/ops/irregular-and-complex/tgather.md b/docs/isa/tile/ops/irregular-and-complex/tgather.md index d3707089b..45bb07e70 100644 --- a/docs/isa/tile/ops/irregular-and-complex/tgather.md +++ b/docs/isa/tile/ops/irregular-and-complex/tgather.md @@ -182,8 +182,8 @@ void example_manual() { pto.tgather ins(%src, %indices : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>) ``` -## Related Ops / Instruction Set Links +## See Also - Instruction set overview: [Irregular And Complex](../../irregular-and-complex.md) - Previous op in instruction set: [pto.tsort32](./tsort32.md) -- Next op in instruction set: [pto.tci](./tci.md) +- Next op in instruction set: [pto.tgatherb](./tgatherb.md) diff --git a/docs/isa/tile/ops/irregular-and-complex/tgatherb.md b/docs/isa/tile/ops/irregular-and-complex/tgatherb.md index 29f286e38..242c39e8e 100644 --- a/docs/isa/tile/ops/irregular-and-complex/tgatherb.md +++ b/docs/isa/tile/ops/irregular-and-complex/tgatherb.md @@ -164,8 +164,8 @@ void example_manual() { pto.tgatherb ins(%src, %offsets : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>) ``` -## Related Ops / Instruction Set Links +## See Also - Instruction set overview: [Irregular And Complex](../../irregular-and-complex.md) -- Previous op in instruction set: [pto.tpartmin](./tpartmin.md) +- Previous op in instruction set: [pto.tgather](./tgather.md) - Next op in instruction set: [pto.tscatter](./tscatter.md) diff --git a/docs/isa/tile/ops/irregular-and-complex/thistogram.md b/docs/isa/tile/ops/irregular-and-complex/thistogram.md index f2065aa40..a4fb33384 100644 --- a/docs/isa/tile/ops/irregular-and-complex/thistogram.md +++ b/docs/isa/tile/ops/irregular-and-complex/thistogram.md @@ -4,38 +4,104 @@ ![THISTOGRAM tile operation](../../../../figures/isa/THISTOGRAM.svg) -## Introduction +## Summary -Accumulate histogram bin counts from source values using an index tile. +`pto.thistogram` accumulates radix-style histogram bin counts from a source tile into a destination bin tile, selecting **one byte lane** of each multi-byte source element as the bin index. The selected byte is chosen by the `HistByte byte` template parameter (`BYTE_0`/`BYTE_1` for `uint16_t`, plus `BYTE_2`/`BYTE_3` for wider types). For non-MSB passes a separate `idx` tile is consumed to mask the contribution of each row, so `pto.thistogram` chains naturally into a multi-pass radix sort. This op is **A5-only** and is exposed under the `Statistics` category of the irregular-and-complex group. -## Math Interpretation +## Mechanism -Semantics are instruction-specific. Unless stated otherwise, behavior is defined over the destination valid region. +For every source element `src_i` (within the valid region), let `b = extract_byte(src_i)` be the bin index extracted from the selected byte lane. The destination bin tile is updated by: -## Assembly Syntax +$$\mathrm{dst}_b \mathrel{+}= 1 \quad \forall\, i \in [0,\, R \cdot C)$$ + +When `byte` selects a non-MSB lane (e.g. `BYTE_0`), the per-row `idx` tile is used to gate which previously bucketed rows still contribute, enabling multi-pass MSB-→LSB radix passes. When `byte` selects the MSB (e.g. `BYTE_1` for `uint16_t`), `idx` is unused but must still be supplied as a valid operand. + +The op executes on the irregular path: the vector pipe consumes the source tile and the FIXP / scalar-update path applies the per-bin increments. Because increments to the same bin are inherently serialised, the inner loop is throughput-limited by the bin-write port rather than by source bandwidth. + +## Syntax Textual spelling is defined by the PTO ISA syntax-and-operands pages. ### IR Level 1 (SSA) ```text -%dst = pto.thistogram ... +%event = pto.thistogram %src, %idx, %dst_in {byte = #pto.hist_byte} + : (!pto.tile<...>, !pto.tile<...>, !pto.tile<...>) -> !pto.record_event ``` ### IR Level 2 (DPS) ```text -pto.thistogram ins(...) outs(%dst : !pto.tile_buf<...>) +pto.thistogram {byte = #pto.hist_byte} + ins(%src, %idx) outs(%dst : !pto.tile_buf<...>) ``` + ## C++ Intrinsic -Declared in `include/pto/common/pto_instr.hpp`. +Declared in `include/pto/common/pto_instr.hpp` (gated to A5 / CPU simulator): + +```cpp +template +PTO_INST RecordEvent THISTOGRAM(TileDataDst &dst, TileDataSrc &src, TileDataIdx &idx, + WaitEvents&... events); +``` + +`HistByte` is defined in `include/pto/common/type.hpp` and selects which byte lane of each source element is histogrammed (`BYTE_0` is the LSB, `BYTE_3` the MSB). + +## Inputs + +| Operand | Description | +|---------|-------------| +| `byte` (template) | `HistByte` enum selecting the byte lane used as the bin index. Must satisfy the source-type constraint (e.g. only `BYTE_0`/`BYTE_1` for `uint16_t`). | +| `src` | Source tile whose elements supply the bin index byte. | +| `idx` | Per-row index/mask tile that gates contributions for non-MSB passes; supplied (and typically loaded) but unused on the MSB pass. Element type is `uint8_t`. | +| `events...` | Optional `RecordEvent` tokens to wait on before issuing. | + +## Expected Outputs + +| Result | Type | Description | +|--------|------|-------------| +| `RecordEvent` | token | Signals completion of the histogram update. | +| `dst` | tile | Holds the running per-bin counts. The op **adds** to existing values; clear `dst` first to obtain a fresh histogram. | + +## Side Effects + +- Reads `src`, accumulates into `dst` in place — `dst` is **not** cleared by the op. +- No DMA traffic; runs entirely within the AI core. +- Does not implicitly fence unrelated tile traffic. ## Constraints !!! warning "Constraints" - Refer to backend-specific legality checks for data type/layout/location/shape constraints. + - Backend support: **A5 only**. A2/A3 reject this op at lowering time. + - `dst` element type must be a counter-friendly integer type (`uint32_t` in the reference path) and must contain 256 bins per row (one per possible byte value). + - `src` element type determines which `HistByte` values are legal: `uint16_t` accepts `BYTE_0`/`BYTE_1`; wider types extend the set up to `BYTE_3`. A `static_assert` rejects illegal `byte` selections. + - `idx` is a per-row `uint8_t` tile aligned to `BLOCK_BYTE_SIZE`; it must be loaded for non-MSB passes and may be left uninitialised but still passed for the MSB pass. + - `dst` must be initialised by the caller (e.g. via `TASSIGN`, `TFILL`, or `TMOV`) before the first accumulation. + - Refer to backend-specific legality checks for data type/layout/location/shape constraints not covered above. + +## Exceptions + +!!! danger "Exceptions" + - Lowering on A2/A3 backends is rejected by the verifier. + - A `HistByte` value that is not legal for the source element type (e.g. `BYTE_2`/`BYTE_3` on `uint16_t`) is rejected at compile time via `static_assert`. + - Programs must not rely on behavior outside the documented legal domain of this operation. ## Examples -See related instruction pages in `docs/isa/` for concrete Auto/Manual usage patterns. +```cpp +// MSB radix pass over a uint16_t tile: idx is unused but still required. +THISTOGRAM(dstTile, srcTile, idxTile); + +// Subsequent LSB pass: idx must be loaded with the per-row mask first. +TLOAD(idxTile, idxGlobal); +THISTOGRAM(dstTile, srcTile, idxTile); +``` + +## See Also + +- Instruction set overview: [Irregular / Complex](../../irregular-and-complex.md) +- Previous: [pto.trandom](./trandom.md) +- Related stats / sort ops: [pto.tsort32](./tsort32.md), [pto.tmrgsort](./tmrgsort.md) diff --git a/docs/isa/tile/ops/irregular-and-complex/tpartadd.md b/docs/isa/tile/ops/irregular-and-complex/tpartadd.md index 3245a21d8..642d83a43 100644 --- a/docs/isa/tile/ops/irregular-and-complex/tpartadd.md +++ b/docs/isa/tile/ops/irregular-and-complex/tpartadd.md @@ -162,7 +162,7 @@ void example_manual() { pto.tpartadd ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>) ``` -## Related Ops / Instruction Set Links +## See Also - Instruction set overview: [Irregular And Complex](../../irregular-and-complex.md) - Previous op in instruction set: [pto.ttri](./ttri.md) diff --git a/docs/isa/tile/ops/irregular-and-complex/tpartmax.md b/docs/isa/tile/ops/irregular-and-complex/tpartmax.md index 61b3f8e79..125dd5665 100644 --- a/docs/isa/tile/ops/irregular-and-complex/tpartmax.md +++ b/docs/isa/tile/ops/irregular-and-complex/tpartmax.md @@ -162,7 +162,7 @@ void example_manual() { pto.tpartmax ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>) ``` -## Related Ops / Instruction Set Links +## See Also - Instruction set overview: [Irregular And Complex](../../irregular-and-complex.md) - Previous op in instruction set: [pto.tpartmul](./tpartmul.md) diff --git a/docs/isa/tile/ops/irregular-and-complex/tpartmin.md b/docs/isa/tile/ops/irregular-and-complex/tpartmin.md index 755535af1..bab92e4ac 100644 --- a/docs/isa/tile/ops/irregular-and-complex/tpartmin.md +++ b/docs/isa/tile/ops/irregular-and-complex/tpartmin.md @@ -162,8 +162,8 @@ void example_manual() { pto.tpartmin ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>) ``` -## Related Ops / Instruction Set Links +## See Also - Instruction set overview: [Irregular And Complex](../../irregular-and-complex.md) - Previous op in instruction set: [pto.tpartmax](./tpartmax.md) -- Next op in instruction set: [pto.tgatherb](./tgatherb.md) +- Next op in instruction set: [pto.tquant](./tquant.md) diff --git a/docs/isa/tile/ops/irregular-and-complex/tpartmul.md b/docs/isa/tile/ops/irregular-and-complex/tpartmul.md index 52962757a..03be7a1df 100644 --- a/docs/isa/tile/ops/irregular-and-complex/tpartmul.md +++ b/docs/isa/tile/ops/irregular-and-complex/tpartmul.md @@ -172,7 +172,7 @@ void example_manual() { pto.tpartmul ins(%src0, %src1 : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>) ``` -## Related Ops / Instruction Set Links +## See Also - Instruction set overview: [Irregular And Complex](../../irregular-and-complex.md) - Previous op in instruction set: [pto.tpartadd](./tpartadd.md) diff --git a/docs/isa/tile/ops/irregular-and-complex/tquant.md b/docs/isa/tile/ops/irregular-and-complex/tquant.md index c966c44a9..5bc67a4ee 100644 --- a/docs/isa/tile/ops/irregular-and-complex/tquant.md +++ b/docs/isa/tile/ops/irregular-and-complex/tquant.md @@ -123,6 +123,8 @@ See related examples in `docs/isa/` and `docs/coding/tutorials/`. pto.tquant ins(%src, %qp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>) ``` -## Related Ops / Instruction Set Links +## See Also -- Instruction set overview: [Irregular And Complex](../../../tile/irregular-and-complex.md) +- Instruction set overview: [Irregular And Complex](../../irregular-and-complex.md) +- Previous op in instruction set: [pto.tpartmin](./tpartmin.md) +- Next op in instruction set: [pto.tdequant](./tdequant.md) diff --git a/docs/isa/tile/ops/irregular-and-complex/tscatter.md b/docs/isa/tile/ops/irregular-and-complex/tscatter.md index 3581b98cf..b18bb34ed 100644 --- a/docs/isa/tile/ops/irregular-and-complex/tscatter.md +++ b/docs/isa/tile/ops/irregular-and-complex/tscatter.md @@ -163,7 +163,8 @@ void example_manual() { pto.tscatter ins(%src, %idx : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>) ``` -## Related Ops / Instruction Set Links +## See Also - Instruction set overview: [Irregular And Complex](../../irregular-and-complex.md) - Previous op in instruction set: [pto.tgatherb](./tgatherb.md) +- Next op in instruction set: [pto.tci](./tci.md) diff --git a/docs/isa/tile/ops/irregular-and-complex/tsort32.md b/docs/isa/tile/ops/irregular-and-complex/tsort32.md index 772a9d53c..8e17291ae 100644 --- a/docs/isa/tile/ops/irregular-and-complex/tsort32.md +++ b/docs/isa/tile/ops/irregular-and-complex/tsort32.md @@ -175,7 +175,7 @@ void example_manual() { pto.tsort32 ins(%src, %idx : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>) ``` -## Related Ops / Instruction Set Links +## See Also - Instruction set overview: [Irregular And Complex](../../irregular-and-complex.md) - Previous op in instruction set: [pto.tmrgsort](./tmrgsort.md) diff --git a/docs/isa/tile/ops/irregular-and-complex/ttri.md b/docs/isa/tile/ops/irregular-and-complex/ttri.md index 00ef1867f..9b68097b5 100644 --- a/docs/isa/tile/ops/irregular-and-complex/ttri.md +++ b/docs/isa/tile/ops/irregular-and-complex/ttri.md @@ -124,7 +124,7 @@ See related examples in `docs/isa/` and `docs/coding/tutorials/`. pto.ttri ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>) ``` -## Related Ops / Instruction Set Links +## See Also - Instruction set overview: [Irregular And Complex](../../irregular-and-complex.md) - Previous op in instruction set: [pto.tci](./tci.md) diff --git a/docs/isa/tile/ops/layout-and-rearrangement/textract.md b/docs/isa/tile/ops/layout-and-rearrangement/textract.md index 7e2fa560b..c88db84fc 100644 --- a/docs/isa/tile/ops/layout-and-rearrangement/textract.md +++ b/docs/isa/tile/ops/layout-and-rearrangement/textract.md @@ -146,6 +146,19 @@ PTO_INST RecordEvent TEXTRACT_FP(DstTileData &dst, SrcTileData &src, FpTileData - `reluMode` (optional) — `ReluPreMode::{NoRelu, NormalRelu}`. - `preQuantScalar` (scalar-quant variant only) — scalar quantization factor. +## Expected Outputs + +| Result | Type | Description | +|---|---|---| +| `RecordEvent` | token | Signals completion of the extraction. | +| `dst` | tile | Holds the `(DstRows, DstCols)` sub-window of `src` starting at `(indexRow, indexCol)`. For fix-pipe variants, values are quantized through the fix-pipe path. | + +## Side Effects + +- **Standard variants**: No architectural side effects beyond producing the destination tile. +- **Fix-pipe variant (`TEXTRACT_FP`)**: Programs the FPC sideband state with the offset `indexCol` before the data is routed through the fix-pipe quantization pipeline. +- Does not implicitly fence unrelated tile traffic. + ## Constraints !!! warning "Constraints" @@ -154,7 +167,15 @@ PTO_INST RecordEvent TEXTRACT_FP(DstTileData &dst, SrcTileData &src, FpTileData - **Fp tile location**: `FpTileData::Loc` must be `TileType::Scaling` (A2/A3 and A5 both enforce this via `static_assert`) - **Fix-pipe path**: The backend offsets the FPC address by `indexCol` (counted in units of the fp tile's element width) before configuring the fix-pipe -## Common Patterns +## Exceptions + +!!! danger "Exceptions" + - Illegal operand tuples, unsupported tile-type pairs, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend instruction set. + - Out-of-range `(indexRow, indexCol)` (window exceeds `src` bounds) is rejected at compile time when shapes are static, otherwise undefined at runtime. + - `TEXTRACT_FP` with `FpTileData::Loc != TileType::Scaling` is rejected by `static_assert`. + - Programs must not rely on behavior outside the documented legal domain of this operation. + +## Examples ### Pattern 1: Extract Left Block from Matrix (GEMM Setup) diff --git a/docs/isa/tile/ops/layout-and-rearrangement/tfillpad-expand.md b/docs/isa/tile/ops/layout-and-rearrangement/tfillpad-expand.md index 1f222bc99..6c257ea21 100644 --- a/docs/isa/tile/ops/layout-and-rearrangement/tfillpad-expand.md +++ b/docs/isa/tile/ops/layout-and-rearrangement/tfillpad-expand.md @@ -110,8 +110,8 @@ See related examples in `docs/isa/` and `docs/coding/tutorials/`. pto.tfillpad_expand ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>) ``` -## Related Ops / Instruction Set Links +## See Also - Instruction set overview: [Layout And Rearrangement](../../layout-and-rearrangement.md) - Previous op in instruction set: [pto.tfillpad_inplace](./tfillpad-inplace.md) -- Next op in instruction set: [pto.tmov](./tmov.md) +- Next op in instruction set: [pto.timg2col](./timg2col.md) diff --git a/docs/isa/tile/ops/layout-and-rearrangement/tfillpad-inplace.md b/docs/isa/tile/ops/layout-and-rearrangement/tfillpad-inplace.md index 90c9f7aeb..b9d34f837 100644 --- a/docs/isa/tile/ops/layout-and-rearrangement/tfillpad-inplace.md +++ b/docs/isa/tile/ops/layout-and-rearrangement/tfillpad-inplace.md @@ -110,7 +110,7 @@ See related examples in `docs/isa/` and `docs/coding/tutorials/`. pto.tfillpad_inplace ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>) ``` -## Related Ops / Instruction Set Links +## See Also - Instruction set overview: [Layout And Rearrangement](../../layout-and-rearrangement.md) - Previous op in instruction set: [pto.tfillpad](./tfillpad.md) diff --git a/docs/isa/tile/ops/layout-and-rearrangement/tfillpad.md b/docs/isa/tile/ops/layout-and-rearrangement/tfillpad.md index afd89e251..cae66e819 100644 --- a/docs/isa/tile/ops/layout-and-rearrangement/tfillpad.md +++ b/docs/isa/tile/ops/layout-and-rearrangement/tfillpad.md @@ -162,8 +162,8 @@ void example2() { pto.tfillpad ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>) ``` -## Related Ops / Instruction Set Links +## See Also - Instruction set overview: [Layout And Rearrangement](../../layout-and-rearrangement.md) -- Previous op in instruction set: [pto.tinsert_fp](./tinsert.md) +- Previous op in instruction set: [pto.tinsert](./tinsert.md) - Next op in instruction set: [pto.tfillpad_inplace](./tfillpad-inplace.md) diff --git a/docs/isa/tile/ops/layout-and-rearrangement/tinsert.md b/docs/isa/tile/ops/layout-and-rearrangement/tinsert.md index 611eee11e..5701e1d5c 100644 --- a/docs/isa/tile/ops/layout-and-rearrangement/tinsert.md +++ b/docs/isa/tile/ops/layout-and-rearrangement/tinsert.md @@ -165,6 +165,19 @@ PTO_INST RecordEvent TINSERT(DstTileData &dst, SrcTileData &src, - `reluMode` (optional) — `ReluPreMode::{NoRelu, NormalRelu}`. - `preQuantScalar` (scalar-quant variant only) — scalar quantization factor. +## Expected Outputs + +| Result | Type | Description | +|---|---|---| +| `RecordEvent` | token | Signals completion of the insertion. | +| `dst` | tile | The `(SrcRows, SrcCols)` sub-region of `dst` starting at `(indexRow, indexCol)` is overwritten with `src` (or with quantized `src` for `TINSERT_FP`); the rest of `dst` is unchanged. | + +## Side Effects + +- **Standard variants**: No architectural side effects beyond updating the destination region. +- **Fix-pipe variant (`TINSERT_FP`)**: Programs the FPC sideband state before the fix-pipe quantization. On the CPU simulator the `fp` parameter is ignored and the call falls back to standard `TINSERT`. +- Does not implicitly fence unrelated tile traffic. + ## Constraints !!! warning "Constraints" @@ -174,7 +187,16 @@ PTO_INST RecordEvent TINSERT(DstTileData &dst, SrcTileData &src, - **A5 fix-pipe**: Destination must be `TileType::Mat` with `BLayout::ColMajor + SLayout::RowMajor`; source must be `float` or `int32_t` `Acc` - **Cpu simulator**: `TINSERT_FP` accepts the interface but ignores the `fp` parameter, falling back to standard `TINSERT` -## Common Patterns +## Exceptions + +!!! danger "Exceptions" + - Illegal operand tuples, unsupported tile-type pairs, invalid layout combinations, or unsupported target-profile modes are rejected by the verifier or by the selected backend instruction set. + - Out-of-range `(indexRow, indexCol)` (window exceeds `dst` bounds) is rejected at compile time when shapes are static, otherwise undefined at runtime. + - `TINSERT_FP` with `FpTileData::Loc != TileType::Scaling` is rejected by `static_assert`. + - On A5, `TINSERT_FP` requires destination `TileType::Mat` with `BLayout::ColMajor + SLayout::RowMajor`; other layouts are rejected. + - Programs must not rely on behavior outside the documented legal domain of this operation. + +## Examples ### Pattern 1: Accumulator Insert into Matrix diff --git a/docs/isa/tile/ops/layout-and-rearrangement/tpack.md b/docs/isa/tile/ops/layout-and-rearrangement/tpack.md index 0d013a15c..3a07de581 100644 --- a/docs/isa/tile/ops/layout-and-rearrangement/tpack.md +++ b/docs/isa/tile/ops/layout-and-rearrangement/tpack.md @@ -4,38 +4,89 @@ ![TPACK tile operation](../../../../figures/isa/TPACK.svg) -## Introduction +## Summary -Pack or convert tile elements into a narrower destination representation. +`pto.tpack` packs (or narrows) tile elements into a more compact destination representation — for example, packing two FP16 lanes into a single FP32 word, or compressing INT16 lanes into INT8. It is the dual of `pto.tunpack` and is commonly used to prepare quantised activations for subsequent layout-converting moves. -## Math Interpretation +## Mechanism -Semantics are instruction-specific. Unless stated otherwise, behavior is defined over the destination valid region. +For every destination element index `i` in the valid region, `tpack` consumes one or more contiguous source lanes and writes a packed word: -## Assembly Syntax +$$\mathrm{dst}_i = \mathrm{Pack}(\mathrm{src}_{k \cdot i},\, \dots,\, \mathrm{src}_{k \cdot i + k - 1})$$ + +where $k$ is the packing ratio set by `(SrcDType, DstDType)`. The op runs on the **vector pipe** (PIPE_V) as a structured shuffle/narrow. + +## Syntax Textual spelling is defined by the PTO ISA syntax-and-operands pages. ### IR Level 1 (SSA) ```text -%dst = pto.tpack ... +%dst, %event = pto.tpack %src : (!pto.tile) -> (!pto.tile, !pto.record_event) ``` ### IR Level 2 (DPS) ```text -pto.tpack ins(...) outs(%dst : !pto.tile_buf<...>) +%event = pto.tpack ins(%src) outs(%dst : !pto.tile_buf<...>) -> !pto.record_event ``` + ## C++ Intrinsic -Declared in `include/pto/common/pto_instr.hpp`. +!!! warning "No public C++ intrinsic yet" + `pto.tpack` is reserved in the ISA but **does not currently expose a public `TPACK(...)` declaration in `include/pto/common/pto_instr.hpp`** or in the per-backend headers. The signature shown below is the intended contract once the intrinsic lands; until then, `tpack` is reachable only through internal lowering paths and tests, and user code must not depend on the C++ form. + +Intended (not-yet-public) signature: + +```cpp +template +PTO_INST RecordEvent TPACK(DstTileData &dst, SrcTileData &src, WaitEvents&... events); +``` + +## Inputs + +| Operand | Description | +|---------|-------------| +| `src` | Source tile carrying the wider/looser element type. | +| `events...` | Optional `RecordEvent` tokens to wait on before issuing. | + +## Expected Outputs + +| Result | Type | Description | +|--------|------|-------------| +| `RecordEvent` | token | Signals completion of the pack. | +| `dst` | tile | Holds the packed/narrowed representation over the valid region. | + +## Side Effects + +- Issues a single PIPE_V instruction; no DMA traffic. +- Does not implicitly fence unrelated tile traffic. ## Constraints !!! warning "Constraints" - Refer to backend-specific legality checks for data type/layout/location/shape constraints. + - `(SrcDType, DstDType)` must be a supported packing pair on the target backend; unsupported pairs are rejected by `static_assert`. + - The destination element count must equal `src` element count divided by the packing ratio `k`. + - `dst` and `src` may not alias for non-1:1 packing ratios; use a separate buffer. + - Refer to backend-specific legality checks for data type/layout/location/shape constraints not covered above. + +## Exceptions + +!!! danger "Exceptions" + - Unsupported `(SrcDType, DstDType)` pairs are rejected at compile time via `static_assert`. + - Aliasing `dst` and `src` for ratios `k > 1` is undefined behavior. + - Programs must not rely on behavior outside the documented legal domain of this operation. ## Examples +No public C++ intrinsic is declared yet (see the C++ Intrinsic section above), so user kernels cannot call `TPACK(...)` directly. Once the intrinsic lands, follow the canonical pattern of declaring source/destination tiles with the matching packing ratio and issuing a single `TPACK(dst, src);` on the vector pipe. + See related instruction pages in `docs/isa/` for concrete Auto/Manual usage patterns. + +## See Also + +- Instruction set overview: [Layout and Rearrangement](../../layout-and-rearrangement.md) +- Previous: [pto.tconcat](./tconcat.md) +- Next: [pto.textract](./textract.md) +- Related quantization op: [pto.tquant](../irregular-and-complex/tquant.md) diff --git a/docs/isa/tile/ops/layout-and-rearrangement/ttrans.md b/docs/isa/tile/ops/layout-and-rearrangement/ttrans.md index b62a7fda9..8c7b4fb40 100644 --- a/docs/isa/tile/ops/layout-and-rearrangement/ttrans.md +++ b/docs/isa/tile/ops/layout-and-rearrangement/ttrans.md @@ -167,7 +167,8 @@ void example_manual() { pto.ttrans ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>) ``` -## Related Ops / Instruction Set Links +## See Also - Instruction set overview: [Layout And Rearrangement](../../layout-and-rearrangement.md) - Previous op in instruction set: [pto.treshape](./treshape.md) +- Next op in instruction set: [pto.tconcat](./tconcat.md) diff --git a/docs/isa/tile/ops/memory-and-data-movement/mgather.md b/docs/isa/tile/ops/memory-and-data-movement/mgather.md index cb50a5408..ed24d8919 100644 --- a/docs/isa/tile/ops/memory-and-data-movement/mgather.md +++ b/docs/isa/tile/ops/memory-and-data-movement/mgather.md @@ -127,8 +127,8 @@ See related examples in `docs/isa/` and `docs/coding/tutorials/`. pto.mgather ins(%mem, %idx : !pto.partition_tensor_view, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>) ``` -## Related Ops / Instruction Set Links +## See Also - Instruction set overview: [Memory And Data Movement](../../memory-and-data-movement.md) -- Previous op in instruction set: [pto.tstore_fp](./tstore.md) +- Previous op in instruction set: [pto.tstore](./tstore.md) - Next op in instruction set: [pto.mscatter](./mscatter.md) diff --git a/docs/isa/tile/ops/memory-and-data-movement/mscatter.md b/docs/isa/tile/ops/memory-and-data-movement/mscatter.md index 7fc3b8eef..46bb24240 100644 --- a/docs/isa/tile/ops/memory-and-data-movement/mscatter.md +++ b/docs/isa/tile/ops/memory-and-data-movement/mscatter.md @@ -133,7 +133,7 @@ mscatter %src, %mem, %idx : !pto.memref<...>, !pto.tile<...>, !pto.tile<...> pto.mscatter ins(%src, %idx : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%mem : !pto.partition_tensor_view) ``` -## Related Ops / Instruction Set Links +## See Also - Instruction set overview: [Memory And Data Movement](../../memory-and-data-movement.md) - Previous op in instruction set: [pto.mgather](./mgather.md) diff --git a/docs/isa/tile/ops/memory-and-data-movement/tprefetch.md b/docs/isa/tile/ops/memory-and-data-movement/tprefetch.md index cf38f5802..614af13b0 100644 --- a/docs/isa/tile/ops/memory-and-data-movement/tprefetch.md +++ b/docs/isa/tile/ops/memory-and-data-movement/tprefetch.md @@ -119,7 +119,7 @@ See related examples in `docs/isa/` and `docs/coding/tutorials/`. pto.tprefetch ins(%src : !pto.global<...>) outs(%dst : !pto.tile_buf<...>) ``` -## Related Ops / Instruction Set Links +## See Also - Instruction set overview: [Memory And Data Movement](../../memory-and-data-movement.md) - Previous op in instruction set: [pto.tload](./tload.md) diff --git a/docs/isa/tile/ops/memory-and-data-movement/tstore.md b/docs/isa/tile/ops/memory-and-data-movement/tstore.md index 85a858069..fad9ec9ad 100644 --- a/docs/isa/tile/ops/memory-and-data-movement/tstore.md +++ b/docs/isa/tile/ops/memory-and-data-movement/tstore.md @@ -27,6 +27,26 @@ The auxiliary `fp` tile is the **sideband configuration tile** consumed by the b $$ \mathrm{dst}_{r_0 + i,\; c_0 + j} = \mathrm{Quantize}\!\left(\mathrm{src}_{i,j};\ \mathrm{fp}\right) $$ +## C++ Intrinsic + +Declared in `include/pto/common/pto_instr.hpp`. Two storage-path variants are exposed; their full signatures are listed in the [Variants](#variants) section below. + +```cpp +// Standard store +template +PTO_INST RecordEvent TSTORE(GlobalData &dst, TileData &src, WaitEvents &... events); + +// Fix-pipe quantized store (Acc only) +template +PTO_INST RecordEvent TSTORE_FP(GlobalData &dst, TileData &src, FpTileData &fp, + WaitEvents &... events); +``` + ## Variants ### Variant 1: Standard Store @@ -177,7 +197,7 @@ After the store completes, the data is written to `dst`. With atomic modes, valu - Programs must not rely on behavior outside the documented legal domain. - Calling `TSTORE_FP` on a non-accumulator tile is rejected by the backend. -## Common Patterns +## Examples ### Pattern 1: Basic Vector Tile Store diff --git a/docs/isa/tile/ops/reduce-and-expand/tcolexpandadd.md b/docs/isa/tile/ops/reduce-and-expand/tcolexpandadd.md index 40e40c778..eb9a6aef1 100644 --- a/docs/isa/tile/ops/reduce-and-expand/tcolexpandadd.md +++ b/docs/isa/tile/ops/reduce-and-expand/tcolexpandadd.md @@ -155,7 +155,7 @@ void example_manual() { pto.tcolexpandadd ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>) ``` -## Related Ops / Instruction Set Links +## See Also - Instruction set overview: [Reduce And Expand](../../reduce-and-expand.md) - Previous op in instruction set: [pto.tcolexpandmul](./tcolexpandmul.md) diff --git a/docs/isa/tile/ops/reduce-and-expand/tcolexpanddiv.md b/docs/isa/tile/ops/reduce-and-expand/tcolexpanddiv.md index e5ae02945..0a26662c0 100644 --- a/docs/isa/tile/ops/reduce-and-expand/tcolexpanddiv.md +++ b/docs/isa/tile/ops/reduce-and-expand/tcolexpanddiv.md @@ -155,8 +155,8 @@ void example_manual() { pto.tcolexpanddiv ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>) ``` -## Related Ops / Instruction Set Links +## See Also - Instruction set overview: [Reduce And Expand](../../reduce-and-expand.md) -- Previous op in instruction set: [pto.tcolexpand](./tcolexpand.md) -- Next op in instruction set: [pto.tcolexpandmul](./tcolexpandmul.md) +- Previous op in instruction set: [pto.tcolexpandmul](./tcolexpandmul.md) +- Next op in instruction set: [pto.tcolexpandmax](./tcolexpandmax.md) diff --git a/docs/isa/tile/ops/reduce-and-expand/tcolexpandexpdif.md b/docs/isa/tile/ops/reduce-and-expand/tcolexpandexpdif.md index 4128cbac4..4e6892e24 100644 --- a/docs/isa/tile/ops/reduce-and-expand/tcolexpandexpdif.md +++ b/docs/isa/tile/ops/reduce-and-expand/tcolexpandexpdif.md @@ -155,7 +155,7 @@ void example_manual() { pto.tcolexpandexpdif ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>) ``` -## Related Ops / Instruction Set Links +## See Also - Instruction set overview: [Reduce And Expand](../../reduce-and-expand.md) -- Previous op in instruction set: [pto.tcolexpandsub](./tcolexpandsub.md) +- Previous op in instruction set: [pto.tcolexpandmin](./tcolexpandmin.md) diff --git a/docs/isa/tile/ops/reduce-and-expand/tcolexpandmax.md b/docs/isa/tile/ops/reduce-and-expand/tcolexpandmax.md index 5927a8d49..7159bf5f6 100644 --- a/docs/isa/tile/ops/reduce-and-expand/tcolexpandmax.md +++ b/docs/isa/tile/ops/reduce-and-expand/tcolexpandmax.md @@ -155,8 +155,8 @@ void example_manual() { pto.tcolexpandmax ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>) ``` -## Related Ops / Instruction Set Links +## See Also - Instruction set overview: [Reduce And Expand](../../reduce-and-expand.md) -- Previous op in instruction set: [pto.tcolexpandadd](./tcolexpandadd.md) +- Previous op in instruction set: [pto.tcolexpanddiv](./tcolexpanddiv.md) - Next op in instruction set: [pto.tcolexpandmin](./tcolexpandmin.md) diff --git a/docs/isa/tile/ops/reduce-and-expand/tcolexpandmin.md b/docs/isa/tile/ops/reduce-and-expand/tcolexpandmin.md index 7dcdd736d..6e2a7f6cf 100644 --- a/docs/isa/tile/ops/reduce-and-expand/tcolexpandmin.md +++ b/docs/isa/tile/ops/reduce-and-expand/tcolexpandmin.md @@ -155,8 +155,8 @@ void example_manual() { pto.tcolexpandmin ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>) ``` -## Related Ops / Instruction Set Links +## See Also - Instruction set overview: [Reduce And Expand](../../reduce-and-expand.md) - Previous op in instruction set: [pto.tcolexpandmax](./tcolexpandmax.md) -- Next op in instruction set: [pto.tcolexpandsub](./tcolexpandsub.md) +- Next op in instruction set: [pto.tcolexpandexpdif](./tcolexpandexpdif.md) diff --git a/docs/isa/tile/ops/reduce-and-expand/tcolexpandmul.md b/docs/isa/tile/ops/reduce-and-expand/tcolexpandmul.md index 7ea5d4dc6..3bee4a96d 100644 --- a/docs/isa/tile/ops/reduce-and-expand/tcolexpandmul.md +++ b/docs/isa/tile/ops/reduce-and-expand/tcolexpandmul.md @@ -155,8 +155,8 @@ void example_manual() { pto.tcolexpandmul ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>) ``` -## Related Ops / Instruction Set Links +## See Also - Instruction set overview: [Reduce And Expand](../../reduce-and-expand.md) -- Previous op in instruction set: [pto.tcolexpanddiv](./tcolexpanddiv.md) -- Next op in instruction set: [pto.tcolexpandadd](./tcolexpandadd.md) +- Previous op in instruction set: [pto.tcolexpandsub](./tcolexpandsub.md) +- Next op in instruction set: [pto.tcolexpanddiv](./tcolexpanddiv.md) diff --git a/docs/isa/tile/ops/reduce-and-expand/tcolexpandsub.md b/docs/isa/tile/ops/reduce-and-expand/tcolexpandsub.md index 89894672b..2e09b884c 100644 --- a/docs/isa/tile/ops/reduce-and-expand/tcolexpandsub.md +++ b/docs/isa/tile/ops/reduce-and-expand/tcolexpandsub.md @@ -155,8 +155,8 @@ void example_manual() { pto.tcolexpandsub ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>) ``` -## Related Ops / Instruction Set Links +## See Also - Instruction set overview: [Reduce And Expand](../../reduce-and-expand.md) -- Previous op in instruction set: [pto.tcolexpandmin](./tcolexpandmin.md) -- Next op in instruction set: [pto.tcolexpandexpdif](./tcolexpandexpdif.md) +- Previous op in instruction set: [pto.tcolexpandadd](./tcolexpandadd.md) +- Next op in instruction set: [pto.tcolexpandmul](./tcolexpandmul.md) diff --git a/docs/isa/tile/ops/reduce-and-expand/trowexpandadd.md b/docs/isa/tile/ops/reduce-and-expand/trowexpandadd.md index 456e61a42..fb622e08d 100644 --- a/docs/isa/tile/ops/reduce-and-expand/trowexpandadd.md +++ b/docs/isa/tile/ops/reduce-and-expand/trowexpandadd.md @@ -164,8 +164,8 @@ void example_manual() { pto.trowexpandadd ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>) ``` -## Related Ops / Instruction Set Links +## See Also - Instruction set overview: [Reduce And Expand](../../reduce-and-expand.md) -- Previous op in instruction set: [pto.trowexpandsub](./trowexpandsub.md) -- Next op in instruction set: [pto.trowexpandmax](./trowexpandmax.md) +- Previous op in instruction set: [pto.trowexpand](./trowexpand.md) +- Next op in instruction set: [pto.trowexpandsub](./trowexpandsub.md) diff --git a/docs/isa/tile/ops/reduce-and-expand/trowexpanddiv.md b/docs/isa/tile/ops/reduce-and-expand/trowexpanddiv.md index 13e411490..a04a591cf 100644 --- a/docs/isa/tile/ops/reduce-and-expand/trowexpanddiv.md +++ b/docs/isa/tile/ops/reduce-and-expand/trowexpanddiv.md @@ -151,8 +151,8 @@ void example_manual() { pto.trowexpanddiv ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>) ``` -## Related Ops / Instruction Set Links +## See Also - Instruction set overview: [Reduce And Expand](../../reduce-and-expand.md) -- Previous op in instruction set: [pto.trowexpand](./trowexpand.md) -- Next op in instruction set: [pto.trowexpandmul](./trowexpandmul.md) +- Previous op in instruction set: [pto.trowexpandmul](./trowexpandmul.md) +- Next op in instruction set: [pto.trowexpandmax](./trowexpandmax.md) diff --git a/docs/isa/tile/ops/reduce-and-expand/trowexpandexpdif.md b/docs/isa/tile/ops/reduce-and-expand/trowexpandexpdif.md index bf0ca547d..9947898ef 100644 --- a/docs/isa/tile/ops/reduce-and-expand/trowexpandexpdif.md +++ b/docs/isa/tile/ops/reduce-and-expand/trowexpandexpdif.md @@ -164,8 +164,8 @@ void example_manual() { pto.trowexpandexpdif ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>) ``` -## Related Ops / Instruction Set Links +## See Also - Instruction set overview: [Reduce And Expand](../../reduce-and-expand.md) - Previous op in instruction set: [pto.trowexpandmin](./trowexpandmin.md) -- Next op in instruction set: [pto.tcolmin](./tcolmin.md) +- Next op in instruction set: [pto.tcolexpand](./tcolexpand.md) diff --git a/docs/isa/tile/ops/reduce-and-expand/trowexpandmax.md b/docs/isa/tile/ops/reduce-and-expand/trowexpandmax.md index f0332d2b9..e22051b52 100644 --- a/docs/isa/tile/ops/reduce-and-expand/trowexpandmax.md +++ b/docs/isa/tile/ops/reduce-and-expand/trowexpandmax.md @@ -164,8 +164,8 @@ void example_manual() { pto.trowexpandmax ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>) ``` -## Related Ops / Instruction Set Links +## See Also - Instruction set overview: [Reduce And Expand](../../reduce-and-expand.md) -- Previous op in instruction set: [pto.trowexpandadd](./trowexpandadd.md) +- Previous op in instruction set: [pto.trowexpanddiv](./trowexpanddiv.md) - Next op in instruction set: [pto.trowexpandmin](./trowexpandmin.md) diff --git a/docs/isa/tile/ops/reduce-and-expand/trowexpandmin.md b/docs/isa/tile/ops/reduce-and-expand/trowexpandmin.md index 315d1c88e..c76b345e3 100644 --- a/docs/isa/tile/ops/reduce-and-expand/trowexpandmin.md +++ b/docs/isa/tile/ops/reduce-and-expand/trowexpandmin.md @@ -164,7 +164,7 @@ void example_manual() { pto.trowexpandmin ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>) ``` -## Related Ops / Instruction Set Links +## See Also - Instruction set overview: [Reduce And Expand](../../reduce-and-expand.md) - Previous op in instruction set: [pto.trowexpandmax](./trowexpandmax.md) diff --git a/docs/isa/tile/ops/reduce-and-expand/trowexpandmul.md b/docs/isa/tile/ops/reduce-and-expand/trowexpandmul.md index a55973ced..72ef81ca5 100644 --- a/docs/isa/tile/ops/reduce-and-expand/trowexpandmul.md +++ b/docs/isa/tile/ops/reduce-and-expand/trowexpandmul.md @@ -151,8 +151,8 @@ void example_manual() { pto.trowexpandmul ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>) ``` -## Related Ops / Instruction Set Links +## See Also - Instruction set overview: [Reduce And Expand](../../reduce-and-expand.md) -- Previous op in instruction set: [pto.trowexpanddiv](./trowexpanddiv.md) -- Next op in instruction set: [pto.trowexpandsub](./trowexpandsub.md) +- Previous op in instruction set: [pto.trowexpandsub](./trowexpandsub.md) +- Next op in instruction set: [pto.trowexpanddiv](./trowexpanddiv.md) diff --git a/docs/isa/tile/ops/reduce-and-expand/trowexpandsub.md b/docs/isa/tile/ops/reduce-and-expand/trowexpandsub.md index 113caa688..93f124368 100644 --- a/docs/isa/tile/ops/reduce-and-expand/trowexpandsub.md +++ b/docs/isa/tile/ops/reduce-and-expand/trowexpandsub.md @@ -151,8 +151,8 @@ void example_manual() { pto.trowexpandsub ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>) ``` -## Related Ops / Instruction Set Links +## See Also - Instruction set overview: [Reduce And Expand](../../reduce-and-expand.md) -- Previous op in instruction set: [pto.trowexpandmul](./trowexpandmul.md) -- Next op in instruction set: [pto.trowexpandadd](./trowexpandadd.md) +- Previous op in instruction set: [pto.trowexpandadd](./trowexpandadd.md) +- Next op in instruction set: [pto.trowexpandmul](./trowexpandmul.md) diff --git a/docs/isa/tile/ops/tile-scalar-and-immediate/taxpy.md b/docs/isa/tile/ops/tile-scalar-and-immediate/taxpy.md index a2d6e7b61..8bf7ba590 100644 --- a/docs/isa/tile/ops/tile-scalar-and-immediate/taxpy.md +++ b/docs/isa/tile/ops/tile-scalar-and-immediate/taxpy.md @@ -4,38 +4,94 @@ ![TAXPY tile operation](../../../../figures/isa/TAXPY.svg) -## Introduction +## Summary -AXPY-style fused update: multiply a tile by a scalar and accumulate into the destination tile. +`pto.taxpy` is the tile-level form of the BLAS AXPY primitive. It updates `dst` in place with `dst = dst + src0 * scalar` elementwise across the destination valid region in a single fused vector operation, avoiding the temporary tile that a separate TMULS + TADD pair would materialise. -## Math Interpretation +## Mechanism -Semantics are instruction-specific. Unless stated otherwise, behavior is defined over the destination valid region. +For every element index `i` in the destination valid region: -## Assembly Syntax +$$\mathrm{dst}_i \mathrel{+}= \alpha \cdot \mathrm{src0}_i$$ + +where $\alpha$ is the scalar argument. The op executes on the **vector pipe** (PIPE_V) as a fused multiply-add against a broadcast scalar, accumulating into `dst` with no intermediate tile written back. + +## Syntax Textual spelling is defined by the PTO ISA syntax-and-operands pages. ### IR Level 1 (SSA) ```text -%dst = pto.taxpy ... +%dst, %event = pto.taxpy %dst_in, %src0, %scalar : (!pto.tile<...>, !pto.tile<...>, ) -> (!pto.tile<...>, !pto.record_event) ``` ### IR Level 2 (DPS) ```text -pto.taxpy ins(...) outs(%dst : !pto.tile_buf<...>) +%event = pto.taxpy ins(%src0, %scalar) outs(%dst : !pto.tile_buf<...>) -> !pto.record_event ``` + ## C++ Intrinsic -Declared in `include/pto/common/pto_instr.hpp`. +Declared in `include/pto/common/pto_instr.hpp`: + +```cpp +template +PTO_INST RecordEvent TAXPY(TileDataDst &dst, TileDataSrc &src0, + typename TileDataSrc::DType scalar, + WaitEvents&... events); +``` + +## Inputs + +| Operand | Description | +|---------|-------------| +| `dst` | Accumulator tile; read and written in place (`dst += src0 * scalar`). | +| `src0` | Tile multiplied by `scalar`. | +| `scalar` | Broadcast scalar coefficient with element type `TileDataSrc::DType`. | +| `events...` | Optional `RecordEvent` tokens to wait on before issuing. | + +## Expected Outputs + +| Result | Type | Description | +|--------|------|-------------| +| `RecordEvent` | token | Signals completion of the fused multiply-add. | +| `dst` | tile | Holds `dst + src0 * scalar` over the valid region; padding region is unmodified. | + +## Side Effects + +- Issues a single PIPE_V instruction; no DMA traffic. +- Does not implicitly fence unrelated tile traffic. +- Reads `scalar` once (broadcast); does not consume index resources. ## Constraints !!! warning "Constraints" - Refer to backend-specific legality checks for data type/layout/location/shape constraints. + - `dst` and `src0` must share the same shape, layout, and element type. + - `scalar` must match `TileDataSrc::DType`; no implicit narrowing conversions are performed. + - `dst` is read-modify-written in place; the prior contents of `dst` are part of the result. + - Refer to backend-specific legality checks for data type/layout/location/shape constraints not covered above. + +## Exceptions + +!!! danger "Exceptions" + - Mismatched element types between `src0` and `dst` are rejected at compile time via `static_assert`. + - A `scalar` whose type cannot be broadcast-converted to the tile element type is rejected at compile time. + - Programs must not rely on behavior outside the documented legal domain of this operation. ## Examples +```cpp +// dst += 0.5f * src0, fused multiply-accumulate in place. +TAXPY(dst, src0, 0.5f); +``` + See related instruction pages in `docs/isa/` for concrete Auto/Manual usage patterns. + +## See Also + +- Instruction set overview: [Tile-Scalar / Immediate](../../tile-scalar-and-immediate.md) +- Previous: [pto.tadds](./tadds.md) +- Next: [pto.tsubs](./tsubs.md) +- Non-fused composition: [pto.tmuls](./tmuls.md) + [pto.tadds](./tadds.md) diff --git a/docs/isa/tile/tile-scalar-and-immediate.md b/docs/isa/tile/tile-scalar-and-immediate.md index fcd889289..85e2dc289 100644 --- a/docs/isa/tile/tile-scalar-and-immediate.md +++ b/docs/isa/tile/tile-scalar-and-immediate.md @@ -7,7 +7,7 @@ Tile-scalar operations combine a tile operand with a scalar value or immediate o | Operation | Description | Category | C++ Intrinsic | |-----------|-------------|----------|----------------| | [pto.tadds](./ops/tile-scalar-and-immediate/tadds.md) | Elementwise addition with scalar | Binary | `TADDS(dst, src, scalar)` | -| [pto.taxpy](./ops/tile-scalar-and-immediate/taxpy.md) | AXPY-style fused tile-scalar update | Fused binary | `TAXPY(dst, src0, scalar, src1)` | +| [pto.taxpy](./ops/tile-scalar-and-immediate/taxpy.md) | AXPY-style fused tile-scalar update | Fused binary | `TAXPY(dst, src0, scalar)` | | [pto.tsubs](./ops/tile-scalar-and-immediate/tsubs.md) | Elementwise subtraction with scalar | Binary | `TSUBS(dst, src, scalar)` | | [pto.tmuls](./ops/tile-scalar-and-immediate/tmuls.md) | Elementwise multiplication with scalar | Binary | `TMULS(dst, src, scalar)` | | [pto.tpows](./ops/tile-scalar-and-immediate/tpows.md) | Elementwise power with scalar exponent | Binary | `TPOWS(dst, base, exp, tmp)` |