Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
120 changes: 62 additions & 58 deletions docs/en/dev/passes/18-lower_transpose_load_param_layout.md
Original file line number Diff line number Diff line change
@@ -1,32 +1,36 @@
# LowerTransposeLoadParamLayout Pass

Lowers ``tile.load(..., transpose=True)`` to canonical-form DN parameter layout (RFC #1300 P6).
Lowers ``tile.load(..., transpose=True)`` by emitting an explicit
``tensor.as_layout`` view inside the InCore body (RFC #1300 P6).

## Overview

Before this pass, ``tile.load(transpose=True)`` is the user's way of saying "I want
the column-major view of this source tensor at the load site". After this pass, that
intent is encoded into the InCore parameter's TensorType itself — the source/load
combo is rewritten to RFC #1300 §3.3 canonical form so codegen, verifier, and
downstream passes consume a single, self-consistent ``(shape, stride, layout)`` triple.
Before this pass, ``tile.load(transpose=True)`` is the user's way of saying "I
want the column-major view of this source tensor at the load site". After this
pass, that intent is encoded into a body-local ``tensor.as_layout`` view at the
top of the InCore body so codegen, verifier, and downstream passes consume a
single, self-consistent ``(shape, stride, layout)`` triple.

For each InCore parameter ``p`` loaded via ``tile.load(p, ..., transpose=True)``:

- ``p``'s TensorType is promoted from ``[..., a, b] ND`` to ``[..., b, a] DN`` —
the trailing-pair shape swap plus the DN layout tag. The new TensorView carries
an empty stride; ``MaterializeTensorStrides`` (which runs later in the default
pipeline, after ``CanonicalizeIOOrder``) fills it with the packed canonical
strides.
- Every ``tile.load(p, offsets, shapes, valid_shapes, ..., transpose=True)`` whose
source is a promoted parameter is rewritten so the three tuples' trailing pair
is swapped to canonical coords and the ``transpose=True`` kwarg is dropped.
``DeduceTileLoadType`` reads the source's DN layout to derive the Mat tile-view
- The InCore body is **prepended** with ``p_dn = tensor.as_layout(p, layout=DN)``.
The new Var ``p_dn`` carries the canonical ``[..., b, a] DN`` view (trailing-pair
shape swap + DN layout tag with packed canonical strides set by the
``tensor.as_layout`` deduce-type).
- Body uses of ``p`` are substituted with ``p_dn``. ``p``'s parameter
signature is left unchanged — the orch side keeps passing its original
row-major ND tensor (which matches the runtime torch tensor's layout).
- Every ``tile.load(p, offsets, shapes, valid_shapes, ..., transpose=True)``
whose source is a promoted parameter is rewritten to ``tile.load(p_dn, ...)``,
with the three tuples' trailing pair swapped to canonical coords and
``transpose=True`` flipped to ``transpose=False``.
``DeduceTileLoadType`` reads ``p_dn``'s DN layout to derive the Mat tile-view
layout that the legacy ``transpose=True`` swap produced — the two signals are
equivalent (§4.2 canonical pair).
- Every non-InCore call site that targets a promoted callee wraps the promoted
argument in ``tensor.as_layout(arg, DN)`` (RFC #1300 P4). The bridging op is
pure metadata — it emits no PTOAS instruction; ``make_tensor_view`` consumes
the new view directly.

Non-InCore (orch) functions are not touched. The DN reinterpret is a
single-function concern owned by the InCore body that needs it, which keeps the
cross-function boundary trivial: orch always passes a row-major ND tensor.

**Requirements**:

Expand All @@ -37,9 +41,7 @@ For each InCore parameter ``p`` loaded via ``tile.load(p, ..., transpose=True)``

**When to use**: 18th pass in the ``Default`` strategy, after
``InferTileMemorySpace`` and before ``ResolveBackendOpLayouts``. The 2D shape
produced by ``FlattenTileNdTo2D`` is a precondition. ``MaterializeTensorStrides``
runs later in the pipeline (after ``CanonicalizeIOOrder``) to materialize the
DN-packed canonical strides on the promoted parameters.
produced by ``FlattenTileNdTo2D`` is a precondition.

## API

Expand All @@ -64,26 +66,26 @@ For each InCore function f:
set P_nt = {param idx with tile.load(p, ..., transpose=False/absent)}
reject P_t ∩ P_nt (mixed-use)
for each idx in P_t:
promote f.params[idx].type: [..., a, b] ND → [..., b, a] DN (empty stride)
substitute old Var → new Var in body
rewrite each tile.load(promoted_param, off, shp, vs, transpose=True) in body:
let p = f.params[idx]
skip if p is already DN-tagged (the user-written / pre-canonical case)
build p_dn := tensor.as_layout(p, layout=DN) — type derived by op deducer
prepend (p_dn = ...) AssignStmt to body
record p → p_dn in substitution map
substitute body uses of every promoted p with p_dn
rewrite each tile.load(p_dn, off, shp, vs, transpose=True) in body:
swap last two dims of off / shp / vs
drop transpose=True kwarg

For each non-InCore function:
walk body; for every Call whose op is a GlobalVar of a promoted callee:
wrap each promoted-slot arg with tensor.as_layout(arg, DN)
(Non-InCore functions are untouched.)
```

**Complexity:** O(N log N) — one body walk per function plus one program-wide call-site
walk. Map lookups (``promotions_by_callee_name``) are ``log N`` per call.
**Complexity:** O(N log N) — one body walk per InCore function.

| Behavior | Trigger |
| -------- | ------- |
| Promote param to ``[..., b, a] DN`` | InCore param is source of ``tile.load(..., transpose=True)`` |
| Prepend ``p_dn = tensor.as_layout(p, DN)`` and rewrite tile.load | InCore param is source of ``tile.load(..., transpose=True)`` |
| Skip param | Already DN, or no transposed load |
| Skip whole function | Function is Orchestration / Opaque / Group |
| Wrap call-site arg in ``tensor.as_layout`` | Non-InCore call to a promoted callee |
| Reject | Mixed transpose=True / transpose=False on same param |
| Reject | DN + explicit physical stride source (would compose as double transpose) |

Expand Down Expand Up @@ -118,28 +120,28 @@ class Before:
def matmul_incore(
self,
a: pl.Tensor[[64, 128], pl.FP32],
b: pl.Tensor[[128, 32], pl.FP32, pl.DN], # ← shape swapped + DN tag
b: pl.Tensor[[32, 128], pl.FP32], # ← param signature unchanged
c: pl.Out[pl.Tensor[[64, 32], pl.FP32]],
) -> pl.Tensor[[64, 32], pl.FP32]:
b_dn = tensor.as_layout(b, layout=DN) # ← prepended view
# type: [128, 32] DN
tile_a = pl.load(a, [0, 0], [64, 128], target_memory=pl.MemorySpace.Mat)
tile_b = pl.load(b, [0, 0], [128, 32], target_memory=pl.MemorySpace.Mat)
# ↑ no transpose kwarg
# ↑ shapes swapped to canonical coords
tile_b = pl.load(b_dn, [0, 0], [128, 32], target_memory=pl.MemorySpace.Mat)
# ↑ source switched to b_dn
# ↑ shapes swapped to canonical coords
# ↑ no transpose kwarg
...

@pl.function(type=pl.FunctionType.Orchestration)
def orchestrator(self, a, b):
c = pl.create_tensor([64, 32], dtype=pl.FP32)
# b is wrapped in tensor.as_layout to bridge ND → DN at the call site:
bridged_b = tensor.as_layout(b, pl.DN) # type: [128, 32] DN
return self.matmul_incore(a, bridged_b, c)
return self.matmul_incore(a, b, c) # ← unchanged
```

``a`` is loaded without transpose, so it is unchanged. ``b`` is promoted in the
InCore signature, all body loads of ``b`` are rewritten to canonical coords with
no transpose, and the orchestrator's call site wraps ``b`` in
``tensor.as_layout`` to bridge ``[32, 128] ND`` → ``[128, 32] DN`` over the same
physical buffer.
``a`` is loaded without transpose, so it is unchanged. ``b``'s param signature
is preserved; the kernel internally derives a DN view via ``tensor.as_layout``
and references that view in its ``tile.load``. The orchestrator is not
touched — it passes its own row-major ``b`` straight through.

## Implementation

Expand All @@ -163,29 +165,31 @@ physical buffer.

| Function type | Action |
| ------------- | ------ |
| InCore (InCore, AIC, AIV) | Scanned, possibly promoted |
| Orchestration / Group / Opaque | Scanned for call sites; promoted-arg wrapped in ``tensor.as_layout`` |
| InCore (InCore, AIC, AIV) | Scanned, body prepended with ``tensor.as_layout`` views as needed |
| Orchestration / Group / Opaque | Untouched |

| Parameter state | Action |
| --------------- | ------ |
| Sourced by ``tile.load(..., transpose=True)``, layout != DN, rank ≥ 2 | Promoted (shape swap + DN tag) |
| Sourced by ``tile.load(..., transpose=True)``, already DN | Idempotentunchanged |
| Sourced by ``tile.load(..., transpose=True)``, layout != DN, rank ≥ 2 | ``tensor.as_layout`` view prepended; body uses substituted |
| Sourced by ``tile.load(..., transpose=True)``, already DN | Skipped``DeduceTileLoadType`` already handles DN-source XOR transpose |
| Mixed transpose=True / transpose=False on same param | ``CHECK`` failure |
| Not sourced by any transposed load | Unchanged |
| Rank < 2 candidate | ``CHECK`` failure |

## Interaction with ``tensor.as_layout`` (P4) and ``MaterializeTensorStrides`` (P3)
## Interaction with ``tensor.as_layout`` (P4)

This pass is the first real consumer of ``tensor.as_layout`` in the default
pipeline. The bridging op is single-purpose: it flips the layout tag and derives
the new shape from §4.2 canonical pair semantics — callers never write the
target shape, so the call-site rewriter cannot get it wrong.
This pass is the first consumer of ``tensor.as_layout`` in the default
pipeline. The bridging op is single-purpose: it flips the layout tag and
derives the new shape from §4.2 canonical pair semantics, then attaches the
packed canonical strides via ``CanonicalizeView``. Codegen lowers
``tensor.as_layout`` to a fresh ``pto.make_tensor_view`` bound to the input
tensor's underlying SSA buffer with the LHS's
``(shape, stride, layout)`` triple — no PTOAS instruction is emitted, the
result is a pure metadata reinterpret.

Downstream, ``MaterializeTensorStrides`` fills the empty stride slot on each
promoted parameter with the packed canonical DN strides (RFC §2.4). The
combination of P6 + P3 is what gives codegen a self-consistent
``(shape, stride, layout)`` triple — no further ``dn_swap`` / ``get_shape_source_idx``
fix-ups are needed in the codegen path for promoted parameters.
Per RFC §4.2, the InCore-side reinterpret does not violate the "InCore cannot
create tensors" invariant: ``tensor.as_layout`` allocates nothing, it
re-describes the input's existing physical buffer.

## Interaction with ``tensor.transpose`` at Orchestration

Expand Down
2 changes: 1 addition & 1 deletion docs/en/dev/passes/26-materialize_tensor_strides.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@ The pass is **idempotent**: re-running on already-materialized IR is a no-op, si

## Example

**Before** — InCore param with empty-stride DN view (e.g. produced by a future `LowerTransposeLoadParamLayout` rewrite):
**Before** — InCore param with empty-stride DN view (user-written `pl.Tensor[..., pl.DN]` without an explicit stride hint):

```python
@pl.function(type=pl.FunctionType.InCore)
Expand Down
Loading
Loading