Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .claude/rules/pass-doc-ordering.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,15 +30,15 @@ Developers read pass docs sequentially to understand the compilation pipeline. I
| 15 | `15-flatten_tile_nd_to_2d.md` | 15th pass |
| 16 | `16-auto_tile_matmul_l0.md` | 16th pass |
| 17 | `17-infer_tile_memory_space.md` | 17th pass |
| 18 | `18-resolve_transpose_layout.md` | 18th pass |
| 18 | `18-lower_transpose_load_param_layout.md` | 18th pass (RFC #1300 P6 — replaces ResolveTransposeLayout) |
| 19 | `19-resolve_backend_op_layouts.md` | 19th pass |
| 20 | `20-expand_mixed_kernel.md` | 20th pass |
| 21 | `21-inject_gm_pipe_buffer.md` | Runs immediately after `ExpandMixedKernel` (backend-gated, Ascend910B) |
| 22 | `22-split_vector_kernel.md` | 22nd pass |
| 23 | `23-normalize_return_order.md` | 23rd pass |
| 24 | `24-lower_pipeline_loops.md` | 24th pass |
| 25 | `25-canonicalize_io_order.md` | 25th pass |
| 26 | `26-materialize_tensor_strides.md` | 26th pass (RFC #1300 P3 — registered, not yet wired into Default; activates with P6/P7) |
| 26 | `26-materialize_tensor_strides.md` | 26th pass (RFC #1300 P3 — wired into Default starting from P6) |
| 27 | `27-init_memref.md` | 27th pass |
| 28 | `28-memory_reuse.md` | 28th pass |
| 29 | `29-legalize_pto_buffer_reuse.md` | 29th pass |
Expand Down
2 changes: 1 addition & 1 deletion CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -165,7 +165,7 @@ set(PYPTO_SOURCES
src/ir/transforms/pass_context.cpp
src/ir/transforms/passes.cpp
src/ir/transforms/resolve_backend_op_layouts_pass.cpp
src/ir/transforms/resolve_transpose_layout_pass.cpp
src/ir/transforms/lower_transpose_load_param_layout_pass.cpp
src/ir/transforms/python_printer.cpp
src/ir/transforms/simplify_pass.cpp
src/ir/transforms/inject_gm_pipe_buffer_pass.cpp
Expand Down
4 changes: 2 additions & 2 deletions docs/en/dev/passes/00-pass_manager.md
Original file line number Diff line number Diff line change
Expand Up @@ -373,7 +373,7 @@ The PTO-oriented tile stage shared by `Default` and `DebugTileOptimization` is:
2. [`FlattenTileNdTo2D`](15-flatten_tile_nd_to_2d.md)
3. [`AutoTileMatmulL0`](16-auto_tile_matmul_l0.md)
4. `InferTileMemorySpace`
5. `ResolveTransposeLayout`
5. [`LowerTransposeLoadParamLayout`](18-lower_transpose_load_param_layout.md) (RFC #1300 P6 — replaces `ResolveTransposeLayout`)
6. [`ResolveBackendOpLayouts`](19-resolve_backend_op_layouts.md)
7. `NormalizeStmtStructure`
8. `ExpandMixedKernel`
Expand All @@ -382,7 +382,7 @@ The PTO-oriented tile stage shared by `Default` and `DebugTileOptimization` is:
11. `NormalizeReturnOrder`
12. [`LowerPipelineLoops`](24-lower_pipeline_loops.md)
13. [`CanonicalizeIOOrder`](25-canonicalize_io_order.md)
14. [`MaterializeTensorStrides`](26-materialize_tensor_strides.md) — registered, not yet wired into the default pipeline (will activate alongside the codegen cleanup in RFC #1300 P6/P7)
14. [`MaterializeTensorStrides`](26-materialize_tensor_strides.md) — wired into the default pipeline starting from RFC #1300 P6
15. `InitMemRef`
16. `MemoryReuse`
17. [`LegalizePTOBufferReuse`](29-legalize_pto_buffer_reuse.md)
Expand Down
2 changes: 1 addition & 1 deletion docs/en/dev/passes/17-infer_tile_memory_space.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ After this pass every `TileType` in InCore functions carries a concrete `memory_
- InCore / Orchestration outlining must be done (`SplitIncoreOrch`)
- Statement structure must be normalized (`NormalizedStmtStructure`)

**When to use**: Run immediately after `FlattenTileNdTo2D` and before `ResolveTransposeLayout` / `ResolveBackendOpLayouts` / `ExpandMixedKernel`. It is the canonical point at which tile memory becomes a contract that downstream passes (especially `ExpandMixedKernel`'s mixed-kernel detection and `InitMemRef`'s buffer allocation) read.
**When to use**: Run immediately after `FlattenTileNdTo2D` and before `LowerTransposeLoadParamLayout` / `ResolveBackendOpLayouts` / `ExpandMixedKernel`. It is the canonical point at which tile memory becomes a contract that downstream passes (especially `ExpandMixedKernel`'s mixed-kernel detection and `InitMemRef`'s buffer allocation) read.

## API

Expand Down
200 changes: 200 additions & 0 deletions docs/en/dev/passes/18-lower_transpose_load_param_layout.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,200 @@
# LowerTransposeLoadParamLayout Pass

Lowers ``tile.load(..., transpose=True)`` to canonical-form DN parameter layout (RFC #1300 P6).

## Overview

Before this pass, ``tile.load(transpose=True)`` is the user's way of saying "I want
the column-major view of this source tensor at the load site". After this pass, that
intent is encoded into the InCore parameter's TensorType itself — the source/load
combo is rewritten to RFC #1300 §3.3 canonical form so codegen, verifier, and
downstream passes consume a single, self-consistent ``(shape, stride, layout)`` triple.

For each InCore parameter ``p`` loaded via ``tile.load(p, ..., transpose=True)``:

- ``p``'s TensorType is promoted from ``[..., a, b] ND`` to ``[..., b, a] DN`` —
the trailing-pair shape swap plus the DN layout tag. The new TensorView carries
an empty stride; ``MaterializeTensorStrides`` (which runs later in the default
pipeline, after ``CanonicalizeIOOrder``) fills it with the packed canonical
strides.
- Every ``tile.load(p, offsets, shapes, valid_shapes, ..., transpose=True)`` whose
source is a promoted parameter is rewritten so the three tuples' trailing pair
is swapped to canonical coords and the ``transpose=True`` kwarg is dropped.
``DeduceTileLoadType`` reads the source's DN layout to derive the Mat tile-view
layout that the legacy ``transpose=True`` swap produced — the two signals are
equivalent (§4.2 canonical pair).
- Every non-InCore call site that targets a promoted callee wraps the promoted
argument in ``tensor.as_layout(arg, DN)`` (RFC #1300 P4). The bridging op is
pure metadata — it emits no PTOAS instruction; ``make_tensor_view`` consumes
the new view directly.

**Requirements**:

- Input IR must be in SSA form
- InCore functions must already be split out (``SplitIncoreOrch``)
- Tile ops must be present and 2D (``IncoreTileOps``, ``TileOps2D``)
- Promoted parameters must have rank ≥ 2

**When to use**: 18th pass in the ``Default`` strategy, after
``InferTileMemorySpace`` and before ``ResolveBackendOpLayouts``. The 2D shape
produced by ``FlattenTileNdTo2D`` is a precondition. ``MaterializeTensorStrides``
runs later in the pipeline (after ``CanonicalizeIOOrder``) to materialize the
DN-packed canonical strides on the promoted parameters.

## API

| C++ | Python | Level |
| --- | ------ | ----- |
| ``pass::LowerTransposeLoadParamLayout()`` | ``passes.lower_transpose_load_param_layout()`` | Program-level |

**Python usage**:

```python
from pypto.pypto_core import passes

p = passes.lower_transpose_load_param_layout()
program_canonical = p(program)
```

## Algorithm

```text
For each InCore function f:
scan body → set P_t = {param idx with tile.load(p, ..., transpose=True)}
set P_nt = {param idx with tile.load(p, ..., transpose=False/absent)}
reject P_t ∩ P_nt (mixed-use)
for each idx in P_t:
promote f.params[idx].type: [..., a, b] ND → [..., b, a] DN (empty stride)
substitute old Var → new Var in body
rewrite each tile.load(promoted_param, off, shp, vs, transpose=True) in body:
swap last two dims of off / shp / vs
drop transpose=True kwarg

For each non-InCore function:
walk body; for every Call whose op is a GlobalVar of a promoted callee:
wrap each promoted-slot arg with tensor.as_layout(arg, DN)
```

**Complexity:** O(N log N) — one body walk per function plus one program-wide call-site
walk. Map lookups (``promotions_by_callee_name``) are ``log N`` per call.

| Behavior | Trigger |
| -------- | ------- |
| Promote param to ``[..., b, a] DN`` | InCore param is source of ``tile.load(..., transpose=True)`` |
| Skip param | Already DN, or no transposed load |
| Skip whole function | Function is Orchestration / Opaque / Group |
| Wrap call-site arg in ``tensor.as_layout`` | Non-InCore call to a promoted callee |
| Reject | Mixed transpose=True / transpose=False on same param |
| Reject | DN + explicit physical stride source (would compose as double transpose) |

## Example

**Before**:

```python
@pl.program
class Before:
@pl.function(type=pl.FunctionType.InCore)
def matmul_incore(
self,
a: pl.Tensor[[64, 128], pl.FP32],
b: pl.Tensor[[32, 128], pl.FP32],
c: pl.Out[pl.Tensor[[64, 32], pl.FP32]],
) -> pl.Tensor[[64, 32], pl.FP32]:
tile_a = pl.load(a, [0, 0], [64, 128], target_memory=pl.MemorySpace.Mat)
tile_b = pl.load(b, [0, 0], [32, 128], target_memory=pl.MemorySpace.Mat, transpose=True)
...

@pl.function(type=pl.FunctionType.Orchestration)
def orchestrator(self, a, b):
c = pl.create_tensor([64, 32], dtype=pl.FP32)
return self.matmul_incore(a, b, c)
```

**After** (semantic — ``tensor.as_layout`` is an internal IR op, not exposed in pl.*):

```text
@pl.function(type=pl.FunctionType.InCore)
def matmul_incore(
self,
a: pl.Tensor[[64, 128], pl.FP32],
b: pl.Tensor[[128, 32], pl.FP32, pl.DN], # ← shape swapped + DN tag
c: pl.Out[pl.Tensor[[64, 32], pl.FP32]],
) -> pl.Tensor[[64, 32], pl.FP32]:
tile_a = pl.load(a, [0, 0], [64, 128], target_memory=pl.MemorySpace.Mat)
tile_b = pl.load(b, [0, 0], [128, 32], target_memory=pl.MemorySpace.Mat)
# ↑ no transpose kwarg
# ↑ shapes swapped to canonical coords
...

@pl.function(type=pl.FunctionType.Orchestration)
def orchestrator(self, a, b):
c = pl.create_tensor([64, 32], dtype=pl.FP32)
# b is wrapped in tensor.as_layout to bridge ND → DN at the call site:
bridged_b = tensor.as_layout(b, pl.DN) # type: [128, 32] DN
return self.matmul_incore(a, bridged_b, c)
```

``a`` is loaded without transpose, so it is unchanged. ``b`` is promoted in the
InCore signature, all body loads of ``b`` are rewritten to canonical coords with
no transpose, and the orchestrator's call site wraps ``b`` in
``tensor.as_layout`` to bridge ``[32, 128] ND`` → ``[128, 32] DN`` over the same
physical buffer.

## Implementation

**Header**: ``include/pypto/ir/transforms/passes.h``

**Implementation**: ``src/ir/transforms/lower_transpose_load_param_layout_pass.cpp``

**Python binding**: ``python/bindings/modules/passes.cpp``

**Tests**: ``tests/ut/ir/transforms/test_lower_transpose_load_param_layout_pass.py``

## Pass Properties

| Property | Value |
| -------- | ----- |
| Required | SSAForm, IncoreTileOps, SplitIncoreOrch, TileOps2D |
| Produced | SSAForm, IncoreTileOps, SplitIncoreOrch, TileOps2D |
| Invalidated | — |

## Scope

| Function type | Action |
| ------------- | ------ |
| InCore (InCore, AIC, AIV) | Scanned, possibly promoted |
| Orchestration / Group / Opaque | Scanned for call sites; promoted-arg wrapped in ``tensor.as_layout`` |

| Parameter state | Action |
| --------------- | ------ |
| Sourced by ``tile.load(..., transpose=True)``, layout != DN, rank ≥ 2 | Promoted (shape swap + DN tag) |
| Sourced by ``tile.load(..., transpose=True)``, already DN | Idempotent — unchanged |
| Mixed transpose=True / transpose=False on same param | ``CHECK`` failure |
| Not sourced by any transposed load | Unchanged |
| Rank < 2 candidate | ``CHECK`` failure |

## Interaction with ``tensor.as_layout`` (P4) and ``MaterializeTensorStrides`` (P3)

This pass is the first real consumer of ``tensor.as_layout`` in the default
pipeline. The bridging op is single-purpose: it flips the layout tag and derives
the new shape from §4.2 canonical pair semantics — callers never write the
target shape, so the call-site rewriter cannot get it wrong.

Downstream, ``MaterializeTensorStrides`` fills the empty stride slot on each
promoted parameter with the packed canonical DN strides (RFC §2.4). The
combination of P6 + P3 is what gives codegen a self-consistent
``(shape, stride, layout)`` triple — no further ``dn_swap`` / ``get_shape_source_idx``
fix-ups are needed in the codegen path for promoted parameters.

## Interaction with ``tensor.transpose`` at Orchestration

A parameter whose source TensorView carries both ``layout = DN`` *and* an
explicit non-empty ``stride`` is the signature of a ``tensor.transpose`` result.
This pass rejects ``tile.load(transpose=True)`` on such parameters with a
``CHECK`` failure — the two encodings would compose as a double transpose at
codegen time and emit wrong addresses. Slice-derived inputs (explicit strides +
``layout = ND``, attached by ``OptimizeOrchTensors``) are unaffected.

Workaround for the rejected case: drop one of the two transpose layers in the
source program.
Loading
Loading