hw-native-sys · Hzfengsy · May 12, 2026 · May 12, 2026 · May 12, 2026
diff --git a/docs/en/dev/passes/18-lower_transpose_load_param_layout.md b/docs/en/dev/passes/18-lower_transpose_load_param_layout.md
@@ -1,32 +1,36 @@
 # LowerTransposeLoadParamLayout Pass
 
-Lowers ``tile.load(..., transpose=True)`` to canonical-form DN parameter layout (RFC #1300 P6).
+Lowers ``tile.load(..., transpose=True)`` by emitting an explicit
+``tensor.as_layout`` view inside the InCore body (RFC #1300 P6).
 
 ## Overview
 
-Before this pass, ``tile.load(transpose=True)`` is the user's way of saying "I want
-the column-major view of this source tensor at the load site". After this pass, that
-intent is encoded into the InCore parameter's TensorType itself — the source/load
-combo is rewritten to RFC #1300 §3.3 canonical form so codegen, verifier, and
-downstream passes consume a single, self-consistent ``(shape, stride, layout)`` triple.
+Before this pass, ``tile.load(transpose=True)`` is the user's way of saying "I
+want the column-major view of this source tensor at the load site". After this
+pass, that intent is encoded into a body-local ``tensor.as_layout`` view at the
+top of the InCore body so codegen, verifier, and downstream passes consume a
+single, self-consistent ``(shape, stride, layout)`` triple.
 
 For each InCore parameter ``p`` loaded via ``tile.load(p, ..., transpose=True)``:
 
-- ``p``'s TensorType is promoted from ``[..., a, b] ND`` to ``[..., b, a] DN`` —
-  the trailing-pair shape swap plus the DN layout tag. The new TensorView carries
-  an empty stride; ``MaterializeTensorStrides`` (which runs later in the default
-  pipeline, after ``CanonicalizeIOOrder``) fills it with the packed canonical
-  strides.
-- Every ``tile.load(p, offsets, shapes, valid_shapes, ..., transpose=True)`` whose
-  source is a promoted parameter is rewritten so the three tuples' trailing pair
-  is swapped to canonical coords and the ``transpose=True`` kwarg is dropped.
-  ``DeduceTileLoadType`` reads the source's DN layout to derive the Mat tile-view
+- The InCore body is **prepended** with ``p_dn = tensor.as_layout(p, layout=DN)``.
+  The new Var ``p_dn`` carries the canonical ``[..., b, a] DN`` view (trailing-pair
+  shape swap + DN layout tag with packed canonical strides set by the
+  ``tensor.as_layout`` deduce-type).
+- Body uses of ``p`` are substituted with ``p_dn``. ``p``'s parameter
+  signature is left unchanged — the orch side keeps passing its original
+  row-major ND tensor (which matches the runtime torch tensor's layout).
+- Every ``tile.load(p, offsets, shapes, valid_shapes, ..., transpose=True)``
+  whose source is a promoted parameter is rewritten to ``tile.load(p_dn, ...)``,
+  with the three tuples' trailing pair swapped to canonical coords and
+  ``transpose=True`` flipped to ``transpose=False``.
+  ``DeduceTileLoadType`` reads ``p_dn``'s DN layout to derive the Mat tile-view
   layout that the legacy ``transpose=True`` swap produced — the two signals are
   equivalent (§4.2 canonical pair).
-- Every non-InCore call site that targets a promoted callee wraps the promoted
-  argument in ``tensor.as_layout(arg, DN)`` (RFC #1300 P4). The bridging op is
-  pure metadata — it emits no PTOAS instruction; ``make_tensor_view`` consumes
-  the new view directly.
+
+Non-InCore (orch) functions are not touched. The DN reinterpret is a
+single-function concern owned by the InCore body that needs it, which keeps the
+cross-function boundary trivial: orch always passes a row-major ND tensor.
 
 **Requirements**:
 
@@ -37,9 +41,7 @@ For each InCore parameter ``p`` loaded via ``tile.load(p, ..., transpose=True)``
 
 **When to use**: 18th pass in the ``Default`` strategy, after
 ``InferTileMemorySpace`` and before ``ResolveBackendOpLayouts``. The 2D shape
-produced by ``FlattenTileNdTo2D`` is a precondition. ``MaterializeTensorStrides``
-runs later in the pipeline (after ``CanonicalizeIOOrder``) to materialize the
-DN-packed canonical strides on the promoted parameters.
+produced by ``FlattenTileNdTo2D`` is a precondition.
 
 ## API
 
@@ -64,26 +66,26 @@ For each InCore function f:
               set P_nt = {param idx with tile.load(p, ..., transpose=False/absent)}
   reject P_t ∩ P_nt  (mixed-use)
   for each idx in P_t:
-    promote f.params[idx].type:  [..., a, b] ND  →  [..., b, a] DN (empty stride)
-    substitute old Var → new Var in body
-  rewrite each tile.load(promoted_param, off, shp, vs, transpose=True) in body:
+    let p = f.params[idx]
+    skip if p is already DN-tagged (the user-written / pre-canonical case)
+    build p_dn := tensor.as_layout(p, layout=DN)  — type derived by op deducer
+    prepend (p_dn = ...) AssignStmt to body
+    record p → p_dn in substitution map
+  substitute body uses of every promoted p with p_dn
+  rewrite each tile.load(p_dn, off, shp, vs, transpose=True) in body:
     swap last two dims of off / shp / vs
     drop transpose=True kwarg
 
-For each non-InCore function:
-  walk body; for every Call whose op is a GlobalVar of a promoted callee:
-    wrap each promoted-slot arg with tensor.as_layout(arg, DN)
+(Non-InCore functions are untouched.)
 ```
 
-**Complexity:** O(N log N) — one body walk per function plus one program-wide call-site
-walk. Map lookups (``promotions_by_callee_name``) are ``log N`` per call.
+**Complexity:** O(N log N) — one body walk per InCore function.
 
 | Behavior | Trigger |
 | -------- | ------- |
-| Promote param to ``[..., b, a] DN`` | InCore param is source of ``tile.load(..., transpose=True)`` |
+| Prepend ``p_dn = tensor.as_layout(p, DN)`` and rewrite tile.load | InCore param is source of ``tile.load(..., transpose=True)`` |
 | Skip param | Already DN, or no transposed load |
 | Skip whole function | Function is Orchestration / Opaque / Group |
-| Wrap call-site arg in ``tensor.as_layout`` | Non-InCore call to a promoted callee |
 | Reject | Mixed transpose=True / transpose=False on same param |
 | Reject | DN + explicit physical stride source (would compose as double transpose) |
 
@@ -118,28 +120,28 @@ class Before:
 def matmul_incore(
     self,
     a: pl.Tensor[[64, 128], pl.FP32],
-    b: pl.Tensor[[128, 32], pl.FP32, pl.DN],   # ← shape swapped + DN tag
+    b: pl.Tensor[[32, 128], pl.FP32],            # ← param signature unchanged
     c: pl.Out[pl.Tensor[[64, 32], pl.FP32]],
 ) -> pl.Tensor[[64, 32], pl.FP32]:
+    b_dn = tensor.as_layout(b, layout=DN)         # ← prepended view
+                                                   #   type: [128, 32] DN
     tile_a = pl.load(a, [0, 0], [64, 128], target_memory=pl.MemorySpace.Mat)
-    tile_b = pl.load(b, [0, 0], [128, 32], target_memory=pl.MemorySpace.Mat)
-                                           # ↑ no transpose kwarg
-                                           # ↑ shapes swapped to canonical coords
+    tile_b = pl.load(b_dn, [0, 0], [128, 32], target_memory=pl.MemorySpace.Mat)
+                                                   # ↑ source switched to b_dn
+                                                   # ↑ shapes swapped to canonical coords
+                                                   # ↑ no transpose kwarg
     ...
 
 @pl.function(type=pl.FunctionType.Orchestration)
 def orchestrator(self, a, b):
     c = pl.create_tensor([64, 32], dtype=pl.FP32)
-    # b is wrapped in tensor.as_layout to bridge ND → DN at the call site:
-    bridged_b = tensor.as_layout(b, pl.DN)  # type: [128, 32] DN
-    return self.matmul_incore(a, bridged_b, c)
+    return self.matmul_incore(a, b, c)            # ← unchanged
 ```
 
-``a`` is loaded without transpose, so it is unchanged. ``b`` is promoted in the
-InCore signature, all body loads of ``b`` are rewritten to canonical coords with
-no transpose, and the orchestrator's call site wraps ``b`` in
-``tensor.as_layout`` to bridge ``[32, 128] ND`` → ``[128, 32] DN`` over the same
-physical buffer.
+``a`` is loaded without transpose, so it is unchanged. ``b``'s param signature
+is preserved; the kernel internally derives a DN view via ``tensor.as_layout``
+and references that view in its ``tile.load``. The orchestrator is not
+touched — it passes its own row-major ``b`` straight through.
 
 ## Implementation
 
@@ -163,29 +165,31 @@ physical buffer.
 
 | Function type | Action |
 | ------------- | ------ |
-| InCore (InCore, AIC, AIV) | Scanned, possibly promoted |
-| Orchestration / Group / Opaque | Scanned for call sites; promoted-arg wrapped in ``tensor.as_layout`` |
+| InCore (InCore, AIC, AIV) | Scanned, body prepended with ``tensor.as_layout`` views as needed |
+| Orchestration / Group / Opaque | Untouched |
 
 | Parameter state | Action |
 | --------------- | ------ |
-| Sourced by ``tile.load(..., transpose=True)``, layout != DN, rank ≥ 2 | Promoted (shape swap + DN tag) |
-| Sourced by ``tile.load(..., transpose=True)``, already DN | Idempotent — unchanged |
+| Sourced by ``tile.load(..., transpose=True)``, layout != DN, rank ≥ 2 | ``tensor.as_layout`` view prepended; body uses substituted |
+| Sourced by ``tile.load(..., transpose=True)``, already DN | Skipped — ``DeduceTileLoadType`` already handles DN-source XOR transpose |
 | Mixed transpose=True / transpose=False on same param | ``CHECK`` failure |
 | Not sourced by any transposed load | Unchanged |
 | Rank < 2 candidate | ``CHECK`` failure |
 
-## Interaction with ``tensor.as_layout`` (P4) and ``MaterializeTensorStrides`` (P3)
+## Interaction with ``tensor.as_layout`` (P4)
 
-This pass is the first real consumer of ``tensor.as_layout`` in the default
-pipeline. The bridging op is single-purpose: it flips the layout tag and derives
-the new shape from §4.2 canonical pair semantics — callers never write the
-target shape, so the call-site rewriter cannot get it wrong.
+This pass is the first consumer of ``tensor.as_layout`` in the default
+pipeline. The bridging op is single-purpose: it flips the layout tag and
+derives the new shape from §4.2 canonical pair semantics, then attaches the
+packed canonical strides via ``CanonicalizeView``. Codegen lowers
+``tensor.as_layout`` to a fresh ``pto.make_tensor_view`` bound to the input
+tensor's underlying SSA buffer with the LHS's
+``(shape, stride, layout)`` triple — no PTOAS instruction is emitted, the
+result is a pure metadata reinterpret.
 
-Downstream, ``MaterializeTensorStrides`` fills the empty stride slot on each
-promoted parameter with the packed canonical DN strides (RFC §2.4). The
-combination of P6 + P3 is what gives codegen a self-consistent
-``(shape, stride, layout)`` triple — no further ``dn_swap`` / ``get_shape_source_idx``
-fix-ups are needed in the codegen path for promoted parameters.
+Per RFC §4.2, the InCore-side reinterpret does not violate the "InCore cannot
+create tensors" invariant: ``tensor.as_layout`` allocates nothing, it
+re-describes the input's existing physical buffer.
 
 ## Interaction with ``tensor.transpose`` at Orchestration
 

diff --git a/docs/en/dev/passes/26-materialize_tensor_strides.md b/docs/en/dev/passes/26-materialize_tensor_strides.md
@@ -66,7 +66,7 @@ The pass is **idempotent**: re-running on already-materialized IR is a no-op, si
 
 ## Example
 
-**Before** — InCore param with empty-stride DN view (e.g. produced by a future `LowerTransposeLoadParamLayout` rewrite):
+**Before** — InCore param with empty-stride DN view (user-written `pl.Tensor[..., pl.DN]` without an explicit stride hint):
 
 ```python
 @pl.function(type=pl.FunctionType.InCore)