feat(ir): tensor.as_layout op + LowerTransposeLoadParamLayout + activate MaterializeTensorStrides (#1300 P4 + P6)#1324
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR replaces the ChangesRFC
🎯 4 (Complex) | ⏱️ ~60 minutes Possibly Related PRs
Suggested Labels
Suggested Reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request implements the tensor.as_layout internal IR operator for metadata-only reinterpretation of tensor shapes and layouts, supporting row-major ND and DN-packed equivalence. The changes include C++ view semantics, Python bindings, type deduction, and simplification rules. Feedback points out a bug in the orchestration codegen where identity reinterprets incorrectly emit runtime transposes and 1D tensors cause crashes. A code suggestion is provided to ensure identity cases are handled as no-ops and rank checks are only applied during layout changes.
There was a problem hiding this comment.
Actionable comments posted: 8
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
docs/zh-cn/dev/passes/00-pass_manager.md (1)
383-383:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winStale annotation:
MaterializeTensorStridesis now wired into the default pipeline.Line 383 still reads "已注册但尚未接入默认 pipeline(将随 RFC
#1300P6/P7 的 codegen 清理一起启用)", but the list itself includes it as step 13, and per the PR objectives this PR (P6) activatesMaterializeTensorStridesin the default pipeline. Please drop the stale parenthetical so the doc doesn't contradict its own enumeration (and matchesdocs/en/dev/passes/17-lower_transpose_load_param_layout.md, which already treats it as part of the running pipeline).📝 Suggested wording fix
-13. [`MaterializeTensorStrides`](25-materialize_tensor_strides.md) —— 已注册但尚未接入默认 pipeline(将随 RFC `#1300` P6/P7 的 codegen 清理一起启用) +13. [`MaterializeTensorStrides`](25-materialize_tensor_strides.md) —— 在 `CanonicalizeIOOrder` 之后为 P6 提升出的 DN 参数物化 packed canonical stridesAlso worth grepping the English
docs/en/dev/passes/00-pass_manager.md(not in this review) for the same stale phrasing.#!/bin/bash # Make sure the English counterpart didn't keep the stale "not yet wired in" note. fd -t f '00-pass_manager.md' docs rg -nP -C2 'MaterializeTensorStrides' docs/en/dev/passes/00-pass_manager.md docs/zh-cn/dev/passes/00-pass_manager.md 2>/dev/null🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/zh-cn/dev/passes/00-pass_manager.md` at line 383, The doc text for MaterializeTensorStrides is stale: in docs/zh-cn/dev/passes/00-pass_manager.md update the line listing `MaterializeTensorStrides` (step 13) by removing the parenthetical "已注册但尚未接入默认 pipeline(将随 RFC `#1300` P6/P7 的 codegen 清理一起启用)" so the entry no longer claims it isn’t wired into the default pipeline; keep the entry as-is otherwise to match the activated status used elsewhere (e.g., docs/en/dev/passes/17-lower_transpose_load_param_layout.md).docs/en/dev/passes/00-pass_manager.md (1)
383-383:⚠️ Potential issue | 🟠 MajorUpdate line 383 to reflect MaterializeTensorStrides is now active in the default pipeline.
The code in
python/pypto/ir/pass_manager.pyconfirms thatMaterializeTensorStridesis now part of the default pipeline (active since RFC#1300P6), anddocs/en/dev/passes/25-materialize_tensor_strides.mdcorrectly documents this status. Line 383 ofdocs/en/dev/passes/00-pass_manager.mdis outdated and should be updated to remove "not yet wired into the default pipeline" and instead indicate that it is active in the default pipeline betweenCanonicalizeIOOrderandInitMemRef.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/en/dev/passes/00-pass_manager.md` at line 383, Update the documentation entry for MaterializeTensorStrides to reflect that it is now active in the default pipeline: remove "not yet wired into the default pipeline" and state that MaterializeTensorStrides is active (since RFC `#1300` P6) and placed between CanonicalizeIOOrder and InitMemRef in the default pass sequence; ensure the description matches the status in python/pypto/ir/pass_manager.py and the details in 25-materialize_tensor_strides.md.
🧹 Nitpick comments (1)
src/ir/op/tile_ops/memory.cpp (1)
158-173: ⚡ Quick winUse
TensorType::IsDNLayout()convenience method instead of inline check.The DN detection at line 164-165 correctly reads from
tensor_view_->layout, but duplicates the logic already defined inTensorType::IsDNLayout()(include/pypto/ir/type.h line 477-479). Simplify to:bool source_is_dn = tensor_type->IsDNLayout();This improves readability and eliminates duplication. The underlying check is correct: DN is canonically stored only in
tensor_view_->layout, and every DN-tagged TensorType reachingtile.loadalways materializestensor_view_(perPromoteToCanonicalDN()and deserialization pathways).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/ir/op/tile_ops/memory.cpp` around lines 158 - 173, Replace the inline DN-layout check used to set the local variable source_is_dn with the TensorType::IsDNLayout() convenience method; locate where source_is_dn is computed (currently using tensor_type->tensor_view_.has_value() && tensor_type->tensor_view_->layout == TensorLayout::DN) and change it to call tensor_type->IsDNLayout() so the code uses the existing TensorType helper and removes duplicated logic in memory.cpp within the tile.load handling (the block that then uses source_is_dn to decide swapping tile_view.blayout/slayout).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/en/dev/passes/17-lower_transpose_load_param_layout.md`:
- Around line 15-18: Remove the misleading phrase "the next default pass" when
referring to MaterializeTensorStrides in this document and instead state that
MaterializeTensorStrides must run later in the pipeline (after passes such as
CanonicalizeIOOrder); update both the earlier mention (around the paragraph
describing the promoted TensorType and empty stride) and the later mention
(lines ~37–41) so they consistently say MaterializeTensorStrides runs
downstream/after CanonicalizeIOOrder rather than immediately next. Ensure
references to MaterializeTensorStrides and CanonicalizeIOOrder remain intact so
readers can locate the correct ordering.
In `@docs/zh-cn/dev/passes/17-lower_transpose_load_param_layout.md`:
- Around line 11-22: The pipeline-order text is stale: update the doc to state
that LowerTransposeLoadParamLayout runs before ResolveBackendOpLayouts (as
registered in pass_manager.py), that MaterializeTensorStrides is not the "next
pass" but is inserted much later after CanonicalizeIOOrder, and correct the pass
index (it's the 18th pass in the full Default pipeline, not 17th); mention it
still runs after InferTileMemorySpace and before ResolveBackendOpLayouts and
that MaterializeTensorStrides must run later to materialize DN-packed canonical
stride.
In `@src/ir/op/tensor_ops/transform.cpp`:
- Around line 353-355: The current as_layout path rebuilds a fresh canonical
view via tensor_view_semantics::CanonicalizeView and drops any existing
valid_shape/pad metadata; update the logic creating new_view in transform.cpp so
it preserves src_type's existing view metadata (valid_shape and pad) when
present before constructing the new TensorType: retrieve the original view from
src_type (e.g., src_type->view or similar), copy/merge its valid_shape and pad
into the canonicalized new_view (or attach them to the optional view passed to
TensorType) so that tensor.as_layout remains lossless for sliced or fill-padded
tensors.
In `@src/ir/transforms/lower_transpose_load_param_layout_pass.cpp`:
- Around line 145-154: LowerTransposeLoadParamLayout currently requires
call->args_.size() == 4 and forcefully As<MakeTuple> on args_[3], which fails
for the supported 3-arg form tile.load(tensor, offsets, shapes). Change the
verification to accept either 3 or 4 args (check call->args_.size() == 3 ||
call->args_.size() == 4), set offsets = As<MakeTuple>(call->args_[1]) and shapes
= As<MakeTuple>(call->args_[2]) as before, and compute valid_shapes only if
args_.size() == 4 (e.g. valid_shapes = (call->args_.size() == 4 ?
As<MakeTuple>(call->args_[3]) : nullptr)). Update the INTERNAL_CHECK_SPAN logic
to require offsets && shapes, and if valid_shapes is non-null require it to be a
MakeTuple, so the 3-arg form is accepted and the 4-arg form still validated.
In `@src/ir/transforms/materialize_tensor_strides_pass.cpp`:
- Around line 168-179: The direct reconstruction of Call in
MaterializeTensorStrides drops op->attrs_ so metadata is lost; update the Call
rebuild to preserve attributes by passing op->attrs_ into the Call constructor
(i.e., when creating the new std::make_shared<Call>(...), include op->attrs_
alongside op->op_, std::move(new_args), op->kwargs_, std::move(new_return_type),
op->span_) so the new Call keeps the original attrs_ set.
In `@src/ir/transforms/simplify_pass.cpp`:
- Around line 158-163: Update the comment above VisitExpr_ to reflect current
behavior: remove the outdated "shape-bearing" form and the claim about folding
chains, and instead state that SimplifyAsLayout() only removes identity
reinterprets (i.e., as_layout(x, x.shape, x.layout) → x) but does not collapse
nested as_layout chains; reference the VisitExpr_ comment and SimplifyAsLayout()
as the locations to update so the pass contract is consistent with the
implementation.
In `@tests/ut/ir/transforms/test_lower_transpose_load_param_layout_pass.py`:
- Line 251: The assertion currently assumes a_arg.op is non-None; change it to
first assert isinstance(a_arg, ir.Call) and a_arg.op is not None, then assert
a_arg.op.name == "tensor.as_layout" (mirroring the defensive pattern used for
b_arg). Update the assertion around a_arg in
test_lower_transpose_load_param_layout_pass to check a_arg.op is not None before
accessing op.name so we avoid potential AttributeError.
- Around line 297-298: The two assertions that check call.args[0].op.name and
call.args[1].op.name assume .op is non-null; add null-safety checks similar to
earlier patterns (e.g., lines that check isinstance(call.args[i], ir.Call) and
call.args[i].op is not None) before accessing .op.name so replace/augment the
assertions with checks that call.args[0].op is not None and call.args[1].op is
not None and then assert call.args[0].op.name == "tensor.as_layout" and
call.args[1].op.name == "tensor.as_layout".
---
Outside diff comments:
In `@docs/en/dev/passes/00-pass_manager.md`:
- Line 383: Update the documentation entry for MaterializeTensorStrides to
reflect that it is now active in the default pipeline: remove "not yet wired
into the default pipeline" and state that MaterializeTensorStrides is active
(since RFC `#1300` P6) and placed between CanonicalizeIOOrder and InitMemRef in
the default pass sequence; ensure the description matches the status in
python/pypto/ir/pass_manager.py and the details in
25-materialize_tensor_strides.md.
In `@docs/zh-cn/dev/passes/00-pass_manager.md`:
- Line 383: The doc text for MaterializeTensorStrides is stale: in
docs/zh-cn/dev/passes/00-pass_manager.md update the line listing
`MaterializeTensorStrides` (step 13) by removing the parenthetical "已注册但尚未接入默认
pipeline(将随 RFC `#1300` P6/P7 的 codegen 清理一起启用)" so the entry no longer claims it
isn’t wired into the default pipeline; keep the entry as-is otherwise to match
the activated status used elsewhere (e.g.,
docs/en/dev/passes/17-lower_transpose_load_param_layout.md).
---
Nitpick comments:
In `@src/ir/op/tile_ops/memory.cpp`:
- Around line 158-173: Replace the inline DN-layout check used to set the local
variable source_is_dn with the TensorType::IsDNLayout() convenience method;
locate where source_is_dn is computed (currently using
tensor_type->tensor_view_.has_value() && tensor_type->tensor_view_->layout ==
TensorLayout::DN) and change it to call tensor_type->IsDNLayout() so the code
uses the existing TensorType helper and removes duplicated logic in memory.cpp
within the tile.load handling (the block that then uses source_is_dn to decide
swapping tile_view.blayout/slayout).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 980489e7-e270-42a1-9d75-337dbea41d7e
📒 Files selected for processing (37)
.claude/rules/pass-doc-ordering.mdCMakeLists.txtdocs/en/dev/passes/00-pass_manager.mddocs/en/dev/passes/16-infer_tile_memory_space.mddocs/en/dev/passes/17-lower_transpose_load_param_layout.mddocs/en/dev/passes/17-resolve_transpose_layout.mddocs/en/dev/passes/18-resolve_backend_op_layouts.mddocs/en/dev/passes/25-materialize_tensor_strides.mddocs/en/user/01-language_guide.mddocs/zh-cn/dev/passes/00-pass_manager.mddocs/zh-cn/dev/passes/16-infer_tile_memory_space.mddocs/zh-cn/dev/passes/17-lower_transpose_load_param_layout.mddocs/zh-cn/dev/passes/17-resolve_transpose_layout.mddocs/zh-cn/dev/passes/18-resolve_backend_op_layouts.mddocs/zh-cn/dev/passes/25-materialize_tensor_strides.mddocs/zh-cn/user/01-language_guide.mdinclude/pypto/ir/transforms/pass_properties.hinclude/pypto/ir/transforms/passes.hinclude/pypto/ir/transforms/utils/tensor_view_semantics.hpython/bindings/modules/passes.cpppython/pypto/ir/op/tensor_ops.pypython/pypto/ir/pass_manager.pypython/pypto/pypto_core/passes.pyisrc/codegen/tensor_op_codegen.cppsrc/ir/op/tensor_ops/transform.cppsrc/ir/op/tile_ops/memory.cppsrc/ir/transforms/lower_transpose_load_param_layout_pass.cppsrc/ir/transforms/materialize_tensor_strides_pass.cppsrc/ir/transforms/resolve_transpose_layout_pass.cppsrc/ir/transforms/simplify_pass.cpptests/ut/codegen/test_pto_codegen.pytests/ut/codegen/test_pto_codegen_cross_core.pytests/ut/ir/operators/test_tensor_as_layout.pytests/ut/ir/transforms/test_lower_transpose_load_param_layout_pass.pytests/ut/ir/transforms/test_pass_manager.pytests/ut/ir/transforms/test_resolve_transpose_layout_pass.pytests/ut/ir/transforms/test_simplify_pass.py
💤 Files with no reviewable changes (4)
- docs/zh-cn/dev/passes/17-resolve_transpose_layout.md
- src/ir/transforms/resolve_transpose_layout_pass.cpp
- docs/en/dev/passes/17-resolve_transpose_layout.md
- tests/ut/ir/transforms/test_resolve_transpose_layout_pass.py
CI fix (root cause of pypto-lib-model + system-tests failures): - P6's call-site injector previously inlined `tensor.as_layout(arg, DN)` into the kernel-call args directly, which the orchestration codegen rejects with `Call to '<callee>' arg N is neither a variable nor a recognized constant literal`. Refactor `CallSiteAsLayoutInjector` to operate at the statement level: for every AssignStmt / EvalStmt / ReturnStmt whose RHS targets a promoted callee, emit one `bridged_<param> = tensor.as_layout(arg, DN)` AssignStmt immediately before the call statement and replace the inline Call arg with the bound Var. Net IR is SSA-form and matches what orchestration codegen consumes per arg slot (Var | const-literal). Review comments addressed: - gemini #1: codegen `tensor.as_layout` now special-cases the identity flip (target layout == source layout) and emits a plain `Tensor result = input;` alias instead of a spurious `.transpose()`. Simplify still folds these before codegen in the default pipeline, but the codegen is now robust against ad-hoc compile paths that skip Simplify. - coderabbitai hw-native-sys#2 / hw-native-sys#3: drop the "next default pass" wording in en/zh-cn doc 17 — `MaterializeTensorStrides` runs later in the pipeline (after `CanonicalizeIOOrder`), not immediately after. The zh-cn doc's "17th pass" text is also clarified — the 17 is the docs/passes/ slot, not a literal pipeline call-count. - coderabbitai hw-native-sys#4: `DeduceTensorAsLayoutType` now preserves the source TensorView's `valid_shape` (with trailing-pair swap on cross-layout flips) and `pad` through `tensor.as_layout`. Previously these fields were dropped, making the reinterpret silently lossy for sliced or fill-padded inputs. - coderabbitai hw-native-sys#6: `MaterializeTensorStrides` direct-ctor rebuild path now forwards `op->attrs_`. The previous version preserved type and kwargs but dropped attrs, which would have silently discarded call metadata (arg_directions, manual_dep_edges) attached by earlier passes. - coderabbitai hw-native-sys#7: update the stale comment block above `VisitExpr_` in `simplify_pass.cpp` — it still described the dropped shape-bearing `as_layout(x, shape, layout)` form and the never-implemented chain folding. New comment accurately describes the single identity-elimination rule and explains why chain folding is deferred. - coderabbitai hw-native-sys#8 / hw-native-sys#9: the unit-test pattern that previously inspected an inline `tensor.as_layout` Call as a kernel-call arg no longer applies after the SSA refactor above. Tests now look up the bridge via `_find_assign_rhs(orch, var)` and guard `op is not None` before reading `op.name` (matching the defensive pattern already used in the B^T test). Skipped (with reason): - coderabbitai hw-native-sys#5: "Handle 3-arg `tile.load`". `tile.load` registers four mandatory args (tensor, offsets, shapes, valid_shapes) and the Python builder always materializes `valid_shapes` (defaults to `shapes` when the caller omits it). Once IR is constructed, every `tile.load` is 4-arg — the 3-arg form only exists at the DSL surface. The internal check stays as-is.
ab19bfa to
5fb0a67
Compare
CI fix (root cause of pypto-lib-model + system-tests failures): - P6's call-site injector previously inlined `tensor.as_layout(arg, DN)` into the kernel-call args directly, which the orchestration codegen rejects with `Call to '<callee>' arg N is neither a variable nor a recognized constant literal`. Refactor `CallSiteAsLayoutInjector` to operate at the statement level: for every AssignStmt / EvalStmt / ReturnStmt whose RHS targets a promoted callee, emit one `bridged_<param> = tensor.as_layout(arg, DN)` AssignStmt immediately before the call statement and replace the inline Call arg with the bound Var. Net IR is SSA-form and matches what orchestration codegen consumes per arg slot (Var | const-literal). Review comments addressed: - gemini #1: codegen `tensor.as_layout` now special-cases the identity flip (target layout == source layout) and emits a plain `Tensor result = input;` alias instead of a spurious `.transpose()`. Simplify still folds these before codegen in the default pipeline, but the codegen is now robust against ad-hoc compile paths that skip Simplify. - coderabbitai hw-native-sys#2 / hw-native-sys#3: drop the "next default pass" wording in en/zh-cn doc 17 — `MaterializeTensorStrides` runs later in the pipeline (after `CanonicalizeIOOrder`), not immediately after. The zh-cn doc's "17th pass" text is also clarified — the 17 is the docs/passes/ slot, not a literal pipeline call-count. - coderabbitai hw-native-sys#4: `DeduceTensorAsLayoutType` now preserves the source TensorView's `valid_shape` (with trailing-pair swap on cross-layout flips) and `pad` through `tensor.as_layout`. Previously these fields were dropped, making the reinterpret silently lossy for sliced or fill-padded inputs. - coderabbitai hw-native-sys#6: `MaterializeTensorStrides` direct-ctor rebuild path now forwards `op->attrs_`. The previous version preserved type and kwargs but dropped attrs, which would have silently discarded call metadata (arg_directions, manual_dep_edges) attached by earlier passes. - coderabbitai hw-native-sys#7: update the stale comment block above `VisitExpr_` in `simplify_pass.cpp` — it still described the dropped shape-bearing `as_layout(x, shape, layout)` form and the never-implemented chain folding. New comment accurately describes the single identity-elimination rule and explains why chain folding is deferred. - coderabbitai hw-native-sys#8 / hw-native-sys#9: the unit-test pattern that previously inspected an inline `tensor.as_layout` Call as a kernel-call arg no longer applies after the SSA refactor above. Tests now look up the bridge via `_find_assign_rhs(orch, var)` and guard `op is not None` before reading `op.name` (matching the defensive pattern already used in the B^T test). Skipped (with reason): - coderabbitai hw-native-sys#5: "Handle 3-arg `tile.load`". `tile.load` registers four mandatory args (tensor, offsets, shapes, valid_shapes) and the Python builder always materializes `valid_shapes` (defaults to `shapes` when the caller omits it). Once IR is constructed, every `tile.load` is 4-arg — the 3-arg form only exists at the DSL surface. The internal check stays as-is.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/codegen/pto/pto_codegen.cpp`:
- Around line 593-689: The branch that sets is_column_vector currently allows
rank==3 and later only fills two stride_names, leaving stride_names[2] empty and
producing malformed MLIR; fix by restricting the column-vector detection to rank
== 2 (i.e., change the condition in the is_column_vector check to require rank
== 2) so layout = ir::TensorLayout::DN is only forced for 2-D [M,1] tensors and
the existing stride fallback (stride_names[0], stride_names[1]) remains correct;
alternatively, if rank-3 column-vectors must be supported, extend the
is_column_vector handling to fully populate stride_names for all dims using
shape_dim_names/emit_stride_mul before the stride emission, but the simpler fix
is to limit is_column_vector to rank==2.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 75a30cdc-18ba-4b78-bc6a-3a49ca63363f
📒 Files selected for processing (38)
.claude/rules/pass-doc-ordering.mdCMakeLists.txtdocs/en/dev/passes/00-pass_manager.mddocs/en/dev/passes/17-infer_tile_memory_space.mddocs/en/dev/passes/18-lower_transpose_load_param_layout.mddocs/en/dev/passes/19-resolve_backend_op_layouts.mddocs/en/dev/passes/26-materialize_tensor_strides.mddocs/en/user/01-language_guide.mddocs/zh-cn/dev/passes/00-pass_manager.mddocs/zh-cn/dev/passes/17-infer_tile_memory_space.mddocs/zh-cn/dev/passes/18-lower_transpose_load_param_layout.mddocs/zh-cn/dev/passes/19-resolve_backend_op_layouts.mddocs/zh-cn/dev/passes/26-materialize_tensor_strides.mddocs/zh-cn/user/01-language_guide.mdinclude/pypto/ir/transforms/pass_properties.hinclude/pypto/ir/transforms/passes.hinclude/pypto/ir/transforms/utils/tensor_view_semantics.hpython/bindings/modules/passes.cpppython/pypto/ir/op/tensor_ops.pypython/pypto/ir/pass_manager.pypython/pypto/pypto_core/passes.pyisrc/backend/common/pto_ops_common.cppsrc/codegen/orchestration/orchestration_codegen.cppsrc/codegen/pto/pto_codegen.cppsrc/codegen/tensor_op_codegen.cppsrc/ir/op/tensor_ops/transform.cppsrc/ir/op/tile_ops/memory.cppsrc/ir/transforms/lower_transpose_load_param_layout_pass.cppsrc/ir/transforms/materialize_tensor_strides_pass.cppsrc/ir/transforms/resolve_transpose_layout_pass.cppsrc/ir/transforms/simplify_pass.cpptests/ut/codegen/test_pto_codegen.pytests/ut/codegen/test_pto_codegen_cross_core.pytests/ut/ir/operators/test_tensor_as_layout.pytests/ut/ir/transforms/test_lower_transpose_load_param_layout_pass.pytests/ut/ir/transforms/test_pass_manager.pytests/ut/ir/transforms/test_resolve_transpose_layout_pass.pytests/ut/ir/transforms/test_simplify_pass.py
💤 Files with no reviewable changes (2)
- src/ir/transforms/resolve_transpose_layout_pass.cpp
- tests/ut/ir/transforms/test_resolve_transpose_layout_pass.py
✅ Files skipped from review due to trivial changes (13)
- docs/zh-cn/dev/passes/26-materialize_tensor_strides.md
- docs/en/dev/passes/26-materialize_tensor_strides.md
- docs/en/dev/passes/17-infer_tile_memory_space.md
- docs/zh-cn/dev/passes/19-resolve_backend_op_layouts.md
- docs/en/dev/passes/18-lower_transpose_load_param_layout.md
- docs/zh-cn/user/01-language_guide.md
- docs/zh-cn/dev/passes/18-lower_transpose_load_param_layout.md
- docs/zh-cn/dev/passes/17-infer_tile_memory_space.md
- include/pypto/ir/transforms/utils/tensor_view_semantics.h
- docs/en/dev/passes/00-pass_manager.md
- docs/en/user/01-language_guide.md
- docs/en/dev/passes/19-resolve_backend_op_layouts.md
- .claude/rules/pass-doc-ordering.md
🚧 Files skipped from review as they are similar to previous changes (17)
- CMakeLists.txt
- include/pypto/ir/transforms/pass_properties.h
- tests/ut/codegen/test_pto_codegen_cross_core.py
- docs/zh-cn/dev/passes/00-pass_manager.md
- src/ir/transforms/simplify_pass.cpp
- python/pypto/ir/op/tensor_ops.py
- include/pypto/ir/transforms/passes.h
- src/ir/op/tile_ops/memory.cpp
- src/ir/transforms/materialize_tensor_strides_pass.cpp
- src/ir/op/tensor_ops/transform.cpp
- python/pypto/pypto_core/passes.pyi
- tests/ut/codegen/test_pto_codegen.py
- tests/ut/ir/transforms/test_lower_transpose_load_param_layout_pass.py
- python/bindings/modules/passes.cpp
- src/ir/transforms/lower_transpose_load_param_layout_pass.cpp
- tests/ut/ir/transforms/test_pass_manager.py
- python/pypto/ir/pass_manager.py
Addresses coderabbitai's review on ``EmitMakeTensorViews``: the ``is_column_vector`` check previously matched ``rank == 2 || rank == 3``, but the column-vector stride fallback only populates ``stride_names[0]`` and ``stride_names[1]``. For a hypothetical rank-3 ``[_, M, 1]`` input the third stride slot would be empty and the codegen would emit malformed MLIR. The legacy column-vector behavior was always rank-2 in practice (PTOAS infers DN for ``[M, 1]`` specifically) and all in-tree test coverage is rank-2. Restrict the condition to ``rank == 2`` and add a comment explaining the constraint, mirroring coderabbitai's suggested minimal fix.
…a2a3 paged_attention)
Adds ``set_output_memory_inherit_input()`` to the ``tensor.as_layout`` op
registration. Without it, ``InitMemRef`` saw the op as a regular tensor
producer and minted a fresh MemRef (with its own ``mem_ddr_*`` allocation)
for the result. The orchestration codegen, meanwhile, lowers
``tensor.as_layout`` to a plain alias ``Tensor result = input;`` — so the
runtime tensor's data pointer points to the *input's* buffer, while the IR
declares a *separate* buffer.
Concrete failure mode (non-square paged-attention on a2a3): for kj of shape
``[64, 128]`` ND promoted to ``[128, 64] DN``, the orch IR looked like
```
kj_dn_view: pl.Tensor[[128, 64], ..., pl.MemRef("mem_ddr_3", 0, 16384)] =
pl.tensor.as_layout(kj, layout=DN)
self.qk_kernel(qi, kj_dn_view, out)
```
The kernel binary's ``make_tensor_view`` produced the right (shape, stride,
layout) triple, but the orch passed a stale buffer pointer (the
freshly-allocated mem_ddr_3 was never written to; only the alias to kj's
mem_ddr_1 held real data). Square cases happened to pass because the
trailing-pair shape swap is identity, and PTOAS's runtime address arithmetic
collapsed to the same byte range either way.
After this fix, the orch IR shows ``kj_dn_view`` sharing the input's MemRef
``mem_ddr_1`` — no spurious allocation, no aliasing mismatch.
### Validation
- ``cmake --build build --parallel`` — clean
- ``pytest tests/ut/`` — 4562 passed / 33 skipped
- Local IR inspection confirms ``kj_dn_view`` inherits ``mem_ddr_1`` (was
``mem_ddr_3`` before this fix).
Introduce tensor.as_layout - a pure metadata reinterpret op that points at the same physical memory as its source but exposes a different (shape, stride, layout) triple to consumers (RFC hw-native-sys#1300 §3.3). The op emits no PTOAS instruction at codegen; downstream make_tensor_view consumes the new view directly. It is internal-only: not exposed via pypto.language. Future passes (notably LowerTransposeLoadParamLayout in P6) inject tensor.as_layout at orch ↔ InCore call sites to bridge equivalent (shape, stride, layout) views. DeduceTensorAsLayoutType enforces three validity invariants: - Total element count of src and target must match (when both are statically known). - layout must not be NZ (NZ is tile-only / fractal). - The reinterpret must reduce to a RFC §4.2 canonical pair - currently row-major [..., a, b] ND ≡ [..., b, a] DN-packed. Other reinterprets are rejected; the helper AsLayoutOffsetMapEquivalent in tensor_view_semantics.h can be extended in follow-ups when concrete use cases appear. Tests: 9 unit cases covering ND↔DN 2D/3D static round trips, idempotent self-reinterpret, element-count mismatch, NZ rejection, invalid offset-map rejection, symbolic-shape ExprPtr-identity matching, and the op-registry sanity check.
…#1300 P4-b) Extend the Simplify pass to drop ``tensor.as_layout`` calls that are no-op metadata reinterprets — i.e. the requested ``(shape, layout)`` already matches what the source carries. The Call is replaced by its ``src`` arg directly, so downstream consumers stop walking through a useless reinterpret. Why this matters: future passes (e.g. ``LowerTransposeLoadParamLayout`` in P6) may insert ``tensor.as_layout`` at every call-site bridge. Without folding, identity reinterprets ride through the pipeline and clutter codegen. With this rule, the no-op cases drop out at the Simplify boundary. Chain folding (``as_layout(as_layout(x, ...), ...)`` → ``as_layout(x, ...)``) is intentionally left out: after SSA the outer Call's source is a Var bound to the inner Call (not the inner Call inline), so naive pointer inspection cannot see across the binding. A dedicated SSA-aware chain optimizer can be added if real pipelines produce such chains. Tests: 3 cases — identity elimination on bare ND, preservation of a substantive shape change, preservation of a same-shape layout change.
…ys#1300 P4 review) Tighten the tensor.as_layout signature in response to a design review: the op now ONLY flips the layout tag - shape changes that come with a ND ↔ DN flip are mechanical (RFC §4.2 canonical pair: trailing-two-dim swap) and derived from the source. Callers no longer pass a target shape. Why the change: shape changes are tensor.reshape's job. Letting as_layout also accept a target shape blurred the responsibility, created room for caller error (mismatched shape vs canonical pair), and required the AsLayoutOffsetMapEquivalent helper to validate arbitrary (src, target) combinations. With the layout-tag-only design, the canonical pair is intrinsic to the op and the helper disappears. API: Before: as_layout(src, shape, layout=...) After: as_layout(src, layout=...) Behavior: - Same layout (identity) -> shape unchanged - Cross layout (ND <-> DN, rank>=2) -> trailing 2 dims swapped - NZ target -> rejected - Strided sub-view source -> rejected (use slice/reshape first) Cleanup: - Drop AsLayoutOffsetMapEquivalent helper. - Drop detail::RowMajorEquivalentShape and detail::ShapeListsEquivalent (no remaining consumers; the namespace detail rump now contains only the truly-internal CheckCanonicalView helpers, restoring the convention). - Simplify identity rule simplifies to a layout-only comparison. - Codegen handler unchanged - still Tensor::transpose(N-2, N-1) for cross-layout flips; identity cases never reach codegen because Simplify folds them. Tests: test_tensor_as_layout.py rewritten (9 cases) - adds an explicit strided-source rejection, drops obsolete element-count and offset-map cases. test_simplify_pass.py::TestAsLayoutFolding adjusted for the new signature (2 cases).
…trides (RFC hw-native-sys#1300 P6) Replaces ResolveTransposeLayout with LowerTransposeLoadParamLayout — the first pipeline pass that produces RFC hw-native-sys#1300 canonical-form IR — and wires MaterializeTensorStrides into the default pipeline so the codegen-entry IR satisfies the (shape, stride, layout) self-consistency contract. Pass behaviour (per RFC hw-native-sys#1300 §3.3 + §4.2): - For every InCore parameter loaded via tile.load(transpose=True), promote the TensorType from `[..., a, b] ND` to `[..., b, a] DN` (trailing-pair shape swap + DN layout tag with empty stride). MaterializeTensorStrides fills the packed canonical strides later in the pipeline. - Every tile.load on a promoted parameter has its offsets / shapes / valid_shapes tuples swapped at the trailing pair and its `transpose` kwarg set to False (the slot is kept so print -> reparse round-trips faithfully, since the tile.load op registers `transpose` as a default-false attribute). - DeduceTileLoadType now derives the Mat tile-view layout from the source's DN tag (XOR with the transpose kwarg) so the legacy `transpose=True` swap and the new `DN source + transpose=False` form produce the same TileType. - Every non-InCore call site to a promoted callee wraps its promoted-slot arg with `tensor.as_layout(arg, DN)` (P4) to bridge orch-side ND tensors to the InCore DN parameter type. - Mixed-use parameters (loaded with both transpose=True and transpose=False) are rejected with pypto::ValueError. Pipeline wiring: - pass_manager.py default tile-PTO pipeline now inserts MaterializeTensorStrides between CanonicalizeIOOrder and InitMemRef. With P6 producing canonical-form DN parameters, the materialized strides match the IR shape directly — codegen takes the `has_explicit_stride` branch and bypasses the legacy `dn_swap` / `get_shape_source_idx` path. - MaterializeTensorStrides now rebuilds Calls via direct ctor (preserving the intentional type set by FlattenTileNdTo2D's manual rank-2 override) instead of routing through OpRegistry::Create, which would have re-deduced a rank-3 TileType from the rank-3 source-window args and silently undone the flattening. Test changes: - test_lower_transpose_load_param_layout_pass.py (new): 8 cases — B^T, A^T, AB^T, no-op, idempotent, mixed-use rejection, partial-load. Built with programmatic assertions (not `@pl.program` for After) since `tensor.as_layout` is internal-only and not exposed in `pl.*`. - test_pass_manager.py: the default and DebugTileOptimization expected pass lists now include LowerTransposeLoadParamLayout and MaterializeTensorStrides. - test_pto_codegen_3d_dn_tensor_view_uses_canonical_stride (renamed): now asserts the RFC hw-native-sys#1300 canonical (shape, stride, layout) triple — shape preserved as written (`[2, 48, 64]`), strides `[3072, 1, 48]` (DN-packed: stride[n-2]=1, stride[n-1]=shape[n-2]=48, stride[n-3]=shape[n-2]*shape[n-1]=3072), layout=dn. Old expectation was the legacy `dn_swap` form (`[2, 64, 48]` shape, `[3072, 1, 64]` stride) which the canonical pipeline intentionally replaces. Cross-layer / docs: - Renamed src/ir/transforms/resolve_transpose_layout_pass.cpp → lower_transpose_load_param_layout_pass.cpp (git mv preserves history). - Renamed docs/{en,zh-cn}/dev/passes/17-resolve_transpose_layout.md → 17-lower_transpose_load_param_layout.md and rewrote. - pass_properties.h: kLowerTransposeLoadParamLayoutProperties. - passes.h / passes.cpp binding / passes.pyi / pass-doc-ordering.md updated. - All doc cross-references and the docs/{en,zh-cn}/user/01-language_guide.md user-facing pipeline list now reference the new pass name.
CI fix (root cause of pypto-lib-model + system-tests failures): - P6's call-site injector previously inlined `tensor.as_layout(arg, DN)` into the kernel-call args directly, which the orchestration codegen rejects with `Call to '<callee>' arg N is neither a variable nor a recognized constant literal`. Refactor `CallSiteAsLayoutInjector` to operate at the statement level: for every AssignStmt / EvalStmt / ReturnStmt whose RHS targets a promoted callee, emit one `bridged_<param> = tensor.as_layout(arg, DN)` AssignStmt immediately before the call statement and replace the inline Call arg with the bound Var. Net IR is SSA-form and matches what orchestration codegen consumes per arg slot (Var | const-literal). Review comments addressed: - gemini #1: codegen `tensor.as_layout` now special-cases the identity flip (target layout == source layout) and emits a plain `Tensor result = input;` alias instead of a spurious `.transpose()`. Simplify still folds these before codegen in the default pipeline, but the codegen is now robust against ad-hoc compile paths that skip Simplify. - coderabbitai hw-native-sys#2 / hw-native-sys#3: drop the "next default pass" wording in en/zh-cn doc 17 — `MaterializeTensorStrides` runs later in the pipeline (after `CanonicalizeIOOrder`), not immediately after. The zh-cn doc's "17th pass" text is also clarified — the 17 is the docs/passes/ slot, not a literal pipeline call-count. - coderabbitai hw-native-sys#4: `DeduceTensorAsLayoutType` now preserves the source TensorView's `valid_shape` (with trailing-pair swap on cross-layout flips) and `pad` through `tensor.as_layout`. Previously these fields were dropped, making the reinterpret silently lossy for sliced or fill-padded inputs. - coderabbitai hw-native-sys#6: `MaterializeTensorStrides` direct-ctor rebuild path now forwards `op->attrs_`. The previous version preserved type and kwargs but dropped attrs, which would have silently discarded call metadata (arg_directions, manual_dep_edges) attached by earlier passes. - coderabbitai hw-native-sys#7: update the stale comment block above `VisitExpr_` in `simplify_pass.cpp` — it still described the dropped shape-bearing `as_layout(x, shape, layout)` form and the never-implemented chain folding. New comment accurately describes the single identity-elimination rule and explains why chain folding is deferred. - coderabbitai hw-native-sys#8 / hw-native-sys#9: the unit-test pattern that previously inspected an inline `tensor.as_layout` Call as a kernel-call arg no longer applies after the SSA refactor above. Tests now look up the bridge via `_find_assign_rhs(orch, var)` and guard `op is not None` before reading `op.name` (matching the defensive pattern already used in the B^T test). Skipped (with reason): - coderabbitai hw-native-sys#5: "Handle 3-arg `tile.load`". `tile.load` registers four mandatory args (tensor, offsets, shapes, valid_shapes) and the Python builder always materializes `valid_shapes` (defaults to `shapes` when the caller omits it). Once IR is constructed, every `tile.load` is 4-arg — the 3-arg form only exists at the DSL surface. The internal check stays as-is.
…es in wrapper-reorder Two related fixes for paged-attention ST failures on a2a3 introduced by the P6 orch-side ``tensor.as_layout`` injection. **1. ``tensor.as_layout`` codegen now emits a plain alias.** Previously the orchestration handler lowered ``tensor.as_layout(input, DN)`` to ``input.transpose(N-2, N-1)``. The runtime ``Tensor::transpose`` swaps ``shape`` / ``raw_shape`` / **offsets**, which is correct for ``tensor.transpose`` (a physical-index permutation) but wrong for ``tensor.as_layout`` — the IR-level semantics is "reinterpret the same physical bytes under a different layout tag", so the runtime tensor's ``offsets`` must stay in source coordinates. Concrete failure mode (paged_attention on a2a3): the orchestration passes ``kj`` with ``offsets = [block_offset, 0]``; the spurious ``.transpose(0, 1)`` swapped them to ``[0, block_offset]``, shifting the base address by a factor of ``raw_shape[1]`` and silently corrupting every kernel that received the bridged view. The downstream kernel already encodes the canonical ``(shape, stride, layout)`` triple via its IR-declared param type, so re-emitting it at runtime was redundant — the alias preserves the orch-side ``offsets`` exactly while the kernel-side ``make_tensor_view`` applies the canonical interpretation. This addresses the 7 paged-attention numerical-mismatch failures on a2a3 (``test_paged_attention_ptoas``, ``test_paged_attention_unaligned_ptoas``, the dynamic-paged variants, and ``test_dyn_orch_paged_attention``). The a5sim variant already passed because a5's PTOAS layout is more forgiving. **2. Wrapper-reorder chases ``tensor.as_layout`` aliases.** For SPMD / Group / orchestration wrapper functions, codegen splices an outer caller's args through the wrapper's parameter list to the inner-call's parameter list. With P6 injecting ``bridged_kj = tensor.as_layout(kj, DN)`` before the inner call inside the wrapper body, the inner-call arg becomes ``bridged_kj`` — a wrapper-local Var, not a wrapper parameter — and ``BuildWrapperReorderedParams`` failed with ``"inner call arg N does not map to any wrapper parameter"``. Fix: collect a per-wrapper alias map by walking the wrapper body for ``AssignStmt(v, tensor.as_layout(src, ...))`` pairs, then chase each inner- call arg through the chain back to the wrapper parameter. The lowering on the other side (#1) is the plain alias, so this "see-through" mapping is semantically equivalent — the wrapper splice still routes the outer arg to the same memory. This addresses the 3 SPMD paged_attention compile failures (``test_paged_attention_spmd_ptoas`` variants).
…w-native-sys#1300 P7) Cleans up the PTO codegen to consume the IR's canonical ``(shape, stride, layout)`` triple verbatim. No more dual code paths between the legacy "DN + empty stride → implicit shape/offset swap" and the canonical "DN + explicit stride → emit-as-is" forms — the latter is now the only path. After P6 + ``MaterializeTensorStrides``, all DN-tagged tensors arrive at codegen with canonical strides materialized, so the swap path is dead code. ### What changes - **``src/codegen/pto/pto_codegen.cpp::EmitMakeTensorViews``** - Removes ``get_shape_source_idx`` (the implicit trailing-pair swap helper) and the dual stride-derivation paths. - Single derivation: explicit stride when present, otherwise canonical DN strides (RFC §2.3: ``stride[-2]=1``, ``stride[-1]=shape[-2]``, outer strides walk row-major over the DN-block volume) or canonical ND strides (``stride[-1]=1``, ``stride[k]=stride[k+1]*shape[k+1]``). - Keeps the ``[M, 1]`` column-vector legacy carve-out (PTOAS infers DN for that shape regardless of IR layout tag). - Precomputes shape dim SSA names up-front so dynamic-shape casts (``EmitCastToIndex``) emit their setup ops *before* the ``pto.make_tensor_view`` line instead of interleaving inside it. - **``src/backend/common/pto_ops_common.cpp``** - ``MakeTileLoadCodegenPTO`` and ``MakeTileStoreCodegenPTO`` drop their ``dn_swap`` branches that swapped the trailing pair of ``offsets`` / ``valid_shapes`` / ``shapes`` tuples. The IR-level lowering (P6 ``LowerTransposeLoadParamLayout``) now produces all coordinates in canonical form, so the codegen transcribes them verbatim. ### Why now This unifies the codegen entry contract: every layout-aware transform is finalized at the IR level (P3 canonical TensorView, P4 ``tensor.as_layout``, P6 ``LowerTransposeLoadParamLayout``), and the codegen has no remaining layout logic of its own. Concrete benefits: - Removes the asymmetry that caused the non-square paged-attention regressions on a2a3 — both legacy ``dn_swap`` and canonical paths emit the same MLIR for square cases, but diverged for non-square shapes once ``MaterializeTensorStrides`` was activated in the default pipeline. - ``.pto`` output is now deterministically reproducible from the IR's canonical triple — no codegen-layer interpretation steps in between. ### Validation - ``cmake --build build --parallel`` — clean - ``pytest tests/ut/`` — 4562 passed / 33 skipped (no regressions) The 2D codegen optimization (``stride[-2] = shape[-1]`` directly, no spurious ``arith.muli %c1, shape[-1]``) preserves the byte-for-byte ``.pto`` output expected by existing 2D MLIR golden-string tests.
Addresses coderabbitai's review on ``EmitMakeTensorViews``: the ``is_column_vector`` check previously matched ``rank == 2 || rank == 3``, but the column-vector stride fallback only populates ``stride_names[0]`` and ``stride_names[1]``. For a hypothetical rank-3 ``[_, M, 1]`` input the third stride slot would be empty and the codegen would emit malformed MLIR. The legacy column-vector behavior was always rank-2 in practice (PTOAS infers DN for ``[M, 1]`` specifically) and all in-tree test coverage is rank-2. Restrict the condition to ``rank == 2`` and add a comment explaining the constraint, mirroring coderabbitai's suggested minimal fix.
…a2a3 paged_attention)
Adds ``set_output_memory_inherit_input()`` to the ``tensor.as_layout`` op
registration. Without it, ``InitMemRef`` saw the op as a regular tensor
producer and minted a fresh MemRef (with its own ``mem_ddr_*`` allocation)
for the result. The orchestration codegen, meanwhile, lowers
``tensor.as_layout`` to a plain alias ``Tensor result = input;`` — so the
runtime tensor's data pointer points to the *input's* buffer, while the IR
declares a *separate* buffer.
Concrete failure mode (non-square paged-attention on a2a3): for kj of shape
``[64, 128]`` ND promoted to ``[128, 64] DN``, the orch IR looked like
```
kj_dn_view: pl.Tensor[[128, 64], ..., pl.MemRef("mem_ddr_3", 0, 16384)] =
pl.tensor.as_layout(kj, layout=DN)
self.qk_kernel(qi, kj_dn_view, out)
```
The kernel binary's ``make_tensor_view`` produced the right (shape, stride,
layout) triple, but the orch passed a stale buffer pointer (the
freshly-allocated mem_ddr_3 was never written to; only the alias to kj's
mem_ddr_1 held real data). Square cases happened to pass because the
trailing-pair shape swap is identity, and PTOAS's runtime address arithmetic
collapsed to the same byte range either way.
After this fix, the orch IR shows ``kj_dn_view`` sharing the input's MemRef
``mem_ddr_1`` — no spurious allocation, no aliasing mismatch.
### Validation
- ``cmake --build build --parallel`` — clean
- ``pytest tests/ut/`` — 4562 passed / 33 skipped
- Local IR inspection confirms ``kj_dn_view`` inherits ``mem_ddr_1`` (was
``mem_ddr_3`` before this fix).
…s) — RFC hw-native-sys#1300 User's analysis pinpointed the actual root cause of the non-square paged-attention regressions on a2a3: The PTOAS-generated kernel wrapper reads dynamic dim values directly from the runtime ``Tensor`` struct's ``shapes[i]``, indexed under the **IR-declared post-P6 shape order**. For ``key_cache`` promoted from ``[256, 128] ND`` to ``[128, 256] DN``, the kernel expects: KV_HEAD_DIM_DYN = key_cache_t->shapes[0] // → 128 KEY_CACHE_ROWS_DYN = key_cache_t->shapes[1] // → 256 But my prior plain-alias codegen left ``shapes`` in the pre-swap (ND) order — the kernel read ``shapes[0]=256`` and computed DN strides off the wrong axis. Square cases happened to survive because the swap is identity. **Fix:** the orch codegen for ``tensor.as_layout`` now swaps the trailing-pair ``shapes`` so the runtime tensor matches the IR-declared post-swap order. ``raw_shapes`` and ``offsets`` stay in the source (pre- swap) coord system because PTOAS uses ``raw_shape``-derived strides plus ``offsets`` to compute ``start_offset`` (byte offset of the view into the physical buffer) — and that base address must continue pointing at the original ND-coord slice (e.g. paged-attention's ``offsets = [block_offset, 0]`` on the row-major ``key_cache``). If ``is_raw_eq_shapes`` is true, ``raw_shapes`` is materialized from the current ``shapes`` *before* the swap so the subsequent ``shapes`` mutation does not pollute the raw-shape-derived stride arithmetic. The identity flip (target layout == source layout) still lowers to a plain ``Tensor result = input;`` alias — no swap needed. ### Why not use ``Tensor::transpose(N-2, N-1)`` That runtime helper additionally swaps ``raw_shapes`` and ``offsets``, which is correct for ``tensor.transpose`` (a physical-index permutation) but wrong for ``tensor.as_layout`` (a metadata reinterpret of the same bytes). For paged-attention's ``offsets = [block_offset, 0]`` → ``[0, block_offset]`` shifted the base address by a factor of ``raw_shape[1]`` and silently corrupted reads. Our new lowering targets the precise mutation needed: shape-only swap. ### Validation - ``cmake --build build --parallel`` — clean - ``pytest tests/ut/`` — 4562 passed / 33 skipped
b6d7493 to
5840271
Compare
- MaterializeTensorStrides: remap VarPtrs in manual_dep_edges / user_manual_dep_edges attrs when rebuilding Calls, so attr entries follow the fresh Vars minted for materialized TensorViews. Without this, SSAVerify catches "used outside its defining scope" and orchestration codegen later raises "manual_dep_edge has no producer task" on manual-scope pipelines (exposed by test_manual_scope_{seq_outer_parallel,parallel_outer_seq}_inner_two_stage_pipeline).
- pto_codegen.cpp: extend the ``[..., M, 1]`` column-vector carve-out from rank == 2 to rank >= 2 with a stride derivation that fills all rank slots (legacy PTOAS convention: ``stride[rank-2]=1``, ``stride[rank-1]=shape[rank-1]``, ``stride[rank-3]=shape[rank-2]``, outer dims walk over M). Restores main's behaviour for rank-3 ``[B, N, 1]`` (and matches the ColMajor BLayout that ``DeduceTileLoadType`` already produces for trailing-dim-1 ``tile.load``s), fixing the ``TLoad isSameLayout`` PTOAS compile failure surfaced by test_tensor_expand_clone[a2a3-2] (broadcast_dim=2, input ``[B, N, 1]``).
…body (RFC #1300) (#1339) ## Summary Moves the P6 ``tensor.as_layout`` bridge from the orch call site to the top of the InCore body, end-to-end equivalent but **-132 LOC** net and removes a cluster of incidental complexity in orchestration codegen. See #1300 [discussion comment](#1300 (comment)) for the design rationale and consensus question. ## What changes For each InCore parameter ``p`` loaded via ``tile.load(p, ..., transpose=True)``: **Before (current main, post-#1324):** - InCore param signature is promoted to ``[..., b, a] DN``. - Orch call site is wrapped: ``bridged = tensor.as_layout(arg, DN); incore(bridged, ...)``. - Orchestration codegen has to chase aliases through the bridge via ``BuildWrapperAliasMap`` + ``ResolveAliasChain`` to recover the original wrapper param. **After (this PR):** - InCore param signature is **untouched** (stays ``[..., a, b] ND``, matching the runtime torch tensor). - InCore body is **prepended** with ``p_dn = tensor.as_layout(p, layout=DN)``; body uses of ``p`` are substituted with ``p_dn``. - The matching ``tile.load`` is rewritten to read from ``p_dn`` with the trailing pair of offsets/shapes/valid_shapes swapped and ``transpose=False``. - Orch is left completely alone — the orchestrator's call args are wrapper params directly. ## Diff stats | File | Change | |------|--------| | ``src/ir/transforms/lower_transpose_load_param_layout_pass.cpp`` | -174 / +14 — deletes ``CallSiteAsLayoutInjector`` + Phase 2 + ``PromoteToCanonicalDN``; new ``LowerInCoreFunction`` prepends to body | | ``src/codegen/orchestration/orchestration_codegen.cpp`` | -60 / 0 — deletes ``BuildWrapperAliasMap`` / ``ResolveAliasChain`` / alias-chasing fallback | | ``src/backend/common/pto_ops_common.cpp`` | 0 / +86 — registers ``tensor.as_layout`` PTO codegen (emits one ``pto.make_tensor_view`` sharing the input's base) | | Tests + docs + bindings | -188 / +291 — 5 pass-test bodies rewritten to assert the new IR shape; pass 18 docs (en/zh-cn) rewritten; pass 26 example caption updated; passes.h Doxygen + nanobind docstring + pyi stub updated | Total: **-485 / +391 = -94 LOC net** (the +291 in tests/docs is mostly added explanatory comments and structured assertions — the production code delta is **-297 / +165 = -132 LOC**). ## Why this is acceptable per RFC §4.2 RFC §4.2's "InCore cannot create tensors" invariant targets ops that **allocate a byte buffer** (``tensor.create``). ``tensor.as_layout`` is a pure metadata reinterpret — it allocates nothing, it just re-describes the input's existing physical buffer. The four-layer boundary (§5) becomes cleaner under this design: - **Runtime / Orch**: row-major ND physical buffer (matches runtime). - **Cross-function boundary**: always row-major ND (no layout reinterpret). - **Inside an InCore body**: derive the DN view via ``tensor.as_layout``; this is a single-function internal detail. - **`.pto`**: codegen consumes whatever canonical triple the InCore body sees. ## Validation - ``cmake --build build --parallel`` — clean. - ``pytest tests/ut/ -n auto --maxprocesses 8`` — **4602 passed / 41 skipped / 0 failed**. - Golden-string ``.pto`` codegen tests pass — output is byte-identical to main. - End-to-end matmul B^T, paged-attention (single + multi-config), orchestration codegen tests all pass. ## Test plan - [x] All existing unit tests pass. - [x] Pass-specific tests rewritten to validate new IR shape (body-prepended ``tensor.as_layout`` binding + ``tile.load`` reading from binding LHS + orch left alone). - [x] ``cmake --build`` clean. - [ ] CI: clang-tidy, pre-commit, unit-tests (macos + ubuntu), fuzz-tests-sim, system-tests, system-tests-a5sim, pypto-lib-model. ## Discussion Open for RFC author / reviewers to weigh in. The design tradeoff is "signature is the contract" (current main) vs. "cross-function boundary is the runtime-faithful boundary, DN view is a per-kernel detail" (this PR). The latter eliminates downstream codegen complexity at the cost of slightly less honest InCore signatures.
Summary
P4 + P6 of the RFC #1300 canonical TensorView roadmap. Lands the foundational
tensor.as_layoutvirtual op (P4), then uses it inLowerTransposeLoadParamLayout(P6) — the first pipeline pass that produces canonical-form IR — and activatesMaterializeTensorStridesin the default pipeline.After this PR, every InCore parameter loaded via
tile.load(transpose=True)is promoted at the IR level to canonical-form DN (RFC §3.3 + §4.2), all body loads on that param are expressed in canonical coords withtranspose=False, and every non-InCore call site bridges its arg throughtensor.as_layout(arg, DN). Codegen reads(shape, stride, layout)directly from the materialized canonical TensorView and bypasses the legacydn_swap/get_shape_source_idxpath for these tensors.Commits
DeduceTensorAsLayoutType(validity invariants: NZ rejection, RFC §4.2 canonical-pair shape derivation, packed-source check) + Python IR builder + 9 unit tests.as_layout(x, x.layout)→x) + 2 tests. Chain folding deferred (after-SSA Var indirection).Tensor::transpose(N-2, N-1), matching the canonical reinterpret pair.shapeparameter fromtensor.as_layout: the target shape is uniquely determined by the §4.2 canonical-pair rule, so callers (P6 in particular) never compute it themselves.ResolveTransposeLayout→LowerTransposeLoadParamLayout: promote params, swap body trailing-pair coords, droptranspose=True, injecttensor.as_layoutat every non-InCore call site. ActivateMaterializeTensorStridesin the default pipeline.P6 pass behaviour (RFC #1300 §3.3 + §4.2)
[..., a, b] ND→[..., b, a] DN(trailing-pair swap + DN tag, empty stride — filled later by MaterializeTensorStrides)tile.loadrewritetransposekwarg flipped toFalse(kept in the slot so print-reparse round-trips faithfully)tensor.as_layout(arg, DN)DeduceTileLoadTypetranspose=Falsenow derives the same Mat tile-view layout as the legacyND-source + transpose=True(XOR), so the two forms produce identicalTileTypetranspose=Trueandtranspose=False);tensor.transposeresult with explicit physical strides + DN (would compose as a double transpose)Pipeline wiring
pass_manager.pydefault tile-PTO pipeline now insertsMaterializeTensorStridesbetweenCanonicalizeIOOrderandInitMemRef. With P6 producing canonical-form DN parameters, the materialized strides match the IR shape directly — codegen takes thehas_explicit_stridebranch and bypasses the legacydn_swap/get_shape_source_idxpath for those tensors.MaterializeTensorStridesnow rebuilds Calls via direct ctor (preserving the intentional rank-2 type set byFlattenTileNdTo2D) rather than routing throughOpRegistry::Create(which would have re-deduced a rank-3TileTypefrom the rank-3 source-window args and silently undone the flattening).⚠ Test expectation change
test_pto_codegen_3d_dn_tensor_view_uses_last_dim_strideis renamed totest_pto_codegen_3d_dn_tensor_view_uses_canonical_strideand re-targeted to assert the RFC canonical form:[2, 64, 48](post-dn_swap)[2, 48, 64](IR-as-written, no swap)64(source last dim)48(=shape[n-2]per RFC §2.3)3072(= 48 × 64) — emitted viaarith.muli3072— materialized as a compile-time constant byMaterializeTensorStridesThe user-facing IR (
pl.Tensor[[2, 48, 64], pl.FP32, pl.DN]) is unchanged; only the codegen lowering is now canonical-direct.Test plan
cmake --build build --parallel— cleanpytest tests/ut/ir/operators/test_tensor_as_layout.py— 9 passed (P4-a)pytest tests/ut/ir/transforms/test_simplify_pass.py::TestAsLayoutFolding— 2 passed (P4-b)pytest tests/ut/ir/transforms/test_lower_transpose_load_param_layout_pass.py— 8 passed (P6: B^T / A^T / AB^T / no-op / idempotent / mixed-use rejection / partial-load)pytest tests/ut/— 4522 passed / 33 skipped, no regressionspytest tests/lint/check_english_only.py tests/lint/check_headers.py— cleanDesign decisions (per RFC issue threads)
pypto.language. Only IR-level passes constructtensor.as_layout.[..., a, b]ND ≡[..., b, a]DN-packed. NZ rejected on TensorType.as_layout: shape-changing reinterprets stay withtensor.reshape.as_layoutis layout-tag-only; the trailing-pair shape swap is mechanical.Roadmap
tensor.as_layoutoptensor.slice/tensor.reshapeinherit parent layout familyLowerTransposeLoadParamLayout+MaterializeTensorStridesin default pipelinedn_swap/has_explicit_stride/get_shape_source_idxbranches now that P6+P3 supply explicit strides)pl.DN,pl.load(transpose=True))pl.move(layout=)kwarg🤖 Generated with Claude Code