feat(ir): tensor.as_layout op + LowerTransposeLoadParamLayout + activate MaterializeTensorStrides (#1300 P4 + P6) by lyfne123 · Pull Request #1324 · hw-native-sys/pypto

lyfne123 · 2026-05-09T07:27:33Z

Summary

P4 + P6 of the RFC #1300 canonical TensorView roadmap. Lands the foundational tensor.as_layout virtual op (P4), then uses it in LowerTransposeLoadParamLayout (P6) — the first pipeline pass that produces canonical-form IR — and activates MaterializeTensorStrides in the default pipeline.

After this PR, every InCore parameter loaded via tile.load(transpose=True) is promoted at the IR level to canonical-form DN (RFC §3.3 + §4.2), all body loads on that param are expressed in canonical coords with transpose=False, and every non-InCore call site bridges its arg through tensor.as_layout(arg, DN). Codegen reads (shape, stride, layout) directly from the materialized canonical TensorView and bypasses the legacy dn_swap / get_shape_source_idx path for these tensors.

Commits

P4-a (bff1be59) — Op definition + DeduceTensorAsLayoutType (validity invariants: NZ rejection, RFC §4.2 canonical-pair shape derivation, packed-source check) + Python IR builder + 9 unit tests.
P4-b (71686cc8) — Simplify pass identity-elimination rule (as_layout(x, x.layout) → x) + 2 tests. Chain folding deferred (after-SSA Var indirection).
P4-c (eacf40ff) — Orchestration codegen handler emits Tensor::transpose(N-2, N-1), matching the canonical reinterpret pair.
P4 refactor (52e51d4e) — Drop the shape parameter from tensor.as_layout: the target shape is uniquely determined by the §4.2 canonical-pair rule, so callers (P6 in particular) never compute it themselves.
P6 (86072218) — ResolveTransposeLayout → LowerTransposeLoadParamLayout: promote params, swap body trailing-pair coords, drop transpose=True, inject tensor.as_layout at every non-InCore call site. Activate MaterializeTensorStrides in the default pipeline.

P6 pass behaviour (RFC #1300 §3.3 + §4.2)

Action	Detail
Param TensorType promotion	`[..., a, b] ND` → `[..., b, a] DN` (trailing-pair swap + DN tag, empty stride — filled later by MaterializeTensorStrides)
Body `tile.load` rewrite	offsets / shapes / valid_shapes trailing pair swapped to canonical coords; `transpose` kwarg flipped to `False` (kept in the slot so print-reparse round-trips faithfully)
Call-site bridging	every non-InCore arg passed to a promoted slot is wrapped with `tensor.as_layout(arg, DN)`
`DeduceTileLoadType`	DN-source + `transpose=False` now derives the same Mat tile-view layout as the legacy `ND-source + transpose=True` (XOR), so the two forms produce identical `TileType`
Rejection	mixed-use param (loaded with both `transpose=True` and `transpose=False`); `tensor.transpose` result with explicit physical strides + DN (would compose as a double transpose)

Pipeline wiring

pass_manager.py default tile-PTO pipeline now inserts MaterializeTensorStrides between CanonicalizeIOOrder and InitMemRef. With P6 producing canonical-form DN parameters, the materialized strides match the IR shape directly — codegen takes the has_explicit_stride branch and bypasses the legacy dn_swap / get_shape_source_idx path for those tensors.
MaterializeTensorStrides now rebuilds Calls via direct ctor (preserving the intentional rank-2 type set by FlattenTileNdTo2D) rather than routing through OpRegistry::Create (which would have re-deduced a rank-3 TileType from the rank-3 source-window args and silently undone the flattening).

⚠ Test expectation change

test_pto_codegen_3d_dn_tensor_view_uses_last_dim_stride is renamed to test_pto_codegen_3d_dn_tensor_view_uses_canonical_stride and re-targeted to assert the RFC canonical form:

Field	Legacy (pre-P6)	Canonical (post-P6)
Emitted shape	`[2, 64, 48]` (post-`dn_swap`)	`[2, 48, 64]` (IR-as-written, no swap)
Stride[-1]	`64` (source last dim)	`48` (= `shape[n-2]` per RFC §2.3)
Stride[-3]	`3072` (= 48 × 64) — emitted via `arith.muli`	`3072` — materialized as a compile-time constant by `MaterializeTensorStrides`

The user-facing IR (pl.Tensor[[2, 48, 64], pl.FP32, pl.DN]) is unchanged; only the codegen lowering is now canonical-direct.

Test plan

cmake --build build --parallel — clean
pytest tests/ut/ir/operators/test_tensor_as_layout.py — 9 passed (P4-a)
pytest tests/ut/ir/transforms/test_simplify_pass.py::TestAsLayoutFolding — 2 passed (P4-b)
pytest tests/ut/ir/transforms/test_lower_transpose_load_param_layout_pass.py — 8 passed (P6: B^T / A^T / AB^T / no-op / idempotent / mixed-use rejection / partial-load)
pytest tests/ut/ — 4522 passed / 33 skipped, no regressions
pytest tests/lint/check_english_only.py tests/lint/check_headers.py — clean

Design decisions (per RFC issue threads)

Internal-only (Q1): not exposed via pypto.language. Only IR-level passes construct tensor.as_layout.
Restricted reinterprets (RFC §4.2): currently only row-major [..., a, b] ND ≡ [..., b, a] DN-packed. NZ rejected on TensorType.
Single-responsibility as_layout: shape-changing reinterprets stay with tensor.reshape. as_layout is layout-tag-only; the trailing-pair shape swap is mechanical.

Roadmap

Phase	Topic	Status
P1–P3	Canonical TensorView foundation	Landed in #1311
P4	`tensor.as_layout` op	This PR
P5	`tensor.slice` / `tensor.reshape` inherit parent layout family	Independent — can land any time
P6	`LowerTransposeLoadParamLayout` + `MaterializeTensorStrides` in default pipeline	This PR
P7	Codegen cleanup (drop the legacy `dn_swap` / `has_explicit_stride` / `get_shape_source_idx` branches now that P6+P3 supply explicit strides)	Next
P8	DSL deprecation (`pl.DN`, `pl.load(transpose=True)`)	After P7
P9	`pl.move(layout=)` kwarg	Independent

🤖 Generated with Claude Code

coderabbitai · 2026-05-09T07:27:40Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

This PR replaces the ResolveTransposeLayout pass with LowerTransposeLoadParamLayout, introducing a new tensor.as_layout IR operation for metadata-only layout flipping. The transformation promotes InCore function parameters from ND to canonical DN layout when used with tile.load(..., transpose=True), swaps trailing coordinate dimensions, and injects tensor.as_layout bridges at non-InCore call sites. Backend codegen no longer performs implicit DN coordinate swapping, assuming pre-canonicalized (shape, stride, layout) triples from the lowering pass.

Changes

RFC #1300 Layout Canonicalization

Layer / File(s)	Summary
tensor.as_layout IR Op: Type Deduction & Core Logic `include/pypto/ir/transforms/utils/tensor_view_semantics.h`, `src/ir/op/tensor_ops/transform.cpp`, `python/pypto/ir/op/tensor_ops.py`	New internal IR operation for ND↔DN layout flipping; adds `#include <utility>` support; type deduction swaps trailing dimensions, validates canonical strides, forbids NZ layouts.
tensor.as_layout Codegen & Testing `src/codegen/tensor_op_codegen.cpp`, `tests/ut/ir/operators/test_tensor_as_layout.py`	Orchestration codegen emits metadata-only alias without runtime swap; comprehensive test suite validates type inference, layout flipping, identity folding, rank/NZ rejection, and strided sub-view rejection.
Tile Load Type Inference: DN-Source Handling `src/ir/op/tile_ops/memory.cpp`	Updates `DeduceTileLoadType` to detect DN-tagged sources; treats DN and explicit transpose as XOR-compatible via `(transpose != source_is_dn)` condition for Mat memory layout swapping.
LowerTransposeLoadParamLayout Pass: Phase 1 InCore `src/ir/transforms/lower_transpose_load_param_layout_pass.cpp` (lines 1–183)	Phase 1 scans InCore functions, rejects mixed transpose modes, promotes parameters to canonical DN with swapped shapes, rewrites tile.load coordinate tuples and drops `transpose=True` kwarg.
LowerTransposeLoadParamLayout Pass: Phase 2 Call-Site `src/ir/transforms/lower_transpose_load_param_layout_pass.cpp` (lines 263–455)	Phase 2 injects `tensor.as_layout(arg, DN)` bindings at non-InCore call sites targeting promoted callees; skips injection when argument already DN-canonical.
LowerTransposeLoadParamLayout Pass Tests `tests/ut/ir/transforms/test_lower_transpose_load_param_layout_pass.py`	Validates both phases: B-load, A-load, both-params, non-square, no-op, idempotence, mixed-mode rejection, and partial-load promotion via SSA binding inspection.
Old Pass Removal `src/ir/transforms/resolve_transpose_layout_pass.cpp`, `tests/ut/ir/transforms/test_resolve_transpose_layout_pass.py`	Deletes entire ResolveTransposeLayout implementation and test module.
Pass Registration & Public API `include/pypto/ir/transforms/pass_properties.h`, `include/pypto/ir/transforms/passes.h`, `python/bindings/modules/passes.cpp`, `python/pypto/pypto_core/passes.pyi`, `CMakeLists.txt`	Registers LowerTransposeLoadParamLayout in properties, C++ headers, Python bindings, stubs; updates CMakeLists to link new implementation.
Pass Pipeline Integration `python/pypto/ir/pass_manager.py`, `tests/ut/ir/transforms/test_pass_manager.py`, `tests/ut/codegen/test_pto_codegen_cross_core.py`	Updates `tile_pto_passes` to use new pass; inserts `MaterializeTensorStrides` before `InitMemRef`; updates test pass-sequence expectations.
Backend PTO Codegen Simplification `src/backend/common/pto_ops_common.cpp`	Removes implicit DN last-two-dim coordinate swapping from tile.load/tile.store codegen; assumes IR tuples are already canonical.
PTO Codegen: EmitMakeTensorViews `src/codegen/pto/pto_codegen.cpp`	Rewrote to directly materialize layout-aware `(shape, stride, layout)` triples without prior DN/ND shape swaps; precomputes SSA names, derives strides by layout semantics, preserves [M,1] column-vector special case.
Orchestration & MaterializeTensorStrides `src/codegen/orchestration/orchestration_codegen.cpp`, `src/ir/transforms/materialize_tensor_strides_pass.cpp`	Added wrapper alias-map helpers for `tensor.as_layout` layout bridges; updated MaterializeTensorStrides to always rebuild Calls with direct constructor, avoiding OpRegistry re-deduction.
Simplify Pass: Identity Folding `src/ir/transforms/simplify_pass.cpp`, `tests/ut/ir/transforms/test_simplify_pass.py`	Added SimplifyAsLayout to eliminate identity `as_layout` calls; comprehensive tests validate identity elimination and substantive layout-flip preservation.
Codegen Test Updates `tests/ut/codegen/test_pto_codegen.py`	Updated 3D DN tensor test to expect canonical stride/shape (no implicit swap); renamed test, updated assertions for logical shape order and canonical DN strides.
Documentation: Pass Definitions (English & Chinese) `docs/en/dev/passes/18-lower_transpose_load_param_layout.md`, `docs/zh-cn/dev/passes/18-lower_transpose_load_param_layout.md`	Added comprehensive pass documentation describing algorithm, scope, interactions, before/after examples, and RFC `#1300` P6 behavior.
Documentation: Pass Pipeline & Ordering `docs//dev/passes/00-pass_manager.md`, `docs//dev/passes/17-.md`, `docs//dev/passes/19-.md`, `docs//dev/passes/26-.md`, `docs//user/01-language_guide.md`, `.claude/rules/pass-doc-ordering.md`	Updated pipeline descriptions, MaterializeTensorStrides status to "active since P6", pass-ordering index, and user documentation.

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly Related PRs

hw-native-sys/pypto#1311: Extends RFC #1300 canonical-tensor-view work with MaterializeTensorStrides and tensor_view_semantics utilities.
hw-native-sys/pypto#1222: Related tensor transpose metadata handling and layout semantics changes.
hw-native-sys/pypto#1217: Modifies PTO codegen's make_tensor_view and related tensor view emission logic.

Suggested Labels

enhancement

Suggested Reviewers

Hzfengsy

Poem

A rabbit hops through transpose lands,
Where dimensions swap by careful hands,
DN layouts blessed, coordinates aligned,
Canonical forms, so well-designed! 🐰
Layout bridges span the call-site gap,
Strides materialized—wisdom's map! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 51.46% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title clearly and concisely summarizes the main changes: introducing tensor.as_layout op, LowerTransposeLoadParamLayout pass, and activating MaterializeTensorStrides, with RFC phase identifiers (P4 + P6) providing context.
Description check	✅ Passed	The PR description is comprehensive and directly related to the changeset. It explains the summary, commits, pass behavior, pipeline wiring, test expectations, design decisions, and roadmap—all aligned with the file-level changes.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request implements the tensor.as_layout internal IR operator for metadata-only reinterpretation of tensor shapes and layouts, supporting row-major ND and DN-packed equivalence. The changes include C++ view semantics, Python bindings, type deduction, and simplification rules. Feedback points out a bug in the orchestration codegen where identity reinterprets incorrectly emit runtime transposes and 1D tensors cause crashes. A code suggestion is provided to ensure identity cases are handled as no-ops and rank checks are only applied during layout changes.

coderabbitai

Actionable comments posted: 8

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

docs/zh-cn/dev/passes/00-pass_manager.md (1)
383-383: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Stale annotation: MaterializeTensorStrides is now wired into the default pipeline.

Line 383 still reads "已注册但尚未接入默认 pipeline（将随 RFC #1300 P6/P7 的 codegen 清理一起启用）", but the list itself includes it as step 13, and per the PR objectives this PR (P6) activates MaterializeTensorStrides in the default pipeline. Please drop the stale parenthetical so the doc doesn't contradict its own enumeration (and matches docs/en/dev/passes/17-lower_transpose_load_param_layout.md, which already treats it as part of the running pipeline).
📝 Suggested wording fix
-13. [`MaterializeTensorStrides`](25-materialize_tensor_strides.md) —— 已注册但尚未接入默认 pipeline（将随 RFC `#1300` P6/P7 的 codegen 清理一起启用）
+13. [`MaterializeTensorStrides`](25-materialize_tensor_strides.md) —— 在 `CanonicalizeIOOrder` 之后为 P6 提升出的 DN 参数物化 packed canonical strides
Also worth grepping the English docs/en/dev/passes/00-pass_manager.md (not in this review) for the same stale phrasing.
#!/bin/bash
# Make sure the English counterpart didn't keep the stale "not yet wired in" note.
fd -t f '00-pass_manager.md' docs
rg -nP -C2 'MaterializeTensorStrides' docs/en/dev/passes/00-pass_manager.md docs/zh-cn/dev/passes/00-pass_manager.md 2>/dev/null
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/zh-cn/dev/passes/00-pass_manager.md` at line 383, The doc text for
MaterializeTensorStrides is stale: in docs/zh-cn/dev/passes/00-pass_manager.md
update the line listing `MaterializeTensorStrides` (step 13) by removing the
parenthetical "已注册但尚未接入默认 pipeline（将随 RFC `#1300` P6/P7 的 codegen 清理一起启用）" so the
entry no longer claims it isn’t wired into the default pipeline; keep the entry
as-is otherwise to match the activated status used elsewhere (e.g.,
docs/en/dev/passes/17-lower_transpose_load_param_layout.md).
docs/en/dev/passes/00-pass_manager.md (1)
383-383: ⚠️ Potential issue | 🟠 Major

Update line 383 to reflect MaterializeTensorStrides is now active in the default pipeline.

The code in python/pypto/ir/pass_manager.py confirms that MaterializeTensorStrides is now part of the default pipeline (active since RFC #1300 P6), and docs/en/dev/passes/25-materialize_tensor_strides.md correctly documents this status. Line 383 of docs/en/dev/passes/00-pass_manager.md is outdated and should be updated to remove "not yet wired into the default pipeline" and instead indicate that it is active in the default pipeline between CanonicalizeIOOrder and InitMemRef.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/en/dev/passes/00-pass_manager.md` at line 383, Update the documentation
entry for MaterializeTensorStrides to reflect that it is now active in the
default pipeline: remove "not yet wired into the default pipeline" and state
that MaterializeTensorStrides is active (since RFC `#1300` P6) and placed between
CanonicalizeIOOrder and InitMemRef in the default pass sequence; ensure the
description matches the status in python/pypto/ir/pass_manager.py and the
details in 25-materialize_tensor_strides.md.

🧹 Nitpick comments (1)

src/ir/op/tile_ops/memory.cpp (1)
158-173: ⚡ Quick win

Use TensorType::IsDNLayout() convenience method instead of inline check.

The DN detection at line 164-165 correctly reads from tensor_view_->layout, but duplicates the logic already defined in TensorType::IsDNLayout() (include/pypto/ir/type.h line 477-479). Simplify to:
bool source_is_dn = tensor_type->IsDNLayout();
This improves readability and eliminates duplication. The underlying check is correct: DN is canonically stored only in tensor_view_->layout, and every DN-tagged TensorType reaching tile.load always materializes tensor_view_ (per PromoteToCanonicalDN() and deserialization pathways).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/ir/op/tile_ops/memory.cpp` around lines 158 - 173, Replace the inline
DN-layout check used to set the local variable source_is_dn with the
TensorType::IsDNLayout() convenience method; locate where source_is_dn is
computed (currently using tensor_type->tensor_view_.has_value() &&
tensor_type->tensor_view_->layout == TensorLayout::DN) and change it to call
tensor_type->IsDNLayout() so the code uses the existing TensorType helper and
removes duplicated logic in memory.cpp within the tile.load handling (the block
that then uses source_is_dn to decide swapping tile_view.blayout/slayout).

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/en/dev/passes/17-lower_transpose_load_param_layout.md`:
- Around line 15-18: Remove the misleading phrase "the next default pass" when
referring to MaterializeTensorStrides in this document and instead state that
MaterializeTensorStrides must run later in the pipeline (after passes such as
CanonicalizeIOOrder); update both the earlier mention (around the paragraph
describing the promoted TensorType and empty stride) and the later mention
(lines ~37–41) so they consistently say MaterializeTensorStrides runs
downstream/after CanonicalizeIOOrder rather than immediately next. Ensure
references to MaterializeTensorStrides and CanonicalizeIOOrder remain intact so
readers can locate the correct ordering.

In `@docs/zh-cn/dev/passes/17-lower_transpose_load_param_layout.md`:
- Around line 11-22: The pipeline-order text is stale: update the doc to state
that LowerTransposeLoadParamLayout runs before ResolveBackendOpLayouts (as
registered in pass_manager.py), that MaterializeTensorStrides is not the "next
pass" but is inserted much later after CanonicalizeIOOrder, and correct the pass
index (it's the 18th pass in the full Default pipeline, not 17th); mention it
still runs after InferTileMemorySpace and before ResolveBackendOpLayouts and
that MaterializeTensorStrides must run later to materialize DN-packed canonical
stride.

In `@src/ir/op/tensor_ops/transform.cpp`:
- Around line 353-355: The current as_layout path rebuilds a fresh canonical
view via tensor_view_semantics::CanonicalizeView and drops any existing
valid_shape/pad metadata; update the logic creating new_view in transform.cpp so
it preserves src_type's existing view metadata (valid_shape and pad) when
present before constructing the new TensorType: retrieve the original view from
src_type (e.g., src_type->view or similar), copy/merge its valid_shape and pad
into the canonicalized new_view (or attach them to the optional view passed to
TensorType) so that tensor.as_layout remains lossless for sliced or fill-padded
tensors.

In `@src/ir/transforms/lower_transpose_load_param_layout_pass.cpp`:
- Around line 145-154: LowerTransposeLoadParamLayout currently requires
call->args_.size() == 4 and forcefully As<MakeTuple> on args_[3], which fails
for the supported 3-arg form tile.load(tensor, offsets, shapes). Change the
verification to accept either 3 or 4 args (check call->args_.size() == 3 ||
call->args_.size() == 4), set offsets = As<MakeTuple>(call->args_[1]) and shapes
= As<MakeTuple>(call->args_[2]) as before, and compute valid_shapes only if
args_.size() == 4 (e.g. valid_shapes = (call->args_.size() == 4 ?
As<MakeTuple>(call->args_[3]) : nullptr)). Update the INTERNAL_CHECK_SPAN logic
to require offsets && shapes, and if valid_shapes is non-null require it to be a
MakeTuple, so the 3-arg form is accepted and the 4-arg form still validated.

In `@src/ir/transforms/materialize_tensor_strides_pass.cpp`:
- Around line 168-179: The direct reconstruction of Call in
MaterializeTensorStrides drops op->attrs_ so metadata is lost; update the Call
rebuild to preserve attributes by passing op->attrs_ into the Call constructor
(i.e., when creating the new std::make_shared<Call>(...), include op->attrs_
alongside op->op_, std::move(new_args), op->kwargs_, std::move(new_return_type),
op->span_) so the new Call keeps the original attrs_ set.

In `@src/ir/transforms/simplify_pass.cpp`:
- Around line 158-163: Update the comment above VisitExpr_ to reflect current
behavior: remove the outdated "shape-bearing" form and the claim about folding
chains, and instead state that SimplifyAsLayout() only removes identity
reinterprets (i.e., as_layout(x, x.shape, x.layout) → x) but does not collapse
nested as_layout chains; reference the VisitExpr_ comment and SimplifyAsLayout()
as the locations to update so the pass contract is consistent with the
implementation.

In `@tests/ut/ir/transforms/test_lower_transpose_load_param_layout_pass.py`:
- Line 251: The assertion currently assumes a_arg.op is non-None; change it to
first assert isinstance(a_arg, ir.Call) and a_arg.op is not None, then assert
a_arg.op.name == "tensor.as_layout" (mirroring the defensive pattern used for
b_arg). Update the assertion around a_arg in
test_lower_transpose_load_param_layout_pass to check a_arg.op is not None before
accessing op.name so we avoid potential AttributeError.
- Around line 297-298: The two assertions that check call.args[0].op.name and
call.args[1].op.name assume .op is non-null; add null-safety checks similar to
earlier patterns (e.g., lines that check isinstance(call.args[i], ir.Call) and
call.args[i].op is not None) before accessing .op.name so replace/augment the
assertions with checks that call.args[0].op is not None and call.args[1].op is
not None and then assert call.args[0].op.name == "tensor.as_layout" and
call.args[1].op.name == "tensor.as_layout".

---

Outside diff comments:
In `@docs/en/dev/passes/00-pass_manager.md`:
- Line 383: Update the documentation entry for MaterializeTensorStrides to
reflect that it is now active in the default pipeline: remove "not yet wired
into the default pipeline" and state that MaterializeTensorStrides is active
(since RFC `#1300` P6) and placed between CanonicalizeIOOrder and InitMemRef in
the default pass sequence; ensure the description matches the status in
python/pypto/ir/pass_manager.py and the details in
25-materialize_tensor_strides.md.

In `@docs/zh-cn/dev/passes/00-pass_manager.md`:
- Line 383: The doc text for MaterializeTensorStrides is stale: in
docs/zh-cn/dev/passes/00-pass_manager.md update the line listing
`MaterializeTensorStrides` (step 13) by removing the parenthetical "已注册但尚未接入默认
pipeline（将随 RFC `#1300` P6/P7 的 codegen 清理一起启用）" so the entry no longer claims it
isn’t wired into the default pipeline; keep the entry as-is otherwise to match
the activated status used elsewhere (e.g.,
docs/en/dev/passes/17-lower_transpose_load_param_layout.md).

---

Nitpick comments:
In `@src/ir/op/tile_ops/memory.cpp`:
- Around line 158-173: Replace the inline DN-layout check used to set the local
variable source_is_dn with the TensorType::IsDNLayout() convenience method;
locate where source_is_dn is computed (currently using
tensor_type->tensor_view_.has_value() && tensor_type->tensor_view_->layout ==
TensorLayout::DN) and change it to call tensor_type->IsDNLayout() so the code
uses the existing TensorType helper and removes duplicated logic in memory.cpp
within the tile.load handling (the block that then uses source_is_dn to decide
swapping tile_view.blayout/slayout).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 980489e7-e270-42a1-9d75-337dbea41d7e

📥 Commits

Reviewing files that changed from the base of the PR and between a2af40c and 8607221.

📒 Files selected for processing (37)

.claude/rules/pass-doc-ordering.md
CMakeLists.txt
docs/en/dev/passes/00-pass_manager.md
docs/en/dev/passes/16-infer_tile_memory_space.md
docs/en/dev/passes/17-lower_transpose_load_param_layout.md
docs/en/dev/passes/17-resolve_transpose_layout.md
docs/en/dev/passes/18-resolve_backend_op_layouts.md
docs/en/dev/passes/25-materialize_tensor_strides.md
docs/en/user/01-language_guide.md
docs/zh-cn/dev/passes/00-pass_manager.md
docs/zh-cn/dev/passes/16-infer_tile_memory_space.md
docs/zh-cn/dev/passes/17-lower_transpose_load_param_layout.md
docs/zh-cn/dev/passes/17-resolve_transpose_layout.md
docs/zh-cn/dev/passes/18-resolve_backend_op_layouts.md
docs/zh-cn/dev/passes/25-materialize_tensor_strides.md
docs/zh-cn/user/01-language_guide.md
include/pypto/ir/transforms/pass_properties.h
include/pypto/ir/transforms/passes.h
include/pypto/ir/transforms/utils/tensor_view_semantics.h
python/bindings/modules/passes.cpp
python/pypto/ir/op/tensor_ops.py
python/pypto/ir/pass_manager.py
python/pypto/pypto_core/passes.pyi
src/codegen/tensor_op_codegen.cpp
src/ir/op/tensor_ops/transform.cpp
src/ir/op/tile_ops/memory.cpp
src/ir/transforms/lower_transpose_load_param_layout_pass.cpp
src/ir/transforms/materialize_tensor_strides_pass.cpp
src/ir/transforms/resolve_transpose_layout_pass.cpp
src/ir/transforms/simplify_pass.cpp
tests/ut/codegen/test_pto_codegen.py
tests/ut/codegen/test_pto_codegen_cross_core.py
tests/ut/ir/operators/test_tensor_as_layout.py
tests/ut/ir/transforms/test_lower_transpose_load_param_layout_pass.py
tests/ut/ir/transforms/test_pass_manager.py
tests/ut/ir/transforms/test_resolve_transpose_layout_pass.py
tests/ut/ir/transforms/test_simplify_pass.py

💤 Files with no reviewable changes (4)

docs/zh-cn/dev/passes/17-resolve_transpose_layout.md
src/ir/transforms/resolve_transpose_layout_pass.cpp
docs/en/dev/passes/17-resolve_transpose_layout.md
tests/ut/ir/transforms/test_resolve_transpose_layout_pass.py

CI fix (root cause of pypto-lib-model + system-tests failures): - P6's call-site injector previously inlined `tensor.as_layout(arg, DN)` into the kernel-call args directly, which the orchestration codegen rejects with `Call to '<callee>' arg N is neither a variable nor a recognized constant literal`. Refactor `CallSiteAsLayoutInjector` to operate at the statement level: for every AssignStmt / EvalStmt / ReturnStmt whose RHS targets a promoted callee, emit one `bridged_<param> = tensor.as_layout(arg, DN)` AssignStmt immediately before the call statement and replace the inline Call arg with the bound Var. Net IR is SSA-form and matches what orchestration codegen consumes per arg slot (Var | const-literal). Review comments addressed: - gemini #1: codegen `tensor.as_layout` now special-cases the identity flip (target layout == source layout) and emits a plain `Tensor result = input;` alias instead of a spurious `.transpose()`. Simplify still folds these before codegen in the default pipeline, but the codegen is now robust against ad-hoc compile paths that skip Simplify. - coderabbitai hw-native-sys#2 / hw-native-sys#3: drop the "next default pass" wording in en/zh-cn doc 17 — `MaterializeTensorStrides` runs later in the pipeline (after `CanonicalizeIOOrder`), not immediately after. The zh-cn doc's "17th pass" text is also clarified — the 17 is the docs/passes/ slot, not a literal pipeline call-count. - coderabbitai hw-native-sys#4: `DeduceTensorAsLayoutType` now preserves the source TensorView's `valid_shape` (with trailing-pair swap on cross-layout flips) and `pad` through `tensor.as_layout`. Previously these fields were dropped, making the reinterpret silently lossy for sliced or fill-padded inputs. - coderabbitai hw-native-sys#6: `MaterializeTensorStrides` direct-ctor rebuild path now forwards `op->attrs_`. The previous version preserved type and kwargs but dropped attrs, which would have silently discarded call metadata (arg_directions, manual_dep_edges) attached by earlier passes. - coderabbitai hw-native-sys#7: update the stale comment block above `VisitExpr_` in `simplify_pass.cpp` — it still described the dropped shape-bearing `as_layout(x, shape, layout)` form and the never-implemented chain folding. New comment accurately describes the single identity-elimination rule and explains why chain folding is deferred. - coderabbitai hw-native-sys#8 / hw-native-sys#9: the unit-test pattern that previously inspected an inline `tensor.as_layout` Call as a kernel-call arg no longer applies after the SSA refactor above. Tests now look up the bridge via `_find_assign_rhs(orch, var)` and guard `op is not None` before reading `op.name` (matching the defensive pattern already used in the B^T test). Skipped (with reason): - coderabbitai hw-native-sys#5: "Handle 3-arg `tile.load`". `tile.load` registers four mandatory args (tensor, offsets, shapes, valid_shapes) and the Python builder always materializes `valid_shapes` (defaults to `shapes` when the caller omits it). Once IR is constructed, every `tile.load` is 4-arg — the 3-arg form only exists at the DSL surface. The internal check stays as-is.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/codegen/pto/pto_codegen.cpp`:
- Around line 593-689: The branch that sets is_column_vector currently allows
rank==3 and later only fills two stride_names, leaving stride_names[2] empty and
producing malformed MLIR; fix by restricting the column-vector detection to rank
== 2 (i.e., change the condition in the is_column_vector check to require rank
== 2) so layout = ir::TensorLayout::DN is only forced for 2-D [M,1] tensors and
the existing stride fallback (stride_names[0], stride_names[1]) remains correct;
alternatively, if rank-3 column-vectors must be supported, extend the
is_column_vector handling to fully populate stride_names for all dims using
shape_dim_names/emit_stride_mul before the stride emission, but the simpler fix
is to limit is_column_vector to rank==2.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 75a30cdc-18ba-4b78-bc6a-3a49ca63363f

📥 Commits

Reviewing files that changed from the base of the PR and between 8607221 and 1fda505.

📒 Files selected for processing (38)

.claude/rules/pass-doc-ordering.md
CMakeLists.txt
docs/en/dev/passes/00-pass_manager.md
docs/en/dev/passes/17-infer_tile_memory_space.md
docs/en/dev/passes/18-lower_transpose_load_param_layout.md
docs/en/dev/passes/19-resolve_backend_op_layouts.md
docs/en/dev/passes/26-materialize_tensor_strides.md
docs/en/user/01-language_guide.md
docs/zh-cn/dev/passes/00-pass_manager.md
docs/zh-cn/dev/passes/17-infer_tile_memory_space.md
docs/zh-cn/dev/passes/18-lower_transpose_load_param_layout.md
docs/zh-cn/dev/passes/19-resolve_backend_op_layouts.md
docs/zh-cn/dev/passes/26-materialize_tensor_strides.md
docs/zh-cn/user/01-language_guide.md
include/pypto/ir/transforms/pass_properties.h
include/pypto/ir/transforms/passes.h
include/pypto/ir/transforms/utils/tensor_view_semantics.h
python/bindings/modules/passes.cpp
python/pypto/ir/op/tensor_ops.py
python/pypto/ir/pass_manager.py
python/pypto/pypto_core/passes.pyi
src/backend/common/pto_ops_common.cpp
src/codegen/orchestration/orchestration_codegen.cpp
src/codegen/pto/pto_codegen.cpp
src/codegen/tensor_op_codegen.cpp
src/ir/op/tensor_ops/transform.cpp
src/ir/op/tile_ops/memory.cpp
src/ir/transforms/lower_transpose_load_param_layout_pass.cpp
src/ir/transforms/materialize_tensor_strides_pass.cpp
src/ir/transforms/resolve_transpose_layout_pass.cpp
src/ir/transforms/simplify_pass.cpp
tests/ut/codegen/test_pto_codegen.py
tests/ut/codegen/test_pto_codegen_cross_core.py
tests/ut/ir/operators/test_tensor_as_layout.py
tests/ut/ir/transforms/test_lower_transpose_load_param_layout_pass.py
tests/ut/ir/transforms/test_pass_manager.py
tests/ut/ir/transforms/test_resolve_transpose_layout_pass.py
tests/ut/ir/transforms/test_simplify_pass.py

💤 Files with no reviewable changes (2)

src/ir/transforms/resolve_transpose_layout_pass.cpp
tests/ut/ir/transforms/test_resolve_transpose_layout_pass.py

✅ Files skipped from review due to trivial changes (13)

docs/zh-cn/dev/passes/26-materialize_tensor_strides.md
docs/en/dev/passes/26-materialize_tensor_strides.md
docs/en/dev/passes/17-infer_tile_memory_space.md
docs/zh-cn/dev/passes/19-resolve_backend_op_layouts.md
docs/en/dev/passes/18-lower_transpose_load_param_layout.md
docs/zh-cn/user/01-language_guide.md
docs/zh-cn/dev/passes/18-lower_transpose_load_param_layout.md
docs/zh-cn/dev/passes/17-infer_tile_memory_space.md
include/pypto/ir/transforms/utils/tensor_view_semantics.h
docs/en/dev/passes/00-pass_manager.md
docs/en/user/01-language_guide.md
docs/en/dev/passes/19-resolve_backend_op_layouts.md
.claude/rules/pass-doc-ordering.md

🚧 Files skipped from review as they are similar to previous changes (17)

CMakeLists.txt
include/pypto/ir/transforms/pass_properties.h
tests/ut/codegen/test_pto_codegen_cross_core.py
docs/zh-cn/dev/passes/00-pass_manager.md
src/ir/transforms/simplify_pass.cpp
python/pypto/ir/op/tensor_ops.py
include/pypto/ir/transforms/passes.h
src/ir/op/tile_ops/memory.cpp
src/ir/transforms/materialize_tensor_strides_pass.cpp
src/ir/op/tensor_ops/transform.cpp
python/pypto/pypto_core/passes.pyi
tests/ut/codegen/test_pto_codegen.py
tests/ut/ir/transforms/test_lower_transpose_load_param_layout_pass.py
python/bindings/modules/passes.cpp
src/ir/transforms/lower_transpose_load_param_layout_pass.cpp
tests/ut/ir/transforms/test_pass_manager.py
python/pypto/ir/pass_manager.py

Addresses coderabbitai's review on ``EmitMakeTensorViews``: the ``is_column_vector`` check previously matched ``rank == 2 || rank == 3``, but the column-vector stride fallback only populates ``stride_names[0]`` and ``stride_names[1]``. For a hypothetical rank-3 ``[_, M, 1]`` input the third stride slot would be empty and the codegen would emit malformed MLIR. The legacy column-vector behavior was always rank-2 in practice (PTOAS infers DN for ``[M, 1]`` specifically) and all in-tree test coverage is rank-2. Restrict the condition to ``rank == 2`` and add a comment explaining the constraint, mirroring coderabbitai's suggested minimal fix.

…a2a3 paged_attention) Adds ``set_output_memory_inherit_input()`` to the ``tensor.as_layout`` op registration. Without it, ``InitMemRef`` saw the op as a regular tensor producer and minted a fresh MemRef (with its own ``mem_ddr_*`` allocation) for the result. The orchestration codegen, meanwhile, lowers ``tensor.as_layout`` to a plain alias ``Tensor result = input;`` — so the runtime tensor's data pointer points to the *input's* buffer, while the IR declares a *separate* buffer. Concrete failure mode (non-square paged-attention on a2a3): for kj of shape ``[64, 128]`` ND promoted to ``[128, 64] DN``, the orch IR looked like ``` kj_dn_view: pl.Tensor[[128, 64], ..., pl.MemRef("mem_ddr_3", 0, 16384)] = pl.tensor.as_layout(kj, layout=DN) self.qk_kernel(qi, kj_dn_view, out) ``` The kernel binary's ``make_tensor_view`` produced the right (shape, stride, layout) triple, but the orch passed a stale buffer pointer (the freshly-allocated mem_ddr_3 was never written to; only the alias to kj's mem_ddr_1 held real data). Square cases happened to pass because the trailing-pair shape swap is identity, and PTOAS's runtime address arithmetic collapsed to the same byte range either way. After this fix, the orch IR shows ``kj_dn_view`` sharing the input's MemRef ``mem_ddr_1`` — no spurious allocation, no aliasing mismatch. ### Validation - ``cmake --build build --parallel`` — clean - ``pytest tests/ut/`` — 4562 passed / 33 skipped - Local IR inspection confirms ``kj_dn_view`` inherits ``mem_ddr_1`` (was ``mem_ddr_3`` before this fix).

Introduce tensor.as_layout - a pure metadata reinterpret op that points at the same physical memory as its source but exposes a different (shape, stride, layout) triple to consumers (RFC hw-native-sys#1300 §3.3). The op emits no PTOAS instruction at codegen; downstream make_tensor_view consumes the new view directly. It is internal-only: not exposed via pypto.language. Future passes (notably LowerTransposeLoadParamLayout in P6) inject tensor.as_layout at orch ↔ InCore call sites to bridge equivalent (shape, stride, layout) views. DeduceTensorAsLayoutType enforces three validity invariants: - Total element count of src and target must match (when both are statically known). - layout must not be NZ (NZ is tile-only / fractal). - The reinterpret must reduce to a RFC §4.2 canonical pair - currently row-major [..., a, b] ND ≡ [..., b, a] DN-packed. Other reinterprets are rejected; the helper AsLayoutOffsetMapEquivalent in tensor_view_semantics.h can be extended in follow-ups when concrete use cases appear. Tests: 9 unit cases covering ND↔DN 2D/3D static round trips, idempotent self-reinterpret, element-count mismatch, NZ rejection, invalid offset-map rejection, symbolic-shape ExprPtr-identity matching, and the op-registry sanity check.

…#1300 P4-b) Extend the Simplify pass to drop ``tensor.as_layout`` calls that are no-op metadata reinterprets — i.e. the requested ``(shape, layout)`` already matches what the source carries. The Call is replaced by its ``src`` arg directly, so downstream consumers stop walking through a useless reinterpret. Why this matters: future passes (e.g. ``LowerTransposeLoadParamLayout`` in P6) may insert ``tensor.as_layout`` at every call-site bridge. Without folding, identity reinterprets ride through the pipeline and clutter codegen. With this rule, the no-op cases drop out at the Simplify boundary. Chain folding (``as_layout(as_layout(x, ...), ...)`` → ``as_layout(x, ...)``) is intentionally left out: after SSA the outer Call's source is a Var bound to the inner Call (not the inner Call inline), so naive pointer inspection cannot see across the binding. A dedicated SSA-aware chain optimizer can be added if real pipelines produce such chains. Tests: 3 cases — identity elimination on bare ND, preservation of a substantive shape change, preservation of a same-shape layout change.

…ive-sys#1300 P4-c)

…ys#1300 P4 review) Tighten the tensor.as_layout signature in response to a design review: the op now ONLY flips the layout tag - shape changes that come with a ND ↔ DN flip are mechanical (RFC §4.2 canonical pair: trailing-two-dim swap) and derived from the source. Callers no longer pass a target shape. Why the change: shape changes are tensor.reshape's job. Letting as_layout also accept a target shape blurred the responsibility, created room for caller error (mismatched shape vs canonical pair), and required the AsLayoutOffsetMapEquivalent helper to validate arbitrary (src, target) combinations. With the layout-tag-only design, the canonical pair is intrinsic to the op and the helper disappears. API: Before: as_layout(src, shape, layout=...) After: as_layout(src, layout=...) Behavior: - Same layout (identity) -> shape unchanged - Cross layout (ND <-> DN, rank>=2) -> trailing 2 dims swapped - NZ target -> rejected - Strided sub-view source -> rejected (use slice/reshape first) Cleanup: - Drop AsLayoutOffsetMapEquivalent helper. - Drop detail::RowMajorEquivalentShape and detail::ShapeListsEquivalent (no remaining consumers; the namespace detail rump now contains only the truly-internal CheckCanonicalView helpers, restoring the convention). - Simplify identity rule simplifies to a layout-only comparison. - Codegen handler unchanged - still Tensor::transpose(N-2, N-1) for cross-layout flips; identity cases never reach codegen because Simplify folds them. Tests: test_tensor_as_layout.py rewritten (9 cases) - adds an explicit strided-source rejection, drops obsolete element-count and offset-map cases. test_simplify_pass.py::TestAsLayoutFolding adjusted for the new signature (2 cases).

…trides (RFC hw-native-sys#1300 P6) Replaces ResolveTransposeLayout with LowerTransposeLoadParamLayout — the first pipeline pass that produces RFC hw-native-sys#1300 canonical-form IR — and wires MaterializeTensorStrides into the default pipeline so the codegen-entry IR satisfies the (shape, stride, layout) self-consistency contract. Pass behaviour (per RFC hw-native-sys#1300 §3.3 + §4.2): - For every InCore parameter loaded via tile.load(transpose=True), promote the TensorType from `[..., a, b] ND` to `[..., b, a] DN` (trailing-pair shape swap + DN layout tag with empty stride). MaterializeTensorStrides fills the packed canonical strides later in the pipeline. - Every tile.load on a promoted parameter has its offsets / shapes / valid_shapes tuples swapped at the trailing pair and its `transpose` kwarg set to False (the slot is kept so print -> reparse round-trips faithfully, since the tile.load op registers `transpose` as a default-false attribute). - DeduceTileLoadType now derives the Mat tile-view layout from the source's DN tag (XOR with the transpose kwarg) so the legacy `transpose=True` swap and the new `DN source + transpose=False` form produce the same TileType. - Every non-InCore call site to a promoted callee wraps its promoted-slot arg with `tensor.as_layout(arg, DN)` (P4) to bridge orch-side ND tensors to the InCore DN parameter type. - Mixed-use parameters (loaded with both transpose=True and transpose=False) are rejected with pypto::ValueError. Pipeline wiring: - pass_manager.py default tile-PTO pipeline now inserts MaterializeTensorStrides between CanonicalizeIOOrder and InitMemRef. With P6 producing canonical-form DN parameters, the materialized strides match the IR shape directly — codegen takes the `has_explicit_stride` branch and bypasses the legacy `dn_swap` / `get_shape_source_idx` path. - MaterializeTensorStrides now rebuilds Calls via direct ctor (preserving the intentional type set by FlattenTileNdTo2D's manual rank-2 override) instead of routing through OpRegistry::Create, which would have re-deduced a rank-3 TileType from the rank-3 source-window args and silently undone the flattening. Test changes: - test_lower_transpose_load_param_layout_pass.py (new): 8 cases — B^T, A^T, AB^T, no-op, idempotent, mixed-use rejection, partial-load. Built with programmatic assertions (not `@pl.program` for After) since `tensor.as_layout` is internal-only and not exposed in `pl.*`. - test_pass_manager.py: the default and DebugTileOptimization expected pass lists now include LowerTransposeLoadParamLayout and MaterializeTensorStrides. - test_pto_codegen_3d_dn_tensor_view_uses_canonical_stride (renamed): now asserts the RFC hw-native-sys#1300 canonical (shape, stride, layout) triple — shape preserved as written (`[2, 48, 64]`), strides `[3072, 1, 48]` (DN-packed: stride[n-2]=1, stride[n-1]=shape[n-2]=48, stride[n-3]=shape[n-2]*shape[n-1]=3072), layout=dn. Old expectation was the legacy `dn_swap` form (`[2, 64, 48]` shape, `[3072, 1, 64]` stride) which the canonical pipeline intentionally replaces. Cross-layer / docs: - Renamed src/ir/transforms/resolve_transpose_layout_pass.cpp → lower_transpose_load_param_layout_pass.cpp (git mv preserves history). - Renamed docs/{en,zh-cn}/dev/passes/17-resolve_transpose_layout.md → 17-lower_transpose_load_param_layout.md and rewrote. - pass_properties.h: kLowerTransposeLoadParamLayoutProperties. - passes.h / passes.cpp binding / passes.pyi / pass-doc-ordering.md updated. - All doc cross-references and the docs/{en,zh-cn}/user/01-language_guide.md user-facing pipeline list now reference the new pass name.

CI fix (root cause of pypto-lib-model + system-tests failures): - P6's call-site injector previously inlined `tensor.as_layout(arg, DN)` into the kernel-call args directly, which the orchestration codegen rejects with `Call to '<callee>' arg N is neither a variable nor a recognized constant literal`. Refactor `CallSiteAsLayoutInjector` to operate at the statement level: for every AssignStmt / EvalStmt / ReturnStmt whose RHS targets a promoted callee, emit one `bridged_<param> = tensor.as_layout(arg, DN)` AssignStmt immediately before the call statement and replace the inline Call arg with the bound Var. Net IR is SSA-form and matches what orchestration codegen consumes per arg slot (Var | const-literal). Review comments addressed: - gemini #1: codegen `tensor.as_layout` now special-cases the identity flip (target layout == source layout) and emits a plain `Tensor result = input;` alias instead of a spurious `.transpose()`. Simplify still folds these before codegen in the default pipeline, but the codegen is now robust against ad-hoc compile paths that skip Simplify. - coderabbitai hw-native-sys#2 / hw-native-sys#3: drop the "next default pass" wording in en/zh-cn doc 17 — `MaterializeTensorStrides` runs later in the pipeline (after `CanonicalizeIOOrder`), not immediately after. The zh-cn doc's "17th pass" text is also clarified — the 17 is the docs/passes/ slot, not a literal pipeline call-count. - coderabbitai hw-native-sys#4: `DeduceTensorAsLayoutType` now preserves the source TensorView's `valid_shape` (with trailing-pair swap on cross-layout flips) and `pad` through `tensor.as_layout`. Previously these fields were dropped, making the reinterpret silently lossy for sliced or fill-padded inputs. - coderabbitai hw-native-sys#6: `MaterializeTensorStrides` direct-ctor rebuild path now forwards `op->attrs_`. The previous version preserved type and kwargs but dropped attrs, which would have silently discarded call metadata (arg_directions, manual_dep_edges) attached by earlier passes. - coderabbitai hw-native-sys#7: update the stale comment block above `VisitExpr_` in `simplify_pass.cpp` — it still described the dropped shape-bearing `as_layout(x, shape, layout)` form and the never-implemented chain folding. New comment accurately describes the single identity-elimination rule and explains why chain folding is deferred. - coderabbitai hw-native-sys#8 / hw-native-sys#9: the unit-test pattern that previously inspected an inline `tensor.as_layout` Call as a kernel-call arg no longer applies after the SSA refactor above. Tests now look up the bridge via `_find_assign_rhs(orch, var)` and guard `op is not None` before reading `op.name` (matching the defensive pattern already used in the B^T test). Skipped (with reason): - coderabbitai hw-native-sys#5: "Handle 3-arg `tile.load`". `tile.load` registers four mandatory args (tensor, offsets, shapes, valid_shapes) and the Python builder always materializes `valid_shapes` (defaults to `shapes` when the caller omits it). Once IR is constructed, every `tile.load` is 4-arg — the 3-arg form only exists at the DSL surface. The internal check stays as-is.

…es in wrapper-reorder Two related fixes for paged-attention ST failures on a2a3 introduced by the P6 orch-side ``tensor.as_layout`` injection. **1. ``tensor.as_layout`` codegen now emits a plain alias.** Previously the orchestration handler lowered ``tensor.as_layout(input, DN)`` to ``input.transpose(N-2, N-1)``. The runtime ``Tensor::transpose`` swaps ``shape`` / ``raw_shape`` / **offsets**, which is correct for ``tensor.transpose`` (a physical-index permutation) but wrong for ``tensor.as_layout`` — the IR-level semantics is "reinterpret the same physical bytes under a different layout tag", so the runtime tensor's ``offsets`` must stay in source coordinates. Concrete failure mode (paged_attention on a2a3): the orchestration passes ``kj`` with ``offsets = [block_offset, 0]``; the spurious ``.transpose(0, 1)`` swapped them to ``[0, block_offset]``, shifting the base address by a factor of ``raw_shape[1]`` and silently corrupting every kernel that received the bridged view. The downstream kernel already encodes the canonical ``(shape, stride, layout)`` triple via its IR-declared param type, so re-emitting it at runtime was redundant — the alias preserves the orch-side ``offsets`` exactly while the kernel-side ``make_tensor_view`` applies the canonical interpretation. This addresses the 7 paged-attention numerical-mismatch failures on a2a3 (``test_paged_attention_ptoas``, ``test_paged_attention_unaligned_ptoas``, the dynamic-paged variants, and ``test_dyn_orch_paged_attention``). The a5sim variant already passed because a5's PTOAS layout is more forgiving. **2. Wrapper-reorder chases ``tensor.as_layout`` aliases.** For SPMD / Group / orchestration wrapper functions, codegen splices an outer caller's args through the wrapper's parameter list to the inner-call's parameter list. With P6 injecting ``bridged_kj = tensor.as_layout(kj, DN)`` before the inner call inside the wrapper body, the inner-call arg becomes ``bridged_kj`` — a wrapper-local Var, not a wrapper parameter — and ``BuildWrapperReorderedParams`` failed with ``"inner call arg N does not map to any wrapper parameter"``. Fix: collect a per-wrapper alias map by walking the wrapper body for ``AssignStmt(v, tensor.as_layout(src, ...))`` pairs, then chase each inner- call arg through the chain back to the wrapper parameter. The lowering on the other side (#1) is the plain alias, so this "see-through" mapping is semantically equivalent — the wrapper splice still routes the outer arg to the same memory. This addresses the 3 SPMD paged_attention compile failures (``test_paged_attention_spmd_ptoas`` variants).

…w-native-sys#1300 P7) Cleans up the PTO codegen to consume the IR's canonical ``(shape, stride, layout)`` triple verbatim. No more dual code paths between the legacy "DN + empty stride → implicit shape/offset swap" and the canonical "DN + explicit stride → emit-as-is" forms — the latter is now the only path. After P6 + ``MaterializeTensorStrides``, all DN-tagged tensors arrive at codegen with canonical strides materialized, so the swap path is dead code. ### What changes - **``src/codegen/pto/pto_codegen.cpp::EmitMakeTensorViews``** - Removes ``get_shape_source_idx`` (the implicit trailing-pair swap helper) and the dual stride-derivation paths. - Single derivation: explicit stride when present, otherwise canonical DN strides (RFC §2.3: ``stride[-2]=1``, ``stride[-1]=shape[-2]``, outer strides walk row-major over the DN-block volume) or canonical ND strides (``stride[-1]=1``, ``stride[k]=stride[k+1]*shape[k+1]``). - Keeps the ``[M, 1]`` column-vector legacy carve-out (PTOAS infers DN for that shape regardless of IR layout tag). - Precomputes shape dim SSA names up-front so dynamic-shape casts (``EmitCastToIndex``) emit their setup ops *before* the ``pto.make_tensor_view`` line instead of interleaving inside it. - **``src/backend/common/pto_ops_common.cpp``** - ``MakeTileLoadCodegenPTO`` and ``MakeTileStoreCodegenPTO`` drop their ``dn_swap`` branches that swapped the trailing pair of ``offsets`` / ``valid_shapes`` / ``shapes`` tuples. The IR-level lowering (P6 ``LowerTransposeLoadParamLayout``) now produces all coordinates in canonical form, so the codegen transcribes them verbatim. ### Why now This unifies the codegen entry contract: every layout-aware transform is finalized at the IR level (P3 canonical TensorView, P4 ``tensor.as_layout``, P6 ``LowerTransposeLoadParamLayout``), and the codegen has no remaining layout logic of its own. Concrete benefits: - Removes the asymmetry that caused the non-square paged-attention regressions on a2a3 — both legacy ``dn_swap`` and canonical paths emit the same MLIR for square cases, but diverged for non-square shapes once ``MaterializeTensorStrides`` was activated in the default pipeline. - ``.pto`` output is now deterministically reproducible from the IR's canonical triple — no codegen-layer interpretation steps in between. ### Validation - ``cmake --build build --parallel`` — clean - ``pytest tests/ut/`` — 4562 passed / 33 skipped (no regressions) The 2D codegen optimization (``stride[-2] = shape[-1]`` directly, no spurious ``arith.muli %c1, shape[-1]``) preserves the byte-for-byte ``.pto`` output expected by existing 2D MLIR golden-string tests.

Addresses coderabbitai's review on ``EmitMakeTensorViews``: the ``is_column_vector`` check previously matched ``rank == 2 || rank == 3``, but the column-vector stride fallback only populates ``stride_names[0]`` and ``stride_names[1]``. For a hypothetical rank-3 ``[_, M, 1]`` input the third stride slot would be empty and the codegen would emit malformed MLIR. The legacy column-vector behavior was always rank-2 in practice (PTOAS infers DN for ``[M, 1]`` specifically) and all in-tree test coverage is rank-2. Restrict the condition to ``rank == 2`` and add a comment explaining the constraint, mirroring coderabbitai's suggested minimal fix.

…a2a3 paged_attention) Adds ``set_output_memory_inherit_input()`` to the ``tensor.as_layout`` op registration. Without it, ``InitMemRef`` saw the op as a regular tensor producer and minted a fresh MemRef (with its own ``mem_ddr_*`` allocation) for the result. The orchestration codegen, meanwhile, lowers ``tensor.as_layout`` to a plain alias ``Tensor result = input;`` — so the runtime tensor's data pointer points to the *input's* buffer, while the IR declares a *separate* buffer. Concrete failure mode (non-square paged-attention on a2a3): for kj of shape ``[64, 128]`` ND promoted to ``[128, 64] DN``, the orch IR looked like ``` kj_dn_view: pl.Tensor[[128, 64], ..., pl.MemRef("mem_ddr_3", 0, 16384)] = pl.tensor.as_layout(kj, layout=DN) self.qk_kernel(qi, kj_dn_view, out) ``` The kernel binary's ``make_tensor_view`` produced the right (shape, stride, layout) triple, but the orch passed a stale buffer pointer (the freshly-allocated mem_ddr_3 was never written to; only the alias to kj's mem_ddr_1 held real data). Square cases happened to pass because the trailing-pair shape swap is identity, and PTOAS's runtime address arithmetic collapsed to the same byte range either way. After this fix, the orch IR shows ``kj_dn_view`` sharing the input's MemRef ``mem_ddr_1`` — no spurious allocation, no aliasing mismatch. ### Validation - ``cmake --build build --parallel`` — clean - ``pytest tests/ut/`` — 4562 passed / 33 skipped - Local IR inspection confirms ``kj_dn_view`` inherits ``mem_ddr_1`` (was ``mem_ddr_3`` before this fix).

…s) — RFC hw-native-sys#1300 User's analysis pinpointed the actual root cause of the non-square paged-attention regressions on a2a3: The PTOAS-generated kernel wrapper reads dynamic dim values directly from the runtime ``Tensor`` struct's ``shapes[i]``, indexed under the **IR-declared post-P6 shape order**. For ``key_cache`` promoted from ``[256, 128] ND`` to ``[128, 256] DN``, the kernel expects: KV_HEAD_DIM_DYN = key_cache_t->shapes[0] // → 128 KEY_CACHE_ROWS_DYN = key_cache_t->shapes[1] // → 256 But my prior plain-alias codegen left ``shapes`` in the pre-swap (ND) order — the kernel read ``shapes[0]=256`` and computed DN strides off the wrong axis. Square cases happened to survive because the swap is identity. **Fix:** the orch codegen for ``tensor.as_layout`` now swaps the trailing-pair ``shapes`` so the runtime tensor matches the IR-declared post-swap order. ``raw_shapes`` and ``offsets`` stay in the source (pre- swap) coord system because PTOAS uses ``raw_shape``-derived strides plus ``offsets`` to compute ``start_offset`` (byte offset of the view into the physical buffer) — and that base address must continue pointing at the original ND-coord slice (e.g. paged-attention's ``offsets = [block_offset, 0]`` on the row-major ``key_cache``). If ``is_raw_eq_shapes`` is true, ``raw_shapes`` is materialized from the current ``shapes`` *before* the swap so the subsequent ``shapes`` mutation does not pollute the raw-shape-derived stride arithmetic. The identity flip (target layout == source layout) still lowers to a plain ``Tensor result = input;`` alias — no swap needed. ### Why not use ``Tensor::transpose(N-2, N-1)`` That runtime helper additionally swaps ``raw_shapes`` and ``offsets``, which is correct for ``tensor.transpose`` (a physical-index permutation) but wrong for ``tensor.as_layout`` (a metadata reinterpret of the same bytes). For paged-attention's ``offsets = [block_offset, 0]`` → ``[0, block_offset]`` shifted the base address by a factor of ``raw_shape[1]`` and silently corrupted reads. Our new lowering targets the precise mutation needed: shape-only swap. ### Validation - ``cmake --build build --parallel`` — clean - ``pytest tests/ut/`` — 4562 passed / 33 skipped

- MaterializeTensorStrides: remap VarPtrs in manual_dep_edges / user_manual_dep_edges attrs when rebuilding Calls, so attr entries follow the fresh Vars minted for materialized TensorViews. Without this, SSAVerify catches "used outside its defining scope" and orchestration codegen later raises "manual_dep_edge has no producer task" on manual-scope pipelines (exposed by test_manual_scope_{seq_outer_parallel,parallel_outer_seq}_inner_two_stage_pipeline).

- pto_codegen.cpp: extend the ``[..., M, 1]`` column-vector carve-out from rank == 2 to rank >= 2 with a stride derivation that fills all rank slots (legacy PTOAS convention: ``stride[rank-2]=1``, ``stride[rank-1]=shape[rank-1]``, ``stride[rank-3]=shape[rank-2]``, outer dims walk over M). Restores main's behaviour for rank-3 ``[B, N, 1]`` (and matches the ColMajor BLayout that ``DeduceTileLoadType`` already produces for trailing-dim-1 ``tile.load``s), fixing the ``TLoad isSameLayout`` PTOAS compile failure surfaced by test_tensor_expand_clone[a2a3-2] (broadcast_dim=2, input ``[B, N, 1]``).

…body (RFC #1300) (#1339) ## Summary Moves the P6 ``tensor.as_layout`` bridge from the orch call site to the top of the InCore body, end-to-end equivalent but **-132 LOC** net and removes a cluster of incidental complexity in orchestration codegen. See #1300 [discussion comment](#1300 (comment)) for the design rationale and consensus question. ## What changes For each InCore parameter ``p`` loaded via ``tile.load(p, ..., transpose=True)``: **Before (current main, post-#1324):** - InCore param signature is promoted to ``[..., b, a] DN``. - Orch call site is wrapped: ``bridged = tensor.as_layout(arg, DN); incore(bridged, ...)``. - Orchestration codegen has to chase aliases through the bridge via ``BuildWrapperAliasMap`` + ``ResolveAliasChain`` to recover the original wrapper param. **After (this PR):** - InCore param signature is **untouched** (stays ``[..., a, b] ND``, matching the runtime torch tensor). - InCore body is **prepended** with ``p_dn = tensor.as_layout(p, layout=DN)``; body uses of ``p`` are substituted with ``p_dn``. - The matching ``tile.load`` is rewritten to read from ``p_dn`` with the trailing pair of offsets/shapes/valid_shapes swapped and ``transpose=False``. - Orch is left completely alone — the orchestrator's call args are wrapper params directly. ## Diff stats | File | Change | |------|--------| | ``src/ir/transforms/lower_transpose_load_param_layout_pass.cpp`` | -174 / +14 — deletes ``CallSiteAsLayoutInjector`` + Phase 2 + ``PromoteToCanonicalDN``; new ``LowerInCoreFunction`` prepends to body | | ``src/codegen/orchestration/orchestration_codegen.cpp`` | -60 / 0 — deletes ``BuildWrapperAliasMap`` / ``ResolveAliasChain`` / alias-chasing fallback | | ``src/backend/common/pto_ops_common.cpp`` | 0 / +86 — registers ``tensor.as_layout`` PTO codegen (emits one ``pto.make_tensor_view`` sharing the input's base) | | Tests + docs + bindings | -188 / +291 — 5 pass-test bodies rewritten to assert the new IR shape; pass 18 docs (en/zh-cn) rewritten; pass 26 example caption updated; passes.h Doxygen + nanobind docstring + pyi stub updated | Total: **-485 / +391 = -94 LOC net** (the +291 in tests/docs is mostly added explanatory comments and structured assertions — the production code delta is **-297 / +165 = -132 LOC**). ## Why this is acceptable per RFC §4.2 RFC §4.2's "InCore cannot create tensors" invariant targets ops that **allocate a byte buffer** (``tensor.create``). ``tensor.as_layout`` is a pure metadata reinterpret — it allocates nothing, it just re-describes the input's existing physical buffer. The four-layer boundary (§5) becomes cleaner under this design: - **Runtime / Orch**: row-major ND physical buffer (matches runtime). - **Cross-function boundary**: always row-major ND (no layout reinterpret). - **Inside an InCore body**: derive the DN view via ``tensor.as_layout``; this is a single-function internal detail. - **`.pto`**: codegen consumes whatever canonical triple the InCore body sees. ## Validation - ``cmake --build build --parallel`` — clean. - ``pytest tests/ut/ -n auto --maxprocesses 8`` — **4602 passed / 41 skipped / 0 failed**. - Golden-string ``.pto`` codegen tests pass — output is byte-identical to main. - End-to-end matmul B^T, paged-attention (single + multi-config), orchestration codegen tests all pass. ## Test plan - [x] All existing unit tests pass. - [x] Pass-specific tests rewritten to validate new IR shape (body-prepended ``tensor.as_layout`` binding + ``tile.load`` reading from binding LHS + orch left alone). - [x] ``cmake --build`` clean. - [ ] CI: clang-tidy, pre-commit, unit-tests (macos + ubuntu), fuzz-tests-sim, system-tests, system-tests-a5sim, pypto-lib-model. ## Discussion Open for RFC author / reviewers to weigh in. The design tradeoff is "signature is the contract" (current main) vs. "cross-function boundary is the runtime-faithful boundary, DN view is a per-kernel detail" (this PR). The latter eliminates downstream codegen complexity at the cost of slightly less honest InCore signatures.

github-project-automation Bot added this to pto project May 9, 2026

gemini-code-assist Bot reviewed May 9, 2026

View reviewed changes

Comment thread src/codegen/tensor_op_codegen.cpp Outdated

lyfne123 changed the title ~~feat(ir): tensor.as_layout virtual op (#1300 P4)~~ feat(ir): tensor.as_layout op + LowerTransposeLoadParamLayout + activate MaterializeTensorStrides (#1300 P4 + P6) May 11, 2026

coderabbitai Bot reviewed May 11, 2026

View reviewed changes

lyfne123 force-pushed the issue-1300-tensor-as-layout branch from ab19bfa to 5fb0a67 Compare May 11, 2026 03:38

coderabbitai Bot reviewed May 11, 2026

View reviewed changes

Comment thread src/codegen/pto/pto_codegen.cpp Outdated

lyfne123 added 11 commits May 12, 2026 08:11

feat(codegen): orchestration handler for tensor.as_layout (RFC hw-nat…

38ee3ac

…ive-sys#1300 P4-c)

lyfne123 force-pushed the issue-1300-tensor-as-layout branch from b6d7493 to 5840271 Compare May 12, 2026 00:11

lyfne123 added 2 commits May 12, 2026 08:48

Hzfengsy approved these changes May 12, 2026

View reviewed changes

Hzfengsy merged commit 8b89309 into hw-native-sys:main May 12, 2026
9 checks passed

This was referenced May 12, 2026

[RFC] Self-consistent IR TensorType layout representation #1300

Closed

refactor(ir): move P6 tensor.as_layout from orch call-site to InCore body (RFC #1300) #1339

Merged

Conversation

lyfne123 commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Commits

P6 pass behaviour (RFC #1300 §3.3 + §4.2)

Pipeline wiring

⚠ Test expectation change

Test plan

Design decisions (per RFC issue threads)

Roadmap

Uh oh!

coderabbitai Bot commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Possibly Related PRs

Suggested Labels

Suggested Reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lyfne123 commented May 9, 2026 •

edited

Loading

coderabbitai Bot commented May 9, 2026 •

edited

Loading