Skip to content

feat(ir): tensor.as_layout op + LowerTransposeLoadParamLayout + activate MaterializeTensorStrides (#1300 P4 + P6)#1324

Merged
Hzfengsy merged 13 commits into
hw-native-sys:mainfrom
lyfne123:issue-1300-tensor-as-layout
May 12, 2026
Merged

feat(ir): tensor.as_layout op + LowerTransposeLoadParamLayout + activate MaterializeTensorStrides (#1300 P4 + P6)#1324
Hzfengsy merged 13 commits into
hw-native-sys:mainfrom
lyfne123:issue-1300-tensor-as-layout

Conversation

@lyfne123
Copy link
Copy Markdown
Collaborator

@lyfne123 lyfne123 commented May 9, 2026

Summary

P4 + P6 of the RFC #1300 canonical TensorView roadmap. Lands the foundational tensor.as_layout virtual op (P4), then uses it in LowerTransposeLoadParamLayout (P6) — the first pipeline pass that produces canonical-form IR — and activates MaterializeTensorStrides in the default pipeline.

After this PR, every InCore parameter loaded via tile.load(transpose=True) is promoted at the IR level to canonical-form DN (RFC §3.3 + §4.2), all body loads on that param are expressed in canonical coords with transpose=False, and every non-InCore call site bridges its arg through tensor.as_layout(arg, DN). Codegen reads (shape, stride, layout) directly from the materialized canonical TensorView and bypasses the legacy dn_swap / get_shape_source_idx path for these tensors.

Commits

  • P4-a (bff1be59) — Op definition + DeduceTensorAsLayoutType (validity invariants: NZ rejection, RFC §4.2 canonical-pair shape derivation, packed-source check) + Python IR builder + 9 unit tests.
  • P4-b (71686cc8) — Simplify pass identity-elimination rule (as_layout(x, x.layout)x) + 2 tests. Chain folding deferred (after-SSA Var indirection).
  • P4-c (eacf40ff) — Orchestration codegen handler emits Tensor::transpose(N-2, N-1), matching the canonical reinterpret pair.
  • P4 refactor (52e51d4e) — Drop the shape parameter from tensor.as_layout: the target shape is uniquely determined by the §4.2 canonical-pair rule, so callers (P6 in particular) never compute it themselves.
  • P6 (86072218) — ResolveTransposeLayoutLowerTransposeLoadParamLayout: promote params, swap body trailing-pair coords, drop transpose=True, inject tensor.as_layout at every non-InCore call site. Activate MaterializeTensorStrides in the default pipeline.

P6 pass behaviour (RFC #1300 §3.3 + §4.2)

Action Detail
Param TensorType promotion [..., a, b] ND[..., b, a] DN (trailing-pair swap + DN tag, empty stride — filled later by MaterializeTensorStrides)
Body tile.load rewrite offsets / shapes / valid_shapes trailing pair swapped to canonical coords; transpose kwarg flipped to False (kept in the slot so print-reparse round-trips faithfully)
Call-site bridging every non-InCore arg passed to a promoted slot is wrapped with tensor.as_layout(arg, DN)
DeduceTileLoadType DN-source + transpose=False now derives the same Mat tile-view layout as the legacy ND-source + transpose=True (XOR), so the two forms produce identical TileType
Rejection mixed-use param (loaded with both transpose=True and transpose=False); tensor.transpose result with explicit physical strides + DN (would compose as a double transpose)

Pipeline wiring

  • pass_manager.py default tile-PTO pipeline now inserts MaterializeTensorStrides between CanonicalizeIOOrder and InitMemRef. With P6 producing canonical-form DN parameters, the materialized strides match the IR shape directly — codegen takes the has_explicit_stride branch and bypasses the legacy dn_swap / get_shape_source_idx path for those tensors.
  • MaterializeTensorStrides now rebuilds Calls via direct ctor (preserving the intentional rank-2 type set by FlattenTileNdTo2D) rather than routing through OpRegistry::Create (which would have re-deduced a rank-3 TileType from the rank-3 source-window args and silently undone the flattening).

⚠ Test expectation change

test_pto_codegen_3d_dn_tensor_view_uses_last_dim_stride is renamed to test_pto_codegen_3d_dn_tensor_view_uses_canonical_stride and re-targeted to assert the RFC canonical form:

Field Legacy (pre-P6) Canonical (post-P6)
Emitted shape [2, 64, 48] (post-dn_swap) [2, 48, 64] (IR-as-written, no swap)
Stride[-1] 64 (source last dim) 48 (= shape[n-2] per RFC §2.3)
Stride[-3] 3072 (= 48 × 64) — emitted via arith.muli 3072 — materialized as a compile-time constant by MaterializeTensorStrides

The user-facing IR (pl.Tensor[[2, 48, 64], pl.FP32, pl.DN]) is unchanged; only the codegen lowering is now canonical-direct.

Test plan

  • cmake --build build --parallel — clean
  • pytest tests/ut/ir/operators/test_tensor_as_layout.py — 9 passed (P4-a)
  • pytest tests/ut/ir/transforms/test_simplify_pass.py::TestAsLayoutFolding — 2 passed (P4-b)
  • pytest tests/ut/ir/transforms/test_lower_transpose_load_param_layout_pass.py — 8 passed (P6: B^T / A^T / AB^T / no-op / idempotent / mixed-use rejection / partial-load)
  • pytest tests/ut/ — 4522 passed / 33 skipped, no regressions
  • pytest tests/lint/check_english_only.py tests/lint/check_headers.py — clean

Design decisions (per RFC issue threads)

  • Internal-only (Q1): not exposed via pypto.language. Only IR-level passes construct tensor.as_layout.
  • Restricted reinterprets (RFC §4.2): currently only row-major [..., a, b] ND ≡ [..., b, a] DN-packed. NZ rejected on TensorType.
  • Single-responsibility as_layout: shape-changing reinterprets stay with tensor.reshape. as_layout is layout-tag-only; the trailing-pair shape swap is mechanical.

Roadmap

Phase Topic Status
P1–P3 Canonical TensorView foundation Landed in #1311
P4 tensor.as_layout op This PR
P5 tensor.slice / tensor.reshape inherit parent layout family Independent — can land any time
P6 LowerTransposeLoadParamLayout + MaterializeTensorStrides in default pipeline This PR
P7 Codegen cleanup (drop the legacy dn_swap / has_explicit_stride / get_shape_source_idx branches now that P6+P3 supply explicit strides) Next
P8 DSL deprecation (pl.DN, pl.load(transpose=True)) After P7
P9 pl.move(layout=) kwarg Independent

🤖 Generated with Claude Code

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 9, 2026

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR replaces the ResolveTransposeLayout pass with LowerTransposeLoadParamLayout, introducing a new tensor.as_layout IR operation for metadata-only layout flipping. The transformation promotes InCore function parameters from ND to canonical DN layout when used with tile.load(..., transpose=True), swaps trailing coordinate dimensions, and injects tensor.as_layout bridges at non-InCore call sites. Backend codegen no longer performs implicit DN coordinate swapping, assuming pre-canonicalized (shape, stride, layout) triples from the lowering pass.

Changes

RFC #1300 Layout Canonicalization

Layer / File(s) Summary
tensor.as_layout IR Op: Type Deduction & Core Logic
include/pypto/ir/transforms/utils/tensor_view_semantics.h, src/ir/op/tensor_ops/transform.cpp, python/pypto/ir/op/tensor_ops.py
New internal IR operation for ND↔DN layout flipping; adds #include <utility> support; type deduction swaps trailing dimensions, validates canonical strides, forbids NZ layouts.
tensor.as_layout Codegen & Testing
src/codegen/tensor_op_codegen.cpp, tests/ut/ir/operators/test_tensor_as_layout.py
Orchestration codegen emits metadata-only alias without runtime swap; comprehensive test suite validates type inference, layout flipping, identity folding, rank/NZ rejection, and strided sub-view rejection.
Tile Load Type Inference: DN-Source Handling
src/ir/op/tile_ops/memory.cpp
Updates DeduceTileLoadType to detect DN-tagged sources; treats DN and explicit transpose as XOR-compatible via (transpose != source_is_dn) condition for Mat memory layout swapping.
LowerTransposeLoadParamLayout Pass: Phase 1 InCore
src/ir/transforms/lower_transpose_load_param_layout_pass.cpp (lines 1–183)
Phase 1 scans InCore functions, rejects mixed transpose modes, promotes parameters to canonical DN with swapped shapes, rewrites tile.load coordinate tuples and drops transpose=True kwarg.
LowerTransposeLoadParamLayout Pass: Phase 2 Call-Site
src/ir/transforms/lower_transpose_load_param_layout_pass.cpp (lines 263–455)
Phase 2 injects tensor.as_layout(arg, DN) bindings at non-InCore call sites targeting promoted callees; skips injection when argument already DN-canonical.
LowerTransposeLoadParamLayout Pass Tests
tests/ut/ir/transforms/test_lower_transpose_load_param_layout_pass.py
Validates both phases: B-load, A-load, both-params, non-square, no-op, idempotence, mixed-mode rejection, and partial-load promotion via SSA binding inspection.
Old Pass Removal
src/ir/transforms/resolve_transpose_layout_pass.cpp, tests/ut/ir/transforms/test_resolve_transpose_layout_pass.py
Deletes entire ResolveTransposeLayout implementation and test module.
Pass Registration & Public API
include/pypto/ir/transforms/pass_properties.h, include/pypto/ir/transforms/passes.h, python/bindings/modules/passes.cpp, python/pypto/pypto_core/passes.pyi, CMakeLists.txt
Registers LowerTransposeLoadParamLayout in properties, C++ headers, Python bindings, stubs; updates CMakeLists to link new implementation.
Pass Pipeline Integration
python/pypto/ir/pass_manager.py, tests/ut/ir/transforms/test_pass_manager.py, tests/ut/codegen/test_pto_codegen_cross_core.py
Updates tile_pto_passes to use new pass; inserts MaterializeTensorStrides before InitMemRef; updates test pass-sequence expectations.
Backend PTO Codegen Simplification
src/backend/common/pto_ops_common.cpp
Removes implicit DN last-two-dim coordinate swapping from tile.load/tile.store codegen; assumes IR tuples are already canonical.
PTO Codegen: EmitMakeTensorViews
src/codegen/pto/pto_codegen.cpp
Rewrote to directly materialize layout-aware (shape, stride, layout) triples without prior DN/ND shape swaps; precomputes SSA names, derives strides by layout semantics, preserves [M,1] column-vector special case.
Orchestration & MaterializeTensorStrides
src/codegen/orchestration/orchestration_codegen.cpp, src/ir/transforms/materialize_tensor_strides_pass.cpp
Added wrapper alias-map helpers for tensor.as_layout layout bridges; updated MaterializeTensorStrides to always rebuild Calls with direct constructor, avoiding OpRegistry re-deduction.
Simplify Pass: Identity Folding
src/ir/transforms/simplify_pass.cpp, tests/ut/ir/transforms/test_simplify_pass.py
Added SimplifyAsLayout to eliminate identity as_layout calls; comprehensive tests validate identity elimination and substantive layout-flip preservation.
Codegen Test Updates
tests/ut/codegen/test_pto_codegen.py
Updated 3D DN tensor test to expect canonical stride/shape (no implicit swap); renamed test, updated assertions for logical shape order and canonical DN strides.
Documentation: Pass Definitions (English & Chinese)
docs/en/dev/passes/18-lower_transpose_load_param_layout.md, docs/zh-cn/dev/passes/18-lower_transpose_load_param_layout.md
Added comprehensive pass documentation describing algorithm, scope, interactions, before/after examples, and RFC #1300 P6 behavior.
Documentation: Pass Pipeline & Ordering
docs/*/dev/passes/00-pass_manager.md, docs/*/dev/passes/17-*.md, docs/*/dev/passes/19-*.md, docs/*/dev/passes/26-*.md, docs/*/user/01-language_guide.md, .claude/rules/pass-doc-ordering.md
Updated pipeline descriptions, MaterializeTensorStrides status to "active since P6", pass-ordering index, and user documentation.

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly Related PRs

Suggested Labels

enhancement

Suggested Reviewers

  • Hzfengsy

Poem

A rabbit hops through transpose lands,
Where dimensions swap by careful hands,
DN layouts blessed, coordinates aligned,
Canonical forms, so well-designed! 🐰
Layout bridges span the call-site gap,
Strides materialized—wisdom's map! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 51.46% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The PR title clearly and concisely summarizes the main changes: introducing tensor.as_layout op, LowerTransposeLoadParamLayout pass, and activating MaterializeTensorStrides, with RFC phase identifiers (P4 + P6) providing context.
Description check ✅ Passed The PR description is comprehensive and directly related to the changeset. It explains the summary, commits, pass behavior, pipeline wiring, test expectations, design decisions, and roadmap—all aligned with the file-level changes.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements the tensor.as_layout internal IR operator for metadata-only reinterpretation of tensor shapes and layouts, supporting row-major ND and DN-packed equivalence. The changes include C++ view semantics, Python bindings, type deduction, and simplification rules. Feedback points out a bug in the orchestration codegen where identity reinterprets incorrectly emit runtime transposes and 1D tensors cause crashes. A code suggestion is provided to ensure identity cases are handled as no-ops and rank checks are only applied during layout changes.

Comment thread src/codegen/tensor_op_codegen.cpp Outdated
@lyfne123 lyfne123 changed the title feat(ir): tensor.as_layout virtual op (#1300 P4) feat(ir): tensor.as_layout op + LowerTransposeLoadParamLayout + activate MaterializeTensorStrides (#1300 P4 + P6) May 11, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
docs/zh-cn/dev/passes/00-pass_manager.md (1)

383-383: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Stale annotation: MaterializeTensorStrides is now wired into the default pipeline.

Line 383 still reads "已注册但尚未接入默认 pipeline(将随 RFC #1300 P6/P7 的 codegen 清理一起启用)", but the list itself includes it as step 13, and per the PR objectives this PR (P6) activates MaterializeTensorStrides in the default pipeline. Please drop the stale parenthetical so the doc doesn't contradict its own enumeration (and matches docs/en/dev/passes/17-lower_transpose_load_param_layout.md, which already treats it as part of the running pipeline).

📝 Suggested wording fix
-13. [`MaterializeTensorStrides`](25-materialize_tensor_strides.md) —— 已注册但尚未接入默认 pipeline(将随 RFC `#1300` P6/P7 的 codegen 清理一起启用)
+13. [`MaterializeTensorStrides`](25-materialize_tensor_strides.md) —— 在 `CanonicalizeIOOrder` 之后为 P6 提升出的 DN 参数物化 packed canonical strides

Also worth grepping the English docs/en/dev/passes/00-pass_manager.md (not in this review) for the same stale phrasing.

#!/bin/bash
# Make sure the English counterpart didn't keep the stale "not yet wired in" note.
fd -t f '00-pass_manager.md' docs
rg -nP -C2 'MaterializeTensorStrides' docs/en/dev/passes/00-pass_manager.md docs/zh-cn/dev/passes/00-pass_manager.md 2>/dev/null
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/zh-cn/dev/passes/00-pass_manager.md` at line 383, The doc text for
MaterializeTensorStrides is stale: in docs/zh-cn/dev/passes/00-pass_manager.md
update the line listing `MaterializeTensorStrides` (step 13) by removing the
parenthetical "已注册但尚未接入默认 pipeline(将随 RFC `#1300` P6/P7 的 codegen 清理一起启用)" so the
entry no longer claims it isn’t wired into the default pipeline; keep the entry
as-is otherwise to match the activated status used elsewhere (e.g.,
docs/en/dev/passes/17-lower_transpose_load_param_layout.md).
docs/en/dev/passes/00-pass_manager.md (1)

383-383: ⚠️ Potential issue | 🟠 Major

Update line 383 to reflect MaterializeTensorStrides is now active in the default pipeline.

The code in python/pypto/ir/pass_manager.py confirms that MaterializeTensorStrides is now part of the default pipeline (active since RFC #1300 P6), and docs/en/dev/passes/25-materialize_tensor_strides.md correctly documents this status. Line 383 of docs/en/dev/passes/00-pass_manager.md is outdated and should be updated to remove "not yet wired into the default pipeline" and instead indicate that it is active in the default pipeline between CanonicalizeIOOrder and InitMemRef.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/en/dev/passes/00-pass_manager.md` at line 383, Update the documentation
entry for MaterializeTensorStrides to reflect that it is now active in the
default pipeline: remove "not yet wired into the default pipeline" and state
that MaterializeTensorStrides is active (since RFC `#1300` P6) and placed between
CanonicalizeIOOrder and InitMemRef in the default pass sequence; ensure the
description matches the status in python/pypto/ir/pass_manager.py and the
details in 25-materialize_tensor_strides.md.
🧹 Nitpick comments (1)
src/ir/op/tile_ops/memory.cpp (1)

158-173: ⚡ Quick win

Use TensorType::IsDNLayout() convenience method instead of inline check.

The DN detection at line 164-165 correctly reads from tensor_view_->layout, but duplicates the logic already defined in TensorType::IsDNLayout() (include/pypto/ir/type.h line 477-479). Simplify to:

bool source_is_dn = tensor_type->IsDNLayout();

This improves readability and eliminates duplication. The underlying check is correct: DN is canonically stored only in tensor_view_->layout, and every DN-tagged TensorType reaching tile.load always materializes tensor_view_ (per PromoteToCanonicalDN() and deserialization pathways).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/ir/op/tile_ops/memory.cpp` around lines 158 - 173, Replace the inline
DN-layout check used to set the local variable source_is_dn with the
TensorType::IsDNLayout() convenience method; locate where source_is_dn is
computed (currently using tensor_type->tensor_view_.has_value() &&
tensor_type->tensor_view_->layout == TensorLayout::DN) and change it to call
tensor_type->IsDNLayout() so the code uses the existing TensorType helper and
removes duplicated logic in memory.cpp within the tile.load handling (the block
that then uses source_is_dn to decide swapping tile_view.blayout/slayout).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/en/dev/passes/17-lower_transpose_load_param_layout.md`:
- Around line 15-18: Remove the misleading phrase "the next default pass" when
referring to MaterializeTensorStrides in this document and instead state that
MaterializeTensorStrides must run later in the pipeline (after passes such as
CanonicalizeIOOrder); update both the earlier mention (around the paragraph
describing the promoted TensorType and empty stride) and the later mention
(lines ~37–41) so they consistently say MaterializeTensorStrides runs
downstream/after CanonicalizeIOOrder rather than immediately next. Ensure
references to MaterializeTensorStrides and CanonicalizeIOOrder remain intact so
readers can locate the correct ordering.

In `@docs/zh-cn/dev/passes/17-lower_transpose_load_param_layout.md`:
- Around line 11-22: The pipeline-order text is stale: update the doc to state
that LowerTransposeLoadParamLayout runs before ResolveBackendOpLayouts (as
registered in pass_manager.py), that MaterializeTensorStrides is not the "next
pass" but is inserted much later after CanonicalizeIOOrder, and correct the pass
index (it's the 18th pass in the full Default pipeline, not 17th); mention it
still runs after InferTileMemorySpace and before ResolveBackendOpLayouts and
that MaterializeTensorStrides must run later to materialize DN-packed canonical
stride.

In `@src/ir/op/tensor_ops/transform.cpp`:
- Around line 353-355: The current as_layout path rebuilds a fresh canonical
view via tensor_view_semantics::CanonicalizeView and drops any existing
valid_shape/pad metadata; update the logic creating new_view in transform.cpp so
it preserves src_type's existing view metadata (valid_shape and pad) when
present before constructing the new TensorType: retrieve the original view from
src_type (e.g., src_type->view or similar), copy/merge its valid_shape and pad
into the canonicalized new_view (or attach them to the optional view passed to
TensorType) so that tensor.as_layout remains lossless for sliced or fill-padded
tensors.

In `@src/ir/transforms/lower_transpose_load_param_layout_pass.cpp`:
- Around line 145-154: LowerTransposeLoadParamLayout currently requires
call->args_.size() == 4 and forcefully As<MakeTuple> on args_[3], which fails
for the supported 3-arg form tile.load(tensor, offsets, shapes). Change the
verification to accept either 3 or 4 args (check call->args_.size() == 3 ||
call->args_.size() == 4), set offsets = As<MakeTuple>(call->args_[1]) and shapes
= As<MakeTuple>(call->args_[2]) as before, and compute valid_shapes only if
args_.size() == 4 (e.g. valid_shapes = (call->args_.size() == 4 ?
As<MakeTuple>(call->args_[3]) : nullptr)). Update the INTERNAL_CHECK_SPAN logic
to require offsets && shapes, and if valid_shapes is non-null require it to be a
MakeTuple, so the 3-arg form is accepted and the 4-arg form still validated.

In `@src/ir/transforms/materialize_tensor_strides_pass.cpp`:
- Around line 168-179: The direct reconstruction of Call in
MaterializeTensorStrides drops op->attrs_ so metadata is lost; update the Call
rebuild to preserve attributes by passing op->attrs_ into the Call constructor
(i.e., when creating the new std::make_shared<Call>(...), include op->attrs_
alongside op->op_, std::move(new_args), op->kwargs_, std::move(new_return_type),
op->span_) so the new Call keeps the original attrs_ set.

In `@src/ir/transforms/simplify_pass.cpp`:
- Around line 158-163: Update the comment above VisitExpr_ to reflect current
behavior: remove the outdated "shape-bearing" form and the claim about folding
chains, and instead state that SimplifyAsLayout() only removes identity
reinterprets (i.e., as_layout(x, x.shape, x.layout) → x) but does not collapse
nested as_layout chains; reference the VisitExpr_ comment and SimplifyAsLayout()
as the locations to update so the pass contract is consistent with the
implementation.

In `@tests/ut/ir/transforms/test_lower_transpose_load_param_layout_pass.py`:
- Line 251: The assertion currently assumes a_arg.op is non-None; change it to
first assert isinstance(a_arg, ir.Call) and a_arg.op is not None, then assert
a_arg.op.name == "tensor.as_layout" (mirroring the defensive pattern used for
b_arg). Update the assertion around a_arg in
test_lower_transpose_load_param_layout_pass to check a_arg.op is not None before
accessing op.name so we avoid potential AttributeError.
- Around line 297-298: The two assertions that check call.args[0].op.name and
call.args[1].op.name assume .op is non-null; add null-safety checks similar to
earlier patterns (e.g., lines that check isinstance(call.args[i], ir.Call) and
call.args[i].op is not None) before accessing .op.name so replace/augment the
assertions with checks that call.args[0].op is not None and call.args[1].op is
not None and then assert call.args[0].op.name == "tensor.as_layout" and
call.args[1].op.name == "tensor.as_layout".

---

Outside diff comments:
In `@docs/en/dev/passes/00-pass_manager.md`:
- Line 383: Update the documentation entry for MaterializeTensorStrides to
reflect that it is now active in the default pipeline: remove "not yet wired
into the default pipeline" and state that MaterializeTensorStrides is active
(since RFC `#1300` P6) and placed between CanonicalizeIOOrder and InitMemRef in
the default pass sequence; ensure the description matches the status in
python/pypto/ir/pass_manager.py and the details in
25-materialize_tensor_strides.md.

In `@docs/zh-cn/dev/passes/00-pass_manager.md`:
- Line 383: The doc text for MaterializeTensorStrides is stale: in
docs/zh-cn/dev/passes/00-pass_manager.md update the line listing
`MaterializeTensorStrides` (step 13) by removing the parenthetical "已注册但尚未接入默认
pipeline(将随 RFC `#1300` P6/P7 的 codegen 清理一起启用)" so the entry no longer claims it
isn’t wired into the default pipeline; keep the entry as-is otherwise to match
the activated status used elsewhere (e.g.,
docs/en/dev/passes/17-lower_transpose_load_param_layout.md).

---

Nitpick comments:
In `@src/ir/op/tile_ops/memory.cpp`:
- Around line 158-173: Replace the inline DN-layout check used to set the local
variable source_is_dn with the TensorType::IsDNLayout() convenience method;
locate where source_is_dn is computed (currently using
tensor_type->tensor_view_.has_value() && tensor_type->tensor_view_->layout ==
TensorLayout::DN) and change it to call tensor_type->IsDNLayout() so the code
uses the existing TensorType helper and removes duplicated logic in memory.cpp
within the tile.load handling (the block that then uses source_is_dn to decide
swapping tile_view.blayout/slayout).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 980489e7-e270-42a1-9d75-337dbea41d7e

📥 Commits

Reviewing files that changed from the base of the PR and between a2af40c and 8607221.

📒 Files selected for processing (37)
  • .claude/rules/pass-doc-ordering.md
  • CMakeLists.txt
  • docs/en/dev/passes/00-pass_manager.md
  • docs/en/dev/passes/16-infer_tile_memory_space.md
  • docs/en/dev/passes/17-lower_transpose_load_param_layout.md
  • docs/en/dev/passes/17-resolve_transpose_layout.md
  • docs/en/dev/passes/18-resolve_backend_op_layouts.md
  • docs/en/dev/passes/25-materialize_tensor_strides.md
  • docs/en/user/01-language_guide.md
  • docs/zh-cn/dev/passes/00-pass_manager.md
  • docs/zh-cn/dev/passes/16-infer_tile_memory_space.md
  • docs/zh-cn/dev/passes/17-lower_transpose_load_param_layout.md
  • docs/zh-cn/dev/passes/17-resolve_transpose_layout.md
  • docs/zh-cn/dev/passes/18-resolve_backend_op_layouts.md
  • docs/zh-cn/dev/passes/25-materialize_tensor_strides.md
  • docs/zh-cn/user/01-language_guide.md
  • include/pypto/ir/transforms/pass_properties.h
  • include/pypto/ir/transforms/passes.h
  • include/pypto/ir/transforms/utils/tensor_view_semantics.h
  • python/bindings/modules/passes.cpp
  • python/pypto/ir/op/tensor_ops.py
  • python/pypto/ir/pass_manager.py
  • python/pypto/pypto_core/passes.pyi
  • src/codegen/tensor_op_codegen.cpp
  • src/ir/op/tensor_ops/transform.cpp
  • src/ir/op/tile_ops/memory.cpp
  • src/ir/transforms/lower_transpose_load_param_layout_pass.cpp
  • src/ir/transforms/materialize_tensor_strides_pass.cpp
  • src/ir/transforms/resolve_transpose_layout_pass.cpp
  • src/ir/transforms/simplify_pass.cpp
  • tests/ut/codegen/test_pto_codegen.py
  • tests/ut/codegen/test_pto_codegen_cross_core.py
  • tests/ut/ir/operators/test_tensor_as_layout.py
  • tests/ut/ir/transforms/test_lower_transpose_load_param_layout_pass.py
  • tests/ut/ir/transforms/test_pass_manager.py
  • tests/ut/ir/transforms/test_resolve_transpose_layout_pass.py
  • tests/ut/ir/transforms/test_simplify_pass.py
💤 Files with no reviewable changes (4)
  • docs/zh-cn/dev/passes/17-resolve_transpose_layout.md
  • src/ir/transforms/resolve_transpose_layout_pass.cpp
  • docs/en/dev/passes/17-resolve_transpose_layout.md
  • tests/ut/ir/transforms/test_resolve_transpose_layout_pass.py

Comment thread docs/en/dev/passes/17-lower_transpose_load_param_layout.md Outdated
Comment thread docs/zh-cn/dev/passes/17-lower_transpose_load_param_layout.md Outdated
Comment thread src/ir/op/tensor_ops/transform.cpp
Comment thread src/ir/transforms/lower_transpose_load_param_layout_pass.cpp
Comment thread src/ir/transforms/materialize_tensor_strides_pass.cpp Outdated
Comment thread src/ir/transforms/simplify_pass.cpp
Comment thread tests/ut/ir/transforms/test_lower_transpose_load_param_layout_pass.py Outdated
Comment thread tests/ut/ir/transforms/test_lower_transpose_load_param_layout_pass.py Outdated
lyfne123 added a commit to lyfne123/pypto that referenced this pull request May 11, 2026
CI fix (root cause of pypto-lib-model + system-tests failures):

- P6's call-site injector previously inlined `tensor.as_layout(arg, DN)` into
  the kernel-call args directly, which the orchestration codegen rejects with
  `Call to '<callee>' arg N is neither a variable nor a recognized constant
  literal`. Refactor `CallSiteAsLayoutInjector` to operate at the statement
  level: for every AssignStmt / EvalStmt / ReturnStmt whose RHS targets a
  promoted callee, emit one `bridged_<param> = tensor.as_layout(arg, DN)`
  AssignStmt immediately before the call statement and replace the inline
  Call arg with the bound Var. Net IR is SSA-form and matches what
  orchestration codegen consumes per arg slot (Var | const-literal).

Review comments addressed:

- gemini #1: codegen `tensor.as_layout` now special-cases the identity flip
  (target layout == source layout) and emits a plain `Tensor result = input;`
  alias instead of a spurious `.transpose()`. Simplify still folds these
  before codegen in the default pipeline, but the codegen is now robust
  against ad-hoc compile paths that skip Simplify.
- coderabbitai hw-native-sys#2 / hw-native-sys#3: drop the "next default pass" wording in en/zh-cn doc
  17 — `MaterializeTensorStrides` runs later in the pipeline (after
  `CanonicalizeIOOrder`), not immediately after. The zh-cn doc's "17th pass"
  text is also clarified — the 17 is the docs/passes/ slot, not a literal
  pipeline call-count.
- coderabbitai hw-native-sys#4: `DeduceTensorAsLayoutType` now preserves the source
  TensorView's `valid_shape` (with trailing-pair swap on cross-layout flips)
  and `pad` through `tensor.as_layout`. Previously these fields were dropped,
  making the reinterpret silently lossy for sliced or fill-padded inputs.
- coderabbitai hw-native-sys#6: `MaterializeTensorStrides` direct-ctor rebuild path now
  forwards `op->attrs_`. The previous version preserved type and kwargs but
  dropped attrs, which would have silently discarded call metadata
  (arg_directions, manual_dep_edges) attached by earlier passes.
- coderabbitai hw-native-sys#7: update the stale comment block above `VisitExpr_` in
  `simplify_pass.cpp` — it still described the dropped shape-bearing
  `as_layout(x, shape, layout)` form and the never-implemented chain
  folding. New comment accurately describes the single identity-elimination
  rule and explains why chain folding is deferred.
- coderabbitai hw-native-sys#8 / hw-native-sys#9: the unit-test pattern that previously inspected an
  inline `tensor.as_layout` Call as a kernel-call arg no longer applies
  after the SSA refactor above. Tests now look up the bridge via
  `_find_assign_rhs(orch, var)` and guard `op is not None` before reading
  `op.name` (matching the defensive pattern already used in the B^T test).

Skipped (with reason):

- coderabbitai hw-native-sys#5: "Handle 3-arg `tile.load`". `tile.load` registers four
  mandatory args (tensor, offsets, shapes, valid_shapes) and the Python
  builder always materializes `valid_shapes` (defaults to `shapes` when the
  caller omits it). Once IR is constructed, every `tile.load` is 4-arg —
  the 3-arg form only exists at the DSL surface. The internal check stays
  as-is.
@lyfne123 lyfne123 force-pushed the issue-1300-tensor-as-layout branch from ab19bfa to 5fb0a67 Compare May 11, 2026 03:38
lyfne123 added a commit to lyfne123/pypto that referenced this pull request May 11, 2026
CI fix (root cause of pypto-lib-model + system-tests failures):

- P6's call-site injector previously inlined `tensor.as_layout(arg, DN)` into
  the kernel-call args directly, which the orchestration codegen rejects with
  `Call to '<callee>' arg N is neither a variable nor a recognized constant
  literal`. Refactor `CallSiteAsLayoutInjector` to operate at the statement
  level: for every AssignStmt / EvalStmt / ReturnStmt whose RHS targets a
  promoted callee, emit one `bridged_<param> = tensor.as_layout(arg, DN)`
  AssignStmt immediately before the call statement and replace the inline
  Call arg with the bound Var. Net IR is SSA-form and matches what
  orchestration codegen consumes per arg slot (Var | const-literal).

Review comments addressed:

- gemini #1: codegen `tensor.as_layout` now special-cases the identity flip
  (target layout == source layout) and emits a plain `Tensor result = input;`
  alias instead of a spurious `.transpose()`. Simplify still folds these
  before codegen in the default pipeline, but the codegen is now robust
  against ad-hoc compile paths that skip Simplify.
- coderabbitai hw-native-sys#2 / hw-native-sys#3: drop the "next default pass" wording in en/zh-cn doc
  17 — `MaterializeTensorStrides` runs later in the pipeline (after
  `CanonicalizeIOOrder`), not immediately after. The zh-cn doc's "17th pass"
  text is also clarified — the 17 is the docs/passes/ slot, not a literal
  pipeline call-count.
- coderabbitai hw-native-sys#4: `DeduceTensorAsLayoutType` now preserves the source
  TensorView's `valid_shape` (with trailing-pair swap on cross-layout flips)
  and `pad` through `tensor.as_layout`. Previously these fields were dropped,
  making the reinterpret silently lossy for sliced or fill-padded inputs.
- coderabbitai hw-native-sys#6: `MaterializeTensorStrides` direct-ctor rebuild path now
  forwards `op->attrs_`. The previous version preserved type and kwargs but
  dropped attrs, which would have silently discarded call metadata
  (arg_directions, manual_dep_edges) attached by earlier passes.
- coderabbitai hw-native-sys#7: update the stale comment block above `VisitExpr_` in
  `simplify_pass.cpp` — it still described the dropped shape-bearing
  `as_layout(x, shape, layout)` form and the never-implemented chain
  folding. New comment accurately describes the single identity-elimination
  rule and explains why chain folding is deferred.
- coderabbitai hw-native-sys#8 / hw-native-sys#9: the unit-test pattern that previously inspected an
  inline `tensor.as_layout` Call as a kernel-call arg no longer applies
  after the SSA refactor above. Tests now look up the bridge via
  `_find_assign_rhs(orch, var)` and guard `op is not None` before reading
  `op.name` (matching the defensive pattern already used in the B^T test).

Skipped (with reason):

- coderabbitai hw-native-sys#5: "Handle 3-arg `tile.load`". `tile.load` registers four
  mandatory args (tensor, offsets, shapes, valid_shapes) and the Python
  builder always materializes `valid_shapes` (defaults to `shapes` when the
  caller omits it). Once IR is constructed, every `tile.load` is 4-arg —
  the 3-arg form only exists at the DSL surface. The internal check stays
  as-is.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/codegen/pto/pto_codegen.cpp`:
- Around line 593-689: The branch that sets is_column_vector currently allows
rank==3 and later only fills two stride_names, leaving stride_names[2] empty and
producing malformed MLIR; fix by restricting the column-vector detection to rank
== 2 (i.e., change the condition in the is_column_vector check to require rank
== 2) so layout = ir::TensorLayout::DN is only forced for 2-D [M,1] tensors and
the existing stride fallback (stride_names[0], stride_names[1]) remains correct;
alternatively, if rank-3 column-vectors must be supported, extend the
is_column_vector handling to fully populate stride_names for all dims using
shape_dim_names/emit_stride_mul before the stride emission, but the simpler fix
is to limit is_column_vector to rank==2.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 75a30cdc-18ba-4b78-bc6a-3a49ca63363f

📥 Commits

Reviewing files that changed from the base of the PR and between 8607221 and 1fda505.

📒 Files selected for processing (38)
  • .claude/rules/pass-doc-ordering.md
  • CMakeLists.txt
  • docs/en/dev/passes/00-pass_manager.md
  • docs/en/dev/passes/17-infer_tile_memory_space.md
  • docs/en/dev/passes/18-lower_transpose_load_param_layout.md
  • docs/en/dev/passes/19-resolve_backend_op_layouts.md
  • docs/en/dev/passes/26-materialize_tensor_strides.md
  • docs/en/user/01-language_guide.md
  • docs/zh-cn/dev/passes/00-pass_manager.md
  • docs/zh-cn/dev/passes/17-infer_tile_memory_space.md
  • docs/zh-cn/dev/passes/18-lower_transpose_load_param_layout.md
  • docs/zh-cn/dev/passes/19-resolve_backend_op_layouts.md
  • docs/zh-cn/dev/passes/26-materialize_tensor_strides.md
  • docs/zh-cn/user/01-language_guide.md
  • include/pypto/ir/transforms/pass_properties.h
  • include/pypto/ir/transforms/passes.h
  • include/pypto/ir/transforms/utils/tensor_view_semantics.h
  • python/bindings/modules/passes.cpp
  • python/pypto/ir/op/tensor_ops.py
  • python/pypto/ir/pass_manager.py
  • python/pypto/pypto_core/passes.pyi
  • src/backend/common/pto_ops_common.cpp
  • src/codegen/orchestration/orchestration_codegen.cpp
  • src/codegen/pto/pto_codegen.cpp
  • src/codegen/tensor_op_codegen.cpp
  • src/ir/op/tensor_ops/transform.cpp
  • src/ir/op/tile_ops/memory.cpp
  • src/ir/transforms/lower_transpose_load_param_layout_pass.cpp
  • src/ir/transforms/materialize_tensor_strides_pass.cpp
  • src/ir/transforms/resolve_transpose_layout_pass.cpp
  • src/ir/transforms/simplify_pass.cpp
  • tests/ut/codegen/test_pto_codegen.py
  • tests/ut/codegen/test_pto_codegen_cross_core.py
  • tests/ut/ir/operators/test_tensor_as_layout.py
  • tests/ut/ir/transforms/test_lower_transpose_load_param_layout_pass.py
  • tests/ut/ir/transforms/test_pass_manager.py
  • tests/ut/ir/transforms/test_resolve_transpose_layout_pass.py
  • tests/ut/ir/transforms/test_simplify_pass.py
💤 Files with no reviewable changes (2)
  • src/ir/transforms/resolve_transpose_layout_pass.cpp
  • tests/ut/ir/transforms/test_resolve_transpose_layout_pass.py
✅ Files skipped from review due to trivial changes (13)
  • docs/zh-cn/dev/passes/26-materialize_tensor_strides.md
  • docs/en/dev/passes/26-materialize_tensor_strides.md
  • docs/en/dev/passes/17-infer_tile_memory_space.md
  • docs/zh-cn/dev/passes/19-resolve_backend_op_layouts.md
  • docs/en/dev/passes/18-lower_transpose_load_param_layout.md
  • docs/zh-cn/user/01-language_guide.md
  • docs/zh-cn/dev/passes/18-lower_transpose_load_param_layout.md
  • docs/zh-cn/dev/passes/17-infer_tile_memory_space.md
  • include/pypto/ir/transforms/utils/tensor_view_semantics.h
  • docs/en/dev/passes/00-pass_manager.md
  • docs/en/user/01-language_guide.md
  • docs/en/dev/passes/19-resolve_backend_op_layouts.md
  • .claude/rules/pass-doc-ordering.md
🚧 Files skipped from review as they are similar to previous changes (17)
  • CMakeLists.txt
  • include/pypto/ir/transforms/pass_properties.h
  • tests/ut/codegen/test_pto_codegen_cross_core.py
  • docs/zh-cn/dev/passes/00-pass_manager.md
  • src/ir/transforms/simplify_pass.cpp
  • python/pypto/ir/op/tensor_ops.py
  • include/pypto/ir/transforms/passes.h
  • src/ir/op/tile_ops/memory.cpp
  • src/ir/transforms/materialize_tensor_strides_pass.cpp
  • src/ir/op/tensor_ops/transform.cpp
  • python/pypto/pypto_core/passes.pyi
  • tests/ut/codegen/test_pto_codegen.py
  • tests/ut/ir/transforms/test_lower_transpose_load_param_layout_pass.py
  • python/bindings/modules/passes.cpp
  • src/ir/transforms/lower_transpose_load_param_layout_pass.cpp
  • tests/ut/ir/transforms/test_pass_manager.py
  • python/pypto/ir/pass_manager.py

Comment thread src/codegen/pto/pto_codegen.cpp Outdated
lyfne123 added a commit to lyfne123/pypto that referenced this pull request May 11, 2026
Addresses coderabbitai's review on ``EmitMakeTensorViews``: the
``is_column_vector`` check previously matched ``rank == 2 || rank == 3``,
but the column-vector stride fallback only populates ``stride_names[0]``
and ``stride_names[1]``. For a hypothetical rank-3 ``[_, M, 1]`` input
the third stride slot would be empty and the codegen would emit malformed
MLIR.

The legacy column-vector behavior was always rank-2 in practice (PTOAS
infers DN for ``[M, 1]`` specifically) and all in-tree test coverage is
rank-2. Restrict the condition to ``rank == 2`` and add a comment
explaining the constraint, mirroring coderabbitai's suggested minimal
fix.
lyfne123 added a commit to lyfne123/pypto that referenced this pull request May 11, 2026
…a2a3 paged_attention)

Adds ``set_output_memory_inherit_input()`` to the ``tensor.as_layout`` op
registration. Without it, ``InitMemRef`` saw the op as a regular tensor
producer and minted a fresh MemRef (with its own ``mem_ddr_*`` allocation)
for the result. The orchestration codegen, meanwhile, lowers
``tensor.as_layout`` to a plain alias ``Tensor result = input;`` — so the
runtime tensor's data pointer points to the *input's* buffer, while the IR
declares a *separate* buffer.

Concrete failure mode (non-square paged-attention on a2a3): for kj of shape
``[64, 128]`` ND promoted to ``[128, 64] DN``, the orch IR looked like

```
kj_dn_view: pl.Tensor[[128, 64], ..., pl.MemRef("mem_ddr_3", 0, 16384)] =
    pl.tensor.as_layout(kj, layout=DN)
self.qk_kernel(qi, kj_dn_view, out)
```

The kernel binary's ``make_tensor_view`` produced the right (shape, stride,
layout) triple, but the orch passed a stale buffer pointer (the
freshly-allocated mem_ddr_3 was never written to; only the alias to kj's
mem_ddr_1 held real data). Square cases happened to pass because the
trailing-pair shape swap is identity, and PTOAS's runtime address arithmetic
collapsed to the same byte range either way.

After this fix, the orch IR shows ``kj_dn_view`` sharing the input's MemRef
``mem_ddr_1`` — no spurious allocation, no aliasing mismatch.

### Validation

- ``cmake --build build --parallel`` — clean
- ``pytest tests/ut/`` — 4562 passed / 33 skipped
- Local IR inspection confirms ``kj_dn_view`` inherits ``mem_ddr_1`` (was
  ``mem_ddr_3`` before this fix).
lyfne123 added 11 commits May 12, 2026 08:11
Introduce tensor.as_layout - a pure metadata reinterpret op that points
at the same physical memory as its source but exposes a different
(shape, stride, layout) triple to consumers (RFC hw-native-sys#1300 §3.3).

The op emits no PTOAS instruction at codegen; downstream
make_tensor_view consumes the new view directly. It is internal-only:
not exposed via pypto.language. Future passes (notably
LowerTransposeLoadParamLayout in P6) inject tensor.as_layout at orch ↔
InCore call sites to bridge equivalent (shape, stride, layout) views.

DeduceTensorAsLayoutType enforces three validity invariants:

- Total element count of src and target must match (when both are
  statically known).
- layout must not be NZ (NZ is tile-only / fractal).
- The reinterpret must reduce to a RFC §4.2 canonical pair - currently
  row-major [..., a, b] ND ≡ [..., b, a] DN-packed. Other reinterprets
  are rejected; the helper AsLayoutOffsetMapEquivalent in
  tensor_view_semantics.h can be extended in follow-ups when concrete
  use cases appear.

Tests: 9 unit cases covering ND↔DN 2D/3D static round trips, idempotent
self-reinterpret, element-count mismatch, NZ rejection, invalid
offset-map rejection, symbolic-shape ExprPtr-identity matching, and the
op-registry sanity check.
…#1300 P4-b)

Extend the Simplify pass to drop ``tensor.as_layout`` calls that are
no-op metadata reinterprets — i.e. the requested ``(shape, layout)``
already matches what the source carries. The Call is replaced by its
``src`` arg directly, so downstream consumers stop walking through a
useless reinterpret.

Why this matters: future passes (e.g. ``LowerTransposeLoadParamLayout``
in P6) may insert ``tensor.as_layout`` at every call-site bridge.
Without folding, identity reinterprets ride through the pipeline and
clutter codegen. With this rule, the no-op cases drop out at the
Simplify boundary.

Chain folding (``as_layout(as_layout(x, ...), ...)`` → ``as_layout(x, ...)``)
is intentionally left out: after SSA the outer Call's source is a Var
bound to the inner Call (not the inner Call inline), so naive pointer
inspection cannot see across the binding. A dedicated SSA-aware chain
optimizer can be added if real pipelines produce such chains.

Tests: 3 cases — identity elimination on bare ND, preservation of a
substantive shape change, preservation of a same-shape layout change.
…ys#1300 P4 review)

Tighten the tensor.as_layout signature in response to a design review:
the op now ONLY flips the layout tag - shape changes that come with a
ND ↔ DN flip are mechanical (RFC §4.2 canonical pair: trailing-two-dim
swap) and derived from the source. Callers no longer pass a target
shape.

Why the change: shape changes are tensor.reshape's job. Letting
as_layout also accept a target shape blurred the responsibility,
created room for caller error (mismatched shape vs canonical pair),
and required the AsLayoutOffsetMapEquivalent helper to validate
arbitrary (src, target) combinations. With the layout-tag-only design,
the canonical pair is intrinsic to the op and the helper disappears.

API:
  Before:  as_layout(src, shape, layout=...)
  After:   as_layout(src, layout=...)

Behavior:
  - Same layout (identity)             -> shape unchanged
  - Cross layout (ND <-> DN, rank>=2)  -> trailing 2 dims swapped
  - NZ target                          -> rejected
  - Strided sub-view source            -> rejected (use slice/reshape first)

Cleanup:
  - Drop AsLayoutOffsetMapEquivalent helper.
  - Drop detail::RowMajorEquivalentShape and detail::ShapeListsEquivalent
    (no remaining consumers; the namespace detail rump now contains only
    the truly-internal CheckCanonicalView helpers, restoring the
    convention).
  - Simplify identity rule simplifies to a layout-only comparison.
  - Codegen handler unchanged - still Tensor::transpose(N-2, N-1) for
    cross-layout flips; identity cases never reach codegen because
    Simplify folds them.

Tests: test_tensor_as_layout.py rewritten (9 cases) - adds an explicit
strided-source rejection, drops obsolete element-count and offset-map
cases. test_simplify_pass.py::TestAsLayoutFolding adjusted for the new
signature (2 cases).
…trides (RFC hw-native-sys#1300 P6)

Replaces ResolveTransposeLayout with LowerTransposeLoadParamLayout — the first
pipeline pass that produces RFC hw-native-sys#1300 canonical-form IR — and wires
MaterializeTensorStrides into the default pipeline so the codegen-entry IR
satisfies the (shape, stride, layout) self-consistency contract.

Pass behaviour (per RFC hw-native-sys#1300 §3.3 + §4.2):

- For every InCore parameter loaded via tile.load(transpose=True), promote the
  TensorType from `[..., a, b] ND` to `[..., b, a] DN` (trailing-pair shape swap
  + DN layout tag with empty stride). MaterializeTensorStrides fills the
  packed canonical strides later in the pipeline.
- Every tile.load on a promoted parameter has its offsets / shapes /
  valid_shapes tuples swapped at the trailing pair and its `transpose` kwarg
  set to False (the slot is kept so print -> reparse round-trips faithfully,
  since the tile.load op registers `transpose` as a default-false attribute).
- DeduceTileLoadType now derives the Mat tile-view layout from the source's
  DN tag (XOR with the transpose kwarg) so the legacy `transpose=True` swap
  and the new `DN source + transpose=False` form produce the same TileType.
- Every non-InCore call site to a promoted callee wraps its promoted-slot arg
  with `tensor.as_layout(arg, DN)` (P4) to bridge orch-side ND tensors to the
  InCore DN parameter type.
- Mixed-use parameters (loaded with both transpose=True and transpose=False)
  are rejected with pypto::ValueError.

Pipeline wiring:

- pass_manager.py default tile-PTO pipeline now inserts MaterializeTensorStrides
  between CanonicalizeIOOrder and InitMemRef. With P6 producing canonical-form
  DN parameters, the materialized strides match the IR shape directly — codegen
  takes the `has_explicit_stride` branch and bypasses the legacy `dn_swap` /
  `get_shape_source_idx` path.
- MaterializeTensorStrides now rebuilds Calls via direct ctor (preserving the
  intentional type set by FlattenTileNdTo2D's manual rank-2 override) instead
  of routing through OpRegistry::Create, which would have re-deduced a rank-3
  TileType from the rank-3 source-window args and silently undone the
  flattening.

Test changes:

- test_lower_transpose_load_param_layout_pass.py (new): 8 cases — B^T, A^T,
  AB^T, no-op, idempotent, mixed-use rejection, partial-load. Built with
  programmatic assertions (not `@pl.program` for After) since `tensor.as_layout`
  is internal-only and not exposed in `pl.*`.
- test_pass_manager.py: the default and DebugTileOptimization expected pass
  lists now include LowerTransposeLoadParamLayout and MaterializeTensorStrides.
- test_pto_codegen_3d_dn_tensor_view_uses_canonical_stride (renamed): now
  asserts the RFC hw-native-sys#1300 canonical (shape, stride, layout) triple — shape
  preserved as written (`[2, 48, 64]`), strides `[3072, 1, 48]` (DN-packed:
  stride[n-2]=1, stride[n-1]=shape[n-2]=48, stride[n-3]=shape[n-2]*shape[n-1]=3072),
  layout=dn. Old expectation was the legacy `dn_swap` form
  (`[2, 64, 48]` shape, `[3072, 1, 64]` stride) which the canonical pipeline
  intentionally replaces.

Cross-layer / docs:

- Renamed src/ir/transforms/resolve_transpose_layout_pass.cpp →
  lower_transpose_load_param_layout_pass.cpp (git mv preserves history).
- Renamed docs/{en,zh-cn}/dev/passes/17-resolve_transpose_layout.md →
  17-lower_transpose_load_param_layout.md and rewrote.
- pass_properties.h: kLowerTransposeLoadParamLayoutProperties.
- passes.h / passes.cpp binding / passes.pyi / pass-doc-ordering.md updated.
- All doc cross-references and the docs/{en,zh-cn}/user/01-language_guide.md
  user-facing pipeline list now reference the new pass name.
CI fix (root cause of pypto-lib-model + system-tests failures):

- P6's call-site injector previously inlined `tensor.as_layout(arg, DN)` into
  the kernel-call args directly, which the orchestration codegen rejects with
  `Call to '<callee>' arg N is neither a variable nor a recognized constant
  literal`. Refactor `CallSiteAsLayoutInjector` to operate at the statement
  level: for every AssignStmt / EvalStmt / ReturnStmt whose RHS targets a
  promoted callee, emit one `bridged_<param> = tensor.as_layout(arg, DN)`
  AssignStmt immediately before the call statement and replace the inline
  Call arg with the bound Var. Net IR is SSA-form and matches what
  orchestration codegen consumes per arg slot (Var | const-literal).

Review comments addressed:

- gemini #1: codegen `tensor.as_layout` now special-cases the identity flip
  (target layout == source layout) and emits a plain `Tensor result = input;`
  alias instead of a spurious `.transpose()`. Simplify still folds these
  before codegen in the default pipeline, but the codegen is now robust
  against ad-hoc compile paths that skip Simplify.
- coderabbitai hw-native-sys#2 / hw-native-sys#3: drop the "next default pass" wording in en/zh-cn doc
  17 — `MaterializeTensorStrides` runs later in the pipeline (after
  `CanonicalizeIOOrder`), not immediately after. The zh-cn doc's "17th pass"
  text is also clarified — the 17 is the docs/passes/ slot, not a literal
  pipeline call-count.
- coderabbitai hw-native-sys#4: `DeduceTensorAsLayoutType` now preserves the source
  TensorView's `valid_shape` (with trailing-pair swap on cross-layout flips)
  and `pad` through `tensor.as_layout`. Previously these fields were dropped,
  making the reinterpret silently lossy for sliced or fill-padded inputs.
- coderabbitai hw-native-sys#6: `MaterializeTensorStrides` direct-ctor rebuild path now
  forwards `op->attrs_`. The previous version preserved type and kwargs but
  dropped attrs, which would have silently discarded call metadata
  (arg_directions, manual_dep_edges) attached by earlier passes.
- coderabbitai hw-native-sys#7: update the stale comment block above `VisitExpr_` in
  `simplify_pass.cpp` — it still described the dropped shape-bearing
  `as_layout(x, shape, layout)` form and the never-implemented chain
  folding. New comment accurately describes the single identity-elimination
  rule and explains why chain folding is deferred.
- coderabbitai hw-native-sys#8 / hw-native-sys#9: the unit-test pattern that previously inspected an
  inline `tensor.as_layout` Call as a kernel-call arg no longer applies
  after the SSA refactor above. Tests now look up the bridge via
  `_find_assign_rhs(orch, var)` and guard `op is not None` before reading
  `op.name` (matching the defensive pattern already used in the B^T test).

Skipped (with reason):

- coderabbitai hw-native-sys#5: "Handle 3-arg `tile.load`". `tile.load` registers four
  mandatory args (tensor, offsets, shapes, valid_shapes) and the Python
  builder always materializes `valid_shapes` (defaults to `shapes` when the
  caller omits it). Once IR is constructed, every `tile.load` is 4-arg —
  the 3-arg form only exists at the DSL surface. The internal check stays
  as-is.
…es in wrapper-reorder

Two related fixes for paged-attention ST failures on a2a3 introduced by the
P6 orch-side ``tensor.as_layout`` injection.

**1. ``tensor.as_layout`` codegen now emits a plain alias.**

Previously the orchestration handler lowered ``tensor.as_layout(input, DN)``
to ``input.transpose(N-2, N-1)``. The runtime ``Tensor::transpose`` swaps
``shape`` / ``raw_shape`` / **offsets**, which is correct for
``tensor.transpose`` (a physical-index permutation) but wrong for
``tensor.as_layout`` — the IR-level semantics is "reinterpret the same
physical bytes under a different layout tag", so the runtime tensor's
``offsets`` must stay in source coordinates.

Concrete failure mode (paged_attention on a2a3): the orchestration passes
``kj`` with ``offsets = [block_offset, 0]``; the spurious ``.transpose(0, 1)``
swapped them to ``[0, block_offset]``, shifting the base address by a factor
of ``raw_shape[1]`` and silently corrupting every kernel that received the
bridged view. The downstream kernel already encodes the canonical
``(shape, stride, layout)`` triple via its IR-declared param type, so
re-emitting it at runtime was redundant — the alias preserves the orch-side
``offsets`` exactly while the kernel-side ``make_tensor_view`` applies the
canonical interpretation.

This addresses the 7 paged-attention numerical-mismatch failures on a2a3
(``test_paged_attention_ptoas``, ``test_paged_attention_unaligned_ptoas``,
the dynamic-paged variants, and ``test_dyn_orch_paged_attention``). The
a5sim variant already passed because a5's PTOAS layout is more forgiving.

**2. Wrapper-reorder chases ``tensor.as_layout`` aliases.**

For SPMD / Group / orchestration wrapper functions, codegen splices an outer
caller's args through the wrapper's parameter list to the inner-call's
parameter list. With P6 injecting ``bridged_kj = tensor.as_layout(kj, DN)``
before the inner call inside the wrapper body, the inner-call arg becomes
``bridged_kj`` — a wrapper-local Var, not a wrapper parameter — and
``BuildWrapperReorderedParams`` failed with
``"inner call arg N does not map to any wrapper parameter"``.

Fix: collect a per-wrapper alias map by walking the wrapper body for
``AssignStmt(v, tensor.as_layout(src, ...))`` pairs, then chase each inner-
call arg through the chain back to the wrapper parameter. The lowering on
the other side (#1) is the plain alias, so this "see-through" mapping is
semantically equivalent — the wrapper splice still routes the outer arg to
the same memory.

This addresses the 3 SPMD paged_attention compile failures
(``test_paged_attention_spmd_ptoas`` variants).
…w-native-sys#1300 P7)

Cleans up the PTO codegen to consume the IR's canonical
``(shape, stride, layout)`` triple verbatim. No more dual code paths
between the legacy "DN + empty stride → implicit shape/offset swap" and the
canonical "DN + explicit stride → emit-as-is" forms — the latter is now the
only path. After P6 + ``MaterializeTensorStrides``, all DN-tagged tensors
arrive at codegen with canonical strides materialized, so the swap path is
dead code.

### What changes

- **``src/codegen/pto/pto_codegen.cpp::EmitMakeTensorViews``**
  - Removes ``get_shape_source_idx`` (the implicit trailing-pair swap helper)
    and the dual stride-derivation paths.
  - Single derivation: explicit stride when present, otherwise canonical
    DN strides (RFC §2.3: ``stride[-2]=1``, ``stride[-1]=shape[-2]``, outer
    strides walk row-major over the DN-block volume) or canonical ND
    strides (``stride[-1]=1``, ``stride[k]=stride[k+1]*shape[k+1]``).
  - Keeps the ``[M, 1]`` column-vector legacy carve-out (PTOAS infers DN
    for that shape regardless of IR layout tag).
  - Precomputes shape dim SSA names up-front so dynamic-shape casts
    (``EmitCastToIndex``) emit their setup ops *before* the
    ``pto.make_tensor_view`` line instead of interleaving inside it.

- **``src/backend/common/pto_ops_common.cpp``**
  - ``MakeTileLoadCodegenPTO`` and ``MakeTileStoreCodegenPTO`` drop their
    ``dn_swap`` branches that swapped the trailing pair of
    ``offsets`` / ``valid_shapes`` / ``shapes`` tuples. The IR-level
    lowering (P6 ``LowerTransposeLoadParamLayout``) now produces all
    coordinates in canonical form, so the codegen transcribes them
    verbatim.

### Why now

This unifies the codegen entry contract: every layout-aware transform is
finalized at the IR level (P3 canonical TensorView, P4 ``tensor.as_layout``,
P6 ``LowerTransposeLoadParamLayout``), and the codegen has no remaining
layout logic of its own. Concrete benefits:

- Removes the asymmetry that caused the non-square paged-attention
  regressions on a2a3 — both legacy ``dn_swap`` and canonical paths emit
  the same MLIR for square cases, but diverged for non-square shapes once
  ``MaterializeTensorStrides`` was activated in the default pipeline.
- ``.pto`` output is now deterministically reproducible from the IR's
  canonical triple — no codegen-layer interpretation steps in between.

### Validation

- ``cmake --build build --parallel`` — clean
- ``pytest tests/ut/`` — 4562 passed / 33 skipped (no regressions)

The 2D codegen optimization (``stride[-2] = shape[-1]`` directly, no
spurious ``arith.muli %c1, shape[-1]``) preserves the byte-for-byte
``.pto`` output expected by existing 2D MLIR golden-string tests.
Addresses coderabbitai's review on ``EmitMakeTensorViews``: the
``is_column_vector`` check previously matched ``rank == 2 || rank == 3``,
but the column-vector stride fallback only populates ``stride_names[0]``
and ``stride_names[1]``. For a hypothetical rank-3 ``[_, M, 1]`` input
the third stride slot would be empty and the codegen would emit malformed
MLIR.

The legacy column-vector behavior was always rank-2 in practice (PTOAS
infers DN for ``[M, 1]`` specifically) and all in-tree test coverage is
rank-2. Restrict the condition to ``rank == 2`` and add a comment
explaining the constraint, mirroring coderabbitai's suggested minimal
fix.
…a2a3 paged_attention)

Adds ``set_output_memory_inherit_input()`` to the ``tensor.as_layout`` op
registration. Without it, ``InitMemRef`` saw the op as a regular tensor
producer and minted a fresh MemRef (with its own ``mem_ddr_*`` allocation)
for the result. The orchestration codegen, meanwhile, lowers
``tensor.as_layout`` to a plain alias ``Tensor result = input;`` — so the
runtime tensor's data pointer points to the *input's* buffer, while the IR
declares a *separate* buffer.

Concrete failure mode (non-square paged-attention on a2a3): for kj of shape
``[64, 128]`` ND promoted to ``[128, 64] DN``, the orch IR looked like

```
kj_dn_view: pl.Tensor[[128, 64], ..., pl.MemRef("mem_ddr_3", 0, 16384)] =
    pl.tensor.as_layout(kj, layout=DN)
self.qk_kernel(qi, kj_dn_view, out)
```

The kernel binary's ``make_tensor_view`` produced the right (shape, stride,
layout) triple, but the orch passed a stale buffer pointer (the
freshly-allocated mem_ddr_3 was never written to; only the alias to kj's
mem_ddr_1 held real data). Square cases happened to pass because the
trailing-pair shape swap is identity, and PTOAS's runtime address arithmetic
collapsed to the same byte range either way.

After this fix, the orch IR shows ``kj_dn_view`` sharing the input's MemRef
``mem_ddr_1`` — no spurious allocation, no aliasing mismatch.

### Validation

- ``cmake --build build --parallel`` — clean
- ``pytest tests/ut/`` — 4562 passed / 33 skipped
- Local IR inspection confirms ``kj_dn_view`` inherits ``mem_ddr_1`` (was
  ``mem_ddr_3`` before this fix).
…s) — RFC hw-native-sys#1300

User's analysis pinpointed the actual root cause of the non-square
paged-attention regressions on a2a3:

The PTOAS-generated kernel wrapper reads dynamic dim values directly from
the runtime ``Tensor`` struct's ``shapes[i]``, indexed under the
**IR-declared post-P6 shape order**. For ``key_cache`` promoted from
``[256, 128] ND`` to ``[128, 256] DN``, the kernel expects:

  KV_HEAD_DIM_DYN    = key_cache_t->shapes[0]  // → 128
  KEY_CACHE_ROWS_DYN = key_cache_t->shapes[1]  // → 256

But my prior plain-alias codegen left ``shapes`` in the pre-swap (ND)
order — the kernel read ``shapes[0]=256`` and computed DN strides off the
wrong axis. Square cases happened to survive because the swap is identity.

**Fix:** the orch codegen for ``tensor.as_layout`` now swaps the
trailing-pair ``shapes`` so the runtime tensor matches the IR-declared
post-swap order. ``raw_shapes`` and ``offsets`` stay in the source (pre-
swap) coord system because PTOAS uses ``raw_shape``-derived strides plus
``offsets`` to compute ``start_offset`` (byte offset of the view into the
physical buffer) — and that base address must continue pointing at the
original ND-coord slice (e.g. paged-attention's
``offsets = [block_offset, 0]`` on the row-major ``key_cache``). If
``is_raw_eq_shapes`` is true, ``raw_shapes`` is materialized from the
current ``shapes`` *before* the swap so the subsequent ``shapes`` mutation
does not pollute the raw-shape-derived stride arithmetic.

The identity flip (target layout == source layout) still lowers to a
plain ``Tensor result = input;`` alias — no swap needed.

### Why not use ``Tensor::transpose(N-2, N-1)``

That runtime helper additionally swaps ``raw_shapes`` and ``offsets``,
which is correct for ``tensor.transpose`` (a physical-index permutation)
but wrong for ``tensor.as_layout`` (a metadata reinterpret of the same
bytes). For paged-attention's
``offsets = [block_offset, 0]`` → ``[0, block_offset]`` shifted the base
address by a factor of ``raw_shape[1]`` and silently corrupted reads.
Our new lowering targets the precise mutation needed: shape-only swap.

### Validation

- ``cmake --build build --parallel`` — clean
- ``pytest tests/ut/`` — 4562 passed / 33 skipped
@lyfne123 lyfne123 force-pushed the issue-1300-tensor-as-layout branch from b6d7493 to 5840271 Compare May 12, 2026 00:11
lyfne123 added 2 commits May 12, 2026 08:48
- MaterializeTensorStrides: remap VarPtrs in manual_dep_edges / user_manual_dep_edges attrs when rebuilding Calls, so attr entries follow the fresh Vars minted for materialized TensorViews. Without this, SSAVerify catches "used outside its defining scope" and orchestration codegen later raises "manual_dep_edge has no producer task" on manual-scope pipelines (exposed by test_manual_scope_{seq_outer_parallel,parallel_outer_seq}_inner_two_stage_pipeline).
- pto_codegen.cpp: extend the ``[..., M, 1]`` column-vector carve-out from rank == 2 to rank >= 2 with a stride derivation that fills all rank slots (legacy PTOAS convention: ``stride[rank-2]=1``, ``stride[rank-1]=shape[rank-1]``, ``stride[rank-3]=shape[rank-2]``, outer dims walk over M). Restores main's behaviour for rank-3 ``[B, N, 1]`` (and matches the ColMajor BLayout that ``DeduceTileLoadType`` already produces for trailing-dim-1 ``tile.load``s), fixing the ``TLoad isSameLayout`` PTOAS compile failure surfaced by test_tensor_expand_clone[a2a3-2] (broadcast_dim=2, input ``[B, N, 1]``).
@Hzfengsy Hzfengsy merged commit 8b89309 into hw-native-sys:main May 12, 2026
9 checks passed
Hzfengsy pushed a commit that referenced this pull request May 12, 2026
…body (RFC #1300) (#1339)

## Summary

Moves the P6 ``tensor.as_layout`` bridge from the orch call site to the top of the InCore body, end-to-end equivalent but **-132 LOC** net and removes a cluster of incidental complexity in orchestration codegen. See #1300 [discussion comment](#1300 (comment)) for the design rationale and consensus question.

## What changes

For each InCore parameter ``p`` loaded via ``tile.load(p, ..., transpose=True)``:

**Before (current main, post-#1324):**
- InCore param signature is promoted to ``[..., b, a] DN``.
- Orch call site is wrapped: ``bridged = tensor.as_layout(arg, DN); incore(bridged, ...)``.
- Orchestration codegen has to chase aliases through the bridge via ``BuildWrapperAliasMap`` + ``ResolveAliasChain`` to recover the original wrapper param.

**After (this PR):**
- InCore param signature is **untouched** (stays ``[..., a, b] ND``, matching the runtime torch tensor).
- InCore body is **prepended** with ``p_dn = tensor.as_layout(p, layout=DN)``; body uses of ``p`` are substituted with ``p_dn``.
- The matching ``tile.load`` is rewritten to read from ``p_dn`` with the trailing pair of offsets/shapes/valid_shapes swapped and ``transpose=False``.
- Orch is left completely alone — the orchestrator's call args are wrapper params directly.

## Diff stats

| File | Change |
|------|--------|
| ``src/ir/transforms/lower_transpose_load_param_layout_pass.cpp`` | -174 / +14 — deletes ``CallSiteAsLayoutInjector`` + Phase 2 + ``PromoteToCanonicalDN``; new ``LowerInCoreFunction`` prepends to body |
| ``src/codegen/orchestration/orchestration_codegen.cpp`` | -60 / 0 — deletes ``BuildWrapperAliasMap`` / ``ResolveAliasChain`` / alias-chasing fallback |
| ``src/backend/common/pto_ops_common.cpp`` | 0 / +86 — registers ``tensor.as_layout`` PTO codegen (emits one ``pto.make_tensor_view`` sharing the input's base) |
| Tests + docs + bindings | -188 / +291 — 5 pass-test bodies rewritten to assert the new IR shape; pass 18 docs (en/zh-cn) rewritten; pass 26 example caption updated; passes.h Doxygen + nanobind docstring + pyi stub updated |

Total: **-485 / +391 = -94 LOC net** (the +291 in tests/docs is mostly added explanatory comments and structured assertions — the production code delta is **-297 / +165 = -132 LOC**).

## Why this is acceptable per RFC §4.2

RFC §4.2's "InCore cannot create tensors" invariant targets ops that **allocate a byte buffer** (``tensor.create``). ``tensor.as_layout`` is a pure metadata reinterpret — it allocates nothing, it just re-describes the input's existing physical buffer. The four-layer boundary (§5) becomes cleaner under this design:

- **Runtime / Orch**: row-major ND physical buffer (matches runtime).
- **Cross-function boundary**: always row-major ND (no layout reinterpret).
- **Inside an InCore body**: derive the DN view via ``tensor.as_layout``; this is a single-function internal detail.
- **`.pto`**: codegen consumes whatever canonical triple the InCore body sees.

## Validation

- ``cmake --build build --parallel`` — clean.
- ``pytest tests/ut/ -n auto --maxprocesses 8`` — **4602 passed / 41 skipped / 0 failed**.
- Golden-string ``.pto`` codegen tests pass — output is byte-identical to main.
- End-to-end matmul B^T, paged-attention (single + multi-config), orchestration codegen tests all pass.

## Test plan

- [x] All existing unit tests pass.
- [x] Pass-specific tests rewritten to validate new IR shape (body-prepended ``tensor.as_layout`` binding + ``tile.load`` reading from binding LHS + orch left alone).
- [x] ``cmake --build`` clean.
- [ ] CI: clang-tidy, pre-commit, unit-tests (macos + ubuntu), fuzz-tests-sim, system-tests, system-tests-a5sim, pypto-lib-model.

## Discussion

Open for RFC author / reviewers to weigh in. The design tradeoff is "signature is the contract" (current main) vs. "cross-function boundary is the runtime-faithful boundary, DN view is a per-kernel detail" (this PR). The latter eliminates downstream codegen complexity at the cost of slightly less honest InCore signatures.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

2 participants