Skip to content

refactor(ir): move P6 tensor.as_layout from orch call-site to InCore body (RFC #1300)#1339

Merged
Hzfengsy merged 2 commits into
hw-native-sys:mainfrom
lyfne123:rfc1300/p6-as-layout-incore
May 12, 2026
Merged

refactor(ir): move P6 tensor.as_layout from orch call-site to InCore body (RFC #1300)#1339
Hzfengsy merged 2 commits into
hw-native-sys:mainfrom
lyfne123:rfc1300/p6-as-layout-incore

Conversation

@lyfne123
Copy link
Copy Markdown
Collaborator

Summary

Moves the P6 tensor.as_layout bridge from the orch call site to the top of the InCore body, end-to-end equivalent but -132 LOC net and removes a cluster of incidental complexity in orchestration codegen. See #1300 discussion comment for the design rationale and consensus question.

What changes

For each InCore parameter p loaded via tile.load(p, ..., transpose=True):

Before (current main, post-#1324):

  • InCore param signature is promoted to [..., b, a] DN.
  • Orch call site is wrapped: bridged = tensor.as_layout(arg, DN); incore(bridged, ...).
  • Orchestration codegen has to chase aliases through the bridge via BuildWrapperAliasMap + ResolveAliasChain to recover the original wrapper param.

After (this PR):

  • InCore param signature is untouched (stays [..., a, b] ND, matching the runtime torch tensor).
  • InCore body is prepended with p_dn = tensor.as_layout(p, layout=DN); body uses of p are substituted with p_dn.
  • The matching tile.load is rewritten to read from p_dn with the trailing pair of offsets/shapes/valid_shapes swapped and transpose=False.
  • Orch is left completely alone — the orchestrator's call args are wrapper params directly.

Diff stats

File Change
src/ir/transforms/lower_transpose_load_param_layout_pass.cpp -174 / +14 — deletes CallSiteAsLayoutInjector + Phase 2 + PromoteToCanonicalDN; new LowerInCoreFunction prepends to body
src/codegen/orchestration/orchestration_codegen.cpp -60 / 0 — deletes BuildWrapperAliasMap / ResolveAliasChain / alias-chasing fallback
src/backend/common/pto_ops_common.cpp 0 / +86 — registers tensor.as_layout PTO codegen (emits one pto.make_tensor_view sharing the input's base)
Tests + docs + bindings -188 / +291 — 5 pass-test bodies rewritten to assert the new IR shape; pass 18 docs (en/zh-cn) rewritten; pass 26 example caption updated; passes.h Doxygen + nanobind docstring + pyi stub updated

Total: -485 / +391 = -94 LOC net (the +291 in tests/docs is mostly added explanatory comments and structured assertions — the production code delta is -297 / +165 = -132 LOC).

Why this is acceptable per RFC §4.2

RFC §4.2's "InCore cannot create tensors" invariant targets ops that allocate a byte buffer (tensor.create). tensor.as_layout is a pure metadata reinterpret — it allocates nothing, it just re-describes the input's existing physical buffer. The four-layer boundary (§5) becomes cleaner under this design:

  • Runtime / Orch: row-major ND physical buffer (matches runtime).
  • Cross-function boundary: always row-major ND (no layout reinterpret).
  • Inside an InCore body: derive the DN view via tensor.as_layout; this is a single-function internal detail.
  • .pto: codegen consumes whatever canonical triple the InCore body sees.

Validation

  • cmake --build build --parallel — clean.
  • pytest tests/ut/ -n auto --maxprocesses 84602 passed / 41 skipped / 0 failed.
  • Golden-string .pto codegen tests pass — output is byte-identical to main.
  • End-to-end matmul B^T, paged-attention (single + multi-config), orchestration codegen tests all pass.

Test plan

  • All existing unit tests pass.
  • Pass-specific tests rewritten to validate new IR shape (body-prepended tensor.as_layout binding + tile.load reading from binding LHS + orch left alone).
  • cmake --build clean.
  • CI: clang-tidy, pre-commit, unit-tests (macos + ubuntu), fuzz-tests-sim, system-tests, system-tests-a5sim, pypto-lib-model.

Discussion

Open for RFC author / reviewers to weigh in. The design tradeoff is "signature is the contract" (current main) vs. "cross-function boundary is the runtime-faithful boundary, DN view is a per-kernel detail" (this PR). The latter eliminates downstream codegen complexity at the cost of slightly less honest InCore signatures.

lyfne123 added 2 commits May 12, 2026 10:57
Alternative design for RFC hw-native-sys#1300 P6: keep InCore param signatures unchanged
and put the layout reinterpret inside the InCore body instead of at the orch
caller. The end-to-end semantics are the same, but the structural cost of
the bridge shifts entirely into PTO codegen (one new op handler ~80 LOC) and
removes the orch-side complexity (BuildWrapperAliasMap + ResolveAliasChain
+ CallSiteAsLayoutInjector ~110 LOC + half the pass).

Files:
- lower_transpose_load_param_layout_pass.cpp: PromoteInCoreFunction is
  replaced by LowerInCoreFunction. New pass: scan InCore body for
  transpose=True loads of params; for each such param ``b``, prepend
  ``b_dn = tensor.as_layout(b, layout=DN)`` to the body and substitute
  body uses of ``b`` with ``b_dn`` before swapping the trailing pair of
  offsets/shapes/valid_shapes on the matching tile.loads. Param signatures
  are untouched. ``CallSiteAsLayoutInjector`` and the entire Phase 2
  (walk non-InCore functions and inject bridges at every call site) are
  removed.
- pto_ops_common.cpp: register ``tensor.as_layout`` as a PTO backend op.
  The codegen lowers it to a fresh ``pto.make_tensor_view`` bound to the
  input's underlying buffer SSA (the function parameter), using the LHS's
  (shape, stride, layout) triple. Downstream tile.load resolves through
  ``RegisterTensorView``.
- orchestration_codegen.cpp: ``BuildWrapperAliasMap`` and
  ``ResolveAliasChain`` are deleted; ``BuildWrapperReorderedParams`` drops
  its alias-chasing fallback. Inner-call args now map directly to wrapper
  params (no orch-side bridge to see through).

Net diff: -297 / +165 lines across the three files.

Unit tests (excluding the pass-specific assertions which encode the old
design's expected param-promotion behaviour): 4583 pass / 41 skipped /
0 failures. The 5 pass-specific test failures all assert ``b.shape == [K, N]``
which is no longer the post-pass param shape — they need rewriting to assert
the new IR structure (body prepended with ``b_dn = as_layout(b, DN)`` of
type ``[K, N] DN``).

End-to-end codegen tests (paged_attn, paged_attn_multi_config, the full
tile-pto pipeline) all pass — the .pto output is byte-identical to the
original design.
Companion to the prior pass/codegen refactor commit. Updates:

- tests/ut/ir/transforms/test_lower_transpose_load_param_layout_pass.py:
  All 5 affected tests rewritten to validate the new IR shape — the InCore
  param signature is preserved, a body-prepended
  ``b_dn = tensor.as_layout(b, layout=DN)`` AssignStmt is the new contract,
  and the matching tile.load reads from ``b_dn`` (the body-local Var) with
  the trailing pair of offsets/shapes/valid_shapes swapped. Orch call sites
  are asserted to pass wrapper params directly with no tensor.as_layout
  bridge. New helpers: ``_find_as_layout_binding`` /
  ``_has_as_layout_for``; obsolete ``_find_assign_rhs`` removed. All 8 pass
  tests now pass.

- docs/en/dev/passes/18-lower_transpose_load_param_layout.md and the zh-cn
  mirror: rewritten to describe the body-prepend design — param signatures
  unchanged, ``b_dn = tensor.as_layout(b, DN)`` prepended at the top of the
  InCore body, orch left untouched. The example diff in the doc now matches
  what the pass actually emits.

- docs/{en,zh-cn}/dev/passes/26-materialize_tensor_strides.md: drop the
  "produced by a future LowerTransposeLoadParamLayout rewrite" hint in the
  example caption — under the new design, P18's body-local
  ``tensor.as_layout`` LHS already carries CanonicalizeView-filled strides,
  so the empty-stride DN-view example is most naturally the user-written
  ``pl.Tensor[..., pl.DN]`` case.

- include/pypto/ir/transforms/passes.h,
  python/bindings/modules/passes.cpp, python/pypto/pypto_core/passes.pyi:
  C++ Doxygen, nanobind docstring, and Python stub doc updated to match the
  new pass behaviour.

Validation: cmake --build build --parallel (clean); pytest tests/ut/ -n auto
→ 4591 passed / 41 skipped / 0 failed.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 12, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

The LowerTransposeLoadParamLayout pass is refactored to insert body-local tensor.as_layout(..., layout=DN) views for InCore function parameters instead of promoting their types and wrapping non-InCore call sites. Parameter signatures remain unchanged; the pass rewrites tile.load(..., transpose=True) by swapping trailing offset/shape pairs and removing the kwarg.

Changes

Transpose Layout Encoding Refactor

Layer / File(s) Summary
Pass specification and documentation
docs/en/dev/passes/18-lower_transpose_load_param_layout.md, docs/zh-cn/dev/passes/18-lower_transpose_load_param_layout.md, docs/*/dev/passes/26-materialize_tensor_strides.md, include/pypto/ir/transforms/passes.h, python/bindings/modules/passes.cpp, python/pypto/pypto_core/passes.pyi
Overview, algorithm, example, and scope sections rewritten for all language versions and bindings to describe body-local DN view insertion, unchanged parameter signatures, non-InCore function preservation, and tile.load coordinate swapping. MaterializeTensorStrides doc updated to clarify DN view source as user-written.
IR transform pass implementation
src/ir/transforms/lower_transpose_load_param_layout_pass.cpp
Core rewrite removes type-promotion and call-site-injection logic. New LowerInCoreFunction flow scans for transpose-loaded parameters, prepends tensor.as_layout bindings, substitutes body uses, and rewrites tile.load calls via rewritten TileLoadBodyRewriter keyed on DN view variables.
Backend codegen for tensor.as_layout
src/backend/common/pto_ops_common.cpp
New op registration validates tensor view metadata, registers buffers for downstream tile.load lookups, materializes shape/stride operands as index SSA values, and emits pto.make_tensor_view with layout attributes derived from TensorLayout.
Orchestration codegen simplification
src/codegen/orchestration/orchestration_codegen.cpp
Removes BuildWrapperAliasMap and ResolveAliasChain helpers and alias-chasing fallback path from BuildWrapperReorderedParams, simplifying task parameter routing to direct mapping only since non-InCore functions are no longer wrapped.
Test suite updates
tests/ut/ir/transforms/test_lower_transpose_load_param_layout_pass.py
Removes orchestration-side SSA binding helpers. Adds InCore body tensor.as_layout binding detection helpers. Rewrites all transpose promotion test assertions to validate body-local view insertion, unchanged parameter signatures, DN-view-based tile.load rewriting, and unmodified orchestration call sites.

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • hw-native-sys/pypto#745: Both PRs modify DN-layout handling pipeline codegen for tensor.as_layout and tile.load offset handling in pto_ops_common.cpp and InCore parameter/transpose-layout transform logic.
  • hw-native-sys/pypto#753: Both PRs modify tile.load transpose handling by swapping shape/offset elements and adjusting tile.load and type-deduction code paths.
  • hw-native-sys/pypto#1324: Both PRs add tensor.as_layout and modify LowerTransposeLoadParamLayout pass, but this PR shifts from call-site injection/promotion to body-local view insertion.

Suggested reviewers

  • Hzfengsy

Poem

🐰 A layout view, now local to the body bound,
No wrapping at the call site to be found,
The pass inserts a DN frame so neat,
Swaps the shapes to make transpose complete,
Non-InCore functions? Left untouched, serene!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 61.90% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: moving tensor.as_layout from orchestrator call-site to InCore body as part of RFC #1300 implementation.
Description check ✅ Passed The description is directly related to the changeset, providing detailed rationale, before/after comparisons, diff statistics, and validation results that explain the refactoring changes across multiple files.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
tests/ut/ir/transforms/test_lower_transpose_load_param_layout_pass.py (1)

95-116: ⚡ Quick win

Enforce “prepended at top of body” in helper semantics.

Line 96 says the helper validates a body-prepended binding, but the implementation only checks existence/uniqueness anywhere in the body. This weakens coverage for the pass contract. Please also assert placement at the beginning of func.body.

Proposed tightening
 def _find_as_layout_binding(func, input_var):
@@
-    for stmt in _iter_stmts(func.body):
+    for stmt in _iter_stmts(func.body):
         if not isinstance(stmt, ir.AssignStmt):
             continue
@@
     assert len(matches) == 1, (
         f"expected exactly one tensor.as_layout binding for {input_var.name_hint}, found {len(matches)}"
     )
+    # Keep the contract strict: binding must be prepended in the top-level body.
+    assert isinstance(func.body, ir.SeqStmts) and func.body.stmts, "function body must be non-empty SeqStmts"
+    first_stmt = func.body.stmts[0]
+    assert isinstance(first_stmt, ir.AssignStmt), "first statement must be AssignStmt for prepended as_layout"
+    first_rhs = first_stmt.value
+    assert (
+        isinstance(first_rhs, ir.Call)
+        and first_rhs.op is not None
+        and first_rhs.op.name == "tensor.as_layout"
+        and first_rhs.args
+        and isinstance(first_rhs.args[0], ir.Var)
+        and first_rhs.args[0] is input_var
+    ), "tensor.as_layout binding must be prepended at top of function body"
     return matches[0]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/ut/ir/transforms/test_lower_transpose_load_param_layout_pass.py` around
lines 95 - 116, The helper _find_as_layout_binding currently only ensures a
unique tensor.as_layout binding anywhere in func.body; change it to also assert
that the matching AssignStmt is prepended at the start of func.body by recording
the matching stmt (the ir.AssignStmt whose rhs is an ir.Call to
"tensor.as_layout" with args[0] == input_var) and after collecting matches
assert len(matches) == 1 and that the matched stmt is the very first statement
in func.body (i.e., func.body[0] is the matched AssignStmt); reference
_find_as_layout_binding, func.body, ir.AssignStmt, ir.Call, and the
"tensor.as_layout" call when making this change.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@tests/ut/ir/transforms/test_lower_transpose_load_param_layout_pass.py`:
- Around line 95-116: The helper _find_as_layout_binding currently only ensures
a unique tensor.as_layout binding anywhere in func.body; change it to also
assert that the matching AssignStmt is prepended at the start of func.body by
recording the matching stmt (the ir.AssignStmt whose rhs is an ir.Call to
"tensor.as_layout" with args[0] == input_var) and after collecting matches
assert len(matches) == 1 and that the matched stmt is the very first statement
in func.body (i.e., func.body[0] is the matched AssignStmt); reference
_find_as_layout_binding, func.body, ir.AssignStmt, ir.Call, and the
"tensor.as_layout" call when making this change.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 30d6b85e-dfed-47cc-b3b0-524d2149ecf5

📥 Commits

Reviewing files that changed from the base of the PR and between 2f76f2f and 59b0257.

📒 Files selected for processing (11)
  • docs/en/dev/passes/18-lower_transpose_load_param_layout.md
  • docs/en/dev/passes/26-materialize_tensor_strides.md
  • docs/zh-cn/dev/passes/18-lower_transpose_load_param_layout.md
  • docs/zh-cn/dev/passes/26-materialize_tensor_strides.md
  • include/pypto/ir/transforms/passes.h
  • python/bindings/modules/passes.cpp
  • python/pypto/pypto_core/passes.pyi
  • src/backend/common/pto_ops_common.cpp
  • src/codegen/orchestration/orchestration_codegen.cpp
  • src/ir/transforms/lower_transpose_load_param_layout_pass.cpp
  • tests/ut/ir/transforms/test_lower_transpose_load_param_layout_pass.py
💤 Files with no reviewable changes (1)
  • src/codegen/orchestration/orchestration_codegen.cpp

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the LowerTransposeLoadParamLayout pass to use a body-local tensor.as_layout view instead of promoting function parameter signatures. This change ensures that orchestration functions remain untouched and parameter signatures stay consistent with their original row-major ND layout. The pass now prepends an explicit layout reinterpret at the top of the InCore body, substitutes internal uses, and rewrites tile.load calls to use canonical coordinates. Corresponding documentation, Python bindings, and unit tests have been updated to reflect this architectural shift. A review comment suggests using a safer casting pattern for the codegen_base in the backend registration to improve robustness and maintain consistency with project patterns.

// ``tile.load`` lookups via ``GetOrCreateTensorView`` find the LHS through
// the ``RegisterTensorView`` call below.
reg("tensor.as_layout", [](const ir::CallPtr& op, codegen::CodegenBase& codegen_base) {
auto& codegen = dynamic_cast<codegen::PTOCodegen&>(codegen_base);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The dynamic_cast to codegen::PTOCodegen& assumes that the provided codegen_base is always an instance of PTOCodegen. While this is likely true for all current PTO-based backends, it's safer to use As<codegen::PTOCodegen>(codegen_base) if available, or at least add a check before the cast to avoid a std::bad_cast exception if a non-PTO backend ever attempts to use this registration.

References
  1. When checking for a type that has a base class, a single check for the base class (using the As pattern) is preferred for consistency and simplicity.

@Hzfengsy Hzfengsy merged commit e1e7a58 into hw-native-sys:main May 12, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

2 participants