refactor(ir): move P6 tensor.as_layout from orch call-site to InCore body (RFC #1300)#1339
Conversation
Alternative design for RFC hw-native-sys#1300 P6: keep InCore param signatures unchanged and put the layout reinterpret inside the InCore body instead of at the orch caller. The end-to-end semantics are the same, but the structural cost of the bridge shifts entirely into PTO codegen (one new op handler ~80 LOC) and removes the orch-side complexity (BuildWrapperAliasMap + ResolveAliasChain + CallSiteAsLayoutInjector ~110 LOC + half the pass). Files: - lower_transpose_load_param_layout_pass.cpp: PromoteInCoreFunction is replaced by LowerInCoreFunction. New pass: scan InCore body for transpose=True loads of params; for each such param ``b``, prepend ``b_dn = tensor.as_layout(b, layout=DN)`` to the body and substitute body uses of ``b`` with ``b_dn`` before swapping the trailing pair of offsets/shapes/valid_shapes on the matching tile.loads. Param signatures are untouched. ``CallSiteAsLayoutInjector`` and the entire Phase 2 (walk non-InCore functions and inject bridges at every call site) are removed. - pto_ops_common.cpp: register ``tensor.as_layout`` as a PTO backend op. The codegen lowers it to a fresh ``pto.make_tensor_view`` bound to the input's underlying buffer SSA (the function parameter), using the LHS's (shape, stride, layout) triple. Downstream tile.load resolves through ``RegisterTensorView``. - orchestration_codegen.cpp: ``BuildWrapperAliasMap`` and ``ResolveAliasChain`` are deleted; ``BuildWrapperReorderedParams`` drops its alias-chasing fallback. Inner-call args now map directly to wrapper params (no orch-side bridge to see through). Net diff: -297 / +165 lines across the three files. Unit tests (excluding the pass-specific assertions which encode the old design's expected param-promotion behaviour): 4583 pass / 41 skipped / 0 failures. The 5 pass-specific test failures all assert ``b.shape == [K, N]`` which is no longer the post-pass param shape — they need rewriting to assert the new IR structure (body prepended with ``b_dn = as_layout(b, DN)`` of type ``[K, N] DN``). End-to-end codegen tests (paged_attn, paged_attn_multi_config, the full tile-pto pipeline) all pass — the .pto output is byte-identical to the original design.
Companion to the prior pass/codegen refactor commit. Updates:
- tests/ut/ir/transforms/test_lower_transpose_load_param_layout_pass.py:
All 5 affected tests rewritten to validate the new IR shape — the InCore
param signature is preserved, a body-prepended
``b_dn = tensor.as_layout(b, layout=DN)`` AssignStmt is the new contract,
and the matching tile.load reads from ``b_dn`` (the body-local Var) with
the trailing pair of offsets/shapes/valid_shapes swapped. Orch call sites
are asserted to pass wrapper params directly with no tensor.as_layout
bridge. New helpers: ``_find_as_layout_binding`` /
``_has_as_layout_for``; obsolete ``_find_assign_rhs`` removed. All 8 pass
tests now pass.
- docs/en/dev/passes/18-lower_transpose_load_param_layout.md and the zh-cn
mirror: rewritten to describe the body-prepend design — param signatures
unchanged, ``b_dn = tensor.as_layout(b, DN)`` prepended at the top of the
InCore body, orch left untouched. The example diff in the doc now matches
what the pass actually emits.
- docs/{en,zh-cn}/dev/passes/26-materialize_tensor_strides.md: drop the
"produced by a future LowerTransposeLoadParamLayout rewrite" hint in the
example caption — under the new design, P18's body-local
``tensor.as_layout`` LHS already carries CanonicalizeView-filled strides,
so the empty-stride DN-view example is most naturally the user-written
``pl.Tensor[..., pl.DN]`` case.
- include/pypto/ir/transforms/passes.h,
python/bindings/modules/passes.cpp, python/pypto/pypto_core/passes.pyi:
C++ Doxygen, nanobind docstring, and Python stub doc updated to match the
new pass behaviour.
Validation: cmake --build build --parallel (clean); pytest tests/ut/ -n auto
→ 4591 passed / 41 skipped / 0 failed.
📝 WalkthroughWalkthroughThe LowerTransposeLoadParamLayout pass is refactored to insert body-local ChangesTranspose Layout Encoding Refactor
🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
tests/ut/ir/transforms/test_lower_transpose_load_param_layout_pass.py (1)
95-116: ⚡ Quick winEnforce “prepended at top of body” in helper semantics.
Line 96 says the helper validates a body-prepended binding, but the implementation only checks existence/uniqueness anywhere in the body. This weakens coverage for the pass contract. Please also assert placement at the beginning of
func.body.Proposed tightening
def _find_as_layout_binding(func, input_var): @@ - for stmt in _iter_stmts(func.body): + for stmt in _iter_stmts(func.body): if not isinstance(stmt, ir.AssignStmt): continue @@ assert len(matches) == 1, ( f"expected exactly one tensor.as_layout binding for {input_var.name_hint}, found {len(matches)}" ) + # Keep the contract strict: binding must be prepended in the top-level body. + assert isinstance(func.body, ir.SeqStmts) and func.body.stmts, "function body must be non-empty SeqStmts" + first_stmt = func.body.stmts[0] + assert isinstance(first_stmt, ir.AssignStmt), "first statement must be AssignStmt for prepended as_layout" + first_rhs = first_stmt.value + assert ( + isinstance(first_rhs, ir.Call) + and first_rhs.op is not None + and first_rhs.op.name == "tensor.as_layout" + and first_rhs.args + and isinstance(first_rhs.args[0], ir.Var) + and first_rhs.args[0] is input_var + ), "tensor.as_layout binding must be prepended at top of function body" return matches[0]🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/ut/ir/transforms/test_lower_transpose_load_param_layout_pass.py` around lines 95 - 116, The helper _find_as_layout_binding currently only ensures a unique tensor.as_layout binding anywhere in func.body; change it to also assert that the matching AssignStmt is prepended at the start of func.body by recording the matching stmt (the ir.AssignStmt whose rhs is an ir.Call to "tensor.as_layout" with args[0] == input_var) and after collecting matches assert len(matches) == 1 and that the matched stmt is the very first statement in func.body (i.e., func.body[0] is the matched AssignStmt); reference _find_as_layout_binding, func.body, ir.AssignStmt, ir.Call, and the "tensor.as_layout" call when making this change.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@tests/ut/ir/transforms/test_lower_transpose_load_param_layout_pass.py`:
- Around line 95-116: The helper _find_as_layout_binding currently only ensures
a unique tensor.as_layout binding anywhere in func.body; change it to also
assert that the matching AssignStmt is prepended at the start of func.body by
recording the matching stmt (the ir.AssignStmt whose rhs is an ir.Call to
"tensor.as_layout" with args[0] == input_var) and after collecting matches
assert len(matches) == 1 and that the matched stmt is the very first statement
in func.body (i.e., func.body[0] is the matched AssignStmt); reference
_find_as_layout_binding, func.body, ir.AssignStmt, ir.Call, and the
"tensor.as_layout" call when making this change.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 30d6b85e-dfed-47cc-b3b0-524d2149ecf5
📒 Files selected for processing (11)
docs/en/dev/passes/18-lower_transpose_load_param_layout.mddocs/en/dev/passes/26-materialize_tensor_strides.mddocs/zh-cn/dev/passes/18-lower_transpose_load_param_layout.mddocs/zh-cn/dev/passes/26-materialize_tensor_strides.mdinclude/pypto/ir/transforms/passes.hpython/bindings/modules/passes.cpppython/pypto/pypto_core/passes.pyisrc/backend/common/pto_ops_common.cppsrc/codegen/orchestration/orchestration_codegen.cppsrc/ir/transforms/lower_transpose_load_param_layout_pass.cpptests/ut/ir/transforms/test_lower_transpose_load_param_layout_pass.py
💤 Files with no reviewable changes (1)
- src/codegen/orchestration/orchestration_codegen.cpp
There was a problem hiding this comment.
Code Review
This pull request refactors the LowerTransposeLoadParamLayout pass to use a body-local tensor.as_layout view instead of promoting function parameter signatures. This change ensures that orchestration functions remain untouched and parameter signatures stay consistent with their original row-major ND layout. The pass now prepends an explicit layout reinterpret at the top of the InCore body, substitutes internal uses, and rewrites tile.load calls to use canonical coordinates. Corresponding documentation, Python bindings, and unit tests have been updated to reflect this architectural shift. A review comment suggests using a safer casting pattern for the codegen_base in the backend registration to improve robustness and maintain consistency with project patterns.
| // ``tile.load`` lookups via ``GetOrCreateTensorView`` find the LHS through | ||
| // the ``RegisterTensorView`` call below. | ||
| reg("tensor.as_layout", [](const ir::CallPtr& op, codegen::CodegenBase& codegen_base) { | ||
| auto& codegen = dynamic_cast<codegen::PTOCodegen&>(codegen_base); |
There was a problem hiding this comment.
The dynamic_cast to codegen::PTOCodegen& assumes that the provided codegen_base is always an instance of PTOCodegen. While this is likely true for all current PTO-based backends, it's safer to use As<codegen::PTOCodegen>(codegen_base) if available, or at least add a check before the cast to avoid a std::bad_cast exception if a non-PTO backend ever attempts to use this registration.
References
- When checking for a type that has a base class, a single check for the base class (using the As pattern) is preferred for consistency and simplicity.
Summary
Moves the P6
tensor.as_layoutbridge from the orch call site to the top of the InCore body, end-to-end equivalent but -132 LOC net and removes a cluster of incidental complexity in orchestration codegen. See #1300 discussion comment for the design rationale and consensus question.What changes
For each InCore parameter
ploaded viatile.load(p, ..., transpose=True):Before (current main, post-#1324):
[..., b, a] DN.bridged = tensor.as_layout(arg, DN); incore(bridged, ...).BuildWrapperAliasMap+ResolveAliasChainto recover the original wrapper param.After (this PR):
[..., a, b] ND, matching the runtime torch tensor).p_dn = tensor.as_layout(p, layout=DN); body uses ofpare substituted withp_dn.tile.loadis rewritten to read fromp_dnwith the trailing pair of offsets/shapes/valid_shapes swapped andtranspose=False.Diff stats
src/ir/transforms/lower_transpose_load_param_layout_pass.cppCallSiteAsLayoutInjector+ Phase 2 +PromoteToCanonicalDN; newLowerInCoreFunctionprepends to bodysrc/codegen/orchestration/orchestration_codegen.cppBuildWrapperAliasMap/ResolveAliasChain/ alias-chasing fallbacksrc/backend/common/pto_ops_common.cpptensor.as_layoutPTO codegen (emits onepto.make_tensor_viewsharing the input's base)Total: -485 / +391 = -94 LOC net (the +291 in tests/docs is mostly added explanatory comments and structured assertions — the production code delta is -297 / +165 = -132 LOC).
Why this is acceptable per RFC §4.2
RFC §4.2's "InCore cannot create tensors" invariant targets ops that allocate a byte buffer (
tensor.create).tensor.as_layoutis a pure metadata reinterpret — it allocates nothing, it just re-describes the input's existing physical buffer. The four-layer boundary (§5) becomes cleaner under this design:tensor.as_layout; this is a single-function internal detail..pto: codegen consumes whatever canonical triple the InCore body sees.Validation
cmake --build build --parallel— clean.pytest tests/ut/ -n auto --maxprocesses 8— 4602 passed / 41 skipped / 0 failed..ptocodegen tests pass — output is byte-identical to main.Test plan
tensor.as_layoutbinding +tile.loadreading from binding LHS + orch left alone).cmake --buildclean.Discussion
Open for RFC author / reviewers to weigh in. The design tradeoff is "signature is the contract" (current main) vs. "cross-function boundary is the runtime-faithful boundary, DN view is a per-kernel detail" (this PR). The latter eliminates downstream codegen complexity at the cost of slightly less honest InCore signatures.