Summary
PyPTO currently lacks first-class framework support for multi-output ops at the op-registration / type-inference / codegen layers. While individual multi-output ops (e.g. tile.gather_compare returning TupleType{dst, cdst}) have been wired up case-by-case, the surrounding infrastructure makes it easy to leak hardware DPS (destination-passing-style) details into the user-facing op signature, and downstream passes (init_memref, memory_reuse, codegen) need bespoke handling for each new multi-output op.
We need a clean, generalized convention + framework support so any future tile/tensor op that produces N outputs (e.g. reduce-with-count, scan-with-mask, sort-with-indices) can be registered with only its inputs and have outputs expressed via TupleType automatically.
Motivation / Use Case
PTOAS exposes a number of intrinsics that physically require multiple destination tiles (DPS form), e.g. TGATHER_IMPL(src, idx, tmp, dst, cdst). Today, when wrapping these as PyPTO ops, there is no clear framework guideline, which leads to two problems:
-
Leaky abstractions. The path of least resistance is to register the op as tile.gather_compare(src, kvalue, tmp, dst, cdst) — i.e. exposing the DPS dst/cdst as input arguments. This makes the op look like a 5-input op when the real semantics is 3 inputs + 2 outputs, and forces users to pre-allocate destination tiles, which is the responsibility of init_memref / memory_reuse, not the user.
-
Fragile downstream pipeline. Each multi-output op currently re-implements:
- Type deduction returning a
TupleType from inputs + attrs
MakeTuple / TupleGetItemExpr plumbing in tensor-to-tile conversion
init_memref allocation per TupleType element
- Codegen emit pattern that finds the bound buffer for each output
This is duplicated across ops, easy to get wrong, and blocks adoption of PTOAS DPS-style intrinsics.
The current effort to land tile.gather_compare on branch tuple_result has surfaced these gaps. Without a unified framework, every new multi-output op (and there will be more — sort-with-indices, scan-with-count, etc.) will pay the same tax and risk introducing the same leaky-API mistake.
Proposed API / Behavior
Convention (op-registration layer):
Multi-output ops MUST be registered with inputs only. Outputs are derived as a TupleType by f_deduce_type. DPS dst arguments must NEVER appear in the op's add_argument list.
// ❌ Wrong — DPS dst leaked as input
REGISTER_OP("tile.gather_compare")
.add_argument(\"src\", ...)
.add_argument(\"kvalue\", ...)
.add_argument(\"tmp\", ...)
.add_argument(\"dst\", \"...\") // leak!
.add_argument(\"cdst\", \"...\"); // leak!
// ✅ Correct — inputs only; outputs via TupleType
REGISTER_OP(\"tile.gather_compare\")
.add_argument(\"src\", ...)
.add_argument(\"kvalue\", ...)
.add_argument(\"tmp\", ...)
.f_deduce_type([](const auto& args, const auto& kwargs, ...) {
// Derive dst/cdst TileType from args + kwargs (out_cols, count_dtype),
// wrap in TupleType.
return std::make_shared<TupleType>(std::vector<TypePtr>{
DeduceDstType(args, kwargs),
DeduceCdstType(args, kwargs),
});
});
DSL surface:
# User-visible: 3 inputs, 2 outputs (Pythonic tuple unpack)
dst, cdst = pl.tile.gather_compare(src, kvalue, tmp, out_cols=K, count_dtype=pl.UINT32)
Framework work needed:
init_memref pass — when a Call whose return type is TupleType, allocate one MemRef per element automatically; bind via TupleGetItemExpr rebinds.
memory_reuse pass — extend reuse analysis to handle TupleType-element MemRefs as independent reuse candidates.
- Codegen base — provide a uniform helper to extract "the buffer bound to output i" of a multi-output op, replacing per-op pattern-matching of
MakeTuple / TupleGetItemExpr.
- Op registry validation — emit a warning (or hard error) when an op with
TupleType return also declares DefField outputs in add_argument (catches the leak).
- Documentation — update
docs/en/dev/ir/operators.md (and Chinese mirror) with a "Multi-output ops" section codifying the rule.
- Tests — UT covering register + deduce + tensor-to-tile conversion + codegen for at least one multi-output op end-to-end (existing
tile.gather_compare is the natural candidate).
Alternatives Considered
- Keep DPS dst as inputs. Rejected: leaks hardware concerns into IR and DSL; surprises users; conflicts with
init_memref ownership.
- Multiple separate single-output ops. Rejected: PTOAS intrinsic semantics tie the outputs together (e.g.
dst and cdst in TGATHER_IMPL are produced by the same hardware instruction); splitting them would complicate codegen and break semantics.
- Out-parameters via Python keyword args (
pl.tile.foo(..., out=(d, c))). Rejected: still requires user pre-allocation; doesn't compose with PyPTO's SSA / init_memref model.
Additional Context
- Current branch with the first concrete multi-output op:
tuple_result (commit 7656e786).
- Related design discussion captured in user feedback memory: "多输出 op 签名应区分输入与输出".
- Relevant files for the framework changes:
include/pypto/ir/op/op_def.h — op registration validation
src/ir/op/type_inference.cpp — TupleType deduction helpers
src/ir/transforms/op_conversion_registry.cpp — tensor→tile conversion plumbing for multi-output ops
src/ir/transforms/init_memref.cpp — TupleType-element allocation
src/ir/transforms/memory_reuse.cpp — reuse analysis for tuple elements
src/codegen/codegen_base.cpp, src/codegen/tensor_op_codegen.cpp — uniform output-buffer lookup
- First op exercising the framework:
tile.gather_compare / tensor.gather_compare.
Summary
PyPTO currently lacks first-class framework support for multi-output ops at the op-registration / type-inference / codegen layers. While individual multi-output ops (e.g.
tile.gather_comparereturningTupleType{dst, cdst}) have been wired up case-by-case, the surrounding infrastructure makes it easy to leak hardware DPS (destination-passing-style) details into the user-facing op signature, and downstream passes (init_memref,memory_reuse, codegen) need bespoke handling for each new multi-output op.We need a clean, generalized convention + framework support so any future tile/tensor op that produces N outputs (e.g. reduce-with-count, scan-with-mask, sort-with-indices) can be registered with only its inputs and have outputs expressed via
TupleTypeautomatically.Motivation / Use Case
PTOAS exposes a number of intrinsics that physically require multiple destination tiles (DPS form), e.g.
TGATHER_IMPL(src, idx, tmp, dst, cdst). Today, when wrapping these as PyPTO ops, there is no clear framework guideline, which leads to two problems:Leaky abstractions. The path of least resistance is to register the op as
tile.gather_compare(src, kvalue, tmp, dst, cdst)— i.e. exposing the DPS dst/cdst as input arguments. This makes the op look like a 5-input op when the real semantics is 3 inputs + 2 outputs, and forces users to pre-allocate destination tiles, which is the responsibility ofinit_memref/memory_reuse, not the user.Fragile downstream pipeline. Each multi-output op currently re-implements:
TupleTypefrom inputs + attrsMakeTuple/TupleGetItemExprplumbing in tensor-to-tile conversioninit_memrefallocation per TupleType elementThis is duplicated across ops, easy to get wrong, and blocks adoption of PTOAS DPS-style intrinsics.
The current effort to land
tile.gather_compareon branchtuple_resulthas surfaced these gaps. Without a unified framework, every new multi-output op (and there will be more — sort-with-indices, scan-with-count, etc.) will pay the same tax and risk introducing the same leaky-API mistake.Proposed API / Behavior
Convention (op-registration layer):
Multi-output ops MUST be registered with inputs only. Outputs are derived as a
TupleTypebyf_deduce_type. DPS dst arguments must NEVER appear in the op'sadd_argumentlist.DSL surface:
Framework work needed:
init_memrefpass — when aCallwhose return type isTupleType, allocate one MemRef per element automatically; bind viaTupleGetItemExprrebinds.memory_reusepass — extend reuse analysis to handle TupleType-element MemRefs as independent reuse candidates.MakeTuple/TupleGetItemExpr.TupleTypereturn also declaresDefFieldoutputs inadd_argument(catches the leak).docs/en/dev/ir/operators.md(and Chinese mirror) with a "Multi-output ops" section codifying the rule.tile.gather_compareis the natural candidate).Alternatives Considered
init_memrefownership.dstandcdstinTGATHER_IMPLare produced by the same hardware instruction); splitting them would complicate codegen and break semantics.pl.tile.foo(..., out=(d, c))). Rejected: still requires user pre-allocation; doesn't compose with PyPTO's SSA /init_memrefmodel.Additional Context
tuple_result(commit7656e786).include/pypto/ir/op/op_def.h— op registration validationsrc/ir/op/type_inference.cpp— TupleType deduction helperssrc/ir/transforms/op_conversion_registry.cpp— tensor→tile conversion plumbing for multi-output opssrc/ir/transforms/init_memref.cpp— TupleType-element allocationsrc/ir/transforms/memory_reuse.cpp— reuse analysis for tuple elementssrc/codegen/codegen_base.cpp,src/codegen/tensor_op_codegen.cpp— uniform output-buffer lookuptile.gather_compare/tensor.gather_compare.