Skip to content

[Feature] Improve multi-output op framework support (TupleType outputs) #1326

@Little-oil

Description

@Little-oil

Summary

PyPTO currently lacks first-class framework support for multi-output ops at the op-registration / type-inference / codegen layers. While individual multi-output ops (e.g. tile.gather_compare returning TupleType{dst, cdst}) have been wired up case-by-case, the surrounding infrastructure makes it easy to leak hardware DPS (destination-passing-style) details into the user-facing op signature, and downstream passes (init_memref, memory_reuse, codegen) need bespoke handling for each new multi-output op.

We need a clean, generalized convention + framework support so any future tile/tensor op that produces N outputs (e.g. reduce-with-count, scan-with-mask, sort-with-indices) can be registered with only its inputs and have outputs expressed via TupleType automatically.

Motivation / Use Case

PTOAS exposes a number of intrinsics that physically require multiple destination tiles (DPS form), e.g. TGATHER_IMPL(src, idx, tmp, dst, cdst). Today, when wrapping these as PyPTO ops, there is no clear framework guideline, which leads to two problems:

  1. Leaky abstractions. The path of least resistance is to register the op as tile.gather_compare(src, kvalue, tmp, dst, cdst) — i.e. exposing the DPS dst/cdst as input arguments. This makes the op look like a 5-input op when the real semantics is 3 inputs + 2 outputs, and forces users to pre-allocate destination tiles, which is the responsibility of init_memref / memory_reuse, not the user.

  2. Fragile downstream pipeline. Each multi-output op currently re-implements:

    • Type deduction returning a TupleType from inputs + attrs
    • MakeTuple / TupleGetItemExpr plumbing in tensor-to-tile conversion
    • init_memref allocation per TupleType element
    • Codegen emit pattern that finds the bound buffer for each output

    This is duplicated across ops, easy to get wrong, and blocks adoption of PTOAS DPS-style intrinsics.

The current effort to land tile.gather_compare on branch tuple_result has surfaced these gaps. Without a unified framework, every new multi-output op (and there will be more — sort-with-indices, scan-with-count, etc.) will pay the same tax and risk introducing the same leaky-API mistake.

Proposed API / Behavior

Convention (op-registration layer):

Multi-output ops MUST be registered with inputs only. Outputs are derived as a TupleType by f_deduce_type. DPS dst arguments must NEVER appear in the op's add_argument list.

// ❌ Wrong — DPS dst leaked as input
REGISTER_OP("tile.gather_compare")
    .add_argument(\"src\", ...)
    .add_argument(\"kvalue\", ...)
    .add_argument(\"tmp\", ...)
    .add_argument(\"dst\", \"...\")    // leak!
    .add_argument(\"cdst\", \"...\");  // leak!

// ✅ Correct — inputs only; outputs via TupleType
REGISTER_OP(\"tile.gather_compare\")
    .add_argument(\"src\", ...)
    .add_argument(\"kvalue\", ...)
    .add_argument(\"tmp\", ...)
    .f_deduce_type([](const auto& args, const auto& kwargs, ...) {
      // Derive dst/cdst TileType from args + kwargs (out_cols, count_dtype),
      // wrap in TupleType.
      return std::make_shared<TupleType>(std::vector<TypePtr>{
          DeduceDstType(args, kwargs),
          DeduceCdstType(args, kwargs),
      });
    });

DSL surface:

# User-visible: 3 inputs, 2 outputs (Pythonic tuple unpack)
dst, cdst = pl.tile.gather_compare(src, kvalue, tmp, out_cols=K, count_dtype=pl.UINT32)

Framework work needed:

  1. init_memref pass — when a Call whose return type is TupleType, allocate one MemRef per element automatically; bind via TupleGetItemExpr rebinds.
  2. memory_reuse pass — extend reuse analysis to handle TupleType-element MemRefs as independent reuse candidates.
  3. Codegen base — provide a uniform helper to extract "the buffer bound to output i" of a multi-output op, replacing per-op pattern-matching of MakeTuple / TupleGetItemExpr.
  4. Op registry validation — emit a warning (or hard error) when an op with TupleType return also declares DefField outputs in add_argument (catches the leak).
  5. Documentation — update docs/en/dev/ir/operators.md (and Chinese mirror) with a "Multi-output ops" section codifying the rule.
  6. Tests — UT covering register + deduce + tensor-to-tile conversion + codegen for at least one multi-output op end-to-end (existing tile.gather_compare is the natural candidate).

Alternatives Considered

  • Keep DPS dst as inputs. Rejected: leaks hardware concerns into IR and DSL; surprises users; conflicts with init_memref ownership.
  • Multiple separate single-output ops. Rejected: PTOAS intrinsic semantics tie the outputs together (e.g. dst and cdst in TGATHER_IMPL are produced by the same hardware instruction); splitting them would complicate codegen and break semantics.
  • Out-parameters via Python keyword args (pl.tile.foo(..., out=(d, c))). Rejected: still requires user pre-allocation; doesn't compose with PyPTO's SSA / init_memref model.

Additional Context

  • Current branch with the first concrete multi-output op: tuple_result (commit 7656e786).
  • Related design discussion captured in user feedback memory: "多输出 op 签名应区分输入与输出".
  • Relevant files for the framework changes:
    • include/pypto/ir/op/op_def.h — op registration validation
    • src/ir/op/type_inference.cpp — TupleType deduction helpers
    • src/ir/transforms/op_conversion_registry.cpp — tensor→tile conversion plumbing for multi-output ops
    • src/ir/transforms/init_memref.cpp — TupleType-element allocation
    • src/ir/transforms/memory_reuse.cpp — reuse analysis for tuple elements
    • src/codegen/codegen_base.cpp, src/codegen/tensor_op_codegen.cpp — uniform output-buffer lookup
  • First op exercising the framework: tile.gather_compare / tensor.gather_compare.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions