[Feature] Improve multi-output op framework support (TupleType outputs)

### Summary

PyPTO currently lacks first-class framework support for **multi-output ops** at the op-registration / type-inference / codegen layers. While individual multi-output ops (e.g. `tile.gather_compare` returning `TupleType{dst, cdst}`) have been wired up case-by-case, the surrounding infrastructure makes it easy to leak hardware DPS (destination-passing-style) details into the user-facing op signature, and downstream passes (`init_memref`, `memory_reuse`, codegen) need bespoke handling for each new multi-output op.

We need a clean, generalized convention + framework support so any future tile/tensor op that produces N outputs (e.g. reduce-with-count, scan-with-mask, sort-with-indices) can be registered with **only its inputs** and have outputs expressed via `TupleType` automatically.

### Motivation / Use Case

PTOAS exposes a number of intrinsics that physically require multiple destination tiles (DPS form), e.g. `TGATHER_IMPL(src, idx, tmp, dst, cdst)`. Today, when wrapping these as PyPTO ops, there is no clear framework guideline, which leads to two problems:

1. **Leaky abstractions.** The path of least resistance is to register the op as `tile.gather_compare(src, kvalue, tmp, dst, cdst)` — i.e. exposing the DPS dst/cdst as **input arguments**. This makes the op look like a 5-input op when the real semantics is 3 inputs + 2 outputs, and forces users to pre-allocate destination tiles, which is the responsibility of `init_memref` / `memory_reuse`, not the user.

2. **Fragile downstream pipeline.** Each multi-output op currently re-implements:
   - Type deduction returning a `TupleType` from inputs + attrs
   - `MakeTuple` / `TupleGetItemExpr` plumbing in tensor-to-tile conversion
   - `init_memref` allocation per TupleType element
   - Codegen emit pattern that finds the bound buffer for each output

   This is duplicated across ops, easy to get wrong, and blocks adoption of PTOAS DPS-style intrinsics.

The current effort to land `tile.gather_compare` on branch `tuple_result` has surfaced these gaps. Without a unified framework, every new multi-output op (and there will be more — sort-with-indices, scan-with-count, etc.) will pay the same tax and risk introducing the same leaky-API mistake.

### Proposed API / Behavior

**Convention (op-registration layer):**

Multi-output ops MUST be registered with inputs only. Outputs are derived as a `TupleType` by `f_deduce_type`. DPS dst arguments must NEVER appear in the op's `add_argument` list.

```cpp
// ❌ Wrong — DPS dst leaked as input
REGISTER_OP("tile.gather_compare")
    .add_argument(\"src\", ...)
    .add_argument(\"kvalue\", ...)
    .add_argument(\"tmp\", ...)
    .add_argument(\"dst\", \"...\")    // leak!
    .add_argument(\"cdst\", \"...\");  // leak!

// ✅ Correct — inputs only; outputs via TupleType
REGISTER_OP(\"tile.gather_compare\")
    .add_argument(\"src\", ...)
    .add_argument(\"kvalue\", ...)
    .add_argument(\"tmp\", ...)
    .f_deduce_type([](const auto& args, const auto& kwargs, ...) {
      // Derive dst/cdst TileType from args + kwargs (out_cols, count_dtype),
      // wrap in TupleType.
      return std::make_shared<TupleType>(std::vector<TypePtr>{
          DeduceDstType(args, kwargs),
          DeduceCdstType(args, kwargs),
      });
    });
```

**DSL surface:**

```python
# User-visible: 3 inputs, 2 outputs (Pythonic tuple unpack)
dst, cdst = pl.tile.gather_compare(src, kvalue, tmp, out_cols=K, count_dtype=pl.UINT32)
```

**Framework work needed:**

1. **`init_memref` pass** — when a `Call` whose return type is `TupleType`, allocate one MemRef per element automatically; bind via `TupleGetItemExpr` rebinds.
2. **`memory_reuse` pass** — extend reuse analysis to handle TupleType-element MemRefs as independent reuse candidates.
3. **Codegen base** — provide a uniform helper to extract \"the buffer bound to output i\" of a multi-output op, replacing per-op pattern-matching of `MakeTuple` / `TupleGetItemExpr`.
4. **Op registry validation** — emit a warning (or hard error) when an op with `TupleType` return also declares `DefField` outputs in `add_argument` (catches the leak).
5. **Documentation** — update `docs/en/dev/ir/operators.md` (and Chinese mirror) with a \"Multi-output ops\" section codifying the rule.
6. **Tests** — UT covering register + deduce + tensor-to-tile conversion + codegen for at least one multi-output op end-to-end (existing `tile.gather_compare` is the natural candidate).

### Alternatives Considered

- **Keep DPS dst as inputs.** Rejected: leaks hardware concerns into IR and DSL; surprises users; conflicts with `init_memref` ownership.
- **Multiple separate single-output ops.** Rejected: PTOAS intrinsic semantics tie the outputs together (e.g. `dst` and `cdst` in `TGATHER_IMPL` are produced by the same hardware instruction); splitting them would complicate codegen and break semantics.
- **Out-parameters via Python keyword args (`pl.tile.foo(..., out=(d, c))`).** Rejected: still requires user pre-allocation; doesn't compose with PyPTO's SSA / `init_memref` model.

### Additional Context

- Current branch with the first concrete multi-output op: `tuple_result` (commit `7656e786`).
- Related design discussion captured in user feedback memory: \"多输出 op 签名应区分输入与输出\".
- Relevant files for the framework changes:
  - `include/pypto/ir/op/op_def.h` — op registration validation
  - `src/ir/op/type_inference.cpp` — TupleType deduction helpers
  - `src/ir/transforms/op_conversion_registry.cpp` — tensor→tile conversion plumbing for multi-output ops
  - `src/ir/transforms/init_memref.cpp` — TupleType-element allocation
  - `src/ir/transforms/memory_reuse.cpp` — reuse analysis for tuple elements
  - `src/codegen/codegen_base.cpp`, `src/codegen/tensor_op_codegen.cpp` — uniform output-buffer lookup
- First op exercising the framework: `tile.gather_compare` / `tensor.gather_compare`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Improve multi-output op framework support (TupleType outputs) #1326

Summary

Motivation / Use Case

Proposed API / Behavior

Alternatives Considered

Additional Context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Feature] Improve multi-output op framework support (TupleType outputs) #1326

Description

Summary

Motivation / Use Case

Proposed API / Behavior

Alternatives Considered

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions