[Bug] pto.trowexpandmul then set_validshape + tfillpad: wrong result on a2a3, SIGSEGV on a2a3sim; supersedes #652

### Component

AICore vector codegen — lowering of `pto.alloc_tile` op-output tiles and the `pto.trowexpandmul` → `pto.set_validshape` → `pto.tfillpad` sequence.

### Summary

A `[16, 256]` FP32 row-major vec tile produced by `pto.trowexpandmul`, then narrowed with `pto.set_validshape` and zero-padded with `pto.tfillpad` in the same kernel, comes back **wrong on a2a3 silicon** and **segfaults a2a3sim**:

- **a2a3 (silicon)**: the surviving valid row (row 0 when `valid_row = 1`) has col 0 correct, cols ~1–15 garbage, cols ~16–255 ≈ 0; rows ≥ `valid_row` are correctly 0. Validation against the torch golden fails (255/4096 elements).
- **a2a3sim**: the runtime crashes — `SIGSEGV` / core dump during `[RUN] runtime`.

This supersedes #652. There I mis-attributed the failure to the `[N, 1]` col_major scale tile and to the `tfillpad`/`trowexpandmul` layout combination, and reported the original-kernel symptom (an indefinite **hang** on a2a3 board, clean pass on a2a3sim). After bisecting (below), the trigger is narrower and unrelated to the `[N, 1]` tile. The exact a2a3-vs-a2a3sim symptom seems to depend on the surrounding code (hang vs wrong result on board; pass vs segfault in sim), but in every form a2a3 ≠ a2a3sim and a2a3 is incorrect.

### Minimal reproduction

`pypto-lib/models/deepseek/v4/moe_expert_hchunk_fillpad_repro.py`:

```bash
python moe_expert_hchunk_fillpad_repro.py -p a2a3sim    # SIGSEGV (core dump) during runtime
python moe_expert_hchunk_fillpad_repro.py -p a2a3 -d 0  # FAIL — 'out' mismatch, 255/4096 elems
```

Kernel (single `pl.at(CORE_GROUP)`):

```python
vr          = pl.min(16, valid_rows)                  # = 1
h_chunk_raw = pl.row_expand_mul(gated, scale_col)     # [16, 256] FP32; scale_col is [16, 1] col_major
h_chunk     = pl.fillpad(pl.tensor.set_validshape(h_chunk_raw, vr, 256), pad_value=pl.PadValue.zero)
out[:, :]   = h_chunk
```

Generated `.pto` (attached `repro_A_with_memoryreuse.pto`):

```
%h_chunk_raw__tile = pto.alloc_tile addr = 0     valid_row = 16 valid_col = 256 ... pad=0
pto.trowexpandmul ins(%gated__tile, %scale_col__tile) outs(%h_chunk_raw__tile)
pto.set_validshape %h_chunk_raw__tile, 1, 256
%h_chunk__tile     = pto.alloc_tile addr = 16448 valid_row = 16 valid_col = 256 ... pad=1
pto.tfillpad ins(%h_chunk_raw__tile) outs(%h_chunk__tile)
pto.tstore   ins(%h_chunk__tile) outs(...)
```

### Suspicious codegen

ptoas emits the `trowexpandmul` output tile `%h_chunk_raw__tile` as a **bare default-constructed** `Tile<...> v21;` — unlike the load-target tiles and the `tfillpad` output tile, which are emitted as `Tile<...>(16, 256)` carrying the `valid_row`/`valid_col` from their `pto.alloc_tile` (attached `repro_A_with_memoryreuse.cpp`):

```cpp
Tile<..., 512, PadValue::Null, ...> v11 = Tile<...>(v7, v6);  // gated      (load target  -> ctor(16,256))
Tile<..., 512, PadValue::Null, ...> v16 = Tile<...>(v7, v5);  // scale_col  (load target  -> ctor(16,1))
...
Tile<..., 512, PadValue::Null, ...> v21;                      // h_chunk_raw <-- BARE DEFAULT CTOR, no (16,256)
TASSIGN(v21, v22);                                            // v22 = 0
TROWEXPANDMUL(v21, v11, v16);                                 // reads v21.GetValidRow()/GetValidCol() — uninitialized
v21.SetValidShape(v5, v6);                                    // (1, 256) — runs AFTER trowexpandmul
Tile<..., 512, PadValue::Zero, ...> v23 = Tile<...>(v7, v6);  // h_chunk    (fillpad output -> ctor(16,256))
TASSIGN(v23, v24);
pipe_barrier(PIPE_V);
TFILLPAD(v23, v21);
```

`pto::Tile`'s default constructor does not initialize `RowMaskInternal` / `ColMaskInternal` (only `data_`, and only in auto mode — see `pto-isa include/pto/common/pto_tile.hpp:1448`). `TROWEXPANDMUL_IMPL` does `validCol = dst.GetValidCol(); validRow = dst.GetValidRow();` up front and uses those to drive the whole computation (`vbrcb` of the per-row scalar into the `TMP_UB_OFFSET` scratch + a `vmul` loop over `validRow` × `validCol`). So `trowexpandmul` runs over an **uninitialized valid shape** — a garbage extent / out-of-range loop (consistent with both the on-board garbage and the sim SIGSEGV) — and the subsequent `set_validshape` only fixes the metadata, not the half-written data. `tfillpad` then faithfully copies the corrupt row 0.

Likely fix (any of): ptoas should construct an op-output `alloc_tile` with its declared `valid_row`/`valid_col` (as it already does for load targets and the `tfillpad` output); or hoist the `set_validshape` ahead of the producing op; or `trowexpandmul` should not consume `dst`'s valid shape.

### Bisection — what is *not* the cause

- **Not the in-place buffer reuse.** With pypto's `MemoryReuse` pass, the `trowexpandmul` output aliases `gated` (both at tile addr 0); with `MemoryReuse` + `LegalizePTOBufferReuse` disabled it gets a distinct address (16448) — **still fails identically on a2a3** (attached `repro_B_no_memoryreuse.pto` / `.cpp`).
- **Not `set_validshape` / `tfillpad` per se.** `pl.fillpad(pl.tensor.set_validshape(pl.slice(gated, [16,256], [0,0]), vr, 256))` — i.e. a *loaded* tile narrowed + padded — **passes** on a2a3 (`models/deepseek/v4/moe_expert_slice_repro.py`).
- **Not `mul`.** Replacing `pl.row_expand_mul` with a plain `pl.mul` (and a `[16,256]` `scale_col`) — same in-place alias, same `set_validshape` + `tfillpad` tail, emits `TMUL` instead of `TROWEXPANDMUL` — **passes** on a2a3.
- ⇒ The trigger is specifically **`pto.trowexpandmul` writing a freshly `alloc_tile`'d output, immediately followed by `set_validshape` + `tfillpad` in the same kernel**. Splitting `row_expand_mul` and the mask into separate `pl.at` blocks (so the `row_expand_mul` result round-trips through GM before `set_validshape`/`fillpad`) works around it — that is what `pypto-lib/models/deepseek/v4/moe_expert.py` does now (passes on both a2a3 and a2a3sim).

### Expected behavior

a2a3 produces the correct result and a2a3sim does not crash — or ptoas errors/warns at compile time instead of silently emitting a kernel that returns wrong data on board / SIGSEGVs the simulator.

### Environment

- pypto-lib: HEAD — `models/deepseek/v4/moe_expert.py`, `moe_expert_hchunk_fillpad_repro.py`, `moe_expert_slice_repro.py`
- pypto: HEAD
- ptoas: v0.36
- pto-isa: `main` @ d8ef4ddc
- Target: a2a3

### Attachments (added separately)

- `repro_A_with_memoryreuse.pto` / `.cpp` — default pipeline (`trowexpandmul` output aliases `gated` at tile addr 0)
- `repro_B_no_memoryreuse.pto` / `.cpp` — `MemoryReuse` + `LegalizePTOBufferReuse` disabled (distinct addresses)

Both fail identically on a2a3 (and SIGSEGV a2a3sim).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] pto.trowexpandmul then set_validshape + tfillpad: wrong result on a2a3, SIGSEGV on a2a3sim; supersedes #652 #660

Component

Summary

Minimal reproduction

Suspicious codegen

Bisection — what is not the cause

Expected behavior

Environment

Attachments (added separately)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] pto.trowexpandmul then set_validshape + tfillpad: wrong result on a2a3, SIGSEGV on a2a3sim; supersedes #652 #660

Description

Component

Summary

Minimal reproduction

Suspicious codegen

Bisection — what is not the cause

Expected behavior

Environment

Attachments (added separately)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions