Skip to content

[Bug] pto.trowexpandmul then set_validshape + tfillpad: wrong result on a2a3, SIGSEGV on a2a3sim; supersedes #652 #660

@zhangqi-chen

Description

@zhangqi-chen

Component

AICore vector codegen — lowering of pto.alloc_tile op-output tiles and the pto.trowexpandmulpto.set_validshapepto.tfillpad sequence.

Summary

A [16, 256] FP32 row-major vec tile produced by pto.trowexpandmul, then narrowed with pto.set_validshape and zero-padded with pto.tfillpad in the same kernel, comes back wrong on a2a3 silicon and segfaults a2a3sim:

  • a2a3 (silicon): the surviving valid row (row 0 when valid_row = 1) has col 0 correct, cols ~1–15 garbage, cols ~16–255 ≈ 0; rows ≥ valid_row are correctly 0. Validation against the torch golden fails (255/4096 elements).
  • a2a3sim: the runtime crashes — SIGSEGV / core dump during [RUN] runtime.

This supersedes #652. There I mis-attributed the failure to the [N, 1] col_major scale tile and to the tfillpad/trowexpandmul layout combination, and reported the original-kernel symptom (an indefinite hang on a2a3 board, clean pass on a2a3sim). After bisecting (below), the trigger is narrower and unrelated to the [N, 1] tile. The exact a2a3-vs-a2a3sim symptom seems to depend on the surrounding code (hang vs wrong result on board; pass vs segfault in sim), but in every form a2a3 ≠ a2a3sim and a2a3 is incorrect.

Minimal reproduction

pypto-lib/models/deepseek/v4/moe_expert_hchunk_fillpad_repro.py:

python moe_expert_hchunk_fillpad_repro.py -p a2a3sim    # SIGSEGV (core dump) during runtime
python moe_expert_hchunk_fillpad_repro.py -p a2a3 -d 0  # FAIL — 'out' mismatch, 255/4096 elems

Kernel (single pl.at(CORE_GROUP)):

vr          = pl.min(16, valid_rows)                  # = 1
h_chunk_raw = pl.row_expand_mul(gated, scale_col)     # [16, 256] FP32; scale_col is [16, 1] col_major
h_chunk     = pl.fillpad(pl.tensor.set_validshape(h_chunk_raw, vr, 256), pad_value=pl.PadValue.zero)
out[:, :]   = h_chunk

Generated .pto (attached repro_A_with_memoryreuse.pto):

%h_chunk_raw__tile = pto.alloc_tile addr = 0     valid_row = 16 valid_col = 256 ... pad=0
pto.trowexpandmul ins(%gated__tile, %scale_col__tile) outs(%h_chunk_raw__tile)
pto.set_validshape %h_chunk_raw__tile, 1, 256
%h_chunk__tile     = pto.alloc_tile addr = 16448 valid_row = 16 valid_col = 256 ... pad=1
pto.tfillpad ins(%h_chunk_raw__tile) outs(%h_chunk__tile)
pto.tstore   ins(%h_chunk__tile) outs(...)

Suspicious codegen

ptoas emits the trowexpandmul output tile %h_chunk_raw__tile as a bare default-constructed Tile<...> v21; — unlike the load-target tiles and the tfillpad output tile, which are emitted as Tile<...>(16, 256) carrying the valid_row/valid_col from their pto.alloc_tile (attached repro_A_with_memoryreuse.cpp):

Tile<..., 512, PadValue::Null, ...> v11 = Tile<...>(v7, v6);  // gated      (load target  -> ctor(16,256))
Tile<..., 512, PadValue::Null, ...> v16 = Tile<...>(v7, v5);  // scale_col  (load target  -> ctor(16,1))
...
Tile<..., 512, PadValue::Null, ...> v21;                      // h_chunk_raw <-- BARE DEFAULT CTOR, no (16,256)
TASSIGN(v21, v22);                                            // v22 = 0
TROWEXPANDMUL(v21, v11, v16);                                 // reads v21.GetValidRow()/GetValidCol() — uninitialized
v21.SetValidShape(v5, v6);                                    // (1, 256) — runs AFTER trowexpandmul
Tile<..., 512, PadValue::Zero, ...> v23 = Tile<...>(v7, v6);  // h_chunk    (fillpad output -> ctor(16,256))
TASSIGN(v23, v24);
pipe_barrier(PIPE_V);
TFILLPAD(v23, v21);

pto::Tile's default constructor does not initialize RowMaskInternal / ColMaskInternal (only data_, and only in auto mode — see pto-isa include/pto/common/pto_tile.hpp:1448). TROWEXPANDMUL_IMPL does validCol = dst.GetValidCol(); validRow = dst.GetValidRow(); up front and uses those to drive the whole computation (vbrcb of the per-row scalar into the TMP_UB_OFFSET scratch + a vmul loop over validRow × validCol). So trowexpandmul runs over an uninitialized valid shape — a garbage extent / out-of-range loop (consistent with both the on-board garbage and the sim SIGSEGV) — and the subsequent set_validshape only fixes the metadata, not the half-written data. tfillpad then faithfully copies the corrupt row 0.

Likely fix (any of): ptoas should construct an op-output alloc_tile with its declared valid_row/valid_col (as it already does for load targets and the tfillpad output); or hoist the set_validshape ahead of the producing op; or trowexpandmul should not consume dst's valid shape.

Bisection — what is not the cause

  • Not the in-place buffer reuse. With pypto's MemoryReuse pass, the trowexpandmul output aliases gated (both at tile addr 0); with MemoryReuse + LegalizePTOBufferReuse disabled it gets a distinct address (16448) — still fails identically on a2a3 (attached repro_B_no_memoryreuse.pto / .cpp).
  • Not set_validshape / tfillpad per se. pl.fillpad(pl.tensor.set_validshape(pl.slice(gated, [16,256], [0,0]), vr, 256)) — i.e. a loaded tile narrowed + padded — passes on a2a3 (models/deepseek/v4/moe_expert_slice_repro.py).
  • Not mul. Replacing pl.row_expand_mul with a plain pl.mul (and a [16,256] scale_col) — same in-place alias, same set_validshape + tfillpad tail, emits TMUL instead of TROWEXPANDMULpasses on a2a3.
  • ⇒ The trigger is specifically pto.trowexpandmul writing a freshly alloc_tile'd output, immediately followed by set_validshape + tfillpad in the same kernel. Splitting row_expand_mul and the mask into separate pl.at blocks (so the row_expand_mul result round-trips through GM before set_validshape/fillpad) works around it — that is what pypto-lib/models/deepseek/v4/moe_expert.py does now (passes on both a2a3 and a2a3sim).

Expected behavior

a2a3 produces the correct result and a2a3sim does not crash — or ptoas errors/warns at compile time instead of silently emitting a kernel that returns wrong data on board / SIGSEGVs the simulator.

Environment

  • pypto-lib: HEAD — models/deepseek/v4/moe_expert.py, moe_expert_hchunk_fillpad_repro.py, moe_expert_slice_repro.py
  • pypto: HEAD
  • ptoas: v0.36
  • pto-isa: main @ d8ef4ddc
  • Target: a2a3

Attachments (added separately)

  • repro_A_with_memoryreuse.pto / .cpp — default pipeline (trowexpandmul output aliases gated at tile addr 0)
  • repro_B_no_memoryreuse.pto / .cppMemoryReuse + LegalizePTOBufferReuse disabled (distinct addresses)

Both fail identically on a2a3 (and SIGSEGV a2a3sim).

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    Status

    In Progress

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions