Component
AICore vector codegen — lowering of pto.alloc_tile op-output tiles and the pto.trowexpandmul → pto.set_validshape → pto.tfillpad sequence.
Summary
A [16, 256] FP32 row-major vec tile produced by pto.trowexpandmul, then narrowed with pto.set_validshape and zero-padded with pto.tfillpad in the same kernel, comes back wrong on a2a3 silicon and segfaults a2a3sim:
- a2a3 (silicon): the surviving valid row (row 0 when
valid_row = 1) has col 0 correct, cols ~1–15 garbage, cols ~16–255 ≈ 0; rows ≥ valid_row are correctly 0. Validation against the torch golden fails (255/4096 elements).
- a2a3sim: the runtime crashes —
SIGSEGV / core dump during [RUN] runtime.
This supersedes #652. There I mis-attributed the failure to the [N, 1] col_major scale tile and to the tfillpad/trowexpandmul layout combination, and reported the original-kernel symptom (an indefinite hang on a2a3 board, clean pass on a2a3sim). After bisecting (below), the trigger is narrower and unrelated to the [N, 1] tile. The exact a2a3-vs-a2a3sim symptom seems to depend on the surrounding code (hang vs wrong result on board; pass vs segfault in sim), but in every form a2a3 ≠ a2a3sim and a2a3 is incorrect.
Minimal reproduction
pypto-lib/models/deepseek/v4/moe_expert_hchunk_fillpad_repro.py:
python moe_expert_hchunk_fillpad_repro.py -p a2a3sim # SIGSEGV (core dump) during runtime
python moe_expert_hchunk_fillpad_repro.py -p a2a3 -d 0 # FAIL — 'out' mismatch, 255/4096 elems
Kernel (single pl.at(CORE_GROUP)):
vr = pl.min(16, valid_rows) # = 1
h_chunk_raw = pl.row_expand_mul(gated, scale_col) # [16, 256] FP32; scale_col is [16, 1] col_major
h_chunk = pl.fillpad(pl.tensor.set_validshape(h_chunk_raw, vr, 256), pad_value=pl.PadValue.zero)
out[:, :] = h_chunk
Generated .pto (attached repro_A_with_memoryreuse.pto):
%h_chunk_raw__tile = pto.alloc_tile addr = 0 valid_row = 16 valid_col = 256 ... pad=0
pto.trowexpandmul ins(%gated__tile, %scale_col__tile) outs(%h_chunk_raw__tile)
pto.set_validshape %h_chunk_raw__tile, 1, 256
%h_chunk__tile = pto.alloc_tile addr = 16448 valid_row = 16 valid_col = 256 ... pad=1
pto.tfillpad ins(%h_chunk_raw__tile) outs(%h_chunk__tile)
pto.tstore ins(%h_chunk__tile) outs(...)
Suspicious codegen
ptoas emits the trowexpandmul output tile %h_chunk_raw__tile as a bare default-constructed Tile<...> v21; — unlike the load-target tiles and the tfillpad output tile, which are emitted as Tile<...>(16, 256) carrying the valid_row/valid_col from their pto.alloc_tile (attached repro_A_with_memoryreuse.cpp):
Tile<..., 512, PadValue::Null, ...> v11 = Tile<...>(v7, v6); // gated (load target -> ctor(16,256))
Tile<..., 512, PadValue::Null, ...> v16 = Tile<...>(v7, v5); // scale_col (load target -> ctor(16,1))
...
Tile<..., 512, PadValue::Null, ...> v21; // h_chunk_raw <-- BARE DEFAULT CTOR, no (16,256)
TASSIGN(v21, v22); // v22 = 0
TROWEXPANDMUL(v21, v11, v16); // reads v21.GetValidRow()/GetValidCol() — uninitialized
v21.SetValidShape(v5, v6); // (1, 256) — runs AFTER trowexpandmul
Tile<..., 512, PadValue::Zero, ...> v23 = Tile<...>(v7, v6); // h_chunk (fillpad output -> ctor(16,256))
TASSIGN(v23, v24);
pipe_barrier(PIPE_V);
TFILLPAD(v23, v21);
pto::Tile's default constructor does not initialize RowMaskInternal / ColMaskInternal (only data_, and only in auto mode — see pto-isa include/pto/common/pto_tile.hpp:1448). TROWEXPANDMUL_IMPL does validCol = dst.GetValidCol(); validRow = dst.GetValidRow(); up front and uses those to drive the whole computation (vbrcb of the per-row scalar into the TMP_UB_OFFSET scratch + a vmul loop over validRow × validCol). So trowexpandmul runs over an uninitialized valid shape — a garbage extent / out-of-range loop (consistent with both the on-board garbage and the sim SIGSEGV) — and the subsequent set_validshape only fixes the metadata, not the half-written data. tfillpad then faithfully copies the corrupt row 0.
Likely fix (any of): ptoas should construct an op-output alloc_tile with its declared valid_row/valid_col (as it already does for load targets and the tfillpad output); or hoist the set_validshape ahead of the producing op; or trowexpandmul should not consume dst's valid shape.
Bisection — what is not the cause
- Not the in-place buffer reuse. With pypto's
MemoryReuse pass, the trowexpandmul output aliases gated (both at tile addr 0); with MemoryReuse + LegalizePTOBufferReuse disabled it gets a distinct address (16448) — still fails identically on a2a3 (attached repro_B_no_memoryreuse.pto / .cpp).
- Not
set_validshape / tfillpad per se. pl.fillpad(pl.tensor.set_validshape(pl.slice(gated, [16,256], [0,0]), vr, 256)) — i.e. a loaded tile narrowed + padded — passes on a2a3 (models/deepseek/v4/moe_expert_slice_repro.py).
- Not
mul. Replacing pl.row_expand_mul with a plain pl.mul (and a [16,256] scale_col) — same in-place alias, same set_validshape + tfillpad tail, emits TMUL instead of TROWEXPANDMUL — passes on a2a3.
- ⇒ The trigger is specifically
pto.trowexpandmul writing a freshly alloc_tile'd output, immediately followed by set_validshape + tfillpad in the same kernel. Splitting row_expand_mul and the mask into separate pl.at blocks (so the row_expand_mul result round-trips through GM before set_validshape/fillpad) works around it — that is what pypto-lib/models/deepseek/v4/moe_expert.py does now (passes on both a2a3 and a2a3sim).
Expected behavior
a2a3 produces the correct result and a2a3sim does not crash — or ptoas errors/warns at compile time instead of silently emitting a kernel that returns wrong data on board / SIGSEGVs the simulator.
Environment
- pypto-lib: HEAD —
models/deepseek/v4/moe_expert.py, moe_expert_hchunk_fillpad_repro.py, moe_expert_slice_repro.py
- pypto: HEAD
- ptoas: v0.36
- pto-isa:
main @ d8ef4ddc
- Target: a2a3
Attachments (added separately)
repro_A_with_memoryreuse.pto / .cpp — default pipeline (trowexpandmul output aliases gated at tile addr 0)
repro_B_no_memoryreuse.pto / .cpp — MemoryReuse + LegalizePTOBufferReuse disabled (distinct addresses)
Both fail identically on a2a3 (and SIGSEGV a2a3sim).
Component
AICore vector codegen — lowering of
pto.alloc_tileop-output tiles and thepto.trowexpandmul→pto.set_validshape→pto.tfillpadsequence.Summary
A
[16, 256]FP32 row-major vec tile produced bypto.trowexpandmul, then narrowed withpto.set_validshapeand zero-padded withpto.tfillpadin the same kernel, comes back wrong on a2a3 silicon and segfaults a2a3sim:valid_row = 1) has col 0 correct, cols ~1–15 garbage, cols ~16–255 ≈ 0; rows ≥valid_roware correctly 0. Validation against the torch golden fails (255/4096 elements).SIGSEGV/ core dump during[RUN] runtime.This supersedes #652. There I mis-attributed the failure to the
[N, 1]col_major scale tile and to thetfillpad/trowexpandmullayout combination, and reported the original-kernel symptom (an indefinite hang on a2a3 board, clean pass on a2a3sim). After bisecting (below), the trigger is narrower and unrelated to the[N, 1]tile. The exact a2a3-vs-a2a3sim symptom seems to depend on the surrounding code (hang vs wrong result on board; pass vs segfault in sim), but in every form a2a3 ≠ a2a3sim and a2a3 is incorrect.Minimal reproduction
pypto-lib/models/deepseek/v4/moe_expert_hchunk_fillpad_repro.py:Kernel (single
pl.at(CORE_GROUP)):Generated
.pto(attachedrepro_A_with_memoryreuse.pto):Suspicious codegen
ptoas emits the
trowexpandmuloutput tile%h_chunk_raw__tileas a bare default-constructedTile<...> v21;— unlike the load-target tiles and thetfillpadoutput tile, which are emitted asTile<...>(16, 256)carrying thevalid_row/valid_colfrom theirpto.alloc_tile(attachedrepro_A_with_memoryreuse.cpp):pto::Tile's default constructor does not initializeRowMaskInternal/ColMaskInternal(onlydata_, and only in auto mode — seepto-isa include/pto/common/pto_tile.hpp:1448).TROWEXPANDMUL_IMPLdoesvalidCol = dst.GetValidCol(); validRow = dst.GetValidRow();up front and uses those to drive the whole computation (vbrcbof the per-row scalar into theTMP_UB_OFFSETscratch + avmulloop overvalidRow×validCol). Sotrowexpandmulruns over an uninitialized valid shape — a garbage extent / out-of-range loop (consistent with both the on-board garbage and the sim SIGSEGV) — and the subsequentset_validshapeonly fixes the metadata, not the half-written data.tfillpadthen faithfully copies the corrupt row 0.Likely fix (any of): ptoas should construct an op-output
alloc_tilewith its declaredvalid_row/valid_col(as it already does for load targets and thetfillpadoutput); or hoist theset_validshapeahead of the producing op; ortrowexpandmulshould not consumedst's valid shape.Bisection — what is not the cause
MemoryReusepass, thetrowexpandmuloutput aliasesgated(both at tile addr 0); withMemoryReuse+LegalizePTOBufferReusedisabled it gets a distinct address (16448) — still fails identically on a2a3 (attachedrepro_B_no_memoryreuse.pto/.cpp).set_validshape/tfillpadper se.pl.fillpad(pl.tensor.set_validshape(pl.slice(gated, [16,256], [0,0]), vr, 256))— i.e. a loaded tile narrowed + padded — passes on a2a3 (models/deepseek/v4/moe_expert_slice_repro.py).mul. Replacingpl.row_expand_mulwith a plainpl.mul(and a[16,256]scale_col) — same in-place alias, sameset_validshape+tfillpadtail, emitsTMULinstead ofTROWEXPANDMUL— passes on a2a3.pto.trowexpandmulwriting a freshlyalloc_tile'd output, immediately followed byset_validshape+tfillpadin the same kernel. Splittingrow_expand_muland the mask into separatepl.atblocks (so therow_expand_mulresult round-trips through GM beforeset_validshape/fillpad) works around it — that is whatpypto-lib/models/deepseek/v4/moe_expert.pydoes now (passes on both a2a3 and a2a3sim).Expected behavior
a2a3 produces the correct result and a2a3sim does not crash — or ptoas errors/warns at compile time instead of silently emitting a kernel that returns wrong data on board / SIGSEGVs the simulator.
Environment
models/deepseek/v4/moe_expert.py,moe_expert_hchunk_fillpad_repro.py,moe_expert_slice_repro.pymain@ d8ef4ddcAttachments (added separately)
repro_A_with_memoryreuse.pto/.cpp— default pipeline (trowexpandmuloutput aliasesgatedat tile addr 0)repro_B_no_memoryreuse.pto/.cpp—MemoryReuse+LegalizePTOBufferReusedisabled (distinct addresses)Both fail identically on a2a3 (and SIGSEGV a2a3sim).