This makes builds non-reproducible and silently miscompiled.
Diffing the outputs shows the divergence is in the UB pointer-cast constants emitted at the top of the kernel, e.g.:
The 12 failing cases are always the same shapes: every (order, seed) combination of (N=4, K=32, L=64) and (N=4, K=64, L=32).
module {
func.func @_kernel(%arg0: !pto.ptr<f16>, %arg1: !pto.ptr<f16>, %arg2: !pto.ptr<f16>, %arg3: !pto.ptr<f16>, %arg4: i32, %arg5: i32, %arg6: i32, %arg7: i32, %arg8: f32, %arg9: f32, %arg10: f32, %arg11: f32, %arg12: f32, %arg13: f32) {
%c0 = arith.constant 0 : index
%c1 = arith.constant 1 : index
%c256 = arith.constant 256 : index
%c8 = arith.constant 8 : index
%cst = arith.constant 0.000000e+00 : f32
%cst_0 = arith.constant 1.000000e+00 : f32
%0 = arith.index_cast %arg4 : i32 to index
%1 = arith.index_cast %arg5 : i32 to index
%2 = arith.index_cast %arg6 : i32 to index
%3 = arith.index_cast %arg7 : i32 to index
pto.section.vector {
%4 = arith.cmpi sgt, %1, %c0 : index
%5 = arith.cmpi sgt, %2, %c0 : index
%6 = arith.andi %4, %5 : i1
%7 = arith.cmpi sge, %c256, %1 : index
%8 = arith.andi %6, %7 : i1
%9 = arith.cmpi sge, %c256, %2 : index
%10 = arith.andi %8, %9 : i1
%11 = arith.remsi %1, %c8 : index
%12 = arith.cmpi eq, %11, %c0 : index
%13 = arith.andi %10, %12 : i1
scf.if %13 {
%14 = pto.get_block_idx
%15 = pto.get_subblock_idx
%16 = pto.get_subblock_num
%17 = pto.get_block_num
%18 = arith.muli %14, %16 : i64
%19 = arith.addi %18, %15 : i64
%20 = arith.index_cast %19 : i64 to index
%21 = arith.muli %17, %16 : i64
%22 = arith.index_cast %21 : i64 to index
%23 = pto.alloc_tile valid_col = %2 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>
%24 = pto.alloc_tile valid_col = %1 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>
%25 = pto.alloc_tile valid_col = %2 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>
%26 = pto.alloc_tile valid_col = %2 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>
%27 = pto.alloc_tile valid_col = %2 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>
%28 = pto.alloc_tile valid_col = %1 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>
%29 = pto.alloc_tile valid_col = %1 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>
%30 = pto.alloc_tile valid_col = %2 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>
%31 = pto.alloc_tile valid_col = %1 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>
%32 = pto.alloc_tile valid_col = %2 : !pto.tile_buf<vec, 8x256xf16, valid=8x?>
%33 = pto.alloc_tile valid_col = %2 : !pto.tile_buf<vec, 8x256xf32, valid=8x?>
%34 = pto.alloc_tile valid_col = %2 : !pto.tile_buf<vec, 8x256xf32, valid=8x?>
%35 = pto.alloc_tile : !pto.tile_buf<vec, 8x1xf32, blayout=col_major>
%36 = pto.alloc_tile : !pto.tile_buf<vec, 8x1xf32, blayout=col_major>
%37 = pto.alloc_tile : !pto.tile_buf<vec, 1x8xf32>
%38 = pto.alloc_tile valid_row = %c1 valid_col = %c1 : !pto.tile_buf<vec, 8x1xf32, valid=?x?, blayout=col_major>
%39 = pto.alloc_tile valid_row = %c1 valid_col = %c1 : !pto.tile_buf<vec, 8x1xf32, valid=?x?, blayout=col_major>
%40 = pto.alloc_tile valid_row = %c1 valid_col = %c1 : !pto.tile_buf<vec, 8x1xf32, valid=?x?, blayout=col_major>
%41 = pto.alloc_tile valid_col = %2 : !pto.tile_buf<vec, 1x256xf16, valid=1x?>
%42 = pto.alloc_tile valid_col = %1 : !pto.tile_buf<vec, 1x256xf16, valid=1x?>
%43 = arith.muli %0, %1 : index
%44 = pto.make_tensor_view %arg0, shape = [%43, %2], strides = [%2, %c1] : !pto.tensor_view<?x?xf16>
%45 = pto.make_tensor_view %arg1, shape = [%43, %2], strides = [%2, %c1] : !pto.tensor_view<?x?xf16>
%46 = pto.make_tensor_view %arg2, shape = [%0, %2], strides = [%2, %c1] : !pto.tensor_view<?x?xf16>
%47 = pto.make_tensor_view %arg3, shape = [%0, %1], strides = [%1, %c1] : !pto.tensor_view<?x?xf16>
scf.for %arg14 = %20 to %0 step %22 {
pto.tmuls ins(%23, %cst : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, f32) outs(%23 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
pto.tadds ins(%23, %cst_0 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, f32) outs(%23 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
pto.tmuls ins(%24, %cst : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, f32) outs(%24 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
pto.tadds ins(%24, %cst_0 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, f32) outs(%24 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
pto.tmuls ins(%25, %cst : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, f32) outs(%25 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
pto.tadds ins(%25, %cst_0 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, f32) outs(%25 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
%48 = arith.muli %arg14, %1 : index
%49 = arith.addi %3, %c1 : index
scf.for %arg15 = %c0 to %49 step %c1 {
pto.tmuls ins(%26, %cst : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, f32) outs(%26 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
pto.tmuls ins(%27, %cst : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, f32) outs(%27 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
scf.for %arg16 = %c0 to %1 step %c8 {
%53 = arith.addi %48, %arg16 : index
%54 = pto.partition_view %44, offsets = [%53, %c0], sizes = [%c8, %2] : !pto.tensor_view<?x?xf16> -> !pto.partition_tensor_view<8x256xf16>
pto.tload ins(%54 : !pto.partition_tensor_view<8x256xf16>) outs(%32 : !pto.tile_buf<vec, 8x256xf16, valid=8x?>)
pto.tcvt ins(%32 {rmode = #pto<round_mode CAST_RINT>} : !pto.tile_buf<vec, 8x256xf16, valid=8x?>) outs(%33 : !pto.tile_buf<vec, 8x256xf32, valid=8x?>)
%55 = pto.subview %24[%c0, %arg16] sizes [1, 8] : !pto.tile_buf<vec, 1x256xf32, valid=1x?> -> !pto.tile_buf<vec, 1x8xf32>
pto.tmuls ins(%55, %cst_0 : !pto.tile_buf<vec, 1x8xf32>, f32) outs(%37 : !pto.tile_buf<vec, 1x8xf32>)
%56 = pto.treshape %37 : !pto.tile_buf<vec, 1x8xf32> -> !pto.tile_buf<vec, 8x1xf32, blayout=col_major>
pto.trowexpanddiv ins(%33, %56 : !pto.tile_buf<vec, 8x256xf32, valid=8x?>, !pto.tile_buf<vec, 8x1xf32, blayout=col_major>) outs(%33 : !pto.tile_buf<vec, 8x256xf32, valid=8x?>)
pto.tcolexpandmul ins(%33, %25 : !pto.tile_buf<vec, 8x256xf32, valid=8x?>, !pto.tile_buf<vec, 1x256xf32, valid=1x?>) outs(%33 : !pto.tile_buf<vec, 8x256xf32, valid=8x?>)
pto.trowsum ins(%33, %34 : !pto.tile_buf<vec, 8x256xf32, valid=8x?>, !pto.tile_buf<vec, 8x256xf32, valid=8x?>) outs(%35 : !pto.tile_buf<vec, 8x1xf32, blayout=col_major>)
%57 = pto.treshape %35 : !pto.tile_buf<vec, 8x1xf32, blayout=col_major> -> !pto.tile_buf<vec, 1x8xf32>
%58 = pto.subview %28[%c0, %arg16] sizes [1, 8] : !pto.tile_buf<vec, 1x256xf32, valid=1x?> -> !pto.tile_buf<vec, 1x8xf32>
pto.tmuls ins(%57, %cst_0 : !pto.tile_buf<vec, 1x8xf32>, f32) outs(%58 : !pto.tile_buf<vec, 1x8xf32>)
pto.tcolsum ins(%33, %34 {isBinary = true} : !pto.tile_buf<vec, 8x256xf32, valid=8x?>, !pto.tile_buf<vec, 8x256xf32, valid=8x?>) outs(%30 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
pto.tadd ins(%26, %30 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, !pto.tile_buf<vec, 1x256xf32, valid=1x?>) outs(%26 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
pto.tmul ins(%33, %33 : !pto.tile_buf<vec, 8x256xf32, valid=8x?>, !pto.tile_buf<vec, 8x256xf32, valid=8x?>) outs(%33 : !pto.tile_buf<vec, 8x256xf32, valid=8x?>)
pto.trowsum ins(%33, %34 : !pto.tile_buf<vec, 8x256xf32, valid=8x?>, !pto.tile_buf<vec, 8x256xf32, valid=8x?>) outs(%36 : !pto.tile_buf<vec, 8x1xf32, blayout=col_major>)
%59 = pto.treshape %36 : !pto.tile_buf<vec, 8x1xf32, blayout=col_major> -> !pto.tile_buf<vec, 1x8xf32>
%60 = pto.subview %29[%c0, %arg16] sizes [1, 8] : !pto.tile_buf<vec, 1x256xf32, valid=1x?> -> !pto.tile_buf<vec, 1x8xf32>
pto.tmuls ins(%59, %cst_0 : !pto.tile_buf<vec, 1x8xf32>, f32) outs(%60 : !pto.tile_buf<vec, 1x8xf32>)
pto.tcolsum ins(%33, %34 {isBinary = true} : !pto.tile_buf<vec, 8x256xf32, valid=8x?>, !pto.tile_buf<vec, 8x256xf32, valid=8x?>) outs(%30 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
pto.tadd ins(%27, %30 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, !pto.tile_buf<vec, 1x256xf32, valid=1x?>) outs(%27 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
}
pto.tmul ins(%28, %28 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, !pto.tile_buf<vec, 1x256xf32, valid=1x?>) outs(%31 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
pto.tmuls ins(%31, %arg11 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, f32) outs(%31 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
pto.tsub ins(%29, %31 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, !pto.tile_buf<vec, 1x256xf32, valid=1x?>) outs(%29 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
pto.tmuls ins(%29, %arg13 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, f32) outs(%29 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
pto.tmaxs ins(%29, %cst : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, f32) outs(%29 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
pto.tsqrt ins(%29 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>) outs(%29 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
pto.tmul ins(%26, %26 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, !pto.tile_buf<vec, 1x256xf32, valid=1x?>) outs(%30 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
pto.tmuls ins(%30, %arg10 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, f32) outs(%30 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
pto.tsub ins(%27, %30 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, !pto.tile_buf<vec, 1x256xf32, valid=1x?>) outs(%27 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
pto.tmuls ins(%27, %arg12 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, f32) outs(%27 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
pto.tmaxs ins(%27, %cst : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, f32) outs(%27 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
pto.tsqrt ins(%27 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>) outs(%27 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
%52 = arith.cmpi eq, %arg15, %c0 : index
scf.if %52 {
pto.trowmin ins(%29, %31 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, !pto.tile_buf<vec, 1x256xf32, valid=1x?>) outs(%39 : !pto.tile_buf<vec, 8x1xf32, valid=?x?, blayout=col_major>)
pto.trowmin ins(%27, %30 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, !pto.tile_buf<vec, 1x256xf32, valid=1x?>) outs(%40 : !pto.tile_buf<vec, 8x1xf32, valid=?x?, blayout=col_major>)
%53 = pto.treshape %39 : !pto.tile_buf<vec, 8x1xf32, valid=?x?, blayout=col_major> -> !pto.tile_buf<vec, 1x8xf32, valid=?x?>
%54 = pto.treshape %40 : !pto.tile_buf<vec, 8x1xf32, valid=?x?, blayout=col_major> -> !pto.tile_buf<vec, 1x8xf32, valid=?x?>
%55 = pto.treshape %38 : !pto.tile_buf<vec, 8x1xf32, valid=?x?, blayout=col_major> -> !pto.tile_buf<vec, 1x8xf32, valid=?x?>
pto.tmin ins(%53, %54 : !pto.tile_buf<vec, 1x8xf32, valid=?x?>, !pto.tile_buf<vec, 1x8xf32, valid=?x?>) outs(%55 : !pto.tile_buf<vec, 1x8xf32, valid=?x?>)
pto.tadds ins(%55, %arg9 : !pto.tile_buf<vec, 1x8xf32, valid=?x?>, f32) outs(%55 : !pto.tile_buf<vec, 1x8xf32, valid=?x?>)
} else {
pto.trowexpanddiv ins(%29, %38 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, !pto.tile_buf<vec, 8x1xf32, valid=?x?, blayout=col_major>) outs(%29 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
%cst_1 = arith.constant 9.99999996E-13 : f32
pto.tmaxs ins(%29, %cst_1 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, f32) outs(%29 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
pto.tlog ins(%29 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>) outs(%29 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
pto.tmuls ins(%29, %arg8 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, f32) outs(%29 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
pto.texp ins(%29 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>) outs(%29 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
pto.tmul ins(%24, %29 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, !pto.tile_buf<vec, 1x256xf32, valid=1x?>) outs(%24 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
pto.trowexpanddiv ins(%27, %38 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, !pto.tile_buf<vec, 8x1xf32, valid=?x?, blayout=col_major>) outs(%27 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
%cst_2 = arith.constant 9.99999996E-13 : f32
pto.tmaxs ins(%27, %cst_2 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, f32) outs(%27 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
pto.tlog ins(%27 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>) outs(%27 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
pto.tmuls ins(%27, %arg8 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, f32) outs(%27 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
pto.texp ins(%27 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>) outs(%27 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
pto.tmul ins(%23, %27 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, !pto.tile_buf<vec, 1x256xf32, valid=1x?>) outs(%23 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
pto.trecip ins(%23 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>) outs(%25 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
}
}
scf.for %arg15 = %c0 to %1 step %c8 {
%52 = arith.addi %48, %arg15 : index
%53 = pto.partition_view %44, offsets = [%52, %c0], sizes = [%c8, %2] : !pto.tensor_view<?x?xf16> -> !pto.partition_tensor_view<8x256xf16>
pto.tload ins(%53 : !pto.partition_tensor_view<8x256xf16>) outs(%32 : !pto.tile_buf<vec, 8x256xf16, valid=8x?>)
pto.tcvt ins(%32 {rmode = #pto<round_mode CAST_RINT>} : !pto.tile_buf<vec, 8x256xf16, valid=8x?>) outs(%33 : !pto.tile_buf<vec, 8x256xf32, valid=8x?>)
%54 = pto.subview %24[%c0, %arg15] sizes [1, 8] : !pto.tile_buf<vec, 1x256xf32, valid=1x?> -> !pto.tile_buf<vec, 1x8xf32>
pto.tmuls ins(%54, %cst_0 : !pto.tile_buf<vec, 1x8xf32>, f32) outs(%37 : !pto.tile_buf<vec, 1x8xf32>)
%55 = pto.treshape %37 : !pto.tile_buf<vec, 1x8xf32> -> !pto.tile_buf<vec, 8x1xf32, blayout=col_major>
pto.trowexpanddiv ins(%33, %55 : !pto.tile_buf<vec, 8x256xf32, valid=8x?>, !pto.tile_buf<vec, 8x1xf32, blayout=col_major>) outs(%33 : !pto.tile_buf<vec, 8x256xf32, valid=8x?>)
pto.tcolexpandmul ins(%33, %25 : !pto.tile_buf<vec, 8x256xf32, valid=8x?>, !pto.tile_buf<vec, 1x256xf32, valid=1x?>) outs(%33 : !pto.tile_buf<vec, 8x256xf32, valid=8x?>)
pto.tcvt ins(%33 {rmode = #pto<round_mode CAST_RINT>} : !pto.tile_buf<vec, 8x256xf32, valid=8x?>) outs(%32 : !pto.tile_buf<vec, 8x256xf16, valid=8x?>)
%56 = arith.addi %48, %arg15 : index
%57 = pto.partition_view %45, offsets = [%56, %c0], sizes = [%c8, %2] : !pto.tensor_view<?x?xf16> -> !pto.partition_tensor_view<8x256xf16>
pto.tstore ins(%32 : !pto.tile_buf<vec, 8x256xf16, valid=8x?>) outs(%57 : !pto.partition_tensor_view<8x256xf16>)
}
pto.tcvt ins(%23 {rmode = #pto<round_mode CAST_RINT>} : !pto.tile_buf<vec, 1x256xf32, valid=1x?>) outs(%41 : !pto.tile_buf<vec, 1x256xf16, valid=1x?>)
%50 = pto.partition_view %46, offsets = [%arg14, %c0], sizes = [%c1, %2] : !pto.tensor_view<?x?xf16> -> !pto.partition_tensor_view<1x256xf16>
pto.tstore ins(%41 : !pto.tile_buf<vec, 1x256xf16, valid=1x?>) outs(%50 : !pto.partition_tensor_view<1x256xf16>)
pto.tcvt ins(%24 {rmode = #pto<round_mode CAST_RINT>} : !pto.tile_buf<vec, 1x256xf32, valid=1x?>) outs(%42 : !pto.tile_buf<vec, 1x256xf16, valid=1x?>)
%51 = pto.partition_view %47, offsets = [%arg14, %c0], sizes = [%c1, %1] : !pto.tensor_view<?x?xf16> -> !pto.partition_tensor_view<1x256xf16>
pto.tstore ins(%42 : !pto.tile_buf<vec, 1x256xf16, valid=1x?>) outs(%51 : !pto.partition_tensor_view<1x256xf16>)
}
}
}
return
}
}
Component
PTO Dialect / ODS (include/PTO/IR)
Description
Running
ptoasrepeatedly on a byte-identical input.ptofile produces different.cppoutputs on each invocation. The differences are not cosmetic: the UB tile base addresses (pto.pointer_cast(%cN_i64)) get reshuffled, and some of the resulting layouts compute numerically wrong results at runtime — i.e. on those layouts two logically-distinct tiles end up sharing overlapping lifetimes/storage and stomp on each other.Concretely, on our Sinkhorn kernel:
md5sum sinkhorn.ptois stable (same bytes every time the Python builder runs).md5sumofptoas's output.cppdiffers across consecutive invocations..sobuilt from the "lucky" layout passes 66/66 correctness cases atrtol=2e-3, atol=1e-3..sobuilt from an "unlucky" layout fails 12/66 cases at the same tolerance, and still fails 17/66 cases at the very loosertol=5e-2, atol=1e-2(so this is real wrong-arithmetic, not a precision regression).The failures are concentrated on specific dynamic shapes (
N=4, K=32, L=64andN=4, K=64, L=32for our kernel), this is consistent with an allocator-aliasing bug whose manifestation depends on which loop iteration first happens to write to the colliding tile.This makes builds non-reproducible and silently miscompiled.
Reproduction (minimal)
The
.ptois generated by the PTODSL Sinkhorn builder in huawei-csl/pto-dsl#117:examples/aot/sinkhorn_dynamic_multicore/sinkhorn_builder.pysinkhorn.ptoDiffing the outputs shows the divergence is in the UB pointer-cast constants emitted at the top of the kernel, e.g.:
To observe the functional consequence (one layout correct, another broken), build twice and run the correctness tests:
The 12 failing cases are always the same shapes: every (order, seed) combination of
(N=4, K=32, L=64)and(N=4, K=64, L=32).The MLIR is always the same:
Expected behavior
ptoasis deterministic: same input bytes → same output bytes (identical.cpp).scf.ifguard, with no aliasing between tiles whose lifetimes overlap.Actual behavior / error logs
.cppoutputs in three consecutive invocations (md5s above).summary: tolerances: rtol=2e-3, atol=1e-3 ... (4, 32, 64, 1, 0, 'mismatch') (4, 32, 64, 1, 42, 'mismatch') (4, 32, 64, 5, 0, 'mismatch') (4, 32, 64, 5, 42, 'mismatch') (4, 32, 64, 10, 0, 'mismatch') (4, 32, 64, 10, 42, 'mismatch') (4, 64, 32, 1, 0, 'mismatch') (4, 64, 32, 1, 42, 'mismatch') (4, 64, 32, 5, 0, 'mismatch') (4, 64, 32, 5, 42, 'mismatch') (4, 64, 32, 10, 0, 'mismatch') (4, 64, 32, 10, 42, 'mismatch') totals: {'match': 54, 'mismatch': 12, 'skip': 0}bisheng:Git commit
fd2107e
Host platform
Linux (aarch64)
Target Ascend arch (if relevant)
a3
PTOAS build level (if relevant)
None