Skip to content

[Bug] Non-deterministic UB allocator: identical MLIR input produces different .cpp outputs across runs, some of which compute incorrect results #541

@MirkoDeVita98

Description

@MirkoDeVita98

Component

PTO Dialect / ODS (include/PTO/IR)

Description

Running ptoas repeatedly on a byte-identical input .pto file produces different .cpp outputs on each invocation. The differences are not cosmetic: the UB tile base addresses (pto.pointer_cast(%cN_i64)) get reshuffled, and some of the resulting layouts compute numerically wrong results at runtime — i.e. on those layouts two logically-distinct tiles end up sharing overlapping lifetimes/storage and stomp on each other.

Concretely, on our Sinkhorn kernel:

  • md5sum sinkhorn.pto is stable (same bytes every time the Python builder runs).
  • md5sum of ptoas's output .cpp differs across consecutive invocations.
  • The .so built from the "lucky" layout passes 66/66 correctness cases at rtol=2e-3, atol=1e-3.
  • The .so built from an "unlucky" layout fails 12/66 cases at the same tolerance, and still fails 17/66 cases at the very loose rtol=5e-2, atol=1e-2 (so this is real wrong-arithmetic, not a precision regression).

The failures are concentrated on specific dynamic shapes (N=4, K=32, L=64 and N=4, K=64, L=32 for our kernel), this is consistent with an allocator-aliasing bug whose manifestation depends on which loop iteration first happens to write to the colliding tile.

This makes builds non-reproducible and silently miscompiled.

Reproduction (minimal)

The .pto is generated by the PTODSL Sinkhorn builder in huawei-csl/pto-dsl#117:

  • Builder: examples/aot/sinkhorn_dynamic_multicore/sinkhorn_builder.py
  • Generated IR (attached below): sinkhorn.pto
cd /mounted_home/pto-dsl/examples/aot/sinkhorn_dynamic_multicore

# 1. Confirm the input .pto is deterministic.
python3 sinkhorn_builder.py > /tmp/a.pto
python3 sinkhorn_builder.py > /tmp/b.pto
md5sum /tmp/a.pto /tmp/b.pto
#   5117399319df1c521c3fe0ee9f750e99  /tmp/a.pto
#   5117399319df1c521c3fe0ee9f750e99  /tmp/b.pto       <-- IDENTICAL

# 2. Run ptoas three times on the *same* input.
for i in 1 2 3; do
    ptoas --enable-insert-sync /tmp/a.pto -o /tmp/r$i.cpp
    md5sum /tmp/r$i.cpp
done
# Observed:
#   f4c0d4a6d4e654a9f0fe8b57f8735ce0  /tmp/r1.cpp
#   330760db3d1d225daadf92967cfc0b6e  /tmp/r2.cpp
#   dfe9db622c2d46caf0563c572cd6a233  /tmp/r3.cpp     <-- ALL DIFFERENT

Diffing the outputs shows the divergence is in the UB pointer-cast constants emitted at the top of the kernel, e.g.:

-  const int64_t v28 = 15392;
-  const int64_t v29 = 16416;
-  const int64_t v30 = 25632;
+  const int64_t v28 = 1024;
+  const int64_t v29 = 29888;
+  const int64_t v30 = 3104;
   ...

To observe the functional consequence (one layout correct, another broken), build twice and run the correctness tests:

bash compile.sh                                         # bad layout (example)
python ./run_sinkhorn.py --lib ./sinkhorn_lib.so --rtol 2e-3 --atol 1e-3
#   totals: {'match': 54, 'mismatch': 12, 'skip': 0}
bash compile.sh                                         # try again — sometimes good, sometimes bad
python ./run_sinkhorn.py --lib ./sinkhorn_lib.so --rtol 2e-3 --atol 1e-3
#   totals: {'match': 66, 'mismatch':  0, 'skip': 0}

The 12 failing cases are always the same shapes: every (order, seed) combination of (N=4, K=32, L=64) and (N=4, K=64, L=32).

The MLIR is always the same:

module {
  func.func @_kernel(%arg0: !pto.ptr<f16>, %arg1: !pto.ptr<f16>, %arg2: !pto.ptr<f16>, %arg3: !pto.ptr<f16>, %arg4: i32, %arg5: i32, %arg6: i32, %arg7: i32, %arg8: f32, %arg9: f32, %arg10: f32, %arg11: f32, %arg12: f32, %arg13: f32) {
    %c0 = arith.constant 0 : index
    %c1 = arith.constant 1 : index
    %c256 = arith.constant 256 : index
    %c8 = arith.constant 8 : index
    %cst = arith.constant 0.000000e+00 : f32
    %cst_0 = arith.constant 1.000000e+00 : f32
    %0 = arith.index_cast %arg4 : i32 to index
    %1 = arith.index_cast %arg5 : i32 to index
    %2 = arith.index_cast %arg6 : i32 to index
    %3 = arith.index_cast %arg7 : i32 to index
    pto.section.vector {
      %4 = arith.cmpi sgt, %1, %c0 : index
      %5 = arith.cmpi sgt, %2, %c0 : index
      %6 = arith.andi %4, %5 : i1
      %7 = arith.cmpi sge, %c256, %1 : index
      %8 = arith.andi %6, %7 : i1
      %9 = arith.cmpi sge, %c256, %2 : index
      %10 = arith.andi %8, %9 : i1
      %11 = arith.remsi %1, %c8 : index
      %12 = arith.cmpi eq, %11, %c0 : index
      %13 = arith.andi %10, %12 : i1
      scf.if %13 {
        %14 = pto.get_block_idx
        %15 = pto.get_subblock_idx
        %16 = pto.get_subblock_num
        %17 = pto.get_block_num
        %18 = arith.muli %14, %16 : i64
        %19 = arith.addi %18, %15 : i64
        %20 = arith.index_cast %19 : i64 to index
        %21 = arith.muli %17, %16 : i64
        %22 = arith.index_cast %21 : i64 to index
        %23 = pto.alloc_tile valid_col = %2 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>
        %24 = pto.alloc_tile valid_col = %1 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>
        %25 = pto.alloc_tile valid_col = %2 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>
        %26 = pto.alloc_tile valid_col = %2 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>
        %27 = pto.alloc_tile valid_col = %2 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>
        %28 = pto.alloc_tile valid_col = %1 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>
        %29 = pto.alloc_tile valid_col = %1 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>
        %30 = pto.alloc_tile valid_col = %2 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>
        %31 = pto.alloc_tile valid_col = %1 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>
        %32 = pto.alloc_tile valid_col = %2 : !pto.tile_buf<vec, 8x256xf16, valid=8x?>
        %33 = pto.alloc_tile valid_col = %2 : !pto.tile_buf<vec, 8x256xf32, valid=8x?>
        %34 = pto.alloc_tile valid_col = %2 : !pto.tile_buf<vec, 8x256xf32, valid=8x?>
        %35 = pto.alloc_tile : !pto.tile_buf<vec, 8x1xf32, blayout=col_major>
        %36 = pto.alloc_tile : !pto.tile_buf<vec, 8x1xf32, blayout=col_major>
        %37 = pto.alloc_tile : !pto.tile_buf<vec, 1x8xf32>
        %38 = pto.alloc_tile valid_row = %c1 valid_col = %c1 : !pto.tile_buf<vec, 8x1xf32, valid=?x?, blayout=col_major>
        %39 = pto.alloc_tile valid_row = %c1 valid_col = %c1 : !pto.tile_buf<vec, 8x1xf32, valid=?x?, blayout=col_major>
        %40 = pto.alloc_tile valid_row = %c1 valid_col = %c1 : !pto.tile_buf<vec, 8x1xf32, valid=?x?, blayout=col_major>
        %41 = pto.alloc_tile valid_col = %2 : !pto.tile_buf<vec, 1x256xf16, valid=1x?>
        %42 = pto.alloc_tile valid_col = %1 : !pto.tile_buf<vec, 1x256xf16, valid=1x?>
        %43 = arith.muli %0, %1 : index
        %44 = pto.make_tensor_view %arg0, shape = [%43, %2], strides = [%2, %c1] : !pto.tensor_view<?x?xf16>
        %45 = pto.make_tensor_view %arg1, shape = [%43, %2], strides = [%2, %c1] : !pto.tensor_view<?x?xf16>
        %46 = pto.make_tensor_view %arg2, shape = [%0, %2], strides = [%2, %c1] : !pto.tensor_view<?x?xf16>
        %47 = pto.make_tensor_view %arg3, shape = [%0, %1], strides = [%1, %c1] : !pto.tensor_view<?x?xf16>
        scf.for %arg14 = %20 to %0 step %22 {
          pto.tmuls ins(%23, %cst : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, f32) outs(%23 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
          pto.tadds ins(%23, %cst_0 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, f32) outs(%23 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
          pto.tmuls ins(%24, %cst : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, f32) outs(%24 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
          pto.tadds ins(%24, %cst_0 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, f32) outs(%24 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
          pto.tmuls ins(%25, %cst : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, f32) outs(%25 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
          pto.tadds ins(%25, %cst_0 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, f32) outs(%25 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
          %48 = arith.muli %arg14, %1 : index
          %49 = arith.addi %3, %c1 : index
          scf.for %arg15 = %c0 to %49 step %c1 {
            pto.tmuls ins(%26, %cst : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, f32) outs(%26 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
            pto.tmuls ins(%27, %cst : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, f32) outs(%27 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
            scf.for %arg16 = %c0 to %1 step %c8 {
              %53 = arith.addi %48, %arg16 : index
              %54 = pto.partition_view %44, offsets = [%53, %c0], sizes = [%c8, %2] : !pto.tensor_view<?x?xf16> -> !pto.partition_tensor_view<8x256xf16>
              pto.tload ins(%54 : !pto.partition_tensor_view<8x256xf16>) outs(%32 : !pto.tile_buf<vec, 8x256xf16, valid=8x?>)
              pto.tcvt ins(%32 {rmode = #pto<round_mode CAST_RINT>} : !pto.tile_buf<vec, 8x256xf16, valid=8x?>) outs(%33 : !pto.tile_buf<vec, 8x256xf32, valid=8x?>)
              %55 = pto.subview %24[%c0, %arg16] sizes [1, 8] : !pto.tile_buf<vec, 1x256xf32, valid=1x?> -> !pto.tile_buf<vec, 1x8xf32>
              pto.tmuls ins(%55, %cst_0 : !pto.tile_buf<vec, 1x8xf32>, f32) outs(%37 : !pto.tile_buf<vec, 1x8xf32>)
              %56 = pto.treshape %37 : !pto.tile_buf<vec, 1x8xf32> -> !pto.tile_buf<vec, 8x1xf32, blayout=col_major>
              pto.trowexpanddiv ins(%33, %56 : !pto.tile_buf<vec, 8x256xf32, valid=8x?>, !pto.tile_buf<vec, 8x1xf32, blayout=col_major>) outs(%33 : !pto.tile_buf<vec, 8x256xf32, valid=8x?>)
              pto.tcolexpandmul ins(%33, %25 : !pto.tile_buf<vec, 8x256xf32, valid=8x?>, !pto.tile_buf<vec, 1x256xf32, valid=1x?>) outs(%33 : !pto.tile_buf<vec, 8x256xf32, valid=8x?>)
              pto.trowsum ins(%33, %34 : !pto.tile_buf<vec, 8x256xf32, valid=8x?>, !pto.tile_buf<vec, 8x256xf32, valid=8x?>) outs(%35 : !pto.tile_buf<vec, 8x1xf32, blayout=col_major>)
              %57 = pto.treshape %35 : !pto.tile_buf<vec, 8x1xf32, blayout=col_major> -> !pto.tile_buf<vec, 1x8xf32>
              %58 = pto.subview %28[%c0, %arg16] sizes [1, 8] : !pto.tile_buf<vec, 1x256xf32, valid=1x?> -> !pto.tile_buf<vec, 1x8xf32>
              pto.tmuls ins(%57, %cst_0 : !pto.tile_buf<vec, 1x8xf32>, f32) outs(%58 : !pto.tile_buf<vec, 1x8xf32>)
              pto.tcolsum ins(%33, %34 {isBinary = true} : !pto.tile_buf<vec, 8x256xf32, valid=8x?>, !pto.tile_buf<vec, 8x256xf32, valid=8x?>) outs(%30 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
              pto.tadd ins(%26, %30 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, !pto.tile_buf<vec, 1x256xf32, valid=1x?>) outs(%26 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
              pto.tmul ins(%33, %33 : !pto.tile_buf<vec, 8x256xf32, valid=8x?>, !pto.tile_buf<vec, 8x256xf32, valid=8x?>) outs(%33 : !pto.tile_buf<vec, 8x256xf32, valid=8x?>)
              pto.trowsum ins(%33, %34 : !pto.tile_buf<vec, 8x256xf32, valid=8x?>, !pto.tile_buf<vec, 8x256xf32, valid=8x?>) outs(%36 : !pto.tile_buf<vec, 8x1xf32, blayout=col_major>)
              %59 = pto.treshape %36 : !pto.tile_buf<vec, 8x1xf32, blayout=col_major> -> !pto.tile_buf<vec, 1x8xf32>
              %60 = pto.subview %29[%c0, %arg16] sizes [1, 8] : !pto.tile_buf<vec, 1x256xf32, valid=1x?> -> !pto.tile_buf<vec, 1x8xf32>
              pto.tmuls ins(%59, %cst_0 : !pto.tile_buf<vec, 1x8xf32>, f32) outs(%60 : !pto.tile_buf<vec, 1x8xf32>)
              pto.tcolsum ins(%33, %34 {isBinary = true} : !pto.tile_buf<vec, 8x256xf32, valid=8x?>, !pto.tile_buf<vec, 8x256xf32, valid=8x?>) outs(%30 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
              pto.tadd ins(%27, %30 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, !pto.tile_buf<vec, 1x256xf32, valid=1x?>) outs(%27 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
            }
            pto.tmul ins(%28, %28 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, !pto.tile_buf<vec, 1x256xf32, valid=1x?>) outs(%31 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
            pto.tmuls ins(%31, %arg11 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, f32) outs(%31 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
            pto.tsub ins(%29, %31 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, !pto.tile_buf<vec, 1x256xf32, valid=1x?>) outs(%29 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
            pto.tmuls ins(%29, %arg13 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, f32) outs(%29 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
            pto.tmaxs ins(%29, %cst : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, f32) outs(%29 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
            pto.tsqrt ins(%29 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>) outs(%29 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
            pto.tmul ins(%26, %26 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, !pto.tile_buf<vec, 1x256xf32, valid=1x?>) outs(%30 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
            pto.tmuls ins(%30, %arg10 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, f32) outs(%30 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
            pto.tsub ins(%27, %30 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, !pto.tile_buf<vec, 1x256xf32, valid=1x?>) outs(%27 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
            pto.tmuls ins(%27, %arg12 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, f32) outs(%27 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
            pto.tmaxs ins(%27, %cst : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, f32) outs(%27 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
            pto.tsqrt ins(%27 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>) outs(%27 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
            %52 = arith.cmpi eq, %arg15, %c0 : index
            scf.if %52 {
              pto.trowmin ins(%29, %31 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, !pto.tile_buf<vec, 1x256xf32, valid=1x?>) outs(%39 : !pto.tile_buf<vec, 8x1xf32, valid=?x?, blayout=col_major>)
              pto.trowmin ins(%27, %30 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, !pto.tile_buf<vec, 1x256xf32, valid=1x?>) outs(%40 : !pto.tile_buf<vec, 8x1xf32, valid=?x?, blayout=col_major>)
              %53 = pto.treshape %39 : !pto.tile_buf<vec, 8x1xf32, valid=?x?, blayout=col_major> -> !pto.tile_buf<vec, 1x8xf32, valid=?x?>
              %54 = pto.treshape %40 : !pto.tile_buf<vec, 8x1xf32, valid=?x?, blayout=col_major> -> !pto.tile_buf<vec, 1x8xf32, valid=?x?>
              %55 = pto.treshape %38 : !pto.tile_buf<vec, 8x1xf32, valid=?x?, blayout=col_major> -> !pto.tile_buf<vec, 1x8xf32, valid=?x?>
              pto.tmin ins(%53, %54 : !pto.tile_buf<vec, 1x8xf32, valid=?x?>, !pto.tile_buf<vec, 1x8xf32, valid=?x?>) outs(%55 : !pto.tile_buf<vec, 1x8xf32, valid=?x?>)
              pto.tadds ins(%55, %arg9 : !pto.tile_buf<vec, 1x8xf32, valid=?x?>, f32) outs(%55 : !pto.tile_buf<vec, 1x8xf32, valid=?x?>)
            } else {
              pto.trowexpanddiv ins(%29, %38 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, !pto.tile_buf<vec, 8x1xf32, valid=?x?, blayout=col_major>) outs(%29 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
              %cst_1 = arith.constant 9.99999996E-13 : f32
              pto.tmaxs ins(%29, %cst_1 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, f32) outs(%29 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
              pto.tlog ins(%29 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>) outs(%29 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
              pto.tmuls ins(%29, %arg8 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, f32) outs(%29 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
              pto.texp ins(%29 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>) outs(%29 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
              pto.tmul ins(%24, %29 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, !pto.tile_buf<vec, 1x256xf32, valid=1x?>) outs(%24 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
              pto.trowexpanddiv ins(%27, %38 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, !pto.tile_buf<vec, 8x1xf32, valid=?x?, blayout=col_major>) outs(%27 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
              %cst_2 = arith.constant 9.99999996E-13 : f32
              pto.tmaxs ins(%27, %cst_2 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, f32) outs(%27 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
              pto.tlog ins(%27 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>) outs(%27 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
              pto.tmuls ins(%27, %arg8 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, f32) outs(%27 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
              pto.texp ins(%27 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>) outs(%27 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
              pto.tmul ins(%23, %27 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, !pto.tile_buf<vec, 1x256xf32, valid=1x?>) outs(%23 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
              pto.trecip ins(%23 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>) outs(%25 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
            }
          }
          scf.for %arg15 = %c0 to %1 step %c8 {
            %52 = arith.addi %48, %arg15 : index
            %53 = pto.partition_view %44, offsets = [%52, %c0], sizes = [%c8, %2] : !pto.tensor_view<?x?xf16> -> !pto.partition_tensor_view<8x256xf16>
            pto.tload ins(%53 : !pto.partition_tensor_view<8x256xf16>) outs(%32 : !pto.tile_buf<vec, 8x256xf16, valid=8x?>)
            pto.tcvt ins(%32 {rmode = #pto<round_mode CAST_RINT>} : !pto.tile_buf<vec, 8x256xf16, valid=8x?>) outs(%33 : !pto.tile_buf<vec, 8x256xf32, valid=8x?>)
            %54 = pto.subview %24[%c0, %arg15] sizes [1, 8] : !pto.tile_buf<vec, 1x256xf32, valid=1x?> -> !pto.tile_buf<vec, 1x8xf32>
            pto.tmuls ins(%54, %cst_0 : !pto.tile_buf<vec, 1x8xf32>, f32) outs(%37 : !pto.tile_buf<vec, 1x8xf32>)
            %55 = pto.treshape %37 : !pto.tile_buf<vec, 1x8xf32> -> !pto.tile_buf<vec, 8x1xf32, blayout=col_major>
            pto.trowexpanddiv ins(%33, %55 : !pto.tile_buf<vec, 8x256xf32, valid=8x?>, !pto.tile_buf<vec, 8x1xf32, blayout=col_major>) outs(%33 : !pto.tile_buf<vec, 8x256xf32, valid=8x?>)
            pto.tcolexpandmul ins(%33, %25 : !pto.tile_buf<vec, 8x256xf32, valid=8x?>, !pto.tile_buf<vec, 1x256xf32, valid=1x?>) outs(%33 : !pto.tile_buf<vec, 8x256xf32, valid=8x?>)
            pto.tcvt ins(%33 {rmode = #pto<round_mode CAST_RINT>} : !pto.tile_buf<vec, 8x256xf32, valid=8x?>) outs(%32 : !pto.tile_buf<vec, 8x256xf16, valid=8x?>)
            %56 = arith.addi %48, %arg15 : index
            %57 = pto.partition_view %45, offsets = [%56, %c0], sizes = [%c8, %2] : !pto.tensor_view<?x?xf16> -> !pto.partition_tensor_view<8x256xf16>
            pto.tstore ins(%32 : !pto.tile_buf<vec, 8x256xf16, valid=8x?>) outs(%57 : !pto.partition_tensor_view<8x256xf16>)
          }
          pto.tcvt ins(%23 {rmode = #pto<round_mode CAST_RINT>} : !pto.tile_buf<vec, 1x256xf32, valid=1x?>) outs(%41 : !pto.tile_buf<vec, 1x256xf16, valid=1x?>)
          %50 = pto.partition_view %46, offsets = [%arg14, %c0], sizes = [%c1, %2] : !pto.tensor_view<?x?xf16> -> !pto.partition_tensor_view<1x256xf16>
          pto.tstore ins(%41 : !pto.tile_buf<vec, 1x256xf16, valid=1x?>) outs(%50 : !pto.partition_tensor_view<1x256xf16>)
          pto.tcvt ins(%24 {rmode = #pto<round_mode CAST_RINT>} : !pto.tile_buf<vec, 1x256xf32, valid=1x?>) outs(%42 : !pto.tile_buf<vec, 1x256xf16, valid=1x?>)
          %51 = pto.partition_view %47, offsets = [%arg14, %c0], sizes = [%c1, %1] : !pto.tensor_view<?x?xf16> -> !pto.partition_tensor_view<1x256xf16>
          pto.tstore ins(%42 : !pto.tile_buf<vec, 1x256xf16, valid=1x?>) outs(%51 : !pto.partition_tensor_view<1x256xf16>)
        }
      }
    }
    return
  }
}

Expected behavior

  1. ptoas is deterministic: same input bytes → same output bytes (identical .cpp).
  2. Whatever UB layout the allocator picks must be functionally correct on every dynamic shape that satisfies the kernel's scf.if guard, with no aliasing between tiles whose lifetimes overlap.

Actual behavior / error logs

  1. Identical input → three different .cpp outputs in three consecutive invocations (md5s above).
  2. Some emitted layouts compute wrong results. With the broken layout:
   summary:
     tolerances: rtol=2e-3, atol=1e-3
     ...
     (4, 32, 64, 1, 0, 'mismatch')
     (4, 32, 64, 1, 42, 'mismatch')
     (4, 32, 64, 5, 0, 'mismatch')
     (4, 32, 64, 5, 42, 'mismatch')
     (4, 32, 64, 10, 0, 'mismatch')
     (4, 32, 64, 10, 42, 'mismatch')
     (4, 64, 32, 1, 0, 'mismatch')
     (4, 64, 32, 1, 42, 'mismatch')
     (4, 64, 32, 5, 0, 'mismatch')
     (4, 64, 32, 5, 42, 'mismatch')
     (4, 64, 32, 10, 0, 'mismatch')
     (4, 64, 32, 10, 42, 'mismatch')
     totals: {'match': 54, 'mismatch': 12, 'skip': 0}
  1. The following warning (which is probably related) is generated at compile time by bisheng:
In file included from ./caller.cpp:4:
./sinkhorn.cpp:75:30: warning: & has lower precedence than <=; <= will be evaluated first [-Wparentheses]
  if (((v6 > v26 & v7 > v26) & v6 <= v24 & v7 <= v24) & v6 % v23 == v26) {
                             ^~~~~~~~~~~
./sinkhorn.cpp:75:30: note: place parentheses around the '<=' expression to silence this warning
  if (((v6 > v26 & v7 > v26) & v6 <= v24 & v7 <= v24) & v6 % v23 == v26) {
                             ^
                               (        )
./sinkhorn.cpp:75:30: note: place parentheses around the & expression to evaluate it first
  if (((v6 > v26 & v7 > v26) & v6 <= v24 & v7 <= v24) & v6 % v23 == v26) {
                             ^
       (                         )
./sinkhorn.cpp:75:55: warning: & has lower precedence than ==; == will be evaluated first [-Wparentheses]
  if (((v6 > v26 & v7 > v26) & v6 <= v24 & v7 <= v24) & v6 % v23 == v26) {
                                                      ^~~~~~~~~~~~~~~~~
./sinkhorn.cpp:75:55: note: place parentheses around the '==' expression to silence this warning
  if (((v6 > v26 & v7 > v26) & v6 <= v24 & v7 <= v24) & v6 % v23 == v26) {
                                                      ^
                                                        (              )
./sinkhorn.cpp:75:55: note: place parentheses around the & expression to evaluate it first
  if (((v6 > v26 & v7 > v26) & v6 <= v24 & v7 <= v24) & v6 % v23 == v26) {
                                                      ^
      (                                                         )
2 warnings generated.

Git commit

fd2107e

Host platform

Linux (aarch64)

Target Ascend arch (if relevant)

a3

PTOAS build level (if relevant)

None

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions