[Bug] Non-deterministic UB allocator: identical MLIR input produces different `.cpp` outputs across runs, some of which compute incorrect results

### Component

PTO Dialect / ODS (include/PTO/IR)

### Description

Running `ptoas` repeatedly on a **byte-identical** input `.pto` file produces **different `.cpp` outputs** on each invocation. The differences are not cosmetic: the UB tile base addresses (`pto.pointer_cast(%cN_i64)`) get reshuffled, and some of the resulting layouts compute **numerically wrong results** at runtime — i.e. on those layouts two logically-distinct tiles end up sharing overlapping lifetimes/storage and stomp on each other.

Concretely, on our Sinkhorn kernel:

- `md5sum sinkhorn.pto` is stable (same bytes every time the Python builder runs).
- `md5sum` of `ptoas`'s output `.cpp` differs across consecutive invocations.
- The `.so` built from the "lucky" layout passes 66/66 correctness cases at `rtol=2e-3, atol=1e-3`.
- The `.so` built from an "unlucky" layout fails 12/66 cases at the same tolerance, and **still fails 17/66 cases at the very loose `rtol=5e-2, atol=1e-2`** (so this is real wrong-arithmetic, not a precision regression).

The failures are concentrated on specific dynamic shapes (`N=4, K=32, L=64` and `N=4, K=64, L=32` for our kernel), this is consistent with an allocator-aliasing bug whose manifestation depends on which loop iteration first happens to write to the colliding tile.

This makes builds non-reproducible and silently miscompiled.

### Reproduction (minimal)

The `.pto` is generated by the PTODSL Sinkhorn builder in https://github.com/huawei-csl/pto-dsl/pull/117:

- Builder: `examples/aot/sinkhorn_dynamic_multicore/sinkhorn_builder.py`
- Generated IR (attached below): `sinkhorn.pto`

```bash
cd /mounted_home/pto-dsl/examples/aot/sinkhorn_dynamic_multicore

# 1. Confirm the input .pto is deterministic.
python3 sinkhorn_builder.py > /tmp/a.pto
python3 sinkhorn_builder.py > /tmp/b.pto
md5sum /tmp/a.pto /tmp/b.pto
#   5117399319df1c521c3fe0ee9f750e99  /tmp/a.pto
#   5117399319df1c521c3fe0ee9f750e99  /tmp/b.pto       <-- IDENTICAL

# 2. Run ptoas three times on the *same* input.
for i in 1 2 3; do
    ptoas --enable-insert-sync /tmp/a.pto -o /tmp/r$i.cpp
    md5sum /tmp/r$i.cpp
done
# Observed:
#   f4c0d4a6d4e654a9f0fe8b57f8735ce0  /tmp/r1.cpp
#   330760db3d1d225daadf92967cfc0b6e  /tmp/r2.cpp
#   dfe9db622c2d46caf0563c572cd6a233  /tmp/r3.cpp     <-- ALL DIFFERENT
```

Diffing the outputs shows the divergence is in the UB pointer-cast constants emitted at the top of the kernel, e.g.:

```bash
-  const int64_t v28 = 15392;
-  const int64_t v29 = 16416;
-  const int64_t v30 = 25632;
+  const int64_t v28 = 1024;
+  const int64_t v29 = 29888;
+  const int64_t v30 = 3104;
   ...
```

To observe the **functional** consequence (one layout correct, another broken), build twice and run the correctness tests:

```bash
bash compile.sh                                         # bad layout (example)
python ./run_sinkhorn.py --lib ./sinkhorn_lib.so --rtol 2e-3 --atol 1e-3
#   totals: {'match': 54, 'mismatch': 12, 'skip': 0}
```
```bash
bash compile.sh                                         # try again — sometimes good, sometimes bad
python ./run_sinkhorn.py --lib ./sinkhorn_lib.so --rtol 2e-3 --atol 1e-3
#   totals: {'match': 66, 'mismatch':  0, 'skip': 0}
```

The 12 failing cases are always the same shapes: every (order, seed) combination of `(N=4, K=32, L=64)` and `(N=4, K=64, L=32)`.

The MLIR is always the same:

```mlir
module {
  func.func @_kernel(%arg0: !pto.ptr<f16>, %arg1: !pto.ptr<f16>, %arg2: !pto.ptr<f16>, %arg3: !pto.ptr<f16>, %arg4: i32, %arg5: i32, %arg6: i32, %arg7: i32, %arg8: f32, %arg9: f32, %arg10: f32, %arg11: f32, %arg12: f32, %arg13: f32) {
    %c0 = arith.constant 0 : index
    %c1 = arith.constant 1 : index
    %c256 = arith.constant 256 : index
    %c8 = arith.constant 8 : index
    %cst = arith.constant 0.000000e+00 : f32
    %cst_0 = arith.constant 1.000000e+00 : f32
    %0 = arith.index_cast %arg4 : i32 to index
    %1 = arith.index_cast %arg5 : i32 to index
    %2 = arith.index_cast %arg6 : i32 to index
    %3 = arith.index_cast %arg7 : i32 to index
    pto.section.vector {
      %4 = arith.cmpi sgt, %1, %c0 : index
      %5 = arith.cmpi sgt, %2, %c0 : index
      %6 = arith.andi %4, %5 : i1
      %7 = arith.cmpi sge, %c256, %1 : index
      %8 = arith.andi %6, %7 : i1
      %9 = arith.cmpi sge, %c256, %2 : index
      %10 = arith.andi %8, %9 : i1
      %11 = arith.remsi %1, %c8 : index
      %12 = arith.cmpi eq, %11, %c0 : index
      %13 = arith.andi %10, %12 : i1
      scf.if %13 {
        %14 = pto.get_block_idx
        %15 = pto.get_subblock_idx
        %16 = pto.get_subblock_num
        %17 = pto.get_block_num
        %18 = arith.muli %14, %16 : i64
        %19 = arith.addi %18, %15 : i64
        %20 = arith.index_cast %19 : i64 to index
        %21 = arith.muli %17, %16 : i64
        %22 = arith.index_cast %21 : i64 to index
        %23 = pto.alloc_tile valid_col = %2 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>
        %24 = pto.alloc_tile valid_col = %1 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>
        %25 = pto.alloc_tile valid_col = %2 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>
        %26 = pto.alloc_tile valid_col = %2 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>
        %27 = pto.alloc_tile valid_col = %2 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>
        %28 = pto.alloc_tile valid_col = %1 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>
        %29 = pto.alloc_tile valid_col = %1 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>
        %30 = pto.alloc_tile valid_col = %2 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>
        %31 = pto.alloc_tile valid_col = %1 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>
        %32 = pto.alloc_tile valid_col = %2 : !pto.tile_buf<vec, 8x256xf16, valid=8x?>
        %33 = pto.alloc_tile valid_col = %2 : !pto.tile_buf<vec, 8x256xf32, valid=8x?>
        %34 = pto.alloc_tile valid_col = %2 : !pto.tile_buf<vec, 8x256xf32, valid=8x?>
        %35 = pto.alloc_tile : !pto.tile_buf<vec, 8x1xf32, blayout=col_major>
        %36 = pto.alloc_tile : !pto.tile_buf<vec, 8x1xf32, blayout=col_major>
        %37 = pto.alloc_tile : !pto.tile_buf<vec, 1x8xf32>
        %38 = pto.alloc_tile valid_row = %c1 valid_col = %c1 : !pto.tile_buf<vec, 8x1xf32, valid=?x?, blayout=col_major>
        %39 = pto.alloc_tile valid_row = %c1 valid_col = %c1 : !pto.tile_buf<vec, 8x1xf32, valid=?x?, blayout=col_major>
        %40 = pto.alloc_tile valid_row = %c1 valid_col = %c1 : !pto.tile_buf<vec, 8x1xf32, valid=?x?, blayout=col_major>
        %41 = pto.alloc_tile valid_col = %2 : !pto.tile_buf<vec, 1x256xf16, valid=1x?>
        %42 = pto.alloc_tile valid_col = %1 : !pto.tile_buf<vec, 1x256xf16, valid=1x?>
        %43 = arith.muli %0, %1 : index
        %44 = pto.make_tensor_view %arg0, shape = [%43, %2], strides = [%2, %c1] : !pto.tensor_view<?x?xf16>
        %45 = pto.make_tensor_view %arg1, shape = [%43, %2], strides = [%2, %c1] : !pto.tensor_view<?x?xf16>
        %46 = pto.make_tensor_view %arg2, shape = [%0, %2], strides = [%2, %c1] : !pto.tensor_view<?x?xf16>
        %47 = pto.make_tensor_view %arg3, shape = [%0, %1], strides = [%1, %c1] : !pto.tensor_view<?x?xf16>
        scf.for %arg14 = %20 to %0 step %22 {
          pto.tmuls ins(%23, %cst : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, f32) outs(%23 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
          pto.tadds ins(%23, %cst_0 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, f32) outs(%23 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
          pto.tmuls ins(%24, %cst : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, f32) outs(%24 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
          pto.tadds ins(%24, %cst_0 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, f32) outs(%24 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
          pto.tmuls ins(%25, %cst : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, f32) outs(%25 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
          pto.tadds ins(%25, %cst_0 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, f32) outs(%25 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
          %48 = arith.muli %arg14, %1 : index
          %49 = arith.addi %3, %c1 : index
          scf.for %arg15 = %c0 to %49 step %c1 {
            pto.tmuls ins(%26, %cst : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, f32) outs(%26 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
            pto.tmuls ins(%27, %cst : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, f32) outs(%27 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
            scf.for %arg16 = %c0 to %1 step %c8 {
              %53 = arith.addi %48, %arg16 : index
              %54 = pto.partition_view %44, offsets = [%53, %c0], sizes = [%c8, %2] : !pto.tensor_view<?x?xf16> -> !pto.partition_tensor_view<8x256xf16>
              pto.tload ins(%54 : !pto.partition_tensor_view<8x256xf16>) outs(%32 : !pto.tile_buf<vec, 8x256xf16, valid=8x?>)
              pto.tcvt ins(%32 {rmode = #pto<round_mode CAST_RINT>} : !pto.tile_buf<vec, 8x256xf16, valid=8x?>) outs(%33 : !pto.tile_buf<vec, 8x256xf32, valid=8x?>)
              %55 = pto.subview %24[%c0, %arg16] sizes [1, 8] : !pto.tile_buf<vec, 1x256xf32, valid=1x?> -> !pto.tile_buf<vec, 1x8xf32>
              pto.tmuls ins(%55, %cst_0 : !pto.tile_buf<vec, 1x8xf32>, f32) outs(%37 : !pto.tile_buf<vec, 1x8xf32>)
              %56 = pto.treshape %37 : !pto.tile_buf<vec, 1x8xf32> -> !pto.tile_buf<vec, 8x1xf32, blayout=col_major>
              pto.trowexpanddiv ins(%33, %56 : !pto.tile_buf<vec, 8x256xf32, valid=8x?>, !pto.tile_buf<vec, 8x1xf32, blayout=col_major>) outs(%33 : !pto.tile_buf<vec, 8x256xf32, valid=8x?>)
              pto.tcolexpandmul ins(%33, %25 : !pto.tile_buf<vec, 8x256xf32, valid=8x?>, !pto.tile_buf<vec, 1x256xf32, valid=1x?>) outs(%33 : !pto.tile_buf<vec, 8x256xf32, valid=8x?>)
              pto.trowsum ins(%33, %34 : !pto.tile_buf<vec, 8x256xf32, valid=8x?>, !pto.tile_buf<vec, 8x256xf32, valid=8x?>) outs(%35 : !pto.tile_buf<vec, 8x1xf32, blayout=col_major>)
              %57 = pto.treshape %35 : !pto.tile_buf<vec, 8x1xf32, blayout=col_major> -> !pto.tile_buf<vec, 1x8xf32>
              %58 = pto.subview %28[%c0, %arg16] sizes [1, 8] : !pto.tile_buf<vec, 1x256xf32, valid=1x?> -> !pto.tile_buf<vec, 1x8xf32>
              pto.tmuls ins(%57, %cst_0 : !pto.tile_buf<vec, 1x8xf32>, f32) outs(%58 : !pto.tile_buf<vec, 1x8xf32>)
              pto.tcolsum ins(%33, %34 {isBinary = true} : !pto.tile_buf<vec, 8x256xf32, valid=8x?>, !pto.tile_buf<vec, 8x256xf32, valid=8x?>) outs(%30 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
              pto.tadd ins(%26, %30 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, !pto.tile_buf<vec, 1x256xf32, valid=1x?>) outs(%26 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
              pto.tmul ins(%33, %33 : !pto.tile_buf<vec, 8x256xf32, valid=8x?>, !pto.tile_buf<vec, 8x256xf32, valid=8x?>) outs(%33 : !pto.tile_buf<vec, 8x256xf32, valid=8x?>)
              pto.trowsum ins(%33, %34 : !pto.tile_buf<vec, 8x256xf32, valid=8x?>, !pto.tile_buf<vec, 8x256xf32, valid=8x?>) outs(%36 : !pto.tile_buf<vec, 8x1xf32, blayout=col_major>)
              %59 = pto.treshape %36 : !pto.tile_buf<vec, 8x1xf32, blayout=col_major> -> !pto.tile_buf<vec, 1x8xf32>
              %60 = pto.subview %29[%c0, %arg16] sizes [1, 8] : !pto.tile_buf<vec, 1x256xf32, valid=1x?> -> !pto.tile_buf<vec, 1x8xf32>
              pto.tmuls ins(%59, %cst_0 : !pto.tile_buf<vec, 1x8xf32>, f32) outs(%60 : !pto.tile_buf<vec, 1x8xf32>)
              pto.tcolsum ins(%33, %34 {isBinary = true} : !pto.tile_buf<vec, 8x256xf32, valid=8x?>, !pto.tile_buf<vec, 8x256xf32, valid=8x?>) outs(%30 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
              pto.tadd ins(%27, %30 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, !pto.tile_buf<vec, 1x256xf32, valid=1x?>) outs(%27 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
            }
            pto.tmul ins(%28, %28 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, !pto.tile_buf<vec, 1x256xf32, valid=1x?>) outs(%31 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
            pto.tmuls ins(%31, %arg11 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, f32) outs(%31 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
            pto.tsub ins(%29, %31 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, !pto.tile_buf<vec, 1x256xf32, valid=1x?>) outs(%29 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
            pto.tmuls ins(%29, %arg13 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, f32) outs(%29 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
            pto.tmaxs ins(%29, %cst : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, f32) outs(%29 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
            pto.tsqrt ins(%29 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>) outs(%29 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
            pto.tmul ins(%26, %26 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, !pto.tile_buf<vec, 1x256xf32, valid=1x?>) outs(%30 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
            pto.tmuls ins(%30, %arg10 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, f32) outs(%30 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
            pto.tsub ins(%27, %30 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, !pto.tile_buf<vec, 1x256xf32, valid=1x?>) outs(%27 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
            pto.tmuls ins(%27, %arg12 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, f32) outs(%27 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
            pto.tmaxs ins(%27, %cst : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, f32) outs(%27 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
            pto.tsqrt ins(%27 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>) outs(%27 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
            %52 = arith.cmpi eq, %arg15, %c0 : index
            scf.if %52 {
              pto.trowmin ins(%29, %31 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, !pto.tile_buf<vec, 1x256xf32, valid=1x?>) outs(%39 : !pto.tile_buf<vec, 8x1xf32, valid=?x?, blayout=col_major>)
              pto.trowmin ins(%27, %30 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, !pto.tile_buf<vec, 1x256xf32, valid=1x?>) outs(%40 : !pto.tile_buf<vec, 8x1xf32, valid=?x?, blayout=col_major>)
              %53 = pto.treshape %39 : !pto.tile_buf<vec, 8x1xf32, valid=?x?, blayout=col_major> -> !pto.tile_buf<vec, 1x8xf32, valid=?x?>
              %54 = pto.treshape %40 : !pto.tile_buf<vec, 8x1xf32, valid=?x?, blayout=col_major> -> !pto.tile_buf<vec, 1x8xf32, valid=?x?>
              %55 = pto.treshape %38 : !pto.tile_buf<vec, 8x1xf32, valid=?x?, blayout=col_major> -> !pto.tile_buf<vec, 1x8xf32, valid=?x?>
              pto.tmin ins(%53, %54 : !pto.tile_buf<vec, 1x8xf32, valid=?x?>, !pto.tile_buf<vec, 1x8xf32, valid=?x?>) outs(%55 : !pto.tile_buf<vec, 1x8xf32, valid=?x?>)
              pto.tadds ins(%55, %arg9 : !pto.tile_buf<vec, 1x8xf32, valid=?x?>, f32) outs(%55 : !pto.tile_buf<vec, 1x8xf32, valid=?x?>)
            } else {
              pto.trowexpanddiv ins(%29, %38 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, !pto.tile_buf<vec, 8x1xf32, valid=?x?, blayout=col_major>) outs(%29 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
              %cst_1 = arith.constant 9.99999996E-13 : f32
              pto.tmaxs ins(%29, %cst_1 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, f32) outs(%29 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
              pto.tlog ins(%29 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>) outs(%29 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
              pto.tmuls ins(%29, %arg8 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, f32) outs(%29 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
              pto.texp ins(%29 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>) outs(%29 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
              pto.tmul ins(%24, %29 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, !pto.tile_buf<vec, 1x256xf32, valid=1x?>) outs(%24 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
              pto.trowexpanddiv ins(%27, %38 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, !pto.tile_buf<vec, 8x1xf32, valid=?x?, blayout=col_major>) outs(%27 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
              %cst_2 = arith.constant 9.99999996E-13 : f32
              pto.tmaxs ins(%27, %cst_2 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, f32) outs(%27 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
              pto.tlog ins(%27 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>) outs(%27 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
              pto.tmuls ins(%27, %arg8 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, f32) outs(%27 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
              pto.texp ins(%27 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>) outs(%27 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
              pto.tmul ins(%23, %27 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>, !pto.tile_buf<vec, 1x256xf32, valid=1x?>) outs(%23 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
              pto.trecip ins(%23 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>) outs(%25 : !pto.tile_buf<vec, 1x256xf32, valid=1x?>)
            }
          }
          scf.for %arg15 = %c0 to %1 step %c8 {
            %52 = arith.addi %48, %arg15 : index
            %53 = pto.partition_view %44, offsets = [%52, %c0], sizes = [%c8, %2] : !pto.tensor_view<?x?xf16> -> !pto.partition_tensor_view<8x256xf16>
            pto.tload ins(%53 : !pto.partition_tensor_view<8x256xf16>) outs(%32 : !pto.tile_buf<vec, 8x256xf16, valid=8x?>)
            pto.tcvt ins(%32 {rmode = #pto<round_mode CAST_RINT>} : !pto.tile_buf<vec, 8x256xf16, valid=8x?>) outs(%33 : !pto.tile_buf<vec, 8x256xf32, valid=8x?>)
            %54 = pto.subview %24[%c0, %arg15] sizes [1, 8] : !pto.tile_buf<vec, 1x256xf32, valid=1x?> -> !pto.tile_buf<vec, 1x8xf32>
            pto.tmuls ins(%54, %cst_0 : !pto.tile_buf<vec, 1x8xf32>, f32) outs(%37 : !pto.tile_buf<vec, 1x8xf32>)
            %55 = pto.treshape %37 : !pto.tile_buf<vec, 1x8xf32> -> !pto.tile_buf<vec, 8x1xf32, blayout=col_major>
            pto.trowexpanddiv ins(%33, %55 : !pto.tile_buf<vec, 8x256xf32, valid=8x?>, !pto.tile_buf<vec, 8x1xf32, blayout=col_major>) outs(%33 : !pto.tile_buf<vec, 8x256xf32, valid=8x?>)
            pto.tcolexpandmul ins(%33, %25 : !pto.tile_buf<vec, 8x256xf32, valid=8x?>, !pto.tile_buf<vec, 1x256xf32, valid=1x?>) outs(%33 : !pto.tile_buf<vec, 8x256xf32, valid=8x?>)
            pto.tcvt ins(%33 {rmode = #pto<round_mode CAST_RINT>} : !pto.tile_buf<vec, 8x256xf32, valid=8x?>) outs(%32 : !pto.tile_buf<vec, 8x256xf16, valid=8x?>)
            %56 = arith.addi %48, %arg15 : index
            %57 = pto.partition_view %45, offsets = [%56, %c0], sizes = [%c8, %2] : !pto.tensor_view<?x?xf16> -> !pto.partition_tensor_view<8x256xf16>
            pto.tstore ins(%32 : !pto.tile_buf<vec, 8x256xf16, valid=8x?>) outs(%57 : !pto.partition_tensor_view<8x256xf16>)
          }
          pto.tcvt ins(%23 {rmode = #pto<round_mode CAST_RINT>} : !pto.tile_buf<vec, 1x256xf32, valid=1x?>) outs(%41 : !pto.tile_buf<vec, 1x256xf16, valid=1x?>)
          %50 = pto.partition_view %46, offsets = [%arg14, %c0], sizes = [%c1, %2] : !pto.tensor_view<?x?xf16> -> !pto.partition_tensor_view<1x256xf16>
          pto.tstore ins(%41 : !pto.tile_buf<vec, 1x256xf16, valid=1x?>) outs(%50 : !pto.partition_tensor_view<1x256xf16>)
          pto.tcvt ins(%24 {rmode = #pto<round_mode CAST_RINT>} : !pto.tile_buf<vec, 1x256xf32, valid=1x?>) outs(%42 : !pto.tile_buf<vec, 1x256xf16, valid=1x?>)
          %51 = pto.partition_view %47, offsets = [%arg14, %c0], sizes = [%c1, %1] : !pto.tensor_view<?x?xf16> -> !pto.partition_tensor_view<1x256xf16>
          pto.tstore ins(%42 : !pto.tile_buf<vec, 1x256xf16, valid=1x?>) outs(%51 : !pto.partition_tensor_view<1x256xf16>)
        }
      }
    }
    return
  }
}
```

### Expected behavior

1. `ptoas` is deterministic: same input bytes → same output bytes (identical `.cpp`).
2. Whatever UB layout the allocator picks must be functionally correct on **every** dynamic shape that satisfies the kernel's `scf.if` guard, with no aliasing between tiles whose lifetimes overlap.


### Actual behavior / error logs

1. Identical input → three different `.cpp` outputs in three consecutive invocations (md5s above).
2. Some emitted layouts compute wrong results. With the broken layout:

```bash
   summary:
     tolerances: rtol=2e-3, atol=1e-3
     ...
     (4, 32, 64, 1, 0, 'mismatch')
     (4, 32, 64, 1, 42, 'mismatch')
     (4, 32, 64, 5, 0, 'mismatch')
     (4, 32, 64, 5, 42, 'mismatch')
     (4, 32, 64, 10, 0, 'mismatch')
     (4, 32, 64, 10, 42, 'mismatch')
     (4, 64, 32, 1, 0, 'mismatch')
     (4, 64, 32, 1, 42, 'mismatch')
     (4, 64, 32, 5, 0, 'mismatch')
     (4, 64, 32, 5, 42, 'mismatch')
     (4, 64, 32, 10, 0, 'mismatch')
     (4, 64, 32, 10, 42, 'mismatch')
     totals: {'match': 54, 'mismatch': 12, 'skip': 0}
   ```

3. The following warning (which is probably related) is generated at compile time by `bisheng`:

```bash
In file included from ./caller.cpp:4:
./sinkhorn.cpp:75:30: warning: & has lower precedence than <=; <= will be evaluated first [-Wparentheses]
  if (((v6 > v26 & v7 > v26) & v6 <= v24 & v7 <= v24) & v6 % v23 == v26) {
                             ^~~~~~~~~~~
./sinkhorn.cpp:75:30: note: place parentheses around the '<=' expression to silence this warning
  if (((v6 > v26 & v7 > v26) & v6 <= v24 & v7 <= v24) & v6 % v23 == v26) {
                             ^
                               (        )
./sinkhorn.cpp:75:30: note: place parentheses around the & expression to evaluate it first
  if (((v6 > v26 & v7 > v26) & v6 <= v24 & v7 <= v24) & v6 % v23 == v26) {
                             ^
       (                         )
./sinkhorn.cpp:75:55: warning: & has lower precedence than ==; == will be evaluated first [-Wparentheses]
  if (((v6 > v26 & v7 > v26) & v6 <= v24 & v7 <= v24) & v6 % v23 == v26) {
                                                      ^~~~~~~~~~~~~~~~~
./sinkhorn.cpp:75:55: note: place parentheses around the '==' expression to silence this warning
  if (((v6 > v26 & v7 > v26) & v6 <= v24 & v7 <= v24) & v6 % v23 == v26) {
                                                      ^
                                                        (              )
./sinkhorn.cpp:75:55: note: place parentheses around the & expression to evaluate it first
  if (((v6 > v26 & v7 > v26) & v6 <= v24 & v7 <= v24) & v6 % v23 == v26) {
                                                      ^
      (                                                         )
2 warnings generated.
```

### Git commit

fd2107e4f88a1171dc8702161f0f77dd22d29271

### Host platform

Linux (aarch64)

### Target Ascend arch (if relevant)

a3

### PTOAS build level (if relevant)

None

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Non-deterministic UB allocator: identical MLIR input produces different `.cpp` outputs across runs, some of which compute incorrect results #541

Component

Description

Reproduction (minimal)

Expected behavior

Actual behavior / error logs

Git commit

Host platform

Target Ascend arch (if relevant)

PTOAS build level (if relevant)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] Non-deterministic UB allocator: identical MLIR input produces different .cpp outputs across runs, some of which compute incorrect results #541

Description

Component

Description

Reproduction (minimal)

Expected behavior

Actual behavior / error logs

Git commit

Host platform

Target Ascend arch (if relevant)

PTOAS build level (if relevant)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[Bug] Non-deterministic UB allocator: identical MLIR input produces different `.cpp` outputs across runs, some of which compute incorrect results #541