[Feat] Implement doubly-stochastic Sinkhorn normalization kernel by Mocchibird · Pull Request #134 · huawei-csl/pto-kernels

Mocchibird · 2026-04-21T15:52:42Z

Sinkhorn pto-isa vs torch

Speedup	Bandwidth

…associated benchmarks, tests, and documentation

learning-chip · 2026-04-21T20:15:52Z

+    TASSIGN(hFlat, UbOfs::MAT_HALF);
+    TASSIGN(fFlat, UbOfs::MAT_FP32);
+    TCVT(fFlat, hFlat, RoundMode::CAST_NONE);


Can the computation be all done in FP16 without upcasting to FP32?

learning-chip · 2026-04-21T20:27:42Z

+    Shape2D<T> outShape(K, K);
+    DynStride outStride(K);
+    Global2D<T, MAX_DIM> gOut(gm_out, outShape, outStride);
+
+    wait_flag(PIPE_MTE3, PIPE_V, EVENT_ID0);
+    set_flag(PIPE_V, PIPE_MTE3, EVENT_ID0);
+    wait_flag(PIPE_V, PIPE_MTE3, EVENT_ID0);
+    TSTORE(gOut, outHalf);


A single TSTORE only works on [K, K] matrix. In mHC paper n = 4, thus the matrix size is only 4x4, just 32 Bytes. The DMA needs >= 16 KiB to get high bandwidth util. Need to process in batches.

(you can also confirm in tilelang example_mhc_pre.py that hc_mult = 4)

learning-chip · 2026-04-21T20:33:43Z

+    # Override default grids for sinkhorn: batch=N (matrices), K=dim
+    batches = args.batches if args.batches else [1, 4, 8, 16, 32, 64]
+    dims = args.hidden_dims if args.hidden_dims else [4, 8, 16, 32, 64, 128]


To emulate mHC setting, the dim is as small as 4, while the batch can be very large. Check the input shape of:

res_mix = mixes[:, 2 * hc_mult :].view(-1, hc_mult, hc_mult) res_mix = sinkhorn_normalize_ref(res_mix, repeat=sinkhorn_repeat, eps=hc_sinkhorn_eps)

from tilelang example

Also check tilelang-gpu's bandwidth util ratio for those shapes, to get a reasonable expectation for NPU perf.

…batch handling and visualization

learning-chip · 2026-04-22T12:34:12Z

+AICORE void sinkhorn(__gm__ T *in, __gm__ T *out, uint32_t N, uint32_t K,
+                     uint32_t repeat, float eps) {


repeat can be a compile-time template parameter / const, just like for the hadamard kernel

learning-chip · 2026-04-22T12:35:15Z

+      for (uint32_t it = 1; it < repeat; ++it) {
+        TASSIGN(v, VC);
+        TROWSUM(v, m, t);
+        pipe_barrier(PIPE_V);
+        TROWEXPANDDIV(m, m, v);
+        pipe_barrier(PIPE_V);
+        CN();
+      }


This loop can be statically unrolled to save some scalar computes

- Introduced DISPATCH_SHAPES to cover various dispatch paths in kernel_sinkhorn.cpp based on batch size (N) and K values. - Added DISPATCH_CASES for efficient testing of different (batch, K) combinations. - Expanded DENSE_SHAPES for broader numerical regression coverage. - Consolidated TEST_CASES to eliminate duplicates from DISPATCH and DENSE shapes. - Updated test_output_is_doubly_stochastic to validate across representative shapes for each dispatch path.

Mocchibird added 3 commits April 21, 2026 15:39

[Feat] Implement doubly-stochastic Sinkhorn normalization kernel and …

d9c72dd

…associated benchmarks, tests, and documentation

better plotting

a88d98e

[chore] linting

48f6ec1

learning-chip reviewed Apr 21, 2026

View reviewed changes

Comment thread examples/jit_cpp/sinkhorn/test_sinkhorn.py Outdated

learning-chip reviewed Apr 21, 2026

View reviewed changes

learning-chip requested changes Apr 21, 2026

View reviewed changes

learning-chip reviewed Apr 21, 2026

View reviewed changes

learning-chip mentioned this pull request Apr 22, 2026

Synkhorn dynamic multicore example huawei-csl/pto-dsl#117

Open

[Feat] Update Sinkhorn benchmark and plotting functions for improved …

b51dbe1

…batch handling and visualization

learning-chip reviewed Apr 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feat] Implement doubly-stochastic Sinkhorn normalization kernel#134

[Feat] Implement doubly-stochastic Sinkhorn normalization kernel#134
Mocchibird wants to merge 5 commits into
huawei-csl:mainfrom
Mocchibird:feat/sinkhorn

Mocchibird commented Apr 21, 2026

Uh oh!

Uh oh!

learning-chip Apr 21, 2026

Uh oh!

learning-chip Apr 21, 2026 •

edited

Loading

Uh oh!

learning-chip Apr 21, 2026 •

edited

Loading

Uh oh!

learning-chip Apr 22, 2026

Uh oh!

learning-chip Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		AICORE void sinkhorn(__gm__ T in, __gm__ T out, uint32_t N, uint32_t K,
		uint32_t repeat, float eps) {

Conversation

Mocchibird commented Apr 21, 2026

Sinkhorn pto-isa vs torch

Uh oh!

Uh oh!

learning-chip Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

learning-chip Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

learning-chip Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

learning-chip Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

learning-chip Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

learning-chip Apr 21, 2026 •

edited

Loading

learning-chip Apr 21, 2026 •

edited

Loading