Skip to content

[Sync][Refactor] Merge upstream TileLang, multiple refactor and support VMM & multimem#58

Open
Rachmanino wants to merge 995 commits intomainfrom
sync/tilelang-0c83691
Open

[Sync][Refactor] Merge upstream TileLang, multiple refactor and support VMM & multimem#58
Rachmanino wants to merge 995 commits intomainfrom
sync/tilelang-0c83691

Conversation

@Rachmanino
Copy link
Copy Markdown
Collaborator

@Rachmanino Rachmanino commented May 6, 2026

  • Merge upstream TileLang
  • Remove dependency on pybind11 and use tvm_ffi for distributed utils binding for consistency
  • Remove standalone tilescale_ext package and unify into tilescale shared_memory folder
  • Support VMM allocator with optional multicast VA
  • Support multimem.{ld_reduce, st, red} PTX instructions and lowering to them via LayoutInference and AutoVectorization to 128 bits
  • Add examples to illustrate usage

LeiWang1999 and others added 30 commits February 9, 2026 15:36
tl: downgrade parallel loops for unroll/vectorize

When vectorization resolves to size=1, keep semantics but rewrite ForKind::kParallel to kSerial so downstream transforms (e.g. pragma-unroll) can apply.\n\nAlso downgrade kParallel in VectorizeRewriter outputs, and rename/apply PragmaUnrollLoop consistently after partition/vectorize.
…tible wheels) (tile-ai#1821)

* Add lazy-loading stubs for CUDART and NVRTC

Build and ship libcudart_stub.so and libnvrtc_stub.so, then force TVM to link against them so the wheel does not hard-depend on libcudart.so.<major> / libnvrtc.so.<major>. This allows a single wheel to run across CUDA major versions where only libcudart/libnvrtc 12 or 13 is present.

* lint fix

* Guard cudart_stub against CUDA 11 headers

Add compile-time checks for CUDA 12+ runtime headers to avoid signature mismatches (e.g. cudaGraphInstantiate) when the stub is built with older toolkits.

* Update CMakeLists and CUDA stubs for version compatibility

- Adjusted comments in CMakeLists.txt to reflect the correct major versions for libcudart and NVRTC.
- Modified cudart.cc to support CUDA 11.x, including changes to the function pointer typedefs and the GraphInstantiate function to handle both legacy and new signatures.
- Updated nvrtc.cc to include support for NVRTC versions 11.0 to 13.x, ensuring compatibility across different CUDA environments.

* Add CUDA stub library support for portable wheels

- Introduced an option to use POSIX dlopen-based CUDA stub libraries for better compatibility across different CUDA Toolkit versions and CPU-only machines.
- Updated CMakeLists.txt to conditionally enable CUDA stubs based on the platform.
- Added compile-time checks in CUDA, CUDART, and NVRTC stub implementations to ensure they are only built on POSIX systems, providing clear error messages for Windows users.
- Enhanced documentation within the CMake configuration for clarity on the use of CUDA stubs.
* Enhance augmented assignment methods in BaseBuilder and Builder classes

- Updated `aug_assign` method to include an optional `name` parameter for better handling of variable names during augmented assignments.
- Improved handling of immutable variables in the `Builder` class to treat augmented assignments as re-bindings, providing warnings for potential issues.
- Added comments to clarify the changes and ensure user expectations are met regarding variable assignments.

* lint fix
)

Merge libtilelang and libtilelang_modules

Co-authored-by: kurisu6912 <[email protected]>
…ion in ThreadSync (tile-ai#1829)

* Refactor OptimizeForTarget function by removing debug print statements and updating module state visualization

* Add test for sync hoisting in non-uniform if within loop using shared memory

This commit introduces a new test case to verify the correct behavior of sync hoisting when a non-uniform if statement is present inside a loop that utilizes shared memory. The test ensures that the synchronization occurs before the if statement, confirming the expected transformation in the module's intermediate representation.

* fix

* Refactor thread variable handling in thread_storage_sync.cc

This commit removes the optional thread variable lookup by tag and replaces it with direct indexing based on the expected position of threadIdx variables. This change addresses a bug where the last three elements of env_threads_ were incorrectly assumed to always correspond to threadIdx.x/y/z, improving the accuracy of thread variable access in the synchronization logic.

* Update testing/python/transform/test_tilelang_transform_thread_sync.py

Co-authored-by: Copilot <[email protected]>

---------

Co-authored-by: Copilot <[email protected]>
… injection (tile-ai#1828)

reset cur_expect_idx_ when entering new scope
* Fix tilelang global load/store template

* Fix tilelang global load/store template

* Minor fix

---------

Co-authored-by: wangxiangwen <[email protected]>
…ess_ptr` (tile-ai#1827)

* [Refactor] Update atomic operations to use access_ptr for buffer access

* Replaced instances of address_of with access_ptr in atomic operations to improve clarity and maintainability.
* Enhanced atomic operation implementations in various files, including atomic_add, atomic_max, and atomic_min.
* Updated related logic in loop vectorization and tile language to support new access patterns.
* Added a utility function to extract destination scalar dtype for atomic operations, supporting both BufferLoad and tvm_access_ptr.
* Improved kernel cache logic to include a build-stamp for the TileLang runtime library, ensuring cache validity during development.

* lint fix

* Fix access_ptr layout rewrite and atomic vectorization

* lint fix

* Refactor atomic operations to remove const qualifiers from pointer parameters in atomicAddx2 and ToFloat functions. This change enhances type flexibility and consistency across atomic operations.

* Refactor buffer access logic in lower_tile_op and vectorize_loop to enhance clarity and maintainability. Removed unused code and improved offset calculations for tile access. Updated atomic operation handling to streamline dtype extraction and ensure correct vectorization behavior.

* Add logging for loop layout and lowered loop in atomic_add; implement tvm_access_ptr handling in loop_vectorize and lower_tile_op

This commit introduces logging statements to track the loop layout and the result of the lowered loop in the AtomicAddNode. Additionally, it implements a new method to handle tvm_access_ptr calls in the loop_vectorize and lower_tile_op files, enhancing the handling of memory access patterns and ensuring correct vectorization behavior.

* fix

* lint fix

* fix

* Update ToFloat2 and ToFloat4 functions to use const qualifiers for pointer parameters in atomic.h; modify test script to call test_atomic_addx2_float instead of main.

* Refactor tvm_access_ptr handling in loop_vectorize to improve argument validation and buffer resolution logic. Ensure at least three arguments are provided and streamline the search for matching buffers in layout_map_. Update vector size calculations to enhance clarity and maintainability.

* Add handling for tvm_access_ptr in vectorize_loop to support offset substitution and buffer creation

This commit introduces logic to handle tvm_access_ptr calls, allowing for the substitution of loop variables in offsets and the creation of dummy buffers. The new method ensures correct base address calculation and enhances the overall handling of memory access patterns in vectorization.

* Add handling for tvm_access_ptr offset validation in loop_vectorize

This commit enhances the loop_vectorize functionality by adding checks for the element offset in tvm_access_ptr calls. It ensures that offsets are invariant within the vector boundary, improving the robustness of vectorization logic and memory access patterns.

* fix

* Enhance tvm_access_ptr handling in loop_vectorize by adding offset validation checks. This update improves the robustness of vectorization logic and ensures offsets remain invariant within vector boundaries.

* Enhance tvm_access_ptr handling in loop_vectorize by adding offset validation checks. This update ensures offsets remain invariant within vector boundaries, improving memory access robustness and vectorization logic.

* fix

* Add tl.access_ptr support in TileLang

This commit introduces the `tl.access_ptr` operation, which carries pointer access metadata in the frontend. It allows for better handling of buffer loads by retaining necessary information for downstream analysis. The new operation is integrated into various components, including the loop vectorization and memory access legalizations, ensuring compatibility with existing functionality while enhancing the robustness of memory access patterns.

* lint fix

* fix

* Add support for tl.access_ptr in atomic operations

This commit updates the AtomicAddNode and AtomicOpBaseNode implementations to utilize the new tl.access_ptr operation for building access pointers to destination elements. It enhances the handling of buffer loads by incorporating type checks for index data types, improving the robustness of memory access patterns. Additionally, it removes unnecessary logging statements in the lowering process, streamlining the codebase.

* lint fix

* Refactor buffer load extraction in vectorize_loop to check for non-empty arguments in access_ptr calls. This change improves the robustness of buffer load handling by ensuring valid argument presence before processing.

* Remove unused fallback logic in KernelCache and clean up import statements in test_tilelang_language_access_ptr.py for improved code clarity.

* Refactor memory access in BitNet and dequantization examples to use tl.access_ptr

This update replaces T.address_of calls with T.access_ptr in multiple files, enhancing memory access handling by specifying read and write permissions. This change improves code clarity and aligns with recent enhancements in pointer access management.
… support (tile-ai#1839)

* [Feature] Add packed FP32x2 math intrinsics and support for CUDA/HIP

This commit introduces new packed FP32x2 math operations: fadd2, fmul2, and fma2, which leverage PTX instructions on supported architectures. The changes include:

- Definitions and implementations of fadd2, fmul2, and fma2 in both CUDA and HIP codegen files.
- New Python API functions for these operations, ensuring they validate input types and handle fallbacks for unsupported architectures.
- Documentation updates to reflect the new intrinsics in the math language module.

These enhancements improve performance for vectorized floating-point operations in TileLang.

* revert hip changes

* annotate cuda only

---------

Co-authored-by: Zhiwen Mo <[email protected]>
…mbar` (tile-ai#1774)

* [BugFix] use BufferLoad for mbar in tcgen5

* add 1SM WS gemm example

* lint

* use shuffle elect to issue tma

* support Buffer to umma_arrive

* update tcgen5 ws examples

* fix

* lint

* revert the use of shuffle_elect

* fix review issues

* disable loop unswitching as workaround for a bug

* lint
tile-ai#1762)

* [Feature] Add new hierarchical reduce with less workspace size and new ptx

* Lint

* Add intrinsic functions and more types

* Fix sm100a redux instruction

* Fix data type

* [Test] Add tests for reduce operations

* [Refactor] Unify AllReduce barrier mechanism via policy template

Replace the duplicated run()/run_hopper() methods in AllReduce with a
single run() that accepts a Barrier policy template parameter
(SyncThreadsBarrier or NamedBarrier<N>). Extract the shared inter-warp
reduction logic into a warp_inter_reduce helper, and split the dispatch
into private hierarchical_reduce/butterfly_reduce methods.

Update codegen in reduce.cc and finalize_reducer.cc to emit
NamedBarrier<all_threads> for SM >= 90 targets instead of the old
all_threads + run_hopper pattern.

Co-Authored-By: Claude Opus 4.6 <[email protected]>

* Add handling for non-float types in warp_reduce function

This update introduces a new case in the warp_reduce function to cast non-float types to float using static_cast. This enhancement improves type flexibility and ensures compatibility with a broader range of input types during reduction operations.

* [Refactor] Remove hierarchical reduce because of performance regression

* [BugFix] Buffer for reduction

---------

Co-authored-by: LeiWang1999 <[email protected]>
Co-authored-by: Claude Opus 4.6 <[email protected]>
* use nvrtc/cudart come with torch

* remove cuda 13 job

* add assert

* update error message

* resolve issues

* vibe doc

* Refactor CUDART and NVRTC API creation by removing null handle checks. This change simplifies the API initialization process in both cudart.cc and nvrtc.cc, ensuring that the function pointer lookups proceed without early returns.

* lint

* add env back

* Add test against nightly torch

* fix

* test with other torch

* fix

* fix

* fix

* Print all version check at end

* fix tvm

* Refactor CUDA and NVRTC stubs to improve version compatibility and lazy loading mechanism. Update documentation to clarify the purpose and implementation of stubs for better understanding.

* lint
* Optimize templates for half/bfloat16

* Fix lint

* Minor fix

* Enhance ReduceOpNode to handle max/min/absmax cases with clear=false. Update initialization logic for temporary buffers and improve reduction operations in reduce.cc. Add print statements for debugging in test_tilelang_language_reduce.py.

---------

Co-authored-by: LeiWang1999 <[email protected]>
* [BugFix] Fix eager mode where there is no tensor args

* fix

---------

Co-authored-by: LeiWang1999 <[email protected]>
* adapt amd fa kernel

* fix some bugs

* make lint happy

* Refactor tensor dtype handling in supply_tensors_gpu function and simplify loop condition in fast_flashattn function. The dtype mapping has been removed in favor of a direct conversion method, improving clarity and error handling.

* clean code

---------

Co-authored-by: LeiWang1999 <[email protected]>
* [Example] Add deepseek mHC sinkhorn backward

* Remove example_mhc_res.py file from deepseek_mhc examples directory

* Remove example_mhc_res.py file from deepseek_mhc examples directory

---------

Co-authored-by: LeiWang1999 <[email protected]>
Modify acc_o accumulation operation in README

Update accumulation operation for acc_o in README.
…roxy` (tile-ai#1850)

* Refactor async intrinsic handling in inject_fence_proxy.cc to exclude tma_load from dependency checks. Update comments for clarity on async intrinsics and their categorization according to NVIDIA documentation.

* Update src/transform/inject_fence_proxy.cc

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

---------

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
* fix(cutedsl): add rsqrt/rsqrtf support in codegen

- Add rsqrt/rsqrtf to fastmath function mapping
- Implement ieee_frsqrt using tl.rsqrt
- Export rsqrt from contrib/cutedsl/math.py

Fixes: norm/test_rms_norm.py

* fix(cutedsl): add atomic_add support and fix shared.dyn access

- Add atomic_add_elem_op and atomic_add_ret_elem_op handling
- Exclude shared.dyn from direct indexing to avoid out-of-bounds access
  (shared.dyn shape is set to (1,) but accessed with arbitrary indices)

Fixes: gemv/test_example_gemv.py

* fix(cutedsl): add builtin::ptx_ldmatrix support

Implement builtin::ptx_ldmatrix by reusing tl.ptx_ldmatrix_x* functions.

* feat(cutedsl): add ptx_mma support with all MMA types

Add complete PTX MMA intrinsic support for CuTeDSL backend:

- Implement ptx_mma codegen in codegen_cutedsl.cc
- Add GetVarPtr_ helper for local buffer pointer extraction
- Create ptx_mma.py with inline PTX assembly wrappers for:
  - FP16: m16n8k16 -> f16/f32
  - BF16: m16n8k16 -> f32
  - INT8/UINT8: m16n8k32 -> i32
  - INT4/UINT4: m16n8k32 -> i32
  - TF32: m16n8k4, m16n8k8 -> f32
  - FP64: m8n8k4 -> f64
  - FP8 e4m3/e5m2: m16n8k32 -> f16/f32
- Fix BroadcastNode to emit typed values for make_filled_tensor
- Update ptx_ldmatrix to use GetVarPtr_ for local buffers

* feat(cutedsl): comprehensive atomic operations support

- Add atomic.py module with full atomic operations:
  - AtomicAdd: float32 (nvvm.atomicrmw), float16 (inline PTX with i16 bitcast),
    int32/int64 (nvvm.atomicrmw)
  - AtomicAddx2/x4: vectorized float32 via PTX atom.global.add.v2/v4.f32
  - AtomicMax/Min: int32/int64 (nvvm.atomicrmw), float32 (CAS loop via inline PTX)
  - AtomicLoad/Store: all memory orderings via inline PTX (relaxed/acquire/release/seq_cst)
- Add codegen support for atomic_load/store/max/min/addx2/addx4 in codegen_cutedsl.cc
- Move AtomicAdd from __init__.py to atomic.py, extend with float16/int64 support

* feat(cutedsl): add fabsf/fabs support in codegen and math module

* feat(cutedsl): add wait_wgmma support via nvvm.wgmma_wait_group_sync_aligned

* feat(cutedsl): support i32 min/max

* fix(cutedsl): handle FP8 narrow precision alignment in codegen

CuTeDSL requires all FP8 vector operations to be 32-bit aligned, but
TIR generates float8x2 (16-bit) vectors. Additionally, MLIR vector
extractelement on FP8 types fails with unrealized_conversion_cast.

Fix CastNode, BufferLoadNode, and BufferStoreNode to:
- Pad unaligned FP8 casts to aligned width (e.g. f32x4 -> fp8x4)
- Return rmem tensor names (not .load()) so downstream uses rmem
  element access instead of MLIR vector extractelement
- Use padded aligned stores for local rmem, scalar element stores
  for global/shared memory (to handle non-aligned offsets)

* fix(cutedsl): use pointer arithmetic for shared.barrier buffer access

_Pointer type from tl.alloc_smem() does not support subscript [i],
but supports pointer arithmetic (ptr + i). This fixes warp_specialize
tests that use mbarrier.

- GetBufferRef_: use (vid + index) for shared.barrier instead of vid[index]
- BufferLoadNode: skip .load() suffix for shared.barrier (returns pointer, not value)

Fixes 3/4 warp_specialize tests.

* fix(cutedsl): handle single coordinate in tma_load

When tma_load receives a single Int32 coordinate (for 1D tensors) instead
of a tuple, wrap it in a tuple before iterating. This fixes the error:
'Int32' object is not iterable

Fixes flash_decoding/test_example_flash_decoding.py

* feat(cutedsl): add decode_i4u/s_to_f16 and warp primitives

- Add quantize.py with decode_i4u_to_f16 and decode_i4s_to_f16
  using inline PTX (lop3.b32 + sub.f16x2)
- Add warp.py with __activemask, __shfl_down_sync, __shfl_sync
  using CuTeDSL's shuffle_sync functions
- Update codegen to map decode_* and __* functions to tl.* namespace

Fixes dequant_gemv_fp16xint4 test.

* feat(cutedsl): add decode_fp4_to_bf16_twiddling

Implement FP4 to BF16 conversion using bit twiddling technique.
- Use prmt.b32 for endianness handling
- Use mul.bf16x2 for scaling
- Pack output with correct high/low BF16 layout

Fixes test_example_dequant_gemm_bf16_mxfp4_hopper.

* fix(cutedsl): support reinterpret for arbitrary expressions and fix shift type promotion

- reinterpret: handle non-BufferLoad args with tl.bitcast()
- shift_right/shift_left: wrap result in dtype() to preserve type
- CuTeDSL Int8>>4 returns Int32 (Python promotion), CUDA keeps int8_t
- Add bitcast() function using llvm.bitcast MLIR op

* feat: disable v2->v1 fallback

* fix(cutedsl): TensorSSA for store (Select/scalar) and TF32 MMA dispatch

- ArithValue: emit tl.where() for Select and wrap scalar store RHS in
  make_filled_tensor().load() so CuTeDSL Tensor.store() gets TensorSSA.
- TF32: accept fp32/f32/float32 for m16n8k4 and m16n8k8 in ptx_mma dispatch
  (e.g. deepseek_mhc).

* feat(cutedsl): implement CumSum1D and CumSum2D for cumulative sum operations

- Add CumSum1D and CumSum2D classes in reduce.py based on CUDA templates
- Add __shfl_up_sync wrapper in warp.py using CuTeDSL's shuffle_sync_up
- Implement warp-level prefix sum (forward/reverse) using shuffle operations
- Use @cute.jit decorator and nested if for DSL-compatible control flow
- Fixes gdn/test_example_cumsum_compilation test (7/8 gdn tests now pass)

* feat(cutedsl): implement tma_reduce for TMA atomic add reduction

* fix(cutedsl): handle if_then_else as tl.where and unify Select/where arm dtypes

- Override builtin::if_then_else in CallNode to emit tl.where() instead of
  Python ternary (which yields ArithValue, not TensorSSA)
- Use make_rmem_tensor + element assignment instead of make_filled_tensor
  (which creates unsupported vector<1xT> types for int)
- Unify arm dtypes in both SelectNode and if_then_else: when true/false arms
  differ (e.g. BFloat16 vs Float32), cast both to common type via .to()
- Fix BufferStore: skip make_filled_tensor wrapping for Select/if_then_else
  values (already TensorSSA); extract [0] for scalar element assignment
- Fix GetBufferRef_: extract Ramp base for make_tensor_at_offset offset
  parameter to avoid emitting tuple offsets

Fixes: cast/group_per_split_token, topk, deepseek_mhc, flash varlen

* fix(cutedsl): convert non-integer offsets to Int64 in make_tensor_at_offset

Complex arithmetic on cutlass.Int64 values (involving tl.min, //, %) can
produce Float64 ArithValue that CuTeDSL Pointer.__add__ does not accept.
Detect non-integer offsets via is_integer() and wrap with cutlass.Int64().

Fixes blocksparse_attention varlen_indice and varlen_mask (6/6 passed).

* fix(cutedsl): fix narrow unaligned load/store for sub-32-bit types

Three issues in the narrow unaligned (e.g. uint8x2, fp8x2) code paths:

1. Load from shared/global: using aligned_lanes (4) elements caused OOB
   near buffer boundaries and MisalignedAddress for non-aligned offsets.
   Fix: load exactly value_lanes (2) elements with no cute.assume.

2. Load from rmem: cute.assume(offset, divby=aligned_lanes) silently
   truncates offsets not divisible by aligned_lanes (e.g. offset=6 with
   divby=4 → 4). Fix: drop cute.assume for rmem (no hardware constraint).

3. Store to rmem: padded (aligned_lanes,) .store() writes beyond
   value_lanes elements, causing overlapping writes and OOB at tensor
   boundaries. Also triggers CuTeDSL's 32-bit alignment check for FP8.
   Fix: use element-by-element assignment (vid[i]=val).

Fixes dequantize_gemm w4a8 (cudaErrorMisalignedAddress → correct results).

* fix(cutedsl): handle signless int types in tl.min/tl.max fallback

* fix(cutedsl): handle _Pointer src_values in AtomicAddx2/x4

* fix(cutedsl): remove double PrintIndent in cp_async_gs_conditional codegen

* fix(example): support SM100 in attention_sink bwd configs

Change get_bwd_configs() to use >= 90 instead of == 90 so that SM100
(Blackwell) uses the same config as SM90 (Hopper): block_M=128,
block_N=32, num_stages=2, threads=256.

* feat: implement WGMMA V2 (ss + rs) for CuTeDSL backend

Implement explicit WGMMA descriptor management ops in the CuTeDSL backend,
replacing LOG(FATAL) stubs with working codegen for all 8 WGMMA-related ops.

New Python files:
- typing.py: WGMMA type map (string dtype -> nvvm.WGMMATypes)
- gemm_V2.py: GmmaDescriptor class, initialize/increase descriptor,
  warpgroup sync ops, wgmma_ss (via nvvm.wgmma_mma_async),
  wgmma_rs (via PTX inline asm)

Key findings and fixes:
- nvvm.mma_smem_desc does not correctly pack the layout_type/swizzle
  field in the WGMMA descriptor; replaced with manual bit packing
- wgmma_rs implemented using llvm.inline_asm with dynamically generated
  PTX and constraint strings based on M/N/K Constexpr parameters

C++ codegen changes (codegen_cutedsl.cc):
- warpgroup_arrive/commit_batch/wait/fence_operand
- ptx_wgmma_ss (15 args) and ptx_wgmma_rs (14 args)
- initialize_wgmma_descriptor (arg reorder: TIR->Python)
- increase_descriptor_offset

All GEMM tests (4/4) and flash attention tests (14/14) pass.

* fix(cutedsl): replace break with guard variable pattern for CuTeDSL compatibility

* feat(cutedsl): implement tcgen05 (SM100/Blackwell) MMA support

Add CuTeDSL codegen and Python runtime support for tcgen05 MMA operations:

- New gemm_tcgen05.py: Tcgen05SmemDescriptor, descriptor initialization,
  MMA variants (ss/ws_ss/ts), mbarrier arrive, TMEM alloc/dealloc/load
- All MMA and commit instructions guarded by elect_one_sync via inline PTX
  (@q predicate on individual instructions, not on block scopes)
- TMEM load uses recursive power-of-2 splitting with direct MLIR emission
- C++ codegen: replace 6 LOG(FATAL) stubs with proper code generation for
  tcgen05 MMA/descriptor/TMEM ops, handle descriptor scope allocation
- Fix shared.barrier pointer access (skip .iterator for alloc_smem results)
- Fix mbarrier_wait: use blocking PTX loop instead of single try_wait call

* fix(cutedsl): correct tcgen05_ld 64b/128b/256b PTX variants and cleanup

- Fix tcgen05_ld_32dp64bNx/128bNx/256bNx: use correct PTX instructions
  (16x64b/16x128b/16x256b) with two-half decomposition instead of
  hardcoded 32x32b
- Generalize _emit_tmem_ld_segment to accept ptx_type and regs_per_x
- Remove dead expression in ptx_mma.py (c_ptr + a_offset)
- Remove unused GemmCuTeDSL and is_cutedsl_target imports

* fix(cutedsl): review cleanup — warp shuffle, mbar timeout, atomic addr space, dead code

- Fix warp.py: correct mask_and_clamp formulas to match CUDA semantics
  for __shfl_down_sync, __shfl_up_sync, __shfl_sync (width param was ignored)
- Fix mbar.py: use timeout_ns parameter instead of hardcoded 0x989680
- Fix atomic.py: use generic address space (ld.f32, atom.cas.b32) instead of
  .global for AtomicMax/AtomicMin CAS loops, supporting shared memory pointers
- Add ICHECK for non-unit Ramp stride in codegen_cutedsl.cc GetBufferRef_
- Remove unreachable duplicate elif gemm_inst.is_tcgen5mma() branch
- Remove unused _to_ir_value helper from ptx_mma.py

* fix(cutedsl): skip BulkLoad1D for CuTeDSL + fix uint broadcast type mismatch

Two fixes for dequant/groupedgemm_bf16_mxfp4_hopper on CuTeDSL:

1. Skip BulkLoad1D/BulkStore1D for CuTeDSL target in copy.cc:
   cp_async_bulk_shared_cluster_global (raw 1D TMA) combined with WGMMA
   in the same kernel triggers a ptxas ICE in the NVPTX backend.
   Falls through to descriptor-based BulkLoad/BulkStore instead.

2. Use signed int type for BroadcastNode fill values in codegen_cutedsl.cc:
   CuTeDSL/MLIR normalizes unsigned integer tensor loads to signed types
   (Uint8 pointer -> i8 elements). BroadcastNode fill values now use
   matching signed types to avoid Int8/Uint8 mismatch in tl.where().

* fix(cutedsl): support fp16/bf16 in AtomicAddx2 via atom.add.noftz.v2.f16/bf16

* ci(cutedsl): add CuTeDSL examples to CI pipeline

Add a new GitHub Actions step that runs all examples with
TILELANG_TARGET=cutedsl. Known failures (sparse ops) are marked as
xfail and unsupported tests (convolution, stream-K, flash_decoding FA3)
are auto-skipped via pytest_collection_modifyitems in conftest.py.
The step uses continue-on-error: true initially until stabilized.

* chore: fix pre-commit

* fix(adapter): use name-based fallback for tir.Var dynamic symbolic lookup

tir.Var uses object identity for hash/eq, so when param_shapes (from
the original PrimFunc) and dynamic_symbolic_map (from the transformed
PrimFunc in ir_module) contain different tir.Var objects with the same
name, direct dict lookup fails with KeyError.

Add a secondary name-based index (_dynamic_symbolic_name_map) and a
_lookup_dynamic_symbolic() helper that falls back to name matching.
This fixes KeyError: batch under pytest-xdist parallel execution.

Affected adapters: cutedsl, nvrtc. (cython and tvm_ffi already had
a str()-based workaround.)

* chore: remove dead GemmCuTeDSL class and update stale V1 comments

- Delete `gemm_cutedsl.py` (no references anywhere in codebase)
- Remove outdated comment in `gemm_op.py`
- Update `run_local_ci_test.sh` comment: CuTeDSL no longer requires V1
- Keep `TILELANG_USE_GEMM_V1` env var as a debugging escape hatch

* refactor(cutedsl): reorganize contrib/cutedsl module structure

- Rename gemm_V1.py/gemm_V2.py to snake_case (gemm_v1.py/gemm_v2.py)
- Merge typing.py (19 lines) and mbar.py (53 lines) into utils.py and cpasync.py
- Extract utility functions from __init__.py into new utils.py module
- Slim __init__.py to pure re-exports
- Add __all__ to all submodules for explicit public API documentation

* fix(ci): use .empty() instead of .size() >= 1 in pipeline_planning.cc

* fix: coderabbit comments

* fix(cutedsl): correct WGMMA result register count for f16/bf16 output

wgmma_ss/wgmma_rs used M*N//128 for result registers, which is only
correct for 32-bit output types. For f16/bf16, each i32 register holds
2 packed elements, halving the count. Derive from cutlass type .width
instead of hardcoding.

Also update barrier test assertion to match codegen's pointer-arithmetic
syntax: (barriers + i) instead of barriers[i].

* refactor(cutedsl): templatize ptx_mma.py with factory function

Replace 13 near-identical @dsl_user_op PTX MMA functions (~840 lines)
with a single _make_ptx_mma factory and a config table (~120 lines).
Also correctly interpolates a_layout/b_layout into the PTX mnemonic
instead of hardcoding row.col.

* fix(cutedsl): fix NaN CAS loop, wgmma_rs int accum, and stale pipeline state

- atomic.py: Use integer-domain comparison (setp.ne.b32) in AtomicMax/Min
  CAS loops to prevent infinite spin when values are NaN
- gemm_v2.py: Use "=r"/i32 constraints for wgmma_rs when C_dtype is s32
  (int8 GEMM) instead of hardcoded "=f"/f32
- reduce.py: Add assert H <= 32 guard for CumSum2D dim=0 column-wise path
- pipeline_planning.cc: Clear pending tcgen05 state unconditionally after
  arrive to prevent stale entries from leaking to future arrives

* fix(cutedsl): fix ptx_mma brace escaping and wgmma_rs DSL compatibility

- ptx_mma.py: Double-brace register lists in PTX template so
  .format(a_layout=, b_layout=) doesn't interpret {$0, ...} as
  format fields (KeyError regression from templatization)
- gemm_v2.py: Replace if/else branches inside DSL-traced loops with
  ternary expressions to avoid CuTeDSL "operands type change inside
  dynamic if" error
- ci.yml: Remove continue-on-error from cutedsl CI, reduce maxfail
  to 3, increase parallelism to 4

* feat(cutedsl): implement ptx_mma_sp (sparse MMA) codegen and runtime

* fix(cutedsl): guard saturate param, move CuTeDSL CI step after core tests

- ptx_mma/ptx_mma_sp: raise NotImplementedError if saturate=True
- warpgroup_fence_operand: add comment explaining no-op rationale
- ci.yml: move CuTeDSL examples after CUDA/ROCm/Metal test steps
  so a CuTeDSL failure no longer blocks core tests

* feat(cutedsl): implement 16 previously unsupported ops

Add support for thread index queries (get_lane_idx, get_warp_idx,
get_warp_idx_sync, get_warp_group_idx), exp10 math function,
warp-level reductions (sum, max, min, bitand, bitor), and
IEEE-754 compliant arithmetic with explicit rounding modes
(add, sub, mul, fma, rcp, sqrt, div).

Reduces unsupported op count from 26 to 10. The remaining ops
are either high-complexity (tma_load_im2col, tl_gemm_sp),
require wrapper changes (sync_grid), or are legacy/unused.

* fix(cutedsl): wrap non-int32 integer literals with CuTeDSL type constructors

CuTeDSL's as_numeric() converts bare Python ints to Int32, causing
"Type mismatch" errors when storing to narrower tensors (e.g. Int16).
Override IntImmNode to emit cutlass.Int16(0) instead of bare 0 for
non-int32 literals. Also add a BufferStore narrowing guard as safety net.

Fixes compress_kernel Int32→Int16 error in gemm_sp/custom_compress.

* feat(cutedsl): implement sync_grid with cooperative kernel launch

Add grid-level synchronization via a software barrier using
llvm.mlir.global counter + inline PTX atomic spin-wait. The CuTeDSL
wrapper detects sync_grid usage and automatically switches to
cuLaunchCooperativeKernel to guarantee all blocks are resident.

- grid_sync.py: module-global counter + PTX barrier protocol
- wrapper.py: CPP_COOPERATIVE_KERNEL_LAUNCH_TEMPLATE
- codegen_cutedsl.cc: emit tl.sync_grid() instead of LOG(FATAL)

---------

Co-authored-by: yuxic <[email protected]>
* tir: add T.cdiv alias for T.ceildiv

* docs: Update InjectFenceProxy documentation and enhance code comments

- Clarified the description of `tl.InjectFenceProxy` to specify the transition between generic and async proxy operations.
- Improved explanations of the pass's functionality, including state tracking and the handling of TMA store synchronization.
- Added details on the new `ProxyStateSet` class and its role in managing proxy states.
- Updated usage instructions for proxy hints to include new options for custom operations.
- Enhanced test coverage for handling unknown external calls and proxy hint overrides.

* fix

* enhance

* fix

* remove tl.proxy_hint

* refactor tma_store_arrive and tma_store_wait

* refactor

* fix

* InjectFenceProxy: hoist fence out of pure-async loops

* InjectFenceProxy: hoist fence for if/while pure-async regions

* Remove LowerTileOp transform test
…ffers (tile-ai#1870)

* [BugFix] tvm_ffi: handle unused nullable buffers with shared symbolic shapes

* [BugFix] ArgBinder: relax shared-shape binding for unused nullable buffers
…ile-ai#1834)

* [Enhancement] Integrate arith::Analyzer into Loop Vectorizer for improved analysis

* Revert "[Enhancement] Integrate arith::Analyzer into Loop Vectorizer for improved analysis"

This reverts commit 0c1df35.

* Update submodule 'tvm' to latest commit 806ec091

---------

Co-authored-by: LeiWang1999 <[email protected]>
Rachmanino and others added 8 commits May 6, 2026 22:48
- Clarify that src/transform/ and src/tl_templates/ are not all exclusive
- Document build-import-run loop with common failure/root cause/fix table
- List distributed codegen additions that must survive merge
- Document TileScale-specific utils that fail silently
- Explain device mismatch trap (parse_device, torch.set_default_device)
- Add checklist items for distributed-specific validation

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Detect tl::multimem:: call_extern to set need_multimem_h_=true
so multimem.h is included in generated kernels.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Restore uncommitted TileScale TMA features lost during upstream sync:
- multimem.cc/h: kTmaStore/kTmaRedStore modes, LowerBulkCopy
- multimem.py: multimem_tma_store(), MultimemReduceOp.NONE
- inject_fence_proxy.cc: IsMultimemBulkCall detection
- multimem.h template: TMA bulk async templates
- Fix LowerParallelLoop call signature for new upstream API

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Add dist.barrier() and update default parameters.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
These were accidentally overwritten in the previous commit.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
- Add TYPE_CHECKING import for BaseAllocator in tensor.py
- Add dout/inouts to spelling wordlist (upstream typos)

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Upstream overwrite removed the TileScale DeepEP submodule URL.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@Rachmanino Rachmanino changed the title [Sync] Merge upstream TileLang [Sync][Refactor] Merge upstream TileLang, multiple refactor and support VMM & multimem May 6, 2026
Rachmanino and others added 3 commits May 7, 2026 00:07
CMake's FindCUDAToolkit creates CUDA::cuda_driver with IMPORTED_LOCATION
but cuda-driver-devel places libcuda.so in lib64/stubs not lib64.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@Rachmanino Rachmanino force-pushed the sync/tilelang-0c83691 branch from 10bcf7b to a9d7396 Compare May 6, 2026 16:34
- Upgrade cibuildwheel from v3.3 to v3.4 (matches upstream)
- Add symlink for CUDA driver library so CMake's FindCUDAToolkit
  can locate CUDA::cuda_driver in the manylinux build environment

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@Rachmanino Rachmanino force-pushed the sync/tilelang-0c83691 branch from a9d7396 to f7ade80 Compare May 6, 2026 16:34
@Rachmanino Rachmanino force-pushed the sync/tilelang-0c83691 branch from 6196934 to a3b9554 Compare May 6, 2026 16:37
Rachmanino and others added 13 commits May 7, 2026 00:54
These TileScale distributed primitives were lost when builtin.py was
taken from upstream. The underlying C++ op registrations in sync.cc/h
are still present.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Add back 16 functions lost when builtin.py was overwritten by upstream:
- fence_cta, fence_gpu, fence_sys
- ld, st (load/store with scope/semantic/PE params)
- atom_add (atomic addition)
- alloc_barrier_gpu, init_barrier_gpu, arrive_barrier_gpu
- wait_barrier_gpu, sync_barrier_gpu
- barrier_blocks, sync_blocks
- warp_any, warp_all

Also restore multimem_tma_store, multimem_signal in distributed/__init__.py
and add missing address_of/alloc_buffer imports.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Test jobs do not need lint tools (pre-commit, clang-format, codespell,
ruff). Decoupling avoids unnecessary dependency resolution failures.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
The mirror caused dependency resolution failures when used with a proxy.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
- Remove || true from CI test commands (fixes error suppression)
- Add CUDA codegen handlers for tl.get_rank, tl.get_num_ranks,
  tl.get_remote_base_ptr, tl.get_uintptr_t, fence_cta/gpu/sys
- Wrap pointer args with get_uintptr_t() in remote_copy Lower methods
  so generated code uses uint64_t consistently
- Add cp_block/cp_warp template functions in distributed.h

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Co-Authored-By: Claude Opus 4.6 <[email protected]>
- Change meta_data from extern declaration to definition so the symbol
  exists in the compiled cubin without requiring NVSHMEM runtime hook
- Move ICHECK_EQ after CUDA error check in GetGlobal
- Add retry logic for stale CUmodule handles (multiprocessing spawn)
- Temporarily skip nvshmem hook to isolate context issues

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Switch from driver API (cuModuleLoadData/cuModuleUnload) to runtime API
(cudaModuleLoadData/cudaModuleUnload) in tilescale_cuda_module. The
runtime API automatically manages CUDA context, fixing
CUDA_ERROR_INVALID_CONTEXT in torch.multiprocessing.spawn subprocesses.
Also adds cudaSetDevice before module operations to ensure correct device.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.