[Sync][Refactor] Merge upstream TileLang, multiple refactor and support VMM & multimem#58
Open
Rachmanino wants to merge 995 commits intomainfrom
Open
[Sync][Refactor] Merge upstream TileLang, multiple refactor and support VMM & multimem#58Rachmanino wants to merge 995 commits intomainfrom
Rachmanino wants to merge 995 commits intomainfrom
Conversation
tl: downgrade parallel loops for unroll/vectorize When vectorization resolves to size=1, keep semantics but rewrite ForKind::kParallel to kSerial so downstream transforms (e.g. pragma-unroll) can apply.\n\nAlso downgrade kParallel in VectorizeRewriter outputs, and rename/apply PragmaUnrollLoop consistently after partition/vectorize.
…tible wheels) (tile-ai#1821) * Add lazy-loading stubs for CUDART and NVRTC Build and ship libcudart_stub.so and libnvrtc_stub.so, then force TVM to link against them so the wheel does not hard-depend on libcudart.so.<major> / libnvrtc.so.<major>. This allows a single wheel to run across CUDA major versions where only libcudart/libnvrtc 12 or 13 is present. * lint fix * Guard cudart_stub against CUDA 11 headers Add compile-time checks for CUDA 12+ runtime headers to avoid signature mismatches (e.g. cudaGraphInstantiate) when the stub is built with older toolkits. * Update CMakeLists and CUDA stubs for version compatibility - Adjusted comments in CMakeLists.txt to reflect the correct major versions for libcudart and NVRTC. - Modified cudart.cc to support CUDA 11.x, including changes to the function pointer typedefs and the GraphInstantiate function to handle both legacy and new signatures. - Updated nvrtc.cc to include support for NVRTC versions 11.0 to 13.x, ensuring compatibility across different CUDA environments. * Add CUDA stub library support for portable wheels - Introduced an option to use POSIX dlopen-based CUDA stub libraries for better compatibility across different CUDA Toolkit versions and CPU-only machines. - Updated CMakeLists.txt to conditionally enable CUDA stubs based on the platform. - Added compile-time checks in CUDA, CUDART, and NVRTC stub implementations to ensure they are only built on POSIX systems, providing clear error messages for Windows users. - Enhanced documentation within the CMake configuration for clarity on the use of CUDA stubs.
…#1824) add side effect check in const int bound
* Enhance augmented assignment methods in BaseBuilder and Builder classes - Updated `aug_assign` method to include an optional `name` parameter for better handling of variable names during augmented assignments. - Improved handling of immutable variables in the `Builder` class to treat augmented assignments as re-bindings, providing warnings for potential issues. - Added comments to clarify the changes and ensure user expectations are met regarding variable assignments. * lint fix
) Merge libtilelang and libtilelang_modules Co-authored-by: kurisu6912 <[email protected]>
…ion in ThreadSync (tile-ai#1829) * Refactor OptimizeForTarget function by removing debug print statements and updating module state visualization * Add test for sync hoisting in non-uniform if within loop using shared memory This commit introduces a new test case to verify the correct behavior of sync hoisting when a non-uniform if statement is present inside a loop that utilizes shared memory. The test ensures that the synchronization occurs before the if statement, confirming the expected transformation in the module's intermediate representation. * fix * Refactor thread variable handling in thread_storage_sync.cc This commit removes the optional thread variable lookup by tag and replaces it with direct indexing based on the expected position of threadIdx variables. This change addresses a bug where the last three elements of env_threads_ were incorrectly assumed to always correspond to threadIdx.x/y/z, improving the accuracy of thread variable access in the synchronization logic. * Update testing/python/transform/test_tilelang_transform_thread_sync.py Co-authored-by: Copilot <[email protected]> --------- Co-authored-by: Copilot <[email protected]>
…nstallation guide (tile-ai#1830)
… injection (tile-ai#1828) reset cur_expect_idx_ when entering new scope
…zation for stable actions (tile-ai#1835) lint fix
* Fix tilelang global load/store template * Fix tilelang global load/store template * Minor fix --------- Co-authored-by: wangxiangwen <[email protected]>
…ess_ptr` (tile-ai#1827) * [Refactor] Update atomic operations to use access_ptr for buffer access * Replaced instances of address_of with access_ptr in atomic operations to improve clarity and maintainability. * Enhanced atomic operation implementations in various files, including atomic_add, atomic_max, and atomic_min. * Updated related logic in loop vectorization and tile language to support new access patterns. * Added a utility function to extract destination scalar dtype for atomic operations, supporting both BufferLoad and tvm_access_ptr. * Improved kernel cache logic to include a build-stamp for the TileLang runtime library, ensuring cache validity during development. * lint fix * Fix access_ptr layout rewrite and atomic vectorization * lint fix * Refactor atomic operations to remove const qualifiers from pointer parameters in atomicAddx2 and ToFloat functions. This change enhances type flexibility and consistency across atomic operations. * Refactor buffer access logic in lower_tile_op and vectorize_loop to enhance clarity and maintainability. Removed unused code and improved offset calculations for tile access. Updated atomic operation handling to streamline dtype extraction and ensure correct vectorization behavior. * Add logging for loop layout and lowered loop in atomic_add; implement tvm_access_ptr handling in loop_vectorize and lower_tile_op This commit introduces logging statements to track the loop layout and the result of the lowered loop in the AtomicAddNode. Additionally, it implements a new method to handle tvm_access_ptr calls in the loop_vectorize and lower_tile_op files, enhancing the handling of memory access patterns and ensuring correct vectorization behavior. * fix * lint fix * fix * Update ToFloat2 and ToFloat4 functions to use const qualifiers for pointer parameters in atomic.h; modify test script to call test_atomic_addx2_float instead of main. * Refactor tvm_access_ptr handling in loop_vectorize to improve argument validation and buffer resolution logic. Ensure at least three arguments are provided and streamline the search for matching buffers in layout_map_. Update vector size calculations to enhance clarity and maintainability. * Add handling for tvm_access_ptr in vectorize_loop to support offset substitution and buffer creation This commit introduces logic to handle tvm_access_ptr calls, allowing for the substitution of loop variables in offsets and the creation of dummy buffers. The new method ensures correct base address calculation and enhances the overall handling of memory access patterns in vectorization. * Add handling for tvm_access_ptr offset validation in loop_vectorize This commit enhances the loop_vectorize functionality by adding checks for the element offset in tvm_access_ptr calls. It ensures that offsets are invariant within the vector boundary, improving the robustness of vectorization logic and memory access patterns. * fix * Enhance tvm_access_ptr handling in loop_vectorize by adding offset validation checks. This update improves the robustness of vectorization logic and ensures offsets remain invariant within vector boundaries. * Enhance tvm_access_ptr handling in loop_vectorize by adding offset validation checks. This update ensures offsets remain invariant within vector boundaries, improving memory access robustness and vectorization logic. * fix * Add tl.access_ptr support in TileLang This commit introduces the `tl.access_ptr` operation, which carries pointer access metadata in the frontend. It allows for better handling of buffer loads by retaining necessary information for downstream analysis. The new operation is integrated into various components, including the loop vectorization and memory access legalizations, ensuring compatibility with existing functionality while enhancing the robustness of memory access patterns. * lint fix * fix * Add support for tl.access_ptr in atomic operations This commit updates the AtomicAddNode and AtomicOpBaseNode implementations to utilize the new tl.access_ptr operation for building access pointers to destination elements. It enhances the handling of buffer loads by incorporating type checks for index data types, improving the robustness of memory access patterns. Additionally, it removes unnecessary logging statements in the lowering process, streamlining the codebase. * lint fix * Refactor buffer load extraction in vectorize_loop to check for non-empty arguments in access_ptr calls. This change improves the robustness of buffer load handling by ensuring valid argument presence before processing. * Remove unused fallback logic in KernelCache and clean up import statements in test_tilelang_language_access_ptr.py for improved code clarity. * Refactor memory access in BitNet and dequantization examples to use tl.access_ptr This update replaces T.address_of calls with T.access_ptr in multiple files, enhancing memory access handling by specifying read and write permissions. This change improves code clarity and aligns with recent enhancements in pointer access management.
… support (tile-ai#1839) * [Feature] Add packed FP32x2 math intrinsics and support for CUDA/HIP This commit introduces new packed FP32x2 math operations: fadd2, fmul2, and fma2, which leverage PTX instructions on supported architectures. The changes include: - Definitions and implementations of fadd2, fmul2, and fma2 in both CUDA and HIP codegen files. - New Python API functions for these operations, ensuring they validate input types and handle fallbacks for unsupported architectures. - Documentation updates to reflect the new intrinsics in the math language module. These enhancements improve performance for vectorized floating-point operations in TileLang. * revert hip changes * annotate cuda only --------- Co-authored-by: Zhiwen Mo <[email protected]>
…mbar` (tile-ai#1774) * [BugFix] use BufferLoad for mbar in tcgen5 * add 1SM WS gemm example * lint * use shuffle elect to issue tma * support Buffer to umma_arrive * update tcgen5 ws examples * fix * lint * revert the use of shuffle_elect * fix review issues * disable loop unswitching as workaround for a bug * lint
tile-ai#1762) * [Feature] Add new hierarchical reduce with less workspace size and new ptx * Lint * Add intrinsic functions and more types * Fix sm100a redux instruction * Fix data type * [Test] Add tests for reduce operations * [Refactor] Unify AllReduce barrier mechanism via policy template Replace the duplicated run()/run_hopper() methods in AllReduce with a single run() that accepts a Barrier policy template parameter (SyncThreadsBarrier or NamedBarrier<N>). Extract the shared inter-warp reduction logic into a warp_inter_reduce helper, and split the dispatch into private hierarchical_reduce/butterfly_reduce methods. Update codegen in reduce.cc and finalize_reducer.cc to emit NamedBarrier<all_threads> for SM >= 90 targets instead of the old all_threads + run_hopper pattern. Co-Authored-By: Claude Opus 4.6 <[email protected]> * Add handling for non-float types in warp_reduce function This update introduces a new case in the warp_reduce function to cast non-float types to float using static_cast. This enhancement improves type flexibility and ensures compatibility with a broader range of input types during reduction operations. * [Refactor] Remove hierarchical reduce because of performance regression * [BugFix] Buffer for reduction --------- Co-authored-by: LeiWang1999 <[email protected]> Co-authored-by: Claude Opus 4.6 <[email protected]>
* use nvrtc/cudart come with torch * remove cuda 13 job * add assert * update error message * resolve issues * vibe doc * Refactor CUDART and NVRTC API creation by removing null handle checks. This change simplifies the API initialization process in both cudart.cc and nvrtc.cc, ensuring that the function pointer lookups proceed without early returns. * lint * add env back * Add test against nightly torch * fix * test with other torch * fix * fix * fix * Print all version check at end * fix tvm * Refactor CUDA and NVRTC stubs to improve version compatibility and lazy loading mechanism. Update documentation to clarify the purpose and implementation of stubs for better understanding. * lint
* Optimize templates for half/bfloat16 * Fix lint * Minor fix * Enhance ReduceOpNode to handle max/min/absmax cases with clear=false. Update initialization logic for temporary buffers and improve reduction operations in reduce.cc. Add print statements for debugging in test_tilelang_language_reduce.py. --------- Co-authored-by: LeiWang1999 <[email protected]>
* [BugFix] Fix eager mode where there is no tensor args * fix --------- Co-authored-by: LeiWang1999 <[email protected]>
* adapt amd fa kernel * fix some bugs * make lint happy * Refactor tensor dtype handling in supply_tensors_gpu function and simplify loop condition in fast_flashattn function. The dtype mapping has been removed in favor of a direct conversion method, improving clarity and error handling. * clean code --------- Co-authored-by: LeiWang1999 <[email protected]>
* [Example] Add deepseek mHC sinkhorn backward * Remove example_mhc_res.py file from deepseek_mhc examples directory * Remove example_mhc_res.py file from deepseek_mhc examples directory --------- Co-authored-by: LeiWang1999 <[email protected]>
Bump version into 0.1.8
Modify acc_o accumulation operation in README Update accumulation operation for acc_o in README.
…roxy` (tile-ai#1850) * Refactor async intrinsic handling in inject_fence_proxy.cc to exclude tma_load from dependency checks. Update comments for clarity on async intrinsics and their categorization according to NVIDIA documentation. * Update src/transform/inject_fence_proxy.cc Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> --------- Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
* fix(cutedsl): add rsqrt/rsqrtf support in codegen
- Add rsqrt/rsqrtf to fastmath function mapping
- Implement ieee_frsqrt using tl.rsqrt
- Export rsqrt from contrib/cutedsl/math.py
Fixes: norm/test_rms_norm.py
* fix(cutedsl): add atomic_add support and fix shared.dyn access
- Add atomic_add_elem_op and atomic_add_ret_elem_op handling
- Exclude shared.dyn from direct indexing to avoid out-of-bounds access
(shared.dyn shape is set to (1,) but accessed with arbitrary indices)
Fixes: gemv/test_example_gemv.py
* fix(cutedsl): add builtin::ptx_ldmatrix support
Implement builtin::ptx_ldmatrix by reusing tl.ptx_ldmatrix_x* functions.
* feat(cutedsl): add ptx_mma support with all MMA types
Add complete PTX MMA intrinsic support for CuTeDSL backend:
- Implement ptx_mma codegen in codegen_cutedsl.cc
- Add GetVarPtr_ helper for local buffer pointer extraction
- Create ptx_mma.py with inline PTX assembly wrappers for:
- FP16: m16n8k16 -> f16/f32
- BF16: m16n8k16 -> f32
- INT8/UINT8: m16n8k32 -> i32
- INT4/UINT4: m16n8k32 -> i32
- TF32: m16n8k4, m16n8k8 -> f32
- FP64: m8n8k4 -> f64
- FP8 e4m3/e5m2: m16n8k32 -> f16/f32
- Fix BroadcastNode to emit typed values for make_filled_tensor
- Update ptx_ldmatrix to use GetVarPtr_ for local buffers
* feat(cutedsl): comprehensive atomic operations support
- Add atomic.py module with full atomic operations:
- AtomicAdd: float32 (nvvm.atomicrmw), float16 (inline PTX with i16 bitcast),
int32/int64 (nvvm.atomicrmw)
- AtomicAddx2/x4: vectorized float32 via PTX atom.global.add.v2/v4.f32
- AtomicMax/Min: int32/int64 (nvvm.atomicrmw), float32 (CAS loop via inline PTX)
- AtomicLoad/Store: all memory orderings via inline PTX (relaxed/acquire/release/seq_cst)
- Add codegen support for atomic_load/store/max/min/addx2/addx4 in codegen_cutedsl.cc
- Move AtomicAdd from __init__.py to atomic.py, extend with float16/int64 support
* feat(cutedsl): add fabsf/fabs support in codegen and math module
* feat(cutedsl): add wait_wgmma support via nvvm.wgmma_wait_group_sync_aligned
* feat(cutedsl): support i32 min/max
* fix(cutedsl): handle FP8 narrow precision alignment in codegen
CuTeDSL requires all FP8 vector operations to be 32-bit aligned, but
TIR generates float8x2 (16-bit) vectors. Additionally, MLIR vector
extractelement on FP8 types fails with unrealized_conversion_cast.
Fix CastNode, BufferLoadNode, and BufferStoreNode to:
- Pad unaligned FP8 casts to aligned width (e.g. f32x4 -> fp8x4)
- Return rmem tensor names (not .load()) so downstream uses rmem
element access instead of MLIR vector extractelement
- Use padded aligned stores for local rmem, scalar element stores
for global/shared memory (to handle non-aligned offsets)
* fix(cutedsl): use pointer arithmetic for shared.barrier buffer access
_Pointer type from tl.alloc_smem() does not support subscript [i],
but supports pointer arithmetic (ptr + i). This fixes warp_specialize
tests that use mbarrier.
- GetBufferRef_: use (vid + index) for shared.barrier instead of vid[index]
- BufferLoadNode: skip .load() suffix for shared.barrier (returns pointer, not value)
Fixes 3/4 warp_specialize tests.
* fix(cutedsl): handle single coordinate in tma_load
When tma_load receives a single Int32 coordinate (for 1D tensors) instead
of a tuple, wrap it in a tuple before iterating. This fixes the error:
'Int32' object is not iterable
Fixes flash_decoding/test_example_flash_decoding.py
* feat(cutedsl): add decode_i4u/s_to_f16 and warp primitives
- Add quantize.py with decode_i4u_to_f16 and decode_i4s_to_f16
using inline PTX (lop3.b32 + sub.f16x2)
- Add warp.py with __activemask, __shfl_down_sync, __shfl_sync
using CuTeDSL's shuffle_sync functions
- Update codegen to map decode_* and __* functions to tl.* namespace
Fixes dequant_gemv_fp16xint4 test.
* feat(cutedsl): add decode_fp4_to_bf16_twiddling
Implement FP4 to BF16 conversion using bit twiddling technique.
- Use prmt.b32 for endianness handling
- Use mul.bf16x2 for scaling
- Pack output with correct high/low BF16 layout
Fixes test_example_dequant_gemm_bf16_mxfp4_hopper.
* fix(cutedsl): support reinterpret for arbitrary expressions and fix shift type promotion
- reinterpret: handle non-BufferLoad args with tl.bitcast()
- shift_right/shift_left: wrap result in dtype() to preserve type
- CuTeDSL Int8>>4 returns Int32 (Python promotion), CUDA keeps int8_t
- Add bitcast() function using llvm.bitcast MLIR op
* feat: disable v2->v1 fallback
* fix(cutedsl): TensorSSA for store (Select/scalar) and TF32 MMA dispatch
- ArithValue: emit tl.where() for Select and wrap scalar store RHS in
make_filled_tensor().load() so CuTeDSL Tensor.store() gets TensorSSA.
- TF32: accept fp32/f32/float32 for m16n8k4 and m16n8k8 in ptx_mma dispatch
(e.g. deepseek_mhc).
* feat(cutedsl): implement CumSum1D and CumSum2D for cumulative sum operations
- Add CumSum1D and CumSum2D classes in reduce.py based on CUDA templates
- Add __shfl_up_sync wrapper in warp.py using CuTeDSL's shuffle_sync_up
- Implement warp-level prefix sum (forward/reverse) using shuffle operations
- Use @cute.jit decorator and nested if for DSL-compatible control flow
- Fixes gdn/test_example_cumsum_compilation test (7/8 gdn tests now pass)
* feat(cutedsl): implement tma_reduce for TMA atomic add reduction
* fix(cutedsl): handle if_then_else as tl.where and unify Select/where arm dtypes
- Override builtin::if_then_else in CallNode to emit tl.where() instead of
Python ternary (which yields ArithValue, not TensorSSA)
- Use make_rmem_tensor + element assignment instead of make_filled_tensor
(which creates unsupported vector<1xT> types for int)
- Unify arm dtypes in both SelectNode and if_then_else: when true/false arms
differ (e.g. BFloat16 vs Float32), cast both to common type via .to()
- Fix BufferStore: skip make_filled_tensor wrapping for Select/if_then_else
values (already TensorSSA); extract [0] for scalar element assignment
- Fix GetBufferRef_: extract Ramp base for make_tensor_at_offset offset
parameter to avoid emitting tuple offsets
Fixes: cast/group_per_split_token, topk, deepseek_mhc, flash varlen
* fix(cutedsl): convert non-integer offsets to Int64 in make_tensor_at_offset
Complex arithmetic on cutlass.Int64 values (involving tl.min, //, %) can
produce Float64 ArithValue that CuTeDSL Pointer.__add__ does not accept.
Detect non-integer offsets via is_integer() and wrap with cutlass.Int64().
Fixes blocksparse_attention varlen_indice and varlen_mask (6/6 passed).
* fix(cutedsl): fix narrow unaligned load/store for sub-32-bit types
Three issues in the narrow unaligned (e.g. uint8x2, fp8x2) code paths:
1. Load from shared/global: using aligned_lanes (4) elements caused OOB
near buffer boundaries and MisalignedAddress for non-aligned offsets.
Fix: load exactly value_lanes (2) elements with no cute.assume.
2. Load from rmem: cute.assume(offset, divby=aligned_lanes) silently
truncates offsets not divisible by aligned_lanes (e.g. offset=6 with
divby=4 → 4). Fix: drop cute.assume for rmem (no hardware constraint).
3. Store to rmem: padded (aligned_lanes,) .store() writes beyond
value_lanes elements, causing overlapping writes and OOB at tensor
boundaries. Also triggers CuTeDSL's 32-bit alignment check for FP8.
Fix: use element-by-element assignment (vid[i]=val).
Fixes dequantize_gemm w4a8 (cudaErrorMisalignedAddress → correct results).
* fix(cutedsl): handle signless int types in tl.min/tl.max fallback
* fix(cutedsl): handle _Pointer src_values in AtomicAddx2/x4
* fix(cutedsl): remove double PrintIndent in cp_async_gs_conditional codegen
* fix(example): support SM100 in attention_sink bwd configs
Change get_bwd_configs() to use >= 90 instead of == 90 so that SM100
(Blackwell) uses the same config as SM90 (Hopper): block_M=128,
block_N=32, num_stages=2, threads=256.
* feat: implement WGMMA V2 (ss + rs) for CuTeDSL backend
Implement explicit WGMMA descriptor management ops in the CuTeDSL backend,
replacing LOG(FATAL) stubs with working codegen for all 8 WGMMA-related ops.
New Python files:
- typing.py: WGMMA type map (string dtype -> nvvm.WGMMATypes)
- gemm_V2.py: GmmaDescriptor class, initialize/increase descriptor,
warpgroup sync ops, wgmma_ss (via nvvm.wgmma_mma_async),
wgmma_rs (via PTX inline asm)
Key findings and fixes:
- nvvm.mma_smem_desc does not correctly pack the layout_type/swizzle
field in the WGMMA descriptor; replaced with manual bit packing
- wgmma_rs implemented using llvm.inline_asm with dynamically generated
PTX and constraint strings based on M/N/K Constexpr parameters
C++ codegen changes (codegen_cutedsl.cc):
- warpgroup_arrive/commit_batch/wait/fence_operand
- ptx_wgmma_ss (15 args) and ptx_wgmma_rs (14 args)
- initialize_wgmma_descriptor (arg reorder: TIR->Python)
- increase_descriptor_offset
All GEMM tests (4/4) and flash attention tests (14/14) pass.
* fix(cutedsl): replace break with guard variable pattern for CuTeDSL compatibility
* feat(cutedsl): implement tcgen05 (SM100/Blackwell) MMA support
Add CuTeDSL codegen and Python runtime support for tcgen05 MMA operations:
- New gemm_tcgen05.py: Tcgen05SmemDescriptor, descriptor initialization,
MMA variants (ss/ws_ss/ts), mbarrier arrive, TMEM alloc/dealloc/load
- All MMA and commit instructions guarded by elect_one_sync via inline PTX
(@q predicate on individual instructions, not on block scopes)
- TMEM load uses recursive power-of-2 splitting with direct MLIR emission
- C++ codegen: replace 6 LOG(FATAL) stubs with proper code generation for
tcgen05 MMA/descriptor/TMEM ops, handle descriptor scope allocation
- Fix shared.barrier pointer access (skip .iterator for alloc_smem results)
- Fix mbarrier_wait: use blocking PTX loop instead of single try_wait call
* fix(cutedsl): correct tcgen05_ld 64b/128b/256b PTX variants and cleanup
- Fix tcgen05_ld_32dp64bNx/128bNx/256bNx: use correct PTX instructions
(16x64b/16x128b/16x256b) with two-half decomposition instead of
hardcoded 32x32b
- Generalize _emit_tmem_ld_segment to accept ptx_type and regs_per_x
- Remove dead expression in ptx_mma.py (c_ptr + a_offset)
- Remove unused GemmCuTeDSL and is_cutedsl_target imports
* fix(cutedsl): review cleanup — warp shuffle, mbar timeout, atomic addr space, dead code
- Fix warp.py: correct mask_and_clamp formulas to match CUDA semantics
for __shfl_down_sync, __shfl_up_sync, __shfl_sync (width param was ignored)
- Fix mbar.py: use timeout_ns parameter instead of hardcoded 0x989680
- Fix atomic.py: use generic address space (ld.f32, atom.cas.b32) instead of
.global for AtomicMax/AtomicMin CAS loops, supporting shared memory pointers
- Add ICHECK for non-unit Ramp stride in codegen_cutedsl.cc GetBufferRef_
- Remove unreachable duplicate elif gemm_inst.is_tcgen5mma() branch
- Remove unused _to_ir_value helper from ptx_mma.py
* fix(cutedsl): skip BulkLoad1D for CuTeDSL + fix uint broadcast type mismatch
Two fixes for dequant/groupedgemm_bf16_mxfp4_hopper on CuTeDSL:
1. Skip BulkLoad1D/BulkStore1D for CuTeDSL target in copy.cc:
cp_async_bulk_shared_cluster_global (raw 1D TMA) combined with WGMMA
in the same kernel triggers a ptxas ICE in the NVPTX backend.
Falls through to descriptor-based BulkLoad/BulkStore instead.
2. Use signed int type for BroadcastNode fill values in codegen_cutedsl.cc:
CuTeDSL/MLIR normalizes unsigned integer tensor loads to signed types
(Uint8 pointer -> i8 elements). BroadcastNode fill values now use
matching signed types to avoid Int8/Uint8 mismatch in tl.where().
* fix(cutedsl): support fp16/bf16 in AtomicAddx2 via atom.add.noftz.v2.f16/bf16
* ci(cutedsl): add CuTeDSL examples to CI pipeline
Add a new GitHub Actions step that runs all examples with
TILELANG_TARGET=cutedsl. Known failures (sparse ops) are marked as
xfail and unsupported tests (convolution, stream-K, flash_decoding FA3)
are auto-skipped via pytest_collection_modifyitems in conftest.py.
The step uses continue-on-error: true initially until stabilized.
* chore: fix pre-commit
* fix(adapter): use name-based fallback for tir.Var dynamic symbolic lookup
tir.Var uses object identity for hash/eq, so when param_shapes (from
the original PrimFunc) and dynamic_symbolic_map (from the transformed
PrimFunc in ir_module) contain different tir.Var objects with the same
name, direct dict lookup fails with KeyError.
Add a secondary name-based index (_dynamic_symbolic_name_map) and a
_lookup_dynamic_symbolic() helper that falls back to name matching.
This fixes KeyError: batch under pytest-xdist parallel execution.
Affected adapters: cutedsl, nvrtc. (cython and tvm_ffi already had
a str()-based workaround.)
* chore: remove dead GemmCuTeDSL class and update stale V1 comments
- Delete `gemm_cutedsl.py` (no references anywhere in codebase)
- Remove outdated comment in `gemm_op.py`
- Update `run_local_ci_test.sh` comment: CuTeDSL no longer requires V1
- Keep `TILELANG_USE_GEMM_V1` env var as a debugging escape hatch
* refactor(cutedsl): reorganize contrib/cutedsl module structure
- Rename gemm_V1.py/gemm_V2.py to snake_case (gemm_v1.py/gemm_v2.py)
- Merge typing.py (19 lines) and mbar.py (53 lines) into utils.py and cpasync.py
- Extract utility functions from __init__.py into new utils.py module
- Slim __init__.py to pure re-exports
- Add __all__ to all submodules for explicit public API documentation
* fix(ci): use .empty() instead of .size() >= 1 in pipeline_planning.cc
* fix: coderabbit comments
* fix(cutedsl): correct WGMMA result register count for f16/bf16 output
wgmma_ss/wgmma_rs used M*N//128 for result registers, which is only
correct for 32-bit output types. For f16/bf16, each i32 register holds
2 packed elements, halving the count. Derive from cutlass type .width
instead of hardcoding.
Also update barrier test assertion to match codegen's pointer-arithmetic
syntax: (barriers + i) instead of barriers[i].
* refactor(cutedsl): templatize ptx_mma.py with factory function
Replace 13 near-identical @dsl_user_op PTX MMA functions (~840 lines)
with a single _make_ptx_mma factory and a config table (~120 lines).
Also correctly interpolates a_layout/b_layout into the PTX mnemonic
instead of hardcoding row.col.
* fix(cutedsl): fix NaN CAS loop, wgmma_rs int accum, and stale pipeline state
- atomic.py: Use integer-domain comparison (setp.ne.b32) in AtomicMax/Min
CAS loops to prevent infinite spin when values are NaN
- gemm_v2.py: Use "=r"/i32 constraints for wgmma_rs when C_dtype is s32
(int8 GEMM) instead of hardcoded "=f"/f32
- reduce.py: Add assert H <= 32 guard for CumSum2D dim=0 column-wise path
- pipeline_planning.cc: Clear pending tcgen05 state unconditionally after
arrive to prevent stale entries from leaking to future arrives
* fix(cutedsl): fix ptx_mma brace escaping and wgmma_rs DSL compatibility
- ptx_mma.py: Double-brace register lists in PTX template so
.format(a_layout=, b_layout=) doesn't interpret {$0, ...} as
format fields (KeyError regression from templatization)
- gemm_v2.py: Replace if/else branches inside DSL-traced loops with
ternary expressions to avoid CuTeDSL "operands type change inside
dynamic if" error
- ci.yml: Remove continue-on-error from cutedsl CI, reduce maxfail
to 3, increase parallelism to 4
* feat(cutedsl): implement ptx_mma_sp (sparse MMA) codegen and runtime
* fix(cutedsl): guard saturate param, move CuTeDSL CI step after core tests
- ptx_mma/ptx_mma_sp: raise NotImplementedError if saturate=True
- warpgroup_fence_operand: add comment explaining no-op rationale
- ci.yml: move CuTeDSL examples after CUDA/ROCm/Metal test steps
so a CuTeDSL failure no longer blocks core tests
* feat(cutedsl): implement 16 previously unsupported ops
Add support for thread index queries (get_lane_idx, get_warp_idx,
get_warp_idx_sync, get_warp_group_idx), exp10 math function,
warp-level reductions (sum, max, min, bitand, bitor), and
IEEE-754 compliant arithmetic with explicit rounding modes
(add, sub, mul, fma, rcp, sqrt, div).
Reduces unsupported op count from 26 to 10. The remaining ops
are either high-complexity (tma_load_im2col, tl_gemm_sp),
require wrapper changes (sync_grid), or are legacy/unused.
* fix(cutedsl): wrap non-int32 integer literals with CuTeDSL type constructors
CuTeDSL's as_numeric() converts bare Python ints to Int32, causing
"Type mismatch" errors when storing to narrower tensors (e.g. Int16).
Override IntImmNode to emit cutlass.Int16(0) instead of bare 0 for
non-int32 literals. Also add a BufferStore narrowing guard as safety net.
Fixes compress_kernel Int32→Int16 error in gemm_sp/custom_compress.
* feat(cutedsl): implement sync_grid with cooperative kernel launch
Add grid-level synchronization via a software barrier using
llvm.mlir.global counter + inline PTX atomic spin-wait. The CuTeDSL
wrapper detects sync_grid usage and automatically switches to
cuLaunchCooperativeKernel to guarantee all blocks are resident.
- grid_sync.py: module-global counter + PTX barrier protocol
- wrapper.py: CPP_COOPERATIVE_KERNEL_LAUNCH_TEMPLATE
- codegen_cutedsl.cc: emit tl.sync_grid() instead of LOG(FATAL)
---------
Co-authored-by: yuxic <[email protected]>
* tir: add T.cdiv alias for T.ceildiv * docs: Update InjectFenceProxy documentation and enhance code comments - Clarified the description of `tl.InjectFenceProxy` to specify the transition between generic and async proxy operations. - Improved explanations of the pass's functionality, including state tracking and the handling of TMA store synchronization. - Added details on the new `ProxyStateSet` class and its role in managing proxy states. - Updated usage instructions for proxy hints to include new options for custom operations. - Enhanced test coverage for handling unknown external calls and proxy hint overrides. * fix * enhance * fix * remove tl.proxy_hint * refactor tma_store_arrive and tma_store_wait * refactor * fix * InjectFenceProxy: hoist fence out of pure-async loops * InjectFenceProxy: hoist fence for if/while pure-async regions * Remove LowerTileOp transform test
…ffers (tile-ai#1870) * [BugFix] tvm_ffi: handle unused nullable buffers with shared symbolic shapes * [BugFix] ArgBinder: relax shared-shape binding for unused nullable buffers
Build tilelang without host toolchain
…ile-ai#1834) * [Enhancement] Integrate arith::Analyzer into Loop Vectorizer for improved analysis * Revert "[Enhancement] Integrate arith::Analyzer into Loop Vectorizer for improved analysis" This reverts commit 0c1df35. * Update submodule 'tvm' to latest commit 806ec091 --------- Co-authored-by: LeiWang1999 <[email protected]>
- Clarify that src/transform/ and src/tl_templates/ are not all exclusive - Document build-import-run loop with common failure/root cause/fix table - List distributed codegen additions that must survive merge - Document TileScale-specific utils that fail silently - Explain device mismatch trap (parse_device, torch.set_default_device) - Add checklist items for distributed-specific validation Co-Authored-By: Claude Opus 4.6 <[email protected]>
Detect tl::multimem:: call_extern to set need_multimem_h_=true so multimem.h is included in generated kernels. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Restore uncommitted TileScale TMA features lost during upstream sync: - multimem.cc/h: kTmaStore/kTmaRedStore modes, LowerBulkCopy - multimem.py: multimem_tma_store(), MultimemReduceOp.NONE - inject_fence_proxy.cc: IsMultimemBulkCall detection - multimem.h template: TMA bulk async templates - Fix LowerParallelLoop call signature for new upstream API Co-Authored-By: Claude Opus 4.6 <[email protected]>
Add dist.barrier() and update default parameters. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Co-Authored-By: Claude Opus 4.6 <[email protected]>
These were accidentally overwritten in the previous commit. Co-Authored-By: Claude Opus 4.6 <[email protected]>
- Add TYPE_CHECKING import for BaseAllocator in tensor.py - Add dout/inouts to spelling wordlist (upstream typos) Co-Authored-By: Claude Opus 4.6 <[email protected]>
Upstream overwrite removed the TileScale DeepEP submodule URL. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Co-Authored-By: Claude Opus 4.6 <[email protected]>
CMake's FindCUDAToolkit creates CUDA::cuda_driver with IMPORTED_LOCATION but cuda-driver-devel places libcuda.so in lib64/stubs not lib64. Co-Authored-By: Claude Opus 4.6 <[email protected]>
10bcf7b to
a9d7396
Compare
- Upgrade cibuildwheel from v3.3 to v3.4 (matches upstream) - Add symlink for CUDA driver library so CMake's FindCUDAToolkit can locate CUDA::cuda_driver in the manylinux build environment Co-Authored-By: Claude Opus 4.6 <[email protected]>
a9d7396 to
f7ade80
Compare
Co-Authored-By: Claude Opus 4.6 <[email protected]>
6196934 to
a3b9554
Compare
These TileScale distributed primitives were lost when builtin.py was taken from upstream. The underlying C++ op registrations in sync.cc/h are still present. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Add back 16 functions lost when builtin.py was overwritten by upstream: - fence_cta, fence_gpu, fence_sys - ld, st (load/store with scope/semantic/PE params) - atom_add (atomic addition) - alloc_barrier_gpu, init_barrier_gpu, arrive_barrier_gpu - wait_barrier_gpu, sync_barrier_gpu - barrier_blocks, sync_blocks - warp_any, warp_all Also restore multimem_tma_store, multimem_signal in distributed/__init__.py and add missing address_of/alloc_buffer imports. Co-Authored-By: Claude Opus 4.6 <[email protected]>
…l quotes) Co-Authored-By: Claude Opus 4.6 <[email protected]>
Co-Authored-By: Claude Opus 4.6 <[email protected]>
Test jobs do not need lint tools (pre-commit, clang-format, codespell, ruff). Decoupling avoids unnecessary dependency resolution failures. Co-Authored-By: Claude Opus 4.6 <[email protected]>
…ments" This reverts commit 0acced7.
…t.txt" This reverts commit e4aee5f.
The mirror caused dependency resolution failures when used with a proxy. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Co-Authored-By: Claude Opus 4.6 <[email protected]>
- Remove || true from CI test commands (fixes error suppression) - Add CUDA codegen handlers for tl.get_rank, tl.get_num_ranks, tl.get_remote_base_ptr, tl.get_uintptr_t, fence_cta/gpu/sys - Wrap pointer args with get_uintptr_t() in remote_copy Lower methods so generated code uses uint64_t consistently - Add cp_block/cp_warp template functions in distributed.h Co-Authored-By: Claude Opus 4.6 <[email protected]>
Co-Authored-By: Claude Opus 4.6 <[email protected]>
- Change meta_data from extern declaration to definition so the symbol exists in the compiled cubin without requiring NVSHMEM runtime hook - Move ICHECK_EQ after CUDA error check in GetGlobal - Add retry logic for stale CUmodule handles (multiprocessing spawn) - Temporarily skip nvshmem hook to isolate context issues Co-Authored-By: Claude Opus 4.6 <[email protected]>
Switch from driver API (cuModuleLoadData/cuModuleUnload) to runtime API (cudaModuleLoadData/cudaModuleUnload) in tilescale_cuda_module. The runtime API automatically manages CUDA context, fixing CUDA_ERROR_INVALID_CONTEXT in torch.multiprocessing.spawn subprocesses. Also adds cudaSetDevice before module operations to ensure correct device. Co-Authored-By: Claude Opus 4.6 <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.