Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
995 commits
Select commit Hold shift + click to select a range
d25f53c
Fix: treat kParallel as serial when vectorizing (#1819)
LeiWang1999 Feb 9, 2026
c65dfae
[Dist] Add lazy-loading stubs for CUDART + NVRTC (CUDA 11/12/13 compa…
LeiWang1999 Feb 9, 2026
bda8ec7
[Analyzer] Add SideEffect Checking in ConstIntBound Analyzer (#1824)
kurisu6912 Feb 9, 2026
c99879a
[Bugfix] Fix ast builder error for `value -= 1` (#1825)
LeiWang1999 Feb 9, 2026
854edec
[Release][Build] Merge libtilelang and libtilelang_modules (#1814)
oraluben Feb 9, 2026
f8b4d9e
[Bugfix] Fix threadIdx variable lookup by thread_tag instead of posit…
LeiWang1999 Feb 9, 2026
2d4f3e6
[Docs] Update nightly build installation instructions in README and I…
xwhzz Feb 10, 2026
523bc05
[BugFix] Reset cur_expect_idx_ correctly for multi-kernel TMA barrier…
ColmaLiu Feb 11, 2026
e666d2d
[Refactor] Treat `local.var` as `local` buffers when deciding vectori…
LeiWang1999 Feb 11, 2026
0d780a9
Fix tilelang global load/store template (#1837)
LJC00118 Feb 11, 2026
4ff81c7
[Refactor] Introduce `T.access_of` to combine `T.address_of` and `acc…
LeiWang1999 Feb 11, 2026
6d10d16
[CUDA][Feature] Add packed FP32x2 math intrinsics and auto vectorized…
LeiWang1999 Feb 12, 2026
1f16d9a
[Example][BugFix] 1SM GEMM example on Blackwell and fix handling of `…
Rachmanino Feb 12, 2026
48a8f4a
[Feature] Hierarchical reduction and warp reduction intrinsics suppor…
tzj-fxz Feb 13, 2026
1c742ae
[Dist][Release] Use one wheel for different CUDA version (#1826)
oraluben Feb 14, 2026
5e3c6b3
[Enhancement] Optimize templates for half/bfloat16 (#1845)
LJC00118 Feb 14, 2026
4956b58
ThreadSync: avoid barriers between atomic ops (#1852)
LeiWang1999 Feb 15, 2026
ffa0cdd
[BugFix] Fix eager mode where there is no tensor args (#1851)
Rachmanino Feb 15, 2026
110ef30
[AMD] Fix bugs about AMD FA kernel (#1701)
danielhua23 Feb 16, 2026
627b579
Add an example: mHC residual projection backward (#1758)
Da1sypetals Feb 16, 2026
41b2552
[Release] Bump version into v0.1.8 (#1853)
LeiWang1999 Feb 16, 2026
e6fe4e1
tir: add T.cdiv alias for T.ceildiv (#1856)
LeiWang1999 Feb 19, 2026
e9a25ce
[Typo] Modify acc_o accumulation operation in README (#1860)
bucket-xv Feb 19, 2026
fbbda0f
[Codegen] Metal codegen on Linux (#1857)
oraluben Feb 19, 2026
a60d619
[Enhancement] Enhance the conditions for async proxy in `InjectFenceP…
Rachmanino Feb 20, 2026
391cf5b
[Enhancement] GEMM V2 on SM90/SM100 CuTeDSL backend (#1855)
lucifer1004 Feb 22, 2026
9f25954
[Refactor] Refactor Pass InjectFenceProxy (#1863)
LeiWang1999 Feb 23, 2026
7ab2b2b
[BugFix] ArgBinder: relax shared-shape binding for unused nullable bu…
LeiWang1999 Feb 24, 2026
a61da71
[Build] Build tilelang without host toolchain (#1833)
oraluben Feb 24, 2026
777df3e
[LoopVectorize] Loop Independent Var Optimization in IfThenElse Expr …
kurisu6912 Feb 24, 2026
44bc8e8
[Refactor][Tools] Add view argument to plot_layout defaulting to stan…
LeiWang1999 Feb 24, 2026
a22fa61
layout: add Layout.repeat for tiling atom layouts (#1875)
LeiWang1999 Feb 24, 2026
9bd62f5
[Layout] Add Layout.expand to lift a layout into higher dimensions an…
LeiWang1999 Feb 25, 2026
911385f
[BugFix] Fix Hopper TMA lowering without warp specialization (#1840)
Henry-Jessie Feb 25, 2026
808b0ce
[Feature] Introduce higher-dimensional gemm layout support (#1798)
LeiWang1999 Feb 25, 2026
026c709
[Build] Disable gtest in tvm (#1877)
oraluben Feb 25, 2026
0fa78e2
[AMD] Fix gfx950 ci and add 16x16x32_bf16/fp16 instructions support (…
benenzhu Feb 25, 2026
2a85f09
[FIX] Fix kernel file suffix for cutedsl (#1865)
jeromeku Feb 25, 2026
28a976a
[FIX] Fix flattened buffer elem_offset to avoid double-count in acces…
bolairookie Feb 27, 2026
7fcd62b
[Enhancement] Clarify the semantic rule of copy operator and add shap…
SiriusNEO Feb 27, 2026
2e854b0
[Feature] Support cluster launch, query, synchronization and barrier …
Rachmanino Feb 28, 2026
eb27557
[CUDA] Support tcgen5mma gemm ts (#1866)
Hale423 Feb 28, 2026
48c2e99
[CI]: Bump actions/upload-artifact from 6 to 7 (#1888)
dependabot[bot] Mar 2, 2026
d94e4de
[CI]: Bump actions/download-artifact from 7 to 8 (#1889)
dependabot[bot] Mar 2, 2026
f708d0d
[CI] [pre-commit.ci] autoupdate (#1891)
pre-commit-ci[bot] Mar 3, 2026
df5ade9
Refactor CUDA version checks for compute 9.0 (#1893)
LeiWang1999 Mar 3, 2026
c214f0a
[BugFix] Fix type mismatch when lowering to AtomicAddx2 template (#1898)
SiriusNEO Mar 4, 2026
00267e4
[BugFix] add target context and avoid redundant re-lowering in TLCPUS…
xyyy1420 Mar 5, 2026
266f2f7
[Feature] Add DumpIR PassConfig in TileLang side (#1903)
SiriusNEO Mar 5, 2026
26ac2fe
[BugFix] Add vector type definitions to common.h for CPU codegen (#1901)
xyyy1420 Mar 5, 2026
c394e37
Avoid cvt instruction in FP4 before cuda 13.0 (#1880)
bucket-xv Mar 5, 2026
0fb658e
feat: configurable compiler temp file cleanup (#1900)
LeiWang1999 Mar 5, 2026
4929ad8
[Refactor] Improve cp.async lowering and add async_copy op (#1887)
LeiWang1999 Mar 6, 2026
9bae56f
[BugFix] Fix ROCm/HIP kernel launch using CUDA-only API (#1905)
Rachmanino Mar 7, 2026
0765d1d
[CI]: Bump pypa/cibuildwheel from 3.3 to 3.4 (#1914)
dependabot[bot] Mar 9, 2026
5601cf2
feat: add ROCm/HIP stub libraries for lazy loading (mirrors CUDA stub…
LeiWang1999 Mar 10, 2026
c0858ef
[Analysis] Refactor FragmentLoopChecker visiting style (#1884)
SiriusNEO Mar 10, 2026
ff1b8cc
[Feature] Add T.gemm support for CPU target (#1904)
xyyy1420 Mar 10, 2026
cd03ed1
[Bugfix] Minor fix for warp specialized gemm swizzling (#1920)
LeiWang1999 Mar 11, 2026
9bc282e
[Refactor] Align infer_shared_layout method in GemmTCGEN5 with WGMMA …
LeiWang1999 Mar 11, 2026
806cdd7
testing: prefer hipBLAS on ROCm in pytest setup (#1924)
LeiWang1999 Mar 11, 2026
bdbb32c
Support ptr-table grouped GEMM kernels (#1923)
LeiWang1999 Mar 12, 2026
59db777
[Enhancement] Only skip parallel loop partitioning when all stores a…
LJC00118 Mar 12, 2026
ccfc127
[Feature] Add CUDA intrinsic for isfinite operation (#1925)
LeiWang1999 Mar 12, 2026
7b6b64f
[Enhancement] Add eager-mode support for tilelang.autotune (#1906)
ColmaLiu Mar 12, 2026
e7a7510
[Docs] Add notes for new skip partitioning parallel loops strategy (#…
SiriusNEO Mar 12, 2026
d338c4f
[Bugfix] Fix concurrent TempDirectory creation during CUDA compilatio…
LeiWang1999 Mar 13, 2026
8fe367c
[Runtime] Improve TMA descriptor diagnostics (#1931)
LeiWang1999 Mar 13, 2026
94959b8
Add machine architecture in cache key (#1933)
kurisu6912 Mar 15, 2026
fee041c
Fix predicated cp.async pipeline scheduling (#1937)
LeiWang1999 Mar 17, 2026
ded6a99
[Feature] Add Producer-Consumer Warp Specialization and T.tma_copy() …
LeiWang1999 Mar 18, 2026
f8dc61c
test: reduce CI runtime for slow Python suites (#1932)
LeiWang1999 Mar 18, 2026
f1d4205
[BugFix] Update usage of tma load in SM100 manual warp-specialized ex…
Rachmanino Mar 18, 2026
9246b82
[Refactor] Replace create_list_of_mbarrier with buffer-based T.alloc_…
LeiWang1999 Mar 18, 2026
5a538f3
Support packed subtype views during layout reshape (#1947)
LeiWang1999 Mar 19, 2026
12b45b3
[Enhancement] Use stronger prover in `ProveFragmentContains` to avoid…
LJC00118 Mar 20, 2026
b5bd642
[Refactor] Separate gemm into explicit `wgmma_gemm` and `tcgen05_gemm…
LeiWang1999 Mar 20, 2026
715ca30
[Bugfix] Handle int64 offsets in ThreadSync for tvm_access_ptr (#1952)
LeiWang1999 Mar 20, 2026
9eba9fe
[Refactor] Simplify mbar validation in GEMM initialization (#1955)
LeiWang1999 Mar 20, 2026
718ea65
[Bugfix] Visit PrimExpr values in CallNode annotations during expr mu…
LeiWang1999 Mar 21, 2026
05dba65
Fix T.gemm() on SM75 Turing GPUs by including SM75 MMA headers (#1956)
Greal-dev Mar 22, 2026
841b6bd
[PIpeline] Enable software pipelining when warp specialization is una…
LeiWang1999 Mar 22, 2026
6d7baa7
[Example] Flash Attention SM100 (#1910)
Hale423 Mar 22, 2026
71a7940
[AMD][Radeon] Upgrade Rocm version to be 7.2 and add the support of R…
zhangnju Mar 23, 2026
045d317
[Bugfix] Fix thread race in getPlaceholder during par_compile (#1961)
kurisu6912 Mar 23, 2026
d0090b4
[Feature] Support alloc global workspace (#1940)
SiriusNEO Mar 23, 2026
b3020fa
[Enhancement] Enhance compatibility for older torch versions and dyna…
Rachmanino Mar 23, 2026
dd0cd3e
[Feature] 2-SM support for TMA, TMEM and TCGEN5MMA on Blackwell (#1882)
Rachmanino Mar 24, 2026
a2d6e01
[Bugfix] Fix double buffer versioning when TMA is used without warp s…
LeiWang1999 Mar 24, 2026
b3a32ac
[Bugfix] Fix vectorize planner ignoring cast source type bit width (#…
LeiWang1999 Mar 24, 2026
5635863
[Bugfix] Tolerate size-1 dim strides in RelaxedStrideCheck for DLPack…
Rachmanino Mar 24, 2026
3ba86c1
[BugFix] Fix bugs in `gemm_streamk` example on SM90 (#1969)
Rachmanino Mar 24, 2026
58c0e0f
Refactor producer-consumer WS access tracking for WGMMA-local state (…
LeiWang1999 Mar 25, 2026
e2c8833
Fix wrapped pre-loop TMA prefixes in producer-consumer WS (#1975)
LeiWang1999 Mar 26, 2026
bbabab3
[BugFix] Use content hash instead of mtime for libtilelang cache key …
LeiWang1999 Mar 26, 2026
8804982
[Feature] Introduce annotation for `minBlocksPerMultiprocessor` in `_…
Rachmanino Mar 26, 2026
c31ff58
Unified packed x2 intrinsics with multi-dtype support and bug fixes (…
bucket-xv Mar 26, 2026
dc60ab8
[Bugfix] Fix alloc_var re-bind warning when assigned with comparison …
kurisu6912 Mar 27, 2026
bdf436d
[Feature] Support TMA store in T.tma_copy() (#1981)
LeiWang1999 Mar 27, 2026
3a956e0
fix(merge_shmem): allow shared memory reuse for buffers with disjoint…
reoLantern Mar 29, 2026
6ca795a
[example] use alloc_global in split-kv decode kernel (#1991)
botbw Mar 30, 2026
1ede0bd
Introduce T.deallocate_tmem and T.transpose (#1971)
LeiWang1999 Mar 30, 2026
a9b8d53
Add `annotations` parameter to `alloc_buffer` in `tilelang/language/a…
Copilot Mar 30, 2026
1c561f6
[Bugfix] Raise error on zero grid dimension instead of silent clamp (…
LeiWang1999 Mar 30, 2026
6a859f1
[BugFix] Fix missing barrier init attrs when TMA is disabled (#1995)
Rachmanino Mar 31, 2026
0f7c214
[BugFix] Add missing fences in GEMM SM100 examples and canonicalize t…
Rachmanino Mar 31, 2026
eb6f05c
[Refactor] Refactor CUDA atomic helpers (#2001)
SiriusNEO Mar 31, 2026
8c3b043
[Bugfix] Fix CuTeDSL autotune cache invalid ELF header (#1967) (#1972)
kurisu6912 Mar 31, 2026
a82fa71
fix: fix copy+cast vectorize loop to use wider vector load/store inst…
Achazwl Mar 31, 2026
e45ecf7
[Feature] Support T.annotate_compile_flags, T.annotate_pass_configs, …
kurisu6912 Apr 2, 2026
6e6295f
[BugFix] Fix CI failures: clean /tmp on self-hosted runners and skip …
kurisu6912 Apr 2, 2026
5f70374
[Test] Add 1D TMA regression test for issue #1842 (#2005)
kurisu6912 Apr 3, 2026
6fc3afa
[BugFix] Fix auto vectorization for binary operations after wider cop…
Achazwl Apr 4, 2026
01c714d
fix: add cudaGetLastError check after cuLaunchKernel in TVM FFI backe…
kurisu6912 Apr 4, 2026
bb79425
[CI] Remove legacy dequantize gemm test (#2013)
LeiWang1999 Apr 5, 2026
1ff58bc
[CI] [pre-commit.ci] autoupdate (#2014)
pre-commit-ci[bot] Apr 6, 2026
4f75940
[BugFix] Enhance CUDA vectorization for binary operations (#2015)
LeiWang1999 Apr 7, 2026
868c740
[Docs] fix arrow direction in ir_transform_diagram.png (#2016)
kermanx Apr 7, 2026
cabdb86
[codex] Fuse packed x2 mul-add into fma2 in CUDA codegen (#2017)
LeiWang1999 Apr 7, 2026
37c1c0c
[codex] Reduce slow pytest runtime in testing/python (#2018)
LeiWang1999 Apr 7, 2026
3ee0988
[Refactor][Pipeline] Run pipeline rewriting before layout inference a…
LeiWang1999 Apr 7, 2026
1cb8fde
Bump transformers from 4.53.0 to 5.0.0rc3 in /examples/bitnet-1.58b (…
dependabot[bot] Apr 8, 2026
469a847
pin apache-tvm-ffi<0.1.10 (derived_object regression) (#2020)
oraluben Apr 8, 2026
86e37b7
Fix serial loop phase dtype mismatch in LowerTileOp (#2022)
LeiWang1999 Apr 8, 2026
a5e3f19
Re-enable deprecated `TL_DISABLE_TMA_LOWER` pass config for TMA store…
LJC00118 Apr 10, 2026
b1a88bf
[Misc] Remove mistakenly introduced temp file (#2027)
SiriusNEO Apr 10, 2026
853e805
[Codegen] Add lexical_alloc_scope for scoped local variable lifetime …
LeiWang1999 Apr 11, 2026
90299d6
[Bugfix] Fix incorrect sync hoist for fragment buffer conditions in T…
LeiWang1999 Apr 11, 2026
d619164
add .agents/skills/build/SKILL.md for build conventions (#2019)
oraluben Apr 11, 2026
7a515b5
[AMD][gfx950] Add gfx950 support for DeepGeem example (#2028)
zhangnju Apr 12, 2026
5d729ee
[Refactor] Remove GEMM v1 and promote gemm_py to be the canonical gem…
LeiWang1999 Apr 13, 2026
35d8139
[CI]: Bump actions/github-script from 8 to 9 (#2036)
dependabot[bot] Apr 13, 2026
8243f7e
Nan propagation option for bf16 and half16 (#1958)
haoran35-jpg Apr 13, 2026
19236b4
[Feature] Add TIR builtins for warp-level vote and block-level predic…
sepcnt Apr 13, 2026
b3d5981
[API] Default warp-lane mask to 0xFFFFFFFF for warp-sync builtins (#2…
LeiWang1999 Apr 13, 2026
a8bafa6
fix: suppress false positive conflict write warning when dst index de…
kurisu6912 Apr 14, 2026
74fc980
[Refactor] Refactor `DecoupleTypeCast` Pass (#2026)
LJC00118 Apr 14, 2026
f309d81
[Bugfix][Subtype] Fix scalar fp4 store/load codegen for non-packed bu…
kurisu6912 Apr 14, 2026
a16ff86
[Feature] autodd: add __freeze__ annotation to protect code regions f…
kurisu6912 Apr 15, 2026
d2e02e1
[BugFix] Skip MMA shared buffer layout inference when layout already …
kurisu6912 Apr 15, 2026
7893678
[Refactor] Remove obsolete RewriteWgmmaSync pass (#2046)
LeiWang1999 Apr 15, 2026
f12fb47
[Refactor] Move target gating into InjectFenceProxy pass entry (#2047)
LeiWang1999 Apr 15, 2026
235ad7e
Add regression test for 1D TMA load compilation and execution (#1989)
huyhoang171106 Apr 15, 2026
891109e
[Transform] Add InjectTcgen05Fence pass (#2003)
LeiWang1999 Apr 16, 2026
d098f51
[Enhancement] Use atomic directory rename for cache writes (#1982)
LeiWang1999 Apr 16, 2026
cb00a60
Replace syntactic loop-var checks with invariance checks (#2050)
LJC00118 Apr 17, 2026
27f1f81
[Feature][Example] Introduce CLC tile schedule and add example for sm…
Rachmanino Apr 17, 2026
77cbe6d
[Feature] Introduce T.CUDASourceCodeKernel (#1970)
SiriusNEO Apr 17, 2026
9c5fe44
[BugFix] Keep shared-prelude local vars in producer-consumer WS (#2055)
Rachmanino Apr 17, 2026
04468a3
[Bugfix] Fix stage-expanded annotated-layout aliases in LayoutInferen…
TerminusAkivili Apr 17, 2026
e1d6388
[Cache] Refactor cache namespace layout (#2057)
LeiWang1999 Apr 18, 2026
0924dab
[Bugfix] Use shared::cta instead of shared::cluster for non-cluster T…
qqq-tao Apr 18, 2026
f8e7059
fix: improve warning output in eager frontend (#2064)
kurisu6912 Apr 20, 2026
99ee74b
[CUDA] Support int4 `T.gemm` (#2063)
LeiWang1999 Apr 20, 2026
c797e41
[Bugfix] Correct index calculation in Software Pipeline pass (#2070)
Rachmanino Apr 20, 2026
96c649f
Add frontmatter for the build skill (#2068)
VitalyAnkh Apr 21, 2026
b744da1
Refactor ptx_ldmatrix to use tl.access_ptr with simplified signature …
LeiWang1999 Apr 21, 2026
046b1bd
[FFI] Remove upper version bound on apache-tvm-ffi (#2071)
LeiWang1999 Apr 21, 2026
9c95a42
[Refactor] Phaseout legacy util `map_torch_type` with `T.dtype.as_tor…
LeiWang1999 Apr 21, 2026
64bd742
[Bugfix] Fix reduce layout (#2074)
bucket-xv Apr 21, 2026
948d38a
[Refactor] Disable unhelpful warning print (#2077)
LeiWang1999 Apr 21, 2026
15309f5
[CUDA] Improve int4 GEMM lowering and packed codegen support (#2073)
LeiWang1999 Apr 21, 2026
4e7d126
Bump pytest --numprocesses from 4 to 8 across all platforms (#2076)
LeiWang1999 Apr 21, 2026
4b3127a
[Enhancement] Enhance alloc_var function to handle _ptr_sentinel dtyp…
LeiWang1999 Apr 22, 2026
6fee850
[Release] Bump version into 0.1.9 (#2060)
LeiWang1999 Apr 22, 2026
441c3b0
[Refactor] Strip build machine paths from LOG messages in wheel relea…
LeiWang1999 Apr 22, 2026
a640a89
[AMD][Radeon] Add the Support of RDNA3/RDNA3.5(gfx11) WMMA (#2044)
jiawei-real Apr 22, 2026
55cf9c1
[codex] Remove dead transform pass leftovers (#2083)
LeiWang1999 Apr 22, 2026
9aba41f
[Bugfix] Enable `.shared::cta` in TMA copy paths only on CUDA 12.8+ (…
ColmaLiu Apr 22, 2026
b88bbcb
[AMD][gfx950] Add ds_read_tr16_b64 / ds_read_tr8_b64 support for gfx9…
zhangnju Apr 23, 2026
6d0bffb
[AMD][Gfx950] Add the support of 160K LDS and copy.async (#2058)
zhangnju Apr 23, 2026
0f29f9c
[BugFix] Relax loop wait and adjust trailing drain behavior in async …
Rachmanino Apr 23, 2026
264efe2
[Feature] Block-scaled GEMM support for MXFP8 on Blackwell (#1945)
Rachmanino Apr 24, 2026
057e5ba
[Host CodeGen][Refactor] Cleanup namespace and remove useless C templ…
SiriusNEO Apr 25, 2026
3f16e50
Add opt-out for prelower semantic checks for DeepSeek V4 Flash on ARM…
foraxe Apr 25, 2026
0ee6345
[Example] Add HISA: hierarchical sparse attention indexer (#2069)
xuyufei-a Apr 25, 2026
8f4a08f
[Language] Small cleanup and notes for alloc global (#2100)
SiriusNEO Apr 25, 2026
8e12157
[Enhancement] Optimize hopper fp8 deepgemm tile size (#2103)
Rachmanino Apr 26, 2026
ffdf514
[CUDA][SM100] Include cuda_fp6.h when emitting FP6 types (#2102)
TerminusAkivili Apr 26, 2026
6a29c76
feat: support cdna4 v_mfma_i32_16x16x64_i8 & v_mfma_i32_32x32x32_i8 (…
Paran0idy Apr 26, 2026
53a4c98
[AMD] [gfx950]Fix multiple HIP codegen bugs to support TileKernel (…
zhangnju Apr 26, 2026
225ead6
[Language][UX] User-friendly error report when incorrectly indexing b…
SiriusNEO Apr 27, 2026
5680e39
[TMA] Support FP4 TensorMap TMA copies (#2107)
LeiWang1999 Apr 28, 2026
2eb877b
[Example] Add MXFP8 blockscaled grouped gemm examples with transB sup…
Rachmanino Apr 28, 2026
6548c05
[Feature] Batched AllReduce for better T.reduce performance (#1976)
kurisu6912 Apr 28, 2026
5d09b5d
fix: add missing TvmLogDebugSettings::ParseSpec and VerboseEnabledImp…
kurisu6912 Apr 28, 2026
b0bec1f
[Refactor][Build] Separate CMakeLists into different backends (#2114)
SiriusNEO Apr 28, 2026
6c6e170
[Enhancement][CUDA][SM100] Report unsupported FP6 vector types earlie…
TerminusAkivili Apr 28, 2026
4639c27
[AMD][CI issue] add gfx950 guard to fix the CI issues (#2105)
zhangnju Apr 29, 2026
cabc702
[BugFix] Fix redundant runtime bounds checks for BufferLoad indices i…
SiriusNEO Apr 29, 2026
b4c913b
[Fix] Unable to allocate shared memory buffer from tail (#2106)
Denverjin Apr 30, 2026
936ae92
[FIX] Fix kernel file suffix for cutedsl when only target is set (#2128)
ur4t Apr 30, 2026
9adc8dc
Change disable_out_of_bound_warning default to True (#2131)
kurisu6912 Apr 30, 2026
2eec5f0
[Typo] Fix typos in comments and example README (#2133)
yurekami Apr 30, 2026
9452312
[codex] Fix 1D TMA store layout inference (#2137)
LeiWang1999 May 3, 2026
129b400
[Fix][Build] Disable Cython PEP-489 multi-phase init for the cython w…
yurekami May 3, 2026
d135bd1
fix: TMA alignment to 1024 bytes on Blackwell (#2134)
kasper0406 May 3, 2026
d7e7fa3
draft vmm
Rachmanino May 5, 2026
10c419e
refactor, add configuration script, single node test passed
Rachmanino May 5, 2026
26d6329
refactor: remove standalone ts_ext, unify into tilelang/distributerd/…
Rachmanino May 5, 2026
84c5f81
[CI] [pre-commit.ci] autoupdate (#2149)
pre-commit-ci[bot] May 5, 2026
03d35c0
minor fix
Rachmanino May 5, 2026
107cade
draft
Rachmanino May 5, 2026
dc7142c
lint
Rachmanino May 5, 2026
caef91f
cleanup
Rachmanino May 5, 2026
3e87f1e
[TMA] Fix TMA descriptor init placement (#2151)
LeiWang1999 May 6, 2026
2905fde
[Refactor] Refactor register annotation lowering (#2088)
Rachmanino May 6, 2026
0c83691
[Feature][Fix] Extend TCGEN5 F8F6F4 dtype plumbing (#2126)
TerminusAkivili May 6, 2026
a1d9972
[Sync] Merge upstream TileLang at 0c836912 (Extend TCGEN5 F8F6F4 dtyp…
Rachmanino May 6, 2026
c370d93
fix: remove stale includes of src/target/cuda.h
Rachmanino May 6, 2026
b6743c1
fix: align TileScale transforms with upstream APIs
Rachmanino May 6, 2026
a76f202
fix: add missing includes to annotate_warp_group_reg_alloc.cc
Rachmanino May 6, 2026
d3135cf
fix: restore upstream annotate_warp_group_reg_alloc.cc
Rachmanino May 6, 2026
b0f91de
chore: bump version to 0.1.9.post1 (sync from upstream 0.1.9)
Rachmanino May 6, 2026
92e8e47
fix: remove duplicate tl.get_tcgen5_mma_meta registration from gemm_p…
Rachmanino May 6, 2026
26660cd
fix: remove lazy_jit import (no longer in upstream jit module)
Rachmanino May 6, 2026
8e05ca1
fix: add tensor() function and separate try/except for distributed im…
Rachmanino May 6, 2026
a392cb5
fix: add parse_device() to tilelang.utils.target (needed by tensor.py)
Rachmanino May 6, 2026
5c31316
fix: ensure torch device matches CUDA device in _init_table
Rachmanino May 6, 2026
3420d90
fix: set CUDA device before init_process_group in init_dist
Rachmanino May 6, 2026
91aa978
fix: remove torch.set_default_device from init_dist to avoid all_gath…
Rachmanino May 6, 2026
8be3659
fix: parse_device("cuda") returns current_device() not hardcoded 0
Rachmanino May 6, 2026
c970f42
fix: restore upstream versions of shared transform/op files
Rachmanino May 6, 2026
54c0458
fix: restore TileScale distributed features after upstream sync
Rachmanino May 6, 2026
91ed486
docs: add Section 10 with merge lessons learned from PR #58
Rachmanino May 6, 2026
e895075
fix: add multimem call_extern detection for need_multimem_h_ flag
Rachmanino May 6, 2026
22e6a7a
fix: restore TMA store/multimem features from wt/vmm working tree
Rachmanino May 6, 2026
2f34040
fix: apply wt/vmm example parameter changes
Rachmanino May 6, 2026
eb8bf9f
fix: remove remaining merge conflict marker in flash_attention example
Rachmanino May 6, 2026
9460e95
fix: re-restore inject_fence_proxy.cc and multimem.h from wt/vmm
Rachmanino May 6, 2026
8a70f86
fix: ruff F821 and codespell issues
Rachmanino May 6, 2026
d7bb25a
fix: restore DeepEP submodule entry in .gitmodules
Rachmanino May 6, 2026
4dd3c03
fix: sort spelling_wordlist.txt for file-contents-sorter hook
Rachmanino May 6, 2026
7b50b04
fix: remove nvshmem_issue.md as it is no longer needed
Rachmanino May 6, 2026
21a01f0
fix: symlink cuda driver stub for cibuildwheel manylinux CI
Rachmanino May 6, 2026
f7ade80
fix: update cibuildwheel to v3.4 and symlink cuda driver stub
Rachmanino May 6, 2026
a3b9554
fix: apply pre-commit fixes (end-of-file, ruff format)
Rachmanino May 6, 2026
b3c7261
fix: add back fence_sys, inc_max_nreg, dec_max_nreg to builtin.py
Rachmanino May 6, 2026
59ef17a
fix: restore all TileScale distributed builtin functions
Rachmanino May 6, 2026
73b9749
fix: resolve ruff lint errors (F811 duplicate defs, UP037/F821 Litera…
Rachmanino May 7, 2026
e4aee5f
fix: remove version pin from clang-format in requirements-lint.txt
Rachmanino May 7, 2026
0acced7
fix: remove requirements-lint.txt from test and build requirements
Rachmanino May 7, 2026
9192b8c
Revert "fix: remove requirements-lint.txt from test and build require…
Rachmanino May 7, 2026
60bdf86
Revert "fix: remove version pin from clang-format in requirements-lin…
Rachmanino May 7, 2026
a3e45d9
fix: remove Tsinghua mirror from CI pip install commands
Rachmanino May 7, 2026
7490730
fix: restore requires_distributed lost during TileLang merge
Rachmanino May 7, 2026
b0b39cb
fix: restore distributed op codegen lowered during TileLang merge
Rachmanino May 7, 2026
055c9b8
style: apply clang-format
Rachmanino May 7, 2026
32e2d1c
fix: define meta_data in cubin, fix GetGlobal error handling
Rachmanino May 7, 2026
308a500
fix: use CUDA runtime API for module loading to fix context errors
Rachmanino May 7, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
88 changes: 88 additions & 0 deletions .agents/skills/tilelang-build/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
---
name: tilelang-build
description: Repository-specific build, rebuild, install, and test instructions for tilelang. Use when working in the tilelang repository and the correct commands are needed for building from source, reinstalling after changes, or running project tests.
---

# Build & Install

## Installing / Rebuilding tilelang

The standard way to build and install:

```bash
pip install .
```

Or with verbose output for debugging build issues:

```bash
pip install . -v
```

`uv pip install .` also works if `uv` is available but is not required.

Build dependencies are declared in `pyproject.toml` and resolved automatically during `pip install .`.

If `ccache` is available, repeated builds only recompile changed C++ files.

## Alternative: Development Build with `--no-build-isolation`

If you need faster iteration (e.g. calling `cmake` directly to recompile C++ without re-running the full pip install), install build dependencies first:

```bash
pip install -r requirements-dev.txt
pip install --no-build-isolation .
```

After this, you can invoke `cmake --build build` directly to recompile only changed C++ files. This is useful when iterating on C++ code.

## Alternative: cmake + PYTHONPATH (recommended for C++ development)

For the fastest C++ iteration, bypass pip entirely and drive cmake directly:

```bash
# Configure (auto-detects CUDA; git submodules are initialised automatically)
cmake -S . -B build

# Build
cmake --build build -j$(nproc)

# Make the local tilelang package importable
export PYTHONPATH=$(pwd):$PYTHONPATH
```

After the initial configure, recompiling is just `cmake --build build -j$(nproc)`. The runtime automatically discovers native libraries from `build/lib/` when it detects a dev checkout (see `tilelang/env.py`).

Useful cmake options:

| Flag | Purpose |
|------|---------|
| `-DUSE_CUDA=ON/OFF` | Enable/disable CUDA backend (ON by default) |
| `-DUSE_ROCM=ON` | Enable ROCm/HIP backend |
| `-DUSE_METAL=ON` | Enable Metal backend (default on macOS) |
| `-DCMAKE_BUILD_TYPE=Debug` | Debug build with `TVM_LOG_DEBUG` enabled |

## Editable Installs

**Never use `pip install -e .`** (editable install). When running Python from the repo root, the local `./tilelang` directory is imported instead of the installed copy (because `.` is on `sys.path` by default). This makes editable installs unnecessary. Avoid `pip install -e .` as it can cause import confusion with this project's layout.

## Running Tests

Most tests require a GPU.

```bash
python -m pytest testing/python/ -x
```

Run a specific test file or test case:

```bash
python -m pytest testing/python/language/test_tilelang_language_copy.py -x
python -m pytest testing/python/language/test_tilelang_language_copy.py -x -k "test_name"
```

For Metal-specific tests (requires macOS with Apple Silicon):

```bash
python -m pytest testing/python/metal/ -x
```
12 changes: 6 additions & 6 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -201,10 +201,10 @@ jobs:
if [[ "${UV_INDEX}" == *"/nightly/"* ]]; then
uv pip install --prerelease=allow -v torch
fi
uv pip install -v -r requirements-test.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
uv pip install -v -r requirements-test.txt
echo "import torch; print(f'torch: {torch.__version__}')" | uv run --no-project --script -
if [[ "${{ matrix.runner.toolkit }}" == *"CUDA"* ]]; then
uv pip install --no-build-isolation-package=flash-attn -v -r requirements-test-cuda.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
uv pip install --no-build-isolation-package=flash-attn -v -r requirements-test-cuda.txt
echo "import flash_attn; print(f'flash_attn: {flash_attn.__version__}')" | uv run --no-project --script -
# elif [[ "${{ matrix.runner.toolkit }}" == *"ROCm"* ]]; then
# uv pip install -v -r requirements-test-rocm.txt
Expand Down Expand Up @@ -304,12 +304,12 @@ jobs:
# Run distributed tests (marked with @requires_distributed) with TILELANG_USE_DISTRIBUTED=1
# DeepEP tests requires fullmesh nvl or internode environment, we disable for now
echo "Running distributed examples with TILELANG_USE_DISTRIBUTED=1:"
TILELANG_USE_DISTRIBUTED=1 "${PYTEST[@]}" --maxfail=3 --numprocesses=1 -m distributed --ignore-glob='*deepep*' . || true
TILELANG_USE_DISTRIBUTED=1 "${PYTEST[@]}" --maxfail=3 --numprocesses=1 -m distributed --ignore-glob='*deepep*' .

# Run remaining example tests (non-distributed)
# Temporarily disable problematic tests: sink, vs_sparse
echo "Running non-distributed examples:"
"${PYTEST[@]}" --maxfail=3 --numprocesses=2 -m "not distributed" -k "not sink and not vs_sparse" . || true
"${PYTEST[@]}" --maxfail=3 --numprocesses=2 -m "not distributed" -k "not sink and not vs_sparse" .

# NVIDIA CUDA tests
- name: Run CUDA tests with Python ${{ matrix.python-version }} (${{ matrix.runner.toolkit }})
Expand All @@ -325,12 +325,12 @@ jobs:

# Run distributed tests (marked with @requires_distributed) with TILELANG_USE_DISTRIBUTED=1
echo "Running distributed tests with TILELANG_USE_DISTRIBUTED=1:"
TILELANG_USE_DISTRIBUTED=1 "${PYTEST[@]}" --maxfail=3 --numprocesses=1 -m distributed . || true
TILELANG_USE_DISTRIBUTED=1 "${PYTEST[@]}" --maxfail=3 --numprocesses=1 -m distributed .

# Run remaining tests (non-distributed)
# Temporarily disable problematic tests: tilelibrary_gemm, jit_gemm_ctypes
echo "Running non-distributed tests:"
"${PYTEST[@]}" --maxfail=3 --numprocesses=2 -m "not distributed" -k "not tilelibrary_gemm and not jit_gemm_ctypes" . || true
"${PYTEST[@]}" --maxfail=3 --numprocesses=2 -m "not distributed" -k "not tilelibrary_gemm and not jit_gemm_ctypes" .

- name: List generated files
if: ${{ !cancelled() }}
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/dist.yml
Original file line number Diff line number Diff line change
Expand Up @@ -125,7 +125,7 @@ jobs:
fi

- name: Build wheels
uses: pypa/cibuildwheel@v3.3
uses: pypa/cibuildwheel@v3.4
with:
package-dir: .
output-dir: wheelhouse
Expand Down
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -133,3 +133,4 @@ maint/host_checks/logs/*

# perf regression test
.perf_regression/
nvshmem_issue.md
2 changes: 1 addition & 1 deletion 3rdparty/composable_kernel
2 changes: 1 addition & 1 deletion 3rdparty/tvm
Submodule tvm updated from 23bce0 to 0e15b2
Loading
Loading