forked from tile-ai/tilelang
-
Notifications
You must be signed in to change notification settings - Fork 0
support more than 2 warp groups #9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
AutumnKite
wants to merge
191
commits into
silentCoder-dev:auto-schedule
Choose a base branch
from
AutumnKite:auto-schedule
base: auto-schedule
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 2 commits
Commits
Show all changes
191 commits
Select commit
Hold shift + click to select a range
1ede0bd
Introduce T.deallocate_tmem and T.transpose (#1971)
LeiWang1999 a9b8d53
Add `annotations` parameter to `alloc_buffer` in `tilelang/language/a…
Copilot 1c561f6
[Bugfix] Raise error on zero grid dimension instead of silent clamp (…
LeiWang1999 6a859f1
[BugFix] Fix missing barrier init attrs when TMA is disabled (#1995)
Rachmanino 0f7c214
[BugFix] Add missing fences in GEMM SM100 examples and canonicalize t…
Rachmanino eb6f05c
[Refactor] Refactor CUDA atomic helpers (#2001)
SiriusNEO 8c3b043
[Bugfix] Fix CuTeDSL autotune cache invalid ELF header (#1967) (#1972)
kurisu6912 a82fa71
fix: fix copy+cast vectorize loop to use wider vector load/store inst…
Achazwl e45ecf7
[Feature] Support T.annotate_compile_flags, T.annotate_pass_configs, …
kurisu6912 6e6295f
[BugFix] Fix CI failures: clean /tmp on self-hosted runners and skip …
kurisu6912 5f70374
[Test] Add 1D TMA regression test for issue #1842 (#2005)
kurisu6912 6fc3afa
[BugFix] Fix auto vectorization for binary operations after wider cop…
Achazwl 01c714d
fix: add cudaGetLastError check after cuLaunchKernel in TVM FFI backe…
kurisu6912 bb79425
[CI] Remove legacy dequantize gemm test (#2013)
LeiWang1999 1ff58bc
[CI] [pre-commit.ci] autoupdate (#2014)
pre-commit-ci[bot] 4f75940
[BugFix] Enhance CUDA vectorization for binary operations (#2015)
LeiWang1999 868c740
[Docs] fix arrow direction in ir_transform_diagram.png (#2016)
kermanx cabdb86
[codex] Fuse packed x2 mul-add into fma2 in CUDA codegen (#2017)
LeiWang1999 37c1c0c
[codex] Reduce slow pytest runtime in testing/python (#2018)
LeiWang1999 3ee0988
[Refactor][Pipeline] Run pipeline rewriting before layout inference a…
LeiWang1999 39adf6a
remove wg_wait in gemm_auto_tcgen5mma.py
AutumnKite 0402120
support more than 2 warp groups
AutumnKite b021c2a
change `before` and `after` to map
AutumnKite 1cb8fde
Bump transformers from 4.53.0 to 5.0.0rc3 in /examples/bitnet-1.58b (…
dependabot[bot] 469a847
pin apache-tvm-ffi<0.1.10 (derived_object regression) (#2020)
oraluben 86e37b7
Fix serial loop phase dtype mismatch in LowerTileOp (#2022)
LeiWang1999 9a15696
Fix typo
silentCoder-dev 7668827
Merge branch 'auto-schedule' of https://github.com/silentCoder-dev/ti…
AutumnKite d37fd42
implement a naive ws on F3
AutumnKite f75ceac
Merge branch 'auto-schedule' of https://github.com/silentCoder-dev/ti…
AutumnKite 9806233
Merge branch 'auto-schedule' of https://github.com/AutumnKite/tilelan…
AutumnKite 5c7352e
fix typo and run format
AutumnKite 1381a30
Merge branch 'auto-schedule' of https://github.com/silentCoder-dev/ti…
AutumnKite 3b47b87
modify return value of NaiveBuild
AutumnKite 313ed08
Merge branch 'auto-schedule' of https://github.com/silentCoder-dev/ti…
AutumnKite 101a672
fix barrier logic
AutumnKite 35f13d2
run format
AutumnKite a5e3f19
Re-enable deprecated `TL_DISABLE_TMA_LOWER` pass config for TMA store…
LJC00118 b1a88bf
[Misc] Remove mistakenly introduced temp file (#2027)
SiriusNEO 9b4e0a2
Fix warpgroup partition
Denverjin 4db4433
Merge pull request #1 from Denverjin/auto-schedule-myh
AutumnKite b31aaa9
fix barrier logic
AutumnKite 7be6332
Merge branch 'auto-schedule' of https://github.com/silentCoder-dev/ti…
AutumnKite 853e805
[Codegen] Add lexical_alloc_scope for scoped local variable lifetime …
LeiWang1999 90299d6
[Bugfix] Fix incorrect sync hoist for fragment buffer conditions in T…
LeiWang1999 d619164
add .agents/skills/build/SKILL.md for build conventions (#2019)
oraluben 7a515b5
[AMD][gfx950] Add gfx950 support for DeepGeem example (#2028)
zhangnju 11dc3e6
Fix shared memory buffer reuse
Denverjin 5d729ee
[Refactor] Remove GEMM v1 and promote gemm_py to be the canonical gem…
LeiWang1999 35d8139
[CI]: Bump actions/github-script from 8 to 9 (#2036)
dependabot[bot] 8243f7e
Nan propagation option for bf16 and half16 (#1958)
haoran35-jpg fc5001f
FIx naive loop var duplication bug
Denverjin 19236b4
[Feature] Add TIR builtins for warp-level vote and block-level predic…
sepcnt b3d5981
[API] Default warp-lane mask to 0xFFFFFFFF for warp-sync builtins (#2…
LeiWang1999 a8bafa6
fix: suppress false positive conflict write warning when dst index de…
kurisu6912 74fc980
[Refactor] Refactor `DecoupleTypeCast` Pass (#2026)
LJC00118 f309d81
[Bugfix][Subtype] Fix scalar fp4 store/load codegen for non-packed bu…
kurisu6912 380fb5e
Support local var fragment
Denverjin 39c3c06
Merge pull request #2 from Denverjin/auto-schedule-myh
AutumnKite a16ff86
[Feature] autodd: add __freeze__ annotation to protect code regions f…
kurisu6912 d2e02e1
[BugFix] Skip MMA shared buffer layout inference when layout already …
kurisu6912 c93778f
support tcgen05_gemm
AutumnKite 7893678
[Refactor] Remove obsolete RewriteWgmmaSync pass (#2046)
LeiWang1999 e3d214d
Add if node
Denverjin 011d880
Merge pull request #3 from Denverjin/auto-schedule-myh
AutumnKite 3c7f4a0
remove debug info
AutumnKite db3d859
run format
AutumnKite f12fb47
[Refactor] Move target gating into InjectFenceProxy pass entry (#2047)
LeiWang1999 7ae8d98
fix loop break detection
Denverjin dbadd77
fix control node break & format
Denverjin 510d773
Merge pull request #4 from Denverjin/auto-schedule-myh
AutumnKite 7ed8266
Fix if task collection bug
Denverjin 235ad7e
Add regression test for 1D TMA load compilation and execution (#1989)
huyhoang171106 9e78312
Merge pull request #5 from Denverjin/auto-schedule-myh
AutumnKite 891109e
[Transform] Add InjectTcgen05Fence pass (#2003)
LeiWang1999 8f67446
fix loop break
AutumnKite 844d04e
fix tma load detection
Denverjin 45f5da4
fix pro/epilogue identification
Denverjin 9970ac9
remove debug output & format
Denverjin 1660c85
reimplement barrier logic
AutumnKite 25b0eb8
Merge branch 'auto-schedule' of https://github.com/AutumnKite/tilelan…
Denverjin 37f05ee
partly fix pro/epilogue logic for barrier
Denverjin e84cee9
Merge pull request #6 from Denverjin/auto-schedule-myh
AutumnKite a4aa5f3
refactor ir structure clone
Denverjin 33e5c36
fix bug
Denverjin cee084c
Merge branch 'auto-schedule' of https://github.com/AutumnKite/tilelan…
Denverjin 36f51af
Merge pull request #7 from Denverjin/auto-schedule-myh
AutumnKite 7d1e6e3
fix bug
AutumnKite d098f51
[Enhancement] Use atomic directory rename for cache writes (#1982)
LeiWang1999 a70fa26
refactor shared memory buffer merge
Denverjin 561adfe
Merge pull request #8 from Denverjin/auto-schedule-myh
AutumnKite 189c99f
upload latency & ii
Denverjin 68f4710
Merge pull request #9 from Denverjin/auto-schedule-myh
AutumnKite cb00a60
Replace syntactic loop-var checks with invariance checks (#2050)
LJC00118 27f1f81
[Feature][Example] Introduce CLC tile schedule and add example for sm…
Rachmanino 77cbe6d
[Feature] Introduce T.CUDASourceCodeKernel (#1970)
SiriusNEO aa0700b
run format
AutumnKite 72d7748
fix II of IfNode
AutumnKite 4bf8678
fix barrier
AutumnKite 9c5fe44
[BugFix] Keep shared-prelude local vars in producer-consumer WS (#2055)
Rachmanino 04468a3
[Bugfix] Fix stage-expanded annotated-layout aliases in LayoutInferen…
TerminusAkivili 6364f5d
fix pro/epilogue let stmt copy
Denverjin 557bcc1
fix z3 small n error
Denverjin 3bc1c01
fix double kernel issue
Denverjin 105b73b
format
Denverjin b18c60a
undo failed merge
Denverjin 6ca1415
Merge pull request #10 from Denverjin/auto-schedule-myh
AutumnKite aa877ab
check dependency in prologue
AutumnKite b2abf0a
Merge commit '891109ea879839bbfe2ad937db208f2d7c6f1ce0' into auto-sch…
Denverjin c96dd9e
fix header missing
Denverjin e1d6388
[Cache] Refactor cache namespace layout (#2057)
LeiWang1999 0924dab
[Bugfix] Use shared::cta instead of shared::cluster for non-cluster T…
qqq-tao b13cdf3
change the interface to support tasks with wg_id=-1
AutumnKite acadde0
remove unused declarations
AutumnKite f8e7059
fix: improve warning output in eager frontend (#2064)
kurisu6912 36e0015
Merge branch 'auto-schedule' of https://github.com/AutumnKite/tilelan…
Denverjin f7c6f43
fix read/write regions
AutumnKite b10e6fb
Merge branch 'auto-schedule' of https://github.com/AutumnKite/tilelan…
Denverjin 99ee74b
[CUDA] Support int4 `T.gemm` (#2063)
LeiWang1999 c797e41
[Bugfix] Correct index calculation in Software Pipeline pass (#2070)
Rachmanino 96c649f
Add frontmatter for the build skill (#2068)
VitalyAnkh b744da1
Refactor ptx_ldmatrix to use tl.access_ptr with simplified signature …
LeiWang1999 046b1bd
[FFI] Remove upper version bound on apache-tvm-ffi (#2071)
LeiWang1999 9c95a42
[Refactor] Phaseout legacy util `map_torch_type` with `T.dtype.as_tor…
LeiWang1999 64bd742
[Bugfix] Fix reduce layout (#2074)
bucket-xv 948d38a
[Refactor] Disable unhelpful warning print (#2077)
LeiWang1999 15309f5
[CUDA] Improve int4 GEMM lowering and packed codegen support (#2073)
LeiWang1999 4e7d126
Bump pytest --numprocesses from 4 to 8 across all platforms (#2076)
LeiWang1999 38cac96
fix dead-lock bug
AutumnKite 4b3127a
[Enhancement] Enhance alloc_var function to handle _ptr_sentinel dtyp…
LeiWang1999 6fee850
[Release] Bump version into 0.1.9 (#2060)
LeiWang1999 1f831f3
fix register fragment reuse
Denverjin cac434c
Merge branch 'auto-schedule' of https://github.com/AutumnKite/tilelan…
Denverjin 3c3cf53
Merge pull request #11 from Denverjin/auto-schedule-myh
AutumnKite 441c3b0
[Refactor] Strip build machine paths from LOG messages in wheel relea…
LeiWang1999 a640a89
[AMD][Radeon] Add the Support of RDNA3/RDNA3.5(gfx11) WMMA (#2044)
jiawei-real 3aeb962
fix let stmt clone bug
Denverjin 4eed399
add the innermost task to sync infos
AutumnKite 55cf9c1
[codex] Remove dead transform pass leftovers (#2083)
LeiWang1999 9aba41f
[Bugfix] Enable `.shared::cta` in TMA copy paths only on CUDA 12.8+ (…
ColmaLiu b6e75b1
fix hopper neutral stage
Denverjin 6e2fb56
change layout map & remove unused letstmt
Denverjin 89e6812
disable auto scheduling when using thread vars
Denverjin be0df30
Merge pull request #12 from Denverjin/auto-schedule-myh
AutumnKite f3f6e74
find first/last tasks of a buffer access and reduce syncs by checking…
AutumnKite 59c9b05
Merge branch 'auto-schedule' of https://github.com/AutumnKite/tilelan…
AutumnKite bdb20c2
format
AutumnKite ff067b0
fix attr warp partition
Denverjin b88bbcb
[AMD][gfx950] Add ds_read_tr16_b64 / ds_read_tr8_b64 support for gfx9…
zhangnju 95f1d29
fix let missing bug
Denverjin 65738cd
Merge pull request #13 from Denverjin/auto-schedule-myh
AutumnKite 6d0bffb
[AMD][Gfx950] Add the support of 160K LDS and copy.async (#2058)
zhangnju 10b7f1f
add double-thread constraint
AutumnKite 4cec4ba
Merge branch 'auto-schedule' of https://github.com/AutumnKite/tilelan…
AutumnKite ef6a431
fix local var copy
Denverjin 0f29f9c
[BugFix] Relax loop wait and adjust trailing drain behavior in async …
Rachmanino e99d35a
fix naive ir structure bug
Denverjin 2ca5e06
Merge pull request #14 from Denverjin/auto-schedule-myh
AutumnKite 86989d7
format
AutumnKite a552916
fix: remove unused let
AutumnKite c18c623
remove redundant letstmts
Denverjin 55cbe0e
Merge branch 'auto-schedule' of https://github.com/AutumnKite/tilelan…
Denverjin 0edb76c
move the rewrites forward
AutumnKite 09d9071
add WAW dependence & avoid duplicated dependence when iter=1
AutumnKite 9897fe5
fix barrier around let missing
Denverjin 3c5578d
Merge branch 'auto-schedule' of https://github.com/AutumnKite/tilelan…
Denverjin b46d709
Merge pull request #15 from Denverjin/auto-schedule-myh
AutumnKite 01bf798
remove cross-warpgroup dependency for register buffers
AutumnKite d255c0a
fix reused buffer analysis
Denverjin 3df8b46
check kernel using barrier & format
Denverjin 468d51b
Merge pull request #16 from Denverjin/auto-schedule-myh
AutumnKite e18b8e6
format
AutumnKite 264efe2
[Feature] Block-scaled GEMM support for MXFP8 on Blackwell (#1945)
Rachmanino 057e5ba
[Host CodeGen][Refactor] Cleanup namespace and remove useless C templ…
SiriusNEO 3f16e50
Add opt-out for prelower semantic checks for DeepSeek V4 Flash on ARM…
foraxe 0ee6345
[Example] Add HISA: hierarchical sparse attention indexer (#2069)
xuyufei-a 8f4a08f
[Language] Small cleanup and notes for alloc global (#2100)
SiriusNEO 8e12157
[Enhancement] Optimize hopper fp8 deepgemm tile size (#2103)
Rachmanino ffdf514
[CUDA][SM100] Include cuda_fp6.h when emitting FP6 types (#2102)
TerminusAkivili 6a29c76
feat: support cdna4 v_mfma_i32_16x16x64_i8 & v_mfma_i32_32x32x32_i8 (…
Paran0idy 53a4c98
[AMD] [gfx950]Fix multiple HIP codegen bugs to support TileKernel (…
zhangnju 73a54d2
Merge remote-tracking branch 'tilelang/main' into auto-schedule-myh
Denverjin a2f6a45
fix T.make_tensor buffer missing
Denverjin b60e8ce
Merge pull request #17 from Denverjin/auto-schedule
AutumnKite 3490f3a
add constraints for warpgroup partition & always analyze and insert b…
AutumnKite 2903cd1
assign tma store to consumer side
AutumnKite 28aa5ad
Fix let & barrier bugs
Denverjin b29d135
Merge pull request #18 from Denverjin/auto-schedule
AutumnKite 10503ba
add top-level barriers
AutumnKite f62744a
format
AutumnKite 4f8823b
fix empty for bug
AutumnKite File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
cat -n src/transform/auto_schedule/schedule_builder.h | head -60Repository: silentCoder-dev/tilelang
Length of output: 2072
🏁 Script executed:
Repository: silentCoder-dev/tilelang
Length of output: 50
🏁 Script executed:
Repository: silentCoder-dev/tilelang
Length of output: 511
Add the direct include for
uint8_t.WarpSpecializeArchusesuint8_tas its underlying type, but the header does not directly include<cstdint>. Relying on transitive includes risks breakage if include chains change in unrelated headers.✏️ Proposed fix
`#include` <tvm/tir/stmt_functor.h> +#include <cstdint> `#include` <map> `#include` <memory>🤖 Prompt for AI Agents