Required prerequisites
Motivation
Tile schedule, i.e. the process of deciding the next tile for the current CTA to work on, has emerged as a critical part of kernel optimization. For example, FA3 and FA4 utilizes longest-time-first tile scheduler to efficiently mitigate the workload imbalance for causal or varlen inputs.
However, TileLang currently does not supply such utilities (like tile scheduler in CUTLASS) and relies on users to perform tile schedule on their own (e.g calculate remapped blockIdx explicitly in the kernel). While primitive for persistent kernels is provided, users still need to change the kernel contents from non-persistent ones.
Solution
We probably should automate this process, so that once the tile schedule strategy or 'persistent' is annotated, the kernel will be transformed as expected. (I'm really not sure whether this is quite difficult or not)
Alternatives
No response
Additional context
No response
Required prerequisites
Motivation
Tile schedule, i.e. the process of deciding the next tile for the current CTA to work on, has emerged as a critical part of kernel optimization. For example, FA3 and FA4 utilizes longest-time-first tile scheduler to efficiently mitigate the workload imbalance for causal or varlen inputs.
However, TileLang currently does not supply such utilities (like tile scheduler in CUTLASS) and relies on users to perform tile schedule on their own (e.g calculate remapped
blockIdxexplicitly in the kernel). While primitive for persistent kernels is provided, users still need to change the kernel contents from non-persistent ones.Solution
We probably should automate this process, so that once the tile schedule strategy or 'persistent' is annotated, the kernel will be transformed as expected. (I'm really not sure whether this is quite difficult or not)
Alternatives
No response
Additional context
No response