Chunkwise gated linear attention reaching 60~80 TFLOP/s, with step-by-step optimization records by learning-chip · Pull Request #88 · huawei-csl/pto-kernels

minimum static linear_attention example

learning-chip

committed

support dynamic B and L dims

learning-chip

committed

minimum benchmark script

learning-chip

committed

a shorter version with faster performance

learning-chip

committed

precompute mask to reduce scalar overhead

learning-chip

committed

add lessons and action plans

learning-chip

committed

use chunk size 128 to get higher TFLOPs

learning-chip

committed

L0 ping pong buffer

learning-chip

committed

two-slot C-V pipelining for higher FLOPs

learning-chip

committed

L1 prefetching for slight higher FLOPs

learning-chip

committed

explore remaining optim items, and clean-up lib path

learning-chip

committed

step-by-step optimization history

learning-chip

committed

simplify step 1 and 2 kernel source to be more reader friendly

learning-chip

committed

educational comments

learning-chip

committed

move duplicated code to common utils

learning-chip

committed

refresh README and optimization lessions

learning-chip

committed

compare performance to triton reference

learning-chip

committed

improve triton baseline by taking from vllm-ascend

learning-chip

committed

match custom triton kernel perf to vllm extracted one

learning-chip

committed

add PTO chunk=128 cases to comparison table

learning-chip

committed

minor change to PTO-ISA comments and benchmark settings

learning-chip

committed

support scalar gating factor, BSND input, var-length batch, to match triton kernel feature list

learning-chip

committed

handle tail elements using partial LOAD/STORE without python-side padding

learning-chip

committed

pipelining improvements for BSND version

learning-chip

committed

update documents for optimized BSND performance

learning-chip

committed

compare pre-computed mask in triton

learning-chip

committed

fast on-the-fly mask construction

learning-chip

committed

use pto-isa TTRI instead of intrinsics

learning-chip

committed

no need to set Path, just PTO_LIB_PATH=/workdir/pto-isa

learning-chip

committed

enable on-the-fly fast mask construction for the main optimized varlen BSND kernel

learning-chip

committed

fix comple error with pto-isa master around April 03

learning-chip

committed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chunkwise gated linear attention reaching 60~80 TFLOP/s, with step-by-step optimization records#88

Chunkwise gated linear attention reaching 60~80 TFLOP/s, with step-by-step optimization records#88
learning-chip wants to merge 31 commits into
mainfrom
linear_attn

Commits on Apr 15, 2026

minimum static linear_attention example

support dynamic B and L dims

minimum benchmark script

a shorter version with faster performance

precompute mask to reduce scalar overhead

add lessons and action plans

use chunk size 128 to get higher TFLOPs

L0 ping pong buffer

two-slot C-V pipelining for higher FLOPs

L1 prefetching for slight higher FLOPs

explore remaining optim items, and clean-up lib path

step-by-step optimization history

simplify step 1 and 2 kernel source to be more reader friendly

educational comments

move duplicated code to common utils

refresh README and optimization lessions

compare performance to triton reference

improve triton baseline by taking from vllm-ascend

match custom triton kernel perf to vllm extracted one

add PTO chunk=128 cases to comparison table

minor change to PTO-ISA comments and benchmark settings

support scalar gating factor, BSND input, var-length batch, to match triton kernel feature list

handle tail elements using partial LOAD/STORE without python-side padding

pipelining improvements for BSND version

update documents for optimized BSND performance

compare pre-computed mask in triton

fast on-the-fly mask construction

use pto-isa TTRI instead of intrinsics

no need to set Path, just PTO_LIB_PATH=/workdir/pto-isa

enable on-the-fly fast mask construction for the main optimized varlen BSND kernel

fix comple error with pto-isa master around April 03