Any performance results versus other NSA implementations?

Hi, thanks for the insightful blog! I wonder if you've produced some throughput results versus other impl such as:
- the FLA one as you mentioned;
- the one in https://github.com/lucidrains/native-sparse-attention-pytorch