Flash attention #1651

minhthuc2502 · 2024-03-28T15:18:18Z

I currently work on the integration for flash attention. It based on the kernel developed in original repo.

For the current version, I don't see any improvement in performance compared with the standard MHA. Tested with GPU A100.

Something to consider here:

Implement flash attention with kvcache as fused.
Accept loss in perf by additional transpose operation for qkv to align the same shape between standard and flash attention

Update:

With long prompt, we have a good improvement in performance.
The size of package increases after the integration of feature
Compiling time increases due to the heavy use of template (but it improves the perf in runtime)

BBC-Esq · 2024-04-02T22:02:33Z

Super excited about this. Let me know if you need someone with Windows and an RTX 4090 to test.

Purfview · 2024-04-04T13:52:14Z

For the current version, I don't see any improvement in performance

For performance you probably should switch to cuDNN 9.x

From cuDNN 9.0.0 release notes:

FP16 and BF16 fused flash attention engine performance has been significantly improved for NVIDIA GPUs:
    Speed-up of up to 50% over cuDNN 8.9.7 on Hopper GPUs.
    Speed-up of up to 100% over cuDNN 8.9.7 on Ampere GPUs.

minhthuc2502 · 2024-04-04T15:09:25Z

For performance you probably should switch to cuDNN 9.x

I think the cuDNN is currently used only for conv ops, so it wouldn't affect the perf of the flash-attention. By the way, I see an improvement when trying with long length of input. I will do some more tests and improve the time of compilation.

minhthuc2502 added help wanted Extra attention is needed and removed help wanted Extra attention is needed labels Mar 28, 2024

BBC-Esq mentioned this pull request Apr 2, 2024

When is flash attention supported #1626

Closed

minhthuc2502 force-pushed the dev/flash_attention branch from 0314a82 to 845a327 Compare April 8, 2024 09:31

flash attention support

e6e8f95

minhthuc2502 force-pushed the dev/flash_attention branch from d22ca71 to e6e8f95 Compare April 8, 2024 16:03

minhthuc2502 added 2 commits April 9, 2024 00:51

fix submodule cutlass

7110815

update cutlass

42ff314

minhthuc2502 changed the title ~~[WIP] flash attention~~ Flash attention Apr 9, 2024

minhthuc2502 merged commit 7d63eea into OpenNMT:master Apr 9, 2024
17 checks passed

minhthuc2502 mentioned this pull request Apr 9, 2024

Request to support FlashAttention in cuda attention.cc #1300

Closed

MahmoudAshraf97 mentioned this pull request Nov 7, 2024

Possible premature temporary removal of flash attention? #1809

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flash attention #1651

Flash attention #1651

minhthuc2502 commented Mar 28, 2024 •

edited

Loading

BBC-Esq commented Apr 2, 2024

Purfview commented Apr 4, 2024

minhthuc2502 commented Apr 4, 2024 •

edited

Loading

Flash attention #1651

Flash attention #1651

Conversation

minhthuc2502 commented Mar 28, 2024 • edited Loading

Something to consider here:

BBC-Esq commented Apr 2, 2024

Purfview commented Apr 4, 2024

minhthuc2502 commented Apr 4, 2024 • edited Loading

minhthuc2502 commented Mar 28, 2024 •

edited

Loading

minhthuc2502 commented Apr 4, 2024 •

edited

Loading