[Feature] support PagedAttention in cuda attention.cc #1317

michaelfeil · 2023-06-23T16:11:53Z

VLLM implemented a mechanism called "PagedAttention", which helps in fast generation of long sequences.
This is might be quite a large feature request.

Blog: https://vllm.ai/ and maybe this https://github.com/vllm-project/vllm/blob/665c48963be11b2e5cb7209cd25f884129e5c284/vllm/model_executor/layers/attention.py#L16 give more insights.

michaelfeil changed the title ~~[Feature] PagedAttention~~ [Feature] support PagedAttention in cuda attention.cc Jun 23, 2023

guillaumekln added the enhancement New feature or request label Jul 3, 2023

michaelfeil mentioned this issue Jul 7, 2023

Continuous batching #1333

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] support PagedAttention in cuda attention.cc #1317

[Feature] support PagedAttention in cuda attention.cc #1317

michaelfeil commented Jun 23, 2023

[Feature] support PagedAttention in cuda attention.cc #1317

[Feature] support PagedAttention in cuda attention.cc #1317

Comments

michaelfeil commented Jun 23, 2023