Replies: 1 comment
-
The calculation of (QK^T) and (Attention Score with V) in the Attention op are both matrix dot products. Once a token is different in the middle, the following KV will be different even if the following Q is the same. Also, paged attention and prefix caching maintain KVCache through blocks. If there is a token in the same block is different, the entire block cannot be reused. Please refer to the document for details. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Based on this documentation, each KV block can be uniquely identified by the tokens within the block and the tokens in the prefix before the block. So if there are two sentences:
Then prefix sharing can not happen for these two complete sentences. But I don't understand why that's the case. My understanding is that each KV cache entry is generated by multiplying an input$x$ with $U_k$ and $U_v$ in each head. The output here doesn't calculate the dependency between each tokens until the attention score is calculated. So the KV cache for the two sentences above really only has a difference in "love" and "hate". And all other KV cache would be the same? So why would a different token in the middle of two similar prefixes prevent prefix sharing from happening? Thanks!
Beta Was this translation helpful? Give feedback.
All reactions