Why caching sharing only works for shared prefix prompt? #10841

BDHU · 2024-12-02T23:46:09Z

BDHU
Dec 2, 2024

Based on this documentation, each KV block can be uniquely identified by the tokens within the block and the tokens in the prefix before the block. So if there are two sentences:

"I love box."
"I hate box."

Then prefix sharing can not happen for these two complete sentences. But I don't understand why that's the case. My understanding is that each KV cache entry is generated by multiplying an input $x$ with $U_k$ and $U_v$ in each head. The output here doesn't calculate the dependency between each tokens until the attention score is calculated. So the KV cache for the two sentences above really only has a difference in "love" and "hate". And all other KV cache would be the same? So why would a different token in the middle of two similar prefixes prevent prefix sharing from happening? Thanks!

ShangmingCai · 2024-12-20T03:01:43Z

ShangmingCai
Dec 20, 2024

The calculation of (QK^T) and (Attention Score with V) in the Attention op are both matrix dot products. Once a token is different in the middle, the following KV will be different even if the following Q is the same. Also, paged attention and prefix caching maintain KVCache through blocks. If there is a token in the same block is different, the entire block cannot be reused. Please refer to the document for details.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Why caching sharing only works for shared prefix prompt? #10841

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Uh oh!

Why caching sharing only works for shared prefix prompt? #10841

Uh oh!

BDHU Dec 2, 2024

Replies: 1 comment

Uh oh!

Uh oh!

ShangmingCai Dec 20, 2024

BDHU
Dec 2, 2024

ShangmingCai
Dec 20, 2024