You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: site/docs/concepts/optimization-techniques/kvcache-eviction-algorithm.md
+5-1
Original file line number
Diff line number
Diff line change
@@ -15,7 +15,7 @@ The KV cache for each sequence is divided into three logical areas:
15
15
16
16
* Start Area: Initial tokens that are never evicted
17
17
* Evictable Area: Tokens that can be evicted based on importance scores
18
-
* Recent Area: Most recent tokens that are preserved (never evicted)
18
+
* Recent Area: Most recent tokens that are preserved (not evicted while in this area, but naturally migrating toward the evictable area as the text generation goes on)
19
19
20
20
The sizes of all three areas can be configured by modifying corresponding fields in a `CacheEvictionConfig` struct, which itself is a part of the pipeline-wide `SchedulerConfig`.
21
21
As the generation starts, the blocks in respective logical areas are filled token-by-token, and once at least one block past the "recent" area is filled, eviction may take place.
@@ -55,4 +55,8 @@ This may impact the ability of the model to correctly recognize the relative pos
55
55
Cache rotation seeks to alleviate this by "re-rotating" corresponding blocks so that the blocks that remain after each eviction are once again "continuous" in terms of the effective RoPE embedding.
56
56
It can be enabled by setting the `CacheEvictionConfig.apply_rotation` field to `true` (default is `false`).
57
57
58
+
## Current limitations
58
59
60
+
* Cache rotation is only targeted for the regular, linear LLaMa-like RoPE application and may degrade accuracy on models that use other RoPE schemes.
61
+
62
+
* Cache rotation is currently only supported for the models with uniform V embedding sizes across the layers.
0 commit comments