Implement arbitrary context caching allocation for llama.cpp #16117
lingyezhixing
started this conversation in
Ideas
Replies: 1 comment
-
Yes, this is something that should be improved. I am having some thoughts how to support this, though nothing too concrete yet. Will eventually come around to this sometime in the future. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Currently, llama.cpp employs a concurrency model that evenly partitions context into slots. However, a growing number of applications and interactive frameworks, such as Claude Code, are adopting a "long-context execution model + short-context background model" approach. Taking Claude Code as an example, in programming tasks, after each execution of user instructions, the model performs minor background tasks. These tasks require a context length of no more than 4k tokens. Without enabling concurrency, these small tasks overwrite the previous KV cache, forcing every new user instruction to start with a full prefill. Given that llama.cpp's prefill performance is already suboptimal, this significantly degrades the user experience.
If two concurrent instances are enabled, the current evenly partitioned context model results in severe resource waste. Therefore, I believe it is necessary to implement unequal slot allocation for llama.cpp. Unfortunately, I am not proficient in programming myself. If this discussion garners enough attention, I hope someone can take on this task
Beta Was this translation helpful? Give feedback.
All reactions