Implement arbitrary context caching allocation for llama.cpp #16117

lingyezhixing · 2025-09-19T17:42:46Z

lingyezhixing
Sep 19, 2025

Currently, llama.cpp employs a concurrency model that evenly partitions context into slots. However, a growing number of applications and interactive frameworks, such as Claude Code, are adopting a "long-context execution model + short-context background model" approach. Taking Claude Code as an example, in programming tasks, after each execution of user instructions, the model performs minor background tasks. These tasks require a context length of no more than 4k tokens. Without enabling concurrency, these small tasks overwrite the previous KV cache, forcing every new user instruction to start with a full prefill. Given that llama.cpp's prefill performance is already suboptimal, this significantly degrades the user experience.

If two concurrent instances are enabled, the current evenly partitioned context model results in severe resource waste. Therefore, I believe it is necessary to implement unequal slot allocation for llama.cpp. Unfortunately, I am not proficient in programming myself. If this discussion garners enough attention, I hope someone can take on this task

ggerganov · 2025-09-22T13:09:15Z

ggerganov
Sep 22, 2025
Maintainer

Yes, this is something that should be improved. I am having some thoughts how to support this, though nothing too concrete yet. Will eventually come around to this sometime in the future.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement arbitrary context caching allocation for llama.cpp #16117

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Implement arbitrary context caching allocation for llama.cpp #16117

Uh oh!

lingyezhixing Sep 19, 2025

Replies: 1 comment

Uh oh!

ggerganov Sep 22, 2025 Maintainer

lingyezhixing
Sep 19, 2025

ggerganov
Sep 22, 2025
Maintainer