rally to refactor the disk prefix caching #1420
magikRUKKOLA
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
First of all, I would like to express my gratitude towards the authors of the KTransformers framework.
But...
Apparently we are having the issues with prefix caching...
The problem is that it's really hard to navigate the snapshots of the KV cache.
The new data is getting appended (! WTF?!!!) to the binary file.
Not the text file, not the JSON file, not XML file, but the binary file.
I can understand the benefits of the binary format.
Of course, its efficiency.
But...
If you are advocating for the efficiency...
It seems quite logical to provide the additional tools to rule this binary format.
Unfortunately, right now, the management of the snapshots of the KV cache... which is getting produced via the prefill cache module of the balance_serve backend of the KTransformers... is NONEXISTENT! The async query manager was able or wasn't able to find some prefix ... this is all what the user is seeing. This is not what we want. We want the full control!
So...
I think I'm expressing right now the voice of the whole community.
And it basically means that we need the clear documentation regarding the format of the KV cache...
And an additional API to save (or restore) the state of the LLM.
Thank you for your attention.
train noise
Beta Was this translation helpful? Give feedback.
All reactions