You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Both the async_rag and the rag functions add the system prompt after the list of user/assistant messages. This prevents leveraging caching in OpenAI models and it is may be an OOD chat format.
Move the system prompt to the top. This doesn't solve completely the KV-cache eviction problem, though, as the system_prompt contains the retrieved contexts and this varies for every query.
We can split the system_prompt in two parts: the instructions part and the context part. By doing so, we can move the contexts part closer to the user message and preserve the KV-cache of the model.
The problem
Both the
async_rag
and therag
functions add the system prompt after the list of user/assistant messages. This prevents leveraging caching in OpenAI models and it is may be an OOD chat format.Naive Solution
Move the system prompt to the top. This doesn't solve completely the KV-cache eviction problem, though, as the
system_prompt
contains the retrieved contexts and this varies for every query.A (possible) better solution
We can split the
system_prompt
in two parts: the instructions part and the context part. By doing so, we can move the contexts part closer to the user message and preserve the KV-cache of the model.Note: It is possible that not all providers allow multiple system messages.
The text was updated successfully, but these errors were encountered: