Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Order of the system message may prevent caching #49

Closed
undo76 opened this issue Nov 23, 2024 · 2 comments
Closed

Order of the system message may prevent caching #49

undo76 opened this issue Nov 23, 2024 · 2 comments

Comments

@undo76
Copy link
Contributor

undo76 commented Nov 23, 2024

The problem

Both the async_rag and the rag functions add the system prompt after the list of user/assistant messages. This prevents leveraging caching in OpenAI models and it is may be an OOD chat format.

    async_stream = await acompletion(
        model=config.llm,
        messages=[
            *(messages or []),
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": prompt},
        ],
        stream=True,
    )

Naive Solution

Move the system prompt to the top. This doesn't solve completely the KV-cache eviction problem, though, as the system_prompt contains the retrieved contexts and this varies for every query.

    async_stream = await acompletion(
        model=config.llm,
        messages=[
            {"role": "system", "content": system_prompt},
            *(messages or []),
            {"role": "user", "content": prompt},
        ],
        stream=True,
    )

A (possible) better solution

We can split the system_prompt in two parts: the instructions part and the context part. By doing so, we can move the contexts part closer to the user message and preserve the KV-cache of the model.

    async_stream = await acompletion(
        model=config.llm,
        messages=[
            {"role": "system", "content": system_prompt},
            *(messages or []),
            {"role": "system", "content": context_content},
            {"role": "user", "content": prompt},
        ],
        stream=True,
    )

Note: It is possible that not all providers allow multiple system messages.

@lsorber
Copy link
Member

lsorber commented Nov 25, 2024

Thanks for submitting this issue @undo76!

Could you explain how OpenAI applies caching exactly? Or do you have a reference where I can read up on this?

What would you think about:

        messages=[
            # Static system prompt
            {"role": "system", "content": system_prompt},
            # Message history
            *(messages or []),
            # Modified user prompt that contains the user's question and the RAG context
            {"role": "user", "content": modified_user_prompt},
        ],

Should we put the RAG instructions in the system_prompt, or closer to the query and RAG context in modified_user_prompt?

@lsorber
Copy link
Member

lsorber commented Dec 3, 2024

The v0.3.0 release resulting from #52 fixes this. In a nutshell, we have implemented a prompt caching-aware message array structure.

@lsorber lsorber closed this as completed Dec 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants