Order of the system message may prevent caching #49

undo76 · 2024-11-23T14:10:26Z

The problem

Both the async_rag and the rag functions add the system prompt after the list of user/assistant messages. This prevents leveraging caching in OpenAI models and it is may be an OOD chat format.

    async_stream = await acompletion(
        model=config.llm,
        messages=[
            *(messages or []),
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": prompt},
        ],
        stream=True,
    )

Naive Solution

Move the system prompt to the top. This doesn't solve completely the KV-cache eviction problem, though, as the system_prompt contains the retrieved contexts and this varies for every query.

    async_stream = await acompletion(
        model=config.llm,
        messages=[
            {"role": "system", "content": system_prompt},
            *(messages or []),
            {"role": "user", "content": prompt},
        ],
        stream=True,
    )

A (possible) better solution

We can split the system_prompt in two parts: the instructions part and the context part. By doing so, we can move the contexts part closer to the user message and preserve the KV-cache of the model.

    async_stream = await acompletion(
        model=config.llm,
        messages=[
            {"role": "system", "content": system_prompt},
            *(messages or []),
            {"role": "system", "content": context_content},
            {"role": "user", "content": prompt},
        ],
        stream=True,
    )

Note: It is possible that not all providers allow multiple system messages.

The text was updated successfully, but these errors were encountered:

lsorber · 2024-11-25T09:02:46Z

Thanks for submitting this issue @undo76!

Could you explain how OpenAI applies caching exactly? Or do you have a reference where I can read up on this?

What would you think about:

        messages=[
            # Static system prompt
            {"role": "system", "content": system_prompt},
            # Message history
            *(messages or []),
            # Modified user prompt that contains the user's question and the RAG context
            {"role": "user", "content": modified_user_prompt},
        ],

Should we put the RAG instructions in the system_prompt, or closer to the query and RAG context in modified_user_prompt?

lsorber · 2024-12-03T18:35:36Z

The v0.3.0 release resulting from #52 fixes this. In a nutshell, we have implemented a prompt caching-aware message array structure.

undo76 mentioned this issue Nov 25, 2024

feat: support prompt caching and apply Anthropic's long-context prompt format #52

Merged

lsorber closed this as completed Dec 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Order of the system message may prevent caching #49

Order of the system message may prevent caching #49

undo76 commented Nov 23, 2024

lsorber commented Nov 25, 2024 •

edited

Loading

lsorber commented Dec 3, 2024

Order of the system message may prevent caching #49

Order of the system message may prevent caching #49

Comments

undo76 commented Nov 23, 2024

The problem

Naive Solution

A (possible) better solution

lsorber commented Nov 25, 2024 • edited Loading

lsorber commented Dec 3, 2024

lsorber commented Nov 25, 2024 •

edited

Loading