Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Incorrect Output in ReAct Mode of LlamaIndex Chat Engine #17322

Open
whisper-bye opened this issue Dec 19, 2024 · 8 comments
Open

[Bug]: Incorrect Output in ReAct Mode of LlamaIndex Chat Engine #17322

whisper-bye opened this issue Dec 19, 2024 · 8 comments
Labels
bug Something isn't working triage Issue needs to be triaged/prioritized

Comments

@whisper-bye
Copy link

whisper-bye commented Dec 19, 2024

Bug Description

When using the ReAct mode of the LlamaIndex chat engine, the output contains duplicated and extra characters that are not expected.

Version

0.12.5-0.12.7

Steps to Reproduce

message = messages[-1]["content"]
chat_history = [
    ChatMessage(role=message["role"], content=message["content"])
    for message in messages[:-1]
]

chat_engine = self.index.as_chat_engine()
streaming_response = await chat_engine.astream_chat(
    message=message, chat_history=chat_history
)

async for token in streaming_response.async_response_gen():
    print(token)
  1. Setup set_global_handler("simple").
  2. Send some questions.
  3. And Then send the message 你好.
  4. Observe the expected log output:
    你好!请问你需要了解哪些信息呢?我可以帮助你查询相关信息。
    
  5. Check the actual output in ReAct mode:
    你好!请问你好!请问你需要了解哪些信息呢?我可以帮助你查询相关信息。
    

All other ChatMode works fine.

Relevant Logs/Tracbacks

@whisper-bye whisper-bye added bug Something isn't working triage Issue needs to be triaged/prioritized labels Dec 19, 2024
Copy link

dosubot bot commented Dec 19, 2024

The issue you're experiencing with duplicated and extra characters in the ReAct mode of the LlamaIndex chat engine could be related to how the astream_chat method processes messages and generates responses. The astream_chat method uses an asynchronous generator to yield response tokens incrementally, which might lead to duplication if the message handling or memory update logic is not correctly implemented [1].

Here are a few things you might want to check or try:

  1. Message Handling: Ensure that the chat history and message handling logic correctly manage the state and do not inadvertently duplicate messages. The chat history should be accurately set and updated to prevent repeated processing of the same message.

  2. Response Generation: Verify that the response synthesizer and the asynchronous generator (async_response_gen) are correctly configured to handle streaming responses without duplicating tokens. The generator should yield each token only once [2].

  3. Version-Specific Changes: Since the issue occurs between versions 0.12.5 and 0.12.7, reviewing the changelogs or commits for these versions might provide insights into any changes that could affect message processing or response generation.

If these steps do not resolve the issue, further investigation into the specific implementation details of the ReAct mode in these versions might be necessary.

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other

@whisper-bye
Copy link
Author

whisper-bye commented Dec 19, 2024

@logan-markewich

I think I roughly understand the issue. Since 你好 is a simple message, the llm didn't follow the react prompt template but directly output the result 你好!请问你需要了解哪些信息呢?我可以帮助你查询相关信息。

At https://github.com/run-llama/llama_index/blob/v0.12.6/llama-index-core/llama_index/core/agent/react/step.py#L750:

  1. In the OpenAI protocol-compatible streaming output, the content of the final chunk is empty.
  2. The latest_chunk and missed_chunks_storage were modified in _infer_stream_chunk_is_final.
missed_chunks_storage is ['你好', '!', '请问']

latest_chunk is '你好!请问。。。'

Eventually, the result was reassembled, leading to the strange output described in my issue:

# add back the chunks that were missed
response_stream = self._async_add_back_chunk_to_stream(
    chunks=[*missed_chunks_storage, latest_chunk], chat_stream=chat_stream
)

@Akash-Kumar-Sen
Copy link

Hello @whisper-bye
I have been trying to reproduce the same issue as mentioned above

The test I have been using is as follows

async def test_react_chat_agent():
    index = VectorStoreIndex.from_documents(
        [Document.example()]
    )

    chat_history = [
        ChatMessage(role=MessageRole.USER, content="What is LlamaIndex?"),
        ChatMessage(role=MessageRole.ASSISTANT, content="LlamaIndex is a 'data framework' to help you build LLM apps. It provides tools for data ingestion, structuring, and advanced retrieval/query interfaces."),
        
        ChatMessage(role=MessageRole.USER, content="How does LlamaIndex augment LLMs with private data?"),
        ChatMessage(role=MessageRole.ASSISTANT, content="LlamaIndex offers data connectors to ingest your existing data sources and formats, and provides ways to structure your data so it can be easily used with LLMs."),
        
        ChatMessage(role=MessageRole.USER, content="What kind of data sources can LlamaIndex ingest?"),
        ChatMessage(role=MessageRole.ASSISTANT, content="LlamaIndex can ingest data from APIs, PDFs, docs, SQL, and other formats."),
        
        ChatMessage(role=MessageRole.USER, content="Can LlamaIndex be integrated with other application frameworks?"),
        ChatMessage(role=MessageRole.ASSISTANT, content="Yes, LlamaIndex allows easy integrations with various application frameworks like LangChain, Flask, Docker, ChatGPT, and more."),
        
        ChatMessage(role=MessageRole.USER, content="Is LlamaIndex suitable for both beginners and advanced users?"),
        ChatMessage(role=MessageRole.ASSISTANT, content="Yes, LlamaIndex provides tools for both beginner and advanced users. Beginners can use the high-level API to ingest and query data in 5 lines of code, while advanced users can customize and extend any module to fit their needs.")
    ]

    chat_engine = index.as_chat_engine(chat_mode=ChatMode.REACT, verbose=True)
    message = "What is an llm"
    response = chat_engine.chat(message, chat_history=chat_history)

    print("----------------------   Response   ----------------------")
    print(response)
    print("----------------------------------------------------------")

    streaming_response = await chat_engine.astream_chat(
        message=message, chat_history=chat_history
    )

    print("----------------------   Streaming response   ----------------------")
    async for token in streaming_response.async_response_gen():
        print(token)
    print("-------------------------------------------------------------------")

The embedding model and LLM I have used are as follows -

>>> import os
>>> Settings.llm = OpenAI(
...         model="gpt-4o-mini",
...         api_key=os.environ["OPENAI_API_KEY"]
...     )
>>> Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-large")

And the response is as follows -

>>> asyncio.run(test_react_chat_agent())
Added user message to memory: What is an llm
----------------------   Response   ----------------------
An LLM, or Large Language Model, is a type of artificial intelligence model designed to understand, generate, and manipulate human language. These models are trained on vast amounts of text data and can perform a variety of language-related tasks, such as translation, summarization, question answering, and text generation. Examples of LLMs include OpenAI's GPT series, Google's BERT, and others.
----------------------------------------------------------
Added user message to memory: What is an llm
=== Calling Function ===
Calling function: query_engine_tool with args: {"input":"What is an LLM?"}
Got output: An LLM is a large language model that is a powerful technology for knowledge generation and reasoning. It is pre-trained on extensive amounts of publicly available data, enabling it to understand and generate human-like text.
========================

----------------------   Streaming response   ----------------------
An
 L
LM
,
 or
 Large
 Language
 Model
,
 is
 a
 powerful
 technology
 for
 knowledge
 generation
 and
 reasoning
.
 It
 is
 pre
-trained
 on
 extensive
 amounts
 of
 publicly
 available
 data
,
 enabling
 it
 to
 understand
 and
 generate
 human
-like
 text
.
-------------------------------------------------------------------
>>> 

As it seems I could not reproduce this issue, would you please point out if any of the reproduction steps that I have missed?

@whisper-bye
Copy link
Author

@Akash-Kumar-Sen
This may be related to openai compatible api (deepseek) and chinese input msg.

@arunpkm
Copy link

arunpkm commented Jan 22, 2025

Any update on this issue.
I am getting this repeating word issue often.
I am using OpenAI gpt-4o model.

CouldCould you please specify the name of the company you are referring to? This will help me provide a more accurate response.

@LiamK
Copy link

LiamK commented Feb 15, 2025

I am also experiencing this issue, similar to @arunpkm .
I'm using
llama-index==0.12.12
There is no duplication when not streaming, only when streaming. I have not tested the async version.

agent = ReActAgent.from_tools(
            llm=llm,
            tools=tools,
            memory=memory,
            verbose=True,
            context=system_context,
        )
agent.chat(query) # works

# causes duplication
for chunk in agent.stream_chat(query).response_gen:
            yield chunk

Usually it seems to duplicate the first word/token. For example,
"TheThe answer to your question is..." or
"II don't know the answer to your question."
There is also consistently a space character preceding the response when streaming. I'm guessing that may be related.

I've noticed the issue with both gpt-4o and mistral-latest-large.

Would really appreciate someone taking a look at the problem!
I'll be happy to test it out, but I'm sure there's someone more capable at digging into LlamaIndex streaming internals than I.

@LiamK
Copy link

LiamK commented Feb 20, 2025

I did a bit of investigating, and although I didn't fix it, I did find where the duplication is occurring. Here is a test program. Note that it doesn't happen all the time. I can't figure out what is different. I'm guessing that sometimes the streaming response from the LLM varies slightly, and that's what triggers it. Anyway, here is my test code. I put some debugging statements in ReActAgentWorker._add_back_chunk_to_stream(). I've attached a file with more information about that.
I hope this will help someone who knows more about the streaming implementation to fix it!

#!/usr/bin/env python3
from llama_index.core.agent import ReActAgent
from llama_index.llms.openai import OpenAI
from llama_index.llms.mistralai import MistralAI
from llama_index.core.tools import FunctionTool


def search_performers(query_string: str) -> str:
    """
    Search for a performer by name with query_string, and return their biographical data
    and their upcoming performances, if any, with dates. If query_string is empty, return list of all
    performers who have upcoming events.
    """
    return """ID: 1
Name: Joe Generico
Description: Joe Generico is a very funny guy!
"""

def create_agent():
    _tools = [
        FunctionTool.from_defaults(fn=search_performers)
    ]

    #_llm = OpenAI(model="gpt-4o")
    _llm = MistralAI(model='mistral-large-latest')

    agent = ReActAgent.from_tools(
        _tools,
        llm=_llm,
        verbose=True,
        context='Act as an expert at comically roasting comedians.'
    )
    return agent

def test_react_streaming():
    agent = create_agent()
    response = agent.stream_chat("Please roast Joe Generico, mercilessly!")
    for token in response.response_gen:
        print(f'{token}', end="")
    print()

if __name__ == '__main__':
    test_react_streaming()

This sometimes returns a streaming response that starts with "JoeJoe Generico..." or also "OhOh, Joe Generico..."
But sometimes it works okay.

duplication_debugging.txt

@Afaneor
Copy link
Contributor

Afaneor commented Mar 20, 2025

I've been experiencing the same duplication issue and have traced the root cause. As @whisper-bye correctly identified, the problem occurs when the LLM doesn't follow the ReAct format and directly outputs a response.

The bug is in the _async_add_back_chunk_to_stream method where missed chunks are added back to the stream. When an LLM doesn't follow the ReAct format (common with simple prompts or non-English inputs), the missed_chunks_storage contains the initial tokens that are then duplicated when combined with latest_chunk.

Here's my fix that works by checking if the text from previous chunks is already contained in the latest chunk:

async def _async_add_back_chunk_to_stream(
    self,
    chunks: List[ChatResponse],
    chat_stream: AsyncGenerator[ChatResponse, None],
) -> AsyncGenerator[ChatResponse, None]:
    """Add back chunks to stream asynchronously."""
    if chunks and len(chunks) > 1:
        # Get the last chunk
        last_chunk = chunks[-1]
        
        # Collect text from all previous chunks
        prev_chunks_text = ''
        for i in range(len(chunks) - 1):
            prev_chunks_text += chunks[i].delta or ''
        
        # Check if the text from previous chunks is contained in the last chunk
        last_chunk_text = last_chunk.delta or ''
        
        if prev_chunks_text and prev_chunks_text in last_chunk_text:
            # Duplication detected - return only the last chunk
            yield last_chunk
        else:
            # No duplication - return all chunks
            for chunk in chunks:
                yield chunk
    else:
        # If there's only one chunk or no chunks, just return them
        for chunk in chunks:
            yield chunk
    
    # Continue with the stream
    async for chunk in chat_stream:
        yield chunk

in ReActAgentWorker class

This solution solves the duplication issue by:

  1. Collecting text from all chunks except the last one
  2. Checking if this text is contained in the last chunk (which indicates duplication)
  3. If duplication is detected, only yielding the last chunk
  4. Otherwise, yielding all chunks

I've tested this with Russian language and gpt model, and it successfully prevents the duplication behavior

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage Issue needs to be triaged/prioritized
Projects
None yet
Development

No branches or pull requests

5 participants