-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Incorrect Output in ReAct Mode of LlamaIndex Chat Engine #17322
Comments
The issue you're experiencing with duplicated and extra characters in the ReAct mode of the LlamaIndex chat engine could be related to how the Here are a few things you might want to check or try:
If these steps do not resolve the issue, further investigation into the specific implementation details of the ReAct mode in these versions might be necessary. To continue talking to Dosu, mention @dosu. Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other |
I think I roughly understand the issue. Since
Eventually, the result was reassembled, leading to the strange output described in my issue: # add back the chunks that were missed
response_stream = self._async_add_back_chunk_to_stream(
chunks=[*missed_chunks_storage, latest_chunk], chat_stream=chat_stream
) |
Hello @whisper-bye The test I have been using is as follows async def test_react_chat_agent():
index = VectorStoreIndex.from_documents(
[Document.example()]
)
chat_history = [
ChatMessage(role=MessageRole.USER, content="What is LlamaIndex?"),
ChatMessage(role=MessageRole.ASSISTANT, content="LlamaIndex is a 'data framework' to help you build LLM apps. It provides tools for data ingestion, structuring, and advanced retrieval/query interfaces."),
ChatMessage(role=MessageRole.USER, content="How does LlamaIndex augment LLMs with private data?"),
ChatMessage(role=MessageRole.ASSISTANT, content="LlamaIndex offers data connectors to ingest your existing data sources and formats, and provides ways to structure your data so it can be easily used with LLMs."),
ChatMessage(role=MessageRole.USER, content="What kind of data sources can LlamaIndex ingest?"),
ChatMessage(role=MessageRole.ASSISTANT, content="LlamaIndex can ingest data from APIs, PDFs, docs, SQL, and other formats."),
ChatMessage(role=MessageRole.USER, content="Can LlamaIndex be integrated with other application frameworks?"),
ChatMessage(role=MessageRole.ASSISTANT, content="Yes, LlamaIndex allows easy integrations with various application frameworks like LangChain, Flask, Docker, ChatGPT, and more."),
ChatMessage(role=MessageRole.USER, content="Is LlamaIndex suitable for both beginners and advanced users?"),
ChatMessage(role=MessageRole.ASSISTANT, content="Yes, LlamaIndex provides tools for both beginner and advanced users. Beginners can use the high-level API to ingest and query data in 5 lines of code, while advanced users can customize and extend any module to fit their needs.")
]
chat_engine = index.as_chat_engine(chat_mode=ChatMode.REACT, verbose=True)
message = "What is an llm"
response = chat_engine.chat(message, chat_history=chat_history)
print("---------------------- Response ----------------------")
print(response)
print("----------------------------------------------------------")
streaming_response = await chat_engine.astream_chat(
message=message, chat_history=chat_history
)
print("---------------------- Streaming response ----------------------")
async for token in streaming_response.async_response_gen():
print(token)
print("-------------------------------------------------------------------") The embedding model and LLM I have used are as follows -
And the response is as follows - >>> asyncio.run(test_react_chat_agent())
Added user message to memory: What is an llm
---------------------- Response ----------------------
An LLM, or Large Language Model, is a type of artificial intelligence model designed to understand, generate, and manipulate human language. These models are trained on vast amounts of text data and can perform a variety of language-related tasks, such as translation, summarization, question answering, and text generation. Examples of LLMs include OpenAI's GPT series, Google's BERT, and others.
----------------------------------------------------------
Added user message to memory: What is an llm
=== Calling Function ===
Calling function: query_engine_tool with args: {"input":"What is an LLM?"}
Got output: An LLM is a large language model that is a powerful technology for knowledge generation and reasoning. It is pre-trained on extensive amounts of publicly available data, enabling it to understand and generate human-like text.
========================
---------------------- Streaming response ----------------------
An
L
LM
,
or
Large
Language
Model
,
is
a
powerful
technology
for
knowledge
generation
and
reasoning
.
It
is
pre
-trained
on
extensive
amounts
of
publicly
available
data
,
enabling
it
to
understand
and
generate
human
-like
text
.
-------------------------------------------------------------------
>>> As it seems I could not reproduce this issue, would you please point out if any of the reproduction steps that I have missed? |
@Akash-Kumar-Sen |
Any update on this issue.
|
I am also experiencing this issue, similar to @arunpkm .
Usually it seems to duplicate the first word/token. For example, I've noticed the issue with both gpt-4o and mistral-latest-large. Would really appreciate someone taking a look at the problem! |
I did a bit of investigating, and although I didn't fix it, I did find where the duplication is occurring. Here is a test program. Note that it doesn't happen all the time. I can't figure out what is different. I'm guessing that sometimes the streaming response from the LLM varies slightly, and that's what triggers it. Anyway, here is my test code. I put some debugging statements in ReActAgentWorker._add_back_chunk_to_stream(). I've attached a file with more information about that.
This sometimes returns a streaming response that starts with "JoeJoe Generico..." or also "OhOh, Joe Generico..." |
I've been experiencing the same duplication issue and have traced the root cause. As @whisper-bye correctly identified, the problem occurs when the LLM doesn't follow the ReAct format and directly outputs a response. The bug is in the Here's my fix that works by checking if the text from previous chunks is already contained in the latest chunk: async def _async_add_back_chunk_to_stream(
self,
chunks: List[ChatResponse],
chat_stream: AsyncGenerator[ChatResponse, None],
) -> AsyncGenerator[ChatResponse, None]:
"""Add back chunks to stream asynchronously."""
if chunks and len(chunks) > 1:
# Get the last chunk
last_chunk = chunks[-1]
# Collect text from all previous chunks
prev_chunks_text = ''
for i in range(len(chunks) - 1):
prev_chunks_text += chunks[i].delta or ''
# Check if the text from previous chunks is contained in the last chunk
last_chunk_text = last_chunk.delta or ''
if prev_chunks_text and prev_chunks_text in last_chunk_text:
# Duplication detected - return only the last chunk
yield last_chunk
else:
# No duplication - return all chunks
for chunk in chunks:
yield chunk
else:
# If there's only one chunk or no chunks, just return them
for chunk in chunks:
yield chunk
# Continue with the stream
async for chunk in chat_stream:
yield chunk in ReActAgentWorker class This solution solves the duplication issue by:
I've tested this with Russian language and gpt model, and it successfully prevents the duplication behavior |
Bug Description
When using the ReAct mode of the LlamaIndex chat engine, the output contains duplicated and extra characters that are not expected.
Version
0.12.5-0.12.7
Steps to Reproduce
set_global_handler("simple")
.你好
.All other ChatMode works fine.
Relevant Logs/Tracbacks
The text was updated successfully, but these errors were encountered: