You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a textual RAG system that I am evolving towards multimodal contents. So I want to be able to pass in the context both textual chunks and base64 images.
I just can't wrap my head around the various Message and Prompt objects available in langchain, and how to have them work properly with multimodal contents. Specifically, the HumanMessagePromptTemplate object.
This is a (very simplified) version of what I am currently doing:
MY_SYSTEM_PROMPT="""Answer the question in the XML node <question> using the context information provided in XML node <context>. Make sure to write the answer in the language provided in XML node <target_language>."""MY_STRING_TEMPLATE="""<question>{question}</question><context>{context}</context><target_language>{language}</target_language>"""answerer_template=ChatPromptTemplate.from_messages(
[
SystemMessage(MY_SYSTEM_PROMPT),
HumanMessagePromptTemplate(
prompt=PromptTemplate(
input_variables=['question', 'context', 'language'],
template=MY_STRING_TEMPLATE,
)
),
]
)
model=<myBaseChatModelimplementation>runnable=answerer_template|modeloutput=awaitrunnable.ainvoke(
{
'context': <astringwiththevariousretrievedtextualchunks>,
'question': <theuserquery>'target_language': <thetargetlanguage>
}
)
As you can see I am using HumanMessagePromptTemplate with a template that uses a lot of XML tags. I use those tags because the LLM I am using seems to "understand" them pretty well. At runtime I inject the proper contents in the template and build the message to send to the LLM.
My problem is that I now need to support multimodal contents in the context, and I cannot find any information on how to make that work with HumanMessagePromptTemplate.
But neither approach gives me the flexibility and the fine-grained control of HumanMessagePromptTemplate, where I can write a full prompt template and fill it at runtime.
I tried to inject the base64 strings in the HumanMessagePromptTemplate as part of the {context} input variable: after all, they are still strings! So I tried something like:
But it didn't work: I got a token limit exceeded error. The weird thing is I do not get that error using the HumanMessage approach with exactly the same images. So I guess the two approaches have some significant differences under the hood.
My best shot at the moment seems to be to stop using prompts altogether, and to change my code into something like:
But I am worried: does it make sense to add the XML tag with those items {"type": "text", "text": f"<context>"} and {"type": "text", "text": f"</context>"}? I have no idea how they are handled by LangChain under the hood so I am a bit worried (also, it looks totally silly).
Sorry for the wall of text. As you may have guessed, I am not really an expert with LangChain so I am sure there is a lot I am missing. But I could not find any useful examples in the documentation or elsewhere (e.g. stackoverflow) so I would really appreciate some help 🙇♂️
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
I have a textual RAG system that I am evolving towards multimodal contents. So I want to be able to pass in the context both textual chunks and base64 images.
I just can't wrap my head around the various Message and Prompt objects available in langchain, and how to have them work properly with multimodal contents. Specifically, the
HumanMessagePromptTemplate
object.This is a (very simplified) version of what I am currently doing:
As you can see I am using
HumanMessagePromptTemplate
with a template that uses a lot of XML tags. I use those tags because the LLM I am using seems to "understand" them pretty well. At runtime I inject the proper contents in the template and build the message to send to the LLM.My problem is that I now need to support multimodal contents in the context, and I cannot find any information on how to make that work with HumanMessagePromptTemplate.
I saw in the multimodal prompts documentation that the idea is to add a separate items in the user message, like this:
which, using the newer Message objects, would look like this:
But neither approach gives me the flexibility and the fine-grained control of HumanMessagePromptTemplate, where I can write a full prompt template and fill it at runtime.
I tried to inject the base64 strings in the HumanMessagePromptTemplate as part of the {context} input variable: after all, they are still strings! So I tried something like:
But it didn't work: I got a token limit exceeded error. The weird thing is I do not get that error using the
HumanMessage
approach with exactly the same images. So I guess the two approaches have some significant differences under the hood.My best shot at the moment seems to be to stop using prompts altogether, and to change my code into something like:
But I am worried: does it make sense to add the XML tag with those items
{"type": "text", "text": f"<context>"}
and{"type": "text", "text": f"</context>"}
? I have no idea how they are handled by LangChain under the hood so I am a bit worried (also, it looks totally silly).Sorry for the wall of text. As you may have guessed, I am not really an expert with LangChain so I am sure there is a lot I am missing. But I could not find any useful examples in the documentation or elsewhere (e.g. stackoverflow) so I would really appreciate some help 🙇♂️
Beta Was this translation helpful? Give feedback.
All reactions