How do I inject multimodal inputs (images) in a HumanMessagePromptTemplate? #29944

AndRossi · 2025-02-23T21:56:25Z

AndRossi
Feb 23, 2025

I have a textual RAG system that I am evolving towards multimodal contents. So I want to be able to pass in the context both textual chunks and base64 images.

I just can't wrap my head around the various Message and Prompt objects available in langchain, and how to have them work properly with multimodal contents. Specifically, the HumanMessagePromptTemplate object.

This is a (very simplified) version of what I am currently doing:

MY_SYSTEM_PROMPT = """Answer the question in the XML node <question> using the context information provided in XML node <context>. Make sure to write the answer in the language provided in XML node <target_language>."""

MY_STRING_TEMPLATE = """
<question>{question}</question>
<context>{context}</context>
<target_language>{language}</target_language>
"""

answerer_template = ChatPromptTemplate.from_messages(
  [
    SystemMessage(MY_SYSTEM_PROMPT),
    HumanMessagePromptTemplate(
      prompt=PromptTemplate(
        input_variables=['question', 'context', 'language'],
        template=MY_STRING_TEMPLATE,
      )
    ),
  ]
)

model = <my BaseChatModel implementation>
runnable = answerer_template | model

output = await runnable.ainvoke(
    {
        'context': <a string with the various retrieved textual chunks>,
        'question': <the user query>
        'target_language': <the target language>
    }
)

As you can see I am using HumanMessagePromptTemplate with a template that uses a lot of XML tags. I use those tags because the LLM I am using seems to "understand" them pretty well. At runtime I inject the proper contents in the template and build the message to send to the LLM.

My problem is that I now need to support multimodal contents in the context, and I cannot find any information on how to make that work with HumanMessagePromptTemplate.

I saw in the multimodal prompts documentation that the idea is to add a separate items in the user message, like this:

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", "compare the two pictures provided"),
        (
            "user",
            [
                {
                    "type": "image_url",
                    "image_url": {"url": "data:image/jpeg;base64,{image_data1}"},
                },
                {
                    "type": "image_url",
                    "image_url": {"url": "data:image/jpeg;base64,{image_data2}"},
                },
            ],
        ),
    ]
)

which, using the newer Message objects, would look like this:

prompt = ChatPromptTemplate.from_messages(
[
    SystemMessage("compare the two pictures provided"),
    HumanMessage(content=[
        {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image_1_data}"}},
        {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image_2_data}"}},
        ...
    ]),
]
)

But neither approach gives me the flexibility and the fine-grained control of HumanMessagePromptTemplate, where I can write a full prompt template and fill it at runtime.

I tried to inject the base64 strings in the HumanMessagePromptTemplate as part of the {context} input variable: after all, they are still strings! So I tried something like:

my_img_context = "\n".join([f"data:image/jpeg;base64,{img_data}" for img_data in retrieved_images])`
output = await runnable.ainvoke(
    {
        'context': my_img_context,
        ...
    }
)

But it didn't work: I got a token limit exceeded error. The weird thing is I do not get that error using the HumanMessage approach with exactly the same images. So I guess the two approaches have some significant differences under the hood.

My best shot at the moment seems to be to stop using prompts altogether, and to change my code into something like:

def build_prompt(question, language, images):
    prompt = ChatPromptTemplate.from_messages([
        SystemMessage("compare the two pictures provided"),
        HumanMessage(content=[
            {"type": "text", "text": f"<question>{question}<question>",
            {"type": "text", "text": f"<target_language>{language}<target_language>}",
            {"type": "text", "text": f"<context>"},
            {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{images[0]}"}},
            {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{images[1]}"}},
            ...
            {"type": "text", "text": f"</context>}",
        ]),
    ])

But I am worried: does it make sense to add the XML tag with those items {"type": "text", "text": f"<context>"} and {"type": "text", "text": f"</context>"}? I have no idea how they are handled by LangChain under the hood so I am a bit worried (also, it looks totally silly).

Sorry for the wall of text. As you may have guessed, I am not really an expert with LangChain so I am sure there is a lot I am missing. But I could not find any useful examples in the documentation or elsewhere (e.g. stackoverflow) so I would really appreciate some help 🙇‍♂️

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How do I inject multimodal inputs (images) in a HumanMessagePromptTemplate? #29944

{{title}}

Replies: 0 comments

Select a reply

How do I inject multimodal inputs (images) in a HumanMessagePromptTemplate? #29944

AndRossi Feb 23, 2025

Replies: 0 comments

AndRossi
Feb 23, 2025