Use chat templates for vision models #173

DePasqualeOrg · 2025-01-01T20:41:20Z

This is a test of my PR to Swift Jinja, which should enable chat templates to be used for vision language models that have a chat template. I've started to set things up, but I need some pointers on how to integrate the image into the messages.

Libraries/MLXVLM/Models/Qwen2VL.swift

DePasqualeOrg · 2025-01-09T18:54:51Z

@davidkoski, I made some changes, and it seems to work in VLMEval. Do you have any thoughts on this?

davidkoski · 2025-01-14T17:12:23Z

Libraries/MLXVLM/Models/Qwen2VL.swift

+                content += Array(repeating: "<|image_pad|>", count: thw.product / mergeLength).joined()
+                content += "<|vision_end|>"
+            }
+            messages[lastIndex]["content"] = content


Given the chat template from Qwen2:

"chat_template": "{% set image_count = namespace(value=0) %}{% set video_count = namespace(value=0) %}{% for message in messages %}{% if loop.first and message['role'] != 'system' %}<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n{% endif %}<|im_start|>{{ message['role'] }}\n{% if message['content'] is string %}{{ message['content'] }}<|im_end|>\n{% else %}{% for content in message['content'] %}{% if content['type'] == 'image' or 'image' in content or 'image_url' in content %}{% set image_count.value = image_count.value + 1 %}{% if add_vision_id %}Picture {{ image_count.value }}: {% endif %}<|vision_start|><|image_pad|><|vision_end|>{% elif content['type'] == 'video' or 'video' in content %}{% set video_count.value = video_count.value + 1 %}{% if add_vision_id %}Video {{ video_count.value }}: {% endif %}<|vision_start|><|video_pad|><|vision_end|>{% elif 'text' in content %}{{ content['text'] }}{% endif %}{% endfor %}<|im_end|>\n{% endif %}{% endfor %}{% if add_generation_prompt %}<|im_start|>assistant\n{% endif %}",

is this code required? It seems like the template is going to do exactly that.

Though maybe we are not getting the full chat template here?

Yes, this is required. The chat template doesn't do exactly this, and the app crashes if I comment this out.

Are values like image_count being sent in? I didn't see where that would be -- perhaps these extra variable would be required. Yes, if the prompt doesn't match up with what the model expects there will be ... trouble :-)

It looks like image_count is set and mutated internally in the template. But I think the problem might be that there are no messages of type image being created when asMessages is called. I will have to dig into this in more detail, since I'm not very familiar with the new API. Or maybe you know of a quick fix.

No, I am not familiar with it either. The python code is here:

https://github.com/Blaizzy/mlx-vlm/blob/main/mlx_vlm/prompt_utils.py

model_to_format = { "idefics2": "message_list_with_image", "qwen2_vl": "message_list_with_image", ... message_formats = { "message_list_with_image": lambda: add_image_tokens( {"role": role, "content": [{"type": "text", "text": prompt}]}, "" ),

and the image token piece is here:

def add_image_tokens(message, token_format): if role == "system": return message if role == "user" and not skip_image_token: if isinstance(message["content"], list): if model_name == "pixtral": message["content"] = [{"type": "image"}] * num_images + message[ "content" ] else: message["content"].extend([{"type": "image"}] * num_images) else: if model_name == "phi3_v": message["content"] = f"{token_format}{message['content']}" else: message["content"] = ( f"{token_format * num_images}{message['content']}" ) if role == "assistant" and model_name == "pixtral": message["content"] = message["content"][0]["content"] return message

Additionally the transformers layer also augments the messages:

https://github.com/huggingface/transformers/blob/main/src/transformers/models/qwen2_vl/processing_qwen2_vl.py#L133

The latter isn't in the template at all.

Thank you. I just pushed a commit that I think gets part of the way to a solution.

DePasqualeOrg · 2025-01-14T18:59:54Z

I think UserInput will need to be changed to include messages that look like this:

{
    'role': 'user',
    'content': [
        {'type': 'text', 'text': 'What is in this image?'},
        {'type': 'image', 'image_url': 'example.jpg'}
    ]
}

DePasqualeOrg · 2025-01-14T20:31:24Z

The solution in my latest commit uses the chat template (correctly, I think) to create a prompt like this:

<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Describe the image in English<|vision_start|><|image_pad|><|vision_end|><|im_end|>
<|im_start|>assistant

However, in order for the model to work, it looks like we need to replace the single <|image_pad|> with repeated padding like this for each image:

let mergeLength = config.mergeSize * config.mergeSize
let repeatedPadding = Array(repeating: "<|image_pad|>", count: thw.product / mergeLength).joined()

DePasqualeOrg · 2025-01-14T20:59:59Z

I now have something that works, although it still needs to take into account the case where multiple images are included.

DePasqualeOrg · 2025-01-15T09:22:54Z

@davidkoski, I found it quite difficult to reason about the code because of how some of the variables and parameters were named. What do you think about calling an array of type [THW] frames?

davidkoski · 2025-01-15T16:38:05Z

@davidkoski, I found it quite difficult to reason about the code because of how some of the variables and parameters were named. What do you think about calling an array of type [THW] frames?

it sounds ok to me, though they aren't the frames themselves but the positions of the frames in one of the arrays (maybe not in the final array). I think try frames or framePositions and see how it goes

davidkoski · 2025-01-15T16:42:54Z

The solution in my latest commit uses the chat template (correctly, I think) to create a prompt like this:
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Describe the image in English<|vision_start|><|image_pad|><|vision_end|><|im_end|>
<|im_start|>assistant
However, in order for the model to work, it looks like we need to replace the single <|image_pad|> with repeated padding like this for each image:
let mergeLength = config.mergeSize * config.mergeSize
let repeatedPadding = Array(repeating: "<|image_pad|>", count: thw.product / mergeLength).joined()

Right, that is this part:

https://github.com/huggingface/transformers/blob/main/src/transformers/models/qwen2_vl/processing_qwen2_vl.py#L133

I think the sequence from the python side is roughly:

add image tokens (https://github.com/Blaizzy/mlx-vlm/blob/main/mlx_vlm/prompt_utils.py)
transformers / processing (expand image tokens)
tokenize

One issue we have on the swift side is step 1 and step 3 occur in the same function in swift-transformers and we don't have a hook for step 2.

davidkoski · 2025-01-27T23:35:07Z

@DePasqualeOrg it looks like the swift-transformers side (which includes Jinja) is ready to go and would solve some issues with text models.

Do you want to prepare a PR for picking that up (since it is mostly your work)? If you are busy I can get that ready.

DePasqualeOrg · 2025-01-28T07:45:32Z

I think #185 accomplishes that. Xcode is showing the latest patch versions of the packages when I open mlx-swift-examples. Or is there something I'm missing?

huggingface/swift-transformers#151 still needs to be merged before this PR, since it expands the type of a message from [String: String] to [String: Any].

DePasqualeOrg · 2025-01-28T09:23:12Z

I've verified that this also works with multiple images, although I'll need to do further testing to check the model's performance. I noticed that Qwen 2 VL tends to respond in Mandarin unless prompted otherwise.

davidkoski · 2025-01-28T14:57:48Z

I've verified that this also works with multiple images, although I'll need to do further testing to check the model's performance. I noticed that Qwen 2 VL tends to respond in Mandarin unless prompted otherwise.

Yeah, I noticed that too. At least the responses seemed correct per google translate :-)

DePasqualeOrg mentioned this pull request Jan 1, 2025

Support vision models and function calling johnmai-dev/Jinja#8

Merged

davidkoski reviewed Jan 7, 2025

View reviewed changes

Libraries/MLXVLM/Models/Qwen2VL.swift Outdated Show resolved Hide resolved

davidkoski reviewed Jan 7, 2025

View reviewed changes

Libraries/MLXVLM/Models/Qwen2VL.swift Outdated Show resolved Hide resolved

DePasqualeOrg force-pushed the vision-chat-templates branch from 5c8ccfa to 3ac6296 Compare January 9, 2025 18:53

DePasqualeOrg force-pushed the vision-chat-templates branch from 3ac6296 to 4547cf1 Compare January 9, 2025 21:43

davidkoski reviewed Jan 14, 2025

View reviewed changes

DePasqualeOrg force-pushed the vision-chat-templates branch from 347ccc5 to 806c7f2 Compare January 14, 2025 20:59

DePasqualeOrg force-pushed the vision-chat-templates branch 2 times, most recently from 0f68fd2 to ed03ae5 Compare January 15, 2025 09:19

DePasqualeOrg force-pushed the vision-chat-templates branch 2 times, most recently from e9c7a02 to 8cb233b Compare January 19, 2025 14:18

davidkoski mentioned this pull request Jan 21, 2025

Chat template not working for DeepSeek Qwen Distill model #181

Closed

DePasqualeOrg force-pushed the vision-chat-templates branch 4 times, most recently from 8959c45 to 3e50263 Compare January 27, 2025 19:32

DePasqualeOrg force-pushed the vision-chat-templates branch from 3e50263 to 031e47f Compare January 28, 2025 07:51

DePasqualeOrg added 2 commits January 28, 2025 10:10

Use chat template for Qwen 2 VL

b2ef5f1

Patch swift-transformers

db97052

DePasqualeOrg force-pushed the vision-chat-templates branch from 031e47f to db97052 Compare January 28, 2025 09:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use chat templates for vision models #173

Use chat templates for vision models #173

DePasqualeOrg commented Jan 1, 2025

DePasqualeOrg commented Jan 9, 2025

davidkoski Jan 14, 2025

DePasqualeOrg Jan 14, 2025

davidkoski Jan 14, 2025

DePasqualeOrg Jan 14, 2025

davidkoski Jan 14, 2025

DePasqualeOrg Jan 14, 2025 •

edited

Loading

DePasqualeOrg commented Jan 14, 2025 •

edited

Loading

DePasqualeOrg commented Jan 14, 2025 •

edited

Loading

DePasqualeOrg commented Jan 14, 2025

DePasqualeOrg commented Jan 15, 2025

davidkoski commented Jan 15, 2025

davidkoski commented Jan 15, 2025

davidkoski commented Jan 27, 2025

DePasqualeOrg commented Jan 28, 2025 •

edited

Loading

DePasqualeOrg commented Jan 28, 2025 •

edited

Loading

davidkoski commented Jan 28, 2025

Use chat templates for vision models #173

Are you sure you want to change the base?

Use chat templates for vision models #173

Conversation

DePasqualeOrg commented Jan 1, 2025

DePasqualeOrg commented Jan 9, 2025

davidkoski Jan 14, 2025

Choose a reason for hiding this comment

DePasqualeOrg Jan 14, 2025

Choose a reason for hiding this comment

davidkoski Jan 14, 2025

Choose a reason for hiding this comment

DePasqualeOrg Jan 14, 2025

Choose a reason for hiding this comment

davidkoski Jan 14, 2025

Choose a reason for hiding this comment

DePasqualeOrg Jan 14, 2025 • edited Loading

Choose a reason for hiding this comment

DePasqualeOrg commented Jan 14, 2025 • edited Loading

DePasqualeOrg commented Jan 14, 2025 • edited Loading

DePasqualeOrg commented Jan 14, 2025

DePasqualeOrg commented Jan 15, 2025

davidkoski commented Jan 15, 2025

davidkoski commented Jan 15, 2025

davidkoski commented Jan 27, 2025

DePasqualeOrg commented Jan 28, 2025 • edited Loading

DePasqualeOrg commented Jan 28, 2025 • edited Loading

davidkoski commented Jan 28, 2025

DePasqualeOrg Jan 14, 2025 •

edited

Loading

DePasqualeOrg commented Jan 14, 2025 •

edited

Loading

DePasqualeOrg commented Jan 14, 2025 •

edited

Loading

DePasqualeOrg commented Jan 28, 2025 •

edited

Loading

DePasqualeOrg commented Jan 28, 2025 •

edited

Loading