-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use chat templates for vision models #173
base: main
Are you sure you want to change the base?
Use chat templates for vision models #173
Conversation
5c8ccfa
to
3ac6296
Compare
@davidkoski, I made some changes, and it seems to work in VLMEval. Do you have any thoughts on this? |
3ac6296
to
4547cf1
Compare
content += Array(repeating: "<|image_pad|>", count: thw.product / mergeLength).joined() | ||
content += "<|vision_end|>" | ||
} | ||
messages[lastIndex]["content"] = content |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the chat template from Qwen2:
"chat_template": "{% set image_count = namespace(value=0) %}{% set video_count = namespace(value=0) %}{% for message in messages %}{% if loop.first and message['role'] != 'system' %}<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n{% endif %}<|im_start|>{{ message['role'] }}\n{% if message['content'] is string %}{{ message['content'] }}<|im_end|>\n{% else %}{% for content in message['content'] %}{% if content['type'] == 'image' or 'image' in content or 'image_url' in content %}{% set image_count.value = image_count.value + 1 %}{% if add_vision_id %}Picture {{ image_count.value }}: {% endif %}<|vision_start|><|image_pad|><|vision_end|>{% elif content['type'] == 'video' or 'video' in content %}{% set video_count.value = video_count.value + 1 %}{% if add_vision_id %}Video {{ video_count.value }}: {% endif %}<|vision_start|><|video_pad|><|vision_end|>{% elif 'text' in content %}{{ content['text'] }}{% endif %}{% endfor %}<|im_end|>\n{% endif %}{% endfor %}{% if add_generation_prompt %}<|im_start|>assistant\n{% endif %}",
is this code required? It seems like the template is going to do exactly that.
Though maybe we are not getting the full chat template here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is required. The chat template doesn't do exactly this, and the app crashes if I comment this out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are values like image_count
being sent in? I didn't see where that would be -- perhaps these extra variable would be required. Yes, if the prompt doesn't match up with what the model expects there will be ... trouble :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like image_count
is set and mutated internally in the template. But I think the problem might be that there are no messages of type image
being created when asMessages
is called. I will have to dig into this in more detail, since I'm not very familiar with the new API. Or maybe you know of a quick fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I am not familiar with it either. The python code is here:
model_to_format = {
"idefics2": "message_list_with_image",
"qwen2_vl": "message_list_with_image",
...
message_formats = {
"message_list_with_image": lambda: add_image_tokens(
{"role": role, "content": [{"type": "text", "text": prompt}]}, ""
),
and the image token piece is here:
def add_image_tokens(message, token_format):
if role == "system":
return message
if role == "user" and not skip_image_token:
if isinstance(message["content"], list):
if model_name == "pixtral":
message["content"] = [{"type": "image"}] * num_images + message[
"content"
]
else:
message["content"].extend([{"type": "image"}] * num_images)
else:
if model_name == "phi3_v":
message["content"] = f"{token_format}{message['content']}"
else:
message["content"] = (
f"{token_format * num_images}{message['content']}"
)
if role == "assistant" and model_name == "pixtral":
message["content"] = message["content"][0]["content"]
return message
Additionally the transformers layer also augments the messages:
The latter isn't in the template at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. I just pushed a commit that I think gets part of the way to a solution.
I think
|
The solution in my latest commit uses the chat template (correctly, I think) to create a prompt like this:
However, in order for the model to work, it looks like we need to replace the single let mergeLength = config.mergeSize * config.mergeSize
let repeatedPadding = Array(repeating: "<|image_pad|>", count: thw.product / mergeLength).joined() |
347ccc5
to
806c7f2
Compare
I now have something that works, although it still needs to take into account the case where multiple images are included. |
0f68fd2
to
ed03ae5
Compare
@davidkoski, I found it quite difficult to reason about the code because of how some of the variables and parameters were named. What do you think about calling an array of type |
it sounds ok to me, though they aren't the frames themselves but the positions of the frames in one of the arrays (maybe not in the final array). I think try |
Right, that is this part: I think the sequence from the python side is roughly:
One issue we have on the swift side is step 1 and step 3 occur in the same function in swift-transformers and we don't have a hook for step 2. |
e9c7a02
to
8cb233b
Compare
8959c45
to
3e50263
Compare
@DePasqualeOrg it looks like the swift-transformers side (which includes Jinja) is ready to go and would solve some issues with text models. Do you want to prepare a PR for picking that up (since it is mostly your work)? If you are busy I can get that ready. |
I think #185 accomplishes that. Xcode is showing the latest patch versions of the packages when I open mlx-swift-examples. Or is there something I'm missing? huggingface/swift-transformers#151 still needs to be merged before this PR, since it expands the type of a message from |
3e50263
to
031e47f
Compare
031e47f
to
db97052
Compare
I've verified that this also works with multiple images, although I'll need to do further testing to check the model's performance. I noticed that Qwen 2 VL tends to respond in Mandarin unless prompted otherwise. |
Yeah, I noticed that too. At least the responses seemed correct per google translate :-) |
This is a test of my PR to Swift Jinja, which should enable chat templates to be used for vision language models that have a chat template. I've started to set things up, but I need some pointers on how to integrate the image into the messages.