-
Notifications
You must be signed in to change notification settings - Fork 198
Add Qwen 2.5 VL #222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Qwen 2.5 VL #222
Conversation
|
||
// Create attention mask | ||
let attentionMask = full( | ||
(1, sequenceLength, sequenceLength), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(1, sequenceLength, sequenceLength), | |
[1, sequenceLength, sequenceLength], |
// Create attention mask | ||
let attentionMask = full( | ||
(1, sequenceLength, sequenceLength), | ||
MLXArray.finfo(q.dtype).min, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one is a little trickier -- maybe we need to add something to DType
, but for now perhaps one of these:
-Float16.greatestFiniteMagnitude
-Float32.greatestFiniteMagnitude
let attentionMask = full( | ||
(1, sequenceLength, sequenceLength), | ||
MLXArray.finfo(q.dtype).min, | ||
dtype: q.dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There isn't a variant that takes a dtype
, only a type:
(e.g. Float32.self
).
Filed: ml-explore/mlx-swift#199
Thanks for your feedback, but I still couldn't even test this because of the build errors I mentioned. The implementation is still a work in progress. Do you know how to resolve the build errors? |
Ah, now I'm seeing the errors in the Qwen 2.5 VL implementation. Something must have been wrong with Xcode. I'll try to pick it up from here and factor out the shared parts with Qwen 2 VL. |
I just pushed a change with fixes to make it build. There are some pieces that are missing to do it exactly right, see ml-explore/mlx-swift#199 In particular this one: // Create attention mask
let attentionMask = full(
[1, sequenceLength, sequenceLength],
values: -Float32.greatestFiniteMagnitude) I don't know if we need Float16 here or Float32, but this will at least build. This part is not ideal as it requires evaluation in the middle of evaluation: // Update mask for each sequence
for i in 1 ..< cuSeqlens.size {
let start = cuSeqlens[i - 1].item(Int.self)
let end = cuSeqlens[i].item(Int.self)
attentionMask[0..., start ..< end, start ..< end] = MLXArray(0)
} I think @awni may have had a technique for avoiding this (that we used in |
Inference now works. I'll see if I can factor out the parts that are shared with Qwen 2 VL. |
0737268
to
72a45d9
Compare
How does this relate to #197? Is this a replacement due to the mentioned issues with the window processing? |
I've factored out the shared parts and verified that this works with Qwen 2 VL and 2.5 VL for images, videos, and combinations of the two. I didn't base this on #197, so if @smdesai has any feedback, please let me know. As a separate issue, I noticed that some of the videos I used exceeded the maximum buffer length of my MacBook Pro, which causes the app to crash. I guess it's the app's responsibility to check the input against the device's maximum buffer length and available RAM. How can we estimate the required memory and buffer length of a given image or video? |
@DePasqualeOrg Looks like we refactored around the same time. I've gone through the code and aside from your windowing fix which I borrowed, the code is identical, the only difference is that I may have refactored a little more. Looks good to me. |
I don't know how to estimate the VLM memory use, but we can compute the size needed for the frames and it will be related to that. A video has a duration and dimensions on the video track. We know our sampling rate and any downsampling, so we can estimate the memory from that. It looks like the video processing (this is from the SMol PR, but the logic is the same in Qwen does this:
That produces an array of large frames and then resamples them. They could be 4k frames. I was going to propose something in #206, but perhaps we should make a new PR: we should be resampling frames as they are collected, so the function that produces frames actually produces an array of small frames. This could significantly reduce memory use. Let me cut an issue for that. #223 |
I asked Claude 3.7 Sonnet to evaluate both solutions. To be clear, I don't want to take much credit for my solution, since Sonnet helped a lot with it. "Your solution" refers to this PR and "his solution" refers to #197.
|
I have tested this (and the preview PR #197 on which this was based) and am getting an error when providing an image (a PNG, not video) to the Is user at the library level responsible for adding |
@deet, this PR is not based on #197. It is based on the existing Qwen 2 VL implementation in Swift and the Qwen 2.5 VL implementation in Python. Also, I tried using a PNG as input, and it works in my app Local Chat. Maybe there's an issue with how you're passing in the images. However, I noticed that I'm getting maximum buffer length crashes when a photo or video is in portrait orientation, so we'll need to make sure they're getting scaled down to an appropriate size also when the width is less than the height. |
I've pushed a new commit that fixes the max buffer crashes with images and videos in portrait orientation. @davidkoski, do you think the limit of 224 pixels makes sense? |
7b6f35e
to
102aa97
Compare
scale = size.height / extent.height | ||
} | ||
// Use the same scaling approach regardless of orientation | ||
let scale = min(size.width / extent.width, size.height / extent.height) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, makes sense -- the previous would fail for images with extreme aspect ratios or just aspect ratios that didn't align with the target size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may need "shortestEdge" as well, along with a center crop. I will write up an issue for this, we can refactor in a future PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See discussion in #200 -- Qwen2 requires the exact computed size and this won't always deliver it. I have set up some unit tests so we can run all of these through.
102aa97
to
9356549
Compare
|
||
hiddenStates = block( | ||
hiddenStates, | ||
attentionMask: attentionMask, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved the attentionMask computation up to this level -- there are really only two different attention masks and we can use them over and over instead of computing per layer (32 layers). Locally that gives 25 -> 33 tokens/s improvement.
This model works well on the sample image (and other flower images I try) but fails for me on one of my images: The python version:
which looks good, but the swift version:
Every test image (other than flowers for some reason :-) ) gives a response that makes me think the image is garbled. |
I tested it last night on a screenshot, and it gave a very detailed and accurate response about the contents of the image, which wasn't the case before the most recent changes I made. |
channel * config.temporalPatchSize * config.patchSize * config.patchSize | ||
actualGridT * (gridH / config.mergeSize) * (gridW / config.mergeSize), | ||
channel * config.temporalPatchSize * (config.mergeSize * config.patchSize) | ||
* (config.mergeSize * config.patchSize) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am curious about this change (also in Qwen25VL). From the python transformers:
flatten_patches = patches.reshape(
grid_t * grid_h * grid_w, channel * self.temporal_patch_size * self.patch_size * self.patch_size
)
which matches the lines right above. We are getting the wrong shape out of this (compared to python), though I can't tell that it makes any difference in the model (the layout of the bytes is the same, despite the change in shape)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey
How can I help ?
Try checking the grid you are getting.
Also, we expand it in mlx-vlm:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DePasqualeOrg here is a great example.
You simply tag us and say "this/last major item" is missing.
A better approach would be:
- What is the problem? (With a reproductive example script)
- What are the shapes, debug logs or traceback?
- What have you tried so far?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Blaizzy, I asked for suggestions about this part of the processor, which @davidkoski asked about. I'm out of my depth here, so I'm going to step back from trying to implement this model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but you didn't mention what shapes you are getting or any debug logs I can look into or link to source you are referencing.
I looking into the transformers version and I can't find this part of the code you speak of.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what I mean by please be detailed.
Yeah, screenshots work ok for me as well -- it reads the text from them. |
Commit 4add690 works well -- I tweaked the mask code between then and now, let me see if that is it. |
OK, the Bool mask didn't work as expected but Int8 works fine. |
This looks great and is working great (sorry for my damage!) I think the pending tasks are:
|
/// Qwen2.5VL VLM `UserInputProcessor`. | ||
/// | ||
/// This is meant to be used with ``Qwen25VL`` and is typically created by ``VLMModelFactory``. | ||
public class Qwen25VLProcessor: UserInputProcessor { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this move into QwenVL and be shared between the models? On the python side both share Qwen2VLImageProcessor
but there is a separate Qwen2_5_VLProcessor
for prompt manipulation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I considered this, but I thought it might make the code too convoluted, since we have to define a protocol and accommodate differences in the two models' configurations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, sounds good
tokenizer: any Tokenizer | ||
) throws -> [Int] { | ||
// Replace single padding token with correct number for each image or video frame | ||
let placeholderTokens = try tokenizer.encode( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the output of the template render (inside swift-transformers):
let rendered = try template.render(context)
"<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nDescribe the image in English<|vision_start|><|image_pad|><|vision_end|><|im_end|>\n<|im_start|>assistant\n"
this is the template that resolves:
(lldb) po selectedChatTemplate
"{% set image_count = namespace(value=0) %}{% set video_count = namespace(value=0) %}{% for message in messages %}{% if loop.first and message[\'role\'] != \'system\' %}<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n{% endif %}<|im_start|>{{ message[\'role\'] }}\n{% if message[\'content\'] is string %}{{ message[\'content\'] }}<|im_end|>\n{% else %}{% for content in message[\'content\'] %}{% if content[\'type\'] == \'image\' or \'image\' in content or \'image_url\' in content %}{% set image_count.value = image_count.value + 1 %}{% if add_vision_id %}Picture {{ image_count.value }}: {% endif %}<|vision_start|><|image_pad|><|vision_end|>{% elif content[\'type\'] == \'video\' or \'video\' in content %}{% set video_count.value = video_count.value + 1 %}{% if add_vision_id %}Video {{ video_count.value }}: {% endif %}<|vision_start|><|video_pad|><|vision_end|>{% elif \'text\' in content %}{{ content[\'text\'] }}{% endif %}{% endfor %}<|im_end|>\n{% endif %}{% endfor %}{% if add_generation_prompt %}<|im_start|>assistant\n{% endif %}"
It looks like this comes from chat_template.json
We investigated the Bool mask -- it looks like this is an issue fixed after v0.21.0 of mlx::core, so we can revisit this once we pick up the latest in mlx-swift. |
@davidkoski, there are now conflicts with main, and I'm not sure how to resolve some of them. My attempt to do that resulted in many errors. Could you rebase this branch and resolve the conflicts? |
Will do, but I won't get to it until tomorrow at the earliest. When I looked at the web view it was just the localized error messages, but a full rebase might collide in many places. |
@DePasqualeOrg I ended up just resolving the conflict with the web editor -- that was easier than doing the full rebase. So we have a clean merge here when it is ready. If we really need a rebase (and I am not sure we do at this point) then I suggest we probably want to squash the commits on the branch to make the rebase easier. I think these items still remain: #222 (comment) and then this should be ready to merge. |
Thank you! If we're okay with having separate processors for Qwen 2 and 2.5 VL, which would keep the code simpler, then I think we just need to resolve the first item on the list. Removing or revising the debug statements is easy to do once we decide what logging solution (if any) to go with. |
I think I'm going to step back from trying to implement models, because I don't have the expertise or the time to finish the job, and we haven't made any progress in weeks. Hopefully others can take it from here. |
I wouldn't be demotivated! It's just that we have work and many responsibilities. We do this on our free time. I would instead advise you to be precise and detailed in your queries. Help us, help you :) |
I tried to resolve the conflicts that have accumulated, but I don't understand all the changes, so I'll leave this to others. I don't have the capacity to finish this PR. |
One of the conflicts I see is this:
vs
I think we should probably defer a default resize to the model (UserInputProcessor) itself -- the UI code, which may want to just load a model, probably shouldn't be concerned with this unless it is under user control. I will see about resolving the other conflicts and cleaning up any debug/logs in the code. @DePasqualeOrg are there any outstanding pieces that you think need to be done before we wrap this up? |
Thanks, @davidkoski. I'm not sure about all the details of the processor, which I think don't fully align with the Python implementation. But it seemed to be working well enough the last time I tested it. I think this is better than not having a working implementation in Swift, and it can be improved if someone wants to take that on. |
Yeah, the processor part in python is split between mlx-vlm and transformers, so we don't have a perfect 1:1 mapping. |
…wen2 in terms of where it is currently factored
Got it synced with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much to @DePasqualeOrg @smdesai (who had another port of this) and @Blaizzy for the original model and support! Awesome contribution!
@davidkoski, I'm working on Qwen 2.5 VL, and I'm getting a build error in VLMModelFactory.swift that I don't understand. I've defined everything just like Qwen 2 VL, but it can't find the modules. Are you able to see what the problem is?