Skip to content

Add Qwen 2.5 VL #222

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 21 commits into from
Apr 14, 2025
Merged

Add Qwen 2.5 VL #222

merged 21 commits into from
Apr 14, 2025

Conversation

DePasqualeOrg
Copy link
Contributor

@davidkoski, I'm working on Qwen 2.5 VL, and I'm getting a build error in VLMModelFactory.swift that I don't understand. I've defined everything just like Qwen 2 VL, but it can't find the modules. Are you able to see what the problem is?


// Create attention mask
let attentionMask = full(
(1, sequenceLength, sequenceLength),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
(1, sequenceLength, sequenceLength),
[1, sequenceLength, sequenceLength],

// Create attention mask
let attentionMask = full(
(1, sequenceLength, sequenceLength),
MLXArray.finfo(q.dtype).min,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is a little trickier -- maybe we need to add something to DType, but for now perhaps one of these:

-Float16.greatestFiniteMagnitude
-Float32.greatestFiniteMagnitude

let attentionMask = full(
(1, sequenceLength, sequenceLength),
MLXArray.finfo(q.dtype).min,
dtype: q.dtype)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There isn't a variant that takes a dtype, only a type: (e.g. Float32.self).

Filed: ml-explore/mlx-swift#199

@DePasqualeOrg
Copy link
Contributor Author

Thanks for your feedback, but I still couldn't even test this because of the build errors I mentioned. The implementation is still a work in progress. Do you know how to resolve the build errors?

@DePasqualeOrg
Copy link
Contributor Author

Ah, now I'm seeing the errors in the Qwen 2.5 VL implementation. Something must have been wrong with Xcode. I'll try to pick it up from here and factor out the shared parts with Qwen 2 VL.

@davidkoski
Copy link
Collaborator

@davidkoski, I'm working on Qwen 2.5 VL, and I'm getting a build error in VLMModelFactory.swift that I don't understand. I've defined everything just like Qwen 2 VL, but it can't find the modules. Are you able to see what the problem is?

I just pushed a change with fixes to make it build. There are some pieces that are missing to do it exactly right, see ml-explore/mlx-swift#199

In particular this one:

            // Create attention mask
            let attentionMask = full(
                [1, sequenceLength, sequenceLength],
                values: -Float32.greatestFiniteMagnitude)

I don't know if we need Float16 here or Float32, but this will at least build.

This part is not ideal as it requires evaluation in the middle of evaluation:

            // Update mask for each sequence
            for i in 1 ..< cuSeqlens.size {
                let start = cuSeqlens[i - 1].item(Int.self)
                let end = cuSeqlens[i].item(Int.self)
                attentionMask[0..., start ..< end, start ..< end] = MLXArray(0)
            }

I think @awni may have had a technique for avoiding this (that we used in Qwen2VL)

@DePasqualeOrg
Copy link
Contributor Author

Inference now works. I'll see if I can factor out the parts that are shared with Qwen 2 VL.

@davidkoski
Copy link
Collaborator

How does this relate to #197? Is this a replacement due to the mentioned issues with the window processing?

@smdesai & @DePasqualeOrg

@DePasqualeOrg
Copy link
Contributor Author

DePasqualeOrg commented Mar 4, 2025

I've factored out the shared parts and verified that this works with Qwen 2 VL and 2.5 VL for images, videos, and combinations of the two. I didn't base this on #197, so if @smdesai has any feedback, please let me know.

As a separate issue, I noticed that some of the videos I used exceeded the maximum buffer length of my MacBook Pro, which causes the app to crash. I guess it's the app's responsibility to check the input against the device's maximum buffer length and available RAM. How can we estimate the required memory and buffer length of a given image or video?

@DePasqualeOrg DePasqualeOrg marked this pull request as ready for review March 4, 2025 22:11
@smdesai
Copy link
Contributor

smdesai commented Mar 5, 2025

@DePasqualeOrg Looks like we refactored around the same time. I've gone through the code and aside from your windowing fix which I borrowed, the code is identical, the only difference is that I may have refactored a little more. Looks good to me.

@davidkoski
Copy link
Collaborator

davidkoski commented Mar 5, 2025

As a separate issue, I noticed that some of the videos I used exceeded the maximum buffer length of my MacBook Pro, which causes the app to crash. I guess it's the app's responsibility to check the input against the device's maximum buffer length and available RAM. How can we estimate the required memory and buffer length of a given image or video?

I don't know how to estimate the VLM memory use, but we can compute the size needed for the frames and it will be related to that. A video has a duration and dimensions on the video track. We know our sampling rate and any downsampling, so we can estimate the memory from that.

It looks like the video processing (this is from the SMol PR, but the logic is the same in Qwen does this:

            var videoFrameResult = await try MediaProcessing.asCIImageSequence(
                video, maxFrames: maxVideoFrames, targetFPS: targetVideoFPS)

            var processedFrames: [MLXArray] = []
            for frame in videoFrameResult.frames {
                let image =
                    frame
                    .toSRGB()
                    .resampled(
                        to: CGSize(width: fixedImageSize, height: fixedImageSize), method: .lanczos
                    )
                    .normalized(mean: config.imageMeanTuple, std: config.imageStdTuple)
                    .asMLXArray()
                processedFrames.append(image)
            }

That produces an array of large frames and then resamples them. They could be 4k frames. I was going to propose something in #206, but perhaps we should make a new PR: we should be resampling frames as they are collected, so the function that produces frames actually produces an array of small frames.

This could significantly reduce memory use. Let me cut an issue for that. #223

@DePasqualeOrg
Copy link
Contributor Author

DePasqualeOrg commented Mar 5, 2025

I asked Claude 3.7 Sonnet to evaluate both solutions. To be clear, I don't want to take much credit for my solution, since Sonnet helped a lot with it. "Your solution" refers to this PR and "his solution" refers to #197.

After comparing both solutions, I would say your solution is better overall. Here's why:

Strengths of your solution:

  1. Better organization and structure:

    • You've created a shared QwenVL namespace for common utilities used by both Qwen2VL and Qwen25VL models
    • Your code has cleaner separation of concerns with more consistent naming
  2. More consistent implementation:

    • You use RMSNorm in the Qwen25VL vision blocks, which is more appropriate than the LayerNorm used in his solution
    • Your implementation of the processor configuration handles the size parameters more consistently
  3. Better error handling and robustness:

    • Your implementation of the window attention mechanism is more complete and accurate
  4. More maintainable code:

    • You've extracted common functionality into the QwenVL namespace, making it easier to maintain both models
    • Your code has better documentation and more consistent naming conventions
  5. More accurate implementation:

    • Your implementation more closely matches the original Python implementation, particularly in the vision model components

Areas where his solution has some advantages:

  1. His solution uses more explicit module naming in some places, which can make it slightly easier to follow the model architecture at first glance.

  2. His implementation of the Qwen25VLProcessor is slightly more concise, though yours is more consistent with the rest of the codebase.

Overall assessment:

Your solution is superior because it demonstrates better software engineering practices through:

  • Better code organization and reuse
  • More consistent implementation
  • More accurate implementation of the model architecture
  • Better maintainability for future updates

The shared QwenVL namespace is particularly valuable as it would make it easier to implement other models in the Qwen family in the future, reducing code duplication and potential for errors.

@deet
Copy link

deet commented Mar 6, 2025

I have tested this (and the preview PR #197 on which this was based) and am getting an error when providing an image (a PNG, not video) to the UserInput -- "Number of placeholder tokens does not match number of frames"

Is user at the library level responsible for adding <|image_pad|> to the message sequence? Or is this related to the chat_template changes?

@DePasqualeOrg
Copy link
Contributor Author

DePasqualeOrg commented Mar 6, 2025

@deet, this PR is not based on #197. It is based on the existing Qwen 2 VL implementation in Swift and the Qwen 2.5 VL implementation in Python.

Also, I tried using a PNG as input, and it works in my app Local Chat. Maybe there's an issue with how you're passing in the images.

However, I noticed that I'm getting maximum buffer length crashes when a photo or video is in portrait orientation, so we'll need to make sure they're getting scaled down to an appropriate size also when the width is less than the height.

@DePasqualeOrg
Copy link
Contributor Author

I've pushed a new commit that fixes the max buffer crashes with images and videos in portrait orientation. @davidkoski, do you think the limit of 224 pixels makes sense?

scale = size.height / extent.height
}
// Use the same scaling approach regardless of orientation
let scale = min(size.width / extent.width, size.height / extent.height)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, makes sense -- the previous would fail for images with extreme aspect ratios or just aspect ratios that didn't align with the target size.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may need "shortestEdge" as well, along with a center crop. I will write up an issue for this, we can refactor in a future PR

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See discussion in #200 -- Qwen2 requires the exact computed size and this won't always deliver it. I have set up some unit tests so we can run all of these through.

@deet
Copy link

deet commented Mar 10, 2025

I'm not seeing the reshape error anymore.

But generally on this branch, I've seen some quality issues in output.

I've tried the latest commit and compared with various states of main

(outputs below are all for the same image with same prompt)

From commit 983eaac

describe this image in english. be very detailed and precise The image is a screenshot of a GitHub repository. The repository contains several commits, including one titled "Support Phi-4-mini" by someone named David. The commit message states that the commit supports Phi-4-mini, which is 216. The commit is marked as "Verified" and the commit timestamp is 2025-02-25.------

From commit c743526

["role": "user", "content": [["type": "text", "text": "describe this image in english. be very detailed and precise"], ["type": "image"]]] The image shows a screenshot of a GitHub repository, specifically a commit history. The commit history is for a file named `README.md` in a project named `LLMMModelFactory`. The commit commit number is 1,83, and the commit message is "Update LLMModelFactory.swift". The commit is marked as "Verified".------

From top of main, commit 3885b92:

["role": "user", "content": [["type": "text", "text": "describe this image in english. be very detailed and precise"], ["type": "image"]]] The image is a screenshot of a GitHub repository, displayed in the "Pull requests" tab. The repository appears to be related to a project named "DePasqualeOrg (Anthony)" or "DePasqualeOrg (Anthony)" as per the Git tag. The project is currently on GitHub, and there are several commits, including one that was recently added and marked as "verified" by the author.------

From THIS commit 9356549

["role": "user", "content": [["type": "text", "text": "describe this image in english. be very detailed and precise"], ["type": "image"]]] The image shows a document management software application, specifically a spreadsheet or database application, displaying a list of items. The list appears to be sorted based on some criteria, possibly date or priority. Here is a detailed description:

- The application is a spreadsheet or database software application.
- The screen shows a list of items, which seems to be a document or file management application.
- The list is sorted in some way, possibly by date or priority.
- The items in the list include various texts, possibly file names or titles, along with associated data such as dates, times, or other identifiers.
- The list is vertically aligned, with each item in a separate row.
- The text appears to be in a language that uses a character set, possibly Chinese or Japanese.
- There are no graphical elements such as images or icons, only text and numbers, indicating it is a text-based application.
- The interface is clean, with no distractions or additional elements.
- The list includes a few items, such as "20190415", "20190414", and "20190412", which might represent dates.
- The application is running on a computer, as indicated by the appearance of the interface------

You can see the main line commits know it's a GitHub page, and are all reasonably similar. But this branch is quite off:

All from mlx-community/Qwen2-VL-2B-Instruct-4bit on this image:

mlx_test_quality_screenshot_github_safari

I've observed a similar effect with larger+better (7B Qwen2) models too, which previously gave quite good results on other types of images but seem similarly affected


hiddenStates = block(
hiddenStates,
attentionMask: attentionMask,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the attentionMask computation up to this level -- there are really only two different attention masks and we can use them over and over instead of computing per layer (32 layers). Locally that gives 25 -> 33 tokens/s improvement.

@davidkoski
Copy link
Collaborator

This model works well on the sample image (and other flower images I try) but fails for me on one of my images:

IMG_0691

The python version:

python -m mlx_vlm.generate --model mlx-community/Qwen2.5-VL-3B-Instruct-4bit --max-tokens 100 --prompt "Describe the image in English" --resize-shape 1024 --image /Users/dkoski/Desktop/IMG_0691.jpeg

The image shows a small, light brown dog with curly fur. The dog is standing on a light-colored carpet and is wearing a blue and white striped collar. The background is blurred, but it appears to be an indoor setting with some furniture and possibly some toys or bags in the background. The dog has a curious expression on its face.

which looks good, but the swift version:

The image appears to be a combination of two different patterns. The upper part of the image shows a pattern consisting of various shades of brown and beige, resembling a pile or clump of what looks like paper or fabric. The lower part of the image features a repeating pattern with a tropical theme. This pattern includes images of pineapples, palm leaves, and other tropical elements, arranged on a light blue background. The pattern appears to repeat in a grid-like manner, covering the entire lower half of

Every test image (other than flowers for some reason :-) ) gives a response that makes me think the image is garbled.

@davidkoski
Copy link
Collaborator

Actually a different test image, bust similar:

IMG_4962

here is the image at the end of the MediaProcessing pipeline:

image

that looks ok. Perhaps something in the "patch" generation? (assuming it is a problem with the input prep)

@DePasqualeOrg
Copy link
Contributor Author

I tested it last night on a screenshot, and it gave a very detailed and accurate response about the contents of the image, which wasn't the case before the most recent changes I made.

channel * config.temporalPatchSize * config.patchSize * config.patchSize
actualGridT * (gridH / config.mergeSize) * (gridW / config.mergeSize),
channel * config.temporalPatchSize * (config.mergeSize * config.patchSize)
* (config.mergeSize * config.patchSize)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious about this change (also in Qwen25VL). From the python transformers:

        flatten_patches = patches.reshape(
            grid_t * grid_h * grid_w, channel * self.temporal_patch_size * self.patch_size * self.patch_size
        )

which matches the lines right above. We are getting the wrong shape out of this (compared to python), though I can't tell that it makes any difference in the model (the layout of the bytes is the same, despite the change in shape)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cc @Blaizzy, @pcuenca: This is the last major item remaining to be resolved for this PR. Do you have any suggestions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Paging @Blaizzy and @pcuenca. It would be great if we could unblock this, since we're close to the finish line.

Copy link

@Blaizzy Blaizzy Apr 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey

How can I help ?

Try checking the grid you are getting.

Also, we expand it in mlx-vlm:

https://github.com/Blaizzy/mlx-vlm/blob/48c8a55e8a74642c4a633619bdd25683f42b2325/mlx_vlm/models/qwen2_5_vl/vision.py#L257

Copy link

@Blaizzy Blaizzy Apr 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DePasqualeOrg here is a great example.

You simply tag us and say "this/last major item" is missing.

A better approach would be:

  • What is the problem? (With a reproductive example script)
  • What are the shapes, debug logs or traceback?
  • What have you tried so far?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Blaizzy, I asked for suggestions about this part of the processor, which @davidkoski asked about. I'm out of my depth here, so I'm going to step back from trying to implement this model.

Copy link

@Blaizzy Blaizzy Apr 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but you didn't mention what shapes you are getting or any debug logs I can look into or link to source you are referencing.

I looking into the transformers version and I can't find this part of the code you speak of.

https://github.com/huggingface/transformers/blob/main/src/transformers/models/qwen2_5_vl/processing_qwen2_5_vl.py

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I mean by please be detailed.

@davidkoski
Copy link
Collaborator

I tested it last night on a screenshot, and it gave a very detailed and accurate response about the contents of the image, which wasn't the case before the most recent changes I made.

Yeah, screenshots work ok for me as well -- it reads the text from them.

@davidkoski
Copy link
Collaborator

Commit 4add690 works well -- I tweaked the mask code between then and now, let me see if that is it.

@davidkoski
Copy link
Collaborator

OK, the Bool mask didn't work as expected but Int8 works fine.

@davidkoski
Copy link
Collaborator

davidkoski commented Mar 12, 2025

This looks great and is working great (sorry for my damage!) I think the pending tasks are:

  • consider this question on the shape of the image -- do we need that? should we revert?
  • remove the various debug prints OR convert to logging
  • revert and dev/debug changes to VLMEval
  • consider moving Qwen2VLProcessor to QwenVL and making both models use it -- on the python side both are using Qwen2VLImageProcessor but there is a separate Qwen2_5_VLProcessor (augmenting the prompt)
  • ready to merge?

/// Qwen2.5VL VLM `UserInputProcessor`.
///
/// This is meant to be used with ``Qwen25VL`` and is typically created by ``VLMModelFactory``.
public class Qwen25VLProcessor: UserInputProcessor {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this move into QwenVL and be shared between the models? On the python side both share Qwen2VLImageProcessor but there is a separate Qwen2_5_VLProcessor for prompt manipulation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered this, but I thought it might make the code too convoluted, since we have to define a protocol and accommodate differences in the two models' configurations.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, sounds good

tokenizer: any Tokenizer
) throws -> [Int] {
// Replace single padding token with correct number for each image or video frame
let placeholderTokens = try tokenizer.encode(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the output of the template render (inside swift-transformers):

        let rendered = try template.render(context)
"<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nDescribe the image in English<|vision_start|><|image_pad|><|vision_end|><|im_end|>\n<|im_start|>assistant\n"

this is the template that resolves:

(lldb) po selectedChatTemplate
"{% set image_count = namespace(value=0) %}{% set video_count = namespace(value=0) %}{% for message in messages %}{% if loop.first and message[\'role\'] != \'system\' %}<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n{% endif %}<|im_start|>{{ message[\'role\'] }}\n{% if message[\'content\'] is string %}{{ message[\'content\'] }}<|im_end|>\n{% else %}{% for content in message[\'content\'] %}{% if content[\'type\'] == \'image\' or \'image\' in content or \'image_url\' in content %}{% set image_count.value = image_count.value + 1 %}{% if add_vision_id %}Picture {{ image_count.value }}: {% endif %}<|vision_start|><|image_pad|><|vision_end|>{% elif content[\'type\'] == \'video\' or \'video\' in content %}{% set video_count.value = video_count.value + 1 %}{% if add_vision_id %}Video {{ video_count.value }}: {% endif %}<|vision_start|><|video_pad|><|vision_end|>{% elif \'text\' in content %}{{ content[\'text\'] }}{% endif %}{% endfor %}<|im_end|>\n{% endif %}{% endfor %}{% if add_generation_prompt %}<|im_start|>assistant\n{% endif %}"

It looks like this comes from chat_template.json

@davidkoski
Copy link
Collaborator

We investigated the Bool mask -- it looks like this is an issue fixed after v0.21.0 of mlx::core, so we can revisit this once we pick up the latest in mlx-swift.

@DePasqualeOrg
Copy link
Contributor Author

@davidkoski, there are now conflicts with main, and I'm not sure how to resolve some of them. My attempt to do that resulted in many errors. Could you rebase this branch and resolve the conflicts?

@davidkoski
Copy link
Collaborator

@davidkoski, there are now conflicts with main, and I'm not sure how to resolve some of them. My attempt to do that resulted in many errors. Could you rebase this branch and resolve the conflicts?

Will do, but I won't get to it until tomorrow at the earliest. When I looked at the web view it was just the localized error messages, but a full rebase might collide in many places.

@davidkoski
Copy link
Collaborator

@DePasqualeOrg I ended up just resolving the conflict with the web editor -- that was easier than doing the full rebase. So we have a clean merge here when it is ready.

If we really need a rebase (and I am not sure we do at this point) then I suggest we probably want to squash the commits on the branch to make the rebase easier.

I think these items still remain: #222 (comment) and then this should be ready to merge.

@DePasqualeOrg
Copy link
Contributor Author

Thank you! If we're okay with having separate processors for Qwen 2 and 2.5 VL, which would keep the code simpler, then I think we just need to resolve the first item on the list. Removing or revising the debug statements is easy to do once we decide what logging solution (if any) to go with.

@DePasqualeOrg
Copy link
Contributor Author

I think I'm going to step back from trying to implement models, because I don't have the expertise or the time to finish the job, and we haven't made any progress in weeks. Hopefully others can take it from here.

@Blaizzy
Copy link

Blaizzy commented Apr 4, 2025

I wouldn't be demotivated!

It's just that we have work and many responsibilities. We do this on our free time.

I would instead advise you to be precise and detailed in your queries. Help us, help you :)

@DePasqualeOrg
Copy link
Contributor Author

I tried to resolve the conflicts that have accumulated, but I don't understand all the changes, so I'll leave this to others. I don't have the capacity to finish this PR.

@davidkoski
Copy link
Collaborator

One of the conflicts I see is this:

            userInput.processing.resize = .init(width: 1344, height: 1344)

vs

            userInput.processing.resize = .init(width: 448, height: 448)

I think we should probably defer a default resize to the model (UserInputProcessor) itself -- the UI code, which may want to just load a model, probably shouldn't be concerned with this unless it is under user control.

I will see about resolving the other conflicts and cleaning up any debug/logs in the code.

@DePasqualeOrg are there any outstanding pieces that you think need to be done before we wrap this up?

@DePasqualeOrg
Copy link
Contributor Author

Thanks, @davidkoski. I'm not sure about all the details of the processor, which I think don't fully align with the Python implementation. But it seemed to be working well enough the last time I tested it. I think this is better than not having a working implementation in Swift, and it can be improved if someone wants to take that on.

@davidkoski
Copy link
Collaborator

Yeah, the processor part in python is split between mlx-vlm and transformers, so we don't have a perfect 1:1 mapping.

@davidkoski
Copy link
Collaborator

Got it synced with main -- there was a bit of refactoring on the video processing and I applied that to Qwen2.5. Hopefully I got all of that synced up without damage. I will test this this afternoon.

Copy link
Collaborator

@davidkoski davidkoski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much to @DePasqualeOrg @smdesai (who had another port of this) and @Blaizzy for the original model and support! Awesome contribution!

@davidkoski davidkoski merged commit efb9ed7 into ml-explore:main Apr 14, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants