Add video support for Qwen2-VL #187

nvnsho · 2025-01-28T07:31:07Z

Overview
This PR adds video processing capabilities to Qwen2-VL by:

Extending UserInput and LMInput to handle video input
Adding frame extraction functionality in the MediaProcessing module
Implementing video processing support and necessary tokens in Qwen2-VL

Performance

Hardware: MacBook Pro (M2 Pro, 32GB RAM)
Test Video: 30 seconds, 320×180 resolution (without resizing to 448x448 in ContentView)

Results
The whole inference procedure, including frame extraction and processing:

This implementation: ~6.5 seconds
Reference implementation: ~5 seconds

davidkoski · 2025-01-28T15:25:01Z

Libraries/MLXVLM/MediaProcessing.swift

+        // Collect the frames
+        var ciImages: [CIImage] = []
+        for sampledTime in sampledTimes {
+            guard let generatedImage = try? await generator.image(at: sampledTime) else {


the for: [CMTime] might be more appropriate here -- I think the at: may have to redo work each time.

davidkoski · 2025-01-28T15:29:59Z

Libraries/MLXVLM/MediaProcessing.swift

+            guard let generatedImage = try? await generator.image(at: sampledTime) else {
+                continue
+            }
+            guard let colorSpace = CGColorSpace(name: CGColorSpace.sRGB), let convertedImage = generatedImage.image.copy(colorSpace: colorSpace) else {


I wonder if it would be more efficient to let CI do this? Doing the colorspace conversion in CG space will do it on the CPU (likely) while doing it in CI is already doing colorspace conversions, e.g. into the working space.

I think these lines could just be omitted and maybe make the CIImage directly from the provided image.

Another option (if the performance isn't acceptable) is to use an AVAssetReader/AVAssetReaderOutput and consume CMSampleBuffers. It is possible that the CGImageRef here refers to an IOSurface but the CMSampleBuffer will almost certainly be the IOSurface (direct output of the decoder). This API is much simpler though :-)

davidkoski · 2025-01-28T15:31:33Z

Libraries/MLXVLM/Models/Qwen2VL.swift

+        print("thw: \(gridThw)")
+        print("pixel vaues, size and actual:")
+        print(pixelValues.size)
+        print(pixelValues)


Remove debug code or switch to logs

davidkoski · 2025-01-28T15:32:50Z

This looks great! Check out my comments and see what you think.

The CI checks failed on swift-format -- please run the pre-commit hooks.

Thank you!

nvnsho force-pushed the video branch 3 times, most recently from 4bb020c to 9e894fd Compare January 28, 2025 07:53

Implement video support for Qwen 2 VL

9e50d90

nvnsho force-pushed the video branch from 9e894fd to 9e50d90 Compare January 28, 2025 07:56

davidkoski self-requested a review January 28, 2025 15:20

davidkoski reviewed Jan 28, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add video support for Qwen2-VL #187

Add video support for Qwen2-VL #187

nvnsho commented Jan 28, 2025 •

edited

Loading

davidkoski Jan 28, 2025

davidkoski Jan 28, 2025

davidkoski Jan 28, 2025

davidkoski commented Jan 28, 2025

Add video support for Qwen2-VL #187

Are you sure you want to change the base?

Add video support for Qwen2-VL #187

Conversation

nvnsho commented Jan 28, 2025 • edited Loading

davidkoski Jan 28, 2025

Choose a reason for hiding this comment

davidkoski Jan 28, 2025

Choose a reason for hiding this comment

davidkoski Jan 28, 2025

Choose a reason for hiding this comment

davidkoski commented Jan 28, 2025

nvnsho commented Jan 28, 2025 •

edited

Loading