Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add video support for Qwen2-VL #187

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

nvnsho
Copy link

@nvnsho nvnsho commented Jan 28, 2025

Overview
This PR adds video processing capabilities to Qwen2-VL by:

  • Extending UserInput and LMInput to handle video input
  • Adding frame extraction functionality in the MediaProcessing module
  • Implementing video processing support and necessary tokens in Qwen2-VL

Performance

  • Hardware: MacBook Pro (M2 Pro, 32GB RAM)
  • Test Video: 30 seconds, 320×180 resolution (without resizing to 448x448 in ContentView)

Results
The whole inference procedure, including frame extraction and processing:

@nvnsho nvnsho force-pushed the video branch 3 times, most recently from 4bb020c to 9e894fd Compare January 28, 2025 07:53
// Collect the frames
var ciImages: [CIImage] = []
for sampledTime in sampledTimes {
guard let generatedImage = try? await generator.image(at: sampledTime) else {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the for: [CMTime] might be more appropriate here -- I think the at: may have to redo work each time.

guard let generatedImage = try? await generator.image(at: sampledTime) else {
continue
}
guard let colorSpace = CGColorSpace(name: CGColorSpace.sRGB), let convertedImage = generatedImage.image.copy(colorSpace: colorSpace) else {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it would be more efficient to let CI do this? Doing the colorspace conversion in CG space will do it on the CPU (likely) while doing it in CI is already doing colorspace conversions, e.g. into the working space.

I think these lines could just be omitted and maybe make the CIImage directly from the provided image.

Another option (if the performance isn't acceptable) is to use an AVAssetReader/AVAssetReaderOutput and consume CMSampleBuffers. It is possible that the CGImageRef here refers to an IOSurface but the CMSampleBuffer will almost certainly be the IOSurface (direct output of the decoder). This API is much simpler though :-)

print("thw: \(gridThw)")
print("pixel vaues, size and actual:")
print(pixelValues.size)
print(pixelValues)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove debug code or switch to logs

@davidkoski
Copy link
Collaborator

This looks great! Check out my comments and see what you think.

The CI checks failed on swift-format -- please run the pre-commit hooks.

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants