-
Notifications
You must be signed in to change notification settings - Fork 30.8k
Qwen 2.5 Omni: apply video defaults #37660
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. The CI will be paused while the PR is in draft mode. When it is ready for review, please click the |
"position_id_per_seconds": 25, | ||
"use_audio_in_video": False, | ||
"min_pixels": 128 * 28 * 28, | ||
"max_pixels": 768 * 28 * 28, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, also noticed but didn't want to enforce as it's dynamic in their repo, depending on video length. I agree this is better than nothing and a longer term solution would be to add it in self.video_processor
videos_inputs = self.image_processor(images=None, videos=videos, **output_kwargs["videos_kwargs"]) | ||
if fps is None: | ||
fps = [2.0] * len(videos) | ||
fps = [fps] * len(videos) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is technically unrelated, but I don't think the input kwarg is expected as a list in this method.
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, can merge after it's marked ready for review :)
position_id_per_seconds = output_kwargs["videos_kwargs"].pop("position_id_per_seconds") | ||
use_audio_in_video = output_kwargs["videos_kwargs"].pop("use_audio_in_video") | ||
fps = output_kwargs["videos_kwargs"].pop("fps", None) | ||
fps = output_kwargs["videos_kwargs"].pop("fps", 2.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, then we can put it in video_kwargs.defaults
. Missed this one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but when I tested I think it was not being properly forwarded to all the places where it's needed. I'll take a quick look, otherwise we can merge this and handle the fps in another PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that consistently using the video fps provided by the user, or defaulting to the value in video_kwargs.defaults
, merits some additional discussion, I'll open a new PR. We can merge this one meanwhile!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related to #37687 as well, users should be able to overwrite the value indeed. And the naming diverged without us noticing 😢
"position_id_per_seconds": 25, | ||
"use_audio_in_video": False, | ||
"min_pixels": 128 * 28 * 28, | ||
"max_pixels": 768 * 28 * 28, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, also noticed but didn't want to enforce as it's dynamic in their repo, depending on video length. I agree this is better than nothing and a longer term solution would be to add it in self.video_processor
…nsformers into qwen-omni-video-defaults
I updated the tests to account for the new default sizes, let me know if this is ok to merge @zucchini-nlp! We can continue fps handling in #37687. |
It's already there, sorry. |
* Apply video defaults for min_pixels and max_pixels * fps kwarg should not be a list * Update test to account for new resizing
What does this PR do?
Applies
min_pixels
andmax_pixels
values to video processor.The values were taken from the original processing codebase, which uses a different set for video than it does for images.
In our case, the image processor would always default to the image case, which results in frames resized to very large sizes, possibly causing OOMs, and preparing inputs with shapes not seen by the model during training.
Reproduction
Consider the following snippet:
[886116, 1176]
[60480, 1176]
[57600, 1176]
The difference between this PR and the reference is because the original codebase selects
40
frames for this video, while we select41
.Alternatives
preprocessor_config.json
.