Qwen 2.5 Omni: apply video defaults #37660

pcuenca · 2025-04-21T19:56:11Z

What does this PR do?

Applies min_pixels and max_pixels values to video processor.

The values were taken from the original processing codebase, which uses a different set for video than it does for images.

In our case, the image processor would always default to the image case, which results in frames resized to very large sizes, possibly causing OOMs, and preparing inputs with shapes not seen by the model during training.

Reproduction

Consider the following snippet:

import soundfile as sf
from transformers import Qwen2_5OmniForConditionalGeneration, Qwen2_5OmniProcessor

model_id = "Qwen/Qwen2.5-Omni-7B"
processor = Qwen2_5OmniProcessor.from_pretrained(model_id)

conversation = [
    {
        "role": "system",
        "content": [
            {"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."}
        ],
    },
    {
        "role": "user",
        "content": [
             {"type": "video", "video": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-Omni/draw.mp4"},
            {"type": "text", "text": "What can you hear and see in this video?"},
        ],
    },
]

inputs = processor.apply_chat_template(
    conversation,
    load_audio_from_video=True,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
    video_fps=2,

    # kwargs to be passed to `Qwen2-5-OmniProcessor`
    padding=True,
    use_audio_in_video=True,
)

print(inputs["pixel_values_videos"].shape)

Before this PR: [886116, 1176]
After this PR: [60480, 1176]
Reference, using qwen_omni_utils.process_mm_info: [57600, 1176]

The difference between this PR and the reference is because the original codebase selects 40 frames for this video, while we select 41.

Alternatives

Use different config values for image and video processing and persist them to preprocessor_config.json.

github-actions · 2025-04-21T19:56:27Z

Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. The CI will be paused while the PR is in draft mode. When it is ready for review, please click the Ready for review button (at the bottom of the PR page). This will assign reviewers and trigger CI.

pcuenca · 2025-04-21T19:57:04Z

src/transformers/models/qwen2_5_omni/processing_qwen2_5_omni.py

            "position_id_per_seconds": 25,
            "use_audio_in_video": False,
+            "min_pixels": 128 * 28 * 28,
+            "max_pixels": 768 * 28 * 28,


From https://github.com/QwenLM/Qwen2.5-Omni/blob/7c8dddb38d52a58ce57e778e10fa0eaf26e078e9/qwen-omni-utils/src/qwen_omni_utils/v2_5/vision_process.py#L30

Indeed, also noticed but didn't want to enforce as it's dynamic in their repo, depending on video length. I agree this is better than nothing and a longer term solution would be to add it in self.video_processor

pcuenca · 2025-04-21T19:58:19Z

src/transformers/models/qwen2_5_omni/processing_qwen2_5_omni.py

            videos_inputs = self.image_processor(images=None, videos=videos, **output_kwargs["videos_kwargs"])
-            if fps is None:
-                fps = [2.0] * len(videos)
+            fps = [fps] * len(videos)


This is technically unrelated, but I don't think the input kwarg is expected as a list in this method.

HuggingFaceDocBuilderDev · 2025-04-21T20:22:23Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

zucchini-nlp

Looks good to me, can merge after it's marked ready for review :)

zucchini-nlp · 2025-04-22T11:22:36Z

src/transformers/models/qwen2_5_omni/processing_qwen2_5_omni.py

        position_id_per_seconds = output_kwargs["videos_kwargs"].pop("position_id_per_seconds")
        use_audio_in_video = output_kwargs["videos_kwargs"].pop("use_audio_in_video")
-        fps = output_kwargs["videos_kwargs"].pop("fps", None)
+        fps = output_kwargs["videos_kwargs"].pop("fps", 2.0)


ah, then we can put it in video_kwargs.defaults. Missed this one

Yes, but when I tested I think it was not being properly forwarded to all the places where it's needed. I'll take a quick look, otherwise we can merge this and handle the fps in another PR.

I think that consistently using the video fps provided by the user, or defaulting to the value in video_kwargs.defaults, merits some additional discussion, I'll open a new PR. We can merge this one meanwhile!

Related to #37687 as well, users should be able to overwrite the value indeed. And the naming diverged without us noticing 😢

zucchini-nlp · 2025-04-22T11:23:50Z

src/transformers/models/qwen2_5_omni/processing_qwen2_5_omni.py

            "position_id_per_seconds": 25,
            "use_audio_in_video": False,
+            "min_pixels": 128 * 28 * 28,
+            "max_pixels": 768 * 28 * 28,


Indeed, also noticed but didn't want to enforce as it's dynamic in their repo, depending on video length. I agree this is better than nothing and a longer term solution would be to add it in self.video_processor

…nsformers into qwen-omni-video-defaults

pcuenca · 2025-04-23T15:03:35Z

I updated the tests to account for the new default sizes, let me know if this is ok to merge @zucchini-nlp! We can continue fps handling in #37687.

pcuenca · 2025-04-28T08:13:27Z

~~Question for @LysandreJik: should this be cherry-picked into the preview release?~~ (As well as #37687 when it's ready)

It's already there, sorry.

* Apply video defaults for min_pixels and max_pixels * fps kwarg should not be a list * Update test to account for new resizing

pcuenca added 2 commits April 21, 2025 21:19

Apply video defaults for min_pixels and max_pixels

5d1d2f4

fps kwarg should not be a list

3614fb1

github-actions bot marked this pull request as draft April 21, 2025 19:56

pcuenca requested review from Cyrilvallez and zucchini-nlp April 21, 2025 19:56

pcuenca commented Apr 21, 2025

View reviewed changes

pcuenca mentioned this pull request Apr 22, 2025

nits on any-to-any task huggingface/huggingface.js#1372

Merged

zucchini-nlp approved these changes Apr 23, 2025

View reviewed changes

pcuenca marked this pull request as ready for review April 23, 2025 13:39

pcuenca added 3 commits April 23, 2025 15:41

Merge branch 'main' into qwen-omni-video-defaults

ea05767

Update test to account for new resizing

ee6fea1

Merge branch 'qwen-omni-video-defaults' of github.com:huggingface/tra…

cc418bb

…nsformers into qwen-omni-video-defaults

zucchini-nlp merged commit 63c6331 into main Apr 23, 2025
12 checks passed

zucchini-nlp deleted the qwen-omni-video-defaults branch April 23, 2025 15:08

Qwen 2.5 Omni: apply video defaults #37660

Qwen 2.5 Omni: apply video defaults #37660

Uh oh!

Conversation

pcuenca commented Apr 21, 2025

What does this PR do?

Reproduction

Alternatives

Uh oh!

github-actions bot commented Apr 21, 2025

Uh oh!

pcuenca Apr 21, 2025

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Apr 22, 2025

Choose a reason for hiding this comment

Uh oh!

pcuenca Apr 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Apr 21, 2025

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Apr 22, 2025

Choose a reason for hiding this comment

Uh oh!

pcuenca Apr 23, 2025

Choose a reason for hiding this comment

Uh oh!

pcuenca Apr 23, 2025

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Apr 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Apr 22, 2025

Choose a reason for hiding this comment

Uh oh!

pcuenca commented Apr 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

pcuenca commented Apr 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pcuenca Apr 21, 2025 •

edited

Loading

zucchini-nlp Apr 23, 2025 •

edited

Loading

pcuenca commented Apr 23, 2025 •

edited

Loading

pcuenca commented Apr 28, 2025 •

edited

Loading