Skip to content

Conversation

pcuenca
Copy link
Member

@pcuenca pcuenca commented Apr 21, 2025

What does this PR do?

Applies min_pixels and max_pixels values to video processor.

The values were taken from the original processing codebase, which uses a different set for video than it does for images.

In our case, the image processor would always default to the image case, which results in frames resized to very large sizes, possibly causing OOMs, and preparing inputs with shapes not seen by the model during training.

Reproduction

Consider the following snippet:

import soundfile as sf
from transformers import Qwen2_5OmniForConditionalGeneration, Qwen2_5OmniProcessor

model_id = "Qwen/Qwen2.5-Omni-7B"
processor = Qwen2_5OmniProcessor.from_pretrained(model_id)

conversation = [
    {
        "role": "system",
        "content": [
            {"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."}
        ],
    },
    {
        "role": "user",
        "content": [
             {"type": "video", "video": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-Omni/draw.mp4"},
            {"type": "text", "text": "What can you hear and see in this video?"},
        ],
    },
]

inputs = processor.apply_chat_template(
    conversation,
    load_audio_from_video=True,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
    video_fps=2,

    # kwargs to be passed to `Qwen2-5-OmniProcessor`
    padding=True,
    use_audio_in_video=True,
)

print(inputs["pixel_values_videos"].shape)
  • Before this PR: [886116, 1176]
  • After this PR: [60480, 1176]
  • Reference, using qwen_omni_utils.process_mm_info: [57600, 1176]

The difference between this PR and the reference is because the original codebase selects 40 frames for this video, while we select 41.

Alternatives

  • Use different config values for image and video processing and persist them to preprocessor_config.json.

@github-actions github-actions bot marked this pull request as draft April 21, 2025 19:56
Copy link
Contributor

Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. The CI will be paused while the PR is in draft mode. When it is ready for review, please click the Ready for review button (at the bottom of the PR page). This will assign reviewers and trigger CI.

"position_id_per_seconds": 25,
"use_audio_in_video": False,
"min_pixels": 128 * 28 * 28,
"max_pixels": 768 * 28 * 28,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, also noticed but didn't want to enforce as it's dynamic in their repo, depending on video length. I agree this is better than nothing and a longer term solution would be to add it in self.video_processor

videos_inputs = self.image_processor(images=None, videos=videos, **output_kwargs["videos_kwargs"])
if fps is None:
fps = [2.0] * len(videos)
fps = [fps] * len(videos)
Copy link
Member Author

@pcuenca pcuenca Apr 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is technically unrelated, but I don't think the input kwarg is expected as a list in this method.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, can merge after it's marked ready for review :)

position_id_per_seconds = output_kwargs["videos_kwargs"].pop("position_id_per_seconds")
use_audio_in_video = output_kwargs["videos_kwargs"].pop("use_audio_in_video")
fps = output_kwargs["videos_kwargs"].pop("fps", None)
fps = output_kwargs["videos_kwargs"].pop("fps", 2.0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, then we can put it in video_kwargs.defaults. Missed this one

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but when I tested I think it was not being properly forwarded to all the places where it's needed. I'll take a quick look, otherwise we can merge this and handle the fps in another PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that consistently using the video fps provided by the user, or defaulting to the value in video_kwargs.defaults, merits some additional discussion, I'll open a new PR. We can merge this one meanwhile!

Copy link
Member

@zucchini-nlp zucchini-nlp Apr 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to #37687 as well, users should be able to overwrite the value indeed. And the naming diverged without us noticing 😢

"position_id_per_seconds": 25,
"use_audio_in_video": False,
"min_pixels": 128 * 28 * 28,
"max_pixels": 768 * 28 * 28,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, also noticed but didn't want to enforce as it's dynamic in their repo, depending on video length. I agree this is better than nothing and a longer term solution would be to add it in self.video_processor

@pcuenca pcuenca marked this pull request as ready for review April 23, 2025 13:39
@pcuenca
Copy link
Member Author

pcuenca commented Apr 23, 2025

I updated the tests to account for the new default sizes, let me know if this is ok to merge @zucchini-nlp! We can continue fps handling in #37687.

@zucchini-nlp zucchini-nlp merged commit 63c6331 into main Apr 23, 2025
12 checks passed
@zucchini-nlp zucchini-nlp deleted the qwen-omni-video-defaults branch April 23, 2025 15:08
@pcuenca
Copy link
Member Author

pcuenca commented Apr 28, 2025

Question for @LysandreJik: should this be cherry-picked into the preview release? (As well as #37687 when it's ready)

It's already there, sorry.

zucchini-nlp pushed a commit to zucchini-nlp/transformers that referenced this pull request May 14, 2025
* Apply video defaults for min_pixels and max_pixels

* fps kwarg should not be a list

* Update test to account for new resizing
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants