Skip to content

Conversation

eustlb
Copy link
Contributor

@eustlb eustlb commented Apr 22, 2025

What does this PR do?

For multimodal models, it is often required to pass kwargs indicating how the passed inputs have been sampled for the Processor: fps for video inputs, sampling_rate for audio inputs.

Such values can be set as default values for in ModelProcessorKwargs, nevertheless they are not passed to the processor's __call__ method when using apply_chat_template, resulting in silent errors.

For example in Qwen2.5-VL:

from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")

conversation = [
    {
        "role": "user",
        "content": [
            {"type": "video", "path": "example.mp4"},
            {"type": "text", "text": "What happened in the video?"},
        ],
    }
]

inputs = processor.apply_chat_template(
    conversation,
    video_fps=1,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt"
) 

→ images will be sampled at 1 fps

Yet such value (fps=1) is not passed in kwargs in:

out = self(
    text=prompt,
    images=batch_images if batch_images else None,
    videos=batch_videos if batch_videos else None,
    audio=batch_audios if batch_audios else None,
    **kwargs,
)

making that the default value fps=2 is used in the processor's __call__ method.

This is also the case for audio processors that need to know at which sampling_rate the audio has been sampled.

Fix attempt

I provide an attempt to fix by initialising such values to the specified default value in the ModelProcessorKwargs if provided, and passing this value to the Processors __call__ kwargs.

cc @zucchini-nlp and @molbap since we talked offline about this

@eustlb eustlb requested a review from zucchini-nlp April 22, 2025 17:16
Copy link
Contributor

Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. The CI will be paused while the PR is in draft mode. When it is ready for review, please click the Ready for review button (at the bottom of the PR page). This will assign reviewers and trigger CI.

@github-actions github-actions bot marked this pull request as draft April 22, 2025 17:16
@eustlb eustlb marked this pull request as ready for review April 22, 2025 17:16
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Member

@pcuenca pcuenca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @zucchini-nlp for the ping, I was starting to test a simpler solution just for Qwen, by forwarding video_fps from the subclass implementation of apply_chat_template. I like that this is more general and works for any video / audio models, very cool!

I think we should add "fps": 2.0 to Qwen2_5OmniProcessorKwargs._defaults.videos_kwargs. This way loading the model will still use the default (2 fps), even if the user does not provide a video_fps kwarg in their call to apply_chat_template. If we don't do this, the processor will still receive 2.0, but self._load_video_for_model won't, and it will read all the individual frames.

Copy link
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR but I am not sure I got it right. IIUC we want to pass the fps/sampling_rate used to sample frames/audio to the processor's call as well, because qwen uses it to further process videos. In that case I don't fully understand why we need to modify kwargs used for in load_video. Might have missed something

Also, I would like us to deprecate on of the kwargs, probably it will the make changes in this PR minimal. WDYT on keeping only fps and deprecating video_fps?

fps": 2.0 to Qwen2_5OmniProcessorKwargs._defaults.videos_kwargs

+1 on this comment

Comment on lines +1451 to +1453
# handle the two naming conventions for fps: video_fps in load kwargs and fps in processor kwargs
processor_key = key if key != "video_fps" else "fps"
default_value = processor_defaults_kwargs.get(processor_key, getattr(kwarg_type_defaults, key, None))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh, totally missed this one, better if we align in naming for future models and use either video_fps or fps. Since we already added qwen with fps, imo we can converge under that name and deprecate video_fps in this PR

Comment on lines +1435 to +1446
# Get the kwargs type annotation from __call__, if any
typing_processor_kwargs_class = self.__call__.__annotations__.get("kwargs")
processor_defaults_kwargs = {}

# Retrieve processor default kwargs
if typing_processor_kwargs_class:
processor_kwargs_class = typing_processor_kwargs_class.__args__[0]
processor_defaults = getattr(processor_kwargs_class, "_defaults", {})

# This combines all default values from different categories (like text_kwargs, images_kwargs, etc.)
processor_defaults_kwargs = {k: v for values in processor_defaults.values() for k, v in values.items()}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is getting way too overengineered imo and it's not this PR's fault. Prob it works for a while but a longer term solution I had is to use dataclasses or similar to split out kwargs per type

Couldn't get time to draft it though

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, this is too complicated 😅 I considered an attempt to reuse _merge_kwargs (which hides a lot of similarly complex logic), but it was not obvious at all.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A short term option would be to just special-case the missing params (fps and sampling_rate) instead of doing it in a general way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, same, but _merge_kwargs wasn't much generalizable unless rewritten 🥲

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I wanted to go for a general approach but it's indeed overengineered. Since it is only relevant for fps and sampling_rate, I am gonna special-case if. Thanks both for the feedback!

fps=mm_load_kwargs.get("video_fps", None),
backend=mm_load_kwargs["video_load_backend"],
**kwargs,
**{k: v for k, v in kwargs.items() if k != "fps"},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

afaik qwen doesn't have a special load_for_model defined, why do we need this?

@zucchini-nlp
Copy link
Member

And also, I am planning to move the whole video sampling into video_processors after it is merged. There are some ideas on how it will look like with SmolVLM (the most complicated sampling logic) in this PR. I will make this idea open to discussion after it's unblocked.

Until then we can start by deprecating similar kwargs and passing sampling args further to model as well

@pcuenca
Copy link
Member

pcuenca commented Apr 24, 2025

I don't fully understand why we need to modify kwargs used for in load_video.

I think it's because, if the user provides a value for fps, we want to apply it during video loading so the correct number of frames is retrieved. Otherwise, it would load all frames (current behaviour; undesired, imo), or the new default we set in Qwen2_5OmniProcessorKwargs._defaults.videos_kwargs, which could be different to the value supplied by the user.

@pcuenca
Copy link
Member

pcuenca commented Apr 24, 2025

And also, I am planning to move the whole video sampling into video_processors after it is merged. There are some ideas on how it will look like with SmolVLM (the most complicated sampling logic) in this PR. I will make this idea open to discussion after it's unblocked.

Nice! Feel free to ping when you are ready, happy to test drive.

@zucchini-nlp
Copy link
Member

if the user provides a value for fps, we want to apply it during video loading so the correct number of frames is retrieved

Oh right, haven't thought of it. Making fps the only correct arg and deprecating video_fps solves the issue then, I think

@eustlb
Copy link
Contributor Author

eustlb commented Apr 25, 2025

Yep perfect, handling the deprecation of video_fps in this PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants