Handle audio/ video default arguments in processor's apply_chat_template #37687

eustlb · 2025-04-22T17:16:24Z

What does this PR do?

For multimodal models, it is often required to pass kwargs indicating how the passed inputs have been sampled for the Processor: fps for video inputs, sampling_rate for audio inputs.

Such values can be set as default values for in ModelProcessorKwargs, nevertheless they are not passed to the processor's __call__ method when using apply_chat_template, resulting in silent errors.

For example in Qwen2.5-VL:

from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")

conversation = [
    {
        "role": "user",
        "content": [
            {"type": "video", "path": "example.mp4"},
            {"type": "text", "text": "What happened in the video?"},
        ],
    }
]

inputs = processor.apply_chat_template(
    conversation,
    video_fps=1,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt"
)

→ images will be sampled at 1 fps

Yet such value (fps=1) is not passed in kwargs in:

out = self(
    text=prompt,
    images=batch_images if batch_images else None,
    videos=batch_videos if batch_videos else None,
    audio=batch_audios if batch_audios else None,
    **kwargs,
)

making that the default value fps=2 is used in the processor's __call__ method.

This is also the case for audio processors that need to know at which sampling_rate the audio has been sampled.

Fix attempt

I provide an attempt to fix by initialising such values to the specified default value in the ModelProcessorKwargs if provided, and passing this value to the Processors __call__ kwargs.

cc @zucchini-nlp and @molbap since we talked offline about this

github-actions · 2025-04-22T17:16:38Z

Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. The CI will be paused while the PR is in draft mode. When it is ready for review, please click the Ready for review button (at the bottom of the PR page). This will assign reviewers and trigger CI.

HuggingFaceDocBuilderDev · 2025-04-22T17:30:03Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

pcuenca

Thanks @zucchini-nlp for the ping, I was starting to test a simpler solution just for Qwen, by forwarding video_fps from the subclass implementation of apply_chat_template. I like that this is more general and works for any video / audio models, very cool!

I think we should add "fps": 2.0 to Qwen2_5OmniProcessorKwargs._defaults.videos_kwargs. This way loading the model will still use the default (2 fps), even if the user does not provide a video_fps kwarg in their call to apply_chat_template. If we don't do this, the processor will still receive 2.0, but self._load_video_for_model won't, and it will read all the individual frames.

zucchini-nlp

Thanks for the PR but I am not sure I got it right. IIUC we want to pass the fps/sampling_rate used to sample frames/audio to the processor's call as well, because qwen uses it to further process videos. In that case I don't fully understand why we need to modify kwargs used for in load_video. Might have missed something

Also, I would like us to deprecate on of the kwargs, probably it will the make changes in this PR minimal. WDYT on keeping only fps and deprecating video_fps?

fps": 2.0 to Qwen2_5OmniProcessorKwargs._defaults.videos_kwargs

+1 on this comment

zucchini-nlp · 2025-04-23T13:44:58Z

src/transformers/processing_utils.py

+                # handle the two naming conventions for fps: video_fps in load kwargs and fps in processor kwargs
+                processor_key = key if key != "video_fps" else "fps"
+                default_value = processor_defaults_kwargs.get(processor_key, getattr(kwarg_type_defaults, key, None))


Ugh, totally missed this one, better if we align in naming for future models and use either video_fps or fps. Since we already added qwen with fps, imo we can converge under that name and deprecate video_fps in this PR

zucchini-nlp · 2025-04-23T13:46:57Z

src/transformers/processing_utils.py

+        # Get the kwargs type annotation from __call__, if any
+        typing_processor_kwargs_class = self.__call__.__annotations__.get("kwargs")
+        processor_defaults_kwargs = {}
+
+        # Retrieve processor default kwargs
+        if typing_processor_kwargs_class:
+            processor_kwargs_class = typing_processor_kwargs_class.__args__[0]
+            processor_defaults = getattr(processor_kwargs_class, "_defaults", {})
+
+            # This combines all default values from different categories (like text_kwargs, images_kwargs, etc.)
+            processor_defaults_kwargs = {k: v for values in processor_defaults.values() for k, v in values.items()}
+


this is getting way too overengineered imo and it's not this PR's fault. Prob it works for a while but a longer term solution I had is to use dataclasses or similar to split out kwargs per type

Couldn't get time to draft it though

+1, this is too complicated 😅 I considered an attempt to reuse _merge_kwargs (which hides a lot of similarly complex logic), but it was not obvious at all.

A short term option would be to just special-case the missing params (fps and sampling_rate) instead of doing it in a general way.

yeah, same, but _merge_kwargs wasn't much generalizable unless rewritten 🥲

Yep, I wanted to go for a general approach but it's indeed overengineered. Since it is only relevant for fps and sampling_rate, I am gonna special-case if. Thanks both for the feedback!

zucchini-nlp · 2025-04-23T13:49:59Z

src/transformers/processing_utils.py

                                fps=mm_load_kwargs.get("video_fps", None),
                                backend=mm_load_kwargs["video_load_backend"],
-                                **kwargs,
+                                **{k: v for k, v in kwargs.items() if k != "fps"},


afaik qwen doesn't have a special load_for_model defined, why do we need this?

zucchini-nlp · 2025-04-24T08:59:27Z

And also, I am planning to move the whole video sampling into video_processors after it is merged. There are some ideas on how it will look like with SmolVLM (the most complicated sampling logic) in this PR. I will make this idea open to discussion after it's unblocked.

Until then we can start by deprecating similar kwargs and passing sampling args further to model as well

pcuenca · 2025-04-24T13:14:06Z

I don't fully understand why we need to modify kwargs used for in load_video.

I think it's because, if the user provides a value for fps, we want to apply it during video loading so the correct number of frames is retrieved. Otherwise, it would load all frames (current behaviour; undesired, imo), or the new default we set in Qwen2_5OmniProcessorKwargs._defaults.videos_kwargs, which could be different to the value supplied by the user.

pcuenca · 2025-04-24T13:15:25Z

And also, I am planning to move the whole video sampling into video_processors after it is merged. There are some ideas on how it will look like with SmolVLM (the most complicated sampling logic) in this PR. I will make this idea open to discussion after it's unblocked.

Nice! Feel free to ping when you are ready, happy to test drive.

zucchini-nlp · 2025-04-24T13:24:47Z

if the user provides a value for fps, we want to apply it during video loading so the correct number of frames is retrieved

Oh right, haven't thought of it. Making fps the only correct arg and deprecating video_fps solves the issue then, I think

eustlb · 2025-04-25T17:08:53Z

Yep perfect, handling the deprecation of video_fps in this PR!

eustlb added 2 commits April 22, 2025 18:11

handle default kwargs

4cff439

handle video_fps

7e90ed8

eustlb requested a review from zucchini-nlp April 22, 2025 17:16

github-actions bot marked this pull request as draft April 22, 2025 17:16

eustlb marked this pull request as ready for review April 22, 2025 17:16

Merge branch 'main' into chat-template-pass-proc-required-kwargs

4ee36c0

zucchini-nlp mentioned this pull request Apr 23, 2025

Qwen 2.5 Omni: apply video defaults #37660

Merged

pcuenca reviewed Apr 23, 2025

View reviewed changes

zucchini-nlp reviewed Apr 24, 2025

View reviewed changes

Handle audio/ video default arguments in processor's apply_chat_template #37687

Are you sure you want to change the base?

Handle audio/ video default arguments in processor's apply_chat_template #37687

Conversation

eustlb commented Apr 22, 2025

What does this PR do?

Fix attempt

Uh oh!

github-actions bot commented Apr 22, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Apr 22, 2025

Uh oh!

pcuenca left a comment

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Apr 23, 2025

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Apr 23, 2025

Choose a reason for hiding this comment

Uh oh!

pcuenca Apr 24, 2025

Choose a reason for hiding this comment

Uh oh!

pcuenca Apr 24, 2025

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Apr 24, 2025

Choose a reason for hiding this comment

Uh oh!

eustlb Apr 25, 2025

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Apr 23, 2025

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp commented Apr 24, 2025

Uh oh!

pcuenca commented Apr 24, 2025

Uh oh!

pcuenca commented Apr 24, 2025

Uh oh!

zucchini-nlp commented Apr 24, 2025

Uh oh!

eustlb commented Apr 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

eustlb commented Apr 25, 2025 •

edited

Loading