-
Notifications
You must be signed in to change notification settings - Fork 30.7k
Handle audio/ video default arguments in processor's apply_chat_template #37687
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Handle audio/ video default arguments in processor's apply_chat_template #37687
Conversation
Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. The CI will be paused while the PR is in draft mode. When it is ready for review, please click the |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @zucchini-nlp for the ping, I was starting to test a simpler solution just for Qwen, by forwarding video_fps
from the subclass implementation of apply_chat_template
. I like that this is more general and works for any video / audio models, very cool!
I think we should add "fps": 2.0
to Qwen2_5OmniProcessorKwargs._defaults.videos_kwargs
. This way loading the model will still use the default (2 fps), even if the user does not provide a video_fps
kwarg in their call to apply_chat_template
. If we don't do this, the processor will still receive 2.0
, but self._load_video_for_model
won't, and it will read all the individual frames.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR but I am not sure I got it right. IIUC we want to pass the fps/sampling_rate
used to sample frames/audio to the processor's call as well, because qwen uses it to further process videos. In that case I don't fully understand why we need to modify kwargs
used for in load_video
. Might have missed something
Also, I would like us to deprecate on of the kwargs
, probably it will the make changes in this PR minimal. WDYT on keeping only fps
and deprecating video_fps
?
fps": 2.0 to Qwen2_5OmniProcessorKwargs._defaults.videos_kwargs
+1 on this comment
# handle the two naming conventions for fps: video_fps in load kwargs and fps in processor kwargs | ||
processor_key = key if key != "video_fps" else "fps" | ||
default_value = processor_defaults_kwargs.get(processor_key, getattr(kwarg_type_defaults, key, None)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ugh, totally missed this one, better if we align in naming for future models and use either video_fps
or fps
. Since we already added qwen with fps
, imo we can converge under that name and deprecate video_fps
in this PR
# Get the kwargs type annotation from __call__, if any | ||
typing_processor_kwargs_class = self.__call__.__annotations__.get("kwargs") | ||
processor_defaults_kwargs = {} | ||
|
||
# Retrieve processor default kwargs | ||
if typing_processor_kwargs_class: | ||
processor_kwargs_class = typing_processor_kwargs_class.__args__[0] | ||
processor_defaults = getattr(processor_kwargs_class, "_defaults", {}) | ||
|
||
# This combines all default values from different categories (like text_kwargs, images_kwargs, etc.) | ||
processor_defaults_kwargs = {k: v for values in processor_defaults.values() for k, v in values.items()} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is getting way too overengineered imo and it's not this PR's fault. Prob it works for a while but a longer term solution I had is to use dataclasses or similar to split out kwargs per type
Couldn't get time to draft it though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, this is too complicated 😅 I considered an attempt to reuse _merge_kwargs
(which hides a lot of similarly complex logic), but it was not obvious at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A short term option would be to just special-case the missing params (fps and sampling_rate) instead of doing it in a general way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, same, but _merge_kwargs
wasn't much generalizable unless rewritten 🥲
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I wanted to go for a general approach but it's indeed overengineered. Since it is only relevant for fps
and sampling_rate
, I am gonna special-case if. Thanks both for the feedback!
fps=mm_load_kwargs.get("video_fps", None), | ||
backend=mm_load_kwargs["video_load_backend"], | ||
**kwargs, | ||
**{k: v for k, v in kwargs.items() if k != "fps"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
afaik qwen doesn't have a special load_for_model
defined, why do we need this?
And also, I am planning to move the whole video sampling into Until then we can start by deprecating similar kwargs and passing sampling args further to model as well |
I think it's because, if the user provides a value for |
Nice! Feel free to ping when you are ready, happy to test drive. |
Oh right, haven't thought of it. Making |
Yep perfect, handling the deprecation of |
What does this PR do?
For multimodal models, it is often required to pass kwargs indicating how the passed inputs have been sampled for the Processor:
fps
for video inputs,sampling_rate
for audio inputs.Such values can be set as default values for in
ModelProcessorKwargs
, nevertheless they are not passed to the processor's__call__
method when usingapply_chat_template
, resulting in silent errors.For example in
Qwen2.5-VL
:→ images will be sampled at 1 fps
Yet such value (
fps=1
) is not passed inkwargs
in:making that the default value
fps=2
is used in the processor's__call__
method.This is also the case for audio processors that need to know at which
sampling_rate
the audio has been sampled.Fix attempt
I provide an attempt to fix by initialising such values to the specified default value in the
ModelProcessorKwargs
if provided, and passing this value to the Processors__call__
kwargs.cc @zucchini-nlp and @molbap since we talked offline about this