-
-
Notifications
You must be signed in to change notification settings - Fork 11.9k
[CustomOp][MM] Extract MMEncoderAttention as CustomOp and replace the backend of QwenVisionAttention with it. #30125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request is a good step towards refactoring the attention mechanisms and making the codebase more modular by introducing MMEncoderAttention as a CustomOp. The unification of the vision attention backend logic is also a welcome improvement.
I've found a critical bug in vllm/attention/layer.py where a variable was renamed but not all its usages were updated, which would cause a runtime error. I've also pointed out an instance of code duplication in the new mm_encoder_attention.py file that should be addressed to improve maintainability.
Once these issues are resolved, this PR will be a solid contribution to the project's architecture.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
vllm/vllm/model_executor/models/paddleocr_vl.py
Lines 612 to 616 in a995c14
| self.attn_backend = attn_backend | |
| self.attn_backend, self.flash_attn_varlen_func = ( | |
| maybe_get_vit_flash_attn_backend( | |
| self.attn_backend, | |
| attn_backend_override=attn_backend_override, |
maybe_get_vit_flash_attn_backend now only accepts the backend and returns a single function, but the PaddleOCR vision attention still unpacks two return values and passes attn_backend_override. Instantiating this module will now raise TypeError: maybe_get_vit_flash_attn_backend() got an unexpected keyword argument 'attn_backend_override', preventing the model from loading.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Isotr0py
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this effort! I left some initial comments, and will further look into this tomorrow. PTAL!
Co-authored-by: Isotr0py <[email protected]> Co-authored-by: tjtanaa <[email protected]> Signed-off-by: shen-shanshan <[email protected]>
Signed-off-by: shen-shanshan <[email protected]>
Signed-off-by: shen-shanshan <[email protected]>
Signed-off-by: shen-shanshan <[email protected]>
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
TODO: Models to check (both FA/SDPA)
(We really need UT for |
I can help add related UT. |
|
Thanks, let's add the test in a following PR before we migrate Only several models use Qwen2-style attention and have small variations, so manual validation is still fine for this PR :) |
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Purpose
To avoid maintaining a variety of modeling files in vllm-ascend, we propose to remove all files in
modelsdir in vllm-ascend. After this, the only thing a vllm plugin need to do is just registering their custom device-specific OOT ops to vllm when adding a new model. To achieve this, there are some refactors need to be done both in vllm and vllm-ascend, such as extracting some general layers as CustomOp, find more details at vllm-project/vllm-ascend#4084.Following #27919 and #27147, this PR has unified the getting logic of
vit_attn_backendand extractedMMEncoderAttentionas a CustomOp.To be specific, vision attention backend should only be checked and overwritten in the platform-specific implementation. We should not overwrite this logic in any other places, such as
model_executor/models/<model_name>.py. In addition, I have moved scattered forward dispatch logic into this CustomOp to avoid verification forcurrent_platformin any other places.To minimize the influence, I only replaced the backend of
QwenVisionAttentionwith this CustomOp and have tested this PR both on Ascend A2 NPU and NVIDIA A100 GPU (TODO). I will modify other modeling files and delete the oldMultiHeadAttentionin the future if this PR could be merged.Test Plan
Test Result
✅ Ascend A2 NPU
Run Qwen2.5-VL:
Output:
Run Qwen3-VL:
Output:
NVIDIA A100 GPU
TO BE DONE...
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.Signed-off-by: shen-shanshan [email protected]
Co-authored-by: Isotr0py [email protected]
Co-authored-by: tjtanaa [email protected]