-
Notifications
You must be signed in to change notification settings - Fork 653
[Bugfix] Disable the dispatch_ffn_combine kernel in MTP path #4751
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a bugfix to disable the fused-moe kernel during the dummy_run of the MTP (Multi-path Transformer) proposer. This is accomplished by checking if the selected MoE communication method is FUSED_ALLTOALL and reverting to the standard ALLTOALL method if it is. This change is localized and specifically targets the dummy_run, which is crucial for graph capturing. The modification correctly addresses a likely bug with the fused kernel in this context, and the implementation is sound. No issues were found in the proposed changes.
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
2f86c9f to
77e74e4
Compare
|
4273f3c to
37ec423
Compare
Signed-off-by: mojave2 <[email protected]>
37ec423 to
c460dda
Compare
…oject#4751) ### What this PR does / why we need it? This PR is to fix a smoking test failure. Adjust mtp_proposer and model_runner_v1 to route MTP decoding through the non‑fused MoE implementation while keeping the overall inference flow unchanged. - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: mojave2 <[email protected]> Co-authored-by: Mengqing Cao <[email protected]>
What this PR does / why we need it?
This PR is to fix a smoking test failure. Adjust mtp_proposer and model_runner_v1 to route MTP decoding through the non‑fused MoE implementation while keeping the overall inference flow unchanged.
Does this PR introduce any user-facing change?
How was this patch tested?
This PR will be tested in smoking tests.