[perf] replace all_reduce for kv_consumer and support different num_tokens among all ranks #4736

linfeng-yuan · 2025-12-05T03:07:57Z

What this PR does / why we need it?

Currently, the all_reduce operation in _sync_metadata_across_dp is performed with gloo backend which is extremely time-consuming when DPEngineCores are in different nodes. This operation cannot be ignored by async scheduling in multi-node-scenarios with speculative decoding (e.g., EAGLE, mtp).

This pr eliminates the all_reduce operation for D Nodes and change the input parameter of MoEDispatch & MoeCombine operators to make MC2EP support different num_tokens across all ranks.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Tested with PD disaggregation (2P: DP2TP8EP16 1D: DP8TP4EP32) scenarios while enabling async scheduling. This pr can remove cross-node all_reduce with gloo backend and further reduce latency with correct accuracy.

vLLM version: v0.12.0
vLLM main: vllm-project/vllm@ad32e3e

gemini-code-assist

Code Review

This pull request introduces performance optimizations for MoE models in a distributed setting, primarily for kv_consumer nodes. It achieves this by replacing a costly all_reduce operation, allowing different ranks to handle varying numbers of tokens, which should improve throughput. This is supported by pre-calculating a global batch size for MoE operators. While the changes are promising for performance, I've identified a couple of areas that need attention to ensure correctness and robustness. Specifically, there's a behavioral change in how random experts are selected for load balancing that might be a bug, and a critical assumption about the MoE communication method that could lead to runtime failures if not enforced.

gemini-code-assist · 2025-12-05T03:09:55Z

vllm_ascend/worker/model_runner_v1.py

+        if self.is_kv_consumer and not self.in_profile_run:
+            num_tokens_after_padding = torch.tensor([num_tokens] *
+                                        self.dp_size,
+                                        device="cpu",
+                                        dtype=torch.int32)
+            return num_tokens, num_tokens_after_padding, with_prefill


This optimization to skip all_reduce for num_tokens is a great performance improvement. However, it relies on the assumption that the MoE communication method will be MC2, as noted in the comment. If a different communication method like AllGather is used (e.g., if num_tokens exceeds mc2_tokens_capacity), it will lead to a runtime failure because AllGather requires tensors of the same size across ranks.

To make this more robust, I suggest adding an assertion to ensure that MC2 is indeed the selected communication method when this optimization is active. This will prevent silent failures in unexpected scenarios.

Suggested change

if self.is_kv_consumer and not self.in_profile_run:

num_tokens_after_padding = torch.tensor([num_tokens] *

self.dp_size,

device="cpu",

dtype=torch.int32)

return num_tokens, num_tokens_after_padding, with_prefill

if self.is_kv_consumer and not self.in_profile_run:

assert self._select_moe_comm_method(num_tokens) == MoECommType.MC2, \

"Skipping all_reduce for num_tokens is only supported with MC2 MoE communication."

num_tokens_after_padding = torch.tensor([num_tokens] *

self.dp_size,

device="cpu",

dtype=torch.int32)

return num_tokens, num_tokens_after_padding, with_prefill

vllm_ascend/quantization/w4a8_dynamic.py

vllm_ascend/quantization/w8a8_dynamic.py

github-actions · 2025-12-05T04:00:02Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

github-actions · 2025-12-06T09:20:02Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

linfeng-yuan · 2025-12-09T09:04:45Z

TODO List:

Extract the tokens_capacity_per_rank and totkens_capacity calculation into utils.py and reuse it in NPUModelRunner; (do it later)
Add constraints of recompute_scheduler; (DONE)
Delete mc2_mask; (do it later)

…mong all ranks Signed-off-by: linfeng-yuan <[email protected]>

Signed-off-by: linfeng-yuan <[email protected]>

linfeng-yuan · 2025-12-11T10:35:11Z

Full CI checks passed (https://github.com/vllm-project/vllm-ascend/actions/runs/20119492240/job/57736426413) Let me add a constraint about recompute scheduler to judge whether ckipping all_reduce. Note that RecomputeScheduler is compatible with vLLM v0.12.0 after this #4859 is merged.

Signed-off-by: linfeng-yuan <[email protected]>

github-actions · 2025-12-12T09:29:46Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

gemini-code-assist bot reviewed Dec 5, 2025

View reviewed changes

github-actions bot added module:ops module:quantization labels Dec 5, 2025

linfeng-yuan force-pushed the replace_all_reduce branch 2 times, most recently from f45b53e to 87129a9 Compare December 5, 2025 07:21

github-actions bot added the merge-conflicts label Dec 6, 2025

linfeng-yuan force-pushed the replace_all_reduce branch from 87129a9 to 0381e07 Compare December 8, 2025 02:20

github-actions bot removed the merge-conflicts label Dec 8, 2025

linfeng-yuan force-pushed the replace_all_reduce branch 2 times, most recently from 2f7f987 to 39fe3f1 Compare December 9, 2025 06:00

wangxiyuan approved these changes Dec 9, 2025

View reviewed changes

weijinqian0 approved these changes Dec 9, 2025

View reviewed changes

linfeng-yuan added 2 commits December 9, 2025 20:45

replace all_reduce for kv_consumer and support different num_tokens a…

54198aa

…mong all ranks Signed-off-by: linfeng-yuan <[email protected]>

fix typo and delete a part of mc2_mask

776e14e

Signed-off-by: linfeng-yuan <[email protected]>

linfeng-yuan force-pushed the replace_all_reduce branch from bc34738 to 776e14e Compare December 9, 2025 12:55

linfeng-yuan added 2 commits December 11, 2025 03:53

fix lmhead_tp break

35265a3

Signed-off-by: linfeng-yuan <[email protected]>

fix ut of TokenDispatcher

03b57aa

Signed-off-by: linfeng-yuan <[email protected]>

github-actions bot added the module:tests label Dec 10, 2025

rjg-lyh added ready read for review ready-for-test start test by label for PR labels Dec 11, 2025

linfeng-yuan removed ready-for-test start test by label for PR ready read for review labels Dec 11, 2025

linfeng-yuan added 2 commits December 11, 2025 19:22

add constraints whether can skip all_reduce across dp_group

69eeb8b

Signed-off-by: linfeng-yuan <[email protected]>

fix lint

95f99b2

Signed-off-by: linfeng-yuan <[email protected]>

linfeng-yuan changed the title ~~[WIP][perf] replace all_reduce for kv_consumer and support different num_tokens among all ranks~~ [perf] replace all_reduce for kv_consumer and support different num_tokens among all ranks Dec 12, 2025

github-actions bot added the merge-conflicts label Dec 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[perf] replace all_reduce for kv_consumer and support different num_tokens among all ranks #4736

[perf] replace all_reduce for kv_consumer and support different num_tokens among all ranks #4736

linfeng-yuan commented Dec 5, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Dec 5, 2025

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Dec 5, 2025

Uh oh!

github-actions bot commented Dec 6, 2025

Uh oh!

linfeng-yuan commented Dec 9, 2025 •

edited

Loading

Uh oh!

linfeng-yuan commented Dec 11, 2025

Uh oh!

github-actions bot commented Dec 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[perf] replace all_reduce for kv_consumer and support different num_tokens among all ranks #4736

Are you sure you want to change the base?

[perf] replace all_reduce for kv_consumer and support different num_tokens among all ranks #4736

Conversation

linfeng-yuan commented Dec 5, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Dec 5, 2025

Uh oh!

github-actions bot commented Dec 6, 2025

Uh oh!

linfeng-yuan commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

linfeng-yuan commented Dec 11, 2025

Uh oh!

github-actions bot commented Dec 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

linfeng-yuan commented Dec 5, 2025 •

edited by github-actions bot

Loading

linfeng-yuan commented Dec 9, 2025 •

edited

Loading