Skip to content

Conversation

@linfeng-yuan
Copy link
Collaborator

@linfeng-yuan linfeng-yuan commented Dec 5, 2025

What this PR does / why we need it?

Currently, the all_reduce operation in _sync_metadata_across_dp is performed with gloo backend which is extremely time-consuming when DPEngineCores are in different nodes. This operation cannot be ignored by async scheduling in multi-node-scenarios with speculative decoding (e.g., EAGLE, mtp).

This pr eliminates the all_reduce operation for D Nodes and change the input parameter of MoEDispatch & MoeCombine operators to make MC2EP support different num_tokens across all ranks.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Tested with PD disaggregation (2P: DP2TP8EP16 1D: DP8TP4EP32) scenarios while enabling async scheduling. This pr can remove cross-node all_reduce with gloo backend and further reduce latency with correct accuracy.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces performance optimizations for MoE models in a distributed setting, primarily for kv_consumer nodes. It achieves this by replacing a costly all_reduce operation, allowing different ranks to handle varying numbers of tokens, which should improve throughput. This is supported by pre-calculating a global batch size for MoE operators. While the changes are promising for performance, I've identified a couple of areas that need attention to ensure correctness and robustness. Specifically, there's a behavioral change in how random experts are selected for load balancing that might be a bug, and a critical assumption about the MoE communication method that could lead to runtime failures if not enforced.

Comment on lines 917 to 977
if self.is_kv_consumer and not self.in_profile_run:
num_tokens_after_padding = torch.tensor([num_tokens] *
self.dp_size,
device="cpu",
dtype=torch.int32)
return num_tokens, num_tokens_after_padding, with_prefill
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This optimization to skip all_reduce for num_tokens is a great performance improvement. However, it relies on the assumption that the MoE communication method will be MC2, as noted in the comment. If a different communication method like AllGather is used (e.g., if num_tokens exceeds mc2_tokens_capacity), it will lead to a runtime failure because AllGather requires tensors of the same size across ranks.

To make this more robust, I suggest adding an assertion to ensure that MC2 is indeed the selected communication method when this optimization is active. This will prevent silent failures in unexpected scenarios.

Suggested change
if self.is_kv_consumer and not self.in_profile_run:
num_tokens_after_padding = torch.tensor([num_tokens] *
self.dp_size,
device="cpu",
dtype=torch.int32)
return num_tokens, num_tokens_after_padding, with_prefill
if self.is_kv_consumer and not self.in_profile_run:
assert self._select_moe_comm_method(num_tokens) == MoECommType.MC2, \
"Skipping all_reduce for num_tokens is only supported with MC2 MoE communication."
num_tokens_after_padding = torch.tensor([num_tokens] *
self.dp_size,
device="cpu",
dtype=torch.int32)
return num_tokens, num_tokens_after_padding, with_prefill

@github-actions
Copy link

github-actions bot commented Dec 5, 2025

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@github-actions
Copy link

github-actions bot commented Dec 6, 2025

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@linfeng-yuan linfeng-yuan force-pushed the replace_all_reduce branch 2 times, most recently from 2f7f987 to 39fe3f1 Compare December 9, 2025 06:00
@linfeng-yuan
Copy link
Collaborator Author

linfeng-yuan commented Dec 9, 2025

TODO List:

  1. Extract the tokens_capacity_per_rank and totkens_capacity calculation into utils.py and reuse it in NPUModelRunner; (do it later)
  2. Add constraints of recompute_scheduler; (DONE)
  3. Delete mc2_mask; (do it later)

Signed-off-by: linfeng-yuan <[email protected]>
Signed-off-by: linfeng-yuan <[email protected]>
@rjg-lyh rjg-lyh added ready read for review ready-for-test start test by label for PR labels Dec 11, 2025
@linfeng-yuan
Copy link
Collaborator Author

Full CI checks passed (https://github.com/vllm-project/vllm-ascend/actions/runs/20119492240/job/57736426413) Let me add a constraint about recompute scheduler to judge whether ckipping all_reduce. Note that RecomputeScheduler is compatible with vLLM v0.12.0 after this #4859 is merged.

@linfeng-yuan linfeng-yuan removed ready-for-test start test by label for PR ready read for review labels Dec 11, 2025
@linfeng-yuan linfeng-yuan changed the title [WIP][perf] replace all_reduce for kv_consumer and support different num_tokens among all ranks [perf] replace all_reduce for kv_consumer and support different num_tokens among all ranks Dec 12, 2025
@github-actions
Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants