-
-
Notifications
You must be signed in to change notification settings - Fork 11k
[Hardware][AMD][Kernel] mori all2all backend integration #26013
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request integrates the mori all2all backend to enhance performance on ROCm platforms. The changes are extensive, touching communication, model execution layers, and configuration. My review focuses on the correctness and robustness of the new integration. I've identified a few critical issues related to state management, incorrect caching, and potential runtime errors that should be addressed to ensure the stability and correctness of this new feature.
|
At my environment, the pre-commit hanged... I will apply pre-commit result, and respond to any reviews or code-assists tomorrow. |
Note that only low latency mode is available on mori (https://github.com/ROCm/mori) Co-authored-by: Inhyeok Bang <[email protected]> Co-authored-by: Dongmin Ra <[email protected]> Co-authored-by: Jimin Park <[email protected]> Co-authored-by: Geonwoo Choi <[email protected]> Signed-off-by: HakJu Kim <[email protected]>
caused by version difference Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: HakJu Kim <[email protected]> Signed-off-by: HakJu Kim <[email protected]>
accepted suggestion Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: HakJu Kim <[email protected]> Signed-off-by: HakJu Kim <[email protected]>
Signed-off-by: HakJu Kim <[email protected]>
Signed-off-by: HakJu Kim <[email protected]>
49a7cfe to
07095ff
Compare
Signed-off-by: HakJu Kim <[email protected]>
Signed-off-by: HakJu Kim <[email protected]>
Signed-off-by: HakJu Kim <[email protected]>
Signed-off-by: HakJu Kim <[email protected]>
Signed-off-by: HakJu Kim <[email protected]>
|
@mgoin |
Signed-off-by: HakJu Kim <[email protected]>
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: HakJu Kim <[email protected]>
| | deepep_low_latency | batched | fp8 | G(128),A,T<sup>3</sup> | Y | Y | [`DeepEPHTPrepareAndFinalize`][vllm.model_executor.layers.fused_moe.deepep_ht_prepare_finalize.DeepEPHTPrepareAndFinalize] | | ||
| | flashinfer_all2allv | standard | nvfp4,fp8 | G,A,T | N | N | [`FlashInferAllToAllMoEPrepareAndFinalize`][vllm.model_executor.layers.fused_moe.flashinfer_cutlass_prepare_finalize.FlashInferAllToAllMoEPrepareAndFinalize] | | ||
| | flashinfer<sup>4</sup> | standard | nvfp4,fp8 | G,A,T | N | N | [`FlashInferCutlassMoEPrepareAndFinalize`][vllm.model_executor.layers.fused_moe.flashinfer_cutlass_prepare_finalize.FlashInferCutlassMoEPrepareAndFinalize] | | ||
| | flashinfer<sup>4</sup> | standard | nvfp4,fp8 | G,A,T | N | N | [`FlashInferCutlassMoEPrepareAndFinalize`][vllm.model_executor.layers.fused_moe.flashinfer_cutlass_prepare_finalize.FlashInferCutlassMoEPrepareAndFinalize] | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove this line (duplicate of prior line)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not part of this PR, and it's not related.
May I fix it in this PR?
If so, I will remove it.
| 5. This is a no-op dispatcher that can be used to pair with any modular experts to produce a modular kernel that runs w/o dispatch or combine. These cannot be selected via environment variable. These are generally use for testing or adapting an expert subclass to the `fused_experts` API. | ||
| 6. This depends on the experts implementation. | ||
| 7. Currently, MoRI supports low-latency mode only. | ||
| 8. This depends on the experts implementation, currently mori supports aiter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this to explain or a direct answer to fp8?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we integrated mori on aiter moe only.
We found it natural to follow quant type and quant format of rocm aiter moe, which is in this doc at about line 103.
| ) | ||
|
|
||
| # Initialize mori shmem with the registered group | ||
| mori.shmem.shmem_torch_process_group_init(group_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 TP4 instances supported?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point!
we missed that possibility.
if this does not work for 2 different instances at same node, we will think how to give unique group_name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed it so that multiple instances could use their own unique group name including their PPID (Parent Process ID)
It works well because all ranks in one instance shared the same PPID
| except Exception as e: | ||
| logger.error("[rank %s] mori shmem init failed: %s", self.rank, e) | ||
| # Don't fail completely - mark as initialized to avoid retry loops | ||
| self._shmem_initialized = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we handle this differently instead of True
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ihbang would you take a look into this?
I was worried about this handling also.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed it to raise an Exception when shmem initialization is failed
| data_type: torch.dtype = torch.bfloat16, | ||
| quant_dtype: torch.dtype | None = None, | ||
| ): | ||
| import mori |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do specific import
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed (some other import are also fixed)
| """Create mori EpDispatchCombineConfig""" | ||
| import mori.ops.dispatch_combine as mori_ops | ||
| from mori.ops.dispatch_combine import EpDispatchCombineKernelType | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adding check for data_type and quant_dtype before proceed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
quant_dtype check is added.
data_type check is not included because it seems that mori doesn't have specific dtype restriction
| from contextlib import contextmanager | ||
| from typing import Any | ||
|
|
||
| from vllm.model_executor.layers.fused_moe.aiter_experts import AiterExperts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add rocm check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rocm check is added
| @@ -0,0 +1,121 @@ | |||
| # SPDX-License-Identifier: Apache-2.0 | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename this file to aiter_mori_experts.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this file is now renamed to aiter_mori_experts.py
(and AiterExperts class is also renamed to AiterMoriExperts)
| elif moe.use_mori_kernels: | ||
| use_fp8_dispatch = ( | ||
| quant_config is not None | ||
| and quant_config.quant_dtype == current_platform.fp8_dtype() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this works for gfx950 (OCP fp8), not gfx942
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we only tested mori all2all with gfx942(MI300X) but it works fine with fp8 fnuz
| quant_type = QuantType.per_1x128 | ||
| else: | ||
| quant_type = QuantType.per_Token | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add more .per_Tensor for example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed it to use per_Tensor instead of per_Token (same with rocm_aiter_fused_experts)
|
|
||
| # Check if input is already pre-quantized (from mori dispatch) | ||
| input_is_pre_quantized = ( | ||
| a1_scale is not None and hidden_states.dtype == torch.float8_e4m3fnuz |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make this work for both gfx942 and gfx950, fp8 type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed it not to use current_platform.fp8_dtype() instead of torch.float8_e4m3fnuz
vllm/utils/__init__.py
Outdated
| def has_mori() -> bool: | ||
| """Whether the optional `mori` package is available.""" | ||
|
|
||
| return _has_module("mori") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider to add platform check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed it to check current_platform.is_rocm() also
Signed-off-by: HakJu Kim <[email protected]>
this happened because of code conflict(semantically) Signed-off-by: HakJu Kim <[email protected]>
|
NOTE : we made a progress in cuda graph problem. (capture ok, but replay hang) Test Result is updated. |
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: ihbang <[email protected]>
|
I added "json_config" to EpDispatchCombineConfig and dispatch/combine methods of EpDispatchCombineOp to change its configs easily. |
Signed-off-by: ihbang <[email protected]>
Signed-off-by: ihbang <[email protected]>
|
@whitememory may you fix the conflict? Thanks. |
Signed-off-by: HakJu Kim <[email protected]>
Signed-off-by: HakJu Kim <[email protected]>
Signed-off-by: HakJu Kim <[email protected]>
@HAIAI |
|
@HAIAI |
Purpose
This PR is to integrate mori(https://github.com/ROCm/mori) kernel as all2all backend of vLLM to boost up EP performance at rocm. This is a work by a team from Moreh Inc.
Note that only low latency mode is available on mori. (internal static buffer size)
to activate mori -> install mori using from https://github.com/ROCm/mori and add
export VLLM_ALL2ALL_BACKEND="mori"at server booting scriptTest Plan
server
curl test
benchmark script
Test Result
Introduction
Since mori supports only low-latency, mori backend must use chunked forward (repeated moe) Prefill performance can be worse than a2a backends which do not use chunked forward. (We set VLLM_MOE_DP_CHUNK_SIZE=1024, but test ISL was 2000 per each request)
Thus, performance is measured using ITL metric.
Cuda graph capture seems to have a problem at main 1405f0c, so we gave --enforce-eager to server argument.
Performance result
mori
default(allgather_reducescatter)
naive
Etc tests
4-1. test at 0.11.0rc1
curl test passed at all a2a backends tested on rocm. (naive, allgather_reducescatter, mori)
cuda graph capture passed at all a2a backends tested on rocm. (allgather_reducescatter, mori) - naive backend is supposed to fail
with this fix from mori cuda graph replay is also fine.
4-2. test at main 1405f0c -> main at commit 1405f0c seems to have problem.
curl test DID NOT PASS using all a2a backends tested on rocm. (naive, allgather_reducescatter, mori)
cuda graph capture DID NOT PASS using all a2a backends tested on rocm. (naive, allgather_reducescatter, mori) - naive backend is supposed to fail
4-2-1.
Result of curl test seems like, at main 1405f0c, server responded to “You are a helpful assistant.” not "Who won the world series in 2020?”.
server at 0.11.0rc1 responded to "Who won the world series in 2020?” correctly.
Future works
5-1. Even at 0.11.0rc1, cuda graph replay seems to have a problem. cuda graph capture is done, but it hangs at replay. We will investigate on this.
5-2 integrate mori with DBO
5-3. when mori HT mode releases, apply the mode to enhance prefill performance
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.