Skip to content

Conversation

@dongmin-ra
Copy link
Contributor

@dongmin-ra dongmin-ra commented Oct 15, 2025

Motivation

Fix possible result value error in internode and intranode EP

Technical Details

In this PR, the CUDA graph hang issue was resolved, but there are parts where result errors could occur.

  • Issue 1: In the intranode kernel, the part that increments the crossDeviceBarrierFlag value was missing.
    • Because of this, running Mori on vLLM with a single node produces incorrect results.
  • Issue 2: In the internode kernel, a potential read-write data race can occur.
    • The global thread 0 increments the crossDeviceBarrierFlag by 1 at the end of internode combine kernel, but there is no barrier before this.
    • Since other parts of the kernel read this value, thread 0 could increment it while another thread block reads it, leading to a race condition.

The changes are as follows:

  1. Incremented the crossDeviceBarrierFlag value in the intranode combine kernel.
  2. Modified the internode kernel to take crossDeviceBarrierFlag as a local variable at the start of the kernel.
  • This ensures that even if global thread 0 increments the value at the end of the kernel, other thread blocks cannot read the updated value prematurely.
  • Additionally, by using a pre-stored variable instead of accessing memory directly each time, unnecessary memory accesses have been eliminated.

@jhchouuu
Copy link
Collaborator

Hi @dongmin-ra , thanks to moreh team!

About Issue 1: Have you did the test and see the incorrect result? Because of the intranode kernel, we will set call_reset=False, which means that the reset operation will not be invoked. We will conduct further tests.
About Issue 2: LGTM, This will indeed enhance the robustness.

FYI, We will update our new kernel recently.

@dongmin-ra
Copy link
Contributor Author

dongmin-ra commented Oct 17, 2025

About Issue 1: Have you did the test and see the incorrect result? Because of the intranode kernel, we will set call_reset=False, which means that the reset operation will not be invoked. We will conduct further tests.

Yes, I found that the response comes out incorrectly when sending a curl request in the vllm + mori environment.
You could test this in vLLM as well. Here’s the curl command I used for testing:

curl http://localhost:8001/v1/chat/completions   -H "Content-Type: application/json"   -d '{
    "model": "deepseek-ai/DeepSeek-R1",
    "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Who won the world series in 2020?"}
    ],
    "max_tokens": 200
}'

The response returned some random output strings. (fyi, vLLM server was launched with 8 DP and EP.)
I suspect that some ranks may have performed dispatch/combine at different layers.

  • For example, rank 0 might have performed dispatch/combine at MoE layer 3, while rank 1 did it at MoE layer 4.

When sending a curl request in vLLM, only one DP rank handles the request while the others run a dummy batch with 1 token. This token difference may cause different layers to be processed across DP ranks at the same time, leading to incorrect results.

I’ll create a unit test to reproduce this.

@dongmin-ra
Copy link
Contributor Author

dongmin-ra commented Oct 23, 2025

@jhchouuu I created a PR (#92) to enhance the EP unit tests.
After merging the PR, you can try setting the num_reps parameter to a value greater than 1 (e.g., 8) to reproduce the incorrect result.

@TianDi101
Copy link
Collaborator

TianDi101 commented Oct 27, 2025

@dongmin-ra Sorry for the late reply since team's bandwidth is saturated recently. We will start testing and reviewing the series of PRs raised by Moreh team this week, thanks again!

@jhchouuu
Copy link
Collaborator

@dongmin-ra Thanks for this commit, I conducted further tests for ep8 in vllm and also discovered similar issues. Your fix can help us solve the problem. I will first merge PR #87 #93 . And #92 will be merged as soon as it is tested.
Thanks to moreh team again.

@jhchouuu jhchouuu merged commit d9a2b9d into ROCm:main Oct 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants