Fix: fix possible result value error in internode and intranode EP #87

dongmin-ra · 2025-10-15T08:51:33Z

Motivation

Fix possible result value error in internode and intranode EP

Technical Details

In this PR, the CUDA graph hang issue was resolved, but there are parts where result errors could occur.

Issue 1: In the intranode kernel, the part that increments the crossDeviceBarrierFlag value was missing.
- Because of this, running Mori on vLLM with a single node produces incorrect results.
Issue 2: In the internode kernel, a potential read-write data race can occur.
- The global thread 0 increments the crossDeviceBarrierFlag by 1 at the end of internode combine kernel, but there is no barrier before this.
- Since other parts of the kernel read this value, thread 0 could increment it while another thread block reads it, leading to a race condition.

The changes are as follows:

Incremented the crossDeviceBarrierFlag value in the intranode combine kernel.
Modified the internode kernel to take crossDeviceBarrierFlag as a local variable at the start of the kernel.

This ensures that even if global thread 0 increments the value at the end of the kernel, other thread blocks cannot read the updated value prematurely.
Additionally, by using a pre-stored variable instead of accessing memory directly each time, unnecessary memory accesses have been eliminated.

jhchouuu · 2025-10-15T15:59:22Z

Hi @dongmin-ra , thanks to moreh team!

About Issue 1: Have you did the test and see the incorrect result? Because of the intranode kernel, we will set call_reset=False, which means that the reset operation will not be invoked. We will conduct further tests.
About Issue 2: LGTM, This will indeed enhance the robustness.

FYI, We will update our new kernel recently.

dongmin-ra · 2025-10-17T00:45:51Z

About Issue 1: Have you did the test and see the incorrect result? Because of the intranode kernel, we will set call_reset=False, which means that the reset operation will not be invoked. We will conduct further tests.

Yes, I found that the response comes out incorrectly when sending a curl request in the vllm + mori environment.
You could test this in vLLM as well. Here’s the curl command I used for testing:

curl http://localhost:8001/v1/chat/completions   -H "Content-Type: application/json"   -d '{
    "model": "deepseek-ai/DeepSeek-R1",
    "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Who won the world series in 2020?"}
    ],
    "max_tokens": 200
}'

The response returned some random output strings. (fyi, vLLM server was launched with 8 DP and EP.)
I suspect that some ranks may have performed dispatch/combine at different layers.

For example, rank 0 might have performed dispatch/combine at MoE layer 3, while rank 1 did it at MoE layer 4.

When sending a curl request in vLLM, only one DP rank handles the request while the others run a dummy batch with 1 token. This token difference may cause different layers to be processed across DP ranks at the same time, leading to incorrect results.

I’ll create a unit test to reproduce this.

dongmin-ra · 2025-10-23T08:33:22Z

@jhchouuu I created a PR (#92) to enhance the EP unit tests.
After merging the PR, you can try setting the num_reps parameter to a value greater than 1 (e.g., 8) to reproduce the incorrect result.

By the way, I found another issue that could corrupt the combined output, so I created an additional PR (Fix: fix output buffer overwrite issue in intranode EP. #93) to fix that.
To eliminate the impact of that issue, I recommend merging the PR#93 before testing on the current issue.

TianDi101 · 2025-10-27T14:32:24Z

@dongmin-ra Sorry for the late reply since team's bandwidth is saturated recently. We will start testing and reviewing the series of PRs raised by Moreh team this week, thanks again!

jhchouuu · 2025-10-27T14:51:51Z

@dongmin-ra Thanks for this commit, I conducted further tests for ep8 in vllm and also discovered similar issues. Your fix can help us solve the problem. I will first merge PR #87 #93 . And #92 will be merged as soon as it is tested.
Thanks to moreh team again.

fix result value error in internode/intranode EP

954af59

whitememory mentioned this pull request Oct 16, 2025

[Hardware][AMD][Kernel] mori all2all backend integration vllm-project/vllm#26013

Open

2 tasks

dongmin-ra mentioned this pull request Oct 23, 2025

Fix: fix output buffer overwrite issue in intranode EP. #93

Merged

1 task

Merge branch 'main' into fix_ep_result_error

b9752ae

jhchouuu merged commit d9a2b9d into ROCm:main Oct 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix: fix possible result value error in internode and intranode EP #87

Fix: fix possible result value error in internode and intranode EP #87

Uh oh!

dongmin-ra commented Oct 15, 2025 •

edited

Loading

Uh oh!

jhchouuu commented Oct 15, 2025

Uh oh!

dongmin-ra commented Oct 17, 2025 •

edited

Loading

Uh oh!

dongmin-ra commented Oct 23, 2025 •

edited

Loading

Uh oh!

TianDi101 commented Oct 27, 2025 •

edited

Loading

Uh oh!

jhchouuu commented Oct 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix: fix possible result value error in internode and intranode EP #87

Fix: fix possible result value error in internode and intranode EP #87

Uh oh!

Conversation

dongmin-ra commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

Uh oh!

jhchouuu commented Oct 15, 2025

Uh oh!

dongmin-ra commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongmin-ra commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TianDi101 commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jhchouuu commented Oct 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dongmin-ra commented Oct 15, 2025 •

edited

Loading

dongmin-ra commented Oct 17, 2025 •

edited

Loading

dongmin-ra commented Oct 23, 2025 •

edited

Loading

TianDi101 commented Oct 27, 2025 •

edited

Loading