Skip to content

Conversation

@dongmin-ra
Copy link
Collaborator

@dongmin-ra dongmin-ra commented Oct 2, 2025

Motivation

Fixed invalid warpSize in host(.cpp) code

Technical Details

Problem

  • Right after this PR, GPU memory access fault occurred during internode EP execution.

Cause

  • The cause is that this PR changed the logic to set warpSize to 64 only when __GFX8__ or __GFX9__ is defined, and otherwise to 32.
    • These macro variables are defined automatically by the compiler (amd clang++) when compiling device code.
    • However, when compiling host code (.cpp files), these macros are not defined.
  • As a result, inside kernel code warpSize is set to 64, but in host code (e.g., dispatch_combine.cpp) it is set to 32.
  • When launching the dispatch and combine kernels, the block dimension is set as warpSize * actualWarpNumPerBlock.
    • Since warpSize was incorrectly set to 32 in the host code, the block dimension ended up being half of the intended value, which caused the error.

Fix

  • Explicitly define __GFX8__ or __GFX9__ in the CMake configuration based on the detected GPU architecture.

Test Plan

  1. Apply the following changes to the examples/ops/dispatch_combine/test_dispatch_combine_internode.py file
diff --git a/examples/ops/dispatch_combine/test_dispatch_combine_internode.py b/examples/ops/dispatch_combine/test_dispatch_combine_internode.py
index 55d4ef2..bcf87e8 100644
--- a/examples/ops/dispatch_combine/test_dispatch_combine_internode.py
+++ b/examples/ops/dispatch_combine/test_dispatch_combine_internode.py
@@ -45,7 +45,7 @@ class EpDispatchCombineTestCase:
             num_experts_per_rank=16,
             # num_experts_per_rank=256 // world_size,
             num_experts_per_token=8,
-            warp_num_per_block=16,
+            warp_num_per_block=1,
             block_num=64,
             max_token_type_size=2,
             kernel_type=mori.ops.EpDispatchCombineKernelType.InterNode,
  1. Execute the example script
export MORI_DISABLE_P2P=1
torchrun --local-ranks-filter 0 \
                --role rank \
                --nnodes=1 \
                --node_rank=0 \
                --nproc_per_node=1 \
                --master_addr=127.0.0.1 \
                --master_port=1234 \
                examples/ops/dispatch_combine/test_dispatch_combine_internode.py --max-tokens 16

Test Result

  • Before modification : Memory acces fault error occurs.
Memory access fault by GPU node-3 (Agent handle: 0x820f5b0) on address 0x7efde4c00000. Reason: Unknown.
Memory access fault by GPU node-7 (Agent handle: 0xa625520) on address 0x7f6177600000. Reason: Unknown.
Memory access fault by GPU node-4 (Agent handle: 0x9df4740) on address 0x7f8066e00000. Reason: Unknown.
Memory access fault by GPU node-2 (Agent handle: 0x86d2ed0) on address 0x7ee328c00000. Reason: Unknown.
Memory access fault by GPU node-6 (Agent handle: 0x9f13da0) on address 0x7fae86a00000. Reason: Unknown.
Memory access fault by GPU node-8 (Agent handle: 0x9aa4c00) on address 0x7ef5c8c00000. Reason: Unknown.
Memory access fault by GPU node-9 (Agent handle: 0x976c110) on address 0x7f0b94c00000. Reason: Unknown.
Memory access fault by GPU node-5 (Agent handle: 0x8cb5700) on address 0x7fb4e7600000. Reason: Unknown.
  • After modification : no error occurs.

Submission Checklist

@hnts03-moreh hnts03-moreh marked this pull request as ready for review October 2, 2025 07:29
Copy link
Collaborator

@hnts03-moreh hnts03-moreh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link

@kyuhyeon-an kyuhyeon-an left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants