[DEV] feat(MoE): Refactor cuda_graph_scope by buptzyb · Pull Request #1917 · NVIDIA/Megatron-LM

buptzyb · 2025-10-24T07:28:31Z

main branch PR #1920 .

With this PR, --cuda-graph-scope in --cuda-graph-impl=transformer_engine mode now supports combinations of the six values:

attn: captures operations in TransformerLayer._forward_attention().
mlp: captures operations in TransformerLayer._forward_mlp() for a dense layer.
moe: captures operations in TransformerLayer._forward_mlp() for a MoE layer.
moe_router: captures operations in TransformerLayer._forward_mlp() up to MoELayer.router(). Note that if shared experts overlap is not used, it also captures the shared experts.
moe_preprocess: captures operations in MoELayer.preprocess(). Must be used together with moe_router.
mamba: captures the mamba layer.

Example 1:

For a dense model, set --cuda-graph-scope attn mlp to capture the whole Transformer layer, or set --cuda-graph-scope attn to capture the attention part, or set --cuda-graph-scope mlp to capture the mlp part. The non-graphed part will go to the normal pass.

Example 2:

For a moe model, set --cuda-graph-scope attn moe_router moe_preprocess to capture operations from the beginning of the Transformer layer to the preprocess method in the moe token dispatcher. However, if you are using alltoall dispatcher with drop-no-padding or router-padding-for-fp8 options, you can only set --cuda-graph-scope attn moe_router to capture up to the moe router because there will be cuda synchronization in the preprocess method. Finally, if you are using alltoall dispatcher with drop-padding, then you can directly set --cuda-graph-scope attn moe to capture the whole layer as its sync-free.

Example 3:

For a model that contains both dense and moe layers like DeepSeek, the dense layer will check for "mlp" and the moe layer will check for "moe*". For example, setting --cuda-graph-scope attn mlp moe_router moe_preprocess captures the whole (attn+mlp) dense layer, and the attn+router+preprocess part of the moe layer.

Example 4:

Mamba model is different from a traditional TransformerLayer-based model. The mamba, attention, mlp, moe are all in different layer objects. For a mamba+moe model, you can set --cuda-graph-scope mamba attn moe_router to capture the corresponding layers. Or you can also set --cuda-graph-scope attn moe_router if you don't want the mamba layers to be graphed.

copy-pr-bot · 2025-10-24T07:28:35Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

pablo-garay · 2025-10-27T04:24:28Z

many checks/tests failing

pablo-garay · 2025-10-27T04:24:40Z

many checks/tests failing

yanring · 2025-10-28T09:44:29Z

Hey @buptzyb, is there any reason we didn’t add extra UTs and only included a functional test? UTs should cover all new features. e.g. test combinations like "moe_preprocess, moe_router"

Was about to add in Kunlun's MR (this one), but Kunlun told me it's a dense model... Do you know where I can find a moe model in UT? Is there any file I can refer to? Or do you think it's better than Kunlun's MR uses a moe model instead?

Perhaps you can construct MoE layers and test them like other MoE UTs (e.g. test_moe_layer_discrepancy.py), shouldn’t need construct a whole model

I don't recommend adding it to Kunlun's MR; it would be better in this MR or a separate MR.

buptzyb · 2025-10-29T14:46:10Z

/ok to test 7185dc5

Signed-off-by: Robin Zhang <robinz@nvidia.com>

buptzyb · 2025-10-29T14:57:01Z

/ok to test 41ab474

buptzyb · 2025-10-30T04:39:03Z

/ok to test 8c590f2

Signed-off-by: Robin Zhang <robinz@nvidia.com>

buptzyb · 2025-10-30T04:45:26Z

/ok to test 6bb7d06

buptzyb · 2025-10-30T06:15:23Z

Hey @buptzyb, is there any reason we didn’t add extra UTs and only included a functional test? UTs should cover all new features. e.g. test combinations like "moe_preprocess, moe_router"

Was about to add in Kunlun's MR (this one), but Kunlun told me it's a dense model... Do you know where I can find a moe model in UT? Is there any file I can refer to? Or do you think it's better than Kunlun's MR uses a moe model instead?

Perhaps you can construct MoE layers and test them like other MoE UTs (e.g. test_moe_layer_discrepancy.py), shouldn’t need construct a whole model

I don't recommend adding it to Kunlun's MR; it would be better in this MR or a separate MR.

Added in test_cuda_graphs.py

yanring

LGTM. Please also start a functional test pipeline on GitLab and ping me once it passes.

yanring · 2025-10-30T07:27:41Z

+    def setup_method(self, method):
+        self.seq_length = 512
+        self.micro_batch_size = 2
+        os.environ['CUDA_DEVICE_MAX_CONNECTIONS'] = '1'


Please restore the env changes on exit.

yanring · 2025-10-30T07:28:43Z

+        args = parse_args()
+        args.num_layers = 4
+        args.mtp_num_layers = 1
+        args.vocab_size = 128800


The vocab size is too large, UT+pytest has a memory leak issue, so use a smaller model if possible.

set to 1024

buptzyb · 2025-10-30T13:23:16Z

LGTM. Please also start a functional test pipeline on GitLab and ping me once it passes.

pipeline passed https://gitlab-master.nvidia.com/ADLR/megatron-lm/-/pipelines/37580916 @yanring

jiemingz

LGTM!

yanring · 2025-10-31T00:02:29Z

/ok to test f7a3dcc

copy-pr-bot · 2025-10-31T00:02:32Z

/ok to test f7a3dcc

@yanring, there was an error processing your request: E2

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/

buptzyb · 2025-10-31T00:17:59Z

/ok to test 4c92ae7

buptzyb · 2025-10-31T09:14:06Z

/ok to test 2037bf4

buptzyb requested review from a team as code owners October 24, 2025 07:28

buptzyb mentioned this pull request Oct 24, 2025

feat(MoE): Refactor cuda_graph_scope #1920

Merged

buptzyb self-assigned this Oct 24, 2025

buptzyb added module: moe Expert Review [deprecated] Apply this label to indicate that your PR is ready for expert review. labels Oct 24, 2025

buptzyb added this to the Core 0.15 milestone Oct 24, 2025

buptzyb requested review from yanring and yaox12 October 24, 2025 08:10

buptzyb force-pushed the dev_cudagraph branch from f9d33c5 to a37b63a Compare October 24, 2025 12:58

copy-pr-bot Bot temporarily deployed to nemo-ci October 24, 2025 12:58 Inactive

copy-pr-bot Bot temporarily deployed to test October 24, 2025 12:59 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci October 24, 2025 13:14 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci October 24, 2025 13:29 Inactive

copy-pr-bot Bot had a problem deploying to nemo-ci October 24, 2025 14:06 Failure

yanring reviewed Oct 28, 2025

View reviewed changes

Comment thread megatron/training/arguments.py Outdated

Comment thread megatron/core/transformer/moe/moe_utils.py

Comment thread megatron/core/transformer/moe/moe_layer.py Outdated

yanring reviewed Oct 28, 2025

View reviewed changes

Comment thread megatron/core/transformer/moe/moe_layer.py Outdated

zhongbozhu mentioned this pull request Oct 28, 2025

[MAIN][NVFP4] Support NVFP4 MOE with Proper Padding #1985

Merged

6 tasks

buptzyb added 7 commits October 29, 2025 07:53

refactore cuda graph scope

f7a3dcc

Signed-off-by: Robin Zhang <robinz@nvidia.com>

support hybridEP

23f243e

Signed-off-by: Robin Zhang <robinz@nvidia.com>

assert expandable_segments and NCCL_GRAPH_REGISTER

ae2ac3f

Signed-off-by: Robin Zhang <robinz@nvidia.com>

add cudagraph functional test

fee7a6e

Signed-off-by: Robin Zhang <robinz@nvidia.com>

refactor moe_layer

aff3cd5

Signed-off-by: Robin Zhang <robinz@nvidia.com>

refactor moe_layer again

ce67c01

Signed-off-by: Robin Zhang <robinz@nvidia.com>

update copyright

2271c13

Signed-off-by: Robin Zhang <robinz@nvidia.com>

update code style and add unit test

6bb7d06

Signed-off-by: Robin Zhang <robinz@nvidia.com>

yanring reviewed Oct 30, 2025

View reviewed changes

minor update

4c92ae7

jiemingz approved these changes Oct 30, 2025

View reviewed changes

yanring approved these changes Oct 31, 2025

View reviewed changes

copyright

2037bf4

buptzyb mentioned this pull request Nov 3, 2025

cherry-pick: Refactor cuda_graph_scope (#1917) into core_dev_r0.15.0 #2117

Merged

6 tasks

ananthsub mentioned this pull request Nov 6, 2025

Bridge instantiate_utils: drop unexpected config keys with warning NVIDIA-NeMo/Megatron-Bridge#1203

Merged

buptzyb mentioned this pull request Nov 26, 2025

[Dev] feat(MoE): Refactor cuda_graph_scope - part2 #2353

Merged

6 tasks

Uh oh!

Conversation

buptzyb commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

copy-pr-bot Bot commented Oct 24, 2025

Uh oh!

pablo-garay commented Oct 27, 2025

Uh oh!

pablo-garay commented Oct 27, 2025

Uh oh!

yanring commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

buptzyb commented Oct 29, 2025

Uh oh!

buptzyb commented Oct 29, 2025

Uh oh!

buptzyb commented Oct 30, 2025

Uh oh!

buptzyb commented Oct 30, 2025

Uh oh!

buptzyb commented Oct 30, 2025

Uh oh!

yanring left a comment

Choose a reason for hiding this comment

Uh oh!

yanring Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

buptzyb Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

yanring Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

buptzyb Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

buptzyb commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jiemingz left a comment

Choose a reason for hiding this comment

Uh oh!

yanring commented Oct 31, 2025

Uh oh!

copy-pr-bot Bot commented Oct 31, 2025

Uh oh!

buptzyb commented Oct 31, 2025

Uh oh!

buptzyb commented Oct 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

buptzyb commented Oct 24, 2025 •

edited

Loading

yanring commented Oct 28, 2025 •

edited

Loading

buptzyb commented Oct 30, 2025 •

edited

Loading