feat(MoE): Refactor cuda_graph_scope by buptzyb · Pull Request #1920 · NVIDIA/Megatron-LM

buptzyb · 2025-10-24T07:40:07Z

dev branch PR #1917 & #2353 & #2694 .

With this PR, --cuda-graph-scope in --cuda-graph-impl=transformer_engine mode now supports combinations of the six values:

attn: captures operations in TransformerLayer._forward_attention().
mlp: captures operations in TransformerLayer._forward_mlp() for a dense layer.
moe: captures operations in TransformerLayer._forward_mlp() for a MoE layer.
moe_router: captures operations in TransformerLayer._forward_mlp() up to MoELayer.router(). Note that if shared experts overlap is not used, it also captures the shared experts.
moe_preprocess: captures operations in MoELayer.preprocess(). Must be used together with moe_router.
mamba: captures the mamba layer.

Example 1:

For a dense model, set --cuda-graph-scope attn mlp to capture the whole Transformer layer, or set --cuda-graph-scope attn to capture the attention part, or set --cuda-graph-scope mlp to capture the mlp part. The non-graphed part will go to the normal pass.

Example 2:

For a moe model, set --cuda-graph-scope attn moe_router moe_preprocess to capture operations from the beginning of the Transformer layer to the preprocess method in the moe token dispatcher. However, if you are using alltoall dispatcher with drop-no-padding or router-padding-for-fp8 options, you can only set --cuda-graph-scope attn moe_router to capture up to the moe router because there will be cuda synchronization in the preprocess method. Finally, if you are using alltoall dispatcher with drop-padding, then you can directly set --cuda-graph-scope attn moe to capture the whole layer as its sync-free.

Example 3:

For a model that contains both dense and moe layers like DeepSeek, the dense layer will check for "mlp" and the moe layer will check for "moe*". For example, setting --cuda-graph-scope attn mlp moe_router moe_preprocess captures the whole (attn+mlp) dense layer, and the attn+router+preprocess part of the moe layer.

Example 4:

Mamba model is different from a traditional TransformerLayer-based model. The mamba, attention, mlp, moe are all in different layer objects. For a mamba+moe model, you can set --cuda-graph-scope mamba attn moe_router to capture the corresponding layers. Or you can also set --cuda-graph-scope attn moe_router if you don't want the mamba layers to be graphed.

copy-pr-bot · 2025-10-24T07:40:11Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Skylion007 · 2025-10-24T18:45:11Z

+                    valid_cudagraph_attrs
+                ), f"attr_outputs: {len(attr_outputs)} != {len(valid_cudagraph_attrs)}"
+                for i, attr_name in enumerate(valid_cudagraph_attrs):
+                    hier_attr_name = attr_name.split('.')


Suggested change

hier_attr_name = attr_name.split('.')

hier_attr_name = attr_name.rsplit('.', maxsplit=1)

I want it to be compatible with multiple levels of hierarchy. Your suggestion works for now because we currently only have up to two levels like the _comm_manager.token_probs. I assume in the future we may need to capture attributes at a deeper level when the code becomes more and more complex.

buptzyb · 2025-10-27T10:05:39Z

/ok to test 3b85592

buptzyb · 2025-12-02T13:27:48Z

Hi @rogerwaleffe @duncanriach @JRD971000 could you help review on behalf of NVIDIA/hybrid-mamba? @kvareddy @santhnm2 could you help review? Thanks!

Signed-off-by: Robin Zhang <robinz@nvidia.com>

kvareddy · 2025-12-04T05:37:24Z

@lmcafee-nvidia @mathemakitten can you please sign off on this MR?

Signed-off-by: Robin Zhang <robinz@nvidia.com>

buptzyb · 2025-12-07T04:22:45Z

Hi @rogerwaleffe @duncanriach @JRD971000 could you help review on behalf of hybrid-mamba? @jaredcasper @deepakn94 @santhnm2 @ericharper could you help review on behalf of gpt? Thanks!

buptzyb · 2025-12-11T11:22:22Z

Hi @rogerwaleffe @duncanriach @JRD971000 could you help review on behalf of hybrid-mamba? @jaredcasper @deepakn94 @santhnm2 @ericharper could you help review on behalf of gpt? @yanring could you help review on behalf of mixture-of-experts-devtech? Thanks!

Signed-off-by: Robin Zhang <robinz@nvidia.com>

buptzyb · 2025-12-18T10:55:39Z

/ok to test 0698a59

buptzyb · 2025-12-18T11:51:38Z

Hi @rogerwaleffe @duncanriach @JRD971000 could you help review on behalf of hybrid-mamba? @jaredcasper @deepakn94 @santhnm2 @ericharper could you help review on behalf of gpt? @yanring could you help review on behalf of mixture-of-experts-devtech? Thanks!

Phlip79 · 2025-12-30T22:55:29Z

/ok to test 0698a59

buptzyb requested review from a team as code owners October 24, 2025 07:40

buptzyb mentioned this pull request Oct 24, 2025

[DEV] feat(MoE): Refactor cuda_graph_scope #1917

Merged

buptzyb self-assigned this Oct 24, 2025

buptzyb added module: moe Expert Review [deprecated] Apply this label to indicate that your PR is ready for expert review. labels Oct 24, 2025

buptzyb added this to the Core 0.15 milestone Oct 24, 2025

Skylion007 reviewed Oct 24, 2025

View reviewed changes

ko3n1g approved these changes Oct 26, 2025

View reviewed changes

buptzyb force-pushed the robinz/refactor_cuda_graph_scope branch from 9b2c179 to 3b85592 Compare October 27, 2025 10:04

copy-pr-bot Bot temporarily deployed to nemo-ci October 27, 2025 10:05 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci October 27, 2025 10:06 Inactive

copy-pr-bot Bot temporarily deployed to test October 27, 2025 10:06 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci October 27, 2025 10:15 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci October 27, 2025 10:30 Inactive

yanring mentioned this pull request Dec 3, 2025

[ROADMAP][Updated on April 07] Megatron Core MoE Roadmap #1729

Open

48 tasks

jiemingz reviewed Dec 4, 2025

View reviewed changes

Comment thread megatron/core/transformer/moe/moe_utils.py

sidsingh-nvidia approved these changes Dec 4, 2025

View reviewed changes

Add functools.wraps

f9857d2

Signed-off-by: Robin Zhang <robinz@nvidia.com>

buptzyb added 2 commits December 4, 2025 17:34

Merge branch 'main' into robinz/refactor_cuda_graph_scope

f975fd3

update test_fp8_param cudagraph ut

625e5b7

Signed-off-by: Robin Zhang <robinz@nvidia.com>

lmcafee-nvidia approved these changes Dec 5, 2025

View reviewed changes

mathemakitten approved these changes Dec 7, 2025

View reviewed changes

kvareddy approved these changes Dec 7, 2025

View reviewed changes

buptzyb added 3 commits December 7, 2025 20:05

Merge branch 'main' into robinz/refactor_cuda_graph_scope

14a9668

Merge branch 'main' into robinz/refactor_cuda_graph_scope

96953f4

Merge branch 'main' into robinz/refactor_cuda_graph_scope

4361b7f

buptzyb added 2 commits December 17, 2025 04:32

improve recompute checks

0ccb425

Signed-off-by: Robin Zhang <robinz@nvidia.com>

Merge branch 'main' into robinz/refactor_cuda_graph_scope

de638ad

buptzyb mentioned this pull request Dec 17, 2025

[Dev] TE cudagraph recompute #2694

Merged

6 tasks

buptzyb added 3 commits December 17, 2025 21:14

Merge branch 'main' into robinz/refactor_cuda_graph_scope

b7abdd7

fix backward compatibility

85583d0

Signed-off-by: Robin Zhang <robinz@nvidia.com>

disable hybridep ut

0698a59

Signed-off-by: Robin Zhang <robinz@nvidia.com>

yaoyu-33 approved these changes Dec 30, 2025

View reviewed changes

ericharper approved these changes Dec 30, 2025

View reviewed changes

duncanriach approved these changes Dec 30, 2025

View reviewed changes

tdene mentioned this pull request Dec 30, 2025

Reflect the changes made by #1920 in RL #2780

Merged

6 tasks

ananthsub mentioned this pull request Jan 29, 2026

[sync] Refactor cuda_graph_scope to use enums NVIDIA-NeMo/Megatron-Bridge#2121

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(MoE): Refactor cuda_graph_scope#1920

feat(MoE): Refactor cuda_graph_scope#1920
yanring merged 29 commits into
NVIDIA:mainfrom
buptzyb:robinz/refactor_cuda_graph_scope

buptzyb commented Oct 24, 2025 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Oct 24, 2025

Uh oh!

Skylion007 Oct 24, 2025

Uh oh!

buptzyb Oct 26, 2025

Uh oh!

buptzyb commented Oct 27, 2025

Uh oh!

buptzyb commented Dec 2, 2025

Uh oh!

Uh oh!

kvareddy commented Dec 4, 2025

Uh oh!

buptzyb commented Dec 7, 2025

Uh oh!

buptzyb commented Dec 11, 2025

Uh oh!

buptzyb commented Dec 18, 2025

Uh oh!

buptzyb commented Dec 18, 2025

Uh oh!

Phlip79 commented Dec 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

16 participants

	hier_attr_name = attr_name.split('.')
	hier_attr_name = attr_name.rsplit('.', maxsplit=1)

Uh oh!

Conversation

buptzyb commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

copy-pr-bot Bot commented Oct 24, 2025

Uh oh!

Skylion007 Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

buptzyb Oct 26, 2025

Choose a reason for hiding this comment

Uh oh!

buptzyb commented Oct 27, 2025

Uh oh!

buptzyb commented Dec 2, 2025

Uh oh!

Uh oh!

kvareddy commented Dec 4, 2025

Uh oh!

buptzyb commented Dec 7, 2025

Uh oh!

buptzyb commented Dec 11, 2025

Uh oh!

buptzyb commented Dec 18, 2025

Uh oh!

buptzyb commented Dec 18, 2025

Uh oh!

Phlip79 commented Dec 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

16 participants

buptzyb commented Oct 24, 2025 •

edited

Loading