Skip to content

feat(MoE): Refactor cuda_graph_scope#1920

Merged
yanring merged 29 commits into
NVIDIA:mainfrom
buptzyb:robinz/refactor_cuda_graph_scope
Dec 30, 2025
Merged

feat(MoE): Refactor cuda_graph_scope#1920
yanring merged 29 commits into
NVIDIA:mainfrom
buptzyb:robinz/refactor_cuda_graph_scope

Conversation

@buptzyb

@buptzyb buptzyb commented Oct 24, 2025

Copy link
Copy Markdown
Contributor

dev branch PR #1917 & #2353 & #2694 .

With this PR, --cuda-graph-scope in --cuda-graph-impl=transformer_engine mode now supports combinations of the six values:

  1. attn: captures operations in TransformerLayer._forward_attention().
  2. mlp: captures operations in TransformerLayer._forward_mlp() for a dense layer.
  3. moe: captures operations in TransformerLayer._forward_mlp() for a MoE layer.
  4. moe_router: captures operations in TransformerLayer._forward_mlp() up to MoELayer.router(). Note that if shared experts overlap is not used, it also captures the shared experts.
  5. moe_preprocess: captures operations in MoELayer.preprocess(). Must be used together with moe_router.
  6. mamba: captures the mamba layer.
  • Example 1:

For a dense model, set --cuda-graph-scope attn mlp to capture the whole Transformer layer, or set --cuda-graph-scope attn to capture the attention part, or set --cuda-graph-scope mlp to capture the mlp part. The non-graphed part will go to the normal pass.

  • Example 2:

For a moe model, set --cuda-graph-scope attn moe_router moe_preprocess to capture operations from the beginning of the Transformer layer to the preprocess method in the moe token dispatcher. However, if you are using alltoall dispatcher with drop-no-padding or router-padding-for-fp8 options, you can only set --cuda-graph-scope attn moe_router to capture up to the moe router because there will be cuda synchronization in the preprocess method. Finally, if you are using alltoall dispatcher with drop-padding, then you can directly set --cuda-graph-scope attn moe to capture the whole layer as its sync-free.

  • Example 3:

For a model that contains both dense and moe layers like DeepSeek, the dense layer will check for "mlp" and the moe layer will check for "moe*". For example, setting --cuda-graph-scope attn mlp moe_router moe_preprocess captures the whole (attn+mlp) dense layer, and the attn+router+preprocess part of the moe layer.

  • Example 4:

Mamba model is different from a traditional TransformerLayer-based model. The mamba, attention, mlp, moe are all in different layer objects. For a mamba+moe model, you can set --cuda-graph-scope mamba attn moe_router to capture the corresponding layers. Or you can also set --cuda-graph-scope attn moe_router if you don't want the mamba layers to be graphed.

image

@buptzyb buptzyb requested review from a team as code owners October 24, 2025 07:40
@copy-pr-bot

copy-pr-bot Bot commented Oct 24, 2025

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@buptzyb buptzyb self-assigned this Oct 24, 2025
@buptzyb buptzyb added module: moe Expert Review [deprecated] Apply this label to indicate that your PR is ready for expert review. labels Oct 24, 2025
@buptzyb buptzyb added this to the Core 0.15 milestone Oct 24, 2025
valid_cudagraph_attrs
), f"attr_outputs: {len(attr_outputs)} != {len(valid_cudagraph_attrs)}"
for i, attr_name in enumerate(valid_cudagraph_attrs):
hier_attr_name = attr_name.split('.')

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
hier_attr_name = attr_name.split('.')
hier_attr_name = attr_name.rsplit('.', maxsplit=1)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want it to be compatible with multiple levels of hierarchy. Your suggestion works for now because we currently only have up to two levels like the _comm_manager.token_probs. I assume in the future we may need to capture attributes at a deeper level when the code becomes more and more complex.

@buptzyb buptzyb force-pushed the robinz/refactor_cuda_graph_scope branch from 9b2c179 to 3b85592 Compare October 27, 2025 10:04
@buptzyb

buptzyb commented Oct 27, 2025

Copy link
Copy Markdown
Contributor Author

/ok to test 3b85592

@buptzyb

buptzyb commented Dec 2, 2025

Copy link
Copy Markdown
Contributor Author

Hi @rogerwaleffe @duncanriach @JRD971000 could you help review on behalf of NVIDIA/hybrid-mamba? @kvareddy @santhnm2 could you help review? Thanks!

Comment thread megatron/core/transformer/moe/moe_utils.py
Signed-off-by: Robin Zhang <robinz@nvidia.com>
@kvareddy

kvareddy commented Dec 4, 2025

Copy link
Copy Markdown
Contributor

@lmcafee-nvidia @mathemakitten can you please sign off on this MR?

@buptzyb

buptzyb commented Dec 7, 2025

Copy link
Copy Markdown
Contributor Author

Hi @rogerwaleffe @duncanriach @JRD971000 could you help review on behalf of hybrid-mamba? @jaredcasper @deepakn94 @santhnm2 @ericharper could you help review on behalf of gpt? Thanks!

@buptzyb

buptzyb commented Dec 11, 2025

Copy link
Copy Markdown
Contributor Author

Hi @rogerwaleffe @duncanriach @JRD971000 could you help review on behalf of hybrid-mamba? @jaredcasper @deepakn94 @santhnm2 @ericharper could you help review on behalf of gpt? @yanring could you help review on behalf of mixture-of-experts-devtech? Thanks!

@buptzyb buptzyb mentioned this pull request Dec 17, 2025
6 tasks
Signed-off-by: Robin Zhang <robinz@nvidia.com>
Signed-off-by: Robin Zhang <robinz@nvidia.com>
@buptzyb

buptzyb commented Dec 18, 2025

Copy link
Copy Markdown
Contributor Author

/ok to test 0698a59

@buptzyb

buptzyb commented Dec 18, 2025

Copy link
Copy Markdown
Contributor Author

Hi @rogerwaleffe @duncanriach @JRD971000 could you help review on behalf of hybrid-mamba? @jaredcasper @deepakn94 @santhnm2 @ericharper could you help review on behalf of gpt? @yanring could you help review on behalf of mixture-of-experts-devtech? Thanks!

@Phlip79

Phlip79 commented Dec 30, 2025

Copy link
Copy Markdown
Member

/ok to test 0698a59

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

complexity: high dev2main: mbridge dev to main: this PR is needed in main for mbridge Final Review PR is in the "final review" stage module: moe

Projects

None yet

Development

Successfully merging this pull request may close these issues.