Skip to content

[DEV] feat(MoE): Refactor cuda_graph_scope#1917

Merged
buptzyb merged 10 commits into
NVIDIA:devfrom
buptzyb:dev_cudagraph
Oct 31, 2025
Merged

[DEV] feat(MoE): Refactor cuda_graph_scope#1917
buptzyb merged 10 commits into
NVIDIA:devfrom
buptzyb:dev_cudagraph

Conversation

@buptzyb

@buptzyb buptzyb commented Oct 24, 2025

Copy link
Copy Markdown
Contributor

main branch PR #1920 .

With this PR, --cuda-graph-scope in --cuda-graph-impl=transformer_engine mode now supports combinations of the six values:

  1. attn: captures operations in TransformerLayer._forward_attention().
  2. mlp: captures operations in TransformerLayer._forward_mlp() for a dense layer.
  3. moe: captures operations in TransformerLayer._forward_mlp() for a MoE layer.
  4. moe_router: captures operations in TransformerLayer._forward_mlp() up to MoELayer.router(). Note that if shared experts overlap is not used, it also captures the shared experts.
  5. moe_preprocess: captures operations in MoELayer.preprocess(). Must be used together with moe_router.
  6. mamba: captures the mamba layer.
  • Example 1:

For a dense model, set --cuda-graph-scope attn mlp to capture the whole Transformer layer, or set --cuda-graph-scope attn to capture the attention part, or set --cuda-graph-scope mlp to capture the mlp part. The non-graphed part will go to the normal pass.

  • Example 2:

For a moe model, set --cuda-graph-scope attn moe_router moe_preprocess to capture operations from the beginning of the Transformer layer to the preprocess method in the moe token dispatcher. However, if you are using alltoall dispatcher with drop-no-padding or router-padding-for-fp8 options, you can only set --cuda-graph-scope attn moe_router to capture up to the moe router because there will be cuda synchronization in the preprocess method. Finally, if you are using alltoall dispatcher with drop-padding, then you can directly set --cuda-graph-scope attn moe to capture the whole layer as its sync-free.

  • Example 3:

For a model that contains both dense and moe layers like DeepSeek, the dense layer will check for "mlp" and the moe layer will check for "moe*". For example, setting --cuda-graph-scope attn mlp moe_router moe_preprocess captures the whole (attn+mlp) dense layer, and the attn+router+preprocess part of the moe layer.

  • Example 4:

Mamba model is different from a traditional TransformerLayer-based model. The mamba, attention, mlp, moe are all in different layer objects. For a mamba+moe model, you can set --cuda-graph-scope mamba attn moe_router to capture the corresponding layers. Or you can also set --cuda-graph-scope attn moe_router if you don't want the mamba layers to be graphed.

image

@buptzyb buptzyb requested review from a team as code owners October 24, 2025 07:28
@copy-pr-bot

copy-pr-bot Bot commented Oct 24, 2025

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@buptzyb buptzyb self-assigned this Oct 24, 2025
@buptzyb buptzyb added module: moe Expert Review [deprecated] Apply this label to indicate that your PR is ready for expert review. labels Oct 24, 2025
@buptzyb buptzyb added this to the Core 0.15 milestone Oct 24, 2025
@buptzyb buptzyb requested review from yanring and yaox12 October 24, 2025 08:10
@pablo-garay

Copy link
Copy Markdown
Contributor

many checks/tests failing

1 similar comment
@pablo-garay

Copy link
Copy Markdown
Contributor

many checks/tests failing

@yanring

yanring commented Oct 28, 2025

Copy link
Copy Markdown
Contributor

Hey @buptzyb, is there any reason we didn’t add extra UTs and only included a functional test? UTs should cover all new features. e.g. test combinations like "moe_preprocess, moe_router"

Was about to add in Kunlun's MR (this one), but Kunlun told me it's a dense model... Do you know where I can find a moe model in UT? Is there any file I can refer to? Or do you think it's better than Kunlun's MR uses a moe model instead?

Perhaps you can construct MoE layers and test them like other MoE UTs (e.g. test_moe_layer_discrepancy.py), shouldn’t need construct a whole model

I don't recommend adding it to Kunlun's MR; it would be better in this MR or a separate MR.

Comment thread megatron/training/arguments.py Outdated
Comment thread megatron/core/transformer/moe/moe_utils.py
Comment thread megatron/core/transformer/moe/moe_layer.py Outdated
Comment thread megatron/core/transformer/moe/moe_layer.py Outdated
@buptzyb

buptzyb commented Oct 29, 2025

Copy link
Copy Markdown
Contributor Author

/ok to test 7185dc5

Signed-off-by: Robin Zhang <robinz@nvidia.com>
Signed-off-by: Robin Zhang <robinz@nvidia.com>
Signed-off-by: Robin Zhang <robinz@nvidia.com>
Signed-off-by: Robin Zhang <robinz@nvidia.com>
Signed-off-by: Robin Zhang <robinz@nvidia.com>
Signed-off-by: Robin Zhang <robinz@nvidia.com>
Signed-off-by: Robin Zhang <robinz@nvidia.com>
@buptzyb

buptzyb commented Oct 29, 2025

Copy link
Copy Markdown
Contributor Author

/ok to test 41ab474

@buptzyb

buptzyb commented Oct 30, 2025

Copy link
Copy Markdown
Contributor Author

/ok to test 8c590f2

Signed-off-by: Robin Zhang <robinz@nvidia.com>
@buptzyb

buptzyb commented Oct 30, 2025

Copy link
Copy Markdown
Contributor Author

/ok to test 6bb7d06

@buptzyb

buptzyb commented Oct 30, 2025

Copy link
Copy Markdown
Contributor Author

Hey @buptzyb, is there any reason we didn’t add extra UTs and only included a functional test? UTs should cover all new features. e.g. test combinations like "moe_preprocess, moe_router"

Was about to add in Kunlun's MR (this one), but Kunlun told me it's a dense model... Do you know where I can find a moe model in UT? Is there any file I can refer to? Or do you think it's better than Kunlun's MR uses a moe model instead?

Perhaps you can construct MoE layers and test them like other MoE UTs (e.g. test_moe_layer_discrepancy.py), shouldn’t need construct a whole model

I don't recommend adding it to Kunlun's MR; it would be better in this MR or a separate MR.

Added in test_cuda_graphs.py

@yanring yanring left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Please also start a functional test pipeline on GitLab and ping me once it passes.

def setup_method(self, method):
self.seq_length = 512
self.micro_batch_size = 2
os.environ['CUDA_DEVICE_MAX_CONNECTIONS'] = '1'

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please restore the env changes on exit.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

args = parse_args()
args.num_layers = 4
args.mtp_num_layers = 1
args.vocab_size = 128800

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The vocab size is too large, UT+pytest has a memory leak issue, so use a smaller model if possible.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set to 1024

@buptzyb

buptzyb commented Oct 30, 2025

Copy link
Copy Markdown
Contributor Author

LGTM. Please also start a functional test pipeline on GitLab and ping me once it passes.

pipeline passed https://gitlab-master.nvidia.com/ADLR/megatron-lm/-/pipelines/37580916 @yanring

@jiemingz jiemingz left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@yanring

yanring commented Oct 31, 2025

Copy link
Copy Markdown
Contributor

/ok to test f7a3dcc

@copy-pr-bot

copy-pr-bot Bot commented Oct 31, 2025

Copy link
Copy Markdown

/ok to test f7a3dcc

@yanring, there was an error processing your request: E2

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/

@buptzyb

buptzyb commented Oct 31, 2025

Copy link
Copy Markdown
Contributor Author

/ok to test 4c92ae7

@buptzyb

buptzyb commented Oct 31, 2025

Copy link
Copy Markdown
Contributor Author

/ok to test 2037bf4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core_dev_r0.15.0 Expert Review [deprecated] Apply this label to indicate that your PR is ready for expert review. module: moe Run tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants