[DEV] feat(MoE): Refactor cuda_graph_scope#1917
Conversation
f9d33c5 to
a37b63a
Compare
|
many checks/tests failing |
1 similar comment
|
many checks/tests failing |
Perhaps you can construct MoE layers and test them like other MoE UTs (e.g. test_moe_layer_discrepancy.py), shouldn’t need construct a whole model I don't recommend adding it to Kunlun's MR; it would be better in this MR or a separate MR. |
|
/ok to test 7185dc5 |
Signed-off-by: Robin Zhang <robinz@nvidia.com>
Signed-off-by: Robin Zhang <robinz@nvidia.com>
Signed-off-by: Robin Zhang <robinz@nvidia.com>
Signed-off-by: Robin Zhang <robinz@nvidia.com>
Signed-off-by: Robin Zhang <robinz@nvidia.com>
Signed-off-by: Robin Zhang <robinz@nvidia.com>
Signed-off-by: Robin Zhang <robinz@nvidia.com>
|
/ok to test 41ab474 |
|
/ok to test 8c590f2 |
Signed-off-by: Robin Zhang <robinz@nvidia.com>
|
/ok to test 6bb7d06 |
Added in test_cuda_graphs.py |
yanring
left a comment
There was a problem hiding this comment.
LGTM. Please also start a functional test pipeline on GitLab and ping me once it passes.
| def setup_method(self, method): | ||
| self.seq_length = 512 | ||
| self.micro_batch_size = 2 | ||
| os.environ['CUDA_DEVICE_MAX_CONNECTIONS'] = '1' |
There was a problem hiding this comment.
Please restore the env changes on exit.
| args = parse_args() | ||
| args.num_layers = 4 | ||
| args.mtp_num_layers = 1 | ||
| args.vocab_size = 128800 |
There was a problem hiding this comment.
The vocab size is too large, UT+pytest has a memory leak issue, so use a smaller model if possible.
pipeline passed https://gitlab-master.nvidia.com/ADLR/megatron-lm/-/pipelines/37580916 @yanring |
|
/ok to test f7a3dcc |
@yanring, there was an error processing your request: See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/ |
|
/ok to test 4c92ae7 |
|
/ok to test 2037bf4 |
main branch PR #1920 .
With this PR,
--cuda-graph-scopein--cuda-graph-impl=transformer_enginemode now supports combinations of the six values:attn: captures operations in TransformerLayer._forward_attention().mlp: captures operations in TransformerLayer._forward_mlp() for a dense layer.moe: captures operations in TransformerLayer._forward_mlp() for a MoE layer.moe_router: captures operations in TransformerLayer._forward_mlp() up to MoELayer.router(). Note that if shared experts overlap is not used, it also captures the shared experts.moe_preprocess: captures operations in MoELayer.preprocess(). Must be used together withmoe_router.mamba: captures the mamba layer.For a dense model, set
--cuda-graph-scope attn mlpto capture the whole Transformer layer, or set--cuda-graph-scope attnto capture the attention part, or set--cuda-graph-scope mlpto capture the mlp part. The non-graphed part will go to the normal pass.For a moe model, set
--cuda-graph-scope attn moe_router moe_preprocessto capture operations from the beginning of the Transformer layer to thepreprocessmethod in the moe token dispatcher. However, if you are using alltoall dispatcher with drop-no-padding or router-padding-for-fp8 options, you can only set--cuda-graph-scope attn moe_routerto capture up to the moe router because there will be cuda synchronization in the preprocess method. Finally, if you are using alltoall dispatcher with drop-padding, then you can directly set--cuda-graph-scope attn moeto capture the whole layer as its sync-free.For a model that contains both dense and moe layers like DeepSeek, the dense layer will check for "mlp" and the moe layer will check for "moe*". For example, setting
--cuda-graph-scope attn mlp moe_router moe_preprocesscaptures the whole (attn+mlp) dense layer, and the attn+router+preprocess part of the moe layer.Mamba model is different from a traditional TransformerLayer-based model. The mamba, attention, mlp, moe are all in different layer objects. For a mamba+moe model, you can set
--cuda-graph-scope mamba attn moe_routerto capture the corresponding layers. Or you can also set--cuda-graph-scope attn moe_routerif you don't want the mamba layers to be graphed.