feat(MoE): Refactor cuda_graph_scope#1920
Conversation
| valid_cudagraph_attrs | ||
| ), f"attr_outputs: {len(attr_outputs)} != {len(valid_cudagraph_attrs)}" | ||
| for i, attr_name in enumerate(valid_cudagraph_attrs): | ||
| hier_attr_name = attr_name.split('.') |
There was a problem hiding this comment.
| hier_attr_name = attr_name.split('.') | |
| hier_attr_name = attr_name.rsplit('.', maxsplit=1) |
There was a problem hiding this comment.
I want it to be compatible with multiple levels of hierarchy. Your suggestion works for now because we currently only have up to two levels like the _comm_manager.token_probs. I assume in the future we may need to capture attributes at a deeper level when the code becomes more and more complex.
9b2c179 to
3b85592
Compare
|
/ok to test 3b85592 |
|
Hi @rogerwaleffe @duncanriach @JRD971000 could you help review on behalf of NVIDIA/hybrid-mamba? @kvareddy @santhnm2 could you help review? Thanks! |
Signed-off-by: Robin Zhang <robinz@nvidia.com>
|
@lmcafee-nvidia @mathemakitten can you please sign off on this MR? |
Signed-off-by: Robin Zhang <robinz@nvidia.com>
|
Hi @rogerwaleffe @duncanriach @JRD971000 could you help review on behalf of hybrid-mamba? @jaredcasper @deepakn94 @santhnm2 @ericharper could you help review on behalf of gpt? Thanks! |
|
Hi @rogerwaleffe @duncanriach @JRD971000 could you help review on behalf of hybrid-mamba? @jaredcasper @deepakn94 @santhnm2 @ericharper could you help review on behalf of gpt? @yanring could you help review on behalf of mixture-of-experts-devtech? Thanks! |
Signed-off-by: Robin Zhang <robinz@nvidia.com>
Signed-off-by: Robin Zhang <robinz@nvidia.com>
Signed-off-by: Robin Zhang <robinz@nvidia.com>
|
/ok to test 0698a59 |
|
Hi @rogerwaleffe @duncanriach @JRD971000 could you help review on behalf of hybrid-mamba? @jaredcasper @deepakn94 @santhnm2 @ericharper could you help review on behalf of gpt? @yanring could you help review on behalf of mixture-of-experts-devtech? Thanks! |
|
/ok to test 0698a59 |
dev branch PR #1917 & #2353 & #2694 .
With this PR,
--cuda-graph-scopein--cuda-graph-impl=transformer_enginemode now supports combinations of the six values:attn: captures operations in TransformerLayer._forward_attention().mlp: captures operations in TransformerLayer._forward_mlp() for a dense layer.moe: captures operations in TransformerLayer._forward_mlp() for a MoE layer.moe_router: captures operations in TransformerLayer._forward_mlp() up to MoELayer.router(). Note that if shared experts overlap is not used, it also captures the shared experts.moe_preprocess: captures operations in MoELayer.preprocess(). Must be used together withmoe_router.mamba: captures the mamba layer.For a dense model, set
--cuda-graph-scope attn mlpto capture the whole Transformer layer, or set--cuda-graph-scope attnto capture the attention part, or set--cuda-graph-scope mlpto capture the mlp part. The non-graphed part will go to the normal pass.For a moe model, set
--cuda-graph-scope attn moe_router moe_preprocessto capture operations from the beginning of the Transformer layer to thepreprocessmethod in the moe token dispatcher. However, if you are using alltoall dispatcher with drop-no-padding or router-padding-for-fp8 options, you can only set--cuda-graph-scope attn moe_routerto capture up to the moe router because there will be cuda synchronization in the preprocess method. Finally, if you are using alltoall dispatcher with drop-padding, then you can directly set--cuda-graph-scope attn moeto capture the whole layer as its sync-free.For a model that contains both dense and moe layers like DeepSeek, the dense layer will check for "mlp" and the moe layer will check for "moe*". For example, setting
--cuda-graph-scope attn mlp moe_router moe_preprocesscaptures the whole (attn+mlp) dense layer, and the attn+router+preprocess part of the moe layer.Mamba model is different from a traditional TransformerLayer-based model. The mamba, attention, mlp, moe are all in different layer objects. For a mamba+moe model, you can set
--cuda-graph-scope mamba attn moe_routerto capture the corresponding layers. Or you can also set--cuda-graph-scope attn moe_routerif you don't want the mamba layers to be graphed.