[Dev] feat(MoE): Refactor cuda_graph_scope - part2#2353
Conversation
Signed-off-by: Robin Zhang <robinz@nvidia.com>
Signed-off-by: Robin Zhang <robinz@nvidia.com>
|
/ok to test df48cc1 |
|
/ok to test e3292c1 |
There was a problem hiding this comment.
LGTM, here are some AI-generated comments for you to review and adjust as needed:
megatron/core/transformer/cuda_graphs.py
Line 1715-1736:
The
delete_cuda_graphs()method looks useful! Suggestions:
- Add a warning log when the TE version is too low to actually delete graphs
- Consider calling
torch.cuda.empty_cache()after deleting graphs to release GPU memory
Line 1219:
The enumerated
_variable is unused. Consider usingfor layers in self.callables_per_chunk:directly.
megatron/core/transformer/moe/token_dispatcher.py
Line 1078-1082:
What is the intent of this change? Please add a comment explaining why these values need to be retained in
drop_and_padmode. This is important for understanding CUDA graph compatibility.
megatron/core/transformer/transformer_layer.py
Line 624-627:
👍 Good fix for the missing NVTX marker! Consider adding a brief comment explaining why the pop needs to happen early in this code path.
megatron/training/arguments.py
Line 1495:
The lambda function is quite complex. Consider extracting it into a named function for better readability and testability. Also consider handling case-insensitive inputs.
def parse_cuda_graph_scope(scope):
if not isinstance(scope, str):
return scope
scope_lower = scope.lower()
if scope_lower == "full":
return "full"
try:
return CudaGraphScope[scope]
except KeyError:
valid_scopes = [s.name for s in CudaGraphScope] + ["full"]
raise argparse.ArgumentTypeError(
f"Invalid cuda_graph_scope '{scope}'. Valid values are: {valid_scopes}"
)megatron/core/transformer/transformer_config.py
Line 1593-1614:
Good backward compatibility handling! Suggestions:
- Consider extracting this logic into a separate helper function for better maintainability
- Add unit tests covering various input formats (string, list, enum, comma-separated string, etc.)
megatron/core/transformer/enums.py
Line 67-80:
Consider adding docstrings for each enum value to clarify their usage:
class CudaGraphScope(enum.Enum): """Cuda Graph Scope - defines which parts of the model to capture.""" full_iteration = 1 # Captures the entire training/inference iteration attn = 2 # Captures attention operations mlp = 3 # Captures MLP operations (dense layers only) moe = 4 # Captures full MoE layer including router and experts moe_router = 5 # Captures only MoE router and preprocessing moe_preprocess = 6 # Captures MoE preprocessing (requires moe_router) mamba = 7 # Captures Mamba layer operations
megatron/core/transformer/moe/fused_a2a.py
Line 320-327:
The new
reset_hybrid_ep_buffer()function should include documentation about when it should be called and thread-safety considerations:def reset_hybrid_ep_buffer(): ''' Reset the HybridEP buffer. Should be called: - When changing parallelism configurations - During test teardown to release resources Note: Not thread-safe. Ensure no concurrent operations are accessing the buffer when calling this function. ''' global _hybrid_ep_buffer _hybrid_ep_buffer = None
| '"local": capture the CUDA graph using MCore local implementation. --cuda-graph-scope=\"full_iteration\" enables whole iteration CUDA graph. ' | ||
| '"transformer_engine": capture the CUDA graph using TE make_graphed_callables().') | ||
| group.add_argument('--cuda-graph-scope', nargs='+', type=str, default=[], | ||
| group.add_argument('--cuda-graph-scope', nargs='+', type=lambda scope: CudaGraphScope[scope] if isinstance(scope, str) and scope != "full" else scope, default=[], |
There was a problem hiding this comment.
"isinstance(scope, str)" seems to be redundant since argparse will always pass str.
| for _, layers in enumerate(self.callables_per_chunk): | ||
| for layer in layers: | ||
| for graph in layer.cuda_graphs: | ||
| if is_te_min_version("2.10.0"): |
There was a problem hiding this comment.
te version check is repeated inside loop, maybe only one pre-check is enough.
| @pytest.mark.skipif( | ||
| not (HAVE_TE and is_te_min_version("1.14.0")), | ||
| reason="Partial CUDA graph support requires TransformerEngine version >= 1.14.0", | ||
| not (HAVE_TE and is_te_min_version("2.10.0")), | ||
| reason="Partial CUDA graph UT support requires TransformerEngine version >= 2.10.0", |
There was a problem hiding this comment.
Just curious that the version requirement jumping is only due to UT code compatibility or functional requirement🤔
There was a problem hiding this comment.
In TE 2.10.0 we added the reset() API for a graph object (this commit NVIDIA/TransformerEngine@262c184) . Without the reset() called after training, there is a probability that the UT hangs at exiting.
Signed-off-by: Robin Zhang <robinz@nvidia.com>
|
/ok to test efb8f5d |
|
/ok to test 1e2e708 |
What does this PR do ?
main PR #1920.
part1 is #1917.
This part mainly changes the cuda_graph_scope to a enum structure, and fixes cudagraph UTs.
Contribution process
flowchart LR A[Pre-checks] --> B[PR Tests] subgraph Code Review/Approval C1[Expert Review] --> C2[Final Review] end B --> C1 C2 --> D[Merge]Pre-checks
Core 0.8)Code review
The following process is enforced via the CODEOWNERS file for changes into
megatron/core. For changes outside ofmegatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.For MRs into `main` branch
(Step 1): Add PR label
Expert Review(Step 2): Collect the expert reviewers reviews
Expert Reviewlabel when your PR is ready for review.Final Review might get declined if these requirements are not fulfilled.
(Step 3): Final Review
Final Reviewlabel(Optional Step 4): Cherry-pick into release branch
If this PR also needs to be merged into
core_r*release branches, after this PR has been merged, selectCherry-pickto open a new PR into the release branch.For MRs into `dev` branch
The proposed review process for `dev` branch is under active discussion.MRs are mergable after one approval by either
eharper@nvidia.comorzijiey@nvidia.com.Merging your PR
Any member of core-adlr and
core-nemowill be able to merge your PR.