Fuse permute+pad and unpermute+unpad ops for FP8/FP4 training #2763
Conversation
36fe5e9 to
4cfb6a6
Compare
|
@yaox12 This PR integrates support for the fused permute+pad and unpermute+unpad APIs in Megatron-LM. Appreciate your review. |
This can remove explicit padding/unpadding around GroupedMLP, which improves throughput and reduces peak memory usage Signed-off-by: xiaoxi-wangfj <690912414@qq.com>
4cfb6a6 to
3f261d7
Compare
f5bda0e to
9e7a356
Compare
Signed-off-by: xiaoxi-wangfj <690912414@qq.com>
9e7a356 to
76aff2e
Compare
|
Hi @yaox12 |
Thanks. I'll review it later. |
|
/ok to test 911548e |
|
/ok to test ea197bc |
Signed-off-by: xiaoxi-wangfj <690912414@qq.com> Co-authored-by: Xin Yao <xiny@nvidia.com>
Signed-off-by: xiaoxi-wangfj <690912414@qq.com> Co-authored-by: Xin Yao <xiny@nvidia.com>
…#2763) Signed-off-by: xiaoxi-wangfj <690912414@qq.com> Co-authored-by: Xin Yao <xiny@nvidia.com>
…#2763) Signed-off-by: xiaoxi-wangfj <690912414@qq.com> Co-authored-by: Xin Yao <xiny@nvidia.com>
…#2763) Signed-off-by: xiaoxi-wangfj <690912414@qq.com> Co-authored-by: Xin Yao <xiny@nvidia.com>
What does this PR do ?
Integrate TransformerEngine fused MoE permute+pad and unpermute+unpad FP8 fast-path (NVIDIA/TransformerEngine#1921) into Megatron-LM MoE token dispatching, eliminating explicit padding/unpadding steps in the experts path to improve performance and reduce peak GPU memory usage.
Background / Motivation
TransformerEngine PR NVIDIA/TransformerEngine#1921 introduces:
moe_permute_and_pad_with_probs(fusedmoe_permute_with_probs+Fp8Padding)moe_unpermute(..., pad_offsets=...)(fusedmoe_unpermute+Fp8Unpadding)These TE-side fused kernels require Megatron-LM changes to:
tokens_per_expertand FP8 alignment size into permutepad_offsetsthrough dispatch and feed it back into unpermuteContribution process
flowchart LR A[Pre-checks] --> B[PR Tests] subgraph Code Review/Approval C1[Expert Review] --> C2[Final Review] end B --> C1 C2 --> D[Merge]Pre-checks
Core 0.8)Code review
The following process is enforced via the CODEOWNERS file for changes into
megatron/core. For changes outside ofmegatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.For MRs into `main` branch
(Step 1): Add PR label
Expert Review(Step 2): Collect the expert reviewers reviews
Expert Reviewlabel when your PR is ready for review.Final Review might get declined if these requirements are not fulfilled.
(Step 3): Final Review
Final Reviewlabel(Optional Step 4): Cherry-pick into release branch
If this PR also needs to be merged into
core_r*release branches, after this PR has been merged, selectCherry-pickto open a new PR into the release branch.For MRs into `dev` branch
The proposed review process for `dev` branch is under active discussion.MRs are mergable after one approval by either
eharper@nvidia.comorzijiey@nvidia.com.Merging your PR
Any member of core-adlr and
core-nemowill be able to merge your PR.