Fuse permute+pad and unpermute+unpad ops for FP8/FP4 training by xiaoxi-wangfj · Pull Request #2763 · NVIDIA/Megatron-LM

xiaoxi-wangfj · 2025-12-26T07:05:28Z

What does this PR do ?

Integrate TransformerEngine fused MoE permute+pad and unpermute+unpad FP8 fast-path (NVIDIA/TransformerEngine#1921) into Megatron-LM MoE token dispatching, eliminating explicit padding/unpadding steps in the experts path to improve performance and reduce peak GPU memory usage.

Background / Motivation

TransformerEngine PR NVIDIA/TransformerEngine#1921 introduces:

moe_permute_and_pad_with_probs (fused moe_permute_with_probs + Fp8Padding)
extended moe_unpermute(..., pad_offsets=...) (fused moe_unpermute + Fp8Unpadding)

These TE-side fused kernels require Megatron-LM changes to:

pass tokens_per_expert and FP8 alignment size into permute
carry pad_offsets through dispatch and feed it back into unpermute
gate the new path behind a configuration switch / availability checks

Contribution process

flowchart LR
    A[Pre-checks] --> B[PR Tests]
    subgraph Code Review/Approval
        C1[Expert Review] --> C2[Final Review]
    end
    B --> C1
    C2 --> D[Merge]

Pre-checks

I want this PR in a versioned release and have added the appropriate Milestone (e.g., Core 0.8)
I have added relevant unit tests
I have added relevant functional tests
I have added proper typing to my code Typing guidelines
I have added relevant documentation
I have run the autoformatter.sh on my PR

Code review

The following process is enforced via the CODEOWNERS file for changes into megatron/core. For changes outside of megatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.

For MRs into `main` branch

(Step 1): Add PR label `Expert Review`

(Step 2): Collect the expert reviewers reviews

Attach the Expert Review label when your PR is ready for review.
GitHub auto-assigns expert reviewers based on your changes. They will get notified and pick up your PR soon.

⚠️ Only proceed to the next step once all reviewers have approved, merge-conflict are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

(Step 3): Final Review

Add Final Review label
GitHub auto-assigns final reviewers based on your changes. They will get notified and pick up your PR soon.

(Optional Step 4): Cherry-pick into release branch

If this PR also needs to be merged into core_r* release branches, after this PR has been merged, select Cherry-pick to open a new PR into the release branch.

For MRs into `dev` branch

The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either eharper@nvidia.com or zijiey@nvidia.com.

Merging your PR

Any member of core-adlr and core-nemo will be able to merge your PR.

copy-pr-bot · 2025-12-26T07:05:31Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

xiaoxi-wangfj · 2025-12-29T03:48:06Z

@yaox12 This PR integrates support for the fused permute+pad and unpermute+unpad APIs in Megatron-LM. Appreciate your review.

This can remove explicit padding/unpadding around GroupedMLP, which improves throughput and reduces peak memory usage Signed-off-by: xiaoxi-wangfj <690912414@qq.com>

Signed-off-by: xiaoxi-wangfj <690912414@qq.com>

xiaoxi-wangfj · 2026-01-06T10:25:55Z

Hi @yaox12
The moe-permute-padding-for-quantization flag has been removed.
The system now defaults to fused_permute_and_pad_with_probs.
When this kernel is not available (e.g., due to TE version constraints),
it transparently falls back to fused_permute_with_probs + FP8 padding.

yaox12 · 2026-01-09T03:28:19Z

Hi @yaox12 The moe-permute-padding-for-quantization flag has been removed. The system now defaults to fused_permute_and_pad_with_probs. When this kernel is not available (e.g., due to TE version constraints), it transparently falls back to fused_permute_with_probs + FP8 padding.

Thanks. I'll review it later.

yaox12 · 2026-02-04T09:15:53Z

/ok to test 911548e

yaox12 · 2026-02-05T01:30:59Z

/ok to test ea197bc

Signed-off-by: xiaoxi-wangfj <690912414@qq.com> Co-authored-by: Xin Yao <xiny@nvidia.com>

…#2763) Signed-off-by: xiaoxi-wangfj <690912414@qq.com> Co-authored-by: Xin Yao <xiny@nvidia.com>

xiaoxi-wangfj requested review from a team as code owners December 26, 2025 07:05

github-actions Bot requested a review from Phlip79 December 26, 2025 07:05

github-actions Bot added the community-request label Dec 26, 2025

xiaoxi-wangfj force-pushed the fused_perm_pad branch from 36fe5e9 to 4cfb6a6 Compare December 29, 2025 03:39

Fuse permute+pad and unpermute+unpad ops for FP8/FP4 precision

3f261d7

This can remove explicit padding/unpadding around GroupedMLP, which improves throughput and reduces peak memory usage Signed-off-by: xiaoxi-wangfj <690912414@qq.com>

xiaoxi-wangfj force-pushed the fused_perm_pad branch from 4cfb6a6 to 3f261d7 Compare December 31, 2025 10:19

zhongbozhu reviewed Jan 2, 2026

View reviewed changes

Comment thread megatron/training/arguments.py Outdated

zhongbozhu mentioned this pull request Jan 3, 2026

[PyTorch][NVFP4][MOE] NVFP4 Grouped Quantize with Hadamard Transform NVIDIA/TransformerEngine#2411

Merged

18 tasks

yanring requested a review from yaox12 January 4, 2026 07:38

yaox12 added the Expert Review [deprecated] Apply this label to indicate that your PR is ready for expert review. label Jan 4, 2026

yaox12 reviewed Jan 6, 2026

View reviewed changes

Comment thread megatron/core/transformer/moe/moe_utils.py Outdated

yaox12 requested changes Jan 6, 2026

View reviewed changes

Comment thread megatron/core/transformer/moe/moe_utils.py

Comment thread megatron/core/transformer/moe/moe_utils.py Outdated

Comment thread megatron/core/transformer/transformer_config.py Outdated

Comment thread megatron/training/arguments.py Outdated

xiaoxi-wangfj force-pushed the fused_perm_pad branch from f5bda0e to 9e7a356 Compare January 6, 2026 10:07

set fused_permute_pad to default

76aff2e

Signed-off-by: xiaoxi-wangfj <690912414@qq.com>

xiaoxi-wangfj force-pushed the fused_perm_pad branch from 9e7a356 to 76aff2e Compare January 6, 2026 10:21

Merge branch 'main' into fused_perm_pad

e90c413

Merge branch 'main' into fused_perm_pad

deb6bd3

Update moe_utils.py

6d46c57

yaox12 approved these changes Jan 14, 2026

View reviewed changes

yaox12 added 2 commits February 4, 2026 17:15

Update moe_utils.py

9057392

Merge branch 'main' into fused_perm_pad

911548e

copy-pr-bot Bot temporarily deployed to nemo-ci February 4, 2026 09:16 Inactive

copy-pr-bot Bot temporarily deployed to test February 4, 2026 09:16 Inactive

Merge branch 'main' into fused_perm_pad

ea197bc

copy-pr-bot Bot temporarily deployed to nemo-ci February 5, 2026 01:31 Inactive

copy-pr-bot Bot temporarily deployed to test February 5, 2026 01:31 Inactive

yaox12 added this pull request to the merge queue Feb 6, 2026

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Feb 6, 2026

yaox12 added this pull request to the merge queue Feb 6, 2026

github-merge-queue Bot pushed a commit that referenced this pull request Feb 6, 2026

Fuse permute+pad and unpermute+unpad ops for FP8/FP4 training (#2763)

6fbf108

Signed-off-by: xiaoxi-wangfj <690912414@qq.com> Co-authored-by: Xin Yao <xiny@nvidia.com>

github-merge-queue Bot pushed a commit that referenced this pull request Feb 6, 2026

Fuse permute+pad and unpermute+unpad ops for FP8/FP4 training (#2763)

2a43408

Signed-off-by: xiaoxi-wangfj <690912414@qq.com> Co-authored-by: Xin Yao <xiny@nvidia.com>

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Feb 6, 2026

yaox12 added this pull request to the merge queue Feb 6, 2026

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Feb 6, 2026

yaox12 added this pull request to the merge queue Feb 7, 2026

Merged via the queue into NVIDIA:main with commit 554ce49 Feb 7, 2026
70 of 73 checks passed

yaox12 deleted the fused_perm_pad branch February 7, 2026 16:06

daiyaanarfeen pushed a commit to daiyaanarfeen/Megatron-LM that referenced this pull request Feb 23, 2026

Fuse permute+pad and unpermute+unpad ops for FP8/FP4 training (NVIDIA…

e6f853e

…#2763) Signed-off-by: xiaoxi-wangfj <690912414@qq.com> Co-authored-by: Xin Yao <xiny@nvidia.com>

BoxiangW pushed a commit to BoxiangW/Megatron-LM that referenced this pull request Mar 4, 2026

Fuse permute+pad and unpermute+unpad ops for FP8/FP4 training (NVIDIA…

464c4bf

…#2763) Signed-off-by: xiaoxi-wangfj <690912414@qq.com> Co-authored-by: Xin Yao <xiny@nvidia.com>

yangbofun pushed a commit to xlm-research/Megatron-LM that referenced this pull request May 22, 2026

Fuse permute+pad and unpermute+unpad ops for FP8/FP4 training (NVIDIA…

aed82e9

…#2763) Signed-off-by: xiaoxi-wangfj <690912414@qq.com> Co-authored-by: Xin Yao <xiny@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fuse permute+pad and unpermute+unpad ops for FP8/FP4 training #2763

Fuse permute+pad and unpermute+unpad ops for FP8/FP4 training #2763
yaox12 merged 14 commits into
NVIDIA:mainfrom
021ai:fused_perm_pad

xiaoxi-wangfj commented Dec 26, 2025

Uh oh!

copy-pr-bot Bot commented Dec 26, 2025

Uh oh!

xiaoxi-wangfj commented Dec 29, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

xiaoxi-wangfj commented Jan 6, 2026

Uh oh!

yaox12 commented Jan 9, 2026

Uh oh!

yaox12 commented Feb 4, 2026

Uh oh!

yaox12 commented Feb 5, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Uh oh!

Conversation

xiaoxi-wangfj commented Dec 26, 2025

What does this PR do ?

Background / Motivation

Contribution process

Pre-checks

Code review

(Step 1): Add PR label Expert Review

(Step 2): Collect the expert reviewers reviews

(Step 3): Final Review

(Optional Step 4): Cherry-pick into release branch

Merging your PR

Uh oh!

copy-pr-bot Bot commented Dec 26, 2025

Uh oh!

xiaoxi-wangfj commented Dec 29, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

xiaoxi-wangfj commented Jan 6, 2026

Uh oh!

yaox12 commented Jan 9, 2026

Uh oh!

yaox12 commented Feb 4, 2026

Uh oh!

yaox12 commented Feb 5, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

(Step 1): Add PR label `Expert Review`