Enable Casting-Free FP8-Flow-MoE Blockwise FP8 Dataflow by xiaoxi-wangfj · Pull Request #1 · 021ai/Megatron-LM

xiaoxi-wangfj · 2025-12-26T03:48:18Z

What does this PR do ?

This PR introduces a set of FP8-centric MoE dataflow optimizations that improve FP8 communication performance and reduce CPU overhead through operator fusion, while maintaining numerical stability in existing FP8 training workflows.

Background / Motivation

Detailed optimization points are described in the joint PR #1.

Contribution process

flowchart LR
    A[Pre-checks] --> B[PR Tests]
    subgraph Code Review/Approval
        C1[Expert Review] --> C2[Final Review]
    end
    B --> C1
    C2 --> D[Merge]

Pre-checks

I want this PR in a versioned release and have added the appropriate Milestone (e.g., Core 0.8)
I have added relevant unit tests
I have added relevant functional tests
I have added proper typing to my code Typing guidelines
I have added relevant documentation
I have run the autoformatter.sh on my PR

Code review

The following process is enforced via the CODEOWNERS file for changes into megatron/core. For changes outside of megatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.

For MRs into `main` branch

(Step 1): Add PR label `Expert Review`

(Step 2): Collect the expert reviewers reviews

Attach the Expert Review label when your PR is ready for review.
GitHub auto-assigns expert reviewers based on your changes. They will get notified and pick up your PR soon.

⚠️ Only proceed to the next step once all reviewers have approved, merge-conflict are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

(Step 3): Final Review

Add Final Review label
GitHub auto-assigns final reviewers based on your changes. They will get notified and pick up your PR soon.

(Optional Step 4): Cherry-pick into release branch

If this PR also needs to be merged into core_r* release branches, after this PR has been merged, select Cherry-pick to open a new PR into the release branch.

For MRs into `dev` branch

The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either eharper@nvidia.com or zijiey@nvidia.com.

Merging your PR

Any member of core-adlr and core-nemo will be able to merge your PR.

This can remove explicit padding/unpadding around GroupedMLP, which improves throughput and reduces peak memory usage Signed-off-by: xiaoxi-wangfj <690912414@qq.com>

1. Quantize activations to FP8 before deepep token dispatch. 2. Feed FP8 activations directly into expert up-projection. 3. Fusion act func and quantize. 4. Add fine-grained recompute `moe_expert`. Signed-off-by: xiaoxi-wangfj <690912414@qq.com> Co-authored-by: dantesuu@gmail.com Co-authored-by: xzhu@zhejianglab.org Co-authored-by: 123sssmmm@gmail.com

xiaoxi-wangfj added 2 commits December 26, 2025 02:37

Fuse permute+pad and unpermute+unpad ops for FP8/FP4 precision

500deb6

This can remove explicit padding/unpadding around GroupedMLP, which improves throughput and reduces peak memory usage Signed-off-by: xiaoxi-wangfj <690912414@qq.com>

xiaoxi-wangfj changed the title ~~Fp8 flow moe~~ Enable Casting-Free FP8-Flow-MoE Blockwise FP8 Dataflow Dec 26, 2025

This was referenced Dec 26, 2025

[PyTorch]Add Casting-Free FP8-Flow-MoE Blockwise Optimizations 021ai/TransformerEngine#1

Closed

[PyTorch]Add Casting-Free FP8-Flow-MoE Blockwise Optimizations NVIDIA/TransformerEngine#2544

Open

xiaoxi-wangfj closed this Dec 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enable Casting-Free FP8-Flow-MoE Blockwise FP8 Dataflow#1

Enable Casting-Free FP8-Flow-MoE Blockwise FP8 Dataflow#1
xiaoxi-wangfj wants to merge 2 commits into
FP8-Flow-MoE-MDfrom
FP8-Flow-MoE

xiaoxi-wangfj commented Dec 26, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

xiaoxi-wangfj commented Dec 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Background / Motivation

Contribution process

Pre-checks

Code review

(Step 1): Add PR label Expert Review

(Step 2): Collect the expert reviewers reviews

(Step 3): Final Review

(Optional Step 4): Cherry-pick into release branch

Merging your PR

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

xiaoxi-wangfj commented Dec 26, 2025 •

edited

Loading

(Step 1): Add PR label `Expert Review`