Huvu/reve optim by huvunvidia · Pull Request #102 · NVIDIA-NeMo/DFM

huvunvidia · 2026-02-06T15:42:43Z

No description provided.

Signed-off-by: Charlie Truong <[email protected]>

VFM : Init repo

Add CP check

Signed-off-by: Ethan He <[email protected]>

The sparse_attention directory was incorrectly configured as a submodule without a corresponding .gitmodules file, causing CI/CD checkout failures. This commit converts it to a regular directory with tracked files. Signed-off-by: Ethan He <[email protected]>

Signed-off-by: Ethan He <[email protected]>

add diffusion;physical ai projects

Enable CodeCov

Signed-off-by: oliver könig <[email protected]>

ci: Add community-bot

…efault-templates Delete .github/ISSUE_TEMPLATE directory

chore: Update cherry-pick workflow to use v0.63.0

Signed-off-by: Abhinav Garg <[email protected]>

Initial commit related to dfm repo structure Very nice structure and merge to work on other MR landings.

* update Signed-off-by: linnan wang <[email protected]> * update Signed-off-by: linnan wang <[email protected]> * update Signed-off-by: linnan wang <[email protected]> * update Signed-off-by: linnan wang <[email protected]> * update Signed-off-by: linnan wang <[email protected]> --------- Signed-off-by: linnan wang <[email protected]>

* first commit * workable code * workable thd * clean up, remove all CP for sbhd, CP now is only for thd * run outside of Mbridge * Update example scripts and add new data module for multimodal datasets - Added comments to clarify file purposes in example_commands.sh, inference_wan.py, pretrain_wan.py, wan_provider.py, wan_step.py, and wan.py. - Introduced EnergonMultiModalDataModule for handling multimodal datasets in nemo_vfm. - Created SequentialMegatronSampler for efficient sequential sampling in large datasets. - Added new files for DIT attention and base data handling. This commit enhances documentation and introduces new functionalities for better data management and processing. * workable code before refactoring * refactor attention submodules + reorder files locations * update refactor * update refactor * reorganize files * reorganize files * refactoring code * add README for perf test * using vae, t5, scheduler from Diffusers * update repo, remove Wan's Github moduels * fix Ruff * fix ruff + copyright * fix Ruff + Lint * fix Ruff + Lint * fix Ruff + Lint * fix Ruff + Lint * fix Ruff + Lint * fix Ruff + Lint * fix Ruff + Lint * fix Ruff + Lint * merged main + address comments * remove example_commands.md, Google waits until mid Nov * refactor inference_configs + mockdatamodule * add dit_embeddings.py * fix lint ruff * add 'average_gradients_across_tp_domain' to torch.nn for when running sequence_parallelism * add english negative prompt * fix ruff lint * Update uv.lock for deps: diffusers==0.35.1, easydict, imageio * update dfm/src/megatron/data/dit * change english negative prompt * seem to workable seq_packing * refactor with Sajad's PR - DiT data to common dir * fix Ruff, lint * fix Ruff, lint * fix Ruff, lint * workable mock datamodule (doesn't need setting path); updated training algo + hyper-parameters aligning with Linnan; tested training with anime dataset finetung * bring wan_task encoders features to common, sharing with dit * lint, ruff * lint, ruff * lint, ruff * fix CP error (input of thd_split_inputs_cp to be cu_seqlens_q_padded instead of cu_seqlens_q) * udpate README_perf_test.md * fix lint, ruff * update uv.lock, merge main * uv.lock * uv.lock * uv.lock * update uv.lock [using ci] * Performance improvements to Wan * Perf optimizations * Tiny fix * Remove CP disable as packed sequences not supported * Fix comment * Minor fixes. Revert video_latent comparison * Fix missed check * Lint fix * H100 mock pretraining perf config * Rename config file * Lint check Signed-off-by: Parth Mannan <[email protected]> * Adding GB200 perf config Signed-off-by: Parth Mannan <[email protected]> * GB300 perf config Signed-off-by: Parth Mannan <[email protected]> * Refactor Energon data module to return wrapped dataloaders and add EnergonDataloader class for cyclic iteration. Introduce WAN pretrain mock data configuration for testing. * Enhance DiffusionTaskEncoder to handle None attributes in stacking and concatenation methods. Add WAN pretrain mock data configuration for testing purposes. * Refactor data processing in dit_data_step to simplify batch retrieval and update WAN pretrain configuration to include train_iters. * Add op fusions Signed-off-by: Parth Mannan <[email protected]> * Update H100 config Signed-off-by: Parth Mannan <[email protected]> * Fix lint Signed-off-by: Parth Mannan <[email protected]> * Resolve conflict Signed-off-by: Parth Mannan <[email protected]> * Fix for mock dataloader test Signed-off-by: Parth Mannan <[email protected]> * Fix Dummyiter Signed-off-by: Parth Mannan <[email protected]> * Fix test Signed-off-by: Parth Mannan <[email protected]> * Make RoPE test only GPU Signed-off-by: Parth Mannan <[email protected]> * Rope cuda fix Signed-off-by: Parth Mannan <[email protected]> --------- Signed-off-by: Parth Mannan <[email protected]> Co-authored-by: Huy Vu2 <[email protected]> Co-authored-by: Abhinav Garg <[email protected]> Co-authored-by: root <[email protected]> Co-authored-by: root <[email protected]> Co-authored-by: Pablo Garay <[email protected]>

Signed-off-by: linnan wang <[email protected]>

* add focs * updated README for Wab * update README wan * relocate teadme --------- Co-authored-by: Huy Vu2 <[email protected]>

* Add DiT Readme. Signed-off-by: sajadn <[email protected]> * Update DiT readme. Signed-off-by: Sajad Norouzi <[email protected]> * Minor wording update. Signed-off-by: Sajad Norouzi <[email protected]> --------- Signed-off-by: sajadn <[email protected]> Signed-off-by: Sajad Norouzi <[email protected]>

* inital commit, workable code * add example * fix lint * fix lint * bring all wan related codes to DFM * add tests * lint --------- Co-authored-by: Huy Vu2 <[email protected]>

@akoumpa

* Initial README commit * Update README and add performance summary documentation - Corrected the link in the README for the performance summary to point to the correct file. - Introduced a new `performance-summary.md` document detailing performance benchmarks for large language models using DFM, including nomenclature, performance metrics, and system configurations. * add DiT megatron links. Signed-off-by: sajadn <[email protected]> * Performance Docs update Signed-off-by: Parth Mannan <[email protected]> * Performance Docs update fix Signed-off-by: Parth Mannan <[email protected]> * Update README to enhance clarity and accuracy - Removed redundant description of the framework. - Clarified the relationship between Megatron Bridge and Megatron Core in the Dual-Path Architecture section. * Enhance README with detailed performance optimizations and parallelism descriptions - Updated the Megatron Bridge Path section to include 6D parallelism details. - Added state-of-the-art performance optimizations to the Dual Training Paths section. - Clarified parallelism terminology in the comparison table for better understanding. * Update perf doc Signed-off-by: Parth Mannan <[email protected]> * update Signed-off-by: linnan wang <[email protected]> * Update README with fine-tuning command Removed TODO comment and added a command for fine-tuning a video diffusion model. * Apply suggestion from @akoumpa * Apply suggestion from @akoumpa * Apply suggestion from @akoumpa * Update README, Wan-related. Updated command syntax and improved clarity in README. * Apply suggestion from @akoumpa * Fixing typo @akoumpa * fix automodel section Signed-off-by: Alexandros Koumparoulis <[email protected]> * fix Signed-off-by: Alexandros Koumparoulis <[email protected]> * update DFM-specific readme Signed-off-by: Pablo Garay <[email protected]> * Update performance-summary.md Thanks a lot @linnanwang for the bench numbers. * Update performance-summary.md * Update performance-summary.md * Update README.md Co-authored-by: Wenwen Gao <[email protected]> * Update README.md Co-authored-by: Wenwen Gao <[email protected]> * Update README.md Co-authored-by: Wenwen Gao <[email protected]> * Update README.md Co-authored-by: Wenwen Gao <[email protected]> * Refactor README.md and performance-summary.md for clarity and conciseness - Simplified descriptions of Megatron Bridge and AutoModel paths in README.md. - Removed outdated comparison table to streamline content. - Updated performance-summary.md to generalize model references and improve clarity. Co-authored-by: Wenwen Gao <[email protected]> * Fix typo in README.md: changed "Built" to "Build" in the container section header for consistency. --------- Signed-off-by: sajadn <[email protected]> Signed-off-by: Parth Mannan <[email protected]> Signed-off-by: linnan wang <[email protected]> Signed-off-by: Alexandros Koumparoulis <[email protected]> Signed-off-by: Pablo Garay <[email protected]> Co-authored-by: sajadn <[email protected]> Co-authored-by: Parth Mannan <[email protected]> Co-authored-by: linnan wang <[email protected]> Co-authored-by: Alexandros Koumparoulis <[email protected]> Co-authored-by: Huy Vu <[email protected]> Co-authored-by: Alexandros Koumparoulis <[email protected]> Co-authored-by: Pablo Garay <[email protected]> Co-authored-by: Wenwen Gao <[email protected]>

* report for public version * fix image size * Update report.md for Wan 2.1 convergence comparison, correcting formatting and ensuring clarity in experiment overview and caveats regarding training loss fluctuations between Diffusers and Megatron-Core implementations. --------- Co-authored-by: Huy Vu2 <[email protected]> Co-authored-by: Abhinav Garg <[email protected]>

- Introduced a new document detailing the comparison between Diffusers (Automodel path) and Megatron-Core (Megatron-Bridge path) for Wan 2.1. - Included experiment overview, dataset specifications, training setup, and results with visual training curves. - Added two binary images illustrating loss vs. steps for both text-to-image and text-to-video stages. This documentation aims to provide insights into the model's performance and training dynamics during the partial convergence test.

* edm and data preprocess tests. Signed-off-by: sajadn <[email protected]> * Minor cleanings for DiT. Signed-off-by: Sajad Norouzi <[email protected]> * add dit unit test. Signed-off-by: Sajad Norouzi <[email protected]> * add iter to the DiffusionDataModule. Signed-off-by: sajadn <[email protected]> * add missing copyright. Signed-off-by: sajadn <[email protected]> * use 'no caption' if caption is not present. Signed-off-by: sajadn <[email protected]> * fix dit inference bug. Add wanbd to inference code. Signed-off-by: sajadn <[email protected]> * update the DiT configs to be aligned with the original paper. Signed-off-by: sajadn <[email protected]> * add wandb[video] and mediapy to uv. Signed-off-by: sajadn <[email protected]> * adjust pos_ids in mock_dataset to have batch dimension, fuse adaLN layers, use DiTSelfAttention. Signed-off-by: sajadn <[email protected]> * fix the diffusion sample size bug. Signed-off-by: sajadn <[email protected]> * fix broken tests. Signed-off-by: sajadn <[email protected]> --------- Signed-off-by: sajadn <[email protected]> Signed-off-by: Sajad Norouzi <[email protected]> Co-authored-by: Abhinav Garg <[email protected]>

- Add pre-flight job to detect docs-only changes using FW-CI-templates - Skip cicd-wait-in-queue, unit tests, and e2e tests when docs_only is true - Skip copyright-check when docs_only is true - Skip build-test-publish-wheel when docs_only is true - Linting and ruff checks remain enabled for all PRs Signed-off-by: Pablo Garay <[email protected]>

* initial commit * update import EnergonMultiModalDataModule * update submodule Megatron-Bridge * Update uv.lock [skip ci] * update uv.lock * small update * small update --------- Co-authored-by: Huy Vu2 <[email protected]> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

* Adding support for hunyuan finetuning Signed-off-by: Pranav Prashant Thombre <[email protected]> * Ensuring that activation checkpointing is gated with a flag Signed-off-by: Pranav Prashant Thombre <[email protected]> * Make the flow matching pipeline logic model agnostic Signed-off-by: Pranav Prashant Thombre <[email protected]> * Adding copyright to dataset processing file Signed-off-by: Pranav Prashant Thombre <[email protected]> * Linting fixes Signed-off-by: Pranav Prashant Thombre <[email protected]> * Fix linting Signed-off-by: Pranav Prashant Thombre <[email protected]> * lintfix Signed-off-by: Pablo Garay <[email protected]> * lintfix Signed-off-by: Pablo Garay <[email protected]> * Update automodel dependencies Signed-off-by: Pranav Prashant Thombre <[email protected]> * Remove unused import Signed-off-by: Pranav Prashant Thombre <[email protected]> * Setting the minimum diffusers package version Signed-off-by: Pranav Prashant Thombre <[email protected]> --------- Signed-off-by: Pranav Prashant Thombre <[email protected]> Signed-off-by: Pablo Garay <[email protected]> Co-authored-by: Pablo Garay <[email protected]>

* workable code prepare_dataset_wan.py - tested matching automodel's preprocess_resize.py; encode-decode verified * fix linr * change location of prepare_dataset_wan.py --------- Co-authored-by: Huy Vu2 <[email protected]>

* Add more comprehensive testing for the automodel path Signed-off-by: Pranav Prashant Thombre <[email protected]> * Adding unit tests for the flow matching pipeline Signed-off-by: Pranav Prashant Thombre <[email protected]> * Adding functional test for Wan Signed-off-by: Pranav Prashant Thombre <[email protected]> * Fixing linting errors Signed-off-by: Pranav Prashant Thombre <[email protected]> * Linting fixes Signed-off-by: Pranav Prashant Thombre <[email protected]> * Increase test timeout Signed-off-by: Pranav Prashant Thombre <[email protected]> * Remove flash attention3 as the default attention backend during training Signed-off-by: Pranav Prashant Thombre <[email protected]> --------- Signed-off-by: Pranav Prashant Thombre <[email protected]>

Added instructions for converting HuggingFace checkpoints to Megatron format and vice versa, including necessary commands and notes on exported checkpoints.

copy-pr-bot · 2026-02-06T15:42:46Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

chtruong814 and others added 30 commits June 13, 2025 14:12

Initial commit

40edfce

Signed-off-by: Charlie Truong <[email protected]>

Init repo

9e16671

lint fix

2df2b62

Naming update

db86712

clean up

b6a9de5

update email

015131f

update contact

1658bf3

update CP

c7a6b24

update CP

057fb69

Merge pull request #1 from NVIDIA-NeMo/pagaray_vfm_init_repo

01e3c5d

VFM : Init repo

Add CP

cd1d559

Merge pull request #2 from NVIDIA-NeMo/pagaray_add_cp

017cd4c

Add CP check

Enable CodeCov

6ec07a8

add diffusion;physical ai projects

f800fe8

check

4190349

sparse attention

a6a8943

Signed-off-by: Ethan He <[email protected]>

add readme

89885c6

Signed-off-by: Ethan He <[email protected]>

pre-commit

61faa38

cp

e531db9

Signed-off-by: Ethan He <[email protected]>

Merge pull request #4 from NVIDIA-NeMo/ethanhe42/projects

1c50ef9

add diffusion;physical ai projects

Merge pull request #3 from NVIDIA-NeMo/pablo-garay-add-code-coverage

38feafb

Enable CodeCov

ci: Add community-bot

c5134df

Signed-off-by: oliver könig <[email protected]>

Merge pull request #6 from NVIDIA-NeMo/ko3n1g/ci/add-community-bot

434f4fb

ci: Add community-bot

Delete .github/ISSUE_TEMPLATE directory

7798518

Merge pull request #8 from NVIDIA-NeMo/pablo-garay-remove-redundant-d…

4dcee7d

…efault-templates Delete .github/ISSUE_TEMPLATE directory

Update cherry-pick-release-commit.yml

a1348c3

Merge pull request #9 from NVIDIA-NeMo/pablo-garay-update-cp-workflow

7b89ce0

chore: Update cherry-pick workflow to use v0.63.0

Initial commit related to dfm repo structure

b5c63a7

Signed-off-by: Abhinav Garg <[email protected]>

Merge pull request #10 from NVIDIA-NeMo/dfm_init

c88cca9

Initial commit related to dfm repo structure Very nice structure and merge to work on other MR landings.

linnanwang and others added 21 commits November 20, 2025 16:39

update (#69)

b30ce36

Signed-off-by: linnan wang <[email protected]>

Add docs for Megatron Wan (#38)

5fbcb0d

* add focs * updated README for Wab * update README wan * relocate teadme --------- Co-authored-by: Huy Vu2 <[email protected]>

Wan HF <-> Megatron checkpoints conversion (#73)

6cf2344

* inital commit, workable code * add example * fix lint * fix lint * bring all wan related codes to DFM * add tests * lint --------- Co-authored-by: Huy Vu2 <[email protected]>

Fixing packing_buffer_size default value

5dbe5bd

Checkpoint conversion for Wan docs (#81)

2c983da

Added instructions for converting HuggingFace checkpoints to Megatron format and vice versa, including necessary commands and notes on exported checkpoints.

workable codes. testing exact forward match of 2 models

dec9c82

make sure GPU memory matches when run on 1 GPU

2741a20

testing code

45e0f5a

code cleaning

57b7932

Huy Vu2 added 6 commits February 6, 2026 07:45

clean PR

910fc48

add running instructions

7610413

changeing reve_pytorch to baseline_reve

88ad1d1

remove dfm/src/megatron/model/reve/reve_pytorch

93c678c

clean code

5144c7f

update code for multi-launches, support GA for baseline

69edcfd

huvunvidia closed this Feb 12, 2026

huvunvidia force-pushed the huvu/reve_optim branch from 5dffa3a to 69edcfd Compare February 12, 2026 08:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Huvu/reve optim#102

Huvu/reve optim#102
huvunvidia wants to merge 141 commits intomainfrom
huvu/reve_optim

huvunvidia commented Feb 6, 2026

Uh oh!

copy-pr-bot bot commented Feb 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

Conversation

huvunvidia commented Feb 6, 2026

Uh oh!

copy-pr-bot bot commented Feb 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants