Thanks for your interest in contributing to Megatron-Bridge!
You can either follow the steps below to set up the environment from scratch, or use the NeMo Framework container, which provides a pre-built environment and makes these steps unnecessary.
Build and run the Docker container:
docker build \
-f docker/Dockerfile.ci \
-t megatron-bridge \
.To start a shell in the container to interactively run/develop:
docker run --rm -it -w /workdir -v $(pwd):/opt/Megatron-Bridge \
--entrypoint bash \
--gpus all \
megatron-bridgeIf you are using VSCode/Cursor you can also use Dev Containers. Here's a devcontainer.json to get you started:
If you're an external contributor, you'll need to fork the repository:
-
Create a fork: Click the "Fork" button on the GitHub repository page or follow this direct link to fork
-
Clone your fork:
git clone https://github.com/YOUR-USERNAME/Megatron-Bridge megatron-bridge cd megatron-bridge -
Add upstream remote to keep your fork updated:
git remote add upstream https://github.com/NVIDIA-NeMo/Megatron-Bridge.git
-
Install pre-commit:
# Requires `uv` to be installed uv run --group dev pre-commit install -
Keep your fork updated before starting new work:
git fetch upstream git checkout main git merge upstream/main git push origin main
-
Create a new branch for your changes:
git checkout main git switch -c your-feature-name
-
Make your changes and commit them:
git add . git commit --signoff -m "Your descriptive commit message"
We require signing commits with
--signoff(or-sfor short). See Signing Your Work for details. -
Push to your fork:
git push origin your-feature-name
-
Create a pull request from your fork's branch to the main repository's
mainbranch through the GitHub web interface.
If you have write access to the repository (NVIDIA contributors):
-
Clone the repository directly:
git clone https://github.com/NVIDIA-NeMo/Megatron-Bridge megatron-bridge cd megatron-bridge -
Install pre-commit from the project root directory:
# Requires `uv` to be installed uv run --group dev pre-commit install -
Create a new branch for your changes:
git switch -c your-feature-name
-
Make your changes and commit them:
git add . git commit --signoff -m "Your descriptive commit message"
-
Push your branch to the repository:
git push origin your-feature-name
-
Create a pull request from your branch to the
mainbranch.
Format your commit messages and PR titles as:
[{areas}] {type}: {description}
Areas (use the most relevant ones, separate multiple with ,):
model- Model implementations and HF bridge logicrecipe- Training recipes and launch configstraining- Training loop, callbacks, and runtime integrationdata- Dataset builders, preprocessing, and samplersckpt- Checkpoint conversion, loading, export, and save pathspeft- PEFT methods (LoRA, adapters) and adapter exportperf- Performance optimizations and throughput improvementsdistill- Knowledge distillationprune- Pruning and sparsityquant- Quantization (PTQ, QAT, FP8 recipes)diffusion- Diffusion model implementations and trainingci- CI, automation, and workflow infrastructuredocs- Documentation, examples, and contributor guidancebuild- Dependencies, packaging, and environment setupmisc- Cross-cutting utilities and other changes
Types:
feat- New featurefix- Bug fixrefactor- Code refactoring without changing functionalitychore- Maintenance taskstest- Adding or updating tests
Breaking Changes: If your PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
Examples:
[model] feat: Add Qwen3 model bridge
[recipe, docs] feat: Add Llama 3.1 70B recipe with documentation
[ckpt] fix: Handle missing keys in HF checkpoint conversion
[BREAKING][training] refactor: Change optimizer config structure
[ci, build] chore: Update ruff version
When you create a pull request, add labels immediately so reviewers and CI can route it correctly. At minimum, apply:
- One type label β
bug,feature,docs, orci - One or more area labels β
area:model,area:recipe,area:training,area:data,area:ckpt,area:peft,area:perf,area:distill,area:diffusion,area:prune,area:quant,area:build, orarea:misc docs-onlyβ if the PR touches only documentation (no code changes); this skips most CI jobsneeds-reviewβ when the PR is ready for reviewneeds-more-testsβ if the change needs additional test coverage; triggers both L0 and L1 CIhigh-complexityβ if the PR is large, touches many files, or is prone to merge conflicts
Add risk labels when applicable:
breaking-changeβ if any public API, CLI argument, config key, or function signature changesneeds-more-testsβ if the change needs additional test coverage (also triggers L1 CI)
Properly labeled PRs get faster reviews and avoid sitting in the triage queue.
Megatron Bridge uses a small governance taxonomy so maintainers, oncall, and automation can reason about issues and PRs consistently:
- New issues should start with
needs-triageand leave triage with onetypelabel plus onearealabel. - PRs should use one primary
area:*value in the PR template. State labels such asneeds-author,blocked, andready-to-mergeare for routing active work, not for replacing review status or CI details. - Release labels such as
r0.3.0, community labels, andneeds-follow-upare still valid, but they are orthogonal to the main governance taxonomy.
Use exactly one type label per issue or PR after triage:
| Label | Use for |
|---|---|
bug |
Incorrect behavior, regressions, or broken workflows |
feature |
New capabilities, enhancements, or enablement work |
support |
Questions, help requests, or user guidance gaps |
docs |
Documentation-only updates or documentation debt |
ci |
CI, automation, test queue, or workflow infrastructure work |
Use at most one primary state label from this set at a time (see exceptions below):
| Label | Meaning |
|---|---|
needs-triage |
New item needs classification and ownership |
needs-review |
PR is ready for code review and waiting on a reviewer |
needs-author |
Author action is required before review or merge can continue |
needs-follow-up |
Issue or PR has finished initial triage/review and needs further follow-up |
blocked |
Work cannot move forward until an external dependency is cleared |
ready-to-merge |
PR is approved, current, and only waiting for CI to pass before merge |
Allowed combinations: needs-author + needs-follow-up and needs-follow-up + blocked can co-exist (e.g., waiting on the author but oncall should keep tracking, or a blocked item that oncall should keep watching across handoffs).
Apply only when risk affects review or merge behavior:
| Label | Meaning |
|---|---|
breaking-change |
Public behavior or API compatibility changes |
high-complexity |
Harder to merge: prone to conflicts and needs additional test coverage |
needs-more-tests |
Requires additional test coverage; triggers both L0 and L1 CI test tiers |
Use one primary area label after triage:
| Label | Scope |
|---|---|
area:model |
Model implementations and HF bridge logic |
area:recipe |
Training recipes and launch configs |
area:training |
Training loop, callbacks, and runtime integration |
area:data |
Dataset builders, preprocessing, and samplers |
area:ckpt |
Checkpoint conversion, loading, export, and save paths |
area:peft |
PEFT methods (LoRA, adapters) and adapter export |
area:perf |
Performance optimizations, kernel integration, and throughput improvements |
area:distill |
Knowledge distillation |
area:diffusion |
Diffusion model implementations and training |
area:prune |
Pruning and sparsity |
area:quant |
Quantization (PTQ, QAT, FP8 recipes) |
area:build |
Dependencies, packaging, images, and environment setup |
area:misc |
Cross-cutting utilities, logging, helpers, and other changes that do not fit a primary domain |
This taxonomy does not replace every existing label:
- Keep release labels such as
r0.3.0as independent scheduling signals. - Keep
community-requestand other community-related labels as independent intake signals. - Use
needs-follow-upwhen an issue or PR should stay explicitly visible to the oncaller across handoffs. - Avoid creating new status synonyms when an existing label in this taxonomy already fits.
- New issues should start with
needs-triage. - Issues should leave triage with one
typelabel and onearealabel. - An issue keeps
needs-triageuntil a maintainer has responded or assigned it. Adding type and area labels is classification; the issue leavesneeds-triageonly when a maintainer engages (responds, assigns, or explicitly routes it). - After a maintainer engages, transition to
needs-follow-up(deferred work oncall should track),needs-author(waiting on reporter for more info),blocked(external dependency), or no state label (actively being worked on). - PRs should not use
needs-triage. Useneeds-review,needs-author,blocked, orready-to-mergeonly when they help route work. high-complexitystarts as a manual maintainer label, not an automated heuristic.needs-follow-upshould usually point to a linked issue instead of staying on a merged PR.needs-follow-upis the visibility label for deferred work that should stay on the oncall radar.needs-follow-upcan be combined withblockedwhen the oncaller should keep watching a blocked item.- If a PR is marked
breaking-change, do not treat it as auto-mergeable even if CI is green.
These four views are the core daily queues maintainers and oncall should watch.
- Scope: open issues labeled
needs-triage - Goal: assign one
typeand onearea - Suggested query:
is:issue is:open label:"needs-triage" sort:updated-asc
- Scope: open PRs labeled
ready-to-merge - Goal: surface PRs that should merge without rereading every CI detail
- Suggested query:
is:pr is:open label:"ready-to-merge" draft:false sort:updated-asc
- Scope: open issues and PRs labeled
blockedorneeds-follow-up - Goal: make blockers and deferred work visible across handoffs
- Suggested query:
is:open (label:"blocked" OR label:"needs-follow-up") sort:updated-asc
- Scope: open PRs labeled
high-complexity - Goal: proactively review, rebase, and ensure adequate test coverage before conflicts waste CI and reviewer time
- Suggested query:
is:pr is:open label:"high-complexity" sort:updated-asc
If you mirror these queues into a GitHub Project, keep the columns and sort keys small:
- item title
- primary area
- owner or assignee
- age
- last updated time
- release label
- current state
We use pytest for writing both unit and functional tests.
Unit tests aim to test functions in isolation. They generally do not depend on artifacts like Hugging Face checkpoints or larger datasets. Exception to this is a small toy dataset consisting of tokenizers.
Unit tests are stored at tests/unit_tests. Please add your test to an existing folder or create a new one if none matches.
Preferred: Unit tests are strongly recommended over functional tests. CI resources are limited, so every functional test slot has a real cost. Cover as much logic as possible with unit tests before reaching for a functional test.
Functional tests are integration tests that perform model training or operate on larger artifacts. We use pytest for writing these. In some cases, it might be desired to run your test (or parts of it) in a subprocess to avoid process contamination. We use subprocess.run for this inside the pytest function. Please add your test into one of the predefined folders. If none of the folders matches semantically, please reach out to the @nvidia-nemo/automation in your PR for consultation.
GPU limit: Functional tests must use at most 2 GPUs. Do not add tests that require more than 2 GPUs β they will not fit in the CI resource budget.
Functional tests are placed in tiered launcher scripts inside tests/functional_tests/. Each tier runs in a separate CI job, allowing faster PR feedback while keeping thorough coverage on nightly runs.
| Tier | Prefix | Trigger | Purpose |
|---|---|---|---|
| L0 | L0_Launch_*.sh |
Every PR, main push, schedule | Core smoke tests β must be fast and stable |
| L1 | L1_Launch_*.sh |
Main push + schedule; PRs labeled needs-more-tests |
Broader model/recipe coverage |
| L2 | L2_Launch_*.sh |
Schedule / workflow_dispatch only |
VL models, checkpoint conversion, heavy quantization |
When adding a new launcher script, always start with the L0 tier so it runs on every PR. A maintainer will adjust the tier later if the test is too slow or better suited for nightly coverage. You must also update .github/workflows/cicd-main.yml to include it in the corresponding job matrix:
# Example: adding an L1 test
- script: L1_Launch_your_new_testWithout this step, your new launcher script will not be picked up by CI.
We use uv for managing dependencies. For reproducible builds, our project tracks the generated uv.lock file in the repository.
On a weekly basis, the CI attempts an update of the lock file to test against upstream dependencies.
Adding required (non-optional) dependencies is strongly discouraged and will be strictly reviewed. Every required dependency inflates the package and container image for all downstream consumers β most of whom will not need it. Prefer optional dependencies under an extra group whenever possible.
If your feature requires a dependency that is not already in pyproject.toml, submit the dependency change as a separate PR first. Do not bundle dependency additions with feature code β this keeps reviews focused and makes CI failures easier to diagnose.
- Add the dependency to
pyproject.toml(either viauv addor by editing the file directly):
# Preferred: optional dependency under an extra group
uv add --optional --extra $EXTRA $DEPENDENCY
# Required dependency (needs strong justification β affects all downstream)
uv add $DEPENDENCYEXTRA refers to the subgroup of extra-dependencies to which you're adding the new dependency.
Example: For adding a TRT-LLM specific dependency, run uv add --optional --extra trtllm $DEPENDENCY.
- Regenerate the lock file:
uv lock- Commit both files and open a PR:
git add pyproject.toml uv.lock
git commit -s -m "[build] chore: Add $DEPENDENCY"
git push- Once the dependency PR is merged, rebase your feature branch onto
mainand open the feature PR.
We use ruff for linting and formatting. CI does not auto-fix linting and formatting issues, but most issues can be fixed by running the following command:
uv run ruff check --fix .
uv run ruff format .Note: If ruff is missing, please follow the installation guide.
Every feature PR must evaluate whether documentation and tests need to be added or updated. This applies to all changes, not just large features. Before submitting a PR, check:
- Docs: Does this change need a new doc page, or does an existing doc need updating?
- Tests: Does this change need new unit or functional tests, or do existing tests need updating?
For new key features (e.g., enabling a new model, enabling a new parallelism strategy), documentation is required. The documentation should:
- Explain the motivation and purpose of the feature
- Outline the technical approach and architecture
- Provide clear usage examples and instructions for users
- Document internal implementation details where appropriate
This ensures that all significant changes are well-thought-out and properly documented for future reference. Comprehensive documentation serves two critical purposes:
- User Adoption: Helps users understand how to effectively use the library's features in their projects
- Developer Extensibility: Enables developers to understand the internal architecture and implementation details, making it easier to modify, extend, or adapt the code for their specific use cases
Quality documentation is essential for both the usability of Megatron-Bridge and its ability to be customized by the community.
Refactoring PRs that rename symbols, move files, or change paths are high risk for creating stale references. When a refactor changes any public name or path, you must check whether the following need corresponding updates:
- Docs: Any docs that reference the old names, paths, config keys, or CLI arguments
- Docstrings: Docstrings in the codebase that mention the old names or paths
- Scripts: Example and training scripts under
scripts/that import or reference the old names or paths
A refactor PR that renames something without updating all references will break users silently. Reviewers should verify these are addressed before approving.
- Follow the existing code style and conventions (see skills/code-style/SKILL.md)
- Write tests for new features
- Update documentation to reflect your changes
- Ensure all tests pass before submitting a PR
- Do not add arbitrary defaults for configs, be as explicit as possible
-
We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license.
- Any contribution which contains commits that are not Signed-Off will not be accepted.
-
To sign off on a commit you simply use the
--signoff(or-s) option when committing your changes:git commit -s -m "Add cool feature."This will append the following to your commit message:
Signed-off-by: Your Name <your@email.com> -
Full text of the DCO:
Developer Certificate of Origin Version 1.1 Copyright (C) 2004, 2006 The Linux Foundation and its contributors. Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed. Developer's Certificate of Origin 1.1 By making a contribution to this project, I certify that: (a) The contribution was created in whole or in part by me and I have the right to submit it under the open source license indicated in the file; or (b) The contribution is based upon previous work that, to the best of my knowledge, is covered under an appropriate open source license and I have the right under that license to submit that work with modifications, whether created in whole or in part by me, under the same open source license (unless I am permitted to submit under a different license), as indicated in the file; or (c) The contribution was provided directly to me by some other person who certified (a), (b) or (c) and I have not modified it. (d) I understand and agree that this project and the contribution are public and that a record of the contribution (including all personal information I submit with it, including my sign-off) is maintained indefinitely and may be redistributed consistent with this project or the open source license(s) involved.
There are two ways to trigger CI tests on your pull request:
If your GitHub user is configured to use signed commits, CI tests will run automatically when you push commits to your pull request.
Note: Signed commits are different from signing-off on commits (which uses the
-sflag mentioned in the Signing Your Work section).
If you don't have signed commits set up, you can still trigger CI tests manually by commenting on your pull request:
/ok to test <commit-SHA>
For example:
/ok to test a1b2c3d4e5f6
Important: You'll need to add this comment for each new commit you push to ensure CI tests run on the latest changes.
You can find the commit SHA in several ways:
- View your pull request's commit history on GitHub
- Run
git log --oneline -1in your local repository - Check the commit details in your Git client
Please see our documentation for a detailed guide on contributing new models.
{ "name": "megatron-bridge-dev", "image": "megatron-bridge:latest", "runArgs": [ "--gpus", "all", "--ulimit", "memlock=-1", "--ulimit", "stack=67108864", "--shm-size=24g", "--privileged", "--pid=host" ] // NOTE: Here is an example of how you can set up some common mounts, environment variables, and set up your shell. // Feel free to adapt to your development workflow and remember to replace the paths with your username. //"mounts": [ // {"source": "/home/yourusername", "target": "/home/yourusername", "type": "bind"}, // {"source": "/home/yourusername/.ssh", "target": "/root/yourusername-ssh", "type": "bind"} //], //"containerEnv": { // "HF_TOKEN_PATH": "/home/yourusername/.cache/huggingface/token", // "HF_HOME": "/home/yourusername/.cache/huggingface", // "HF_DATASETS_CACHE": "/home/yourusername/.cache/huggingface/datasets", // "WANDB_API_KEY": "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" //}, // // This (1) marks all directories safe (2) copies in ssh keys (3) sources user's bashrc file //"postStartCommand": "git config --global --add safe.directory '*' && cp -r /root/yourusername-ssh/* /root/.ssh/ && source /home/yourusername/.bashrc" }