Skip to content

[CICD] support Metax MACA workflow#48

Merged
Darryl233 merged 160 commits intoflagos-ai:mainfrom
BrianPei:main
Apr 2, 2026
Merged

[CICD] support Metax MACA workflow#48
Darryl233 merged 160 commits intoflagos-ai:mainfrom
BrianPei:main

Conversation

@qqjxzxq
Copy link
Copy Markdown

@qqjxzxq qqjxzxq commented Mar 17, 2026

Description

This PR implements and integrates the Metax (MACA) workflow into TransformerEngine-FL. It enables automated CI/CD pipelines, functional training tests, and unit tests specifically optimized for Metax hardware environments.

Key updates in this version: Successful TE compilation on Metax and alignment with NVIDIA's standard QA workflows.

Fixes # (issue_number_if_applicable)

Type of change

  • New feature (non-breaking change which adds functionality)
  • Infra/Build change (changes to CI/CD workflows or build scripts)
  • Documentation change
  • Bug fix
  • Code refactoring

Changes

1. Build & Compilation

  • TE Build Completion: Successfully completed the compilation and build process for TransformerEngine on the Metax platform.
  • Workflow Alignment: Designed the Metax testing workflow based on NVIDIA's qa-l0-te-cpp-unittest-pytorch-lint standard to ensure parity with upstream quality gates.

2. CI/CD Infrastructure & Test Modules

  • Metax Platform Support: Added configs/metax.yml to define Metax-specific runner labels, images, and device configurations.
  • Verified Workflow Modules: The following modules have been implemented and verified on the Metax platform:
    • pytorch-lint: Static code analysis and linting.
    • pytorch-debug: Debug-level build and basic functional verification.
    • pytorch-unittest: Core unit testing for Metax-adapted operators.
  • Workflow Modularization:
    • Introduced configs/all_tests_common.yml and configs/unit_tests_common.yml for reusable test logic.
    • Added configs/all_tests_metax.yml as the dedicated entry point for Metax functional testing.

3. Environment & Runtime Fixes

  • Image Management: Implemented image-pull-policy: never and --pull never options to force the use of local registry images (localhost:5000), optimizing startup time in local cluster environments.
  • Dynamic Resource Scaling:
    • Adapted torchrun and training scripts to support dynamic GPU/Accelerator counts (specifically for C500 clusters).
    • Removed hardcoded GPU host configurations to improve portability across different Metax nodes.

4. Cleanup

  • Removed legacy CUDA/Ascend specific configurations from the Metax workflow path to prevent environment contamination.

Hardware/Environment Verified

  • Platform: Metax MACA
  • Accelerator: C500
  • Registry: Local Registry (localhost:5000)

TODO / Next Steps

  • Integrate the Metax-specific adaptation workflow into the central platform.
  • Generate and upload comprehensive Benchmark and Performance test reports.

Checklist:

  • I have read and followed the contributing guidelines.
  • The functionality is complete and verified on Metax hardware.
  • I have commented my code, particularly in hardware-specific adaptation areas.
  • My changes generate no new warnings.
  • I have added/updated tests that prove my feature works on the MACA platform.
  • New and existing unit tests (Lint, Debug, Unittest) pass locally with Metax environment.

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Mar 17, 2026

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ qqjxzxq
❌ peiyu


peiyu seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

@qqjxzxq qqjxzxq closed this Mar 17, 2026
@qqjxzxq qqjxzxq reopened this Mar 17, 2026
@qqjxzxq
Copy link
Copy Markdown
Author

qqjxzxq commented Mar 31, 2026

I have fixed the formatting issues, please trigger the CI again @Darryl233

@Darryl233 Darryl233 changed the title ci: support Metax MACA workflow [CICD] support Metax MACA workflow Apr 1, 2026
Copy link
Copy Markdown
Collaborator

@Darryl233 Darryl233 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Collaborator

@BrianPei BrianPei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@xmhubj xmhubj requested review from xmhubj April 2, 2026 03:01
@xmhubj
Copy link
Copy Markdown
Collaborator

xmhubj commented Apr 2, 2026

@BrianPei please sign the CLA.

@Darryl233 Darryl233 merged commit ebcfadc into flagos-ai:main Apr 2, 2026
16 of 17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants