-
Notifications
You must be signed in to change notification settings - Fork 32
CI: GitHub Action migration from Jenkins CI #322
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from 88 commits
Commits
Show all changes
120 commits
Select commit
Hold shift + click to select a range
87d437e
First drawt of the workflow
leo-automation e5c86af
Update transformer-engine-ci.yml
leo-automation 335b09f
Fixes
leo-automation 8136739
Update transformer-engine-ci.yml
leo-automation 440a781
Update transformer-engine-ci.yml
leo-automation 578ddce
toLower
leo-automation 99863b1
CI fixes
leo-automation b984353
Added build-only runners
leo-automation 63d5046
Debug docker credentials
leo-automation 370ff19
Debug docker credentials
leo-automation 4981dc1
Node js is missing on the node
leo-automation c96d76f
Remove devcontainer
leo-automation 82ea4e5
More fixes
leo-automation bae63ab
Typo
leo-automation 29955e9
Permissions inside the container
leo-automation 227ca35
sudo
leo-automation 9fc2342
Update transformer-engine-ci.yml
leo-automation 6d3a025
Update transformer-engine-ci.yml
leo-automation 19a78cd
Update transformer-engine-ci.yml
leo-automation 93a42ee
Update transformer-engine-ci.yml
leo-automation 40f8d08
Update transformer-engine-ci.yml
leo-automation d050431
Update transformer-engine-ci.yml
leo-automation 5aeaf38
Update transformer-engine-ci.yml
leo-automation 53633e5
Update transformer-engine-ci.yml
leo-automation 8989cd5
Update transformer-engine-ci.yml
leo-automation 06c3c30
Update transformer-engine-ci.yml
leo-automation 866b582
Update transformer-engine-ci.yml
leo-automation 0c78dad
CI: add CMAKE_CXX_COMPILER and CMAKE_CC_COMPILER
aaea1ac
DCMAKE_CXX_COMPILER
7694965
ls -l
18f6cb1
view setup.py
9e95632
add dcmake cxx compiler to setup.py
9a46d85
absolute path to dcmake_cxx_compiler
6f92c32
hip_arch
9ca4b79
hip-runtime-amd and hip-dev
a726165
dhip root dir
271fadb
rocm_path
7634219
add HIP_PATH
8e30435
add setuptools and wheel
0a50f6e
tree
c8fad75
tree
662c74a
rm hip dir
7b18142
back to basic
9c24673
rm build-essential
4b7d574
Run CI
leo-automation 064461b
Fix the path
leo-automation 1b402b9
Test rootless
leo-automation 37a880d
Debug
leo-automation 5c2ea46
sudo
leo-automation 36596e9
sudo debug
leo-automation 4bde899
sudo
leo-automation 8b03bea
AMDGPU_TARGETS
leo-automation dd2feaa
Fix
leo-automation 8e34a89
Debugiing
leo-automation 10317db
Debugging
leo-automation a8afdd0
Debugging
leo-automation 7f4f53e
PATH problem perhaps
leo-automation 0457361
Debugging
leo-automation ad32476
Debug
leo-automation b18d52d
Runner change
leo-automation e998ae4
Fix cplus
leo-automation c56ba18
Fix
leo-automation 5840c36
Fix
leo-automation 1037792
Correct cmaker path
leo-automation bbe956c
Debug
leo-automation c27bf79
Cmake link
leo-automation 9159ef0
Remove -f
leo-automation 5e9f3b1
Debug
leo-automation d847e3d
PATH
leo-automation 6122314
Refactoring to run container as a step
leo-automation b1cd2e1
build-only
leo-automation 28b192d
Houskeeping
leo-automation 3a3cff9
All in one stage
leo-automation efae349
GPU runner doesn't have access to internal network, so moved all netw…
leo-automation 60e5d75
Split docker image
leo-automation b2c4c15
Fix /opt/cmake
leo-automation 052f6d8
Housekeeping
leo-automation 79d8666
Typo
leo-automation f7e3442
Pip install incorrect path
leo-automation f0cc321
Houskeeping
leo-automation ada7ec6
Resolving comments
leo-automation 80bfdcb
New line at the end
leo-automation 9672ff1
Small changes
leo-automation 3a0fb27
Consolidate under one runner with internal network access
leo-automation 9c232d8
Fix FFI import. Add distributed tests hang workaround (#347)
ipanfilo 47eb7b7
Update rocm-ci.yml
mkunredd 28dc543
Submodules
leo-automation a037a09
Make TE ROCm wheels building image directly from manylinix image (#340)
ipanfilo 083fcfc
Added timeout
leo-automation 21ff8d6
Resolve comments
leo-automation 72ec76b
[CI] Hotfix test_gemm_autotune update (#353)
VeeraRajasekhar b092058
MXFP8 test scale off by 1 fix (#338)
alextmagro 9187659
[CI] Removed Jax jit workaround, replaced with XLA_FLAGS=--xla_gpu_en…
VeeraRajasekhar 22815fa
Run CI
leo-automation 2eb4714
Runner update
leo-automation dabb812
Update runners
leo-automation d8beee1
Debug
leo-automation 5a7c74d
Debug
leo-automation 83735d8
Remove HIP macros around std:: math functions (#343)
alextmagro a007240
Run CI
leo-automation b4da0d5
Update runners label
leo-automation 533bee5
[TE] [AITER] Add prebuilt AITER download and upload flow (#335)
VeeraRajasekhar 96d48ce
std::max type mismatch hotfix (#361)
alextmagro 6b8a47d
CI: allow numpy 2.0 (#366)
ipanfilo 9a987f8
Relax tolerance to pass 29x29x17389NT GEMM on MI350 (#365)
ipanfilo 90c04bc
FIX Occasional import error when only building for a single framework…
Micky774 7887013
Integrate aiter HD192_HD128 backward kernels (#364)
VeeraRajasekhar e9c7361
Use .info/version for ROCm verison (#368)
ipanfilo 87fece2
Enable aligned vectorized memory ops for MXFP8 cast (#342)
alextmagro 32e2d1d
[ROCm] align the softmax aux shape with NVTE upstream (#371)
wangye805 5685b2c
Te2.4 fsdp2 fp8 allgather autocast (#349)
sudhu2k 3a6c5b6
Test CI plus small changes
leo-automation ba31fd8
Run CI
leo-automation 956fa26
Update runners label
leo-automation cd612d7
Update runners
leo-automation 6bbd03c
[TE] Implement Triton current scaling (#341)
matthiasdiener c95f9db
Update benchmark script to support fwd_v3 and a16 (#373)
VeeraRajasekhar 9eaaf4c
Enable AITER V3 kernels by default (#372)
ipanfilo 031d73b
Add new logic from Jenkins and continue-on-error: true for tests
leo-automation 94174a4
Merge branch 'leo/migrate-ci-to-gha' of https://github.com/ROCm/Trans…
leo-automation File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,210 @@ | ||
| # Copyright (c) 2024-2025, Advanced Micro Devices, Inc. All rights reserved. | ||
| # | ||
| # See LICENSE for license information. | ||
|
|
||
| name: TransformerEngine CI | ||
|
|
||
| on: | ||
| push: | ||
| branches: | ||
| - 'dev' | ||
| - 'release_v1.*_rocm' | ||
| - 'release_v2.*_rocm' | ||
| pull_request: | ||
| branches: | ||
| - 'dev' | ||
| - 'release_v1.**_rocm' | ||
| - 'release_v2.**_rocm' | ||
| workflow_dispatch: | ||
| inputs: | ||
| test_level: | ||
| description: 'Test Level (1-3)' | ||
| required: true | ||
| default: '1' | ||
|
|
||
| concurrency: | ||
| group: ${{ github.workflow }}-${{ github.ref }} | ||
| cancel-in-progress: true | ||
|
|
||
| jobs: | ||
| build_and_test: | ||
| name: Build and Test on GPU | ||
| timeout-minutes: 720 | ||
| runs-on: [linux-mi308-4] | ||
| steps: | ||
| - name: Checkout repository | ||
| uses: actions/checkout@v4 | ||
| with: | ||
| submodules: 'recursive' | ||
|
|
||
| - name: Select Docker Image Tag | ||
| id: select-image | ||
| env: | ||
| DEV_IMAGE: ${{ vars.DEV_DOCKER_IMAGE }} | ||
| REL_IMAGE: ${{ vars.REL613_DOCKER_IMAGE }} | ||
| run: | | ||
| BRANCH_NAME="${{ github.base_ref || github.ref_name }}" | ||
| echo "Determining image for branch: $BRANCH_NAME" | ||
| DEV_DOCKER_IMAGE="$DEV_IMAGE" | ||
| REL613_DOCKER_IMAGE="$REL_IMAGE" | ||
| IMAGE_TO_USE="$DEV_DOCKER_IMAGE" | ||
| if [[ $BRANCH_NAME =~ ^release_v([0-9]+)\.([0-9]+)_rocm$ ]]; then | ||
| MAJOR_VERSION=${BASH_REMATCH[1]} | ||
| MINOR_VERSION=${BASH_REMATCH[2]} | ||
| if (( MAJOR_VERSION == 1 )); then | ||
| if (( MINOR_VERSION == 13 || MINOR_VERSION == 14 )); then IMAGE_TO_USE="$REL613_DOCKER_IMAGE"; fi | ||
| fi | ||
| fi | ||
| echo "Selected image: $IMAGE_TO_USE" | ||
| echo "image-tag=$IMAGE_TO_USE" >> $GITHUB_OUTPUT | ||
|
|
||
| - name: Log in to Docker registry | ||
| uses: docker/login-action@v3 | ||
| with: | ||
| username: ${{ secrets.ARTIFACTORY_USER }} | ||
| password: ${{ secrets.ARTIFACTORY_PAT }} | ||
|
|
||
| - name: Pull and Save Docker Image | ||
| run: | | ||
| docker pull ${{ steps.select-image.outputs.image-tag }} | ||
|
|
||
| - name: Run Container | ||
| run: | | ||
| docker run -dt \ | ||
| --name te-runner \ | ||
| --network=host \ | ||
| --device=/dev/dri --device=/dev/kfd \ | ||
| --shm-size=16G \ | ||
| --pid=host \ | ||
| --group-add $(getent group render | cut -d: -f3) \ | ||
| --group-add $(getent group video | cut -d: -f3) \ | ||
| -v "${{ github.workspace }}:/workspace" \ | ||
| -w /workspace \ | ||
| ${{ steps.select-image.outputs.image-tag}} | ||
|
|
||
| - name: Determine GPU Architecture via rocminfo | ||
| id: gpu-arch | ||
| run: | | ||
| # Run rocminfo inside the container and capture the output | ||
| ARCH=$(docker exec te-runner bash -c "rocminfo | grep -m 1 -oP 'gfx[0-9a-fA-F]+'") | ||
| if [ -z "$ARCH" ]; then | ||
| echo "::error::Could not determine GPU architecture using rocminfo inside the container." | ||
| # Optional: Print full rocminfo output for debugging | ||
| docker exec te-runner rocminfo | ||
| exit 1 | ||
| fi | ||
| echo "Detected GPU Arch: $ARCH" | ||
| echo "arch=$ARCH" >> $GITHUB_OUTPUT | ||
|
|
||
| - name: Build Project | ||
| run: | | ||
| docker exec \ | ||
| -e GPU_ARCH=${{ steps.gpu-arch.outputs.arch }} \ | ||
| te-runner bash -c "$(cat <<'EOF' | ||
| set -ex | ||
|
|
||
| export HIP_PATH="" | ||
| export PYTORCH_ROCM_ARCH=$GPU_ARCH | ||
| export NVTE_ROCM_ARCH=$GPU_ARCH | ||
| pip wheel . -v -w /workspace/dist | ||
| pip install /workspace/dist/*.whl | ||
| EOF | ||
| )" | ||
|
|
||
| - name: Run sGPU tests | ||
|
ipanfilo marked this conversation as resolved.
|
||
| run: | | ||
| docker exec te-runner bash -c "$(cat <<'EOF' | ||
| #!/usr/bin/bash | ||
| set -x -o pipefail | ||
| ulimit -c 0 # Disable core dumps | ||
|
|
||
| # debug output | ||
| ls -d /opt/rocm* | ||
| python --version | ||
| pip list | egrep "transformer_e|torch|jax|numpy|ml_dtypes|typing_ext" | ||
|
|
||
| HIP_VISIBLE_DEVICES=1 ci/pytorch.sh > /workspace/torch_sgpu.log 2>&1 & | ||
| torch_pid=$!; echo Pytorch test pid $! | ||
|
|
||
| HIP_VISIBLE_DEVICES=2 ci/jax.sh > /workspace/jax_sgpu.log 2>&1 & | ||
| jax_pid=$!; echo JAX test pid $! | ||
|
|
||
| HIP_VISIBLE_DEVICES=3 ci/core.sh > /workspace/core_sgpu.log 2>&1 & | ||
| core_pid=$!; echo Core test pid $! | ||
|
|
||
| wait $core_pid; core_rc=$? | ||
| wait $jax_pid; jax_rc=$? | ||
| wait $torch_pid; torch_rc=$? | ||
|
|
||
| if [ $torch_rc -ne 0 ]; then echo "Pytorch sGPU test FAILED."; fi | ||
| if [ $jax_rc -ne 0 ]; then echo "JAX sGPU test FAILED."; fi | ||
| if [ $core_rc -ne 0 ]; then echo "Core sGPU test FAILED."; fi | ||
|
|
||
| test $torch_rc -eq 0 -a $jax_rc -eq 0 -a $core_rc -eq 0 | ||
| EOF | ||
| )" | ||
|
|
||
| - name: Run mGPU tests | ||
| run: | | ||
| docker exec te-runner bash -c "$(cat <<'EOF' | ||
| #!/usr/bin/bash | ||
| set -x -o pipefail | ||
| ulimit -c 0 # Disable core dumps | ||
|
|
||
| ci/pytorch.sh > /workspace/torch_mgpu.log 2>&1; torch_rc=$? | ||
| ci/jax.sh > /workspace/jax_mgpu.log 2>&1; jax_rc=$? | ||
|
|
||
| if [ $torch_rc -ne 0 ]; then echo "Pytorch mGPU test FAILED."; fi | ||
| if [ $jax_rc -ne 0 ]; then echo "JAX mGPU test FAILED."; fi | ||
|
|
||
| test $torch_rc -eq 0 -a $jax_rc -eq 0 | ||
| EOF | ||
| )" | ||
|
|
||
| - name: Run Examples | ||
| run: | | ||
| docker exec te-runner bash -c "$(cat <<'EOF' | ||
| #!/usr/bin/bash | ||
| set -ex -o pipefail | ||
| ulimit -c 0 # Disable core dumps | ||
|
|
||
| cd /workspace/examples/pytorch/mnist | ||
| python main.py | ||
| python main.py --use-te | ||
| python main.py --use-fp8 | ||
|
|
||
| cd /workspace/examples/jax/mnist | ||
| pip3 install -r requirements.txt | ||
| python test_single_gpu_mnist.py | ||
| python test_single_gpu_mnist.py --use-te | ||
| python test_single_gpu_mnist.py --use-fp8 | ||
|
|
||
| cd /workspace/examples/jax/encoder | ||
| pip3 install -r requirements.txt | ||
| python test_single_gpu_encoder.py | ||
| python test_single_gpu_encoder.py --use-fp8 | ||
| EOF | ||
| )" | ||
|
|
||
| - name: Copy logs and reports from container | ||
| if: always() | ||
| run: | | ||
| docker cp te-runner:/workspace/torch_sgpu.log ./torch_sgpu.log || true | ||
| docker cp te-runner:/workspace/jax_sgpu.log ./jax_sgpu.log || true | ||
| docker cp te-runner:/workspace/core_sgpu.log ./core_sgpu.log || true | ||
| docker cp te-runner:/workspace/torch_mgpu.log ./torch_mgpu.log || true | ||
| docker cp te-runner:/workspace/jax_mgpu.log ./jax_mgpu.log || true | ||
|
|
||
| - name: Upload logs and test reports | ||
| if: always() | ||
| uses: actions/upload-artifact@v4 | ||
| with: | ||
| name: logs-and-reports | ||
| path: | | ||
| *.log | ||
| if-no-files-found: ignore | ||
| retention-days: 1 | ||
|
|
||
| - name: Cleanup container | ||
| if: always() | ||
| run: docker rm -f te-runner || true | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.