Skip to content
Closed
Changes from 88 commits
Commits
Show all changes
120 commits
Select commit Hold shift + click to select a range
87d437e
First drawt of the workflow
leo-automation Sep 17, 2025
e5c86af
Update transformer-engine-ci.yml
leo-automation Sep 26, 2025
335b09f
Fixes
leo-automation Sep 26, 2025
8136739
Update transformer-engine-ci.yml
leo-automation Sep 26, 2025
440a781
Update transformer-engine-ci.yml
leo-automation Sep 26, 2025
578ddce
toLower
leo-automation Sep 26, 2025
99863b1
CI fixes
leo-automation Sep 26, 2025
b984353
Added build-only runners
leo-automation Sep 29, 2025
63d5046
Debug docker credentials
leo-automation Sep 30, 2025
370ff19
Debug docker credentials
leo-automation Sep 30, 2025
4981dc1
Node js is missing on the node
leo-automation Sep 30, 2025
c96d76f
Remove devcontainer
leo-automation Sep 30, 2025
82ea4e5
More fixes
leo-automation Sep 30, 2025
bae63ab
Typo
leo-automation Sep 30, 2025
29955e9
Permissions inside the container
leo-automation Sep 30, 2025
227ca35
sudo
leo-automation Sep 30, 2025
9fc2342
Update transformer-engine-ci.yml
leo-automation Sep 30, 2025
6d3a025
Update transformer-engine-ci.yml
leo-automation Sep 30, 2025
19a78cd
Update transformer-engine-ci.yml
leo-automation Sep 30, 2025
93a42ee
Update transformer-engine-ci.yml
leo-automation Sep 30, 2025
40f8d08
Update transformer-engine-ci.yml
leo-automation Sep 30, 2025
d050431
Update transformer-engine-ci.yml
leo-automation Sep 30, 2025
5aeaf38
Update transformer-engine-ci.yml
leo-automation Sep 30, 2025
53633e5
Update transformer-engine-ci.yml
leo-automation Sep 30, 2025
8989cd5
Update transformer-engine-ci.yml
leo-automation Sep 30, 2025
06c3c30
Update transformer-engine-ci.yml
leo-automation Sep 30, 2025
866b582
Update transformer-engine-ci.yml
leo-automation Sep 30, 2025
0c78dad
CI: add CMAKE_CXX_COMPILER and CMAKE_CC_COMPILER
Oct 2, 2025
aaea1ac
DCMAKE_CXX_COMPILER
Oct 2, 2025
7694965
ls -l
Oct 2, 2025
18f6cb1
view setup.py
Oct 2, 2025
9e95632
add dcmake cxx compiler to setup.py
Oct 2, 2025
9a46d85
absolute path to dcmake_cxx_compiler
Oct 2, 2025
6f92c32
hip_arch
Oct 2, 2025
9ca4b79
hip-runtime-amd and hip-dev
Oct 2, 2025
a726165
dhip root dir
Oct 2, 2025
271fadb
rocm_path
Oct 2, 2025
7634219
add HIP_PATH
Oct 3, 2025
8e30435
add setuptools and wheel
Oct 3, 2025
0a50f6e
tree
Oct 3, 2025
c8fad75
tree
Oct 3, 2025
662c74a
rm hip dir
Oct 3, 2025
7b18142
back to basic
Oct 3, 2025
9c24673
rm build-essential
Oct 3, 2025
4b7d574
Run CI
leo-automation Oct 15, 2025
064461b
Fix the path
leo-automation Oct 15, 2025
1b402b9
Test rootless
leo-automation Oct 15, 2025
37a880d
Debug
leo-automation Oct 15, 2025
5c2ea46
sudo
leo-automation Oct 15, 2025
36596e9
sudo debug
leo-automation Oct 15, 2025
4bde899
sudo
leo-automation Oct 15, 2025
8b03bea
AMDGPU_TARGETS
leo-automation Oct 15, 2025
dd2feaa
Fix
leo-automation Oct 15, 2025
8e34a89
Debugiing
leo-automation Oct 15, 2025
10317db
Debugging
leo-automation Oct 15, 2025
a8afdd0
Debugging
leo-automation Oct 15, 2025
7f4f53e
PATH problem perhaps
leo-automation Oct 15, 2025
0457361
Debugging
leo-automation Oct 15, 2025
ad32476
Debug
leo-automation Oct 16, 2025
b18d52d
Runner change
leo-automation Oct 16, 2025
e998ae4
Fix cplus
leo-automation Oct 16, 2025
c56ba18
Fix
leo-automation Oct 16, 2025
5840c36
Fix
leo-automation Oct 16, 2025
1037792
Correct cmaker path
leo-automation Oct 16, 2025
bbe956c
Debug
leo-automation Oct 16, 2025
c27bf79
Cmake link
leo-automation Oct 16, 2025
9159ef0
Remove -f
leo-automation Oct 16, 2025
5e9f3b1
Debug
leo-automation Oct 16, 2025
d847e3d
PATH
leo-automation Oct 16, 2025
6122314
Refactoring to run container as a step
leo-automation Oct 16, 2025
b1cd2e1
build-only
leo-automation Oct 16, 2025
28b192d
Houskeeping
leo-automation Oct 16, 2025
3a3cff9
All in one stage
leo-automation Oct 17, 2025
efae349
GPU runner doesn't have access to internal network, so moved all netw…
leo-automation Oct 17, 2025
60e5d75
Split docker image
leo-automation Oct 20, 2025
b2c4c15
Fix /opt/cmake
leo-automation Oct 20, 2025
052f6d8
Housekeeping
leo-automation Oct 20, 2025
79d8666
Typo
leo-automation Oct 20, 2025
f7e3442
Pip install incorrect path
leo-automation Oct 21, 2025
f0cc321
Houskeeping
leo-automation Oct 21, 2025
ada7ec6
Resolving comments
leo-automation Oct 22, 2025
80bfdcb
New line at the end
leo-automation Oct 22, 2025
9672ff1
Small changes
leo-automation Oct 22, 2025
3a0fb27
Consolidate under one runner with internal network access
leo-automation Oct 23, 2025
9c232d8
Fix FFI import. Add distributed tests hang workaround (#347)
ipanfilo Oct 23, 2025
47eb7b7
Update rocm-ci.yml
mkunredd Oct 23, 2025
28dc543
Submodules
leo-automation Oct 23, 2025
a037a09
Make TE ROCm wheels building image directly from manylinix image (#340)
ipanfilo Oct 27, 2025
083fcfc
Added timeout
leo-automation Oct 29, 2025
21ff8d6
Resolve comments
leo-automation Oct 31, 2025
72ec76b
[CI] Hotfix test_gemm_autotune update (#353)
VeeraRajasekhar Oct 31, 2025
b092058
MXFP8 test scale off by 1 fix (#338)
alextmagro Oct 31, 2025
9187659
[CI] Removed Jax jit workaround, replaced with XLA_FLAGS=--xla_gpu_en…
VeeraRajasekhar Oct 31, 2025
22815fa
Run CI
leo-automation Nov 3, 2025
2eb4714
Runner update
leo-automation Nov 3, 2025
dabb812
Update runners
leo-automation Nov 3, 2025
d8beee1
Debug
leo-automation Nov 3, 2025
5a7c74d
Debug
leo-automation Nov 3, 2025
83735d8
Remove HIP macros around std:: math functions (#343)
alextmagro Nov 3, 2025
a007240
Run CI
leo-automation Nov 4, 2025
b4da0d5
Update runners label
leo-automation Nov 4, 2025
533bee5
[TE] [AITER] Add prebuilt AITER download and upload flow (#335)
VeeraRajasekhar Nov 5, 2025
96d48ce
std::max type mismatch hotfix (#361)
alextmagro Nov 7, 2025
6b8a47d
CI: allow numpy 2.0 (#366)
ipanfilo Nov 7, 2025
9a987f8
Relax tolerance to pass 29x29x17389NT GEMM on MI350 (#365)
ipanfilo Nov 8, 2025
90c04bc
FIX Occasional import error when only building for a single framework…
Micky774 Nov 10, 2025
7887013
Integrate aiter HD192_HD128 backward kernels (#364)
VeeraRajasekhar Nov 11, 2025
e9c7361
Use .info/version for ROCm verison (#368)
ipanfilo Nov 12, 2025
87fece2
Enable aligned vectorized memory ops for MXFP8 cast (#342)
alextmagro Nov 13, 2025
32e2d1d
[ROCm] align the softmax aux shape with NVTE upstream (#371)
wangye805 Nov 14, 2025
5685b2c
Te2.4 fsdp2 fp8 allgather autocast (#349)
sudhu2k Nov 17, 2025
3a6c5b6
Test CI plus small changes
leo-automation Nov 17, 2025
ba31fd8
Run CI
leo-automation Nov 17, 2025
956fa26
Update runners label
leo-automation Nov 17, 2025
cd612d7
Update runners
leo-automation Nov 17, 2025
6bbd03c
[TE] Implement Triton current scaling (#341)
matthiasdiener Nov 18, 2025
c95f9db
Update benchmark script to support fwd_v3 and a16 (#373)
VeeraRajasekhar Nov 18, 2025
9eaaf4c
Enable AITER V3 kernels by default (#372)
ipanfilo Nov 19, 2025
031d73b
Add new logic from Jenkins and continue-on-error: true for tests
leo-automation Nov 19, 2025
94174a4
Merge branch 'leo/migrate-ci-to-gha' of https://github.com/ROCm/Trans…
leo-automation Nov 19, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
210 changes: 210 additions & 0 deletions .github/workflows/rocm-ci.yml
Comment thread
ipanfilo marked this conversation as resolved.
Original file line number Diff line number Diff line change
@@ -0,0 +1,210 @@
# Copyright (c) 2024-2025, Advanced Micro Devices, Inc. All rights reserved.
#
# See LICENSE for license information.

name: TransformerEngine CI

on:
push:
branches:
- 'dev'
- 'release_v1.*_rocm'
- 'release_v2.*_rocm'
pull_request:
branches:
- 'dev'
- 'release_v1.**_rocm'
- 'release_v2.**_rocm'
workflow_dispatch:
inputs:
test_level:
description: 'Test Level (1-3)'
required: true
default: '1'

concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true

jobs:
build_and_test:
name: Build and Test on GPU
timeout-minutes: 720
runs-on: [linux-mi308-4]
steps:
- name: Checkout repository
uses: actions/checkout@v4
with:
submodules: 'recursive'

- name: Select Docker Image Tag
id: select-image
env:
DEV_IMAGE: ${{ vars.DEV_DOCKER_IMAGE }}
REL_IMAGE: ${{ vars.REL613_DOCKER_IMAGE }}
run: |
BRANCH_NAME="${{ github.base_ref || github.ref_name }}"
echo "Determining image for branch: $BRANCH_NAME"
DEV_DOCKER_IMAGE="$DEV_IMAGE"
REL613_DOCKER_IMAGE="$REL_IMAGE"
IMAGE_TO_USE="$DEV_DOCKER_IMAGE"
if [[ $BRANCH_NAME =~ ^release_v([0-9]+)\.([0-9]+)_rocm$ ]]; then
MAJOR_VERSION=${BASH_REMATCH[1]}
MINOR_VERSION=${BASH_REMATCH[2]}
if (( MAJOR_VERSION == 1 )); then
if (( MINOR_VERSION == 13 || MINOR_VERSION == 14 )); then IMAGE_TO_USE="$REL613_DOCKER_IMAGE"; fi
fi
fi
echo "Selected image: $IMAGE_TO_USE"
echo "image-tag=$IMAGE_TO_USE" >> $GITHUB_OUTPUT

- name: Log in to Docker registry
uses: docker/login-action@v3
with:
username: ${{ secrets.ARTIFACTORY_USER }}
password: ${{ secrets.ARTIFACTORY_PAT }}

- name: Pull and Save Docker Image
run: |
docker pull ${{ steps.select-image.outputs.image-tag }}

- name: Run Container
run: |
docker run -dt \
--name te-runner \
--network=host \
--device=/dev/dri --device=/dev/kfd \
--shm-size=16G \
--pid=host \
--group-add $(getent group render | cut -d: -f3) \
--group-add $(getent group video | cut -d: -f3) \
-v "${{ github.workspace }}:/workspace" \
-w /workspace \
${{ steps.select-image.outputs.image-tag}}

- name: Determine GPU Architecture via rocminfo
id: gpu-arch
run: |
# Run rocminfo inside the container and capture the output
ARCH=$(docker exec te-runner bash -c "rocminfo | grep -m 1 -oP 'gfx[0-9a-fA-F]+'")
if [ -z "$ARCH" ]; then
echo "::error::Could not determine GPU architecture using rocminfo inside the container."
# Optional: Print full rocminfo output for debugging
docker exec te-runner rocminfo
exit 1
fi
echo "Detected GPU Arch: $ARCH"
echo "arch=$ARCH" >> $GITHUB_OUTPUT

- name: Build Project
run: |
docker exec \
-e GPU_ARCH=${{ steps.gpu-arch.outputs.arch }} \
te-runner bash -c "$(cat <<'EOF'
set -ex

export HIP_PATH=""
export PYTORCH_ROCM_ARCH=$GPU_ARCH
export NVTE_ROCM_ARCH=$GPU_ARCH
pip wheel . -v -w /workspace/dist
pip install /workspace/dist/*.whl
EOF
)"

- name: Run sGPU tests
Comment thread
ipanfilo marked this conversation as resolved.
run: |
docker exec te-runner bash -c "$(cat <<'EOF'
#!/usr/bin/bash
set -x -o pipefail
ulimit -c 0 # Disable core dumps

# debug output
ls -d /opt/rocm*
python --version
pip list | egrep "transformer_e|torch|jax|numpy|ml_dtypes|typing_ext"

HIP_VISIBLE_DEVICES=1 ci/pytorch.sh > /workspace/torch_sgpu.log 2>&1 &
torch_pid=$!; echo Pytorch test pid $!

HIP_VISIBLE_DEVICES=2 ci/jax.sh > /workspace/jax_sgpu.log 2>&1 &
jax_pid=$!; echo JAX test pid $!

HIP_VISIBLE_DEVICES=3 ci/core.sh > /workspace/core_sgpu.log 2>&1 &
core_pid=$!; echo Core test pid $!

wait $core_pid; core_rc=$?
wait $jax_pid; jax_rc=$?
wait $torch_pid; torch_rc=$?

if [ $torch_rc -ne 0 ]; then echo "Pytorch sGPU test FAILED."; fi
if [ $jax_rc -ne 0 ]; then echo "JAX sGPU test FAILED."; fi
if [ $core_rc -ne 0 ]; then echo "Core sGPU test FAILED."; fi

test $torch_rc -eq 0 -a $jax_rc -eq 0 -a $core_rc -eq 0
EOF
)"

- name: Run mGPU tests
run: |
docker exec te-runner bash -c "$(cat <<'EOF'
#!/usr/bin/bash
set -x -o pipefail
ulimit -c 0 # Disable core dumps

ci/pytorch.sh > /workspace/torch_mgpu.log 2>&1; torch_rc=$?
ci/jax.sh > /workspace/jax_mgpu.log 2>&1; jax_rc=$?

if [ $torch_rc -ne 0 ]; then echo "Pytorch mGPU test FAILED."; fi
if [ $jax_rc -ne 0 ]; then echo "JAX mGPU test FAILED."; fi

test $torch_rc -eq 0 -a $jax_rc -eq 0
EOF
)"

- name: Run Examples
run: |
docker exec te-runner bash -c "$(cat <<'EOF'
#!/usr/bin/bash
set -ex -o pipefail
ulimit -c 0 # Disable core dumps

cd /workspace/examples/pytorch/mnist
python main.py
python main.py --use-te
python main.py --use-fp8

cd /workspace/examples/jax/mnist
pip3 install -r requirements.txt
python test_single_gpu_mnist.py
python test_single_gpu_mnist.py --use-te
python test_single_gpu_mnist.py --use-fp8

cd /workspace/examples/jax/encoder
pip3 install -r requirements.txt
python test_single_gpu_encoder.py
python test_single_gpu_encoder.py --use-fp8
EOF
)"

- name: Copy logs and reports from container
if: always()
run: |
docker cp te-runner:/workspace/torch_sgpu.log ./torch_sgpu.log || true
docker cp te-runner:/workspace/jax_sgpu.log ./jax_sgpu.log || true
docker cp te-runner:/workspace/core_sgpu.log ./core_sgpu.log || true
docker cp te-runner:/workspace/torch_mgpu.log ./torch_mgpu.log || true
docker cp te-runner:/workspace/jax_mgpu.log ./jax_mgpu.log || true

- name: Upload logs and test reports
if: always()
uses: actions/upload-artifact@v4
with:
name: logs-and-reports
path: |
*.log
if-no-files-found: ignore
retention-days: 1

- name: Cleanup container
if: always()
run: docker rm -f te-runner || true
Loading