Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
120 commits
Select commit Hold shift + click to select a range
87d437e
First drawt of the workflow
leo-automation Sep 17, 2025
e5c86af
Update transformer-engine-ci.yml
leo-automation Sep 26, 2025
335b09f
Fixes
leo-automation Sep 26, 2025
8136739
Update transformer-engine-ci.yml
leo-automation Sep 26, 2025
440a781
Update transformer-engine-ci.yml
leo-automation Sep 26, 2025
578ddce
toLower
leo-automation Sep 26, 2025
99863b1
CI fixes
leo-automation Sep 26, 2025
b984353
Added build-only runners
leo-automation Sep 29, 2025
63d5046
Debug docker credentials
leo-automation Sep 30, 2025
370ff19
Debug docker credentials
leo-automation Sep 30, 2025
4981dc1
Node js is missing on the node
leo-automation Sep 30, 2025
c96d76f
Remove devcontainer
leo-automation Sep 30, 2025
82ea4e5
More fixes
leo-automation Sep 30, 2025
bae63ab
Typo
leo-automation Sep 30, 2025
29955e9
Permissions inside the container
leo-automation Sep 30, 2025
227ca35
sudo
leo-automation Sep 30, 2025
9fc2342
Update transformer-engine-ci.yml
leo-automation Sep 30, 2025
6d3a025
Update transformer-engine-ci.yml
leo-automation Sep 30, 2025
19a78cd
Update transformer-engine-ci.yml
leo-automation Sep 30, 2025
93a42ee
Update transformer-engine-ci.yml
leo-automation Sep 30, 2025
40f8d08
Update transformer-engine-ci.yml
leo-automation Sep 30, 2025
d050431
Update transformer-engine-ci.yml
leo-automation Sep 30, 2025
5aeaf38
Update transformer-engine-ci.yml
leo-automation Sep 30, 2025
53633e5
Update transformer-engine-ci.yml
leo-automation Sep 30, 2025
8989cd5
Update transformer-engine-ci.yml
leo-automation Sep 30, 2025
06c3c30
Update transformer-engine-ci.yml
leo-automation Sep 30, 2025
866b582
Update transformer-engine-ci.yml
leo-automation Sep 30, 2025
0c78dad
CI: add CMAKE_CXX_COMPILER and CMAKE_CC_COMPILER
Oct 2, 2025
aaea1ac
DCMAKE_CXX_COMPILER
Oct 2, 2025
7694965
ls -l
Oct 2, 2025
18f6cb1
view setup.py
Oct 2, 2025
9e95632
add dcmake cxx compiler to setup.py
Oct 2, 2025
9a46d85
absolute path to dcmake_cxx_compiler
Oct 2, 2025
6f92c32
hip_arch
Oct 2, 2025
9ca4b79
hip-runtime-amd and hip-dev
Oct 2, 2025
a726165
dhip root dir
Oct 2, 2025
271fadb
rocm_path
Oct 2, 2025
7634219
add HIP_PATH
Oct 3, 2025
8e30435
add setuptools and wheel
Oct 3, 2025
0a50f6e
tree
Oct 3, 2025
c8fad75
tree
Oct 3, 2025
662c74a
rm hip dir
Oct 3, 2025
7b18142
back to basic
Oct 3, 2025
9c24673
rm build-essential
Oct 3, 2025
4b7d574
Run CI
leo-automation Oct 15, 2025
064461b
Fix the path
leo-automation Oct 15, 2025
1b402b9
Test rootless
leo-automation Oct 15, 2025
37a880d
Debug
leo-automation Oct 15, 2025
5c2ea46
sudo
leo-automation Oct 15, 2025
36596e9
sudo debug
leo-automation Oct 15, 2025
4bde899
sudo
leo-automation Oct 15, 2025
8b03bea
AMDGPU_TARGETS
leo-automation Oct 15, 2025
dd2feaa
Fix
leo-automation Oct 15, 2025
8e34a89
Debugiing
leo-automation Oct 15, 2025
10317db
Debugging
leo-automation Oct 15, 2025
a8afdd0
Debugging
leo-automation Oct 15, 2025
7f4f53e
PATH problem perhaps
leo-automation Oct 15, 2025
0457361
Debugging
leo-automation Oct 15, 2025
ad32476
Debug
leo-automation Oct 16, 2025
b18d52d
Runner change
leo-automation Oct 16, 2025
e998ae4
Fix cplus
leo-automation Oct 16, 2025
c56ba18
Fix
leo-automation Oct 16, 2025
5840c36
Fix
leo-automation Oct 16, 2025
1037792
Correct cmaker path
leo-automation Oct 16, 2025
bbe956c
Debug
leo-automation Oct 16, 2025
c27bf79
Cmake link
leo-automation Oct 16, 2025
9159ef0
Remove -f
leo-automation Oct 16, 2025
5e9f3b1
Debug
leo-automation Oct 16, 2025
d847e3d
PATH
leo-automation Oct 16, 2025
6122314
Refactoring to run container as a step
leo-automation Oct 16, 2025
b1cd2e1
build-only
leo-automation Oct 16, 2025
28b192d
Houskeeping
leo-automation Oct 16, 2025
3a3cff9
All in one stage
leo-automation Oct 17, 2025
efae349
GPU runner doesn't have access to internal network, so moved all netw…
leo-automation Oct 17, 2025
60e5d75
Split docker image
leo-automation Oct 20, 2025
b2c4c15
Fix /opt/cmake
leo-automation Oct 20, 2025
052f6d8
Housekeeping
leo-automation Oct 20, 2025
79d8666
Typo
leo-automation Oct 20, 2025
f7e3442
Pip install incorrect path
leo-automation Oct 21, 2025
f0cc321
Houskeeping
leo-automation Oct 21, 2025
ada7ec6
Resolving comments
leo-automation Oct 22, 2025
80bfdcb
New line at the end
leo-automation Oct 22, 2025
9672ff1
Small changes
leo-automation Oct 22, 2025
3a0fb27
Consolidate under one runner with internal network access
leo-automation Oct 23, 2025
9c232d8
Fix FFI import. Add distributed tests hang workaround (#347)
ipanfilo Oct 23, 2025
47eb7b7
Update rocm-ci.yml
mkunredd Oct 23, 2025
28dc543
Submodules
leo-automation Oct 23, 2025
a037a09
Make TE ROCm wheels building image directly from manylinix image (#340)
ipanfilo Oct 27, 2025
083fcfc
Added timeout
leo-automation Oct 29, 2025
21ff8d6
Resolve comments
leo-automation Oct 31, 2025
72ec76b
[CI] Hotfix test_gemm_autotune update (#353)
VeeraRajasekhar Oct 31, 2025
b092058
MXFP8 test scale off by 1 fix (#338)
alextmagro Oct 31, 2025
9187659
[CI] Removed Jax jit workaround, replaced with XLA_FLAGS=--xla_gpu_en…
VeeraRajasekhar Oct 31, 2025
22815fa
Run CI
leo-automation Nov 3, 2025
2eb4714
Runner update
leo-automation Nov 3, 2025
dabb812
Update runners
leo-automation Nov 3, 2025
d8beee1
Debug
leo-automation Nov 3, 2025
5a7c74d
Debug
leo-automation Nov 3, 2025
83735d8
Remove HIP macros around std:: math functions (#343)
alextmagro Nov 3, 2025
a007240
Run CI
leo-automation Nov 4, 2025
b4da0d5
Update runners label
leo-automation Nov 4, 2025
533bee5
[TE] [AITER] Add prebuilt AITER download and upload flow (#335)
VeeraRajasekhar Nov 5, 2025
96d48ce
std::max type mismatch hotfix (#361)
alextmagro Nov 7, 2025
6b8a47d
CI: allow numpy 2.0 (#366)
ipanfilo Nov 7, 2025
9a987f8
Relax tolerance to pass 29x29x17389NT GEMM on MI350 (#365)
ipanfilo Nov 8, 2025
90c04bc
FIX Occasional import error when only building for a single framework…
Micky774 Nov 10, 2025
7887013
Integrate aiter HD192_HD128 backward kernels (#364)
VeeraRajasekhar Nov 11, 2025
e9c7361
Use .info/version for ROCm verison (#368)
ipanfilo Nov 12, 2025
87fece2
Enable aligned vectorized memory ops for MXFP8 cast (#342)
alextmagro Nov 13, 2025
32e2d1d
[ROCm] align the softmax aux shape with NVTE upstream (#371)
wangye805 Nov 14, 2025
5685b2c
Te2.4 fsdp2 fp8 allgather autocast (#349)
sudhu2k Nov 17, 2025
3a6c5b6
Test CI plus small changes
leo-automation Nov 17, 2025
ba31fd8
Run CI
leo-automation Nov 17, 2025
956fa26
Update runners label
leo-automation Nov 17, 2025
cd612d7
Update runners
leo-automation Nov 17, 2025
6bbd03c
[TE] Implement Triton current scaling (#341)
matthiasdiener Nov 18, 2025
c95f9db
Update benchmark script to support fwd_v3 and a16 (#373)
VeeraRajasekhar Nov 18, 2025
9eaaf4c
Enable AITER V3 kernels by default (#372)
ipanfilo Nov 19, 2025
031d73b
Add new logic from Jenkins and continue-on-error: true for tests
leo-automation Nov 19, 2025
94174a4
Merge branch 'leo/migrate-ci-to-gha' of https://github.com/ROCm/Trans…
leo-automation Nov 19, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
293 changes: 293 additions & 0 deletions .github/workflows/rocm-ci.yml
Comment thread
ipanfilo marked this conversation as resolved.
Original file line number Diff line number Diff line change
@@ -0,0 +1,293 @@
# Copyright (c) 2024-2025, Advanced Micro Devices, Inc. All rights reserved.
#
# See LICENSE for license information.

name: TransformerEngine CI

on:
push:
branches:
- 'dev'
- 'release_v1.*_rocm'
- 'release_v2.*_rocm'
pull_request:
branches:
- 'dev'
- 'release_v1.**_rocm'
- 'release_v2.**_rocm'
workflow_dispatch:
inputs:
test_level:
description: 'Test Level (1-3)'
required: true
default: '1'
skip_dev_merge:
description: 'Skip merging dev branch'
type: boolean
default: false

concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true

jobs:
build_and_test:
name: Build and Test on GPU
timeout-minutes: 720
runs-on: linux-mi325-4
steps:
- name: Checkout repository
uses: actions/checkout@v4
with:
submodules: 'recursive'

- name: Merge origin/dev
# Only run on PRs targeting dev, or manual runs where we didn't skip it
if: |
(github.event_name == 'pull_request' && github.base_ref == 'dev') ||
(github.event_name == 'workflow_dispatch' && inputs.skip_dev_merge != 'true' && github.ref == 'refs/heads/dev')
run: |
echo "Attempting to merge origin/dev..."
git config --global user.email "amd@amd.com"
git config --global user.name "AMD CI"

# Fetch dev specifically
git fetch origin dev

# Attempt merge; this will exit with error code 1 if there is a conflict, failing the job
git merge origin/dev
echo "Merge successful."

- name: Select Docker Image Tag
id: select-image
env:
DEV_IMAGE: ${{ vars.DEV_DOCKER_IMAGE }}
REL_IMAGE: ${{ vars.REL613_DOCKER_IMAGE }}
run: |
BRANCH_NAME="${{ github.base_ref || github.ref_name }}"
echo "Determining image for branch: $BRANCH_NAME"
DEV_DOCKER_IMAGE="$DEV_IMAGE"
REL613_DOCKER_IMAGE="$REL_IMAGE"
IMAGE_TO_USE="$DEV_DOCKER_IMAGE"
if [[ $BRANCH_NAME =~ ^release_v([0-9]+)\.([0-9]+)_rocm$ ]]; then
MAJOR_VERSION=${BASH_REMATCH[1]}
MINOR_VERSION=${BASH_REMATCH[2]}
if (( MAJOR_VERSION == 1 )); then
if (( MINOR_VERSION == 13 || MINOR_VERSION == 14 )); then IMAGE_TO_USE="$REL613_DOCKER_IMAGE"; fi
fi
fi
echo "Selected image: $IMAGE_TO_USE"
echo "image-tag=$IMAGE_TO_USE" >> $GITHUB_OUTPUT

- name: Pull Docker Image
run: |
docker pull ${{ steps.select-image.outputs.image-tag }}

- name: Run Container
run: |
docker run -dt \
--name te-runner \
--network=host \
--device=/dev/dri --device=/dev/kfd \
--shm-size=16G \
--pid=host \
--group-add $(getent group render | cut -d: -f3) \
--group-add $(getent group video | cut -d: -f3) \
-v "${{ github.workspace }}:/workspace" \
-w /workspace \
${{ steps.select-image.outputs.image-tag}}

- name: Determine GPU Architecture via rocminfo
id: gpu-arch
run: |
# Run rocminfo inside the container and capture the output
ARCH=$(docker exec te-runner bash -c "rocminfo | grep -m 1 -oP 'gfx[0-9a-fA-F]+'")
if [ -z "$ARCH" ]; then
echo "::error::Could not determine GPU architecture using rocminfo inside the container."
# Optional: Print full rocminfo output for debugging
docker exec te-runner rocminfo
exit 1
fi
echo "Detected GPU Arch: $ARCH"
echo "arch=$ARCH" >> $GITHUB_OUTPUT

- name: Build Project
run: |
docker exec \
-e GPU_ARCH=${{ steps.gpu-arch.outputs.arch }} \
te-runner bash -c "$(cat <<'EOF'
set -ex

export HIP_PATH=""
export PYTORCH_ROCM_ARCH=$GPU_ARCH
export NVTE_ROCM_ARCH=$GPU_ARCH
export NVTE_AITER_PREBUILT_BASE_URL=https://compute-artifactory.amd.com:5000/artifactory/rocm-generic-local/te-ci/aiter-prebuilts
pip install ninja
pip install -v . 2>&1
EOF
)"

- name: Run sGPU tests
Comment thread
ipanfilo marked this conversation as resolved.
id: sgpu-tests
continue-on-error: true
run: |
docker exec te-runner bash -c "$(cat <<'EOF'
#!/usr/bin/bash
set -x -o pipefail
ulimit -c 0 # Disable core dumps

# debug output
ls -d /opt/rocm*
python --version
pip list | egrep "transformer_e|torch|jax|numpy|ml_dtypes|typing_ext"

HIP_VISIBLE_DEVICES=1 ci/pytorch.sh > /workspace/torch_sgpu.log 2>&1 &
torch_pid=$!; echo Pytorch test pid $!

HIP_VISIBLE_DEVICES=2 ci/jax.sh > /workspace/jax_sgpu.log 2>&1 &
jax_pid=$!; echo JAX test pid $!

HIP_VISIBLE_DEVICES=3 ci/core.sh > /workspace/core_sgpu.log 2>&1 &
core_pid=$!; echo Core test pid $!

wait $core_pid; core_rc=$?
wait $jax_pid; jax_rc=$?
wait $torch_pid; torch_rc=$?

# Check PyTorch
if [ $torch_rc -ne 0 ]; then
echo "::group::[FAILED] PyTorch sGPU Log"
cat /workspace/torch_sgpu.log
echo "::endgroup::"
echo "::error::Pytorch sGPU test FAILED."
fi

# Check JAX
if [ $jax_rc -ne 0 ]; then
echo "::group::[FAILED] JAX sGPU Log"
cat /workspace/jax_sgpu.log
echo "::endgroup::"
echo "::error::JAX sGPU test FAILED."
fi

# Check Core
if [ $core_rc -ne 0 ]; then
echo "::group::[FAILED] Core sGPU Log"
cat /workspace/core_sgpu.log
echo "::endgroup::"
echo "::error::Core sGPU test FAILED."
fi

test $torch_rc -eq 0 -a $jax_rc -eq 0 -a $core_rc -eq 0
EOF
)"

- name: Run mGPU tests
id: mgpu-tests
continue-on-error: true
run: |
docker exec te-runner bash -c "$(cat <<'EOF'
#!/usr/bin/bash
set -x -o pipefail
ulimit -c 0 # Disable core dumps

# Run PyTorch
ci/pytorch.sh > /workspace/torch_mgpu.log 2>&1
torch_rc=$?

# Run JAX
ci/jax.sh > /workspace/jax_mgpu.log 2>&1
jax_rc=$?

if [ $torch_rc -ne 0 ]; then
echo "::group::[FAILED] PyTorch mGPU Log"
cat /workspace/torch_mgpu.log
echo "::endgroup::"
echo "::error::Pytorch mGPU test FAILED."
fi

if [ $jax_rc -ne 0 ]; then
echo "::group::[FAILED] JAX mGPU Log"
cat /workspace/jax_mgpu.log
echo "::endgroup::"
echo "::error::JAX mGPU test FAILED."
fi

test $torch_rc -eq 0 -a $jax_rc -eq 0
EOF
)"

- name: Run Examples
id: examples-tests
continue-on-error: true
run: |
docker exec te-runner bash -c "$(cat <<'EOF'
#!/usr/bin/bash
set -ex -o pipefail
ulimit -c 0 # Disable core dumps

cd /workspace/examples/pytorch/mnist
python main.py 2>&1 | tee /workspace/examples.log
python main.py --use-te 2>&1 | tee -a /workspace/examples.log
python main.py --use-fp8 2>&1 | tee -a /workspace/examples.log

cd /workspace/examples/jax/mnist
pip3 install -r requirements.txt
python test_single_gpu_mnist.py 2>&1 | tee -a /workspace/examples.log
python test_single_gpu_mnist.py --use-te 2>&1 | tee -a /workspace/examples.log
python test_single_gpu_mnist.py --use-fp8 2>&1 | tee -a /workspace/examples.log

cd /workspace/examples/jax/encoder
pip3 install -r requirements.txt
python test_single_gpu_encoder.py 2>&1 | tee -a /workspace/examples.log
python test_single_gpu_encoder.py --use-fp8 2>&1 | tee -a /workspace/examples.log
EOF
)"

- name: Check Test Failure Status
if: always()
run: |
# Check outcomes of the specific test steps
# "outcome" will be 'failure' even if continue-on-error was true
if [[ "${{ steps.sgpu-tests.outcome }}" == "failure" ]]; then
echo "::error::sGPU Tests Failed."
EXIT_STATUS=1
fi

if [[ "${{ steps.mgpu-tests.outcome }}" == "failure" ]]; then
echo "::error::mGPU Tests Failed."
EXIT_STATUS=1
fi

if [[ "${{ steps.examples-tests.outcome }}" == "failure" ]]; then
echo "::error::Example Tests Failed."
EXIT_STATUS=1
fi

# Fail the job if any errors were detected
if [[ "$EXIT_STATUS" == "1" ]]; then
exit 1
fi

- name: Copy logs and reports from container
if: always()
run: |
docker cp te-runner:/workspace/torch_sgpu.log ./torch_sgpu.log || true
docker cp te-runner:/workspace/jax_sgpu.log ./jax_sgpu.log || true
docker cp te-runner:/workspace/core_sgpu.log ./core_sgpu.log || true
docker cp te-runner:/workspace/torch_mgpu.log ./torch_mgpu.log || true
docker cp te-runner:/workspace/jax_mgpu.log ./jax_mgpu.log || true

- name: Upload logs and test reports
if: always()
uses: actions/upload-artifact@v4
with:
name: logs-and-reports
path: |
*.log
if-no-files-found: ignore
retention-days: 5

- name: Cleanup container
if: always()
run: docker rm -f te-runner || true
2 changes: 1 addition & 1 deletion 3rdparty/aiter
Submodule aiter updated 967 files
2 changes: 1 addition & 1 deletion 3rdparty/hipify_torch
14 changes: 7 additions & 7 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -264,15 +264,15 @@ Note that when using `THD` format tensors with CK Fused Attention, one should pa
to indicate that there is no padding between sequences. Otherwise, passing proper tensors will indicate padding between sequences. This is the case
for both the `FusedAttention` and `DotProductAttention` modules.

FA v3 Kernels in CK Backend
AITER FA v3 Kernels
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ROCm TE provides experimental support for flash-attention v3 fwd/bwd kernels using the ck backend for limited fused attention configs.
To enable FA v3 kernels, the following environment variables can be used:
ROCm TE supports flash-attention v3 fwd/bwd kernels on gfx942 and gfx950 using AITER backend.
This functionality can be controlled by the following environment variables:

* NVTE_CK_USES_FWD_V3 - by default 0, if set to 1, some cases will call the fwd v3 kernel, only applicable to the gfx942 architecture;
* NVTE_CK_USES_BWD_V3 - by default 0, if set to 1, some cases will call the bwd v3 dqdkdv kernel;
* NVTE_CK_IS_V3_ATOMIC_FP32 - by default 1, if set to 0 will use atomic fp16/bf16(w/o convert_dq kernel) in bwd pass when NVTE_CK_USES_BWD_V3 is set to 1;
* NVTE_CK_HOW_V3_BF16_CVT - by default 1, float to bf16 convert type when bwd_v3 is set to 1, 0:RTNE; 1:RTNA; 2:RTZ, only applicable to the gfx942 architecture.
* NVTE_CK_USES_FWD_V3 - by default 1, if set to 0, v3 kernels will not be used for fwd pass;
* NVTE_CK_USES_BWD_V3 - by default 1, if set to 0, v3 kernels will not be used for bwd pass;
* NVTE_CK_IS_V3_ATOMIC_FP32 - by default 1, if set to 0 will use atomic fp16/bf16(w/o convert_dq kernel) in bwd pass when v3 is enabled;
* NVTE_CK_HOW_V3_BF16_CVT - by default 1, float to bf16 convert type when v3 is enabled, 0:RTNE; 1:RTNA; 2:RTZ, only applicable to the gfx942 architecture.

Float to BFloat16 Conversion in CK Backend (gfx942 only)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Expand Down
Loading
Loading