Skip to content

Updated test logging and timeouts #697

Updated test logging and timeouts

Updated test logging and timeouts #697

Triggered via pull request June 5, 2026 15:41
Status Cancelled
Total duration 3h 57m 35s
Artifacts 7

rocm-ci-dispatch.yml

on: pull_request
determine_level
5s
determine_level
CI Level 3  /  Select Docker Image
5s
CI Level 3 / Select Docker Image
CI Level 3  /  ...  /  Build ROCm Docker image and TransformerEngine wheels
30m 43s
CI Level 3 / build / Build ROCm Docker image and TransformerEngine wheels
Matrix: dispatch / mgpu_tests
Matrix: dispatch / sgpu_tests
Fit to window
Zoom out
Zoom in

Annotations

15 errors, 8 warnings, and 1 notice
failed: tests.pytorch.attention.test_attention_with_cp
test_cp_with_fused_attention[False-None-False-False-False-p2p-thd-cp_1_1-bf16]::subprocess.CalledProcessError: Command '['python3', '-m', 'torch.distributed.launch', '--nproc-per-node=2', '/workspace/tests/pytorch/attention/run_attention_with_cp.py', 'dtype=bf16', 'model=cp_1_1', 'qkv_format=thd', 'kernel_backend=FusedAttention', 'cp_comm_type=p2p', 'fp8_bwd=False', 'fp8_dpa=False', 'fp8_mha=False', 'scaling_mode=None', 'f16_O=False', 'is_training=True', 'log_level=WARNING']' returned non-zero exit status 1.
failed: tests.pytorch.attention.test_attention_with_cp
test_cp_with_fused_attention[False-None-False-False-False-p2p-thd-cp_1_0-bf16]::subprocess.CalledProcessError: Command '['python3', '-m', 'torch.distributed.launch', '--nproc-per-node=2', '/workspace/tests/pytorch/attention/run_attention_with_cp.py', 'dtype=bf16', 'model=cp_1_0', 'qkv_format=thd', 'kernel_backend=FusedAttention', 'cp_comm_type=p2p', 'fp8_bwd=False', 'fp8_dpa=False', 'fp8_mha=False', 'scaling_mode=None', 'f16_O=False', 'is_training=True', 'log_level=WARNING']' returned non-zero exit status 1.
failed: tests.pytorch.attention.test_attention_with_cp
test_cp_with_fused_attention[False-None-False-False-False-p2p-sbhd-cp_3_4-bf16]::subprocess.CalledProcessError: Command '['python3', '-m', 'torch.distributed.launch', '--nproc-per-node=2', '/workspace/tests/pytorch/attention/run_attention_with_cp.py', 'dtype=bf16', 'model=cp_3_4', 'qkv_format=sbhd', 'kernel_backend=FusedAttention', 'cp_comm_type=p2p', 'fp8_bwd=False', 'fp8_dpa=False', 'fp8_mha=False', 'scaling_mode=None', 'f16_O=False', 'is_training=True', 'log_level=WARNING']' returned non-zero exit status 1.
failed: tests.pytorch.attention.test_attention_with_cp
test_cp_with_fused_attention[False-None-False-False-False-p2p-sbhd-cp_3_2-bf16]::subprocess.CalledProcessError: Command '['python3', '-m', 'torch.distributed.launch', '--nproc-per-node=2', '/workspace/tests/pytorch/attention/run_attention_with_cp.py', 'dtype=bf16', 'model=cp_3_2', 'qkv_format=sbhd', 'kernel_backend=FusedAttention', 'cp_comm_type=p2p', 'fp8_bwd=False', 'fp8_dpa=False', 'fp8_mha=False', 'scaling_mode=None', 'f16_O=False', 'is_training=True', 'log_level=WARNING']' returned non-zero exit status 1.
failed: tests.pytorch.attention.test_attention_with_cp
test_cp_with_fused_attention[False-None-False-False-False-p2p-sbhd-cp_2_3-bf16]::subprocess.CalledProcessError: Command '['python3', '-m', 'torch.distributed.launch', '--nproc-per-node=2', '/workspace/tests/pytorch/attention/run_attention_with_cp.py', 'dtype=bf16', 'model=cp_2_3', 'qkv_format=sbhd', 'kernel_backend=FusedAttention', 'cp_comm_type=p2p', 'fp8_bwd=False', 'fp8_dpa=False', 'fp8_mha=False', 'scaling_mode=None', 'f16_O=False', 'is_training=True', 'log_level=WARNING']' returned non-zero exit status 1.
failed: tests.pytorch.attention.test_attention_with_cp
test_cp_with_fused_attention[False-None-False-False-False-p2p-sbhd-cp_2_2-bf16]::subprocess.CalledProcessError: Command '['python3', '-m', 'torch.distributed.launch', '--nproc-per-node=2', '/workspace/tests/pytorch/attention/run_attention_with_cp.py', 'dtype=bf16', 'model=cp_2_2', 'qkv_format=sbhd', 'kernel_backend=FusedAttention', 'cp_comm_type=p2p', 'fp8_bwd=False', 'fp8_dpa=False', 'fp8_mha=False', 'scaling_mode=None', 'f16_O=False', 'is_training=True', 'log_level=WARNING']' returned non-zero exit status 1.
failed: tests.pytorch.attention.test_attention_with_cp
test_cp_with_fused_attention[False-None-False-False-False-p2p-sbhd-cp_2_0-bf16]::subprocess.CalledProcessError: Command '['python3', '-m', 'torch.distributed.launch', '--nproc-per-node=2', '/workspace/tests/pytorch/attention/run_attention_with_cp.py', 'dtype=bf16', 'model=cp_2_0', 'qkv_format=sbhd', 'kernel_backend=FusedAttention', 'cp_comm_type=p2p', 'fp8_bwd=False', 'fp8_dpa=False', 'fp8_mha=False', 'scaling_mode=None', 'f16_O=False', 'is_training=True', 'log_level=WARNING']' returned non-zero exit status 1.
failed: tests.pytorch.attention.test_attention_with_cp
test_cp_with_fused_attention[False-None-False-False-False-p2p-sbhd-cp_1_4-bf16]::subprocess.CalledProcessError: Command '['python3', '-m', 'torch.distributed.launch', '--nproc-per-node=2', '/workspace/tests/pytorch/attention/run_attention_with_cp.py', 'dtype=bf16', 'model=cp_1_4', 'qkv_format=sbhd', 'kernel_backend=FusedAttention', 'cp_comm_type=p2p', 'fp8_bwd=False', 'fp8_dpa=False', 'fp8_mha=False', 'scaling_mode=None', 'f16_O=False', 'is_training=True', 'log_level=WARNING']' returned non-zero exit status 1.
failed: tests.pytorch.attention.test_attention_with_cp
test_cp_with_fused_attention[False-None-False-False-False-p2p-sbhd-cp_1_1-bf16]::subprocess.CalledProcessError: Command '['python3', '-m', 'torch.distributed.launch', '--nproc-per-node=2', '/workspace/tests/pytorch/attention/run_attention_with_cp.py', 'dtype=bf16', 'model=cp_1_1', 'qkv_format=sbhd', 'kernel_backend=FusedAttention', 'cp_comm_type=p2p', 'fp8_bwd=False', 'fp8_dpa=False', 'fp8_mha=False', 'scaling_mode=None', 'f16_O=False', 'is_training=True', 'log_level=WARNING']' returned non-zero exit status 1.
failed: tests.pytorch.attention.test_attention_with_cp
test_cp_with_fused_attention[False-None-False-False-False-p2p-sbhd-cp_1_0-bf16]::subprocess.CalledProcessError: Command '['python3', '-m', 'torch.distributed.launch', '--nproc-per-node=2', '/workspace/tests/pytorch/attention/run_attention_with_cp.py', 'dtype=bf16', 'model=cp_1_0', 'qkv_format=sbhd', 'kernel_backend=FusedAttention', 'cp_comm_type=p2p', 'fp8_bwd=False', 'fp8_dpa=False', 'fp8_mha=False', 'scaling_mode=None', 'f16_O=False', 'is_training=True', 'log_level=WARNING']' returned non-zero exit status 1.
CI Level 3 / mGPU Torch (mi35x)
Process completed with exit code 1.
CI Level 3 / mGPU Torch (mi35x)
PyTorch mGPU tests FAILED.
CI Level 3 / sGPU Tests (mi35x)
Canceling since a higher priority waiting request for PR Automatic CI-refs/pull/608/merge exists
CI Level 3 / sGPU Tests (mi35x)
The operation was canceled.
PR Automatic CI
Canceling since a higher priority waiting request for PR Automatic CI-refs/pull/608/merge exists
CI Level 3 / build / Build ROCm Docker image and TransformerEngine wheels
Node.js 20 actions are deprecated. The following actions are running on Node.js 20 and may not work as expected: actions/upload-artifact@v4. Actions will be forced to run with Node.js 24 by default starting June 16th, 2026. Node.js 20 will be removed from the runner on September 16th, 2026. Please check if updated versions of these actions are available that support Node.js 24. To opt into Node.js 24 now, set the FORCE_JAVASCRIPT_ACTIONS_TO_NODE24=true environment variable on the runner or in your workflow file. Once Node.js 24 becomes the default, you can temporarily opt out by setting ACTIONS_ALLOW_USE_UNSECURE_NODE_VERSION=true. For more information see: https://github.blog/changelog/2025-09-19-deprecation-of-node-20-on-github-actions-runners/
CI Level 3 / mGPU JAX (mi35x)
Node.js 20 actions are deprecated. The following actions are running on Node.js 20 and may not work as expected: actions/download-artifact@v4, actions/upload-artifact@v4. Actions will be forced to run with Node.js 24 by default starting June 16th, 2026. Node.js 20 will be removed from the runner on September 16th, 2026. Please check if updated versions of these actions are available that support Node.js 24. To opt into Node.js 24 now, set the FORCE_JAVASCRIPT_ACTIONS_TO_NODE24=true environment variable on the runner or in your workflow file. Once Node.js 24 becomes the default, you can temporarily opt out by setting ACTIONS_ALLOW_USE_UNSECURE_NODE_VERSION=true. For more information see: https://github.blog/changelog/2025-09-19-deprecation-of-node-20-on-github-actions-runners/
CI Level 3 / mGPU JAX (mi30x)
Node.js 20 actions are deprecated. The following actions are running on Node.js 20 and may not work as expected: actions/download-artifact@v4, actions/upload-artifact@v4. Actions will be forced to run with Node.js 24 by default starting June 16th, 2026. Node.js 20 will be removed from the runner on September 16th, 2026. Please check if updated versions of these actions are available that support Node.js 24. To opt into Node.js 24 now, set the FORCE_JAVASCRIPT_ACTIONS_TO_NODE24=true environment variable on the runner or in your workflow file. Once Node.js 24 becomes the default, you can temporarily opt out by setting ACTIONS_ALLOW_USE_UNSECURE_NODE_VERSION=true. For more information see: https://github.blog/changelog/2025-09-19-deprecation-of-node-20-on-github-actions-runners/
CI Level 3 / mGPU Torch (mi30x)
Node.js 20 actions are deprecated. The following actions are running on Node.js 20 and may not work as expected: actions/download-artifact@v4, actions/upload-artifact@v4. Actions will be forced to run with Node.js 24 by default starting June 16th, 2026. Node.js 20 will be removed from the runner on September 16th, 2026. Please check if updated versions of these actions are available that support Node.js 24. To opt into Node.js 24 now, set the FORCE_JAVASCRIPT_ACTIONS_TO_NODE24=true environment variable on the runner or in your workflow file. Once Node.js 24 becomes the default, you can temporarily opt out by setting ACTIONS_ALLOW_USE_UNSECURE_NODE_VERSION=true. For more information see: https://github.blog/changelog/2025-09-19-deprecation-of-node-20-on-github-actions-runners/
CI Level 3 / mGPU Torch (mi35x)
Node.js 20 actions are deprecated. The following actions are running on Node.js 20 and may not work as expected: actions/download-artifact@v4, actions/upload-artifact@v4. Actions will be forced to run with Node.js 24 by default starting June 16th, 2026. Node.js 20 will be removed from the runner on September 16th, 2026. Please check if updated versions of these actions are available that support Node.js 24. To opt into Node.js 24 now, set the FORCE_JAVASCRIPT_ACTIONS_TO_NODE24=true environment variable on the runner or in your workflow file. Once Node.js 24 becomes the default, you can temporarily opt out by setting ACTIONS_ALLOW_USE_UNSECURE_NODE_VERSION=true. For more information see: https://github.blog/changelog/2025-09-19-deprecation-of-node-20-on-github-actions-runners/
CI Level 3 / mGPU Torch (mi35x)
31 more failures omitted from annotations; see the job summary for the full list.
CI Level 3 / sGPU Tests (mi30x)
Node.js 20 actions are deprecated. The following actions are running on Node.js 20 and may not work as expected: actions/download-artifact@v4, actions/upload-artifact@v4. Actions will be forced to run with Node.js 24 by default starting June 16th, 2026. Node.js 20 will be removed from the runner on September 16th, 2026. Please check if updated versions of these actions are available that support Node.js 24. To opt into Node.js 24 now, set the FORCE_JAVASCRIPT_ACTIONS_TO_NODE24=true environment variable on the runner or in your workflow file. Once Node.js 24 becomes the default, you can temporarily opt out by setting ACTIONS_ALLOW_USE_UNSECURE_NODE_VERSION=true. For more information see: https://github.blog/changelog/2025-09-19-deprecation-of-node-20-on-github-actions-runners/
CI Level 3 / sGPU Tests (mi35x)
Node.js 20 actions are deprecated. The following actions are running on Node.js 20 and may not work as expected: actions/download-artifact@v4, actions/upload-artifact@v4. Actions will be forced to run with Node.js 24 by default starting June 16th, 2026. Node.js 20 will be removed from the runner on September 16th, 2026. Please check if updated versions of these actions are available that support Node.js 24. To opt into Node.js 24 now, set the FORCE_JAVASCRIPT_ACTIONS_TO_NODE24=true environment variable on the runner or in your workflow file. Once Node.js 24 becomes the default, you can temporarily opt out by setting ACTIONS_ALLOW_USE_UNSECURE_NODE_VERSION=true. For more information see: https://github.blog/changelog/2025-09-19-deprecation-of-node-20-on-github-actions-runners/
CI Level 3 / Select Docker Image
Using ci/ci_config.json from dev

Artifacts

Produced during runtime
Name Size Digest
logs-mgpu-mi30x-jax Expired
48.8 KB
sha256:37d4cdc0e4ddc5079e385e2d54c99c0d6ccbe6535686a1b4c69c45daf92bf150
logs-mgpu-mi30x-pytorch Expired
113 KB
sha256:12bed27efb59fedaa54b6cbf2a42e82b2c9a328cbcdc5cbfc43fda9911b7589e
logs-mgpu-mi35x-jax Expired
53 KB
sha256:080befbd928585bdfdf04237b8153e42741c9cbfbae19150acca03d9cba5e8ea
logs-mgpu-mi35x-pytorch Expired
137 KB
sha256:4ebb64d83ef85b57e50b649bef660af925032db4e3743116203dd8f727462f26
logs-sgpu-mi30x Expired
3.42 MB
sha256:a0337a44069940d31dde88467c7683bc82bb2af6997394c1bfb9d5e5b4ef3037
logs-sgpu-mi35x Expired
2.71 MB
sha256:974bb7e5ea0478f02e6fe25e2977aabe809d782cc17e4a9dde6c22228f65fd2b
te-rocm-wheels Expired
722 MB
sha256:a4c4f3320051701647fb7b52f9072752d789dd6e0da4c9e09b4356edd09eaf87