Skip to content

Add gtfn_gpu to distributed CI pipeline#1012

Open
msimberg wants to merge 82 commits intoC2SM:mainfrom
msimberg:distributed-tests-dace-gpu
Open

Add gtfn_gpu to distributed CI pipeline#1012
msimberg wants to merge 82 commits intoC2SM:mainfrom
msimberg:distributed-tests-dace-gpu

Conversation

@msimberg
Copy link
Contributor

@msimberg msimberg commented Jan 28, 2026

Adds gtfn_gpu backend to the distributed CI pipeline. dace_gpu is still left out because compilation takes too long.

The base image is upgraded because it's possible, but not strictly necessary. The CPU-only version of the pipeline needed 25.04 (24.04 and 25.10 did not work for various reasons). However, since OpenMPI and libfabric are now built manually in the container the base image version is less of a constraint. 24.04 doesn't have matching GCC/CUDA versions and 26.04 doesn't exist yet, but the pipeline should eventually use 26.04.

OpenMPI and libfabric are built manually for slingshot support because getting the ubuntu repository packages to work with GPU support did not seem possible/easy. The installation is based on https://github.com/eth-cscs/cray-network-stack.

GHEX needs an upgrade, because there's a bug in how strides are calculated for GPU buffers. @philip-paul-mueller has already fixed this in ghex-org/GHEX#190 but we should wait for that to be merged (and probably test in icon-exclaim first).

This also fixes a few cupy/numpy incompatibilities. revert_repeated_index_to_invalid was updated to only deal with numpy for now as the connectivities are always numpy arrays. test_halo_exchange_for_sparse_field is marked embedded_only. The non-MPI test was already marked embedded-only.

This does not try to unify the default and distributed CI pipeline definitions. That should, however, be done done sooner or later as well.

@msimberg
Copy link
Contributor Author

cscs-ci run distributed

@msimberg msimberg force-pushed the distributed-tests-dace-gpu branch from f310db9 to fb3927e Compare January 28, 2026 15:04
@msimberg msimberg force-pushed the distributed-tests-dace-gpu branch from fb3927e to 1fd6389 Compare January 28, 2026 15:16
@msimberg
Copy link
Contributor Author

cscs-ci run distributed

1 similar comment
@msimberg
Copy link
Contributor Author

cscs-ci run distributed

@msimberg
Copy link
Contributor Author

cscs-ci run distributed

@msimberg
Copy link
Contributor Author

cscs-ci run distributed

@msimberg
Copy link
Contributor Author

cscs-ci run distributed

@msimberg
Copy link
Contributor Author

cscs-ci run distributed

1 similar comment
@msimberg
Copy link
Contributor Author

cscs-ci run distributed

@msimberg msimberg force-pushed the distributed-tests-dace-gpu branch from ab4ac8f to 8f04d36 Compare January 29, 2026 13:38
@msimberg
Copy link
Contributor Author

cscs-ci run distributed

@msimberg
Copy link
Contributor Author

cscs-ci run distributed

@msimberg
Copy link
Contributor Author

cscs-ci run distributed

@msimberg
Copy link
Contributor Author

cscs-ci run distributed

@msimberg
Copy link
Contributor Author

cscs-ci run distributed

@msimberg msimberg changed the title Add GPU backends to distributed CI pipeline Add gtfn_gpu to distributed CI pipeline Mar 25, 2026
@msimberg
Copy link
Contributor Author

cscs-ci run default

@msimberg
Copy link
Contributor Author

cscs-ci run distributed

1 similar comment
@msimberg
Copy link
Contributor Author

cscs-ci run distributed

@msimberg
Copy link
Contributor Author

There may be a problem with the distributed+cuda venv since my latest merge of main... Let's see.

@msimberg
Copy link
Contributor Author

cscs-ci run distributed

1 similar comment
@msimberg
Copy link
Contributor Author

cscs-ci run distributed

@msimberg msimberg force-pushed the distributed-tests-dace-gpu branch from 1717e27 to 5652ce8 Compare March 25, 2026 10:38
@msimberg
Copy link
Contributor Author

cscs-ci run distributed

@msimberg
Copy link
Contributor Author

cscs-ci run distributed

@msimberg
Copy link
Contributor Author

cscs-ci run distributed

1 similar comment
@msimberg
Copy link
Contributor Author

cscs-ci run distributed

@msimberg
Copy link
Contributor Author

cscs-ci run distributed

@github-actions
Copy link

Mandatory Tests

Please make sure you run these tests via comment before you merge!

  • cscs-ci run default
  • cscs-ci run distributed

Optional Tests

To run benchmarks you can use:

  • cscs-ci run benchmark-bencher

To run tests and benchmarks with the DaCe backend you can use:

  • cscs-ci run dace

To run test levels ignored by the default test suite (mostly simple datatest for static fields computations) you can use:

  • cscs-ci run extra

For more detailed information please look at CI in the EXCLAIM universe.

@msimberg
Copy link
Contributor Author

I changed the distributed gtfn_gpu/common job back to the normal partition, because there's a maximum time limit of one hour on the shared partition. I've added a todo for that as well. Once we get the timings below an hour we can change that back to the shared partition.

@msimberg
Copy link
Contributor Author

cscs-ci run distributed

@msimberg
Copy link
Contributor Author

cscs-ci run default

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants