Add gtfn_gpu to distributed CI pipeline#1012
Conversation
|
cscs-ci run distributed |
f310db9 to
fb3927e
Compare
fb3927e to
1fd6389
Compare
|
cscs-ci run distributed |
1 similar comment
|
cscs-ci run distributed |
|
cscs-ci run distributed |
|
cscs-ci run distributed |
This reverts commit e30c2f7.
|
cscs-ci run distributed |
|
cscs-ci run distributed |
1 similar comment
|
cscs-ci run distributed |
ab4ac8f to
8f04d36
Compare
|
cscs-ci run distributed |
|
cscs-ci run distributed |
|
cscs-ci run distributed |
Make revert_repeated_index_to_invalid numpy-only as it's not usefully vectorized
|
cscs-ci run distributed |
|
cscs-ci run distributed |
|
cscs-ci run default |
|
cscs-ci run distributed |
1 similar comment
|
cscs-ci run distributed |
|
There may be a problem with the distributed+cuda venv since my latest merge of main... Let's see. |
|
cscs-ci run distributed |
1 similar comment
|
cscs-ci run distributed |
1717e27 to
5652ce8
Compare
|
cscs-ci run distributed |
|
cscs-ci run distributed |
|
cscs-ci run distributed |
1 similar comment
|
cscs-ci run distributed |
|
cscs-ci run distributed |
|
Mandatory Tests Please make sure you run these tests via comment before you merge!
Optional Tests To run benchmarks you can use:
To run tests and benchmarks with the DaCe backend you can use:
To run test levels ignored by the default test suite (mostly simple datatest for static fields computations) you can use:
For more detailed information please look at CI in the EXCLAIM universe. |
|
I changed the distributed gtfn_gpu/common job back to the normal partition, because there's a maximum time limit of one hour on the shared partition. I've added a todo for that as well. Once we get the timings below an hour we can change that back to the shared partition. |
|
cscs-ci run distributed |
|
cscs-ci run default |
Adds gtfn_gpu backend to the distributed CI pipeline. dace_gpu is still left out because compilation takes too long.
The base image is upgraded because it's possible, but not strictly necessary. The CPU-only version of the pipeline needed 25.04 (24.04 and 25.10 did not work for various reasons). However, since OpenMPI and libfabric are now built manually in the container the base image version is less of a constraint. 24.04 doesn't have matching GCC/CUDA versions and 26.04 doesn't exist yet, but the pipeline should eventually use 26.04.
OpenMPI and libfabric are built manually for slingshot support because getting the ubuntu repository packages to work with GPU support did not seem possible/easy. The installation is based on https://github.com/eth-cscs/cray-network-stack.
GHEX needs an upgrade, because there's a bug in how strides are calculated for GPU buffers. @philip-paul-mueller has already fixed this in ghex-org/GHEX#190 but we should wait for that to be merged (and probably test in icon-exclaim first).
This also fixes a few cupy/numpy incompatibilities.
revert_repeated_index_to_invalidwas updated to only deal with numpy for now as the connectivities are always numpy arrays.test_halo_exchange_for_sparse_fieldis markedembedded_only. The non-MPI test was already marked embedded-only.This does not try to unify the default and distributed CI pipeline definitions. That should, however, be done done sooner or later as well.