Add gtfn_gpu to distributed CI pipeline by msimberg · Pull Request #1012 · C2SM/icon4py

msimberg · 2026-01-28T13:38:38Z

Adds gtfn_gpu backend to the distributed CI pipeline. dace_gpu is still left out because compilation takes too long.

The base image is upgraded because it's possible, but not strictly necessary. The CPU-only version of the pipeline needed 25.04 (24.04 and 25.10 did not work for various reasons). However, since OpenMPI and libfabric are now built manually in the container the base image version is less of a constraint. 24.04 doesn't have matching GCC/CUDA versions and 26.04 doesn't exist yet, but the pipeline should eventually use 26.04.

OpenMPI and libfabric are built manually for slingshot support because getting the ubuntu repository packages to work with GPU support did not seem possible/easy. The installation is based on https://github.com/eth-cscs/cray-network-stack.

GHEX needs an upgrade, because there's a bug in how strides are calculated for GPU buffers. @philip-paul-mueller has already fixed this in ghex-org/GHEX#190 but we should wait for that to be merged (and probably test in icon-exclaim first).

This also fixes a few cupy/numpy incompatibilities. revert_repeated_index_to_invalid was updated to only deal with numpy for now as the connectivities are always numpy arrays. test_halo_exchange_for_sparse_field is marked embedded_only. The non-MPI test was already marked embedded-only.

This does not try to unify the default and distributed CI pipeline definitions. That should, however, be done done sooner or later as well.

msimberg · 2026-01-28T13:38:48Z

cscs-ci run distributed

msimberg · 2026-01-28T15:16:44Z

cscs-ci run distributed

msimberg · 2026-01-28T15:21:37Z

cscs-ci run distributed

msimberg · 2026-01-28T16:04:01Z

cscs-ci run distributed

msimberg · 2026-01-28T17:06:15Z

cscs-ci run distributed

This reverts commit e30c2f7.

msimberg · 2026-01-29T12:26:42Z

cscs-ci run distributed

msimberg · 2026-01-29T13:08:20Z

cscs-ci run distributed

msimberg · 2026-01-29T13:25:36Z

cscs-ci run distributed

msimberg · 2026-01-29T13:38:29Z

cscs-ci run distributed

msimberg · 2026-01-29T17:48:54Z

cscs-ci run distributed

msimberg · 2026-01-30T12:01:42Z

cscs-ci run distributed

…e-gpu

Make revert_repeated_index_to_invalid numpy-only as it's not usefully vectorized

msimberg · 2026-01-30T14:45:08Z

cscs-ci run distributed

msimberg · 2026-03-24T15:31:40Z

cscs-ci run distributed

ci/distributed.yml

msimberg · 2026-03-25T08:52:59Z

cscs-ci run default

msimberg · 2026-03-25T08:53:04Z

cscs-ci run distributed

msimberg · 2026-03-25T09:46:02Z

cscs-ci run distributed

msimberg · 2026-03-25T09:46:24Z

There may be a problem with the distributed+cuda venv since my latest merge of main... Let's see.

msimberg · 2026-03-25T10:28:21Z

cscs-ci run distributed

msimberg · 2026-03-25T10:38:54Z

cscs-ci run distributed

msimberg · 2026-03-25T10:42:17Z

cscs-ci run distributed

…e-gpu

msimberg · 2026-03-25T13:40:15Z

cscs-ci run distributed

msimberg · 2026-03-25T14:31:45Z

cscs-ci run distributed

msimberg · 2026-03-25T14:41:40Z

cscs-ci run distributed

msimberg · 2026-03-25T14:59:39Z

cscs-ci run distributed

…e-gpu

github-actions · 2026-03-25T16:36:47Z

Mandatory Tests

Please make sure you run these tests via comment before you merge!

cscs-ci run default
cscs-ci run distributed

Optional Tests

To run benchmarks you can use:

cscs-ci run benchmark-bencher

To run tests and benchmarks with the DaCe backend you can use:

cscs-ci run dace

To run test levels ignored by the default test suite (mostly simple datatest for static fields computations) you can use:

cscs-ci run extra

For more detailed information please look at CI in the EXCLAIM universe.

msimberg · 2026-03-25T16:37:40Z

I changed the distributed gtfn_gpu/common job back to the normal partition, because there's a maximum time limit of one hour on the shared partition. I've added a todo for that as well. Once we get the timings below an hour we can change that back to the shared partition.

msimberg · 2026-03-25T16:37:47Z

cscs-ci run distributed

msimberg · 2026-03-25T16:37:53Z

cscs-ci run default

msimberg force-pushed the distributed-tests-dace-gpu branch from f310db9 to fb3927e Compare January 28, 2026 15:04

Attempt to add cuda support to distributed ci pipeline

1fd6389

msimberg force-pushed the distributed-tests-dace-gpu branch from fb3927e to 1fd6389 Compare January 28, 2026 15:16

Add cuda12 extra

cbb1891

Add nvidia-cuda-toolkit

bbb151c

Revert "refactor: testing infrastructure (C2SM#1002)"

b9be7fb

This reverts commit e30c2f7.

Use cxi hook in ci

731283a

msimberg added 2 commits January 29, 2026 14:38

Try mpich

ea2b3aa

Reduce tests

8f04d36

msimberg force-pushed the distributed-tests-dace-gpu branch from ab4ac8f to 8f04d36 Compare January 29, 2026 13:38

Try using manually built openmpi

9f96b70

Debugging

9fce9b5

msimberg mentioned this pull request Jan 30, 2026

Domain decomposition and halo construction #540

Merged

msimberg added 4 commits January 30, 2026 15:22

Remove debug prints

c6a767e

Merge remote-tracking branch 'origin/main' into distributed-tests-dac…

0b9d26b

…e-gpu

Unrevert test download changes

adb1ee6

Numpy/cupy issues

b0321e7

Make revert_repeated_index_to_invalid numpy-only as it's not usefully vectorized

Enable shm, lnx, xpmem support in libfabric

c62979c

msimberg mentioned this pull request Mar 23, 2026

Use only a single GPU/numa node for distributed tests #1121

Merged

Test only gtfn_gpu

a7f60f0

msimberg commented Mar 25, 2026

View reviewed changes

ci/distributed.yml Outdated Show resolved Hide resolved

Update distributed config

5652ce8

msimberg changed the title ~~Add GPU backends to distributed CI pipeline~~ Add gtfn_gpu to distributed CI pipeline Mar 25, 2026

msimberg force-pushed the distributed-tests-dace-gpu branch from 1717e27 to 5652ce8 Compare March 25, 2026 10:38

msimberg added 4 commits March 25, 2026 14:28

Upgrade mpi4py

5ba050c

Remove explicitl GT4PY_BUILD_JOBS from distributed pipeline

3a37c9c

Merge remote-tracking branch 'origin/main' into distributed-tests-dac…

9c57322

…e-gpu

Merge remote-tracking branch 'origin/main' into distributed-tests-dac…

5beb757

…e-gpu

Decrease distributed gpu timelimit

99811d0

msimberg added 2 commits March 25, 2026 15:57

Use normal partition for long distributed CI jobs

d6dcc6c

Remove gpus per task entry from distributed ci configuration

fabff2f

Merge remote-tracking branch 'origin/main' into distributed-tests-dac…

ee1115c

…e-gpu

Conversation

msimberg commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

msimberg commented Jan 28, 2026

Uh oh!

msimberg commented Jan 28, 2026

Uh oh!

msimberg commented Jan 28, 2026

Uh oh!

msimberg commented Jan 28, 2026

Uh oh!

msimberg commented Jan 28, 2026

Uh oh!

msimberg commented Jan 29, 2026

Uh oh!

msimberg commented Jan 29, 2026

Uh oh!

msimberg commented Jan 29, 2026

Uh oh!

msimberg commented Jan 29, 2026

Uh oh!

msimberg commented Jan 29, 2026

Uh oh!

msimberg commented Jan 30, 2026

Uh oh!

msimberg commented Jan 30, 2026

Uh oh!

msimberg commented Mar 24, 2026

Uh oh!

Uh oh!

msimberg commented Mar 25, 2026

Uh oh!

msimberg commented Mar 25, 2026

Uh oh!

msimberg commented Mar 25, 2026

Uh oh!

msimberg commented Mar 25, 2026

Uh oh!

msimberg commented Mar 25, 2026

Uh oh!

msimberg commented Mar 25, 2026

Uh oh!

msimberg commented Mar 25, 2026

Uh oh!

msimberg commented Mar 25, 2026

Uh oh!

msimberg commented Mar 25, 2026

Uh oh!

msimberg commented Mar 25, 2026

Uh oh!

msimberg commented Mar 25, 2026

Uh oh!

github-actions bot commented Mar 25, 2026

Uh oh!

msimberg commented Mar 25, 2026

Uh oh!

msimberg commented Mar 25, 2026

Uh oh!

msimberg commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

msimberg commented Jan 28, 2026 •

edited

Loading