-
Notifications
You must be signed in to change notification settings - Fork 10
Add gtfn_gpu to distributed CI pipeline #1012
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
msimberg
wants to merge
82
commits into
C2SM:main
Choose a base branch
from
msimberg:distributed-tests-dace-gpu
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 30 commits
Commits
Show all changes
82 commits
Select commit
Hold shift + click to select a range
1fd6389
Attempt to add cuda support to distributed ci pipeline
msimberg cbb1891
Add cuda12 extra
msimberg bbb151c
Add nvidia-cuda-toolkit
msimberg b9be7fb
Revert "refactor: testing infrastructure (#1002)"
msimberg 731283a
Use cxi hook in ci
msimberg ea2b3aa
Try mpich
msimberg 8f04d36
Reduce tests
msimberg 9f96b70
Try using manually built openmpi
msimberg 9fce9b5
Debugging
msimberg c6a767e
Remove debug prints
msimberg 0b9d26b
Merge remote-tracking branch 'origin/main' into distributed-tests-dac…
msimberg adb1ee6
Unrevert test download changes
msimberg b0321e7
Numpy/cupy issues
msimberg c62979c
Enable shm, lnx, xpmem support in libfabric
msimberg b4071d0
Linting
msimberg 6eb3d8d
Enable GPU support for GHEX
msimberg 28b1b1b
Set appropriate gcc for cuda
msimberg 73a5b5b
Explicitly set OpenMPI settings
msimberg d8e90e4
Don't dlopen cuda and gdrcopy
msimberg 67cfdb5
Update comments and clean up options
msimberg c81af9e
Try ubuntu lts release for distributed ci
msimberg 790612a
Set gpu binding through SLURM_GPUS_PER_TASK
msimberg 64482e8
Enable all tests again
msimberg b3eef3a
Clean up names in distributed.yml
msimberg d6f71d6
Update base image to ubuntu 25.10
msimberg 7b68f7b
Merge remote-tracking branch 'origin/main' into distributed-tests-dac…
msimberg 518bbde
Mark distributed compute_geofac_div test embedded only, like single-r…
msimberg c1eed7f
Use philip's async-mpi branch (fixes gpu buffer stride computation)
msimberg d08b60c
Increase time limit for distributed dace tests
msimberg 148850c
Increase time limit for distributed dace_gpu common tests
msimberg c6d0042
Merge branch 'main' into distributed-tests-dace-gpu
jcanton 0c727f5
sorry2
jcanton 0c51fa2
Merge remote-tracking branch 'origin/main' into distributed-tests-dac…
msimberg 5b6ddf8
Merge remote-tracking branch 'origin/main' into distributed-tests-dac…
msimberg f415370
Merge branch 'main' into distributed-tests-dace-gpu
nfarabullini 47d0e63
Merge branch 'main' into distributed-tests-dace-gpu
nfarabullini ce21e8f
modified np strict references with broader array_ns
nfarabullini 74938f3
Merge branch 'main' into distributed-tests-dace-gpu
nfarabullini 878db70
Update interpolation_fields.py
nfarabullini c449030
ran pre-commit
nfarabullini 9460369
removed additional but unused return val
nfarabullini c3606ae
Update interpolation_fields.py
nfarabullini 81375ca
ran pre-commit
nfarabullini 6362e62
small fix to tuple
nfarabullini 0ed42b7
Merge remote-tracking branch 'origin/main' into distributed-tests-dac…
msimberg 000efca
Fix numpy/cupy inconsistency in test_parallel_grid_manager.py
msimberg b0c8f5e
Loosen rbf tolerance again for gpu
msimberg f7f7dcd
Fix allocator argument
msimberg 1744318
Specify backend for all metrics fields
msimberg 5995058
Add missing allocator to test_parallel_grid_refinement.py
msimberg 23170e2
Merge remote-tracking branch 'origin/main' into distributed-tests-dac…
msimberg 2ff0109
Add another missing allocator
msimberg f802f98
More allocators
msimberg 8d78f0b
Increase timeout in distributed tests
msimberg addef83
Format files
msimberg 5d97f1d
Increase timelimit further
msimberg 9b153ff
More consistency for cupy/numpy, use cupy more extensively in serialb…
msimberg ea3304f
Very long distributed gpu ci time limit
msimberg a5bac95
Merge remote-tracking branch 'origin/main' into distributed-tests-dac…
msimberg 89424c9
Move check_local_global_field helper to common file for reuse elsewhere
msimberg 79dcdb3
Add customizable tolerance to check_local_global_field
msimberg d555a26
numpy/cupy fixes
msimberg 57d1694
Slightly loosen test_parallel_grid_manager.py tolerances again
msimberg 61a3f45
print failures immediately in ci
msimberg fb209bb
Fix formatting and linter warnings
msimberg a5c633f
Make some field tests unit tests in test_parallel_grid_manager.py
msimberg 20c0ea1
Don't test r01b01 grid anymore
msimberg 7cccfef
Split geometry fields test in test_parallel_grid_manager.py into unit…
msimberg 0b063cc
Only run integration tests in distributed CI
msimberg eb01d9a
Apply suggestion from @msimberg
msimberg fd93fb8
Test only dace_gpu/common in distributed pipeline
msimberg 702d0bd
Try persistent cache and more workers on distributed CI pipeline
msimberg a7f60f0
Test only gtfn_gpu
msimberg 5652ce8
Update distributed config
msimberg 5ba050c
Upgrade mpi4py
msimberg 3a37c9c
Remove explicitl GT4PY_BUILD_JOBS from distributed pipeline
msimberg 9c57322
Merge remote-tracking branch 'origin/main' into distributed-tests-dac…
msimberg 5beb757
Merge remote-tracking branch 'origin/main' into distributed-tests-dac…
msimberg 99811d0
Decrease distributed gpu timelimit
msimberg d6dcc6c
Use normal partition for long distributed CI jobs
msimberg fabff2f
Remove gpus per task entry from distributed ci configuration
msimberg ee1115c
Merge remote-tracking branch 'origin/main' into distributed-tests-dac…
msimberg File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,27 +1,124 @@ | ||
| FROM ubuntu:25.04 | ||
| FROM ubuntu:25.10 | ||
|
|
||
| ENV LANG C.UTF-8 | ||
| ENV LC_ALL C.UTF-8 | ||
|
|
||
| ARG DEBIAN_FRONTEND=noninteractive | ||
| RUN apt-get update -qq && apt-get install -qq -y --no-install-recommends \ | ||
| strace \ | ||
| build-essential \ | ||
| tar \ | ||
| wget \ | ||
| curl \ | ||
| libboost-dev \ | ||
| libnuma-dev \ | ||
| libopenmpi-dev \ | ||
| ca-certificates \ | ||
| libssl-dev \ | ||
| autoconf \ | ||
| automake \ | ||
| libtool \ | ||
| pkg-config \ | ||
| libreadline-dev \ | ||
| git && \ | ||
| RUN apt-get update && \ | ||
| apt-get install -y --no-install-recommends \ | ||
| autoconf \ | ||
| automake \ | ||
| build-essential \ | ||
| ca-certificates \ | ||
| curl \ | ||
| git \ | ||
| libboost-dev \ | ||
| libconfig-dev \ | ||
| libcurl4-openssl-dev \ | ||
| libfuse-dev \ | ||
| libjson-c-dev \ | ||
| libnl-3-dev \ | ||
| libnuma-dev \ | ||
| libreadline-dev \ | ||
| libsensors-dev \ | ||
| libssl-dev \ | ||
| libtool \ | ||
| libuv1-dev \ | ||
| libyaml-dev \ | ||
| nvidia-cuda-dev \ | ||
| nvidia-cuda-toolkit \ | ||
| nvidia-cuda-toolkit-gcc \ | ||
| pkg-config \ | ||
| python3 \ | ||
| strace \ | ||
| tar \ | ||
| wget && \ | ||
| rm -rf /var/lib/apt/lists/* | ||
|
|
||
| ENV CC=/usr/bin/cuda-gcc | ||
| ENV CXX=/usr/bin/cuda-g++ | ||
| ENV CUDAHOSTCXX=/usr/bin/cuda-g++ | ||
|
|
||
| # Install OpenMPI configured with libfabric, libcxi, and gdrcopy support for use | ||
| # on Alps. This is based on examples in | ||
| # https://github.com/eth-cscs/cray-network-stack. | ||
| ARG gdrcopy_version=2.5.1 | ||
| RUN set -eux; \ | ||
| git clone --depth 1 --branch "v${gdrcopy_version}" https://github.com/NVIDIA/gdrcopy.git; \ | ||
| cd gdrcopy; \ | ||
| make lib -j"$(nproc)" lib_install; \ | ||
| cd /; \ | ||
| rm -rf /gdrcopy; \ | ||
| ldconfig | ||
|
|
||
| ARG cassini_headers_version=release/shs-13.0.0 | ||
| RUN set -eux; \ | ||
| git clone --depth 1 --branch "${cassini_headers_version}" https://github.com/HewlettPackard/shs-cassini-headers.git; \ | ||
| cd shs-cassini-headers; \ | ||
| cp -r include/* /usr/include/; \ | ||
| cp -r share/* /usr/share/; \ | ||
| rm -rf /shs-cassini-headers | ||
|
|
||
| ARG cxi_driver_version=release/shs-13.0.0 | ||
| RUN set -eux; \ | ||
| git clone --depth 1 --branch "${cxi_driver_version}" https://github.com/HewlettPackard/shs-cxi-driver.git; \ | ||
| cd shs-cxi-driver; \ | ||
| cp -r include/* /usr/include/; \ | ||
| rm -rf /shs-cxi-driver | ||
|
|
||
| ARG libcxi_version=release/shs-13.0.0 | ||
| RUN set -eux; \ | ||
| git clone --depth 1 --branch "${libcxi_version}" https://github.com/HewlettPackard/shs-libcxi.git; \ | ||
| cd shs-libcxi; \ | ||
| ./autogen.sh; \ | ||
| ./configure \ | ||
| --with-cuda; \ | ||
| make -j"$(nproc)" install; \ | ||
| cd /; \ | ||
| rm -rf /shs-libcxi; \ | ||
| ldconfig | ||
|
|
||
| ARG xpmem_version=0d0bad4e1d07b38d53ecc8f20786bb1328c446da | ||
| RUN set -eux; \ | ||
| git clone https://github.com/hpc/xpmem.git; \ | ||
| cd xpmem; \ | ||
| git checkout "${xpmem_version}"; \ | ||
| ./autogen.sh; \ | ||
| ./configure --disable-kernel-module; \ | ||
| make -j"$(nproc)" install; \ | ||
| cd /; \ | ||
| rm -rf /xpmem; \ | ||
| ldconfig | ||
|
|
||
| # NOTE: xpmem is not found correctly without setting the prefix explicitly in | ||
| # --enable-xpmem | ||
| ARG libfabric_version=v2.4.0 | ||
| RUN set -eux; \ | ||
| git clone --depth 1 --branch "${libfabric_version}" https://github.com/ofiwg/libfabric.git; \ | ||
| cd libfabric; \ | ||
| ./autogen.sh; \ | ||
| ./configure \ | ||
| --with-cuda \ | ||
| --enable-xpmem=/usr \ | ||
| --enable-tcp \ | ||
| --enable-cxi; \ | ||
| make -j"$(nproc)" install; \ | ||
| cd /; \ | ||
| rm -rf /libfabric; \ | ||
| ldconfig | ||
|
|
||
| ARG openmpi_version=5.0.9 | ||
| RUN set -eux; \ | ||
| curl -fsSL "https://download.open-mpi.org/release/open-mpi/v5.0/openmpi-${openmpi_version}.tar.gz" -o /tmp/ompi.tar.gz; \ | ||
| tar -C /tmp -xzf /tmp/ompi.tar.gz; \ | ||
| cd "/tmp/openmpi-${openmpi_version}"; \ | ||
| ./configure \ | ||
| --with-ofi \ | ||
| --with-cuda=/usr; \ | ||
| make -j"$(nproc)" install; \ | ||
| cd /; \ | ||
| rm -rf "/tmp/openmpi-${openmpi_version}" /tmp/ompi.tar.gz; \ | ||
| ldconfig | ||
|
|
||
| # Install uv: https://docs.astral.sh/uv/guides/integration/docker | ||
| COPY --from=ghcr.io/astral-sh/uv:0.9.24@sha256:816fdce3387ed2142e37d2e56e1b1b97ccc1ea87731ba199dc8a25c04e4997c5 /uv /uvx /bin/ |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.