Limited-area transforms on GPU #279

ddegrauwe · 2025-06-24T08:25:30Z

Adding capability of performing limited-area spectral transforms on GPU, based on work originally done by Meteo-France and NVIDIA.

using same HIP FFT interface as the global transforms;
using OpenACC for computational kernels;
GPU and CPU benchmark programs are built;
(using CUDA graphs gives occasional crashes, so it's disabled for the limited-area case)

…identify meridional transform through negative kfields.

…essary for graphs).

ddegrauwe · 2025-06-24T08:26:44Z

This limited-area contribution required two small modifications in the global transforms:

loop in trltom_pack_unpack, where wavenumber loop should go to MAXVAL(G%NMEN) instead of until NSMAX. Otherwise results are wrong for a LAM grid where NLAT<NLON
LAM transforms require to have distinct FFT plans for the zonal and meridional directions. This is done by providing negative values for kfield to hicfft.hip.cpp, which then takes the absolute value to have the actual number of fields. A bit dirty, but it minimizes the impact on the global transforms.

dhaumont · 2025-06-24T08:37:48Z

src/etrans/common/internal/ellips.F90

Shouldn't this PARKIND1 to EC_PARKIND be automatically applied to a generated files by the sed transformation?

dhaumont · 2025-06-24T08:42:06Z

src/etrans/gpu/CMakeLists.txt

There is a lot of duplicated code in the cmake files between gpu, cpu, lam and global. We should at some point try to reduce this duplication to the bare minimum

samhatfield · 2025-06-24T08:55:36Z

Thanks for this @ddegrauwe. Some quick suggestions before I start a proper review:

It looks like you're using tab indentation in a few places which makes it hard to read depending on how the editor shows it. Could you replace tabs with spaces? We use two space indentation in CMake, and usually two spaces in Fortran (usually...). I spotted issues in these files
- CMakeLists.txt
- src/programs/CMakeLists.txt
- src/programs/ectrans-lam-benchmark.F90
Is it strictly always true that MAXVAL(G%NMEN) = NSMAX? I can make a counterexample: if you setup trans with NSMAX = 511 and NDGL = 250 and with an octahedral grid in G%NLOEN, then MAXVAL(G%NMEN) = ((20 + 4*249) - 1) / 2 = 507 which is less than NSMAX (a linear grid would be chosen due to the limited number of latitudes see here).

src/programs/CMakeLists.txt

Co-authored-by: Sam Hatfield <samuel.hatfield@ecmwf.int>

ddegrauwe · 2025-06-24T09:44:38Z

Hi @samhatfield

Thanks for the OpenMP fix!

It looks like you're using tab indentation in a few places which makes it hard to read depending on how the editor shows it. Could you replace tabs with spaces? We use two space indentation in CMake, and usually two spaces in Fortran (usually...). I spotted issues in these files

CMakeLists.txt

src/programs/CMakeLists.txt

src/programs/ectrans-lam-benchmark.F90

Sorry about those. Should be fine now.

Is it strictly always true that MAXVAL(G%NMEN) = NSMAX? I can make a counterexample: if you setup trans with NSMAX = 511 and NDGL = 250 and with an octahedral grid in G%NLOEN, then MAXVAL(G%NMEN) = ((20 + 4*249) - 1) / 2 = 507 which is less than NSMAX (a linear grid would be chosen due to the limited number of latitudes see here).

Indeed, MAXVAL(G%NMEN) is not the same as NSMAX. But for a global model, G%NMEN is always smaller than NSMAX, so looping until NSMAX is fine . There's a condition IF (JM <= G_NMEN(IGLG)) THEN inside the loop, so having the loop bounds too wide (i.e. using NSMAX instead of MAXVAL(G%NMENMAX)) doesn't harm in the global case.

For a LAM domain, however, we can have a rectangular domain which is larger in the zonal direction than in the meridional direction. In that case, we have NSMAX<G%NMEN. Looping only until NSMAX will give wrong results because some waves are omitted.

samhatfield · 2025-06-24T10:33:33Z

Sorry about those. Should be fine now.

No problem - looks like you spotted some preexisting tabs as well. They are very hard to spot...

Indeed, MAXVAL(G%NMEN) is not the same as NSMAX. But for a global model, G%NMEN is always smaller than NSMAX, so looping until NSMAX is fine . There's a condition IF (JM <= G_NMEN(IGLG)) THEN inside the loop, so having the loop bounds too wide (i.e. using NSMAX instead of MAXVAL(G%NMENMAX)) doesn't harm in the global case.

For a LAM domain, however, we can have a rectangular domain which is larger in the zonal direction than in the meridional direction. In that case, we have NSMAX<G%NMEN. Looping only until NSMAX will give wrong results because some waves are omitted.

Makes sense. Given the IF that you mention, why don't we simply rewrite the loop as

DO KGL=1,D_NDGL_FS
  DO JM=0,G_NMEN(OFFSET_VAR+KGL-1)
    DO JF=1,KF_FS
      IOFF_LAT = KF_FS*D_NSTAGTF(KGL)+(JF-1)*(D_NSTAGTF(KGL+1)-D_NSTAGTF(KGL))
      SCAL = 1._JPRBT/REAL(G_NLOEN(OFFSET_VAR+KGL-1),JPRBT)
      ISTA  = 2_JPIB*D_NPNTGTB0(JM,KGL)*KF_FS
      FOUBUF_IN(ISTA+2*JF-1) = SCAL * PREEL_COMPLEX(IOFF_LAT+2*JM+1)
      FOUBUF_IN(ISTA+2*JF  ) = SCAL * PREEL_COMPLEX(IOFF_LAT+2*JM+2)
    ENDDO
  ENDDO
ENDDO

? Then we don't need IGLG nor NMEN_MAX.

ddegrauwe · 2025-06-24T12:14:21Z

Makes sense. Given the IF that you mention, why don't we simply rewrite the loop as

DO KGL=1,D_NDGL_FS
  DO JM=0,G_NMEN(OFFSET_VAR+KGL-1)
    DO JF=1,KF_FS
      IOFF_LAT = KF_FS*D_NSTAGTF(KGL)+(JF-1)*(D_NSTAGTF(KGL+1)-D_NSTAGTF(KGL))
      SCAL = 1._JPRBT/REAL(G_NLOEN(OFFSET_VAR+KGL-1),JPRBT)
      ISTA  = 2_JPIB*D_NPNTGTB0(JM,KGL)*KF_FS
      FOUBUF_IN(ISTA+2*JF-1) = SCAL * PREEL_COMPLEX(IOFF_LAT+2*JM+1)
      FOUBUF_IN(ISTA+2*JF  ) = SCAL * PREEL_COMPLEX(IOFF_LAT+2*JM+2)
    ENDDO
  ENDDO
ENDDO

? Then we don't need IGLG nor NMEN_MAX.

In that case it would no longer be possible to collapse the loops for OpenACC or OpenMP.

samhatfield · 2025-06-24T12:35:00Z

In that case it would no longer be possible to collapse the loops for OpenACC or OpenMP.

Ah yes, good point. Then we should leave it.

src/etrans/gpu/internal/suemp_trans_preleg_mod.F90

src/programs/ectrans-lam-benchmark.F90

ddegrauwe · 2025-07-02T12:26:01Z

Many thanks for your suggestions so far @samhatfield; I included them in a new commit.
The CI tests failed due to further evolution of the develop branch, so I merged the develop branch into this one and fixed the issue (including etrans library in transi).

Co-authored-by: Sam Hatfield <samuel.hatfield@ecmwf.int>

wdeconinck · 2025-07-02T13:23:28Z

@samhatfield are you working on fixing the adjoint test to have a fixed seed?
https://github.com/ecmwf-ifs/ectrans/actions/runs/16025873962/job/45213547025?pr=279#step:12:6873

samhatfield · 2025-07-02T15:39:50Z

@samhatfield are you working on fixing the adjoint test to have a fixed seed? https://github.com/ecmwf-ifs/ectrans/actions/runs/16025873962/job/45213547025?pr=279#step:12:6873

I thought I had already, but apparently I didn't include the direct transform test. Maybe that wasn't in develop at the time. Let me open another PR.

samhatfield · 2025-07-02T15:43:29Z

Here you go: #284.

wdeconinck · 2025-09-30T09:41:55Z

What is the status here?
Currently there are some merge conflicts.

ddegrauwe · 2026-01-23T13:26:54Z

Hi @samhatfield , @wdeconinck ,

I just updated this pull request to the current develop branch. As far as I can tell, all checks pass and no merge conflicts remain.
Anything still blocking the merge here?

samhatfield · 2026-01-23T13:34:37Z

Hi @samhatfield , @wdeconinck ,

I just updated this pull request to the current develop branch. As far as I can tell, all checks pass and no merge conflicts remain. Anything still blocking the merge here?

Hey Daan, nothing major to block the PR if I remember right (I guess I would have said so 6 months ago if there was). I'd like to have another look again before approving. However I'll only look at the "top level" files, like the CMake files etc., and just confirm nothing you're adding here interferes with global transform functionality or anything we've introduced in the last 6 months.

As for everything in src/etrans/gpu, if it's okay with you, I'll not look closely at the changes there and trust you that everything is working.

src/programs/CMakeLists.txt

src/etrans/gpu/CMakeLists.txt

src/etrans/common/CMakeLists.txt

src/etrans/gpu/CMakeLists.txt

src/etrans/common/CMakeLists.txt

samhatfield · 2026-01-23T14:51:14Z

src/transi/CMakeLists.txt

+set(transi_PRIVATE_LIBRARIES trans_dp)
+if( HAVE_ETRANS )
+  list( APPEND transi_PRIVATE_LIBRARIES
+    etrans_dp
+  )
+endif()
+


Does this code actually do something?

It seems unused, and probably a leftover of the merge.
Instead etrans_dp has been added directly in the list of PRIVATE_LIBS further via a generator expression

Checking this shows me that the GPU version transi_gpu_dp does not yet link with etrans_gpu_dp

If not straightforward, I would leave this for a follow-up PR.

samhatfield

Approved subject to resolution of question above.

ddegrauwe added 14 commits June 16, 2025 17:06

WIP: added lam gpu sources

12dd036

WIP lam library now compiles; removed adjoint code for GPU

162449b

WIP. Fixed CMakeLists.txt to fix gpu compilation. Modified hicfft to …

8cfa23b

…identify meridional transform through negative kfields.

fixes

1f1eb19

WIP: CRASHES WITH SIGSEGV. Using constant-address arrays for fft (nec…

b8be65c

…essary for graphs).

in-place fft in eftdir. Works when cuda graphs are disabled.

a8256ba

in-place fft in eledir; optimized buffered allocation in eltdir

d937912

in-place fft in eleinv

5bd5bea

in-place fft in eftinv

7c68b91

Disable combination of ETRANS and GPU_GRAPHS_FFT, which doesn't work

d2bd7e6

created common (single/double, cpu/gpu) folder for LAM transforms.

d19a1e1

small fix for single-precision

163de4e

Merge branch 'develop' into lam_gpu

d23dd7f

restore compilation options

5f73b25

github-actions bot added the contributor label Jun 24, 2025

dhaumont reviewed Jun 24, 2025

View reviewed changes

samhatfield reviewed Jun 24, 2025

View reviewed changes

src/programs/CMakeLists.txt Outdated Show resolved Hide resolved

ddegrauwe and others added 4 commits June 24, 2025 11:17

Update src/programs/CMakeLists.txt

a295ca0

Co-authored-by: Sam Hatfield <samuel.hatfield@ecmwf.int>

removed tabs

62b6d04

removed tabs

3d0d329

more cosmetics

51f884b

samhatfield reviewed Jun 26, 2025

View reviewed changes

src/etrans/gpu/internal/suemp_trans_preleg_mod.F90 Outdated Show resolved Hide resolved

samhatfield reviewed Jun 26, 2025

View reviewed changes

src/programs/ectrans-lam-benchmark.F90 Outdated Show resolved Hide resolved

samhatfield reviewed Jun 26, 2025

View reviewed changes

src/programs/ectrans-lam-benchmark.F90 Outdated Show resolved Hide resolved

ddegrauwe added 2 commits July 2, 2025 12:31

Merge remote-tracking branch 'origin/develop' into tmp

df3fd51

include etrans in transi private libs

3b37fd9

ddegrauwe and others added 2 commits July 2, 2025 14:35

Apply more suggestions from code review

d606ec1

Co-authored-by: Sam Hatfield <samuel.hatfield@ecmwf.int>

Even more suggestions from review

8f05fe7

wdeconinck force-pushed the develop branch from aad830b to f1d16a6 Compare August 26, 2025 09:27

merged develop

9fefd21

samhatfield self-requested a review January 23, 2026 13:35

samhatfield added the approved-for-ci label Jan 23, 2026

Merge branch 'develop' into degrauwe_lam_gpu

cd120de

github-actions bot removed the approved-for-ci label Jan 23, 2026

samhatfield added the approved-for-ci label Jan 23, 2026