Skip to content

Conversation

@ddegrauwe
Copy link
Collaborator

Adding capability of performing limited-area spectral transforms on GPU, based on work originally done by Meteo-France and NVIDIA.

  • using same HIP FFT interface as the global transforms;
  • using OpenACC for computational kernels;
  • GPU and CPU benchmark programs are built;
  • (using CUDA graphs gives occasional crashes, so it's disabled for the limited-area case)

@ddegrauwe
Copy link
Collaborator Author

ddegrauwe commented Jun 24, 2025

This limited-area contribution required two small modifications in the global transforms:

  • loop in trltom_pack_unpack, where wavenumber loop should go to MAXVAL(G%NMEN) instead of until NSMAX. Otherwise results are wrong for a LAM grid where NLAT<NLON
  • LAM transforms require to have distinct FFT plans for the zonal and meridional directions. This is done by providing negative values for kfield to hicfft.hip.cpp, which then takes the absolute value to have the actual number of fields. A bit dirty, but it minimizes the impact on the global transforms.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this PARKIND1 to EC_PARKIND be automatically applied to a generated files by the sed transformation?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a lot of duplicated code in the cmake files between gpu, cpu, lam and global. We should at some point try to reduce this duplication to the bare minimum

@samhatfield
Copy link
Collaborator

samhatfield commented Jun 24, 2025

Thanks for this @ddegrauwe. Some quick suggestions before I start a proper review:

  • It looks like you're using tab indentation in a few places which makes it hard to read depending on how the editor shows it. Could you replace tabs with spaces? We use two space indentation in CMake, and usually two spaces in Fortran (usually...). I spotted issues in these files
    • CMakeLists.txt
    • src/programs/CMakeLists.txt
    • src/programs/ectrans-lam-benchmark.F90
  • Is it strictly always true that MAXVAL(G%NMEN) = NSMAX? I can make a counterexample: if you setup trans with NSMAX = 511 and NDGL = 250 and with an octahedral grid in G%NLOEN, then MAXVAL(G%NMEN) = ((20 + 4*249) - 1) / 2 = 507 which is less than NSMAX (a linear grid would be chosen due to the limited number of latitudes see here).

ddegrauwe and others added 4 commits June 24, 2025 11:17
@ddegrauwe
Copy link
Collaborator Author

Hi @samhatfield

Thanks for the OpenMP fix!

  • It looks like you're using tab indentation in a few places which makes it hard to read depending on how the editor shows it. Could you replace tabs with spaces? We use two space indentation in CMake, and usually two spaces in Fortran (usually...). I spotted issues in these files

    • CMakeLists.txt
    • src/programs/CMakeLists.txt
    • src/programs/ectrans-lam-benchmark.F90

Sorry about those. Should be fine now.

  • Is it strictly always true that MAXVAL(G%NMEN) = NSMAX? I can make a counterexample: if you setup trans with NSMAX = 511 and NDGL = 250 and with an octahedral grid in G%NLOEN, then MAXVAL(G%NMEN) = ((20 + 4*249) - 1) / 2 = 507 which is less than NSMAX (a linear grid would be chosen due to the limited number of latitudes see here).

Indeed, MAXVAL(G%NMEN) is not the same as NSMAX. But for a global model, G%NMEN is always smaller than NSMAX, so looping until NSMAX is fine . There's a condition IF (JM <= G_NMEN(IGLG)) THEN inside the loop, so having the loop bounds too wide (i.e. using NSMAX instead of MAXVAL(G%NMENMAX)) doesn't harm in the global case.

For a LAM domain, however, we can have a rectangular domain which is larger in the zonal direction than in the meridional direction. In that case, we have NSMAX<G%NMEN. Looping only until NSMAX will give wrong results because some waves are omitted.

@samhatfield
Copy link
Collaborator

Sorry about those. Should be fine now.

No problem - looks like you spotted some preexisting tabs as well. They are very hard to spot...

Indeed, MAXVAL(G%NMEN) is not the same as NSMAX. But for a global model, G%NMEN is always smaller than NSMAX, so looping until NSMAX is fine . There's a condition IF (JM <= G_NMEN(IGLG)) THEN inside the loop, so having the loop bounds too wide (i.e. using NSMAX instead of MAXVAL(G%NMENMAX)) doesn't harm in the global case.

For a LAM domain, however, we can have a rectangular domain which is larger in the zonal direction than in the meridional direction. In that case, we have NSMAX<G%NMEN. Looping only until NSMAX will give wrong results because some waves are omitted.

Makes sense. Given the IF that you mention, why don't we simply rewrite the loop as

DO KGL=1,D_NDGL_FS
  DO JM=0,G_NMEN(OFFSET_VAR+KGL-1)
    DO JF=1,KF_FS
      IOFF_LAT = KF_FS*D_NSTAGTF(KGL)+(JF-1)*(D_NSTAGTF(KGL+1)-D_NSTAGTF(KGL))
      SCAL = 1._JPRBT/REAL(G_NLOEN(OFFSET_VAR+KGL-1),JPRBT)
      ISTA  = 2_JPIB*D_NPNTGTB0(JM,KGL)*KF_FS
      FOUBUF_IN(ISTA+2*JF-1) = SCAL * PREEL_COMPLEX(IOFF_LAT+2*JM+1)
      FOUBUF_IN(ISTA+2*JF  ) = SCAL * PREEL_COMPLEX(IOFF_LAT+2*JM+2)
    ENDDO
  ENDDO
ENDDO

? Then we don't need IGLG nor NMEN_MAX.

@ddegrauwe
Copy link
Collaborator Author

Makes sense. Given the IF that you mention, why don't we simply rewrite the loop as

DO KGL=1,D_NDGL_FS
  DO JM=0,G_NMEN(OFFSET_VAR+KGL-1)
    DO JF=1,KF_FS
      IOFF_LAT = KF_FS*D_NSTAGTF(KGL)+(JF-1)*(D_NSTAGTF(KGL+1)-D_NSTAGTF(KGL))
      SCAL = 1._JPRBT/REAL(G_NLOEN(OFFSET_VAR+KGL-1),JPRBT)
      ISTA  = 2_JPIB*D_NPNTGTB0(JM,KGL)*KF_FS
      FOUBUF_IN(ISTA+2*JF-1) = SCAL * PREEL_COMPLEX(IOFF_LAT+2*JM+1)
      FOUBUF_IN(ISTA+2*JF  ) = SCAL * PREEL_COMPLEX(IOFF_LAT+2*JM+2)
    ENDDO
  ENDDO
ENDDO

? Then we don't need IGLG nor NMEN_MAX.

In that case it would no longer be possible to collapse the loops for OpenACC or OpenMP.

@samhatfield
Copy link
Collaborator

In that case it would no longer be possible to collapse the loops for OpenACC or OpenMP.

Ah yes, good point. Then we should leave it.

@ddegrauwe
Copy link
Collaborator Author

Many thanks for your suggestions so far @samhatfield; I included them in a new commit.
The CI tests failed due to further evolution of the develop branch, so I merged the develop branch into this one and fixed the issue (including etrans library in transi).

ddegrauwe and others added 2 commits July 2, 2025 14:35
Co-authored-by: Sam Hatfield <samuel.hatfield@ecmwf.int>
@wdeconinck
Copy link
Collaborator

@samhatfield are you working on fixing the adjoint test to have a fixed seed?
https://github.com/ecmwf-ifs/ectrans/actions/runs/16025873962/job/45213547025?pr=279#step:12:6873

@samhatfield
Copy link
Collaborator

@samhatfield are you working on fixing the adjoint test to have a fixed seed? https://github.com/ecmwf-ifs/ectrans/actions/runs/16025873962/job/45213547025?pr=279#step:12:6873

I thought I had already, but apparently I didn't include the direct transform test. Maybe that wasn't in develop at the time. Let me open another PR.

@samhatfield
Copy link
Collaborator

Here you go: #284.

@wdeconinck
Copy link
Collaborator

What is the status here?
Currently there are some merge conflicts.

@ddegrauwe
Copy link
Collaborator Author

Hi @samhatfield , @wdeconinck ,

I just updated this pull request to the current develop branch. As far as I can tell, all checks pass and no merge conflicts remain.
Anything still blocking the merge here?

@samhatfield
Copy link
Collaborator

samhatfield commented Jan 23, 2026

Hi @samhatfield , @wdeconinck ,

I just updated this pull request to the current develop branch. As far as I can tell, all checks pass and no merge conflicts remain. Anything still blocking the merge here?

Hey Daan, nothing major to block the PR if I remember right (I guess I would have said so 6 months ago if there was). I'd like to have another look again before approving. However I'll only look at the "top level" files, like the CMake files etc., and just confirm nothing you're adding here interferes with global transform functionality or anything we've introduced in the last 6 months.

As for everything in src/etrans/gpu, if it's okay with you, I'll not look closely at the changes there and trust you that everything is working.

Comment on lines +15 to +21
set(transi_PRIVATE_LIBRARIES trans_dp)
if( HAVE_ETRANS )
list( APPEND transi_PRIVATE_LIBRARIES
etrans_dp
)
endif()

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this code actually do something?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems unused, and probably a leftover of the merge.
Instead etrans_dp has been added directly in the list of PRIVATE_LIBS further via a generator expression

Checking this shows me that the GPU version transi_gpu_dp does not yet link with etrans_gpu_dp

If not straightforward, I would leave this for a follow-up PR.

Copy link
Collaborator

@samhatfield samhatfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved subject to resolution of question above.

@samhatfield samhatfield added this to the 1.8.0 milestone Jan 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants