Skip to content

Conversation

@samhatfield
Copy link
Collaborator

This is a slightly less brute-force alternative to PR #334 which also lays the groundwork for eventually relying entirely on MPL in the GPU code path. Let me explain...

With this branch, if you disable GPU_AWARE_MPI, an MPI library is not required by ecTrans. No such library will be linked against and there will be no calls to MPI in any compiled object code. Whether MPI is called "under the hood" of MPL depends entirely on whether you compiled FIAT with or without MPI. In the latter case, the MPI serial fallback will be used. This means you can test on GPU platforms without an MPI installation by simply building FIAT without MPI and disabling GPU_AWARE_MPI.

For now, GPU_AWARE_MPI requires direct calls to MPI, hence only for that configuration do we need to link against MPI::MPI_Fortran explicitly. Eventually we should have support to pass GPU buffers to MPL, and when that happens we can finally delete all references to MPI from ecTrans and rely entirely on MPL, much as we already do for the CPU version.

@wdeconinck
Copy link
Collaborator

This is great step!

  1. I would keep using HAVE_MPI. It is customary and shorter

  2. Since you're on this now, I was thinking to immediately take advantage of MPL with MPI_F08 backend part of this PR.
    I have created a fiat PR Export availability of MPL_F08 to downstream packages fiat#74, to be merged in a.s.a.p., that you can query to see if fiat was compiled with MPI_F08 API. You can already use the variable fiat_HAVE_MPL_F08 even if this PR is not merged as it will be evaluating to FALSE when not defined.

    • If fiat has MPI_F08, then we can use MPL directly even for GPU-aware MPI, and we can already test this.
    • If was not compiled with MPI_F08 (so previous releases or MPL_F77_DEPRECATED=ON), we need to keep using MPI_F08 explicitly for now.

So the logic needs to be a bit different for this to work.

@samhatfield
Copy link
Collaborator Author

samhatfield commented Nov 24, 2025

  1. I would keep using HAVE_MPI. It is customary and shorter

There are preexisting references to ectrans_HAVE_MPI, e.g. in transi and in ectrans-import.cmake.in. Is it the case that this variable is automatically set by ecbuild_add_option( FEATURE MPI ... )? If so, now that that option doesn't exist anymore, I would have to replace those instances with HAVE_MPI (and set( HAVE_MPI ${fiat_HAVE_MPI} )). Not a problem, but then I wonder if it's better simply to delete the line from ectrans-import.cmake.in, as this is not a feature of ecTrans anymore.

@samhatfield
Copy link
Collaborator Author

samhatfield commented Nov 24, 2025

Following offline discussions with @wdeconinck, I've added support for the MPI_F08 feature (on by default) of FIAT. This further reduces the configurations where it's necessary to call MPI directly (what I call "raw" MPI). The only remaining configuration in fact is when ecTrans is being built against a FIAT version earlier than {next version to be released} (a new release with MPI_F08 compatibility hasn't been made yet).

If we in future made {next FIAT version to be released} as the minimum supported FIAT version, we could simply delete all raw MPI calls.

I will do some testing to make sure everything is working, before this can be merged.

@samhatfield
Copy link
Collaborator Author

Problems on LUMI... I wonder if we have to add an exception for CCE.

@wdeconinck
Copy link
Collaborator

Problems on LUMI... I wonder if we have to add an exception for CCE.

I think this is again this Cray issue biting us: #157 (comment)
The MPI_F08 API for Cray at least seems broken... An exception seems warranted, but also we should see if this was fixed in the mean time on LUMI with a CCE.

@samhatfield
Copy link
Collaborator Author

Unfortunately I think we will have to enable MPL_F77_DEPRECATED when building FIAT on LUMI. I get numerous MPI errors when testing even ecTrans 1.7.0, when FIAT:develop is used. I'll document and "fix" this in a separate PR.

@samhatfield
Copy link
Collaborator Author

Wow, what a nightmare. After a lot of tedious debugging, I noticed that I had removed the GPU_AWARE_MPI by accident. That's why the LUMI adjoint test failed (in fact you could argue other tests were failing but silently). When I put this back (correctly), the AC GPU tests started failing. The issue is "cannot find MPL_RECV/SEND", which may indicate an issue with passing GPU buffers to those subroutines. It looks like we may have to fall back on MPI_F77 for NVHPC.

@samhatfield
Copy link
Collaborator Author

During debugging I noticed some issues with TRMTOLAD and TRLTOMAD which at one point I thought were the culprits, but it turned out to be a red herring. Still, we should fix those, so I've opened another PR (#340) and rebased this branch against that one.

@samhatfield
Copy link
Collaborator Author

The plot thickens: TRGTOL builds fine on AC GPU with MPL_F08. TRLTOG does not, even though in both cases MPL_RECV and MPL_SEND are called in the same way with the same type of arguments.

Comment on lines 125 to 129
#ifdef PARKINDTRANS_SINGLE
#define TRMTOL_DTYPE MPI_REAL4
#define TRMTOLAD_DTYPE MPI_REAL4
#else
#define TRMTOL_DTYPE MPI_REAL8
#define TRMTOLAD_DTYPE MPI_REAL8
#endif
Copy link
Collaborator

@wdeconinck wdeconinck Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section still needs to be guarded by #ifdef USE_RAW_MPI because MPI_REAL4 and MPI_REAL8 are normally not available. I wonder why this worked.
Do we have similar construct in the other tr*to* routines?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TRMTOLAD_DTYPE is only referenced inside of #ifdef USE_RAW_MPI regions, that's why there's no compile error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request gpu

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants