Skip to content

Conversation

@samhatfield
Copy link
Contributor

MPL_SETDFLT_COMM doesn't update the value of MPL_COMM (which is simply the one-indexed rank of
"this rank" in the currently set communicator). It updates MPL_NUMPROC but not MPL_COMM. Here's
an example of this behaviour:

PROGRAM TEST_PROGRAM

USE PARKIND1,        ONLY : JPIM
USE MPL_MODULE,      ONLY : MPL_INIT, MPL_COMM, MPL_COMM_SPLIT, MPL_SETDFLT_COMM, MPL_END, &
  &                         MPL_MYRANK, MPL_NPROC

IMPLICIT NONE

INTEGER(KIND=JPIM) :: IERROR
INTEGER(KIND=JPIM) :: SPLIT_NUM_RANKS, SPLIT_RANK
INTEGER(KIND=JPIM) :: SPLIT_COLOUR, SPLIT_KEY
INTEGER(KIND=JPIM) :: SPLIT_COMM
INTEGER(KIND=JPIM) :: DUMMY_COMM

CALL MPL_INIT()

! First rank in group 0, others in group 1
SPLIT_COLOUR = MERGE(0, 1, MPL_MYRANK() == 1)
SPLIT_KEY = MPL_MYRANK()
WRITE(6,*) "Rank ", SPLIT_KEY, ", will be in group", split_colour

! Split communicator according to rank colour
CALL MPL_COMM_SPLIT(MPL_COMM, SPLIT_COLOUR, SPLIT_KEY, SPLIT_COMM, IERROR)
CALL MPL_SETDFLT_COMM(SPLIT_COMM, DUMMY_COMM)

! Get rank and comm size after splitting
SPLIT_RANK = MPL_MYRANK()
SPLIT_NUM_RANKS = MPL_NPROC()

WRITE(6,*) "Rank ", SPLIT_RANK, "in group", SPLIT_COLOUR, "in comm with size", SPLIT_NUM_RANKS

CALL MPL_END()

END PROGRAM TEST_PROGRAM

This prints:

Rank            1 , will be in group           0
Rank            2 , will be in group           1
Rank            3 , will be in group           1
Rank            4 , will be in group           1
Rank            1 in group           0 in comm with size           1
Rank            2 in group           1 in comm with size           3
Rank            3 in group           1 in comm with size           3
Rank            4 in group           1 in comm with size           3

"Rank 4 in group 1 in comm with size 3" -> there's your problem.

This PR fixes this by having MPL_SETDFLT_COMM also update MPL_COMM. The same program prints:

Rank            1 , will be in group           0
Rank            2 , will be in group           1
Rank            3 , will be in group           1
Rank            4 , will be in group           1
Rank            1 in group           0 in comm with size           1
Rank            1 in group           1 in comm with size           3
Rank            2 in group           1 in comm with size           3
Rank            3 in group           1 in comm with size           3

@mo-joshuacolclough
Copy link

Can confirm I run into this issue in ecmwf-ifs/ectrans/pull/318, and that this PR fixes it. Thanks

@wdeconinck
Copy link
Collaborator

Thanks @samhatfield could you make this program into a test?

@wdeconinck
Copy link
Collaborator

Could you also check that this works with the option -DENABLE_MPL_F77_DEPRECATED=ON ?
That may mean doing the same changes in the parallel internal directory.

@wdeconinck wdeconinck requested a review from marsdeno November 11, 2025 13:23
@samhatfield
Copy link
Contributor Author

Done. I wasn't sure what all the extra CMake infrastructure was for for the existing MPL tests (e.g. test_mpl_no_output.cmake) so I just ignored that and declared the test like I do in ecTrans. Hope that's okay.

@samhatfield samhatfield force-pushed the fix_mpl_set_dflt_comm branch from 5ae301a to b04e58c Compare November 11, 2025 14:39
Copy link
Collaborator

@wdeconinck wdeconinck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @samhatfield for this fix and extra test!

Copy link
Collaborator

@wdeconinck wdeconinck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get this error when configuring fiat with -DENABLE_MPL_F77_DEPRECATED=ON
@marsdeno , some inconsistency with MPL_F77 and MPL_F08 ?

/***/fiat/tests/test_mpl_split_comm.F90:34:90:

   34 |   CALL MPL_MESSAGE("TEST_MPL_SPLIT_COMM", "MPL_COMM_SPLIT failed", IERROR, LDABORT=.TRUE.)
      |                                                                                          1
Error: Type mismatch in argument 'kerror' at (1); passed CHARACTER(19) to INTEGER(4)

@samhatfield
Copy link
Contributor Author

Just a case of shifting the arguments of MPL_F77/MPL_MESSAGE so they match those of MPI_F08/MPL_MESSAGE? Looks like @marsdeno wanted to put all non-optional arguments first when refactoring for MPL_F08.

@wdeconinck
Copy link
Collaborator

Just a case of shifting the arguments of MPL_F77/MPL_MESSAGE so they match those of MPI_F08/MPL_MESSAGE? Looks like @marsdeno wanted to put all non-optional arguments first when refactoring for MPL_F08.

I guess you can name the IERROR argument to avoid this error

@samhatfield
Copy link
Contributor Author

I guess you can name the IERROR argument to avoid this error

Yeah true. However is the "API" of MPL_F77 not supposed to match exactly that of MPL_F08?

@wdeconinck
Copy link
Collaborator

Yes, but that should be fixed in another PR, or reviewed by @marsdeno without blocking this PR from going in.

@samhatfield
Copy link
Contributor Author

The problem is that CDMESSAGE is not a named argument, and I don't think it's possible to put an unnamed argument after a named argument. This

CALL MPL_MESSAGE("TEST_MPL_SPLIT_COMM", CDSTRING="MPL_COMM_SPLIT failed", KERROR=IERROR, &
  &              LDABORT=.TRUE.)

doesn't work either:

fiat/tests/test_mpl_split_comm.F90:35:34:

   35 |     &              LDABORT=.TRUE.)
      |                                  1
Error: Type mismatch in argument 'kerror' at (1); passed CHARACTER(19) to INTEGER(4)
fiat/tests/test_mpl_split_comm.F90:34:83:

   34 |   CALL MPL_MESSAGE("TEST_MPL_SPLIT_COMM", CDSTRING="MPL_COMM_SPLIT failed", KERROR=IERROR, &
      |                                                                                   1
Error: Keyword argument 'kerror' at (1) is already associated with another actual argument

(Error when enabling MPL_F77_DEPRECATED.)

@wdeconinck
Copy link
Collaborator

You have to also name the CDMESSAGE argument:

CALL MPL_MESSAGE(CDMESSAGE="TEST_MPL_SPLIT_COMM", CDSTRING="MPL_COMM_SPLIT failed", KERROR=IERROR, &
  &              LDABORT=.TRUE.)

I am not sure that MPL_MESSAGE is meant as an external routine; in fact why did you not choose to use MPL_ABORT as elsewhere in the same test?

@wdeconinck
Copy link
Collaborator

I have fixed the MPL_F77_DEPRECATED functionality.
This shows we need for the time being also CI testing for this deprecated configuration.

@wdeconinck
Copy link
Collaborator

The problem is that CDMESSAGE is not a named argument, and I don't think it's possible to put an unnamed argument after a named argument.

FYI, you can name every argument in Fortran, even if non-optional, and then you can move it around as you want.

@samhatfield
Copy link
Contributor Author

samhatfield commented Nov 18, 2025

I am not sure that MPL_MESSAGE is meant as an external routine; in fact why did you not choose to use MPL_ABORT as elsewhere in the same test?

Hmm, I'm not sure. I think I was copying the style I saw elsewhere. MPL_MESSAGE is useful for when you want to control whether you abort or not from another logical I suppose (LDABORT). In this case we just want to abort, so should I switch to MPL_ABORT?

@samhatfield
Copy link
Contributor Author

FYI, you can name every argument in Fortran, even if non-optional, and then you can move it around as you want.

I did not know that!

@wdeconinck
Copy link
Collaborator

In this case we just want to abort, so should I switch to MPL_ABORT?

@samhatfield yes please

@samhatfield
Copy link
Contributor Author

@samhatfield yes please

Done.

@wdeconinck wdeconinck merged commit a9995a0 into ecmwf-ifs:develop Nov 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants