Skip to content

Sync from NCAR/main (many combined updates)#369

Open
grantfirl wants to merge 59 commits intoufs-community:ufs/devfrom
grantfirl:NCAR-main-sync-20260401
Open

Sync from NCAR/main (many combined updates)#369
grantfirl wants to merge 59 commits intoufs-community:ufs/devfrom
grantfirl:NCAR-main-sync-20260401

Conversation

@grantfirl
Copy link
Copy Markdown
Collaborator

@grantfirl grantfirl commented Apr 2, 2026

Description of Changes:

This PR brings development from NCAR/main (several separate PRs) to the ufs/dev branch. It contains primarily development from NRL and is not expected to change RT results.

NCAR/main #1183:
by closing the namelist file in lnd_iau_mod_set_control if the land iau section is not found and INTERNAL_FILE_NML is not used, and by checking that the file open call in cires_ugwpv0_mod_init is successful; if not, it returns with an appropriate error message and error code following CCPP requirements

NCAR/main #1188:
Prevent division by zero when applying free convection adjustment in NSSTM. Require the heat content in diurnal thermocline xt to be positive. This is consistent with the earlier calculation of depth for convective adjustment d_conv in convdepth subroutine.

NCAR/main #1189:

  • Remove unused variable latr, and additional local variables exists, ios, dxsg, k, that are also unused, from GWD schemes.
  • Remove white space, empty lines, tabs (all done in a separate commit).
  • Remove duplicate lines intent=out from two meta files.

NCAR/main #1199:
Update radlw_main.F90: remove comments at end of file for LLVM 22

NCAR/main #1201:
Remove unused variables and improve scheme description in cnvc90.f (remove duplicate description).

NCAR/main #1202:

  • Improve readability inspired by updates in GWD/cires_ugwpv1_oro.F90 (making comparison between v0 and v1 easier).
  • Convert GWD/ugwp_driver_v0.F to GWD/ugwp_driver_v0.F90 and update related meta files.
  • Remove constants that are input but not used con_g, con_omega, and remove unused old code.
  • Add errflg and errmsg.

NCAR/main #1187:
This PR modifies how data is read in the CCPP init and timestep_init phases. Instead of reading the data serially with every single MPI task, the data is read by the MPI root rank and then broadcasted. This is implemented for all code except the GOCART aerosols (NEPTUNE doesn't use these, hence we have no way to test; also to check: new o3 and h2o code).

The implementation is taking the path described in NCAR#1106: an MPI broadcast wrapper is added in a new module mpiutils which wraps around the - now type dependent - MPI interfaces in mpi_f08.

The CCPP MPI broadcast routines in this PR make use of a ccpp_abort function to stop the model in the event of an MPI error. This is not following CCPP requirements to avoid having to pass errmsg and errflg all the way down and then back out to the host model to abort. CCPP compliancy with current rules can be implemented, but it is worth discussing if alternative methods are preferable and/or simplify the code. To note: The authoritative code in NCAR ccpp-physics in many places simplies calls stop to abort the model. That's much worse than using MPI_ABORT and of course also not CCPP compliant. In NEPTUNE, we've used a function equivalent to ccpp_abort in these places.

NCAR/main #1197:
Add ability to build with ip library if it is found. The sp library is being replaced by ip so this is required. Note that in spack-stack the ip package builds with the OpenMP flag so it gets turned on in the CMake build. This means the CMAKE_Fortran_FLAGS_OPENMP_OFF needs to be set by the host model since the RRTMGP files currently break if compiled with OpenMP flags. The CMAKE_Fortran_FLAGS_OPENMP_OFF environment variable will make sure OpenMP isn't used by the RRTMGP files.

NCAR/main #1205:
Similar changes to NCAR/main NCAR#1187, but targeted toward NOAA model-specific interstitials; also removed some problematic ccpp_bcast calls in aerinterp.F90

-Also point to ccpp/dev branch of MYNN-SFC submodule.

Tests Conducted:

SCM RTs, UFS RTs, NEPTUNE testing for individual PRs along the way using NCAR/main

Dependencies:

None

Documentation:

N/A

Issue (optional):

Fixes #372

Contributors (optional):

@matusmartini @climbfuji @scrasmussen

hertneky and others added 30 commits November 24, 2025 18:42
This PR changes the option relative_path in the CCPP metadata to dependencies_path as discussed in NCAR/ccpp-framework#685.
…, return with a meaningful error message and flag
Copy link
Copy Markdown
Collaborator

@AnningCheng-NOAA AnningCheng-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fine with me

@grantfirl
Copy link
Copy Markdown
Collaborator Author

@climbfuji It looks like most/all of the nested UFS RTs are still hanging when using these code changes. Can you think of any reason why only the nested tests would hang?

I traced the point where the model is stopping back to UFSATM/fv3/module_fcst_grid_comp.F90/fcst_initialize here: https://github.com/NOAA-EMC/ufsatm/blob/0f4fe59702e81be34c5b6bce5ab31b2e6004d804/fv3/module_fcst_grid_comp.F90#L937

This happens when n=2 in the loop. It seems like ESMF is basically setting an error and returning, but the model doesn't actually stop. The run directory on Ursa is: /scratch3/BMC/gmtb/Grant.Firl/stmp2/Grant.Firl/FV3_RT/rt_884137/hafs_regional_1nest_atm_intel

I further looked into the ESMF log file for this process (000) and found an error stack, but this is as far as I went, since this is well outside of the realm of physics.

20260427 171000.089 ERROR PET000 ESMCI_DistGrid.C:1441 ESMCI::DistGrid::create() Invalid argument - deCount must match between provided DELayout
and provided regDecomp
20260427 171000.089 ERROR PET000 ESMCI_DistGrid_F.C:147 c_esmc_distgridcreaterd() Invalid argument - Internal subroutine call returned Error
20260427 171000.090 ERROR PET000 ESMF_DistGrid.F90:1242 ESMF_DistGridCreateRD() Invalid argument - Internal subroutine call returned Error
20260427 171000.090 ERROR PET000 ESMF_Grid.F90:29822 ESMF_GridCreateDistgridReg Invalid argument - Internal subroutine call returned Error
20260427 171000.091 ERROR PET000 ESMF_Grid.F90:10892 ESMF_GridCreateNoPeriDimR Invalid argument - Internal subroutine call returned Error
20260427 171000.091 ERROR PET000 module_fcst_grid_comp.F90:261 Invalid argument - Passing error in return code
20260427 171000.091 ERROR PET000 module_fcst_grid_comp.F90:944 Invalid argument - Passing error in return code
20260427 171000.091 ERROR PET000 ufsatm_cap.F90:675 Invalid argument - Passing error in return code
20260427 171000.091 ERROR PET000 ATM:src/addon/NUOPC/src/NUOPC_ModelBase.F90:692 Invalid argument - Passing error in return code
20260427 171000.091 ERROR PET000 UFS Driver Grid Comp:src/addon/NUOPC/src/NUOPC_Driver.F90:2918 Invalid argument - Phase 'IPDvXp01' Initialize f
or modelComp 1: ATM did not return ESMF_SUCCESS
20260427 171000.091 ERROR PET000 UFS Driver Grid Comp:src/addon/NUOPC/src/NUOPC_Driver.F90:1365 Invalid argument - Passing error in return code
20260427 171000.091 ERROR PET000 UFS Driver Grid Comp:src/addon/NUOPC/src/NUOPC_Driver.F90:486 Invalid argument - Passing error in return code
20260427 171000.091 ERROR PET000 UFS.F90:397 Invalid argument - Aborting UFS

@grantfirl
Copy link
Copy Markdown
Collaborator Author

grantfirl commented Apr 27, 2026

@climbfuji There does appear to be some MPI broadcasting happening immediately above where the model stops. Would there be some kind of broadcasting conflict between what is going on in the physics vs what is happening in module_fcst_grid_comp.F90?

@climbfuji
Copy link
Copy Markdown

@climbfuji There does appear to be some MPI broadcasting happening immediately above where the model stops. Would there be some kind of broadcasting conflict between what is going on in the physics vs what is happening in module_fcst_grid_comp.F90?

This rings a bell. @dustinswales had to add an ugly workaround to the RRTMGP (?) code to pass in another MPI communicator (long before the mpi_bcast changes were made). I never had the chance to look at this, but may guess is that the MPI communicator passed in is not the correct one. In a situation with nests, my guess is that you have multiple MPI communicators (similar to a coupled model).

@grantfirl
Copy link
Copy Markdown
Collaborator Author

@climbfuji There does appear to be some MPI broadcasting happening immediately above where the model stops. Would there be some kind of broadcasting conflict between what is going on in the physics vs what is happening in module_fcst_grid_comp.F90?

This rings a bell. @dustinswales had to add an ugly workaround to the RRTMGP (?) code to pass in another MPI communicator (long before the mpi_bcast changes were made). I never had the chance to look at this, but may guess is that the MPI communicator passed in is not the correct one. In a situation with nests, my guess is that you have multiple MPI communicators (similar to a coupled model).

Ya, this came to mind to me as well (with respect to the RRTMGP/nested thing). I too was wondering if they had the same root problem.

@dustinswales
Copy link
Copy Markdown
Collaborator

For GP to work with nested configs, I needed to add initialization flags to any module that was reading data and broadcasting.
Otherwise, each nested domain will try to read the data, which was a no no.

@grantfirl
Copy link
Copy Markdown
Collaborator Author

@dustinswales The initialization flags for reading/broadcasting were reverted, though, and the workaround of setting the local RRTMGP mpiroot = 0 was reinstated upon merge. The issue #352 was added to fix this in the future. Apparently, the future is now! It definitely seems like a clue that setting mpiroot = 0 before broadcasting in the physics (at least in RRTMGP) seems to "work", albeit perhaps for the wrong reason?

It seems like there is a conflict in the MPI root PE used for broadcasting somewhere in the UFS/FV3 code vs physics, only when nesting is active? I'm out of my depth here, but a quick search brings up the following:

Mismatched Root Arguments: Every process in the communicator must pass the same integer value for the root argument. If Rank 0 calls MPI_Bcast(..., root=0, ...) while Rank 1 calls MPI_Bcast(..., root=1, ...), the MPI environment may hang because Rank 1 is waiting to send data while Rank 0 is expecting to send data, with no one acting as the receiver for either.

In Fortran, an MPI_Bcast root conflict occurs when the processes involved in the collective call do not agree on which process is the source (root) of the data. This is a common logical error that leads to undefined behavior, such as hanging (deadlocks), garbage data, or program crashes.

So, it seems like the correct way to fix this is to find all mpi_bcast calls in the UFS, make sure that they're all using the correct MPI communicator and MPI root PE?

I'm doing a quick test where I'm running one of the HAFS nested RTs and I switched the immediately preceding MPI_bcast call in module_fcst_grid_comp to use mpp_root_pe(), which is what is passed in to the physics as "mpi_root". Of course, if this works, RRTMGP would need to go back to using mpi_root instead of 0.

@grantfirl
Copy link
Copy Markdown
Collaborator Author

I'm doing a quick test where I'm running one of the HAFS nested RTs and I switched the immediately preceding MPI_bcast call in module_fcst_grid_comp to use mpp_root_pe(), which is what is passed in to the physics as "mpi_root". Of course, if this works, RRTMGP would need to go back to using mpi_root instead of 0.

No joy from this. Another test that I could try is to pass in 0 as the mpi_root for all ccpp_bcast calls and see if that "hack" works. Not that I want to keep that in, it's just a data point to help debug this.

@dustinswales
Copy link
Copy Markdown
Collaborator

@dustinswales The initialization flags for reading/broadcasting were reverted, though, and the workaround of setting the local RRTMGP mpiroot = 0 was reinstated upon merge. The issue #352 was added to fix this in the future. Apparently, the future is now! It definitely seems like a clue that setting mpiroot = 0 before broadcasting in the physics (at least in RRTMGP) seems to "work", albeit perhaps for the wrong reason?

It seems like there is a conflict in the MPI root PE used for broadcasting somewhere in the UFS/FV3 code vs physics, only when nesting is active? I'm out of my depth here, but a quick search brings up the following:

Mismatched Root Arguments: Every process in the communicator must pass the same integer value for the root argument. If Rank 0 calls MPI_Bcast(..., root=0, ...) while Rank 1 calls MPI_Bcast(..., root=1, ...), the MPI environment may hang because Rank 1 is waiting to send data while Rank 0 is expecting to send data, with no one acting as the receiver for either.
In Fortran, an MPI_Bcast root conflict occurs when the processes involved in the collective call do not agree on which process is the source (root) of the data. This is a common logical error that leads to undefined behavior, such as hanging (deadlocks), garbage data, or program crashes.

So, it seems like the correct way to fix this is to find all mpi_bcast calls in the UFS, make sure that they're all using the correct MPI communicator and MPI root PE?

I'm doing a quick test where I'm running one of the HAFS nested RTs and I switched the immediately preceding MPI_bcast call in module_fcst_grid_comp to use mpp_root_pe(), which is what is passed in to the physics as "mpi_root". Of course, if this works, RRTMGP would need to go back to using mpi_root instead of 0.

@grantfirl My apologies. I forgot they reverted the init flag solution for GP. Snap.
I feel like you are on the right track by tracking down inconsistencies with the MPI communicator. It sure seems like a host configuration problem and not a physics issue.

@grantfirl
Copy link
Copy Markdown
Collaborator Author

@grantfirl My apologies. I forgot they reverted the init flag solution for GP. Snap. I feel like you are on the right track by tracking down inconsistencies with the MPI communicator. It sure seems like a host configuration problem and not a physics issue.

Ya, it sure seems like some kind on inconsistency in the MPI root used for broadcasting. I'm asking Dusan for help to see if he remembers perhaps what is different about the nested situation that could cause this, pointing to a fix. It sure seems like the physics code is OK. It's trying to use the MPI communicator and root that it is given, but it's just conflicting with the host.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MYNN submodule pointing to Grant's branch

9 participants