Sync from NCAR/main (many combined updates)#369
Sync from NCAR/main (many combined updates)#369grantfirl wants to merge 59 commits intoufs-community:ufs/devfrom
Conversation
UFS/dev PR327
This PR changes the option relative_path in the CCPP metadata to dependencies_path as discussed in NCAR/ccpp-framework#685.
…hen an error occurs during read
…, return with a meaningful error message and flag
ufs/dev sync (NCAR#329)
duplicate lines from meta files
Ufs dev pr336
…pp-physics into feature/wrapper_1183_1188_1189_ncar
… into feature/wrapper_1183_1188_1189_ncar
UFS-dev PR#298
ufs-dev PRs 349 343
…esentative for NRL
|
@climbfuji It looks like most/all of the nested UFS RTs are still hanging when using these code changes. Can you think of any reason why only the nested tests would hang? I traced the point where the model is stopping back to UFSATM/fv3/module_fcst_grid_comp.F90/fcst_initialize here: https://github.com/NOAA-EMC/ufsatm/blob/0f4fe59702e81be34c5b6bce5ab31b2e6004d804/fv3/module_fcst_grid_comp.F90#L937 This happens when n=2 in the loop. It seems like ESMF is basically setting an error and returning, but the model doesn't actually stop. The run directory on Ursa is: /scratch3/BMC/gmtb/Grant.Firl/stmp2/Grant.Firl/FV3_RT/rt_884137/hafs_regional_1nest_atm_intel I further looked into the ESMF log file for this process (000) and found an error stack, but this is as far as I went, since this is well outside of the realm of physics. 20260427 171000.089 ERROR PET000 ESMCI_DistGrid.C:1441 ESMCI::DistGrid::create() Invalid argument - deCount must match between provided DELayout |
|
@climbfuji There does appear to be some MPI broadcasting happening immediately above where the model stops. Would there be some kind of broadcasting conflict between what is going on in the physics vs what is happening in module_fcst_grid_comp.F90? |
This rings a bell. @dustinswales had to add an ugly workaround to the RRTMGP (?) code to pass in another MPI communicator (long before the mpi_bcast changes were made). I never had the chance to look at this, but may guess is that the MPI communicator passed in is not the correct one. In a situation with nests, my guess is that you have multiple MPI communicators (similar to a coupled model). |
Ya, this came to mind to me as well (with respect to the RRTMGP/nested thing). I too was wondering if they had the same root problem. |
|
For GP to work with nested configs, I needed to add initialization flags to any module that was reading data and broadcasting. |
|
@dustinswales The initialization flags for reading/broadcasting were reverted, though, and the workaround of setting the local RRTMGP mpiroot = 0 was reinstated upon merge. The issue #352 was added to fix this in the future. Apparently, the future is now! It definitely seems like a clue that setting mpiroot = 0 before broadcasting in the physics (at least in RRTMGP) seems to "work", albeit perhaps for the wrong reason? It seems like there is a conflict in the MPI root PE used for broadcasting somewhere in the UFS/FV3 code vs physics, only when nesting is active? I'm out of my depth here, but a quick search brings up the following:
So, it seems like the correct way to fix this is to find all mpi_bcast calls in the UFS, make sure that they're all using the correct MPI communicator and MPI root PE? I'm doing a quick test where I'm running one of the HAFS nested RTs and I switched the immediately preceding MPI_bcast call in module_fcst_grid_comp to use mpp_root_pe(), which is what is passed in to the physics as "mpi_root". Of course, if this works, RRTMGP would need to go back to using mpi_root instead of 0. |
No joy from this. Another test that I could try is to pass in 0 as the mpi_root for all ccpp_bcast calls and see if that "hack" works. Not that I want to keep that in, it's just a data point to help debug this. |
@grantfirl My apologies. I forgot they reverted the init flag solution for GP. Snap. |
Ya, it sure seems like some kind on inconsistency in the MPI root used for broadcasting. I'm asking Dusan for help to see if he remembers perhaps what is different about the nested situation that could cause this, pointing to a fix. It sure seems like the physics code is OK. It's trying to use the MPI communicator and root that it is given, but it's just conflicting with the host. |
Description of Changes:
This PR brings development from NCAR/main (several separate PRs) to the ufs/dev branch. It contains primarily development from NRL and is not expected to change RT results.
NCAR/main #1183:
by closing the namelist file in lnd_iau_mod_set_control if the land iau section is not found and INTERNAL_FILE_NML is not used, and by checking that the file open call in cires_ugwpv0_mod_init is successful; if not, it returns with an appropriate error message and error code following CCPP requirements
NCAR/main #1188:
Prevent division by zero when applying free convection adjustment in NSSTM. Require the heat content in diurnal thermocline xt to be positive. This is consistent with the earlier calculation of depth for convective adjustment d_conv in convdepth subroutine.
NCAR/main #1189:
NCAR/main #1199:
Update radlw_main.F90: remove comments at end of file for LLVM 22
NCAR/main #1201:
Remove unused variables and improve scheme description in cnvc90.f (remove duplicate description).
NCAR/main #1202:
NCAR/main #1187:
This PR modifies how data is read in the CCPP init and timestep_init phases. Instead of reading the data serially with every single MPI task, the data is read by the MPI root rank and then broadcasted. This is implemented for all code except the GOCART aerosols (NEPTUNE doesn't use these, hence we have no way to test; also to check: new o3 and h2o code).
The implementation is taking the path described in NCAR#1106: an MPI broadcast wrapper is added in a new module mpiutils which wraps around the - now type dependent - MPI interfaces in mpi_f08.
The CCPP MPI broadcast routines in this PR make use of a ccpp_abort function to stop the model in the event of an MPI error. This is not following CCPP requirements to avoid having to pass errmsg and errflg all the way down and then back out to the host model to abort. CCPP compliancy with current rules can be implemented, but it is worth discussing if alternative methods are preferable and/or simplify the code. To note: The authoritative code in NCAR ccpp-physics in many places simplies calls stop to abort the model. That's much worse than using MPI_ABORT and of course also not CCPP compliant. In NEPTUNE, we've used a function equivalent to ccpp_abort in these places.
NCAR/main #1197:
Add ability to build with ip library if it is found. The sp library is being replaced by ip so this is required. Note that in spack-stack the ip package builds with the OpenMP flag so it gets turned on in the CMake build. This means the CMAKE_Fortran_FLAGS_OPENMP_OFF needs to be set by the host model since the RRTMGP files currently break if compiled with OpenMP flags. The CMAKE_Fortran_FLAGS_OPENMP_OFF environment variable will make sure OpenMP isn't used by the RRTMGP files.
NCAR/main #1205:
Similar changes to NCAR/main NCAR#1187, but targeted toward NOAA model-specific interstitials; also removed some problematic ccpp_bcast calls in aerinterp.F90
-Also point to ccpp/dev branch of MYNN-SFC submodule.
Tests Conducted:
SCM RTs, UFS RTs, NEPTUNE testing for individual PRs along the way using NCAR/main
Dependencies:
None
Documentation:
N/A
Issue (optional):
Fixes #372
Contributors (optional):
@matusmartini @climbfuji @scrasmussen