Sync NCAR/main branch of CCPP physics with ufs/dev branch#3190
Sync NCAR/main branch of CCPP physics with ufs/dev branch#3190grantfirl wants to merge 11 commits intoufs-community:developfrom
Conversation
|
There are several time-outs in my Ursa RT run. I think that this may be related to our resources being overused and fairshare reflecting that, throttling our runs. |
|
I'll rerun the tests on Ursa as a secondary check. Also, a new GNU warning was introduced in this PR: Can we get that fixed as part of this PR? Should just be an |
Sure, I'll make that change and update the PR. |
|
It looks like the timeouts on the HAFS tests are real, unfortunately. I repeated one of the HAFS tests to make sure and it does seem to hang early on in the run. Every other test passed. My first guess is that something in the CCPP updates is conflicting with the nests. This will need to be resolved before bringing in these changes. |
@dpsarmie Thanks very much for rerunning those. I'll see if I can debug the HAFS tests that are timing out. I'm guessing that it's another issue with the new ccpp_bcast functionality, perhaps triggered by the nests. |
|
@grantfirl Just wanted to confirm that this PR is ready for review? If so, I'll request UFSATM reviews. |
No, there are still problems with the nested tests. I'll probably need help debugging these. I'm going to ask Dom, who made most of the changes related to mpi_bcast, to help. |
|
@jkbk2004 @gspetro-NOAA Do you know of anyone who could help debug this nesting problem? I described what I'm seeing for @climbfuji here: ufs-community/ccpp-physics#369 (comment) |
Perhaps @DusanJovic-NOAA would have some idea since it's happening in UFSATM? That said, @NickSzapiro-NOAA suggested that you could potentially revert some of the CCPP changes in your PR (like NCAR/ccpp-physics#1187) to isolate where the failures might be coming from. |
|
@DusanJovic-NOAA Do you know/remember what happens with the MPI root PE used in a nested situation? It seems as though there is a conflict with the MPI root PE when nesting is turned on. The MPI_bcast calls as part of this PR are trying to use the mpi_root value given to physics (which is set from FV3's mpp_root_pe() function) rather than a hard-coded 0 value, which is used in various places throughout the UFSATM code. I'm wondering if this is causing the conflict/hanging behavior noticed in this PR? Any insights are appreciated! |
Root PE on the nested domain is the rank of the first mpi task that runs the nest. For example, if a parent domain runs on 80 tasks, and nest runs on 60 tasks: first 80 tasks, ranks 0,..,79 will have root PE = 0, and |
|
I think I see where the problem is. In atmos_model.F90 we currently have:
I think solution is eiteher:
I'm not sure about the rules for how ccpp treats multiple domains, or if it lacks a concept of multiple domains and treats all point columns uniformly, just as a list/array of columns. This hasn't been an issue so far, probably because ccpp-physics hardcodes '0' for the root of MPI collective calls. Basically solution 1) reverts to that behavior. |
|
@DusanJovic-NOAA I didn't follow this closely, but regarding the CCPP question: This depends on how the multiple domains are implemented. Is there a separate "copy" (instance) of CCPP for the global domain (the six tiles) and the nested domain (one tile)? Or are the grid points of the nested domain simply appended to those of the global domain? |
|
@climbfuji I'm not sure I fully understand what you mean by 'a separate "copy" (instance) of CCPP for the global domain (the six tiles) and the nested domain (one tile)'. If I understand correctly how this works, CCPP runs over a list of columns on each MPI task. Each task runs it's own copy/instance of CCPP. Does CCPP 'know' which (geographical) domain that list of points belongs to? I do not think it does. If it does 'know,' how does it know? The notion of domains, and the potential need for ccpp to 'know' this, is relevant if, for example, collective calls should only communicate with other (MPI) tasks belonging to the same domain, for example finding 'per domain' average, minimum, or maximum. Or something like that. I don't know if this is or ever will be needed in ccpp. In that case, domain communicators will be required, I think. |
|
@DusanJovic-NOAA @climbfuji I also had the idea of setting Init_parm%master to 0 as a test, and I can confirm that it does indeed work (model no longer hangs). I'm just not sure that it is the preferred solution. Dom, if a different host model handles MPI communicators and roots differently, as long as all ccpp_bcast calls (and any remaining straight mpi_bcast calls) are set up to use whatever communicator/root that they're given, instead of hard-coded, is that OK from your perspective? |
Yes, absolutely. I am pretty sure that's how it works in NEPTUNE. We have a specific MPI communicator with a specific MPI root and CCPP should (and hopefully does) use those exclusively. Never MPI_COMM_WORLD or 0 by default. |
|
@gspetro-NOAA I think that this should be ready to re-enter the UFS merge queue again. I think that the nested hanging problem has been fixed. I've re-run Ursa RTs and everything is passing with the exception of control_csawmg_gnu with a time-out and control_p8_ugwpv1_tempo_aerosol_hail_intel with memory issues, both of which I'm hoping are transitory and not related to this PR. I'll upload the new log from Ursa when it finishes up. It's stuck on the last test in the queue since my project's fairshare is so low right now. That said, I'd like to reintroduce 2 commits from @dustinswales related to the RRTMGP scheme that were reverted back in ufs-community/ccpp-physics#332 and related to ufs-community/ccpp-physics#352. I think that these were reverted due to the same issue that we ran into in this PR with the model hanging when using nests. These 2 commits should now work with HAFS with RRTMGP, although I haven't tested it yet (also due to the low fairshare). I'm wondering if @BinLiu-NOAA might be willing to test RRTMGP with HAFS if I cherry-pick those 2 commits and put them into ufs-community/ccpp-physics#369? |
Out of curiosity, how did you fix the MPI communicator problem? |
setting Init_parm%master to 0 (AKA, "mpi_root" known by the CCPP). |
Ok, thanks. |
|
@grantfirl Could we add NCAR/MYNN-SFC#27 as a dependency for your CCPP PR? I just filed ufs-community/ccpp-physics#372, but basically, ufs/dev is still pointing at your feature/tendency_cleanup MYNN branch because we didn't see the PR listed at the WM level in #2810 to have you revert gitmodules, update the hash, and merge that PR. |
|
Yes, I was planning on doing that since this was already pointed out to me by someone else. |
@gspetro-NOAA The developers of MYNN-SFC went ahead and merged in that PR, since it was forgotten during the last round. I made sure that ccpp-physics is now pointing to the correct commit hash of the correct fork/branch of MYNN-SFC, so I'm not adding the PR as a dependency. I made a note that I did this in the ccpp-physics PR and referenced the issue that you created. |
|
Note: The Ursa pre-test log that was uploaded had compilation failures for MPAS. I forgot to update the MPAS-Model submodule before testing, so I just did this and am re-running the MPAS-specific tests only. I'll upload that log to comments when finished. |
|
@BinLiu-NOAA This PR most likely fixes ufs-community/ccpp-physics#352 too due to the mpi_root being passed to RRTMGP being consistent with the rest of UFS now, even for nesting configurations. There isn't a HAFS nested test with RRTMGP to test this, but if you wanted to try, it would be appreciated, and we could perhaps close out the issue. |
|
MPAS RT logs: |
|
@grantfirl @NickSzapiro-NOAA There is an increase of one warning (warning #5194: Source line truncated) for four tests:
There is also an increase of 1 remark for several tests, but it appears to be the "ifort: remark #10448: Intel(R) Fortran Compiler Classic (ifort) is now deprecated" that will go away w/the switch to LLVM. UPDATE: You can check my develop and 3190 run_dirs here:
|
|
Thanks @gspetro-NOAA! Do you know the line to make sure it's truncated in the comment? It would be nice to fix all of these |
It looks like 2 warnings were added in physics/Radiation/radiation_gases.f (possibly lines 286 & 535?) and 1 warning was removed from physics/Radiation/radiation_surface.f for a net 1 added. Not positive which line bc I was just grepping counts of warnings (and now counts of warnings in specific files), but from the vimdiff of Those lines don't look like they're changed in ufs-community/ccpp-physics#369 though... 🤔 |
@gspetro-NOAA @NickSzapiro-NOAA I went ahead and cleaned up all lines longer than 72 columns in the radiation_gases.f file, so we should be good to go on the warnings/remarks front. |
Great! I ran a compile, and it has reduced the warnings for those tests. Thank you! We should be good to go on this PR as soon as 3207 is merged. |
|
@grantfirl Please sync w/develop and address the minor comment in the UFSATM sub-PR, and we can begin testing for this PR. 🙂 |
Commit Queue Requirements:
test_changes.listindicates which tests, if any, are changed by this PR. Committest_changes.list, even if it is empty.Description:
This is primarily just a CCPP physics update, although it contains some host-side metadata changes to work with the ccpp-physics changes. See ufs-community/ccpp-physics#369 for a description of the CCPP-physics changes.
Commit Message:
Priority:
Git Tracking
UFSWM:
Sub component Pull Requests:
UFSWM Blocking Dependencies:
Documentation:
ccpp_bcastcapability.Changes
Regression Test Changes (Please commit test_changes.list):
Input data Changes:
Library Changes/Upgrades:
This does allow CCPP to use the IP library rather than the SP library, which is a prerequisite for using spack-stack 2.0+
Testing Log: