Skip to content

Bring in the answer changing (for derecho_intel) ccs_config update #3111

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

ekluzek
Copy link
Collaborator

@ekluzek ekluzek commented May 6, 2025

Description of changes

Update of ccs_config to the point where the new intel-oneapi compiler is used for derecho_intel which changes answers for derecho_intel.

Specific notes

Contributors other than yourself, if any:

CTSM Issues Fixed (include github issue #):
Fixes #2476
Fixes #3120

Are answers expected to change (and if so in what way)? Yes, just for derecho_intel

Any User Interface Changes (namelist or namelist defaults changes)? No

Does this create a need to change or add documentation? Did you do so? No No

Testing performed, if any: will run regular testing
Right now just working on getting derecho_gnu to work for mpi-serial

@ekluzek ekluzek added this to the cesm3_0_beta07 milestone May 6, 2025
@ekluzek ekluzek self-assigned this May 6, 2025
@ekluzek ekluzek added the enhancement new capability or improved behavior of existing capability label May 6, 2025
@ekluzek ekluzek added non-bfb Changes answers (incl. adding tests) modernization E.g., for improving ability to perform on new computing architectures labels May 6, 2025
@ekluzek ekluzek marked this pull request as draft May 6, 2025 18:34
@ekluzek
Copy link
Collaborator Author

ekluzek commented May 6, 2025

This might not actually come in directly to master, but may come in as a tag with other changes. Having a PR here helps me to document what I'm finding in terms of this change. And helps to document when answer changes happen, and show that submodule changes after this are bit-for-bit.

Updated ccs_config versions break derecho_gnu with mpi-serial. So I'm using the test:

SMS_D_Ld1_Mmpi-serial.f45_f45_mg37.I2000Clm50SpRs.derecho_gnu.clm-ptsRLA

to search for versions that work or don't work with that to figure out how to fix the ccs_config problem.

For more notes on that see the ccs_config issue here: ESMCI/ccs_config_cesm#233

Right now I'm figuring out problems with ccs_config for derecho_gnu. When updating ccs_config so far I see in terms of derecho_gnu with mpi-serial working:

Version Status Notes
1.0.20 PASS ctsm5.3.041 version
1.0.21 X
1.0.22 X
1.0.23 X
1.0.24 X
1.0.25 X
1.0.26 PASS The gnu update is reverted later
1.0.27 PASS
1.0.28 PASS
1.0.29 PASS
1.0.30 PASS
1.0.31 X Module build problem starts as well as the link step
1.0.31 (fix module) X link problem
1.0.31 (fix module + esmf-8.8.0) X Can't load no gcc12.2.0
1.0.31 (fix module + rm -debug pio/esmf) PASS See below
1.0.32 (+previous) PASS We need to get to here for CISM
This is the version in cesm3_0_alpha06d
But, dercho_gnu threading problems
1.0.32 (+previous + libsci/23.02.1.1) PASS This seems to work

@ekluzek
Copy link
Collaborator Author

ekluzek commented May 7, 2025

Latest testing shows:

  • Izumi testing passed as expected -- without even any babysitting!
  • Some issues with some derecho_gnu DEBUG tests

Here are the tests that failed:

ERP_D_Ld3_P64x2.f10_f10_mg37.I2000Clm50BgcCru.derecho_gnu.clm-default	(NLCOMP RUN)		
ERP_D_P64x2_Ld3.f10_f10_mg37.I1850Clm50BgcCrop.derecho_gnu.clm-default	(NLCOMP RUN)		
ERP_D_P64x2_Ld3.f10_f10_mg37.I1850Clm60BgcCrop.derecho_gnu.clm-mimics	(NLCOMP RUN)		
ERP_D_P64x2_Ld3.f10_f10_mg37.I2000Clm50BgcCru.derecho_gnu.clm-default	(NLCOMP RUN)		
ERP_D_P64x2_Ld3.f10_f10_mg37.I2000Clm50BgcCru.derecho_gnu.clm-snowveg_norad	(NLCOMP RUN)		
ERP_P64x2_D_Ld3.f10_f10_mg37.I1850Clm50BgcCrop.derecho_gnu.clm-extra_outputs	(NLCOMP RUN)		
ERP_P64x2_D_Ld5.f10_f10_mg37.I2000Clm50Sp.derecho_gnu.clm-default	(NLCOMP RUN)		
ERP_P64x2_Ld396.f10_f10_mg37.IHistClm60Bgc.derecho_gnu.clm-monthly	(NLCOMP RUN)

All but the 2nd, 5th and 6th in the list above are due to wallclock

@ekluzek
Copy link
Collaborator Author

ekluzek commented May 7, 2025

The runtimes for ctsm5.3.041 tests are much less than the wallclock limits, so it really shouldn't have run out of wallclock for these.

Jim suggested that I just need to remove -debug from pio and not esmf, so I'll try that. And also I could just do the removal from pio for mpi-serial.

PASS ERP_D_Ld3_P64x2.f10_f10_mg37.I2000Clm50BgcCru.derecho_gnu.clm-default RUN time=112
PASS ERP_D_P64x2_Ld3.f10_f10_mg37.I1850Clm50BgcCrop.derecho_gnu.clm-default RUN time=105
PASS ERP_D_P64x2_Ld3.f10_f10_mg37.I1850Clm60BgcCrop.derecho_gnu.clm-mimics RUN time=161
PASS ERP_D_P64x2_Ld3.f10_f10_mg37.I2000Clm50BgcCru.derecho_gnu.clm-default RUN time=111
PASS ERP_D_P64x2_Ld3.f10_f10_mg37.I2000Clm50BgcCru.derecho_gnu.clm-snowveg_norad RUN time=121
PASS ERP_P64x2_D_Ld3.f10_f10_mg37.I1850Clm50BgcCrop.derecho_gnu.clm-extra_outputs RUN time=106
PASS ERP_P64x2_D_Ld5.f10_f10_mg37.I2000Clm50Sp.derecho_gnu.clm-default RUN time=112
PASS ERP_P64x2_Ld396.f10_f10_mg37.IHistClm60Bgc.derecho_gnu.clm-monthly RUN time=947

@ekluzek
Copy link
Collaborator Author

ekluzek commented May 7, 2025

The tests that fail early all die with an error in UrbanBuilding temperature. Where endrun is called, but there's also a segfault in the error.

So for example for the last one: ERP_P64x2_D_Ld3.f10_f10_mg37.I1850Clm50BgcCrop.derecho_gnu.clm-extra_outputs

cesm.log:

dec0089.hsn.de.hpc.ucar.edu 55: Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
dec0089.hsn.de.hpc.ucar.edu 55: 
dec0089.hsn.de.hpc.ucar.edu 55: Backtrace for this error:
dec0089.hsn.de.hpc.ucar.edu 55: #0  0x14b3bb5a9d4f in ???
dec0089.hsn.de.hpc.ucar.edu 55: 	at /usr/src/debug/glibc-2.31-150300.41.1.x86_64/signal/../sysdeps/unix/sysv/linux/x86_64/sigaction.c:0
dec0089.hsn.de.hpc.ucar.edu 55: #1  0x14b3bb6dfd13 in ???
dec0089.hsn.de.hpc.ucar.edu 55: 	at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:247
dec0089.hsn.de.hpc.ucar.edu 55: #2  0x14b3c956be6b in ???
dec0089.hsn.de.hpc.ucar.edu 55: #3  0x14b3c94f389a in ???
dec0089.hsn.de.hpc.ucar.edu 55: #4  0x14b3c94f56e2 in ???
dec0089.hsn.de.hpc.ucar.edu 55: #5  0x20c21b2 in __shr_log_mod_MOD_shr_log_error
dec0089.hsn.de.hpc.ucar.edu 55: 	at /glade/work/erik/ctsm_worktrees/external_updates/share/src/shr_log_mod.F90:148
dec0089.hsn.de.hpc.ucar.edu 55: #6  0x1fbf782 in __shr_abort_mod_MOD_shr_abort_abort
dec0089.hsn.de.hpc.ucar.edu 55: 	at /glade/work/erik/ctsm_worktrees/external_updates/share/src/shr_abort_mod.F90:63
dec0089.hsn.de.hpc.ucar.edu 55: #7  0x6c134e in __abortutils_MOD_endrun_write_point_context
dec0089.hsn.de.hpc.ucar.edu 55: 	at /glade/work/erik/ctsm_worktrees/external_updates/src/main/abortutils.F90:98
dec0089.hsn.de.hpc.ucar.edu 55: #8  0x189ef2c in __urbbuildtempoleson2015mod_MOD_buildingtemperature
dec0089.hsn.de.hpc.ucar.edu 55: 	at /glade/work/erik/ctsm_worktrees/external_updates/src/biogeophys/UrbBuildTempOleson2015Mod.F90:907
dec0089.hsn.de.hpc.ucar.edu 55: #9  0x16d3702 in __soiltemperaturemod_MOD_soiltemperature
dec0089.hsn.de.hpc.ucar.edu 55: 	at /glade/work/erik/ctsm_worktrees/external_updates/src/biogeophys/SoilTemperatureMod.F90:544

@ekluzek
Copy link
Collaborator Author

ekluzek commented May 9, 2025

OK, I seem to have something working now for the list of failed tests. So running aux_clm over again.

@ekluzek ekluzek moved this from Todo to Done in LMWG: Sprint Planning Board May 9, 2025
@ekluzek
Copy link
Collaborator Author

ekluzek commented May 9, 2025

OK, testing looks as expected. All tests are b4b on Izumi and gnu tests are b4b on Derecho. All Intel tests are NOT b4b except the FUNIT and PFS tests because there are no history files to compare:

FUNITCTSM_P1x1.f10_f10_mg37.I2000Clm50Sp.derecho_intel.GC.ctsm5341ccschangeacl_int/TestStatus:PASS FUNITCTSM_P1x1.f10_f10_mg37.I2000Clm50Sp.derecho_intel BASELINE
PFS_Ld10_PS.f19_g17.I2000Clm50BgcCrop.derecho_intel.GC.ctsm5341ccschangeacl_int/TestStatus:PASS PFS_Ld10_PS.f19_g17.I2000Clm50BgcCrop.derecho_intel BASELINE ctsm5.3.041:

The nvhpc tests on derecho also show differences:

SMS.f10_f10_mg37.I2000Clm50BgcCrop.derecho_nvhpc.clm-crop
SMS.f45_f45_mg37.I2000Clm60FatesSpRsGs.derecho_nvhpc.clm-FatesColdSatPhen

This is likely expected because there were several updates to the nvhpc environment done in the ccs_config update including the compiler version:

diff /glade/derecho/scratch/samrabin/tests_0424-161819de/SMS.f10_f10_mg37.I2000Clm50BgcCrop.derecho_nvhpc.clm-crop.GC.0424-161819de_nvh/.env_mach_specific.sh .
6c6
< module load cesmdev/1.0 ncarenv/23.09
---
> module load cesmdev/1.0 ncarenv/24.12
8c8
< module load conda/latest nco craype nvhpc/24.3 ncarcompilers/1.0.0 cmake cray-mpich/8.1.27 netcdf-mpi/4.9.2 parallel-netcdf/1.12.3 parallelio/2.6.2 esmf/8.6.0
---
> module load conda/latest nco craype cmake nvhpc/24.11 ncarcompilers/1.0.0 cray-mpich/8.1.29 netcdf-mpi/4.9.2 parallel-netcdf/1.14.0 parallelio/2.6.4 esmf/8.8.0

Looking at the differences for some Intel cases, for shorter cases it appears to be near roundoff level, but for at least 100 history variables. Looking at longer cases for example the 5 year case:

ERS_Ly5_P128x1.f10_f10_mg37.IHistClm45BgcCrop.derecho_intel.clm-cropMonthOutput

difference are large and cover most of the history file.

This is all as would be expected for this change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement new capability or improved behavior of existing capability modernization E.g., for improving ability to perform on new computing architectures non-bfb Changes answers (incl. adding tests)
Projects
Status: Done
1 participant