Skip to content

Endrun error that happens with update to cray-libsci/23.09.1.1 library in UrbanBuilding for threaded DEBUG on GNU compiler on Derecho #3120

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ekluzek opened this issue May 8, 2025 · 2 comments · May be fixed by #3111
Assignees
Labels
bug something is working incorrectly investigation Needs to be verified and more investigation into what's going on. science Enhancement to or bug impacting science

Comments

@ekluzek
Copy link
Collaborator

ekluzek commented May 8, 2025

I have tests that die with an error in UrbanBuilding temperature. These tests pass in ctsm5.3.041, but fail with the updated cray-libsci library.

Tests that fail are:

ERP_D_P64x2_Ld3.f10_f10_mg37.I1850Clm50BgcCrop.derecho_gnu.clm-default
ERP_D_P64x2_Ld3.f10_f10_mg37.I2000Clm50BgcCru.derecho_gnu.clm-snowveg_norad
ERP_P64x2_D_Ld3.f10_f10_mg37.I1850Clm50BgcCrop.derecho_gnu.clm-extra_outputs

So for example for the last one: ERP_P64x2_D_Ld3.f10_f10_mg37.I1850Clm50BgcCrop.derecho_gnu.clm-extra_outputs

cesm.log:

dec0089.hsn.de.hpc.ucar.edu 55: Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
dec0089.hsn.de.hpc.ucar.edu 55: 
dec0089.hsn.de.hpc.ucar.edu 55: Backtrace for this error:
dec0089.hsn.de.hpc.ucar.edu 55: #0  0x14b3bb5a9d4f in ???
dec0089.hsn.de.hpc.ucar.edu 55: 	at /usr/src/debug/glibc-2.31-150300.41.1.x86_64/signal/../sysdeps/unix/sysv/linux/x86_64/sigaction.c:0
dec0089.hsn.de.hpc.ucar.edu 55: #1  0x14b3bb6dfd13 in ???
dec0089.hsn.de.hpc.ucar.edu 55: 	at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:247
dec0089.hsn.de.hpc.ucar.edu 55: #2  0x14b3c956be6b in ???
dec0089.hsn.de.hpc.ucar.edu 55: #3  0x14b3c94f389a in ???
dec0089.hsn.de.hpc.ucar.edu 55: #4  0x14b3c94f56e2 in ???
dec0089.hsn.de.hpc.ucar.edu 55: #5  0x20c21b2 in __shr_log_mod_MOD_shr_log_error
dec0089.hsn.de.hpc.ucar.edu 55: 	at /glade/work/erik/ctsm_worktrees/external_updates/share/src/shr_log_mod.F90:148
dec0089.hsn.de.hpc.ucar.edu 55: #6  0x1fbf782 in __shr_abort_mod_MOD_shr_abort_abort
dec0089.hsn.de.hpc.ucar.edu 55: 	at /glade/work/erik/ctsm_worktrees/external_updates/share/src/shr_abort_mod.F90:63
dec0089.hsn.de.hpc.ucar.edu 55: #7  0x6c134e in __abortutils_MOD_endrun_write_point_context
dec0089.hsn.de.hpc.ucar.edu 55: 	at /glade/work/erik/ctsm_worktrees/external_updates/src/main/abortutils.F90:98
dec0089.hsn.de.hpc.ucar.edu 55: #8  0x189ef2c in __urbbuildtempoleson2015mod_MOD_buildingtemperature
dec0089.hsn.de.hpc.ucar.edu 55: 	at /glade/work/erik/ctsm_worktrees/external_updates/src/biogeophys/UrbBuildTempOleson2015Mod.F90:907
dec0089.hsn.de.hpc.ucar.edu 55: #9  0x16d3702 in __soiltemperaturemod_MOD_soiltemperature
dec0089.hsn.de.hpc.ucar.edu 55: 	at /glade/work/erik/ctsm_worktrees/external_updates/src/biogeophys/SoilTemperatureMod.F90:544

Originally posted by @ekluzek in #3111 (comment)

@ekluzek ekluzek self-assigned this May 8, 2025
@ekluzek ekluzek added investigation Needs to be verified and more investigation into what's going on. bug something is working incorrectly science Enhancement to or bug impacting science labels May 8, 2025
@ekluzek ekluzek added this to the cesm3_0_beta07 milestone May 8, 2025
@ekluzek
Copy link
Collaborator Author

ekluzek commented May 8, 2025

I don't know if this is related, but these tests also hang.

ERP_D_Ld3_P64x2.f10_f10_mg37.I2000Clm50BgcCru.derecho_gnu.clm-default (NLCOMP RUN)
ERP_D_P64x2_Ld3.f10_f10_mg37.I1850Clm60BgcCrop.derecho_gnu.clm-mimics (NLCOMP RUN)
ERP_D_P64x2_Ld3.f10_f10_mg37.I2000Clm50BgcCru.derecho_gnu.clm-default (NLCOMP RUN)
ERP_P64x2_D_Ld5.f10_f10_mg37.I2000Clm50Sp.derecho_gnu.clm-default (NLCOMP RUN)
ERP_P64x2_Ld396.f10_f10_mg37.IHistClm60Bgc.derecho_gnu.clm-monthly (NLCOMP RUN)

@ekluzek
Copy link
Collaborator Author

ekluzek commented May 9, 2025

OK, running with the older version cray-libsci/23.02.0 the top case works. In the documentation for the next libsci version it says about resolving a problem with a hang. Although it's with an LAPACK subroutine that we don't seem to use in CTSM.

See https://github.com/PE-Cray/cpe-changelog/blob/main/ex/cpe-23.12-sles15-sp5-ReleaseNotes.txt

which has..

Cray LibSci 23.12.5
=====================

  Release Date:
  -------------
    December 2023


  Purpose:
  --------
    Cray LibSci 23.12.5 provides scientific libraries for Cray
    HPC systems. Cray LibSci is supported on the host CPU but
    not on the accelerator of these systems.

    The Cray LibSci 23.12.5 release provides the following:
        * support for RHEL gcc-toolset-10
        * **fix for hang in Lapack cpotrf**

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug something is working incorrectly investigation Needs to be verified and more investigation into what's going on. science Enhancement to or bug impacting science
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

1 participant