Skip to content

Cray specific detection and enhanced netcdf pkg-config support#2704

Merged
asalmgren merged 45 commits intoerf-model:developmentfrom
jmsexton03:add_craype_defaults_cmake
Nov 14, 2025
Merged

Cray specific detection and enhanced netcdf pkg-config support#2704
asalmgren merged 45 commits intoerf-model:developmentfrom
jmsexton03:add_craype_defaults_cmake

Conversation

@jmsexton03
Copy link
Copy Markdown
Collaborator

@jmsexton03 jmsexton03 commented Nov 10, 2025

Summary

Comprehensive build system improvements for Cray/HPC systems with automatic detection and configuration. Adds 7 auto-fixes for common Cray build issues (CUDA+EKAT, fcompare, GPU-aware MPI, NetCDF). Includes developer utilities for organized testing, enhanced diagnostics with CMake 3.25+ log levels, and machine-specific profiles for Perlmutter, Frontier, Polaris, and Aurora. Detailed documentation is being developed in #2708 , and will include a simple table of working builds.


Details

Cray/HPC System Support

  • Auto-detects Cray environments and applies 7 build fixes (CUDA flags, MPI GTL libraries, NetCDF paths, HDF5 configuration)
  • Prevents MPI detection hangs with bare wrappers; manual MPI setup for Cray MPICH 8.x
  • Machine profiles with standardized module loading (Perlmutter, Frontier, Polaris, Aurora)
  • Auto-detects GPU architectures from CRAY_ACCEL_TARGET and environment variables

Build System Enhancements

  • Added distclean target (removes CMake cache/artifacts) and uninstall target with manifest tracking
  • Expanded .gitignore for build artifacts (build_/, install_/, CMakeCache.txt, generated files)
  • Wrapper scripts for clean+build workflows (interactive and CI/auto modes)
  • Robust ERF_DIR auto-detection with multi-method fallback and verification

NetCDF & Dependency Detection

  • Cascading pkg-config fallback for NetCDF variants (netcdf → netcdf-cxx4_parallel → netcdf_parallel)
  • GMake adds MPICH_DIR to PKG_CONFIG_PATH; NOAHMP tries netcdf-fortran → netcdf-fortran_parallel
  • Enhanced FindNetCDF with detection logging and helpful error messages
  • Auto-suggests module load commands on detection failures

Enhanced Diagnostics

  • CMake 3.25+ log levels (--log-level=VERBOSE/DEBUG/TRACE) with hierarchical message context
  • Generates cray_detected_config.cmake reference file showing auto-detected settings
  • Detection attempt logging in FindNetCDF and MPI configuration
  • Helpful error messages with resolution steps and auto-suggested machine profiles

AMLattanzi and others added 30 commits November 6, 2025 19:24
…del#2702)

* Apply FillBoundary for momenta in slow_rhs_post.

* check on domain boundaries not box boundaries

* replace bx lo/hi by domain lo/hi

* revert incorrect change in last commit

* fix test on small cells

* Added wrappers for non-const EB factory members.

* Use cell-centered grids for eb_aux_ area fraction and face centroids.

* Corrected rayleigh damping thickness in inputs_FittedMesh.

* Remove FB from slow_rhs_post, since we do this in erf_slow_rhs_pre.

---------

Co-authored-by: Ann Almgren <[email protected]>
@jmsexton03 jmsexton03 mentioned this pull request Nov 12, 2025
@jmsexton03
Copy link
Copy Markdown
Collaborator Author

Questions

  1. Should this auto-detection be on by default? It's trying to not override any user-provided variables, but there is a CMake flag added to turn it on and off.
  2. Any feedback on completeness of the flags below from a kitchen - sink / physics / IO perspective?
  3. After I've rerun all the tests with current code and updated the table with the succeeding hashes (in this comment and in Docs PR #2708 ), can someone test on Kestrel something similar to:
source Build/machines/perlmutter_erf.profile
./cmake.sh
make distclean
./cmake_cuda.sh
make distclean
./cmake_with_kokkos_many_cuda.sh
rm -rf build_erf

Goal

The goal of this PR is to allow something like the following to work across systems, where all you have to change is the physics and IO flags of interest, and the specific gpu backend you're requesting (like https://github.com/jmsexton03/ERF/blob/add_craype_defaults_cmake/Build/cmake.sh or https://github.com/jmsexton03/ERF/blob/add_craype_defaults_cmake/Build/cmake_with_kokkos_many.sh)

#!/bin/bash

#Example cmake configuration script that assumes cray detection

cmake -DCMAKE_INSTALL_PREFIX:PATH=./install_erf \
      -DMPIEXEC_PREFLAGS:STRING=--oversubscribe \
      -DCMAKE_BUILD_TYPE:STRING=Release \
      -DERF_DIM:STRING=3 \
      -DERF_ENABLE_FFT:BOOL=ON \
      -DERF_ENABLE_NETCDF:BOOL=ON \
      -DERF_ENABLE_HDF5:BOOL=ON \
      -DERF_ENABLE_RRTMGP:BOOL=ON \
      -DERF_ENABLE_SHOC:BOOL=OFF \
      -DERF_ENABLE_MPI:BOOL=ON \
      -DERF_ENABLE_CUDA:BOOL=OFF \
      -DERF_ENABLE_HIP:BOOL=OFF \
      -DERF_ENABLE_SYCL:BOOL=OFF \
      -DERF_ENABLE_TESTS:BOOL=ON \
      -DERF_ENABLE_FCOMPARE:BOOL=ON \
      -DERF_ENABLE_DOCUMENTATION:BOOL=OFF \
      -DCMAKE_EXPORT_COMPILE_COMMANDS:BOOL=ON \
      -B build_erf ..

cmake --build build_erf -j10 -v
cmake --install build_erf --prefix=install_erf

Since SHOC or P3 require an additional setup step, I'm aiming to test those separately.

Docs table

ERF provides several build scripts optimized for different systems and architectures. This table shows which scripts have been tested and verified on each system. Verified builds are marked with the git commit hash where they were last tested.

Build Script Perlmutter Frontier Aurora Polaris Kestrel RegtestCPU RegtestGPU
cmake.sh Untested Untested Untested Untested Untested Untested Untested
cmake_with_kokkos_many.sh Untested Untested Untested Untested Untested Untested Untested
cmake_with_kokkos_many_cuda.sh Untested Untested Untested Untested
cmake_with_kokkos_many_noradiation_hip.sh Untested
cmake_with_kokkos_many_sycl.sh Untested
Perlmutter/build_erf_with_shoc_cuda_Perlmutter.sh Untested Untested Untested Untested Untested Untested

Note: The build_erf_with_shoc_cuda_Perlmutter.sh script is being tested for cross-site compatibility with auto-detection enabled. A simplified version may work across CUDA-enabled HPC sites (Perlmutter, Polaris, Kestrel) with -DCRAY_AUTO_DETECTION=ON.

larenspear and others added 5 commits November 12, 2025 14:42
* NetCDF/RRTGMP/Particles CI

* Remove commented out Spack commands

Removed commented-out lines for spack view commands.

* windows with ms-mpi attempt 1

* windows with ms-mpi attempt 1

* windows with ms-mpi attempt 2

* add option for MPI in cmake

* fix test path

* fix indentation

* fix paths

* Ctest with powershell?

* Ctest with powershell 2

* Simplifying installing MS-MPI + more

* Change shell and MPI executable path

* MPI wrapper script

* More MPI test changes (hopeful)

* MS-MPI test

* MS-MPI from correct source

* MS-MPI from correct source 2

* MSI vs EXE

* Exe nonewwindow

* Remove exe

* Remove exe 2

* Compile only, not run

* Binary artifact

* Binary artifact 2

* Whole build directory

Updated artifact upload paths to include the entire build directory.

* Modify Windows MPI workflow and enhance README

Updated job configuration for Windows MPI workflow to support both OFF and ON variants for MPI. Added detailed installation instructions and troubleshooting steps in the README.

* Style fixes

* Remove non-MPI builds

* Put regular windows.yml back to prior state

---------

Co-authored-by: Aaron M. Lattanzi <[email protected]>
@jmsexton03
Copy link
Copy Markdown
Collaborator Author

Build Script Perlmutter Frontier Aurora Polaris Kestrel RegtestCPU RegtestGPU
cmake.sh f8665c2 (ABL) f8665c2 (ABL) f8665c2 f8665c2 (ABL) Untested f8665c2 (ABL) f8665c2 (ABL)
cmake_with_kokkos_many.sh Untested Untested Untested Untested Untested f8665c2 (ABL) Untested
cmake_with_kokkos_many_cuda.sh f8665c2 (ABL) Untested Untested f8665c2 (ABL)
cmake_with_kokkos_many_noradiation_hip.sh f8665c2 (ABL bad fcompare result)
cmake_with_kokkos_many_hip.sh f8665c2 + rrtmpg_workarounds_kokkos
cmake_with_kokkos_many_sycl.sh f8665c2 + rrtmpg_workarounds_kokkos
Perlmutter/build_erf_with_shoc_cuda_Perlmutter.sh 3c7d1d0 + use_shoc_fix (ABL) Untested
build_erf_with_shoc.sh 3c7d1d0 + use_shoc_fix Untested

@jmsexton03 jmsexton03 marked this pull request as ready for review November 14, 2025 20:15
@asalmgren asalmgren merged commit 8146cc4 into erf-model:development Nov 14, 2025
23 of 25 checks passed
wiersema1 pushed a commit to wiersema1/ERF that referenced this pull request Nov 17, 2025
…odel#2704)

* Move scripts for PM into own directory and make some examples for GPU and CPU.

* Add netcdf-cxx4_parallel and other fallbacks and a more descriptive error message

* Add similar noahmp netcdf-fortran logic

* Minor path fixing and additional validation script

* Automatically make ./wrapper_clean_build.sh ./cmake_cuda.sh work on perlmutter

* Remove module versions, tweak minimum and cuda

* Add more cuda kokkos detection

* Haven't tested compiler detection workaround

* Update compiler detection

* Update printopts

* Add mpicxx hang workaround

* Make distclean broader

* Add distclean target

* Add make uninstall for cmake

* Update distclean and uninstall

* Add ERF_DIR detection to shoc script

* Add machines

* Add script with no flags

* Documentation updates (erf-model#2703)

* Use cell-centered grid for EB area fraction and face centroid (erf-model#2702)

* Apply FillBoundary for momenta in slow_rhs_post.

* check on domain boundaries not box boundaries

* replace bx lo/hi by domain lo/hi

* revert incorrect change in last commit

* fix test on small cells

* Added wrappers for non-const EB factory members.

* Use cell-centered grids for eb_aux_ area fraction and face centroids.

* Corrected rayleigh damping thickness in inputs_FittedMesh.

* Remove FB from slow_rhs_post, since we do this in erf_slow_rhs_pre.

---------

Co-authored-by: Ann Almgren <[email protected]>

* Fix upwind real bcs. (erf-model#2705)

* Add pkg-config deps if needed for netcdf

* Separate shoc script

* Tweak FindNetCDF.cmake for other options

* Style

* Cleanup chars

* Add rocm detection

* More general mpi library name detection

* HDF5 detection improvements with hip

* Add modules

* Landmaks fix for metgrid. (erf-model#2706)

* Style

* Add more logging

* Update log levels

* Add message context to main cmake

* Test log-context and fix nesting

* Move new warning

* Correcting hurricane intensification output (erf-model#2707)

Co-authored-by: Mahesh Natarajan <[email protected]>

* Add more verbose errors and warnings

* Make a config file, replace accidently removed host

* MSVC with MS-MPI + Downloadable Binary (erf-model#2709)

* NetCDF/RRTGMP/Particles CI

* Remove commented out Spack commands

Removed commented-out lines for spack view commands.

* windows with ms-mpi attempt 1

* windows with ms-mpi attempt 1

* windows with ms-mpi attempt 2

* add option for MPI in cmake

* fix test path

* fix indentation

* fix paths

* Ctest with powershell?

* Ctest with powershell 2

* Simplifying installing MS-MPI + more

* Change shell and MPI executable path

* MPI wrapper script

* More MPI test changes (hopeful)

* MS-MPI test

* MS-MPI from correct source

* MS-MPI from correct source 2

* MSI vs EXE

* Exe nonewwindow

* Remove exe

* Remove exe 2

* Compile only, not run

* Binary artifact

* Binary artifact 2

* Whole build directory

Updated artifact upload paths to include the entire build directory.

* Modify Windows MPI workflow and enhance README

Updated job configuration for Windows MPI workflow to support both OFF and ON variants for MPI. Added detailed installation instructions and troubleshooting steps in the README.

* Style fixes

* Remove non-MPI builds

* Put regular windows.yml back to prior state

---------

Co-authored-by: Aaron M. Lattanzi <[email protected]>

* More cmake module checking

* Add better reconfigure catching

* Add build_with_shoc

---------

Co-authored-by: AMLattanzi <[email protected]>
Co-authored-by: Akash Dhruv <[email protected]>
Co-authored-by: Soonpil Kang <[email protected]>
Co-authored-by: Ann Almgren <[email protected]>
Co-authored-by: Aaron M. Lattanzi <[email protected]>
Co-authored-by: Mahesh Natarajan <[email protected]>
Co-authored-by: Mahesh Natarajan <[email protected]>
Co-authored-by: Laren Spear <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants