Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ifpack2: rbiluk test failures in kokkos integration builds #12981

Closed
ndellingwood opened this issue May 2, 2024 · 11 comments
Closed

Ifpack2: rbiluk test failures in kokkos integration builds #12981

ndellingwood opened this issue May 2, 2024 · 11 comments
Assignees
Labels
pkg: Ifpack2 pkg: KokkosKernels type: bug The primary issue is a bug in Trilinos code or tests

Comments

@ndellingwood
Copy link
Contributor

Bug Report

@jgfouca @trilinos/ifpack2

Description

The Trilinos nightly Kokkos integration tests are failing in the Ifpack2_RBILUK_hb_belos_block_serial_MPI_1 test, @jgfouca can you take a look following the recent block spiluk #12908 and ifpack2 rbiluk #12911 updates?

Here are links to the failing jobs:

The jobs run on cee ascic resources (related script files for the gcc/8.3.0 build https://trilinos-cdash.sandia.gov/build/1511033/files), but I can send some slightly modified scripts for kokkos-dev-2 if that is helpful

The Trilinos + Kokkos integration testing was down for awhile (a kokkos_swap issue kokkos/kokkos#6960) that was resolved yesterday which is likely why this didn't show up earlier.

Steps to Reproduce

  1. SHA1: develop branches
  2. Configure script: see https://trilinos-cdash.sandia.gov/build/1511033/files
# Notes for repo setup
git clone -b develop https://github.com/trilinos/Trilinos.git
TRILINOS_DIR=$PWD/Trilinos
git clone -b develop https://github.com/kokkos/kokkos.git
KOKKOS_DIR=$PWD/kokkos
cd $TRILINOS_DIR
ln -s ${PWD}/../kokkos kokkos
cd ..
git clone -b develop https://github.com/kokkos/kokkos-kernels.git
KOKKOSKERNELS_DIR=$PWD/kokkos-kernels
cd $TRILINOS_DIR
ln -s ${PWD}/../kokkos-kernels kokkos-kernels
cd ..

@ndellingwood ndellingwood added type: bug The primary issue is a bug in Trilinos code or tests pkg: Ifpack2 pkg: KokkosKernels labels May 2, 2024
Copy link

github-actions bot commented May 2, 2024

Automatic mention of the @trilinos/ifpack2 team

1 similar comment
Copy link

github-actions bot commented May 2, 2024

Automatic mention of the @trilinos/ifpack2 team

@jgfouca jgfouca self-assigned this May 2, 2024
@jgfouca
Copy link
Contributor

jgfouca commented May 2, 2024

@ndellingwood , I was not able to reproduce that test fail on my machine. Is there an easy way to see what the env was for the integration test?

@ndellingwood
Copy link
Contributor Author

@jgfouca this the link containing the reproducer cmake files etc. for the gcc/8.3.0 job setup by frameworks: https://trilinos-cdash.sandia.gov/build/1511033/files

That link come from this job: https://trilinos-cdash.sandia.gov/build/1511033 and some reproducer instructions are here https://github.com/trilinos/Trilinos/wiki/Reproducing-PR-Testing-Errors

If you have access to the ascic machines I'm guessing you can follow the instructions pretty directly. Otherwise, I can send you some slightly modified configuration scripts to try and reproduce on kokkos-dev-2

@ndellingwood
Copy link
Contributor Author

@jgfouca in case it is helpful, the gist is that genconfig (I believe) generates the generatedPRFragment.cmake file containing configuration details of the job, which relies on packageEnables.cmake to read the file; the cmake line pulls these files in as (snipped from configure_command.txt ):

cmake -D Trilinos_CONFIGURE_OPTIONS_FILE=/scratch/trilinos/jenkins/ascic158/workspace/Trilinos_nightly_pipeline/generatedPRFragment.cmake -C "/scratch/trilinos/jenkins/ascic158/workspace/Trilinos_nightly_pipeline/packageEnables.cmake" -G "Ninja" -DKokkos_SOURCE_DIR_OVERRIDE:string=kokkos -DKokkosKernels_SOURCE_DIR_OVERRIDE:string=kokkos-kernels $TRILINOS_SRC

If you don't have access to ascic, you can try reproducing using these sems modules (I've done this in the past on kokkos-dev-2):

module load sems-cmake sems-gcc/8.3.0 sems-openmpi/1.10.7 sems-ninja
module load sems-zlib/1.2.11 sems-hdf5/1.10.7 sems-netcdf-c/4.7.3 sems-parallel-netcdf/1.12.1 sems-boost/1.69.0
module load sems-superlu/4.3 sems-scotch/6.0.3 sems-parmetis/4.0.3 sems-metis/5.1.0

@jgfouca
Copy link
Contributor

jgfouca commented May 2, 2024

@ndellingwood , I made a mistake and this error is actually easy for me to reproduce. Here's what I've found so far:

  1. If you just use the kokkos-kernels in Trilinos, the test works
  2. If you use latest develop kokkos-kernels, the test fails
  3. If you use the commit on develop where the block spiluk fixes went in, the test fails

So , this is pretty confusing. The block-spiluk impl file is the same for all 3 of the above tests.

@ndellingwood
Copy link
Contributor Author

@jgfouca that is odd, though the kokkos-kernels commit at the block spiluk fixes (3.) should be on top of other changes/diffs to kokkos-kernels made between 4.3.0 and the block spiluk fix?

@jgfouca
Copy link
Contributor

jgfouca commented May 2, 2024

@ndellingwood , @brian-kelley , I've traced the cause of the fail to this change:

--- a/sparse/src/KokkosSparse_BsrMatrix.hpp
+++ b/sparse/src/KokkosSparse_BsrMatrix.hpp
@@ -34,6 +34,7 @@
 #include "Kokkos_ArithTraits.hpp"
 #include "KokkosSparse_CrsMatrix.hpp"
 #include "KokkosKernels_Error.hpp"
+#include "KokkosKernels_default_types.hpp"
 
 namespace KokkosSparse {
 
@@ -325,9 +326,7 @@ struct BsrRowViewConst {
 /// storage for sparse matrices, as described, for example, in Saad
 /// (2nd ed.).
 template <class ScalarType, class OrdinalType, class Device,
-          class MemoryTraits = void,
-          class SizeType     = typename Kokkos::ViewTraits<OrdinalType*, Device,
-                                                       void, void>::size_type>
+          class MemoryTraits = void, class SizeType = default_size_type>

Which came in through this commit:

commit 8756faaa64fc5192e24634386a1051e1e2730c83
Author: brian-kelley <[email protected]>
Date:   Tue Mar 26 13:36:56 2024 -0600

    Use default_size_type as default offset in matrix types (#2149)
    
    Now a declaration like CrsMatrix<Scalar, Ordinal, Device>
    will by default use an ETI'd type combination (as int is the default
    ETI'd offset)

I don't see what's wrong with Brian's change, but I've double checked and that seems to be the cause. Since only RBILUK is failing, it must be the way I'm using Bsr in that file.

@brian-kelley
Copy link
Contributor

brian-kelley commented May 2, 2024

@ndellingwood It looks like

-DKokkosKernels_INST_OFFSET_SIZE_T=ON
-DKokkosKernels_INST_OFFSET_INT=OFF

aren't included as flags in this build, but we need them to build Trilinos for now (that goes for all KK develop/Trilinos integration builds).
We flipped the defaults in KK develop (now int on, size_t off) in kokkos/kokkos-kernels#2140 but Tpetra still uses only size_t.
Although, if there's really a type mismatch, I'm not sure how this would even compile. Maybe there's a Teuchos::OrdinalTraits<size_t>::invalid() value getting used in RBILUK somewhere that can't be represented as int? Building with a UB sanitizer might tell you if and where that's happening.

Or maybe on the KK side, default_size_type is hardcoded somewhere instead of getting the type from the matrix/handle.

BTW, @jhux2 is now working on allowing offsets other than size_t in the Tpetra stack.

@jgfouca
Copy link
Contributor

jgfouca commented May 2, 2024

@brian-kelley , I've further isolated the problem to a hacky thing I needed to do to call Sequential::trsv on CUDA:

      // Kokkos kernels impl. For now, the only block trsv available is Sequential                                                                                                                                             
      // and must be done on host.                                                                                                                                                                                             
      using row_map_type = typename local_matrix_host_type::row_map_type;
      using index_type   = typename local_matrix_host_type::index_type;
      using values_type  = typename local_matrix_host_type::values_type;

      auto X_view = X.getLocalViewHost(Tpetra::Access::ReadOnly);
      auto Y_view = Y.getLocalViewHost(Tpetra::Access::ReadWrite);

      auto L_row_ptrs_host = L_block_->getCrsGraph().getLocalRowPtrsHost();
      auto L_entries_host = L_block_->getCrsGraph().getLocalIndicesHost();
      auto U_row_ptrs_host = U_block_->getCrsGraph().getLocalRowPtrsHost();
      auto U_entries_host = U_block_->getCrsGraph().getLocalIndicesHost();
      auto L_values_host = L_block_->getValuesHost();
      auto U_values_host = U_block_->getValuesHost();

      row_map_type* L_row_ptrs_host_ri = reinterpret_cast<row_map_type*>(&L_row_ptrs_host);
      index_type* L_entries_host_ri    = reinterpret_cast<index_type*>(&L_entries_host);
      row_map_type* U_row_ptrs_host_ri = reinterpret_cast<row_map_type*>(&U_row_ptrs_host);
      index_type* U_entries_host_ri    = reinterpret_cast<index_type*>(&U_entries_host);
      values_type* L_values_host_ri    = reinterpret_cast<values_type*>(&L_values_host);
      values_type* U_values_host_ri    = reinterpret_cast<values_type*>(&U_values_host);

      const auto numRows = L_block_->getLocalNumRows();
      local_matrix_host_type L_block_local_host("L_block_local_host", numRows, numRows, L_entries_host.size(), *L_values_host_ri, *L_row_ptrs_host_ri, *L_entries_host_ri, blockSize_);
      local_matrix_host_type U_block_local_host("U_block_local_host", numRows, numRows, U_entries_host.size(), *U_values_host_ri, *U_row_ptrs_host_ri, *U_entries_host_ri, blockSize_);

      if (mode == Teuchos::NO_TRANS) {
        KokkosSparse::trsv("L", "N", "N", L_block_local_host, X_view, Y_view);
        KokkosSparse::trsv("U", "N", "N", U_block_local_host, Y_view, Y_view);
        KokkosBlas::axpby(alpha, Y_view, beta, Y_view);
      }
      else {
        KokkosSparse::trsv("U", "T", "N", U_block_local_host, X_view, Y_view);
        KokkosSparse::trsv("L", "T", "N", L_block_local_host, Y_view, Y_view);
        KokkosBlas::axpby(alpha, Y_view, beta, Y_view);
      }

To sum up, L_Block and U_Block are device BSRs but I need host BSRs in order to call trsv. I ran into all kinds of problems dealing with the Views I was working with not being quite the right type as to be acceptable to the BsrMatrix constructor, so I ended up doing the reinterpret_cast above. It looked to me like the problem was View<const T*> vs. View<T*>, so I thought I was just casting away const-ness, but maybe the types were mismatched too. Maybe you can think of a better way to do this.

@ndellingwood
Copy link
Contributor Author

@jgfouca @brian-kelley thanks for looking into this and resolving! The fix only needed to be applied to the kokkos-kernels@develop branch, not directly in Trilinos (Jim fixed with kokkos/kokkos-kernels#2196) but will be included with the 4.3.01 patch release

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pkg: Ifpack2 pkg: KokkosKernels type: bug The primary issue is a bug in Trilinos code or tests
Projects
None yet
Development

No branches or pull requests

3 participants