Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DEBUG] Try to isolate the issue in SBLS with CI #363

Closed
wants to merge 2 commits into from

Conversation

amontoison
Copy link
Member

No description provided.

@amontoison amontoison changed the title Update sblstf.c [DEBUG] Try to isolate the issue in SBLS with CI Jan 3, 2025
@codecov-commenter
Copy link

codecov-commenter commented Jan 3, 2025

Codecov Report

Attention: Patch coverage is 96.42857% with 1 line in your changes missing coverage. Please review.

Project coverage is 18.99%. Comparing base (a86af94) to head (b5f4c46).
Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
src/sls/sls.F90 96.42% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           master     #363       +/-   ##
===========================================
- Coverage   43.38%   18.99%   -24.40%     
===========================================
  Files         313      149      -164     
  Lines      161619    41829   -119790     
  Branches    55978    13779    -42199     
===========================================
- Hits        70123     7944    -62179     
+ Misses      73963    31602    -42361     
+ Partials    17533     2283    -15250     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@amontoison
Copy link
Member Author

amontoison commented Jan 3, 2025

@nimgould I can reproduce the issue with CI.
It seems that we have the exit status = -9 many times but because we don't return the value of status at the end of the tests, we missed them.
testlog.txt

Conclusion: The issue is not in the Julia interface. 🎉
Something is wrong in the C / Fortran part of GALAHAD.

@nimgould
Copy link
Contributor

nimgould commented Jan 4, 2025

@nimgould I can reproduce the issue with CI. It seems that we have the exit status = -9 many times but because we don't return the value of status at the end of the tests, we missed them. testlog.txt

Conclusion: The issue is not in the Julia interface. 🎉 Something is wrong in the C / Fortran part of GALAHAD.

Possibly. All we can say for sure is that the make and meson versions behave differently. I get SLS: analysis complete: status = 0, ordering = 7 for all real and integer types from the C tests. Your tests are reporting an array allocation error (status = -1) which is very rare to say the least (I have never seen it before unless I have asked for a huge size!!). We need to pin down where this is happening. As I mentioned before, setting control.sls_control.print_level = 1 might give us a clue

@amontoison
Copy link
Member Author

Nick, I don’t encounter any errors with my local version of GALAHAD compiled using Meson and GCC v11.4.
I’m not sure which version of GCC you’re using, but it seems the errors occur with GCC12 and GCC13.

I modified the PR to compile and test only SBLS (reducing the time from 30 minutes to 3 minutes), which allowed me to gather new information about where the issue might occur.
On Windows, I encountered this unauthorized memory access:

Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks.
Exception: EXCEPTION_ACCESS_VIOLATION at 0x7ff95132964b -- memset at C:\Windows\SYSTEM32\ntdll.dll (unknown line)
in expression starting at D:\a\GALAHAD\GALAHAD\GALAHAD.jl\test\test_sbls.jl:200
memset at C:\Windows\SYSTEM32\ntdll.dll (unknown line)
RtlFreeHeap at C:\Windows\SYSTEM32\ntdll.dll (unknown line)
free_base at C:\Windows\System32\ucrtbase.dll (unknown line)
sbls_form_and_factorize at D:\a\GALAHAD\GALAHAD\builddir\../src/sbls\sbls.F90:1892
sbls_factorize_matrix at D:\a\GALAHAD\GALAHAD\builddir\../src/sbls\sbls.F90:10238
sbls_factorize_matrix_s at D:\a\GALAHAD\GALAHAD\builddir\../src/sbls/C\sbls_ciface.F90:655
sbls_factorize_matrix at D:\a\GALAHAD\GALAHAD\GALAHAD.jl\src\wrappers\sbls.jl:302
test_sbls at D:\a\GALAHAD\GALAHAD\GALAHAD.jl\test\test_sbls.jl:74

@nimgould
Copy link
Contributor

nimgould commented Jan 5, 2025

I was using gfortran 10 as this is the latest that Matlab supports.

I plan to do a virgin install this morning, and will test gfortrans 10-13, my distro doesn't have 14. I am sure we agree that there is a lurking issue, and if I can locate it with 12, then I can fix it. More when I find anything.

I'll also have a look at the Windows issue, although that may be harder to deal with. Clearly something is going on in sbls!

@amontoison
Copy link
Member Author

@nimgould inform.sls_inform.bad_alloc is empty :(

@nimgould
Copy link
Contributor

nimgould commented Jan 5, 2025

OK, the only way we will track this is to put print statements into the fortran. Very annoying. We need to find out why sls_analyse is terminating with status = -1. I can try, but I'll need to move to the branch you are using; I presume I can edit files there?

@nimgould
Copy link
Contributor

nimgould commented Jan 5, 2025

(Tomorrow though, now)

@nimgould
Copy link
Contributor

nimgould commented Jan 6, 2025

Righty ho, I have put in write statements to compare the makefile and meson outputs.
The status = -1 output occurs after the call of ssids_analyse, and is obtained from the
ssids's inform structure. I printed the entire inform structure. The first seven components of the derived type should be 32bit (ip_) and the next two and 64bit (long_). For the makefile version, this is the case. For the meson version, all components are 64bit, and this corrupts the call. Are you passing the integer_64 flag to part of of the meson build?

makefile version

ssids_inform = 0 0 0 0 5 1 5 5 0 15 55 0 1 0 0 0 0 0 0 0 0 0 0 0 1 55 0

meson version:

ssids_inform = -50 0 0 0 0 0 0 0 0 0 0 0 0 0 5014 0 0 0 0 0 0 0 0 0 0 0 0

@amontoison
Copy link
Member Author

Yes, the flag -DINTEGER_64 is passed to all compilers (C, C++ and Fortran).

@amontoison
Copy link
Member Author

amontoison commented Jan 6, 2025

@nimgould
In the file ssids_ciface.F90, the structure spral_ssids_inform is:

    TYPE, BIND( C ) :: spral_ssids_inform
       INTEGER ( KIND = ipc_ ) :: flag
       INTEGER ( KIND = ipc_ ) :: matrix_dup
       INTEGER ( KIND = ipc_ ) :: matrix_missing_diag
       INTEGER ( KIND = ipc_ ) :: matrix_outrange
       INTEGER ( KIND = ipc_ ) :: matrix_rank
       INTEGER ( KIND = ipc_ ) :: maxdepth
       INTEGER ( KIND = ipc_ ) :: maxfront
       INTEGER ( KIND = ipc_ ) :: num_delay
       INTEGER ( KIND = longc_ ) :: num_factor
       INTEGER ( KIND = longc_ ) :: num_flops
       INTEGER ( KIND = ipc_ ) :: num_neg
       INTEGER ( KIND = ipc_ ) :: num_sup
       INTEGER ( KIND = ipc_ ) :: num_two
       INTEGER ( KIND = ipc_ ) :: stat
!    type(auction_inform) :: auction
       INTEGER ( KIND = ipc_ ) :: cuda_error
       INTEGER ( KIND = ipc_ ) :: cublas_error
       INTEGER ( KIND = ipc_ ) :: not_first_pass
       INTEGER ( KIND = ipc_ ) :: not_second_pass
       INTEGER ( KIND = ipc_ ) :: nparts
       INTEGER ( KIND = longc_ ) :: cpu_flops
       INTEGER ( KIND = longc_ ) :: gpu_flops
    END TYPE spral_ssids_inform

I don't understand why the first seven components should be Int32 and not Int64.

Do you correctly pass the flag -DINTEGER_64 to the C++ compiler in the makefile version?

@nimgould
Copy link
Contributor

nimgould commented Jan 6, 2025

I do both 32 and 64 bit. I had presumed that you would use test the 32 bit integer version as that is what the actions test name was. But ok, If yours is 64 bit, then I get (the reasonable)

ssids_inform = 0 0 0 0 5 1 5 5 0 15 55 0 1 0 0 0 0 0 0 0 0 0 0 0 1
instead

I'll need to dig deeper into ssids to see why your numbers are seemingly incorrect. I'm busy mostly until tomorrow afternoon, so it'll have to wait. Just to check, are you using any other non-default flags?

@amontoison
Copy link
Member Author

This is my list of flags:

# METIS
if host_machine.system() == 'windows'
  add_global_arguments('-DUSE_GKREGEX', language : 'c')
endif

if host_machine.system() == 'linux'
  add_global_arguments('-DSPRAL_HAVE_SCHED_GETCPU', language : 'cpp')
else
  add_global_arguments('-DSPRAL_NO_SCHED_GETCPU', language : 'cpp')
endif

# HWLOC
if libhwloc.found() and has_hwloch
  add_global_arguments('-DSPRAL_HAVE_HWLOC', language : 'cpp')
else
  add_global_arguments('-DSPRAL_NO_HWLOC', language : 'cpp')
endif

# HSL
if libhsl.found()
  add_global_arguments('-DLANCELOT_USE_MA57', language : 'fortran')
endif
extra_args_single = ['-DREAL_32', '-DMULTIPRECISION']
extra_args_double = ['-DMULTIPRECISION']
extra_args_quadruple = ['-DREAL_128', '-DGALAHAD_BLAS', '-DGALAHAD_LAPACK', '-DDUMMY_QMUMPS', '-DMULTIPRECISION']

if not libblas.found()
  extra_args_single += '-DGALAHAD_BLAS'
  extra_args_double += '-DGALAHAD_BLAS'
endif
if not liblapack.found()
  extra_args_single += '-DGALAHAD_LAPACK'
  extra_args_double += '-DGALAHAD_LAPACK'
endif
if not libsmumps.found()
  extra_args_single += '-DDUMMY_SMUMPS'
endif
if not libdmumps.found()
  extra_args_double += '-DDUMMY_DMUMPS'
endif
if not (libblas_name == 'mkl_rt' or liblapack_name == 'mkl_rt')
  extra_args_single += '-DDUMMY_MKL_PARDISO'
  extra_args_double += '-DDUMMY_MKL_PARDISO'
  extra_args_quadruple += '-DDUMMY_MKL_PARDISO'
endif
if not libpardiso.found()
  extra_args_single += '-DDUMMY_PARDISO'
  extra_args_double += '-DDUMMY_PARDISO'
  extra_args_quadruple += '-DDUMMY_PARDISO'
endif
if not libpastixf.found()
  extra_args_single += '-DDUMMY_PASTIXF'
  extra_args_double += '-DDUMMY_PASTIXF'
  extra_args_quadruple += '-DDUMMY_PASTIXF'
endif
if not libspmf.found()
  extra_args_single += '-DDUMMY_SPMF'
  extra_args_double += '-DDUMMY_SPMF'
  extra_args_quadruple += '-DDUMMY_SPMF'
endif
if not libwsmp.found()
  extra_args_single += '-DDUMMY_WSMP'
  extra_args_double += '-DDUMMY_WSMP'
  extra_args_quadruple += '-DDUMMY_WSMP'
endif
if not libhsl.found()
  extra_args_single += '-DDUMMY_HSL'
  extra_args_double += '-DDUMMY_HSL'
  extra_args_quadruple += '-DDUMMY_HSL'
endif

# MPI
extra_args_single += '-DDUMMY_MPI'
extra_args_double += '-DDUMMY_MPI'
extra_args_quadruple += '-DDUMMY_MPI'

# Compile GALAHAD with 64-bit integer
if int64
  extra_args_single += '-DINTEGER_64'
  extra_args_double += '-DINTEGER_64'
  extra_args_quadruple += '-DINTEGER_64'
endif

I am invited for two days at INRIA Sophia-Antipolis (near Nice) tomorrow and Wednesday so busy until I come back on Thursday. 🌊 🍸 🏖️

@nimgould
Copy link
Contributor

nimgould commented Jan 6, 2025

Thanks. As we discussed last week, the MULTIPRECISION flag is only there as an option if C users want to have different procedure names per real,int setup. I believe that you said this wasn't needed for Julia, it isn't used by Python and not (currently) documented for C

@amontoison amontoison force-pushed the amontoison-patch-1 branch 2 times, most recently from c5a2c2c to 91e19f4 Compare January 6, 2025 21:50
@amontoison
Copy link
Member Author

Thanks. As we discussed last week, the MULTIPRECISION flag is only there as an option if C users want to have different procedure names per real,int setup. I believe that you said this wasn't needed for Julia, it isn't used by Python and not (currently) documented for C

I needed to update the Julia wrappers to not use this flag.
I did it in #364.
The Meson build system has now an option -Dmultiprecision that is false by default.
I rebased this branch so don't forget to do git pull before any modification tomorrow.

@nimgould
Copy link
Contributor

nimgould commented Jan 7, 2025

I have isolated the issue in ssids_analyse, an it is all down to a call to spral's
guess_topology that relies on a proper implementation of libhwloc. For the makefile version, this function returns st =0, but for the meson one it is st-5040. Is this being
set correctly in the meson version?

@nimgould
Copy link
Contributor

nimgould commented Jan 7, 2025

In the makefile version, HWLOC is turned on and off manually, i.e.,

ifeq "$(HWLOC)" "un"
DHWLOC = -DSPRAL_NO_HWLOC
else
DHWLOC = -DSPRAL_HAVE_HWLOC
endif

where the $HWLOC variable is set to either '' or 'un' in the makefile preprocessor. For gcc it is '', so the proper HWLOC is used. For meson it seems to be computed via

HWLOC

if libhwloc.found() and has_hwloch
add_global_arguments('-DSPRAL_HAVE_HWLOC', language : 'cpp')
else
add_global_arguments('-DSPRAL_NO_HWLOC', language : 'cpp')
endif

So maybe this test simply gets it wrong?

@nimgould
Copy link
Contributor

nimgould commented Jan 7, 2025

Since the 32bit test pass, but the 64 bit ones don't, what does meson add for this argument in either case? I'm sure you meson experts can find out very easily

@nimgould
Copy link
Contributor

nimgould commented Jan 7, 2025

Right, I have noew seen an st=5040 locally (via the makefile). So this is nothing to do with meson (sorry!!) and everything to do with the hwloc functiions. I suspect that everything in hwloc is int rather than int/int64, and that the spral call is promoting it to int64 for the _64 runs - this is just a hunch at the moment! Certainly the nregion variable in the _64 version sometimes returns 0 and sometimes 140733193388033. I shall track this down

@nimgould
Copy link
Contributor

nimgould commented Jan 7, 2025

OK, hwloc bug is now squashed, as suspected it has to use 32bit integers. My mistake when adapting for i64. The windows character ( len = 0) :: prefix issue remains, but that is a real compiler bug according to my reading of the standard

@nimgould
Copy link
Contributor

nimgould commented Jan 7, 2025

Yup, it's definitely a compiler bug. I suggest we disable this sbls test on Windows (but it's possible that the same bug will affect other packages too)

@amontoison
Copy link
Member Author

@nimgould I am back from Nice.
I backported your fix for SSIDS to the master branch with this commit: d56eb41

Regarding the Windows bug, it only occurs with GCC 12 an GCC 13.
It works fine with GCC 9, GCC 10, and GCC 11.
By default, our CI runs with GCC 11 on Windows, so everything is good on our side.

Do you plan to add anything else for release 5.1.0, or should we go ahead and tag it?

@amontoison amontoison closed this Jan 9, 2025
@amontoison amontoison deleted the amontoison-patch-1 branch January 9, 2025 09:12
@nimgould
Copy link
Contributor

nimgould commented Jan 9, 2025

Thanks, Alexis. Nothing more for 5.1, so could you please tag it. I'll push the latest Julia (etc) docs to their repo once you done this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants