-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DEBUG] Try to isolate the issue in SBLS with CI #363
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #363 +/- ##
===========================================
- Coverage 43.38% 18.99% -24.40%
===========================================
Files 313 149 -164
Lines 161619 41829 -119790
Branches 55978 13779 -42199
===========================================
- Hits 70123 7944 -62179
+ Misses 73963 31602 -42361
+ Partials 17533 2283 -15250 ☔ View full report in Codecov by Sentry. |
@nimgould I can reproduce the issue with CI. Conclusion: The issue is not in the Julia interface. 🎉 |
Possibly. All we can say for sure is that the make and meson versions behave differently. I get SLS: analysis complete: status = 0, ordering = 7 for all real and integer types from the C tests. Your tests are reporting an array allocation error (status = -1) which is very rare to say the least (I have never seen it before unless I have asked for a huge size!!). We need to pin down where this is happening. As I mentioned before, setting control.sls_control.print_level = 1 might give us a clue |
dd83682
to
49b282b
Compare
Nick, I don’t encounter any errors with my local version of GALAHAD compiled using Meson and GCC v11.4. I modified the PR to compile and test only SBLS (reducing the time from 30 minutes to 3 minutes), which allowed me to gather new information about where the issue might occur. Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks.
Exception: EXCEPTION_ACCESS_VIOLATION at 0x7ff95132964b -- memset at C:\Windows\SYSTEM32\ntdll.dll (unknown line)
in expression starting at D:\a\GALAHAD\GALAHAD\GALAHAD.jl\test\test_sbls.jl:200
memset at C:\Windows\SYSTEM32\ntdll.dll (unknown line)
RtlFreeHeap at C:\Windows\SYSTEM32\ntdll.dll (unknown line)
free_base at C:\Windows\System32\ucrtbase.dll (unknown line)
sbls_form_and_factorize at D:\a\GALAHAD\GALAHAD\builddir\../src/sbls\sbls.F90:1892
sbls_factorize_matrix at D:\a\GALAHAD\GALAHAD\builddir\../src/sbls\sbls.F90:10238
sbls_factorize_matrix_s at D:\a\GALAHAD\GALAHAD\builddir\../src/sbls/C\sbls_ciface.F90:655
sbls_factorize_matrix at D:\a\GALAHAD\GALAHAD\GALAHAD.jl\src\wrappers\sbls.jl:302
test_sbls at D:\a\GALAHAD\GALAHAD\GALAHAD.jl\test\test_sbls.jl:74 |
I was using gfortran 10 as this is the latest that Matlab supports. I plan to do a virgin install this morning, and will test gfortrans 10-13, my distro doesn't have 14. I am sure we agree that there is a lurking issue, and if I can locate it with 12, then I can fix it. More when I find anything. I'll also have a look at the Windows issue, although that may be harder to deal with. Clearly something is going on in sbls! |
@nimgould |
6f3678e
to
e9a6443
Compare
OK, the only way we will track this is to put print statements into the fortran. Very annoying. We need to find out why sls_analyse is terminating with status = -1. I can try, but I'll need to move to the branch you are using; I presume I can edit files there? |
(Tomorrow though, now) |
Righty ho, I have put in write statements to compare the makefile and meson outputs. makefile version ssids_inform = 0 0 0 0 5 1 5 5 0 15 55 0 1 0 0 0 0 0 0 0 0 0 0 0 1 55 0 meson version: ssids_inform = -50 0 0 0 0 0 0 0 0 0 0 0 0 0 5014 0 0 0 0 0 0 0 0 0 0 0 0 |
Yes, the flag |
@nimgould TYPE, BIND( C ) :: spral_ssids_inform
INTEGER ( KIND = ipc_ ) :: flag
INTEGER ( KIND = ipc_ ) :: matrix_dup
INTEGER ( KIND = ipc_ ) :: matrix_missing_diag
INTEGER ( KIND = ipc_ ) :: matrix_outrange
INTEGER ( KIND = ipc_ ) :: matrix_rank
INTEGER ( KIND = ipc_ ) :: maxdepth
INTEGER ( KIND = ipc_ ) :: maxfront
INTEGER ( KIND = ipc_ ) :: num_delay
INTEGER ( KIND = longc_ ) :: num_factor
INTEGER ( KIND = longc_ ) :: num_flops
INTEGER ( KIND = ipc_ ) :: num_neg
INTEGER ( KIND = ipc_ ) :: num_sup
INTEGER ( KIND = ipc_ ) :: num_two
INTEGER ( KIND = ipc_ ) :: stat
! type(auction_inform) :: auction
INTEGER ( KIND = ipc_ ) :: cuda_error
INTEGER ( KIND = ipc_ ) :: cublas_error
INTEGER ( KIND = ipc_ ) :: not_first_pass
INTEGER ( KIND = ipc_ ) :: not_second_pass
INTEGER ( KIND = ipc_ ) :: nparts
INTEGER ( KIND = longc_ ) :: cpu_flops
INTEGER ( KIND = longc_ ) :: gpu_flops
END TYPE spral_ssids_inform I don't understand why the first seven components should be Do you correctly pass the flag |
I do both 32 and 64 bit. I had presumed that you would use test the 32 bit integer version as that is what the actions test name was. But ok, If yours is 64 bit, then I get (the reasonable) ssids_inform = 0 0 0 0 5 1 5 5 0 15 55 0 1 0 0 0 0 0 0 0 0 0 0 0 1 I'll need to dig deeper into ssids to see why your numbers are seemingly incorrect. I'm busy mostly until tomorrow afternoon, so it'll have to wait. Just to check, are you using any other non-default flags? |
This is my list of flags:
I am invited for two days at INRIA Sophia-Antipolis (near Nice) tomorrow and Wednesday so busy until I come back on Thursday. 🌊 🍸 🏖️ |
a90d3f1
to
a11730f
Compare
Thanks. As we discussed last week, the MULTIPRECISION flag is only there as an option if C users want to have different procedure names per real,int setup. I believe that you said this wasn't needed for Julia, it isn't used by Python and not (currently) documented for C |
c5a2c2c
to
91e19f4
Compare
I needed to update the Julia wrappers to not use this flag. |
I have isolated the issue in ssids_analyse, an it is all down to a call to spral's |
In the makefile version, HWLOC is turned on and off manually, i.e., ifeq "$(HWLOC)" "un" where the $HWLOC variable is set to either '' or 'un' in the makefile preprocessor. For gcc it is '', so the proper HWLOC is used. For meson it seems to be computed via HWLOCif libhwloc.found() and has_hwloch So maybe this test simply gets it wrong? |
Since the 32bit test pass, but the 64 bit ones don't, what does meson add for this argument in either case? I'm sure you meson experts can find out very easily |
Right, I have noew seen an st=5040 locally (via the makefile). So this is nothing to do with meson (sorry!!) and everything to do with the hwloc functiions. I suspect that everything in hwloc is int rather than int/int64, and that the spral call is promoting it to int64 for the _64 runs - this is just a hunch at the moment! Certainly the nregion variable in the _64 version sometimes returns 0 and sometimes 140733193388033. I shall track this down |
OK, hwloc bug is now squashed, as suspected it has to use 32bit integers. My mistake when adapting for i64. The windows character ( len = 0) :: prefix issue remains, but that is a real compiler bug according to my reading of the standard |
Yup, it's definitely a compiler bug. I suggest we disable this sbls test on Windows (but it's possible that the same bug will affect other packages too) |
97ec154
to
b5f4c46
Compare
@nimgould I am back from Nice. Regarding the Windows bug, it only occurs with GCC 12 an GCC 13. Do you plan to add anything else for release 5.1.0, or should we go ahead and tag it? |
Thanks, Alexis. Nothing more for 5.1, so could you please tag it. I'll push the latest Julia (etc) docs to their repo once you done this. |
No description provided.