Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Amesos2_KLU2 loadA_impl widespread Nalu failures #13737

Open
spdomin opened this issue Jan 21, 2025 · 12 comments
Open

Amesos2_KLU2 loadA_impl widespread Nalu failures #13737

spdomin opened this issue Jan 21, 2025 · 12 comments
Labels
pkg: Amesos2 type: bug The primary issue is a bug in Trilinos code or tests

Comments

@spdomin
Copy link
Contributor

spdomin commented Jan 21, 2025

Bug Report

Widespread new throws in many of our Nalu test suite. No real pattern that I can see.

Description

[ascic0204:3563317] *** Process received signal ***
[ascic0204:3563317] Signal: Aborted (6)
[ascic0204:3563317] Signal code: (-6)
terminate called after throwing an instance of 'std::runtime_error'
what(): /fgs/spdomin/nightly/Trilinos/packages/amesos2/src/Amesos2_KLU2_def.hpp:488:
Throw number = 1

Throw test that evaluated to true: nnz_ret != as<local_ordinal_type>(this->globalNumNonZeros_)

Steps to Reproduce

Good:

NaluCFD/Nalu SHA1: aa35b4d3d1dd9cc2d63ea79e1a1d34c3970ed25e
Trilinos/develop SHA1: 0678446

Bad:

NaluCFD/Nalu SHA1: aa35b4d3d1dd9cc2d63ea79e1a1d34c3970ed25e
Trilinos/develop SHA1: 5b94adf

@spdomin spdomin added the type: bug The primary issue is a bug in Trilinos code or tests label Jan 21, 2025
@spdomin
Copy link
Contributor Author

spdomin commented Jan 21, 2025

Some cases are complex, i.e., higher-order, overset, etc., while some are simple flows:

The following tests FAILED:
6 - dgNonConformal3dFluids (Failed)
15 - ductElemWedge (Failed)
16 - ductWedge (Failed)
19 - elemBackStepLRSST (Failed)
20 - elemClosedDomain (Failed)
23 - elemPipeCHT (Failed)
28 - 1x2x10Tet10 (Failed)
29 - ductTet10 (Failed)
34 - hoVortex (Failed)
41 - nonConformalWithPeriodic (Failed)
42 - nonConformalWithPeriodicConsolidated (Failed) // very complex case...
49 - oversetFluids (Failed)
50 - oversetFluidsEdge (Failed)
52 - periodic3dElemNp4 (Failed)
53 - periodic3dElemNp8 (Failed)
55 - periodic3dEdgeNp4 (Failed)
75 - 2d_quad4_channel (Failed) // simple as you get...
76 - 2d_quad9_couette (Failed)
83 - 3d_tet4_taylor_green_0p2 (Failed)
84 - 3d_hex8_turb_channel (Failed)

@cgcgcg
Copy link
Contributor

cgcgcg commented Jan 21, 2025

@iyamazaki

@iyamazaki
Copy link
Contributor

@spdomin. Sorry for the issues. I am looking into. Can you give me a suggestion/instruction on how I could reproduce the errors, or are there any log for the runs, that I could look into? My intention was at least for the complex case, it would take the original code path.

@spdomin
Copy link
Contributor Author

spdomin commented Jan 21, 2025

I would be happy to help you set up a Nalu build and run the simplest case, 2d_quad4_channel.

The most straight-forward path to explore is that you follow:

https://nalu.readthedocs.io/en/latest/source/user/build_manually.html#linux-and-osx

I wrapped this in a simple script and it should build somewhat easily.

The Trilinos config we use is under: https://github.com/NaluCFD/Nalu/blob/master/build/do-configTrilinos_release while the Nalu config is: https://github.com/NaluCFD/Nalu/blob/master/build/do-configNalu_release

Depending on your environment, you might be able to use all of my installations for TPLs. Write me at my work email address for more (Sandia or Stanford).

In the meantime, it might be nice to revert so that we do not lose coverage while we sort this out

@spdomin
Copy link
Contributor Author

spdomin commented Jan 21, 2025

Not surprisingly, 2d_quad4_channel passes with one MPI rank and fails for any count higher.

@iyamazaki
Copy link
Contributor

Thank you, @spdomin. I managed to reproduce errors. Meanwhile, we created a PR to revert the changes.

@spdomin
Copy link
Contributor Author

spdomin commented Jan 21, 2025

Great - let me know if I can help - especially if you find something that Nalu is doing that it probably should not.

@iyamazaki
Copy link
Contributor

Thank you, @spdomin. I fixed a bug in the PR, and now I see the same results, running the regression tests with/without the changes. Unfortunately, my nalu build with the before-the-change (i.e., 0678446) has some failures, and I am not sure if I have fixed all the issues: e.g.,

20: ..elemClosedDomain........... FAILED:   s  max diff:  .00000000695591166900011567339468
1/1 Test #20: elemClosedDomain .................***Failed    6.42 sec

0% tests passed, 1 tests failed out of 1

Label Time Summary:
regression    =  12.84 sec*proc (1 test)

Total Test time (real) =   6.42 sec

The following tests FAILED:
	 20 - elemClosedDomain (Failed)

@iyamazaki
Copy link
Contributor

@spdomin. I merged PR 13740. Please let me know if it still causes any issue.

@spdomin
Copy link
Contributor Author

spdomin commented Jan 23, 2025

Sorry. ‘Meetings all day and out of pocket.

After your revert, last night was clean (see way below).

The elemClosed** is a special case. For a low-Mach flow, when closed, the elliptic solve can be singular. However, when relaxing the low-Mach assumption, while adding new terms that allow for a low-speed compressible use case, the system should be fine (and result in a successful test as seen in the last nightly test suite).

Send me the error you are seeing and we can work through it. I should be free tomorrow:)

100% tests passed, 0 tests failed out of 84

Label Time Summary:
laboratory = 937.64 secproc (10 tests)
performance = 7996.26 sec
proc (3 tests)
regression = 13226.85 secproc (68 tests)
unit = 9.56 sec
proc (2 tests)
verification = 289.04 sec*proc (1 test)

Total Test time (real) = 858.13 sec
NaluCFD/Nalu SHA1: aa35b4d3d1dd9cc2d63ea79e1a1d34c3970ed25e
Trilinos/develop SHA1: 9f25b1c

@iyamazaki
Copy link
Contributor

Thank you, @spdomin. We merged a new PR with a fix yesterday. Can you let me know if any issue remains?

@spdomin
Copy link
Contributor Author

spdomin commented Jan 24, 2025

We had one diff: FAILED:
75 - 2d_quad4_channel (Failed)

Luckily, this configuration has an analytical solution. I will run it out to full convergence tomorrow (Friday) and report back.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pkg: Amesos2 type: bug The primary issue is a bug in Trilinos code or tests
Projects
None yet
Development

No branches or pull requests

3 participants