-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discover issues (follow-up to PR https://github.com/JCSDA/spack-stack/pull/993) #1011
Comments
To test number 3 on my end I need to make changes to In the meantime @climbfuji will provide some OOPS timing statistics which may help us figure out the problem with the intel compiler. Tagging @mathomp4 for visibility. |
Please use the modules from PR #1017 - thanks! |
@climbfuji, will there be a miniconda module for SCU17? I see there is a stack-python version but Dan created GEOS-ESM\jedi_bundle using |
No, there won't be one. We use the native/OS python3 to "drive" spack, and for everything else (building GEOS etc) we use the spack-built python3. |
I created an Intel JEDI built with the following SLES15 modules and confirmed the extreme slowness issue with a high resolution SOCA Only difference I notice is that the stack-intel-mpi is
Also, when I load these modules, the native pip3 version is 3.6 which is used for Swell installation. However, when I create a |
@Dooruk Regarding Python. I see this, and that is entirely expected and correct. There's no need for a
|
Regarding the slowness. Can you set the following environment variables and try again? I got those from @mathomp4 and they will be part of the
|
Note: These Intel MPI options are what we've found for Intel MPI + GEOSgcm + SLES15. At the moment SLES15 = Milan, but it's possible when the Cascade Lakes get on there, maybe we have to be even more specific. Of all them, it's possible that NCCS might put the PSM3 line in their Intel MPI modulefile, but I don't think they have yet. |
These helped tremendously, thanks @mathomp4! Yesterday, I hit the walltime limit 1 hour without these env variables and the same variational executable now takes 270 seconds. Silly question but would these env variables help improve performance if they were implemented while building JEDI? Ok, in that case I will use |
@Dooruk I'll make those env vars default in the milan stack compiler module in my next spack-stack PR. Sounds good about Python/venv, that's what we do, too. |
If I had to guess, the PSM3 might have been the important one, though dang, that's a biiiig difference. I wonder if other flags made a difference as well? Probably not worth the effort to do a benchmark sweep of every flag :) |
I just built a spack-stack on discover-mil with [email protected] instead of 2021.10.0 - will run some tests later today and let you know how/if that changes the runtime |
right now discover-mil has come to a crawl (gpfs issues again?) |
thanks
yes, even my simple pip installs take 10 minutes, it is so frustrating.. On Discover, from Tues till Thurs there are more people active and that slows the system down I noticed. I would suggest doing Discover work on Mondays and Fridays 😄 |
Yeah. We aren't sure why SCU17 sometimes has these issues. Another fun can be sometimes it seems like the network out of discover uses a weird route. Sometimes clones of MAPL can take 30+ minutes. It's why I've moved to blobless clones when I can since that can be seconds comparatively! |
You are right. I just tested running with all the other flags without the following and I get the same issue. This seems to be the magic touch 🪄 :
|
@Dooruk I am doing a few timing comparisons. I have a large variational task that, with the I_MPI settings above, finishes in 1900s on discover-mil with [email protected] (and the walltime in EWOK is set to 1hr 15min, therefore this looks rather fast). I am going to do the same run on on discover (cas) with 2021.5.0 (I think it's 5). |
The second cycle with intel 2021.10.0 on scu17 finished even faster (1700s). |
@Dooruk @mathomp4 Here is a poor man's comparison of three different experiments, where the large variational task (18 nodes, 12 tasks per node (due to memory limitations on scu16, maybe potential for optimization on scu17)) ran twice. Note: I used the I_MPI settings from @mathomp4 for SCU17. The takeaways:
I am going to close this issue as resolved, because the factor-of-many difference in runtime we saw initially is addressed by the I_MPI settings. But I'll open another issue for the 22% runtime difference SCU16/17 and factor of 2 memory increase on SCU17 with 2021.10.0 vs 2021.6.0. |
Describe the bug
In testing PR #993, two issues were discovered that need to be addressed:
LDFLAGS="-L/usr/local/other/gcc/11.2.0/lib64"
to the cmake command when building jedi-bundle. There was an additional/similar error later on when building mapl during themake
step.ecflow_ui
like this:LD_PRELOAD="/usr/local/other/gcc/12.3.0/lib64/libstdc++.so" ecflow_ui
skylab-aero-weather
hangs in variational task (somewhere in bump?) and runs are in general VERY VERY slowTo Reproduce
See above
Expected behavior
Both workarounds shouldn't be necessary
System:
Discover SCU16 and SCU17 with Intel
Additional context
n/a
The text was updated successfully, but these errors were encountered: