-
Notifications
You must be signed in to change notification settings - Fork 32
FlatMap: Use SSE2 intrinstics #1616
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| #include "axom/core/ArrayView.hpp" | ||
| #include "axom/core/utilities/BitUtilities.hpp" | ||
|
|
||
| #if defined(_MSC_VER) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to do anything special to get the intrinsics, e.g. compile with the -march flag?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that nothing special is required if Axom is compiled as a 64-bit library, since support for up to SSE2 is a part of the x86-64 spec. For 32-bit x86 systems, you would need to specify -march=....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I confirmed with my changes from #1614 that I am using the new SSE2 intrinsics w/ the rzwhippet-clang host-config
kennyweiss
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @publixsubfan
Arlie-Capps
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool, Max. Thanks!
|
@publixsubfan Nice work! |
Summary
When available, uses SSE2 operations for
GroupBucket::getEmptyBucket()andGroupBucket::visitHashBucket(). This should accelerate performance of lookup and non-batched insertion operations on the CPU.Performance
We see a roughly 3x performance bump at small numbers of elements, which drops to 2x at the 100k-900k element count, and to 1.3-1.5x at 1M-9M elements.