-
Notifications
You must be signed in to change notification settings - Fork 152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
auto vectorize for any cpu and platform (x86_64, aarch64 ...) instead of only avx2 #1591
Comments
Thanks for sharing this approach, it seems like an excellent improvement for more generic code. I will want to carefully examine performance impacts before implementing it, but thanks so much! |
I simply can not be slower than using pure integral types as long whole calculation sequence uses proper vector types, then all calculations will be done in simd unit. |
Yes, it is Arm that I often see SWAR (SIMD within a register) being faster than SIMD intrinsics. |
moved comment from PR as it belongs here : In my production code I was using approach hiding explicit intrinsic calls with forced inline functions written for generic, x86_64 variants an aarch64 examples: template<int... args, typename vector_type>
constexpr vector_type shuffle_vector(vector_type a, vector_type b) noexcept
{
#if defined(__clang__)
return __builtin_shufflevector(a, b, args...);
#else
using element_type = typename std::remove_reference<typename std::remove_cv<decltype(a[0])>::type>::type;
return __builtin_shuffle(a, b, vector_type{static_cast<element_type>(args)...});
#endif
} or #if defined(__ARM_NEON)
[[nodiscard, gnu::always_inline, gnu::const]]
inline float64x2_t max_pd(float64x2_t a, float64x2_t b)
{
return vpmaxq_f64(a, b);
}
#elif defined(__SSE2__)
using float64x2_t = __m128d;
[[nodiscard, gnu::always_inline, gnu::const]]
inline float64x2_t max_pd(float64x2_t a, float64x2_t b) noexcept
{
return _mm_max_pd(a, b);
}
#else
[[nodiscard, gnu::always_inline, gnu::const]]
inline float64x2_t max_pd(float64x2_t a, float64x2_t b) noexcept
{
return and that way i was able to avoid using direct intrinsic |
So I wanted to point out that normal production code function should look like always generic code even when they use some intrinsics via wrappers, You avoid that ugly part You have with this AVX2 #ifdef code block |
Here You have (probably invalid) fast prof of concept only that using pure intrinsics is waste of time with current compilers and auto vectorization works nice.
For last 15 years I was in large project step by step removing intrinsics and replacing them with generic vector code , not introducing ..
All we need is to prepare special types( int64x4_t declared in example is suitable for x86_64 but for aarch64 i would declare with operators
struct uint64x4_t { uint64x2_t low; uint64x2_t high;};
And compiler does the magic, of course there need to be some amendments to be compatibile with not only clang but gcc and msvc, but they are trivial.
example 32 bytes
For example code
You get proper vectorization and instruction set best possible
znver5
penryn
etc ... see golbot link for details.
The text was updated successfully, but these errors were encountered: