-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT: Accelerate Vector.Dot for all base types #111853
Conversation
I was just looking at the SSE4.1/AVX2 fallback for long multiply, and I think we should just replace it with the SSE2 one I added here (extended to AVX2 as well, ofc). The current fallback only has two multiplications compared to 3 for the new one, but one of those is a Quick benchmark, tested on main vs local with the SSE4.1 fallback removed: [SimpleJob, DisassemblyDiagnoser]
public unsafe class LongBench
{
private const int nitems = 1 << 10;
private long* data;
[GlobalSetup]
public void Setup()
{
const int len = sizeof(long) * nitems;
data = (long*)NativeMemory.AlignedAlloc(len, 16);
Random.Shared.NextBytes(new Span<byte>(data, len));
}
[Benchmark]
public Vector128<long> Multiply()
{
long* ptr = data, end = ptr + nitems - Vector128<long>.Count;
var res = Vector128<long>.Zero;
while (ptr < end)
{
res += Vector128.LoadAligned(ptr) * Vector128.LoadAligned(ptr + Vector128<long>.Count);
ptr += Vector128<long>.Count;
}
return res;
}
} Skylake
Meteor Lake
Turns out the SSE2-only version is faster on AMD as well. Zen 5
Here's the disasm for SSE4.1 ; LongBench.Multiply()
mov rax,[rcx+8]
lea rcx,[rax+1FF0]
xorps xmm0,xmm0
cmp rax,rcx
jae short M00_L01
M00_L00:
movdqa xmm1,[rax]
movdqa xmm2,[rax+10]
movaps xmm3,xmm1
pmuludq xmm3,xmm2
pshufd xmm2,xmm2,0B1
pmulld xmm1,xmm2
xorps xmm2,xmm2
phaddd xmm1,xmm2
pshufd xmm1,xmm1,73
paddq xmm1,xmm3
paddq xmm0,xmm1
add rax,10
cmp rax,rcx
jb short M00_L00
M00_L01:
movups [rdx],xmm0
mov rax,rdx
ret
; Total bytes of code 82 And here's the SSE2 replacement ; LongBench.Multiply()
mov rax,[rcx+8]
lea rcx,[rax+1FF0]
xorps xmm0,xmm0
cmp rax,rcx
jae short M00_L01
M00_L00:
movdqa xmm1,[rax]
movdqa xmm2,[rax+10]
movaps xmm3,xmm1
pmuludq xmm3,xmm2
movaps xmm4,xmm2
psrlq xmm4,20
pmuludq xmm4,xmm1
psrlq xmm1,20
pmuludq xmm1,xmm2
paddq xmm1,xmm4
psllq xmm1,20
paddq xmm1,xmm3
paddq xmm0,xmm1
add rax,10
cmp rax,rcx
jb short M00_L00
M00_L01:
movups [rdx],xmm0
mov rax,rdx
ret
; Total bytes of code 89 |
@EgorBot -amd -intel --envvars DOTNET_EnableAVX512F:0 using BenchmarkDotNet.Running;
using BenchmarkDotNet.Attributes;
using System.Numerics;
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.X86;
using System.Runtime.InteropServices;
using System.Runtime.CompilerServices;
BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);
public unsafe class LongBench
{
private const int nitems = 1 << 10;
private long* data;
[GlobalSetup]
public void Setup()
{
const int len = sizeof(long) * nitems;
data = (long*)NativeMemory.AlignedAlloc(len, 64);
Random.Shared.NextBytes(new Span<byte>(data, len));
}
[Benchmark]
public Vector128<long> Multiply128()
{
long* ptr = data, end = ptr + nitems - Vector128<long>.Count;
var res = Vector128<long>.Zero;
while (ptr < end)
{
res ^= Vector128.LoadAligned(ptr) * Vector128.LoadAligned(ptr + Vector128<long>.Count);
ptr += Vector128<long>.Count;
}
return res;
}
[Benchmark]
public Vector256<long> Multiply256()
{
long* ptr = data, end = ptr + nitems - Vector256<long>.Count;
var res = Vector256<long>.Zero;
while (ptr < end)
{
res ^= Vector256.LoadAligned(ptr) * Vector256.LoadAligned(ptr + Vector256<long>.Count);
ptr += Vector256<long>.Count;
}
return res;
}
[Benchmark]
public Vector<long> MultiplyVectorT()
{
long* ptr = data, end = ptr + nitems - Vector<long>.Count;
var res = Vector<long>.Zero;
while (ptr < end)
{
res ^= Vector.Load(ptr) * Vector.Load(ptr + Vector256<long>.Count);
ptr += Vector<long>.Count;
}
return res;
}
} |
cc @EgorBo I believe you were the last to touch most of this |
/azp run Fuzzlyn, runtime-coreclr jitstress-isas-x86, runtime-coreclr jitstress-isas-avx512 |
Azure Pipelines successfully started running 3 pipeline(s). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
Resolves #85207
op_Multiply
andMultiplyAddEstimate
intrinsics since these can always be accelerated now.Vector256.Sum
to be treated as intrinsic (only AVX instructions are used).Dot
can be treated as intrinsic for all types.Vector512.Dot
as intrinsic.Diffs look good. The only regressions are due to inlining or the slightly larger (but faster) SSE2 multiply code.