Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: Accelerate Vector.Dot for all base types #111853

Merged
merged 7 commits into from
Mar 11, 2025
Merged

Conversation

saucecontrol
Copy link
Member

@saucecontrol saucecontrol commented Jan 27, 2025

Resolves #85207

  • Replaces the SSE4.1 fallback for long vector multiply with a faster SSE2 version and removes restrictions on op_Multiply and MultiplyAddEstimate intrinsics since these can always be accelerated now.
  • Removes AVX2 requirement for Vector256.Sum to be treated as intrinsic (only AVX instructions are used).
  • Removes restrictions on byte and long types so that Dot can be treated as intrinsic for all types.
  • Adds Vector512.Dot as intrinsic.
     
    Diffs look good. The only regressions are due to inlining or the slightly larger (but faster) SSE2 multiply code.

@saucecontrol saucecontrol marked this pull request as ready for review January 27, 2025 20:27
@saucecontrol
Copy link
Member Author

saucecontrol commented Jan 27, 2025

I was just looking at the SSE4.1/AVX2 fallback for long multiply, and I think we should just replace it with the SSE2 one I added here (extended to AVX2 as well, ofc).

The current fallback only has two multiplications compared to 3 for the new one, but one of those is a pmulld, which is slow on Intel (10 cycles compared to 5 for pmuludq), plus phaddd and 2x pshufd, which can also bottleneck on older Intel.

Quick benchmark, tested on main vs local with the SSE4.1 fallback removed:

[SimpleJob, DisassemblyDiagnoser]
public unsafe class LongBench
{
    private const int nitems = 1 << 10;
    private long* data;

    [GlobalSetup]
    public void Setup()
    {
        const int len = sizeof(long) * nitems;
        data = (long*)NativeMemory.AlignedAlloc(len, 16);
        Random.Shared.NextBytes(new Span<byte>(data, len));
    }

    [Benchmark]
    public Vector128<long> Multiply()
    {
        long* ptr = data, end = ptr + nitems - Vector128<long>.Count;
        var res = Vector128<long>.Zero;

        while (ptr < end)
        {
            res += Vector128.LoadAligned(ptr) * Vector128.LoadAligned(ptr + Vector128<long>.Count);
            ptr += Vector128<long>.Count;
        }

        return res;
    }
}

Skylake

Method Job Toolchain Mean Error StdDev Ratio RatioSD Code Size
Multiply Job-SDELDU \core_root_main\corerun.exe 618.0 ns 9.43 ns 8.36 ns 1.00 0.02 82 B
Multiply Job-UZJSFT \core_root_vdot\corerun.exe 486.0 ns 7.40 ns 6.18 ns 0.79 0.01 89 B

Meteor Lake

Method Job Toolchain Mean Error StdDev Ratio RatioSD Code Size
Multiply Job-IBHRLY \core_root_main\corerun.exe 539.4 ns 20.20 ns 59.55 ns 1.01 0.16 82 B
Multiply Job-WVQFRW \core_root_vdot\corerun.exe 455.4 ns 8.86 ns 17.69 ns 0.85 0.10 89 B

Turns out the SSE2-only version is faster on AMD as well.

Zen 5

Method Job Toolchain Mean Error StdDev Ratio RatioSD Code Size
Multiply Job-NUXFIL \core_root_main\corerun.exe 309.7 ns 0.99 ns 0.93 ns 1.00 0.00 82 B
Multiply Job-PQZTMG \core_root_vdot\corerun.exe 243.9 ns 0.83 ns 0.78 ns 0.79 0.00 89 B

Here's the disasm for SSE4.1

; LongBench.Multiply()
       mov       rax,[rcx+8]
       lea       rcx,[rax+1FF0]
       xorps     xmm0,xmm0
       cmp       rax,rcx
       jae       short M00_L01
M00_L00:
       movdqa    xmm1,[rax]
       movdqa    xmm2,[rax+10]
       movaps    xmm3,xmm1
       pmuludq   xmm3,xmm2
       pshufd    xmm2,xmm2,0B1
       pmulld    xmm1,xmm2
       xorps     xmm2,xmm2
       phaddd    xmm1,xmm2
       pshufd    xmm1,xmm1,73
       paddq     xmm1,xmm3
       paddq     xmm0,xmm1
       add       rax,10
       cmp       rax,rcx
       jb        short M00_L00
M00_L01:
       movups    [rdx],xmm0
       mov       rax,rdx
       ret
; Total bytes of code 82

And here's the SSE2 replacement

; LongBench.Multiply()
       mov       rax,[rcx+8]
       lea       rcx,[rax+1FF0]
       xorps     xmm0,xmm0
       cmp       rax,rcx
       jae       short M00_L01
M00_L00:
       movdqa    xmm1,[rax]
       movdqa    xmm2,[rax+10]
       movaps    xmm3,xmm1
       pmuludq   xmm3,xmm2
       movaps    xmm4,xmm2
       psrlq     xmm4,20
       pmuludq   xmm4,xmm1
       psrlq     xmm1,20
       pmuludq   xmm1,xmm2
       paddq     xmm1,xmm4
       psllq     xmm1,20
       paddq     xmm1,xmm3
       paddq     xmm0,xmm1
       add       rax,10
       cmp       rax,rcx
       jb        short M00_L00
M00_L01:
       movups    [rdx],xmm0
       mov       rax,rdx
       ret
; Total bytes of code 89

@saucecontrol
Copy link
Member Author

@EgorBot -amd -intel --envvars DOTNET_EnableAVX512F:0

using BenchmarkDotNet.Running;
using BenchmarkDotNet.Attributes;

using System.Numerics;
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.X86;
using System.Runtime.InteropServices;
using System.Runtime.CompilerServices;

BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);

public unsafe class LongBench
{
    private const int nitems = 1 << 10;
    private long* data;

    [GlobalSetup]
    public void Setup()
    {
        const int len = sizeof(long) * nitems;
        data = (long*)NativeMemory.AlignedAlloc(len, 64);
        Random.Shared.NextBytes(new Span<byte>(data, len));
    }

    [Benchmark]
    public Vector128<long> Multiply128()
    {
        long* ptr = data, end = ptr + nitems - Vector128<long>.Count;
        var res = Vector128<long>.Zero;

        while (ptr < end)
        {
            res ^= Vector128.LoadAligned(ptr) * Vector128.LoadAligned(ptr + Vector128<long>.Count);
            ptr += Vector128<long>.Count;
        }

        return res;
    }

    [Benchmark]
    public Vector256<long> Multiply256()
    {
        long* ptr = data, end = ptr + nitems - Vector256<long>.Count;
        var res = Vector256<long>.Zero;

        while (ptr < end)
        {
            res ^= Vector256.LoadAligned(ptr) * Vector256.LoadAligned(ptr + Vector256<long>.Count);
            ptr += Vector256<long>.Count;
        }

        return res;
    }

    [Benchmark]
    public Vector<long> MultiplyVectorT()
    {
        long* ptr = data, end = ptr + nitems - Vector<long>.Count;
        var res = Vector<long>.Zero;

        while (ptr < end)
        {
            res ^= Vector.Load(ptr) * Vector.Load(ptr + Vector256<long>.Count);
            ptr += Vector<long>.Count;
        }

        return res;
    }
}

@saucecontrol
Copy link
Member Author

cc @EgorBo I believe you were the last to touch most of this

@EgorBo EgorBo self-requested a review March 10, 2025 19:43
@EgorBo
Copy link
Member

EgorBo commented Mar 11, 2025

/azp run Fuzzlyn, runtime-coreclr jitstress-isas-x86, runtime-coreclr jitstress-isas-avx512

Copy link

Azure Pipelines successfully started running 3 pipeline(s).

Copy link
Member

@EgorBo EgorBo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@EgorBo EgorBo merged commit f565711 into dotnet:main Mar 11, 2025
126 of 139 checks passed
@saucecontrol saucecontrol deleted the vdot branch March 11, 2025 20:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Finish Avx512 specific lightup for Vector128/256/512<T>
3 participants