Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JIT] [APX] Enable additional General Purpose Registers. #108799

Draft
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

DeepakRajendrakumaran
Copy link
Contributor

This PR is built on top of #108796

What this PR does

  1. Add eGPR to available register on x64 in JIT and related changes to turn these on/off based on APX availability
    Link to related commit
  2. A LSRA_LIMIT_EXT_GPR_SET register stress mode to force eGPR register usage when possible.
    Link to related commit
  3. Some minor changes to turn on Rex2 encoding with eGPR
    Link to related commit
  4. Temporary changes to mask away eGPR for currently un-supported instructions - primarily ones requiring eEVEX + imul + movszx (This commit will be removed once we have support for these but was essential for testing)
    Link to related commit
  5. Minor flags to gets altjit to work(need to make sure if this is conflicting with Ruihan's changes)
    Link to related commit

Testing

  • Ran tests using sde(specifically src/tests/JIT) using Ruihan's script
  • Ran superpmi for src/tests/JIT using altjit feature

Analysis of superpmi results

Summary from JitAnalyze


Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 823288813
Total bytes of diff: 823634058
Total bytes of delta: 345245 (0.04 % of base)
Total relative delta: NaN
    diff is a regression.
    relative diff is a regression.
Detail diffs


Top file regressions (bytes):
       98472 : JIT\Methodical\Methodical_do\Methodical_do.dasm (3.18% of base)
       98472 : JIT\Methodical\Methodical_ro\Methodical_ro.dasm (3.18% of base)
       87180 : JIT\Methodical\Methodical_d1\Methodical_d1.dasm (3.77% of base)
       87180 : JIT\Methodical\Methodical_r1\Methodical_r1.dasm (3.75% of base)
       11382 : JIT\Methodical\Methodical_r2\Methodical_r2.dasm (0.87% of base)
       11382 : JIT\Methodical\Methodical_d2\Methodical_d2.dasm (0.89% of base)
        3422 : JIT\HardwareIntrinsics\Arm\Sve\Sve_ro\Sve_ro.dasm (0.05% of base)
        1599 : JIT\HardwareIntrinsics\Arm\AdvSimd\AdvSimd_ro\AdvSimd_ro.dasm (0.02% of base)
        1271 : JIT\HardwareIntrinsics\HardwareIntrinsics_X86_Avx512_ro\X86_Avx512F_ro.dasm (0.04% of base)
        1271 : JIT\HardwareIntrinsics\X86_Avx512\Avx512F\Avx512F_ro\X86_Avx512F_ro.dasm (0.04% of base)
        1205 : JIT\HardwareIntrinsics\X86\Sse2\Sse2_ro\X86_Sse2_ro.dasm (0.06% of base)
        1205 : JIT\HardwareIntrinsics\HardwareIntrinsics_X86_ro\X86_Sse2_ro.dasm (0.06% of base)
        1166 : JIT\HardwareIntrinsics\HardwareIntrinsics_X86_Avx10v1_ro\X86_Avx10v1_Vector128_ro.dasm (0.09% of base)
        1166 : JIT\HardwareIntrinsics\X86_Avx10v1\Avx10v1_Vector128\Avx10v1_Vector128_ro\X86_Avx10v1_Vector128_ro.dasm (0.09% of base)
        1091 : JIT\HardwareIntrinsics\X86_Avx\Avx2\Avx2_ro\X86_Avx2_ro.dasm (0.05% of base)
        1091 : JIT\HardwareIntrinsics\HardwareIntrinsics_X86_Avx_ro\X86_Avx2_ro.dasm (0.05% of base)
        1091 : JIT\HardwareIntrinsics\HardwareIntrinsics_X86_Avx_r\X86_Avx2_ro.dasm (0.05% of base)
        1016 : JIT\HardwareIntrinsics\HardwareIntrinsics_X86_Avx10v1_ro\X86_Avx10v1_Vector256_ro.dasm (0.08% of base)
        1016 : JIT\HardwareIntrinsics\X86_Avx10v1\Avx10v1_Vector256\Avx10v1_Vector256_ro\X86_Avx10v1_Vector256_ro.dasm (0.08% of base)
         879 : JIT\HardwareIntrinsics\HardwareIntrinsics_X86_Avx_r\X86_Avx1_ro.dasm (0.07% of base)

Top file improvements (bytes):
       -9932 : JIT\HardwareIntrinsics\HardwareIntrinsics_General_ro\Vector512_1_ro.dasm (-0.40% of base)
       -9932 : JIT\HardwareIntrinsics\General\Vector512_1\Vector512_1_ro\Vector512_1_ro.dasm (-0.40% of base)
       -9435 : JIT\HardwareIntrinsics\HardwareIntrinsics_General_ro\Vector512_ro.dasm (-0.23% of base)
       -9435 : JIT\HardwareIntrinsics\General\Vector512\Vector512_ro\Vector512_ro.dasm (-0.23% of base)
       -7973 : JIT\HardwareIntrinsics\HardwareIntrinsics_X86_ro\X86_Sse2_handwritten_ro.dasm (-1.92% of base)
       -7973 : JIT\HardwareIntrinsics\X86\Sse2\Sse2_handwritten_ro\X86_Sse2_handwritten_ro.dasm (-1.92% of base)
       -4763 : JIT\Regression\JitBlue\GitHub_17777\GitHub_17777\GitHub_17777.dasm (-1.32% of base)
       -2935 : JIT\HardwareIntrinsics\HardwareIntrinsics_General_ro\Vector128_ro.dasm (-0.09% of base)
       -2935 : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm (-0.09% of base)
       -2496 : JIT\HardwareIntrinsics\HardwareIntrinsics_General_ro\Vector64_ro.dasm (-0.07% of base)
       -2496 : JIT\HardwareIntrinsics\General\Vector64\Vector64_ro\Vector64_ro.dasm (-0.07% of base)
       -2311 : JIT\HardwareIntrinsics\General\Vector256\Vector256_ro\Vector256_ro.dasm (-0.07% of base)
       -2311 : JIT\HardwareIntrinsics\HardwareIntrinsics_General_ro\Vector256_ro.dasm (-0.07% of base)
       -1088 : JIT\Methodical\Arrays\huge\huge_b_r\huge_b_r.dasm (-19.19% of base)
       -1088 : JIT\Methodical\Arrays\huge\huge_i4_r\huge_i4_r.dasm (-18.74% of base)
       -1088 : JIT\Methodical\Arrays\huge\huge_r4_r\huge_r4_r.dasm (-18.14% of base)
       -1088 : JIT\Methodical\Arrays\huge\huge_r8_r\huge_r8_r.dasm (-18.32% of base)
       -1088 : JIT\Methodical\Methodical_r1\huge_i4_r.dasm (-18.74% of base)
       -1088 : JIT\Methodical\Methodical_r1\huge_r4_r.dasm (-18.14% of base)
       -1088 : JIT\Methodical\Methodical_r1\huge_r8_r.dasm (-18.32% of base)

852 total files with Code Size differences (180 improved, 672 regressed), 4485 unchanged.

Top method regressions (bytes):
        9850 ( 9.50% of base) : JIT\Methodical\Methodical_do\Methodical_do.dasm - i4rem:TestEntryPoint():int (FullOpts)
        9850 ( 9.50% of base) : JIT\Methodical\Methodical_d1\Methodical_d1.dasm - i4rem:TestEntryPoint():int (FullOpts)
        9850 ( 9.50% of base) : JIT\Methodical\Methodical_ro\Methodical_ro.dasm - i4rem:TestEntryPoint():int (FullOpts)
        9850 ( 9.50% of base) : JIT\Methodical\Methodical_r1\Methodical_r1.dasm - i4rem:TestEntryPoint():int (FullOpts)
        8982 ( 8.50% of base) : JIT\Methodical\Methodical_do\Methodical_do.dasm - i8rem:TestEntryPoint():int (FullOpts)
        8982 ( 8.50% of base) : JIT\Methodical\Methodical_d1\Methodical_d1.dasm - i8rem:TestEntryPoint():int (FullOpts)
        8982 ( 8.50% of base) : JIT\Methodical\Methodical_ro\Methodical_ro.dasm - i8rem:TestEntryPoint():int (FullOpts)
        8982 ( 8.50% of base) : JIT\Methodical\Methodical_r1\Methodical_r1.dasm - i8rem:TestEntryPoint():int (FullOpts)
        8133 ( 7.81% of base) : JIT\Methodical\Methodical_do\Methodical_do.dasm - u4div:TestEntryPoint():int (FullOpts)
        8133 ( 7.81% of base) : JIT\Methodical\Methodical_d1\Methodical_d1.dasm - u4div:TestEntryPoint():int (FullOpts)
        8133 ( 7.81% of base) : JIT\Methodical\Methodical_ro\Methodical_ro.dasm - u4div:TestEntryPoint():int (FullOpts)
        8133 ( 7.81% of base) : JIT\Methodical\Methodical_r1\Methodical_r1.dasm - u4div:TestEntryPoint():int (FullOpts)
        8034 ( 7.99% of base) : JIT\Methodical\Methodical_do\Methodical_do.dasm - i4div:TestEntryPoint():int (FullOpts)
        8034 ( 7.99% of base) : JIT\Methodical\Methodical_d1\Methodical_d1.dasm - i4div:TestEntryPoint():int (FullOpts)
        8034 ( 7.99% of base) : JIT\Methodical\Methodical_ro\Methodical_ro.dasm - i4div:TestEntryPoint():int (FullOpts)
        8034 ( 7.99% of base) : JIT\Methodical\Methodical_r1\Methodical_r1.dasm - i4div:TestEntryPoint():int (FullOpts)
        8010 ( 8.28% of base) : JIT\Methodical\Methodical_do\Methodical_do.dasm - r8div:TestEntryPoint():int (FullOpts)
        8010 ( 8.28% of base) : JIT\Methodical\Methodical_d1\Methodical_d1.dasm - r8div:TestEntryPoint():int (FullOpts)
        8010 ( 8.28% of base) : JIT\Methodical\Methodical_ro\Methodical_ro.dasm - r8div:TestEntryPoint():int (FullOpts)
        8010 ( 8.28% of base) : JIT\Methodical\Methodical_r1\Methodical_r1.dasm - r8div:TestEntryPoint():int (FullOpts)

Top method improvements (bytes):
       -4763 (-1.33% of base) : JIT\Regression\JitBlue\GitHub_17777\GitHub_17777\GitHub_17777.dasm - Repro.Program:Test(int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int):int (FullOpts)
       -1088 (-19.19% of base) : JIT\Methodical\Arrays\huge\huge_b_r\huge_b_r.dasm - JitTest_huge_b_huge_il.Test:Main():int (FullOpts)
       -1088 (-19.19% of base) : JIT\Methodical\Methodical_r1\huge_b_r.dasm - JitTest_huge_b_huge_il.Test:Main():int (FullOpts)
       -1088 (-18.74% of base) : JIT\Methodical\Arrays\huge\huge_i4_r\huge_i4_r.dasm - JitTest_huge_i4_huge_il.Test:Main():int (FullOpts)
       -1088 (-18.74% of base) : JIT\Methodical\Methodical_r1\huge_i4_r.dasm - JitTest_huge_i4_huge_il.Test:Main():int (FullOpts)
       -1088 (-18.14% of base) : JIT\Methodical\Arrays\huge\huge_r4_r\huge_r4_r.dasm - JitTest_huge_r4_huge_il.Test:Main():int (FullOpts)
       -1088 (-18.14% of base) : JIT\Methodical\Methodical_r1\huge_r4_r.dasm - JitTest_huge_r4_huge_il.Test:Main():int (FullOpts)
       -1088 (-18.32% of base) : JIT\Methodical\Arrays\huge\huge_r8_r\huge_r8_r.dasm - JitTest_huge_r8_huge_il.Test:Main():int (FullOpts)
       -1088 (-18.32% of base) : JIT\Methodical\Methodical_r1\huge_r8_r.dasm - JitTest_huge_r8_huge_il.Test:Main():int (FullOpts)
       -1088 (-18.59% of base) : JIT\Methodical\Methodical_r1\huge_u8_r.dasm - JitTest_huge_u8_huge_il.Test:Main():int (FullOpts)
       -1088 (-18.59% of base) : JIT\Methodical\Arrays\huge\huge_u8_r\huge_u8_r.dasm - JitTest_huge_u8_huge_il.Test:Main():int (FullOpts)
       -1088 (-18.29% of base) : JIT\Methodical\Methodical_r2\hugedim_r.dasm - JitTest_hugedim_arrays_il.Test:Main():int (FullOpts)
       -1088 (-18.29% of base) : JIT\Methodical\int64\arrays\hugedim_r\hugedim_r.dasm - JitTest_hugedim_arrays_il.Test:Main():int (FullOpts)
        -781 (-11.87% of base) : JIT\Methodical\VT\port\huge_gcref_r\huge_gcref_r.dasm - JitTest_huge_gcref_port_il.Test:Main():int (FullOpts)
        -781 (-11.87% of base) : JIT\Methodical\Methodical_r2\huge_gcref_r.dasm - JitTest_huge_gcref_port_il.Test:Main():int (FullOpts)
        -781 (-11.87% of base) : JIT\Methodical\Methodical_r1\huge_struct_r.dasm - JitTest_huge_struct_huge_il.Test:Main():int (FullOpts)
        -781 (-11.87% of base) : JIT\Methodical\Arrays\huge\huge_struct_r\huge_struct_r.dasm - JitTest_huge_struct_huge_il.Test:Main():int (FullOpts)
        -749 (-11.58% of base) : JIT\Methodical\Arrays\huge\huge_objref_r\huge_objref_r.dasm - JitTest_huge_objref_huge_il.Test:Main():int (FullOpts)
        -749 (-11.58% of base) : JIT\Methodical\Methodical_r1\huge_objref_r.dasm - JitTest_huge_objref_huge_il.Test:Main():int (FullOpts)
        -361 (-11.14% of base) : JIT\HardwareIntrinsics\HardwareIntrinsics_X86_ro\X86_Sse2_handwritten_ro.dasm - IntelHardwareIntrinsicTest.SSE2.TestTableSse2`2[long,long]:CheckUnpack(IntelHardwareIntrinsicTest.SSE2.CheckMethodSixteenOfAll`2[long,long]):ubyte:this (FullOpts)

Top method regressions (percentages):
         362 (27.38% of base) : JIT\HardwareIntrinsics\Arm\AdvSimd\AdvSimd_ro\AdvSimd_ro.dasm - JIT.HardwareIntrinsics.Arm._AdvSimd.VectorLookupExtension_4Test__VectorTableLookupExtensionByte:.ctor():this (FullOpts)
         362 (27.30% of base) : JIT\HardwareIntrinsics\Arm\AdvSimd.Arm64\AdvSimd.Arm64_ro\AdvSimd.Arm64_ro.dasm - JIT.HardwareIntrinsics.Arm._AdvSimd.Arm64.VectorLookupExtension_4Test__VectorTableLookupExtensionByte:.ctor():this (FullOpts)
         362 (27.12% of base) : JIT\HardwareIntrinsics\Arm\AdvSimd\AdvSimd_ro\AdvSimd_ro.dasm - JIT.HardwareIntrinsics.Arm._AdvSimd.VectorLookupExtension_4Test__VectorTableLookupExtensionSByte:.ctor():this (FullOpts)
         362 (27.04% of base) : JIT\HardwareIntrinsics\Arm\AdvSimd.Arm64\AdvSimd.Arm64_ro\AdvSimd.Arm64_ro.dasm - JIT.HardwareIntrinsics.Arm._AdvSimd.Arm64.VectorLookupExtension_4Test__VectorTableLookupExtensionSByte:.ctor():this (FullOpts)
         357 (14.89% of base) : JIT\Methodical\Methodical_others\Methodical_others.dasm - Test_baduwinfo1:bar(System.String,System.String,System.String,System.String,System.String,System.String,System.String,System.String,System.String,System.String,System.String,System.String,System.String,System.String,System.String,System.String,System.String,System.String,System.String,System.String,System.String,System.String,System.String,System.String,System.String,System.String,System.String,System.String,System.String,System.String,System.String,System.String,System.String):int (FullOpts)
          54 (12.68% of base) : JIT\Methodical\Methodical_others\Methodical_others.dasm - structinreg.Program2:test23(structinreg.Test23):int (FullOpts)
         137 (12.24% of base) : JIT\superpmi\superpmicollect\Bytemark\Bytemark.dasm - IDEAEncryption:cipher_idea(ubyte[],ubyte[],int,ushort[]) (FullOpts)
         137 (12.24% of base) : JIT\Performance\CodeQuality\Bytemark\Bytemark\Bytemark.dasm - IDEAEncryption:cipher_idea(ubyte[],ubyte[],int,ushort[]) (FullOpts)
          23 (11.17% of base) : JIT\Performance\JIT.performance\fannkuch-redux-9.dasm - BenchmarksGame.FannkuchRedux_9:FirstPermutation(ulong,ulong,ulong,int,int) (FullOpts)
          23 (11.17% of base) : JIT\Performance\CodeQuality\BenchmarksGame\fannkuch-redux\fannkuch-redux-9\fannkuch-redux-9.dasm - BenchmarksGame.FannkuchRedux_9:FirstPermutation(ulong,ulong,ulong,int,int) (FullOpts)
          20 (10.31% of base) : JIT\Directed\tailcall\more_tailcalls\more_tailcalls.dasm - Program:IL_STUB_InstantiatingStub(System.Object,System.Object,System.Object,System.Object,System.Object,System.Object,System.Object,System.Object,int,int,System.Span`1[int],int):int (FullOpts)
        1595 ( 9.63% of base) : JIT\Methodical\Methodical_r2\Methodical_r2.dasm - r4NaNsub:TestEntryPoint():int (FullOpts)
        1595 ( 9.63% of base) : JIT\Methodical\Methodical_do\Methodical_do.dasm - r4NaNsub:TestEntryPoint():int (FullOpts)
        1595 ( 9.63% of base) : JIT\Methodical\Methodical_ro\Methodical_ro.dasm - r4NaNsub:TestEntryPoint():int (FullOpts)
        1595 ( 9.63% of base) : JIT\Methodical\Methodical_d2\Methodical_d2.dasm - r4NaNsub:TestEntryPoint():int (FullOpts)
         106 ( 9.52% of base) : JIT\Directed\array-il\_Arrayscomplex3\_Arrayscomplex3.dasm - Complex2_Array_Test:Main():int (FullOpts)
         106 ( 9.52% of base) : JIT\Directed\Directed_3\_Arrayscomplex3.dasm - Complex2_Array_Test:Main():int (FullOpts)
        9850 ( 9.50% of base) : JIT\Methodical\Methodical_do\Methodical_do.dasm - i4rem:TestEntryPoint():int (FullOpts)
        9850 ( 9.50% of base) : JIT\Methodical\Methodical_d1\Methodical_d1.dasm - i4rem:TestEntryPoint():int (FullOpts)
        9850 ( 9.50% of base) : JIT\Methodical\Methodical_ro\Methodical_ro.dasm - i4rem:TestEntryPoint():int (FullOpts)

Top method improvements (percentages):
       -1088 (-19.19% of base) : JIT\Methodical\Arrays\huge\huge_b_r\huge_b_r.dasm - JitTest_huge_b_huge_il.Test:Main():int (FullOpts)
       -1088 (-19.19% of base) : JIT\Methodical\Methodical_r1\huge_b_r.dasm - JitTest_huge_b_huge_il.Test:Main():int (FullOpts)
       -1088 (-18.74% of base) : JIT\Methodical\Arrays\huge\huge_i4_r\huge_i4_r.dasm - JitTest_huge_i4_huge_il.Test:Main():int (FullOpts)
       -1088 (-18.74% of base) : JIT\Methodical\Methodical_r1\huge_i4_r.dasm - JitTest_huge_i4_huge_il.Test:Main():int (FullOpts)
       -1088 (-18.59% of base) : JIT\Methodical\Methodical_r1\huge_u8_r.dasm - JitTest_huge_u8_huge_il.Test:Main():int (FullOpts)
       -1088 (-18.59% of base) : JIT\Methodical\Arrays\huge\huge_u8_r\huge_u8_r.dasm - JitTest_huge_u8_huge_il.Test:Main():int (FullOpts)
       -1088 (-18.32% of base) : JIT\Methodical\Arrays\huge\huge_r8_r\huge_r8_r.dasm - JitTest_huge_r8_huge_il.Test:Main():int (FullOpts)
       -1088 (-18.32% of base) : JIT\Methodical\Methodical_r1\huge_r8_r.dasm - JitTest_huge_r8_huge_il.Test:Main():int (FullOpts)
       -1088 (-18.29% of base) : JIT\Methodical\Methodical_r2\hugedim_r.dasm - JitTest_hugedim_arrays_il.Test:Main():int (FullOpts)
       -1088 (-18.29% of base) : JIT\Methodical\int64\arrays\hugedim_r\hugedim_r.dasm - JitTest_hugedim_arrays_il.Test:Main():int (FullOpts)
       -1088 (-18.14% of base) : JIT\Methodical\Arrays\huge\huge_r4_r\huge_r4_r.dasm - JitTest_huge_r4_huge_il.Test:Main():int (FullOpts)
       -1088 (-18.14% of base) : JIT\Methodical\Methodical_r1\huge_r4_r.dasm - JitTest_huge_r4_huge_il.Test:Main():int (FullOpts)
        -112 (-16.26% of base) : JIT\HardwareIntrinsics\HardwareIntrinsics_X86_ro\X86_Sse2_handwritten_ro.dasm - IntelHardwareIntrinsicTest.SSE2.TestTableSse2`2[ubyte,long]:GetEightOneDataPoint(int):System.ValueTuple`4[System.ValueTuple`8[ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,System.ValueTuple`1[ubyte]],System.ValueTuple`8[ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,System.ValueTuple`1[ubyte]],long,long]:this (FullOpts)
        -112 (-16.26% of base) : JIT\HardwareIntrinsics\X86\Sse2\Sse2_handwritten_ro\X86_Sse2_handwritten_ro.dasm - IntelHardwareIntrinsicTest.SSE2.TestTableSse2`2[ubyte,long]:GetEightOneDataPoint(int):System.ValueTuple`4[System.ValueTuple`8[ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,System.ValueTuple`1[ubyte]],System.ValueTuple`8[ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,System.ValueTuple`1[ubyte]],long,long]:this (FullOpts)
        -112 (-15.80% of base) : JIT\HardwareIntrinsics\HardwareIntrinsics_X86_ro\X86_Sse2_handwritten_ro.dasm - IntelHardwareIntrinsicTest.SSE2.TestTableSse2`2[short,long]:GetEightOneDataPoint(int):System.ValueTuple`4[System.ValueTuple`8[short,short,short,short,short,short,short,System.ValueTuple`1[short]],System.ValueTuple`8[short,short,short,short,short,short,short,System.ValueTuple`1[short]],long,long]:this (FullOpts)
        -112 (-15.80% of base) : JIT\HardwareIntrinsics\X86\Sse2\Sse2_handwritten_ro\X86_Sse2_handwritten_ro.dasm - IntelHardwareIntrinsicTest.SSE2.TestTableSse2`2[short,long]:GetEightOneDataPoint(int):System.ValueTuple`4[System.ValueTuple`8[short,short,short,short,short,short,short,System.ValueTuple`1[short]],System.ValueTuple`8[short,short,short,short,short,short,short,System.ValueTuple`1[short]],long,long]:this (FullOpts)
        -297 (-15.66% of base) : JIT\Performance\JIT.performance\MDMulMatrix.dasm - Benchstone.MDBenchI.MDMulMatrix:Inner(int[,],int[,],int[,]) (FullOpts)
        -297 (-15.66% of base) : JIT\Performance\CodeQuality\Benchstones\MDBenchI\MDMulMatrix\MDMulMatrix\MDMulMatrix.dasm - Benchstone.MDBenchI.MDMulMatrix:Inner(int[,],int[,],int[,]) (FullOpts)
        -349 (-14.64% of base) : JIT\HardwareIntrinsics\HardwareIntrinsics_X86_ro\X86_Sse2_handwritten_ro.dasm - IntelHardwareIntrinsicTest.SSE2.TestTableSse2`2[ubyte,long]:CheckPackSaturate(IntelHardwareIntrinsicTest.SSE2.CheckMethodSixteen`2[ubyte,long]):ubyte:this (FullOpts)
        -349 (-14.64% of base) : JIT\HardwareIntrinsics\X86\Sse2\Sse2_handwritten_ro\X86_Sse2_handwritten_ro.dasm - IntelHardwareIntrinsicTest.SSE2.TestTableSse2`2[ubyte,long]:CheckPackSaturate(IntelHardwareIntrinsicTest.SSE2.CheckMethodSixteen`2[ubyte,long]):ubyte:this (FullOpts)

15016 total methods with Code Size differences (5105 improved, 9911 regressed), 1031138 unchanged.


Why arm tests from \HardwareIntrinsics\Arm\AdvSimd\AdvSimd_ro\AdvSimd_ro.dasm show up here and generates x86 code. My theory is that since we are compiling for x86 using aljit, it takes the software fallback path for arm instrinsics and generates x86 code. See how IsSupported() is generating false below

image

I'm ignoring these for now

Some interesting code samples highlighting changes introduced due to enabling additional GPRs

Case 1

A very simple case with r16 being used

see V03 loc0
In this case, a spill is reduced and we see instruction reduction. The cost of using this eGPR is slightly higher encoding size with Rex2. We do not add this to the calculus while doing reg allocation

<details>
<summary><span style="color:green">-3</span> (<span style="color:green">-4.92%</span>) : 9473.dasm - System.Threading.Tasks.Task:AtomicStateUpdate(int,int):ubyte:this (Tier1)</summary>
<div style="margin-left:1em">

```diff
@@ -1,3 +1,5 @@
+
+ Deepak methName = AtomicStateUpdate 
 ; Assembly listing for method System.Threading.Tasks.Task:AtomicStateUpdate(int,int):ubyte:this (Tier1)
 ; Emitting BLENDED_CODE for X64 with AVX512 - Windows
 ; Tier1 code
@@ -11,49 +13,45 @@
 ;  V00 this         [V00,T00] (  5,  4   )     ref  ->  rcx         this class-hnd single-def <System.Threading.Tasks.Task>
 ;  V01 arg1         [V01,T02] (  4,  3   )     int  ->  rdx         single-def
 ;  V02 arg2         [V02,T03] (  4,  3   )     int  ->   r8         single-def
-;  V03 loc0         [V03,T01] (  5,  5   )     int  ->  [rsp+0x04]  spill-single-def
+;  V03 loc0         [V03,T01] (  5,  5   )     int  ->  r16        
 ;# V04 OutArgs      [V04    ] (  1,  1   )  struct ( 0) [rsp+0x00]  do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
 ;
-; Lcl frame size = 8
+; Lcl frame size = 0
 
 G_M2073_IG01:        ; bbWeight=1, gcVars=0000000000000000 {}, gcrefRegs=0000 {}, byrefRegs=0000 {}, gcvars, byref, nogc <-- Prolog IG
-       push     rax
-						;; size=1 bbWeight=1 PerfScore 1.00
+						;; size=0 bbWeight=1 PerfScore 0.00
 G_M2073_IG02:        ; bbWeight=1, gcrefRegs=0002 {rcx}, byrefRegs=0000 {}, byref, isz
        ; gcrRegs +[rcx]
-       mov      eax, dword ptr [rcx+0x34]
-       mov      dword ptr [rsp+0x04], eax
-       test     eax, r8d
+       mov      r16, dword ptr [rcx+0x34]
+       test     r16, r8d
        jne      SHORT G_M2073_IG05
-       lea      r10, bword ptr [rcx+0x34]
-       ; byrRegs +[r10]
-       mov      r9d, eax
-       or       r9d, edx
+       lea      r17, bword ptr [rcx+0x34]
+       ; byrRegs +[r17]
+       mov      r18, r16
+       or       r18, edx
+       mov      eax, r16
        lock     
-       cmpxchg  dword ptr [r10], r9d
-       cmp      eax, dword ptr [rsp+0x04]
+       cmpxchg  dword ptr [r17], r18
+       cmp      eax, r16
        jne      SHORT G_M2073_IG04
        mov      eax, 1
-						;; size=38 bbWeight=1 PerfScore 26.50
+						;; size=46 bbWeight=1 PerfScore 24.00
 G_M2073_IG03:        ; bbWeight=1, epilog, nogc, extend
-       add      rsp, 8
        ret      
-						;; size=5 bbWeight=1 PerfScore 1.25
+						;; size=1 bbWeight=1 PerfScore 1.00
 G_M2073_IG04:        ; bbWeight=0, gcrefRegs=0002 {rcx}, byrefRegs=0000 {}, byref, epilog, nogc
-       ; byrRegs -[r10]
-       add      rsp, 8
+       ; byrRegs -[r17]
        tail.jmp [System.Threading.Tasks.Task:AtomicStateUpdateSlow(int,int):ubyte:this]
-						;; size=10 bbWeight=0 PerfScore 0.00
+						;; size=6 bbWeight=0 PerfScore 0.00
 G_M2073_IG05:        ; bbWeight=0, gcVars=0000000000000000 {}, gcrefRegs=0000 {}, byrefRegs=0000 {}, gcvars, byref
        ; gcrRegs -[rcx]
        xor      eax, eax
-						;; size=2 bbWeight=0 PerfScore 0.00
+						;; size=4 bbWeight=0 PerfScore 0.00
 G_M2073_IG06:        ; bbWeight=0, epilog, nogc, extend
-       add      rsp, 8
        ret      
-						;; size=5 bbWeight=0 PerfScore 0.00
+						;; size=1 bbWeight=0 PerfScore 0.00
 
-; Total bytes of code 61, prolog size 1, PerfScore 28.75, instruction count 20, allocated bytes for code 61 (MethodHash=e528f7e6) for method System.Threading.Tasks.Task:AtomicStateUpdate(int,int):ubyte:this (Tier1)
+; Total bytes of code 58, prolog size 0, PerfScore 25.00, instruction count 16, allocated bytes for code 58 (MethodHash=e528f7e6) for method System.Threading.Tasks.Task:AtomicStateUpdate(int,int):ubyte:this (Tier1)
 ; ============================================================

Case 2

An example of lack of eEVEX/instructions not having eGPR support causing regression

In this example, we use imul. We currently have not enabled eGPR usage for imul. This means if the input to imul is in an eGPR, we insert a mov to move it to a lower GPR. This further adds to register usage

 ; Assembly listing for method AssignRect:first_assignments(int[,],short[,]):int (FullOpts)
 ; Emitting BLENDED_CODE for X64 with AVX512 - Windows
 ; FullOpts code
@@ -9,577 +11,486 @@
 ;
 ;  V00 arg0         [V00,T06] ( 13, 362   )     ref  ->  rcx         class-hnd single-def <int[,]>
 ;  V01 arg1         [V01,T10] ( 17, 213   )     ref  ->  rdx         class-hnd single-def <short[,]>
-;  V02 loc0         [V02,T04] ( 28, 528.50)   short  ->  registers  
-;  V03 loc1         [V03,T03] ( 27, 677   )   short  ->  registers  
+;  V02 loc0         [V02,T04] ( 28, 528.50)   short  ->  r17        
+;  V03 loc1         [V03,T03] ( 27, 677   )   short  ->  r18        
 ;  V04 loc2         [V04,T02] ( 28, 810   )   short  ->  registers  
-;  V05 loc3         [V05,T25] (  6,  25   )   short  ->  [rsp+0x60] 
-;  V06 loc4         [V06,T27] (  9,  21   )   short  ->  [rsp+0x5C] 
-;  V07 loc5         [V07,T12] (  8, 168   )   short  ->  r14        
-;  V08 loc6         [V08,T11] ( 13, 168.25)     int  ->  registers  
+;  V05 loc3         [V05,T25] (  6,  25   )   short  ->  r25        
+;  V06 loc4         [V06,T27] (  9,  21   )   short  ->  r20        
+;  V07 loc5         [V07,T12] (  8, 168   )   short  ->  r26        
+;  V08 loc6         [V08,T11] ( 13, 168.25)     int  ->  r16        
 ;  V09 OutArgs      [V09    ] (  1,   1   )  struct (32) [rsp+0x00]  do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
 ;  V10 tmp1         [V10,T01] ( 45,2118   )     int  ->  registers   "MD array shared temp"
 ;  V11 tmp2         [V11,T00] ( 48,2166   )     int  ->  registers   "MD array shared temp"
-;  V12 cse0         [V12,T18] (  3,  40   )     int  ->  [rsp+0x58]  spill-single-def "CSE #14: aggressive"
-;  V13 cse1         [V13,T19] (  3,  40   )     int  ->  [rsp+0x54]  spill-single-def "CSE #18: aggressive"
-;  V14 cse2         [V14,T29] (  3,  10   )     int  ->  [rsp+0x50]  spill-single-def "CSE #24: aggressive"
-;  V15 cse3         [V15,T23] (  5,  26   )     int  ->  [rsp+0x4C]  multi-def "CSE #21: aggressive"
-;  V16 cse4         [V16,T20] (  3,  40   )     int  ->  [rsp+0x48]  spill-single-def "CSE #22: aggressive"
-;  V17 cse5         [V17,T17] (  2,  68   )     int  ->  [rsp+0x44]  spill-single-def hoist "CSE #06: aggressive"
-;  V18 cse6         [V18,T21] (  2,   8   )     int  ->  [rsp+0x40]  spill-single-def "CSE #13: aggressive"
-;  V19 cse7         [V19,T22] (  2,   8   )     int  ->  [rsp+0x3C]  spill-single-def "CSE #17: aggressive"
-;  V20 cse8         [V20,T26] (  2,  17   )     int  ->  [rsp+0x38]  spill-single-def hoist "CSE #19: aggressive"
-;  V21 cse9         [V21,T28] (  2,  17   )     int  ->  r11         hoist "CSE #02: aggressive"
-;  V22 cse10        [V22,T30] (  2,   2   )     int  ->  [rsp+0x34]  spill-single-def "CSE #23: aggressive"
-;  V23 cse11        [V23,T24] (  4,  18   )     int  ->  [rsp+0x30]  multi-def "CSE #20: aggressive"
-;  V24 cse12        [V24,T05] ( 15, 512   )     int  ->  registers   "CSE #08: aggressive"
-;  V25 cse13        [V25,T07] ( 21, 330   )     int  ->  rdi         "CSE #04: aggressive"
-;  V26 cse14        [V26,T14] (  7, 145   )     int  ->  r15         "CSE #05: aggressive"
-;  V27 cse15        [V27,T09] (  7, 220   )     int  ->  r12         hoist "CSE #07: aggressive"
-;  V28 cse16        [V28,T16] ( 10, 123   )     int  ->  [rsp+0x2C]  "CSE #01: aggressive"
-;  V29 cse17        [V29,T15] ( 10, 138   )     int  ->  rbx         hoist "CSE #03: aggressive"
-;  V30 cse18        [V30,T13] ( 10, 153   )     int  ->  rbp         "CSE #09: aggressive"
-;  V31 cse19        [V31,T08] (  8, 288   )     int  ->  r13         "CSE #12: aggressive"
-;  TEMP_01                                      int  ->  [rsp+0x64]
+;  V12 cse0         [V12,T18] (  3,  40   )     int  ->  r26         "CSE #14: aggressive"
+;  V13 cse1         [V13,T19] (  3,  40   )     int  ->  rbp         "CSE #18: aggressive"
+;  V14 cse2         [V14,T29] (  3,  10   )     int  ->  [rsp+0x2C]  spill-single-def "CSE #24: aggressive"
+;  V15 cse3         [V15,T23] (  5,  26   )     int  ->  registers   multi-def "CSE #21: aggressive"
+;  V16 cse4         [V16,T20] (  3,  40   )     int  ->  rbp         "CSE #22: aggressive"
+;  V17 cse5         [V17,T17] (  2,  68   )     int  ->  r28         hoist "CSE #06: aggressive"
+;  V18 cse6         [V18,T21] (  2,   8   )     int  ->  r18         "CSE #13: aggressive"
+;  V19 cse7         [V19,T22] (  2,   8   )     int  ->  r17         "CSE #17: aggressive"
+;  V20 cse8         [V20,T26] (  2,  17   )     int  ->  r25         hoist "CSE #19: aggressive"
+;  V21 cse9         [V21,T28] (  2,  17   )     int  ->  r20         hoist "CSE #02: aggressive"
+;  V22 cse10        [V22,T30] (  2,   2   )     int  ->  [rsp+0x28]  spill-single-def "CSE #23: aggressive"
+;  V23 cse11        [V23,T24] (  4,  18   )     int  ->  r24         multi-def "CSE #20: aggressive"
+;  V24 cse12        [V24,T05] ( 15, 512   )     int  ->  r30         "CSE #08: aggressive"
+;  V25 cse13        [V25,T07] ( 21, 330   )     int  ->  r22         "CSE #04: aggressive"
+;  V26 cse14        [V26,T14] (  7, 145   )     int  ->  r27         "CSE #05: aggressive"
+;  V27 cse15        [V27,T09] (  7, 220   )     int  ->  r29         hoist "CSE #07: aggressive"
+;  V28 cse16        [V28,T16] ( 10, 123   )     int  ->  r19         "CSE #01: aggressive"
+;  V29 cse17        [V29,T15] ( 10, 138   )     int  ->  r21         hoist "CSE #03: aggressive"
+;  V30 cse18        [V30,T13] ( 10, 153   )     int  ->  r23         "CSE #09: aggressive"
+;  V31 cse19        [V31,T08] (  8, 288   )     int  ->  r31         "CSE #12: aggressive"
 ;
-; Lcl frame size = 104
+; Lcl frame size = 48
 
 G_M36001_IG01:        ; bbWeight=0.25, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, nogc <-- Prolog IG
-       push     r15
-       push     r14
-       push     r13
-       push     r12
-       push     rdi
-       push     rsi
        push     rbp
-       push     rbx
-       sub      rsp, 104
-						;; size=16 bbWeight=0.25 PerfScore 2.06
+       sub      rsp, 48
+						;; size=8 bbWeight=0.25 PerfScore 0.31
 G_M36001_IG02:        ; bbWeight=0.25, gcrefRegs=0006 {rcx rdx}, byrefRegs=0000 {}, byref
        ; gcrRegs +[rcx rdx]
-       xor      eax, eax
-       xor      r8d, r8d
-						;; size=5 bbWeight=0.25 PerfScore 0.12
+       xor      r16, r16
+       xor      r17, r17
+						;; size=8 bbWeight=0.25 PerfScore 0.12
 G_M36001_IG03:        ; bbWeight=1, gcrefRegs=0006 {rcx rdx}, byrefRegs=0000 {}, byref
-       xor      r10d, r10d
-       mov      r9d, dword ptr [rdx+0x18]
-       mov      r11d, r8d
-       sub      r11d, r9d
-       mov      ebx, dword ptr [rdx+0x10]
-						;; size=16 bbWeight=1 PerfScore 4.75
+       xor      r18, r18
+       mov      r19, dword ptr [rdx+0x18]
+       mov      r20, r17
+       sub      r20, r19
+       mov      r21, dword ptr [rdx+0x10]
+						;; size=22 bbWeight=1 PerfScore 4.75
 G_M36001_IG04:        ; bbWeight=16, gcrefRegs=0006 {rcx rdx}, byrefRegs=0000 {}, byref, isz
-       mov      esi, r11d
-       cmp      esi, ebx
-       jae      G_M36001_IG59
-       mov      edi, dword ptr [rdx+0x14]
-       imul     esi, edi
-       mov      ebp, dword ptr [rdx+0x1C]
-       mov      r14d, r10d
-       sub      r14d, ebp
-       cmp      r14d, edi
-       jae      G_M36001_IG59
-       add      r14d, esi
-       mov      esi, r14d
-       mov      word  ptr [rdx+2*rsi+0x20], 0
-       inc      r10d
-       movsx    r10, r10w
-       cmp      r10d, 101
+       mov      ebp, r20
+       cmp      ebp, r21
+       jae      G_M36001_IG49
+       mov      r22, dword ptr [rdx+0x14]
+       mov      eax, r22
+       imul     ebp, eax
+       mov      r23, dword ptr [rdx+0x1C]
+       mov      r24, r18
+       sub      r24, r23
+       cmp      r24, r22
+       jae      G_M36001_IG49
+       add      ebp, r24
+       mov      eax, ebp
+       mov      word  ptr [rdx+2*rax+0x20], 0
+       inc      r18
+       movsx    r18, r18w
+       cmp      r18, 101
        jl       SHORT G_M36001_IG04
-						;; size=61 bbWeight=16 PerfScore 200.00
+						;; size=81 bbWeight=16 PerfScore 204.00
 G_M36001_IG05:        ; bbWeight=4, gcrefRegs=0006 {rcx rdx}, byrefRegs=0000 {}, byref, isz
-       inc      r8d
-       movsx    r8, r8w
-       cmp      r8d, 101
-       mov      dword ptr [rsp+0x2C], r9d
+       inc      r17
+       movsx    r17, r17w
+       cmp      r17, 101
        jl       SHORT G_M36001_IG03
-						;; size=18 bbWeight=4 PerfScore 11.00
+						;; size=15 bbWeight=4 PerfScore 7.00
 G_M36001_IG06:        ; bbWeight=1, gcrefRegs=0006 {rcx rdx}, byrefRegs=0000 {}, byref
-       xor      r11d, r11d
-       mov      dword ptr [rsp+0x5C], r11d
-						;; size=8 bbWeight=1 PerfScore 1.25
+       xor      r20, r20
+						;; size=4 bbWeight=1 PerfScore 0.25
 G_M36001_IG07:        ; bbWeight=1, gcrefRegs=0006 {rcx rdx}, byrefRegs=0000 {}, byref
-       xor      esi, esi
-       mov      dword ptr [rsp+0x60], esi
-       xor      r8d, r8d
-						;; size=9 bbWeight=1 PerfScore 1.50
+       xor      r25, r25
+       xor      r17, r17
+						;; size=8 bbWeight=1 PerfScore 0.50
 G_M36001_IG08:        ; bbWeight=4, gcrefRegs=0006 {rcx rdx}, byrefRegs=0000 {}, byref
-       xor      r14d, r14d
-       xor      r10d, r10d
-       mov      r15d, dword ptr [rcx+0x18]
-       mov      r13d, r8d
-       sub      r13d, r15d
-       mov      dword ptr [rsp+0x44], r13d
-       mov      r12d, dword ptr [rcx+0x10]
-						;; size=25 bbWeight=4 PerfScore 24.00
+       xor      r26, r26
+       xor      r18, r18
+       mov      r27, dword ptr [rcx+0x18]
+       mov      r28, r17
+       sub      r28, r27
+       mov      r29, dword ptr [rcx+0x10]
+						;; size=26 bbWeight=4 PerfScore 20.00
 G_M36001_IG09:        ; bbWeight=64, gcrefRegs=0006 {rcx rdx}, byrefRegs=0000 {}, byref, isz
-       mov      r11d, r13d
-       cmp      r11d, r12d
-       jae      G_M36001_IG59
-       mov      esi, dword ptr [rcx+0x14]
-       imul     r11d, esi
-       mov      r13d, dword ptr [rcx+0x1C]
-       mov      r9d, r10d
-       sub      r9d, r13d
-       cmp      r9d, esi
-       jae      G_M36001_IG59
-       add      r11d, r9d
-       mov      r9d, r11d
-       cmp      dword ptr [rcx+4*r9+0x20], 0
+       mov      ebp, r28
+       cmp      ebp, r29
+       jae      G_M36001_IG49
+       mov      r30, dword ptr [rcx+0x14]
+       mov      eax, r30
+       imul     eax, ebp
+       mov      r31, dword ptr [rcx+0x1C]
+       mov      r24, r18
+       sub      r24, r31
+       cmp      r24, r30
+       jae      G_M36001_IG49
+       add      eax, r24
+       cmp      dword ptr [rcx+4*rax+0x20], 0
        jne      SHORT G_M36001_IG11
-						;; size=52 bbWeight=64 PerfScore 880.00
+						;; size=62 bbWeight=64 PerfScore 880.00
 G_M36001_IG10:        ; bbWeight=32, gcrefRegs=0006 {rcx rdx}, byrefRegs=0000 {}, byref, isz
-       mov      r11d, r8d
-       sub      r11d, dword ptr [rsp+0x2C]
-       cmp      r11d, ebx
-       jae      G_M36001_IG59
-       imul     r11d, edi
-       mov      r9d, r10d
-       sub      r9d, ebp
-       cmp      r9d, edi
-       jae      G_M36001_IG59
-       add      r11d, r9d
-       mov      r9d, r11d
-       cmp      word  ptr [rdx+2*r9+0x20], 0
+       mov      ebp, r17
+       sub      ebp, r19
+       cmp      ebp, r21
+       jae      G_M36001_IG49
+       mov      eax, r22
+       imul     eax, ebp
+       mov      r24, r18
+       sub      r24, r23
+       cmp      r24, r22
+       jae      G_M36001_IG49
+       add      eax, r24
+       cmp      word  ptr [rdx+2*rax+0x20], 0
        jne      SHORT G_M36001_IG11
-       inc      r14d
-       movsx    r14, r14w
-       mov      eax, r10d
-						;; size=61 bbWeight=32 PerfScore 400.00
+       inc      r26
+       movsx    r26, r26w
+       mov      r16, r18
+						;; size=69 bbWeight=32 PerfScore 344.00
 G_M36001_IG11:        ; bbWeight=64, gcrefRegs=0006 {rcx rdx}, byrefRegs=0000 {}, byref
-       inc      r10d
-       movsx    r10, r10w
...

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Oct 11, 2024
@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Oct 11, 2024
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@BruceForstall BruceForstall added the apx Related to the Intel Advanced Performance Extensions (APX) label Oct 15, 2024
@DeepakRajendrakumaran DeepakRajendrakumaran marked this pull request as ready for review October 21, 2024 22:08
@JulieLeeMSFT JulieLeeMSFT added this to the 10.0.0 milestone Nov 4, 2024
@JulieLeeMSFT
Copy link
Member

CC @jakobbotsch and @tannergooding for code review.

@jakobbotsch
Copy link
Member

@DeepakRajendrakumaran What is the status of this PR? It's marked as ready but the description says it's built on top of #108796 that is not marked as ready.

@DeepakRajendrakumaran
Copy link
Contributor Author

@DeepakRajendrakumaran What is the status of this PR? It's marked as ready but the description says it's built on top of #108796 that is not marked as ready.

Thanks for pointing that out. It has some dependencies on other PRs - specifically the Rex2 encoding PR. Considering that, do you have a suggestion on how to mark this for now?

@DeepakRajendrakumaran
Copy link
Contributor Author

DeepakRajendrakumaran commented Nov 20, 2024

@kunalspathak

Now that CPUID changes have merged, ran superpmi TP and I have a problem

image

Ran the scripts shared by Kunal a while back to debug why this is happening

The following is for libraries

Base: 798636572986, Diff: 837269651550, +4.8374%

?processBlockStartLocations@LinearScan@@AEAAXPEAUBasicBlock@@@Z                                                                                            : 7483341082 : +105.48%  : 15.71% : +0.9370%
?allocateRegistersMinimal@LinearScan@@QEAAXXZ                                                                                                              : 5166096591 : +51.73%   : 10.84% : +0.6469%
?allocateRegisters@LinearScan@@QEAAXXZ                                                                                                                     : 3501980510 : +32.45%   : 7.35%  : +0.4385%
?processKills@LinearScan@@AEAAXPEAVRefPosition@@@Z                                                                                                         : 2761837171 : +53.97%   : 5.80%  : +0.3458%
?genConsumeReg@CodeGen@@IEAA?AW4_regNumber_enum@@PEAUGenTree@@@Z                                                                                           : 2114364155 : +56.59%   : 4.44%  : +0.2647%
?TakesRex2Prefix@emitter@@QEBA_NPEBUinstrDesc@1@@Z                                                                                                         : 1652787168 : NA        : 3.47%  : +0.2070%
?freeRegisters@LinearScan@@AEAAXUregMaskTP@@@Z                                                                                                             : 1645251557 : +62.83%   : 3.45%  : +0.2060%
?mergeRegisterPreferences@Interval@@QEAAX_K@Z                                                                                                              : 1424229795 : +2637.42% : 2.99%  : +0.1783%
?AddX86PrefixIfNeeded@emitter@@QEAA_KPEBUinstrDesc@1@_KW4emitAttr@@@Z                                                                                      : 1332532027 : NA        : 2.80%  : +0.1669%
?AddX86PrefixIfNeededAndNotPresent@emitter@@QEAA_KPEBUinstrDesc@1@_KW4emitAttr@@@Z                                                                         : 1247317388 : NA        : 2.62%  : +0.1562%
?gcMarkRegPtrVal@GCInfo@@QEAAXW4_regNumber_enum@@W4var_types@@@Z                                                                                           : 1236233831 : +174.95%  : 2.59%  : +0.1548%
??$select@$0A@@RegisterSelection@LinearScan@@QEAA_KPEAVInterval@@PEAVRefPosition@@@Z                                                                       : 1044477092 : +10.11%   : 2.19%  : +0.1308%
?assignPhysReg@LinearScan@@AEAAXPEAVRegRecord@@PEAVInterval@@@Z                                                                                            : 749700826  : +42.11%   : 1.57%  : +0.0939%
?genCodeForBBlist@CodeGen@@IEAAXXZ                                                                                                                         : 707125092  : +11.03%   : 1.48%  : +0.0885%
?allocateRegMinimal@LinearScan@@AEAA?AW4_regNumber_enum@@PEAVInterval@@PEAVRefPosition@@@Z                                                                 : 704654429  : +15.88%   : 1.48%  : +0.0882%
?buildKillPositionsForNode@LinearScan@@AEAA_NPEAUGenTree@@IUregMaskTP@@@Z                                                                                  : 658845785  : +64.48%   : 1.38%  : +0.0825%
?emitOutputInstr@emitter@@IEAA_KPEAUinsGroup@@PEAUinstrDesc@1@PEAPEAE@Z                                                                                    : 658192653  : +9.65%    : 1.38%  : +0.0824%
?emitGCregDeadUpd@emitter@@QEAAXW4_regNumber_enum@@PEAE@Z                                                                                                  : 629879757  : +107.83%  : 1.32%  : +0.0789%
?updateAssignedInterval@LinearScan@@AEAAXPEAVRegRecord@@PEAVInterval@@@Z                                                                                   : 546122060  : +24.24%   : 1.15%  : +0.0684%
?emitStackPopLargeStk@emitter@@QEAAXPEAE_NEI@Z                                                                                                             : 525848563  : +104.66%  : 1.10%  : +0.0658%
?emitGetAdjustedSize@emitter@@QEBAIPEAUinstrDesc@1@_K@Z                                                                                                    : 487696755  : +31.37%   : 1.02%  : +0.0611%
?emitGCregLiveUpd@emitter@@QEAAXW4GCtype@@W4_regNumber_enum@@PEAE@Z                                                                                        : 451135285  : +59.41%   : 0.95%  : +0.0565%
?buildPhysRegRecords@LinearScan@@AEAAXXZ                                                                                                                   : 417375644  : +52.32%   : 0.88%  : +0.0523%
?AddRexWPrefix@emitter@@QEAA_KPEBUinstrDesc@1@_K@Z                                                                                                         : 337122934  : +62.86%   : 0.71%  : +0.0422%
?TakesEvexPrefix@emitter@@QEBA_NPEBUinstrDesc@1@@Z                                                                                                         : 326871135  : +13.69%   : 0.69%  : +0.0409%
?newRefPosition@LinearScan@@AEAAPEAVRefPosition@@PEAVInterval@@IW4RefType@@PEAUGenTree@@_KI@Z                                                              : 289859613  : +3.27%    : 0.61%  : +0.0363%
??0LinearScan@@QEAA@PEAVCompiler@@@Z                                                                                                                       : 287558884  : +56.87%   : 0.60%  : +0.0360%
?emitOutputRexOrSimdPrefixIfNeeded@emitter@@QEAAIW4instruction@@PEAEAEA_K@Z                                                                                : 276256843  : +10.64%   : 0.58%  : +0.0346%
?emitIns_Call@emitter@@QEAAXW4EmitCallType@1@PEAUCORINFO_METHOD_STRUCT_@@PEAX_JW4emitAttr@@AEBQEA_KUregMaskTP@@6AEBVDebugInfo@@W4_regNumber_enum@@8I3_N9@Z : 251568991  : +17.79%   : 0.53%  : +0.0315%
?resetAllRegistersState@LinearScan@@AEAAXXZ                                                                                                                : 250671960  : +48.42%   : 0.53%  : +0.0314%
?emitUpdateLiveGCregs@emitter@@QEAAXW4GCtype@@UregMaskTP@@PEAE@Z                                                                                           : 236180536  : +61.03%   : 0.50%  : +0.0296%
?BuildNode@LinearScan@@AEAAHPEAUGenTree@@@Z                                                                                                                : 211945171  : +3.63%    : 0.44%  : +0.0265%
?genUpdateRegLife@CodeGenInterface@@QEAAXPEBVLclVarDsc@@_N1@Z                                                                                              : 208334297  : +146.29%  : 0.44%  : +0.0261%
?unassignPhysReg@LinearScan@@AEAAXPEAVRegRecord@@PEAVRefPosition@@@Z                                                                                       : 204285611  : +8.70%    : 0.43%  : +0.0256%
?BuildCall@LinearScan@@AEAAHPEAUGenTreeCall@@@Z                                                                                                            : 201972715  : +19.34%   : 0.42%  : +0.0253%
?genProduceReg@CodeGen@@IEAAXPEAUGenTree@@@Z                                                                                                               : 156386903  : +5.43%    : 0.33%  : +0.0196%
?emitGetGCRegsSavedOrModified@emitter@@QEAA?AUregMaskTP@@PEAUCORINFO_METHOD_STRUCT_@@@Z                                                                    : 155613852  : NA        : 0.33%  : +0.0195%
??$resolveRegisters@$00@LinearScan@@QEAAXXZ                                                                                                                : 154302371  : +4.84%    : 0.32%  : +0.0193%
??$compChangeLife@$00@Compiler@@QEAAXAEBQEA_K@Z                                                                                                            : 150051997  : +21.15%   : 0.31%  : +0.0188%
?genPushCalleeSavedRegisters@CodeGen@@IEAAXXZ                                                                                                              : 136488370  : +268.86%  : 0.29%  : +0.0171%
?BuildRMWUses@LinearScan@@AEAAHPEAUGenTree@@00_K1@Z                                                                                                        : 119460162  : NA        : 0.25%  : +0.0150%
?emitInsSize@emitter@@QEAAIPEAUinstrDesc@1@_K_N@Z                                                                                                          : 113904280  : +11.97%   : 0.24%  : +0.0143%
??$resolveRegisters@$0A@@LinearScan@@QEAAXXZ                                                                                                               : 99510485   : +3.12%    : 0.21%  : +0.0125%
?BuildIndir@LinearScan@@AEAAHPEAUGenTreeIndir@@@Z                                                                                                          : 96091030   : +48.43%   : 0.20%  : +0.0120%
?compInitOptions@Compiler@@IEAAXPEAVJitFlags@@@Z                                                                                                           : 89279923   : +9.31%    : 0.19%  : +0.0112%
?instGen_Set_Reg_To_Imm@CodeGen@@QEAAXW4emitAttr@@W4_regNumber_enum@@_JW4insFlags@@@Z                                                                      : 78050532   : +26.93%   : 0.16%  : +0.0098%
?resolveLocalRef@LinearScan@@AEAAXPEAUBasicBlock@@PEAUGenTreeLclVar@@PEAVRefPosition@@@Z                                                                   : 74859133   : +3.74%    : 0.16%  : +0.0094%
??$allocateReg@$0A@@LinearScan@@AEAA?AW4_regNumber_enum@@PEAVInterval@@PEAVRefPosition@@@Z                                                                 : 74540254   : +6.75%    : 0.16%  : +0.0093%
memset                                                                                                                                                     : 73679442   : +1.13%    : 0.15%  : +0.0092%
?emitOutputRI@emitter@@QEAAPEAEPEAEPEAUinstrDesc@1@@Z                                                                                                      : 67994420   : +6.26%    : 0.14%  : +0.0085%
?insEncodeReg012@emitter@@QEAAIPEBUinstrDesc@1@W4_regNumber_enum@@W4emitAttr@@PEA_K@Z                                                                      : 65961952   : +6.54%    : 0.14%  : +0.0083%
?genSetRegToConst@CodeGen@@IEAAXW4_regNumber_enum@@W4var_types@@PEAUGenTree@@@Z                                                                            : 63572129   : +16.91%   : 0.13%  : +0.0080%
?emitInsSizeSV@emitter@@QEAAIPEAUinstrDesc@1@_KHH@Z                                                                                                        : 58355992   : +5.91%    : 0.12%  : +0.0073%
?BuildDefWithKills@LinearScan@@AEAAXPEAUGenTree@@H_KUregMaskTP@@@Z                                                                                         : 56553707   : +40.78%   : 0.12%  : +0.0071%
?BuildCast@LinearScan@@AEAAHPEAUGenTreeCast@@@Z                                                                                                            : 56461244   : NA        : 0.12%  : +0.0071%
?BuildStoreLocDef@LinearScan@@AEAAXPEAUGenTreeLclVarCommon@@PEAVLclVarDsc@@PEAVRefPosition@@H@Z                                                            : 53688042   : +14.79%   : 0.11%  : +0.0067%
?emitOutputRR@emitter@@QEAAPEAEPEAEPEAUinstrDesc@1@@Z                                                                                                      : 53279956   : +3.55%    : 0.11%  : +0.0067%
?genCallInstruction@CodeGen@@IEAAXPEAUGenTreeCall@@@Z                                                                                                      : 50084108   : +5.82%    : 0.11%  : +0.0063%
?emitHandleMemOp@emitter@@AEAAXPEAUGenTreeIndir@@PEAUinstrDesc@1@W4insFormat@1@W4instruction@@@Z                                                           : -58626864  : -10.34%   : 0.12%  : -0.0073%
?getMatchingConstants@LinearScan@@AEAA_K_KPEAVInterval@@PEAVRefPosition@@@Z                                                                                : -79107557  : -100.00%  : 0.17%  : -0.0099%
?emitSizeOfInsDsc_CNS@emitter@@AEBA_KPEAUinstrDesc@1@@Z                                                                                                    : -90499395  : -98.48%   : 0.19%  : -0.0113%
?BuildRMWUses@LinearScan@@AEAAHPEAUGenTree@@00_K@Z                                                                                                         : -120949346 : -100.00%  : 0.25%  : -0.0151%
?BuildGCWriteBarrier@LinearScan@@AEAAHPEAUGenTree@@@Z                                                                                                      : -146449406 : -100.00%  : 0.31%  : -0.0183%
?associateRefPosWithInterval@LinearScan@@AEAAXPEAVRefPosition@@@Z                                                                                          : -188074386 : -3.81%    : 0.39%  : -0.0235%
?addKillForRegs@LinearScan@@AEAAXUregMaskTP@@I@Z                                                                                                           : -213435792 : -100.00%  : 0.45%  : -0.0267%
?BuildSimple@LinearScan@@AEAAHPEAUGenTree@@@Z                                                                                                              : -345016623 : -99.92%   : 0.72%  : -0.0432%
?genCodeForTreeNode@CodeGen@@IEAAXPEAUGenTree@@@Z                                                                                                          : -414160388 : -6.66%    : 0.87%  : -0.0519%
?updateRegisterPreferences@Interval@@QEAAX_K@Z                                                                                                             : -580317174 : -100.00%  : 1.22%  : -0.0727%
?AddSimdPrefixIfNeededAndNotPresent@emitter@@QEAA_KPEBUinstrDesc@1@_KW4emitAttr@@@Z                                                                        : -885312893 : -100.00%  : 1.86%  : -0.1109%
?AddSimdPrefixIfNeeded@emitter@@QEAA_KPEBUinstrDesc@1@_KW4emitAttr@@@Z                                                                                     : -984986225 : -100.00%  : 2.07%  : -0.1233%

@DeepakRajendrakumaran
Copy link
Contributor Author

DeepakRajendrakumaran commented Nov 26, 2024

Trying to further make sure the Rex2 changes are not causing TP regression. We can safely conclude the TP regression is from eGPR enablement

The following is with/without Rex2 changes(without reg alloc changes)

Overall (+0.08% to +0.23%)
Collection PDIFF
aspnet.run.windows.x64.checked.mch +0.16%
coreclr_tests.run.windows.x64.checked.mch +0.23%
libraries.crossgen2.windows.x64.checked.mch +0.14%
libraries.pmi.windows.x64.checked.mch +0.11%
libraries_tests.run.windows.x64.Release.mch +0.18%
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch +0.12%
smoke_tests.nativeaot.windows.x64.checked.mch +0.08%
MinOpts (+0.28% to +0.48%)
Collection PDIFF
aspnet.run.windows.x64.checked.mch +0.48%
coreclr_tests.run.windows.x64.checked.mch +0.36%
libraries.crossgen2.windows.x64.checked.mch +0.38%
libraries.pmi.windows.x64.checked.mch +0.37%
libraries_tests.run.windows.x64.Release.mch +0.47%
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch +0.40%
smoke_tests.nativeaot.windows.x64.checked.mch +0.28%
FullOpts (+0.08% to +0.14%)
Collection PDIFF
aspnet.run.windows.x64.checked.mch +0.10%
coreclr_tests.run.windows.x64.checked.mch +0.14%
libraries.crossgen2.windows.x64.checked.mch +0.14%
libraries.pmi.windows.x64.checked.mch +0.11%
libraries_tests.run.windows.x64.Release.mch +0.10%
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch +0.12%
smoke_tests.nativeaot.windows.x64.checked.mch +0.08%

With Rex2 as base and eGPR changes as diff

Overall (+3.60% to +4.65%)
Collection PDIFF
aspnet.run.windows.x64.checked.mch +4.33%
coreclr_tests.run.windows.x64.checked.mch +4.65%
libraries.crossgen2.windows.x64.checked.mch +4.29%
libraries.pmi.windows.x64.checked.mch +3.76%
libraries_tests.run.windows.x64.Release.mch +4.65%
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch +3.60%
smoke_tests.nativeaot.windows.x64.checked.mch +3.66%
MinOpts (+6.09% to +8.79%)
Collection PDIFF
aspnet.run.windows.x64.checked.mch +8.27%
coreclr_tests.run.windows.x64.checked.mch +6.09%
libraries.crossgen2.windows.x64.checked.mch +7.27%
libraries.pmi.windows.x64.checked.mch +6.86%
libraries_tests.run.windows.x64.Release.mch +8.42%
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch +7.16%
smoke_tests.nativeaot.windows.x64.checked.mch +8.79%
FullOpts (+3.47% to +4.29%)
Collection PDIFF
aspnet.run.windows.x64.checked.mch +3.59%
coreclr_tests.run.windows.x64.checked.mch +3.60%
libraries.crossgen2.windows.x64.checked.mch +4.29%
libraries.pmi.windows.x64.checked.mch +3.76%
libraries_tests.run.windows.x64.Release.mch +3.47%
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch +3.52%
smoke_tests.nativeaot.windows.x64.checked.mch +3.66%

Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went through the first pass, need to evaluate where TP regression is coming from. However, I still see some asmdiffs...can you please fix it?

@@ -12534,6 +12555,9 @@ void LinearScan::verifyResolutionMove(GenTree* resolutionMove, LsraLocation curr
LinearScan::RegisterSelection::RegisterSelection(LinearScan* linearScan)
{
this->linearScan = linearScan;
#if defined(TARGET_AMD64)
rbmAllInt = linearScan->compiler->get_RBM_ALLINT();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason why we need it here instead of LinearScan ctor (which you are already doing)?

@@ -742,6 +743,7 @@ class emitter
// The instrDescCGCA struct's member keeping the GC-ness of the first return register is _idcSecondRetRegGCType.
GCtype _idGCref : 2; // GCref operand? (value is a "GCtype")

#if !defined(TARGET_AMD64)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason for this change?

@@ -62,7 +62,12 @@ bool regMaskTP::IsRegNumInMask(regNumber reg, var_types type) const
//
void regMaskTP::AddGprRegs(SingleTypeRegSet gprRegs)
{
// RBM_ALLINT is not known at compile time on TARGET_AMD64 since it's dependent on APX support.
#if defined(TARGET_AMD64)
assert((gprRegs == RBM_NONE) || ((gprRegs & RBM_ALLINT_STATIC_ALL) != RBM_NONE));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for non-APX machines, gpr will still be 0-15 and with this assert, we will allow float register to get set, right?


// RBM_ALLINT is not known at compile time on TARGET_AMD64 since it's dependent on APX support. Deprecated????
#if defined(TARGET_AMD64)
sprintf_s(regmask, cchRegMask, REG_MASK_INT_FMT, (mask & RBM_ALLINT_STATIC_ALL).getLow());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we need RBM_ALLINT_STATIC_ALL here? it should just use RBM_ALLINT and it should return the right mask depending on if high int registers are available or not.

// RBM_ALLINT is not known at compile time on TARGET_AMD64 since it's dependent on APX support. These are used by GC
// exclusively
#if defined(TARGET_AMD64)
printf(REG_MASK_INT_FMT, (mask & RBM_ALLINT_STATIC_ALL).getLow());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

likewise here...can just use RBM_ALLINT?

@@ -3136,4 +3347,51 @@ inline SingleTypeRegSet LinearScan::BuildEvexIncompatibleMask(GenTree* tree)
#endif
}

inline bool LinearScan::DoesThisUseGPR(GenTree* op)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add method docs for this and below method?

return false;
}

inline SingleTypeRegSet LinearScan::BuildApxIncompatibleGPRMask(GenTree* tree, bool forceLowGpr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the goal of this method?

SingleTypeRegSet op1Candidates = candidates;
SingleTypeRegSet op2Candidates = candidates;
int srcCount = 0;
// SingleTypeRegSet op1Candidates = candidates;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are lot of such comments in this file. can you please delete them?


// We are dealing exclusively with HWIntrinsics here
return (op->AsHWIntrinsic()->OperIsBroadcastScalar() ||
(op->AsHWIntrinsic()->OperIsMemoryLoad() && DoesThisUseGPR(op->AsHWIntrinsic()->Op(1))));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we only care if Op(1) uses GPR, not any other operand?

else
{
// ToDo-APX : imul currently doesn't have rex2 support. So, cannot use R16-R31.
dstCandidates = BuildApxIncompatibleGPRMask(tree, true);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calls to BuildApxIncompatibleGPRMask for many nodes seems expensive. Wondering if we can do something like:

  1. at the top just set SingleTypeRegSet incompatibleGprMask = compiler->canUseApxEncoding() ? lowGPRRegs() : RBM_NONE;
  2. Places where you are passing forceLowGpr= true can instead just use incompatibleGprMask.
  3. Places where you are not forcing lowGPr, can just use DoesThisUseGPR(tree) ? incompatibleGprMask : RBM_NONE

Also, might worth caching the value of lowGPRRegs() because currently it is evaluated every time to be (availableIntRegs & RBM_LOWINT.GetIntRegSet()) and I see lowGprRegs() is used at lot of places.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
apx Related to the Intel Advanced Performance Extensions (APX) area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants