Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JIT] Codegen issue for InlineArray in record #110968

Open
sfiruch opened this issue Dec 27, 2024 · 2 comments
Open

[JIT] Codegen issue for InlineArray in record #110968

sfiruch opened this issue Dec 27, 2024 · 2 comments
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Milestone

Comments

@sfiruch
Copy link
Contributor

sfiruch commented Dec 27, 2024

Overview

I found a particular case where access to InlineArrays are surprisingly slow in my application. Here's a repro, also on SharpLab:

[InlineArray(16)]
struct InlineArray
{
    private object _element0;
}

class AClass(InlineArray inlineArray)
{
    object Read(int i) => inlineArray[i];
}

record class ARecordClass(InlineArray inlineArray)
{
    object Read(int i) => inlineArray[i];
}

record class ARecordClassWorkaround(InlineArray _a)
{
    readonly InlineArray inlineArray = _a;
    
    object Read(int i) => inlineArray[i];
}

Good codegen, for class

AClass.Read(Int32)
    L0000: sub rsp, 0x28
    L0004: cmp [rcx], cl
    L0006: add rcx, 8
    L000a: cmp edx, 0x10
    L000d: jae short L001a
    L000f: mov eax, edx
    L0011: mov rax, [rcx+rax*8]
    L0015: add rsp, 0x28
    L0019: ret
    L001a: call 0x00007ffd1f961728
    L001f: int3

(Detail: I don't think this should allocate stack space. Probably left after the rest was optimized.)

Suboptimal codegen, for record class

ARecordClass.Read(Int32)
    L0000: push rbx
    L0001: sub rsp, 0xa0
    L0008: vzeroupper
    L000b: vxorps xmm4, xmm4, xmm4
    L000f: vmovdqu [rsp+0x20], ymm4
    L0015: vmovdqu [rsp+0x40], ymm4
    L001b: vmovdqu [rsp+0x60], ymm4
    L0021: vmovdqu [rsp+0x80], ymm4
    L002a: mov ebx, edx
    L002c: lea rdx, [rcx+8]
    L0030: cmp [rdx], dl
    L0032: lea rcx, [rsp+0x20]
    L0037: mov r8d, 0x80
    L003d: call 0x00007ffd28e8a570 ;BulkMoveWithWriteBarrier
    L0042: cmp ebx, 0x10
    L0045: jae short L005b
    L0047: lea rax, [rsp+0x20]
    L004c: mov ecx, ebx
    L004e: mov rax, [rax+rcx*8]
    L0052: add rsp, 0xa0
    L0059: pop rbx
    L005a: ret
    L005b: call 0x00007ffd1f961728
    L0060: int3

Analysis

I could only reproduce this problem by combining InlineArray with record class. Other types don't lead to this codegen. It seems that the field access via getter is problematic. Because the getter returns the whole InlineArray the JIT appears to everything to the stack before accessing it. If you make the InlineArray bigger, the problem becomes worse/more obvious.

The stack probe for an 64k-sized InlineArray look like this:

    L0014: vxorps xmm4, xmm4, xmm4
    L0018: vmovdqa [rsp+0x20], xmm4
    L001e: mov rax, 0xffffffffffff0010
    L0028: vmovdqa [rsp+rax+0x10020], xmm4
    L0031: vmovdqa [rsp+rax+0x10030], xmm4
    L003a: vmovdqa [rsp+rax+0x10040], xmm4

The rsp+rax+0x10020 encoding is curious, as it could be reformulated as rsp+0x30 instead. I'm not so sure this is important though.

Workaround

The ARecordClassWorkaround version results in fast code.

Regression

Tested with .NET 9.0.0. Unclear if regression or not.

@sfiruch sfiruch added the tenet-performance Performance related issue label Dec 27, 2024
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Dec 27, 2024
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Dec 27, 2024
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@EgorBo
Copy link
Member

EgorBo commented Dec 27, 2024

Yeah, the problem that the whole struct is copied (with a bulk write barrier) to then access its element by index. I am not sure why we emit a write barrier here for stack. Perhaps, because the struct is too big and the bulk write barrier does copy faster than the whatever jit emits inline (rep stos?).

Codegen on arm64 looks better (because JIT is allowed to do the copy with SIMD):

; Method ConsoleApp11.Program+ARecordClass:Read(int):System.Object:this (FullOpts)
G_M52317_IG01:
            stp     fp, lr, [sp, #-0x90]!
            mov     fp, sp
            add     x9, fp, #16
            movi    v16.16b, #0
            stp     q16, q16, [x9]
            stp     q16, q16, [x9, #0x20]
            stp     q16, q16, [x9, #0x40]
            stp     q16, q16, [x9, #0x60]
G_M52317_IG02:
            sub     x2, x0, #8
            ldp     q16, q17, [x2, #0x10]
            stp     q16, q17, [fp, #0x10]
            ldp     q16, q17, [x2, #0x30]
            stp     q16, q17, [fp, #0x30]
            ldp     q16, q17, [x2, #0x50]
            stp     q16, q17, [fp, #0x50]
            ldp     q16, q17, [x2, #0x70]
            stp     q16, q17, [fp, #0x70]
G_M52317_IG03:
            cmp     w1, #16
            bhs     G_M52317_IG05
            add     x0, fp, #16	// [V02 loc0]
            ldr     x0, [x0, w1, UXTW #3]
G_M52317_IG04:
            ldp     fp, lr, [sp], #0x90
            ret     lr
G_M52317_IG05:
            bl      CORINFO_HELP_RNGCHKFAIL
            brk     #0
; Total bytes of code: 100

@EgorBo EgorBo removed the untriaged New issue has not been triaged by the area owner label Dec 27, 2024
@EgorBo EgorBo added this to the Future milestone Dec 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

2 participants