[JIT] Codegen issue for `InlineArray` in `record` #110968

sfiruch · 2024-12-27T16:00:54Z

Overview

I found a particular case where access to InlineArrays are surprisingly slow in my application. Here's a repro, also on SharpLab:

[InlineArray(16)]
struct InlineArray
{
    private object _element0;
}

class AClass(InlineArray inlineArray)
{
    object Read(int i) => inlineArray[i];
}

record class ARecordClass(InlineArray inlineArray)
{
    object Read(int i) => inlineArray[i];
}

record class ARecordClassWorkaround(InlineArray _a)
{
    readonly InlineArray inlineArray = _a;
    
    object Read(int i) => inlineArray[i];
}

Good codegen, for `class`

AClass.Read(Int32)
    L0000: sub rsp, 0x28
    L0004: cmp [rcx], cl
    L0006: add rcx, 8
    L000a: cmp edx, 0x10
    L000d: jae short L001a
    L000f: mov eax, edx
    L0011: mov rax, [rcx+rax*8]
    L0015: add rsp, 0x28
    L0019: ret
    L001a: call 0x00007ffd1f961728
    L001f: int3

(Detail: I don't think this should allocate stack space. Probably left after the rest was optimized.)

Suboptimal codegen, for `record class`

ARecordClass.Read(Int32)
    L0000: push rbx
    L0001: sub rsp, 0xa0
    L0008: vzeroupper
    L000b: vxorps xmm4, xmm4, xmm4
    L000f: vmovdqu [rsp+0x20], ymm4
    L0015: vmovdqu [rsp+0x40], ymm4
    L001b: vmovdqu [rsp+0x60], ymm4
    L0021: vmovdqu [rsp+0x80], ymm4
    L002a: mov ebx, edx
    L002c: lea rdx, [rcx+8]
    L0030: cmp [rdx], dl
    L0032: lea rcx, [rsp+0x20]
    L0037: mov r8d, 0x80
    L003d: call 0x00007ffd28e8a570 ;BulkMoveWithWriteBarrier
    L0042: cmp ebx, 0x10
    L0045: jae short L005b
    L0047: lea rax, [rsp+0x20]
    L004c: mov ecx, ebx
    L004e: mov rax, [rax+rcx*8]
    L0052: add rsp, 0xa0
    L0059: pop rbx
    L005a: ret
    L005b: call 0x00007ffd1f961728
    L0060: int3

Analysis

I could only reproduce this problem by combining InlineArray with record class. Other types don't lead to this codegen. It seems that the field access via getter is problematic. Because the getter returns the whole InlineArray the JIT appears to everything to the stack before accessing it. If you make the InlineArray bigger, the problem becomes worse/more obvious.

The stack probe for an 64k-sized InlineArray look like this:

    L0014: vxorps xmm4, xmm4, xmm4
    L0018: vmovdqa [rsp+0x20], xmm4
    L001e: mov rax, 0xffffffffffff0010
    L0028: vmovdqa [rsp+rax+0x10020], xmm4
    L0031: vmovdqa [rsp+rax+0x10030], xmm4
    L003a: vmovdqa [rsp+rax+0x10040], xmm4

The rsp+rax+0x10020 encoding is curious, as it could be reformulated as rsp+0x30 instead. I'm not so sure this is important though.

Workaround

The ARecordClassWorkaround version results in fast code.

Regression

Tested with .NET 9.0.0. Unclear if regression or not.

The text was updated successfully, but these errors were encountered:

dotnet-policy-service · 2024-12-27T16:01:29Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

EgorBo · 2024-12-27T18:49:04Z

Yeah, the problem that the whole struct is copied (with a bulk write barrier) to then access its element by index. I am not sure why we emit a write barrier here for stack. Perhaps, because the struct is too big and the bulk write barrier does copy faster than the whatever jit emits inline (rep stos?).

Codegen on arm64 looks better (because JIT is allowed to do the copy with SIMD):

; Method ConsoleApp11.Program+ARecordClass:Read(int):System.Object:this (FullOpts)
G_M52317_IG01:
            stp     fp, lr, [sp, #-0x90]!
            mov     fp, sp
            add     x9, fp, #16
            movi    v16.16b, #0
            stp     q16, q16, [x9]
            stp     q16, q16, [x9, #0x20]
            stp     q16, q16, [x9, #0x40]
            stp     q16, q16, [x9, #0x60]
G_M52317_IG02:
            sub     x2, x0, #8
            ldp     q16, q17, [x2, #0x10]
            stp     q16, q17, [fp, #0x10]
            ldp     q16, q17, [x2, #0x30]
            stp     q16, q17, [fp, #0x30]
            ldp     q16, q17, [x2, #0x50]
            stp     q16, q17, [fp, #0x50]
            ldp     q16, q17, [x2, #0x70]
            stp     q16, q17, [fp, #0x70]
G_M52317_IG03:
            cmp     w1, #16
            bhs     G_M52317_IG05
            add     x0, fp, #16	// [V02 loc0]
            ldr     x0, [x0, w1, UXTW #3]
G_M52317_IG04:
            ldp     fp, lr, [sp], #0x90
            ret     lr
G_M52317_IG05:
            bl      CORINFO_HELP_RNGCHKFAIL
            brk     #0
; Total bytes of code: 100

sfiruch added the tenet-performance Performance related issue label Dec 27, 2024

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Dec 27, 2024

dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Dec 27, 2024

EgorBo removed the untriaged New issue has not been triaged by the area owner label Dec 27, 2024

EgorBo added this to the Future milestone Dec 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[JIT] Codegen issue for `InlineArray` in `record` #110968

[JIT] Codegen issue for `InlineArray` in `record` #110968

sfiruch commented Dec 27, 2024

dotnet-policy-service bot commented Dec 27, 2024

EgorBo commented Dec 27, 2024 •

edited

Loading

[JIT] Codegen issue for InlineArray in record #110968

[JIT] Codegen issue for InlineArray in record #110968

Comments

sfiruch commented Dec 27, 2024

Overview

Good codegen, for class

Suboptimal codegen, for record class

Analysis

Workaround

Regression

dotnet-policy-service bot commented Dec 27, 2024

EgorBo commented Dec 27, 2024 • edited Loading

[JIT] Codegen issue for `InlineArray` in `record` #110968

[JIT] Codegen issue for `InlineArray` in `record` #110968

Good codegen, for `class`

Suboptimal codegen, for `record class`

EgorBo commented Dec 27, 2024 •

edited

Loading