Skip to content

iterator for_each performance regression #112911

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
dgiger42 opened this issue Jun 21, 2023 · 19 comments
Open

iterator for_each performance regression #112911

dgiger42 opened this issue Jun 21, 2023 · 19 comments
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-bug Category: This is a bug. I-slow Issue: Problems and improvements with respect to performance of generated code. P-medium Medium priority regression-untriaged Untriaged performance or correctness regression. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@dgiger42
Copy link
Contributor

dgiger42 commented Jun 21, 2023

Code

I tried this code:

pub fn main() {
    let n = 100 as usize;
    println!("n: {}", n);
    let mut v = vec![0; n];

    for _ in 0..100_000_000 {
        v.iter_mut().for_each(|x| {
            *x = 4;
        })
    }
}

I expected to see this happen: Performance at least matching previous versions of Rust

Instead, this happened: On my machine, the execution time on nightly is around 2x as long as on Rust 1.69. The time reported by perf stat went from 0.35 to 0.67 seconds, and the instruction counts went up from 6.7 billion to 8.9 billion.

Version it worked on

It most recently worked on: 1.69

Version with regression

rustc --version --verbose:

rustc 1.72.0-nightly (46514218f 2023-06-20)
binary: rustc
commit-hash: 46514218f6f31ad3a1510ecc32af47e9e486c27d
commit-date: 2023-06-20
host: x86_64-unknown-linux-gnu
release: 1.72.0-nightly
LLVM version: 16.0.5


The regression also affects Rust 1.70, but the performance difference is not as big as on nightly.

Backtrace

Backtrace

<backtrace>

@dgiger42 dgiger42 added C-bug Category: This is a bug. regression-untriaged Untriaged performance or correctness regression. labels Jun 21, 2023
@rustbot rustbot added the I-prioritize Issue: Indicates that prioritization has been requested for this issue. label Jun 21, 2023
@Jules-Bertholet
Copy link
Contributor

@rustbot label I-slow

@rustbot rustbot added the I-slow Issue: Problems and improvements with respect to performance of generated code. label Jun 22, 2023
@saethlin
Copy link
Member

Looks to me like a change in the degree of loop unrolling.

@rustbot label +A-LLVM

@rustbot rustbot added the A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. label Jun 22, 2023
@apiraino
Copy link
Contributor

WG-prioritization assigning priority (Zulip discussion).

@rustbot label -I-prioritize +P-high +T-compiler

@rustbot rustbot added P-high High priority T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. and removed I-prioritize Issue: Indicates that prioritization has been requested for this issue. labels Jun 22, 2023
@scottmcm
Copy link
Member

for_each is a fold, so cc #106343 @the8472

@the8472
Copy link
Member

the8472 commented Jun 22, 2023

The regression also affects Rust 1.70, but the performance difference is not as big as on nightly.

There may be multiple regressions.

@the8472
Copy link
Member

the8472 commented Jun 22, 2023

Not seeing any differences in the assembly between stable, beta and nightly. Only 1.69 produces more unrolling.

https://rust.godbolt.org/z/f4EqeGY9E

@dgiger42
Copy link
Contributor Author

I reran the benchmark on a bunch of different versions to confirm that there are two separate regressions: One from 1.69 to 1.70, and one from beta to nightly.

The one from 1.69 to 1.70 increases the time on my machine from 0.33 seconds to 0.48, and is the one that also changed the number of instructions. I suspect this is explained by the change in loop unrolling.

The regression from beta to nightly increased the average time from 0.48 seconds to 0.67, without any change to the number of instructions. I'm not sure what to make of this one.

Here are the commands I used to test this, along with the outputs from perf stat:

cargo +1.69 b -r  && perf stat -r10 target/release/fast-primes

Performance counter stats for 'target/release/fast-primes' (10 runs):

            328.21 msec task-clock                #    0.997 CPUs utilized            ( +-  0.12% )
                 2      context-switches          #    6.080 /sec                     ( +-130.39% )
                 0      cpu-migrations            #    0.000 /sec                   
                79      page-faults               #  240.166 /sec                     ( +-  0.77% )
     1,502,154,041      cycles                    #    4.567 GHz                      ( +-  0.03% )
     6,701,977,226      instructions              #    4.46  insn per cycle           ( +-  0.00% )
     1,300,390,144      branches                  #    3.953 G/sec                    ( +-  0.00% )
            11,383      branch-misses             #    0.00% of all branches          ( +-  5.04% )
     7,510,312,100      slots                     #   22.832 G/sec                    ( +-  0.01% )
     5,478,110,002      topdown-retiring          #     72.8% retiring                ( +-  0.13% )
       117,808,817      topdown-bad-spec          #      1.1% bad speculation         ( +- 15.84% )
       500,687,473      topdown-fe-bound          #      6.7% frontend bound          ( +-  2.12% )
     1,413,705,807      topdown-be-bound          #     19.4% backend bound           ( +-  0.83% )

          0.329231 +- 0.000408 seconds time elapsed  ( +-  0.12% )
cargo +1.70 b -r  && perf stat -r10 target/release/fast-primes

Performance counter stats for 'target/release/fast-primes' (10 runs):

            482.28 msec task-clock                #    0.999 CPUs utilized            ( +-  0.07% )
                 1      context-switches          #    2.073 /sec                     ( +- 24.94% )
                 0      cpu-migrations            #    0.000 /sec                   
                84      page-faults               #  174.159 /sec                     ( +-  0.51% )
     2,207,773,319      cycles                    #    4.577 GHz                      ( +-  0.00% )
     8,902,223,483      instructions              #    4.03  insn per cycle           ( +-  0.00% )
     2,000,434,702      branches                  #    4.148 G/sec                    ( +-  0.00% )
           280,880      branch-misses             #    0.01% of all branches          ( +-  0.02% )
    11,037,936,360      slots                     #   22.885 G/sec                    ( +-  0.00% )
     6,969,050,015      topdown-retiring          #     63.1% retiring                ( +-  0.00% )
        86,572,049      topdown-bad-spec          #      0.8% bad speculation         ( +-  0.00% )
     3,982,314,294      topdown-fe-bound          #     36.1% frontend bound          ( +-  0.00% )
                 0      topdown-be-bound          #      0.0% backend bound         

          0.482615 +- 0.000351 seconds time elapsed  ( +-  0.07% )
cargo +beta b -r  && perf stat -r10 target/release/fast-primes

Performance counter stats for 'target/release/fast-primes' (10 runs):

            482.02 msec task-clock                #    0.998 CPUs utilized            ( +-  0.07% )
                 2      context-switches          #    4.145 /sec                     ( +- 21.67% )
                 0      cpu-migrations            #    0.000 /sec                   
                82      page-faults               #  169.936 /sec                     ( +-  0.79% )
     2,207,399,550      cycles                    #    4.575 GHz                      ( +-  0.02% )
     8,902,207,469      instructions              #    4.03  insn per cycle           ( +-  0.00% )
     2,000,430,623      branches                  #    4.146 G/sec                    ( +-  0.00% )
           281,018      branch-misses             #    0.01% of all branches          ( +-  6.68% )
    11,036,997,750      slots                     #   22.873 G/sec                    ( +-  0.02% )
     6,968,457,402      topdown-retiring          #     63.1% retiring                ( +-  0.06% )
        86,564,688      topdown-bad-spec          #      0.8% bad speculation         ( +-  5.00% )
     3,981,975,658      topdown-fe-bound          #     36.1% frontend bound          ( +-  0.02% )
                 0      topdown-be-bound          #      0.0% backend bound         

          0.482823 +- 0.000388 seconds time elapsed  ( +-  0.08% )
cargo +nightly b -r  && perf stat -r10 target/release/fast-primes

Performance counter stats for 'target/release/fast-primes' (10 runs):

            666.67 msec task-clock                #    0.989 CPUs utilized            ( +-  0.42% )
                 7      context-switches          #   10.386 /sec                     ( +- 10.00% )
                 0      cpu-migrations            #    0.000 /sec                   
                81      page-faults               #  120.177 /sec                     ( +-  0.72% )
     3,052,452,237      cycles                    #    4.529 GHz                      ( +-  0.44% )
     8,902,480,122      instructions              #    2.89  insn per cycle           ( +-  0.00% )
     2,000,480,403      branches                  #    2.968 G/sec                    ( +-  0.00% )
            12,684      branch-misses             #    0.00% of all branches          ( +-542.86% )
    15,262,057,470      slots                     #   22.644 G/sec                    ( +-  0.42% )
     6,942,739,868      topdown-retiring          #     45.1% retiring                ( +-  0.11% )
       119,702,411      topdown-bad-spec          #      0.7% bad speculation         ( +- 11.78% )
     8,199,615,189      topdown-fe-bound          #     54.1% frontend bound          ( +-  0.80% )
                 0      topdown-be-bound          #      0.0% backend bound         

           0.67430 +- 0.00282 seconds time elapsed  ( +-  0.42% )

@saethlin
Copy link
Member

What is "the benchmark" here? I see your executable is called fast-primes, and the code in the issue description has nothing to do with primes. I cannot reproduce any significant runtime changes of the code in the issue description across Rust versions.

@dgiger42
Copy link
Contributor Author

What is "the benchmark" here? I see your executable is called fast-primes, and the code in the issue description has nothing to do with primes. I cannot reproduce any significant runtime changes of the code in the issue description across Rust versions.

I was referring to the example code at the beginning of the issue. The executable is called fast-primes because the example is minimized from a prime sieve I've been working on for a while, which I noticed got slower with newer versions of Rust. Perhaps it's only a regression on some CPUs? In case it matters, I'm using an 11th gen Intel CPU.

@saethlin
Copy link
Member

Yeah I definitely have a different CPU, a 3970X. That's unfortunate that this is so CPU-dependent.

@the8472
Copy link
Member

the8472 commented Jun 22, 2023

The assembly is produced for the x86-64 baseline by default and I don't know what kind of tuning that implies. What happens if you use -Ctarget-cpu=native?

@the8472
Copy link
Member

the8472 commented Jun 22, 2023

can you check with objdump or the assembly view in perf report whether it's actually producing the same assembly as godbolt? Minor changes can affect codegen.

@dgiger42
Copy link
Contributor Author

The assembly is produced for the x86-64 baseline by default and I don't know what kind of tuning that implies. What happens if you use -Ctarget-cpu=native?

With -Ctarget-cpu=native, 1.69 gets slower and nightly gets faster, so that they're the same speed, although now the nightly version is fewer instructions.

These benchmark results are particularly surprising to me, because when I originally noticed a regression in my (way more complicated) prime sieve on newer versions of Rust, I was already compiling with target-cpu=native. I had stopped doing that specifically for this issue in hopes that it would make the issue more reproducible.

RUSTFLAGS='-Ctarget-cpu=native' cargo +1.69 b -r  && perf stat -r10 target/release/fast-primes

  Performance counter stats for 'target/release/fast-primes' (10 runs):

            444.11 msec task-clock                #    0.997 CPUs utilized            ( +-  0.31% )
                 2      context-switches          #    4.492 /sec                     ( +- 16.67% )
                 0      cpu-migrations            #    0.000 /sec                   
                82      page-faults               #  184.166 /sec                     ( +-  0.45% )
     2,002,695,022      cycles                    #    4.498 GHz                      ( +-  0.06% )
     5,702,163,026      instructions              #    2.85  insn per cycle           ( +-  0.00% )
     1,400,423,677      branches                  #    3.145 G/sec                    ( +-  0.00% )
            11,923      branch-misses             #    0.00% of all branches          ( +-  3.50% )
    10,011,089,765      slots                     #   22.484 G/sec                    ( +-  0.05% )
     4,475,546,012      topdown-retiring          #     44.7% retiring                ( +-  0.05% )
        78,518,351      topdown-bad-spec          #      1.1% bad speculation         ( +- 15.00% )
       667,405,984      topdown-fe-bound          #      7.2% frontend bound          ( +-  1.89% )
     4,789,619,416      topdown-be-bound          #     47.1% backend bound           ( +-  0.29% )

           0.44554 +- 0.00143 seconds time elapsed  ( +-  0.32% )
RUSTFLAGS='-Ctarget-cpu=native' cargo +nightly b -r  && perf stat -r10 target/release/fast-primes

 Performance counter stats for 'target/release/fast-primes' (10 runs):

            447.01 msec task-clock                #    1.006 CPUs utilized            ( +-  0.14% )
                 0      context-switches          #    0.000 /sec                   
                 0      cpu-migrations            #    0.000 /sec                   
                82      page-faults               #  184.714 /sec                     ( +-  0.66% )
     2,002,370,480      cycles                    #    4.511 GHz                      ( +-  0.00% )
     4,752,149,548      instructions              #    2.37  insn per cycle           ( +-  0.00% )
     1,000,420,577      branches                  #    2.254 G/sec                    ( +-  0.00% )
            11,668      branch-misses             #    0.00% of all branches          ( +-  1.39% )
    10,007,696,855      slots                     #   22.543 G/sec                    ( +-  0.01% )
     3,767,603,521      topdown-retiring          #     37.5% retiring                ( +-  0.15% )
        39,245,870      topdown-bad-spec          #      1.1% bad speculation         ( +- 53.36% )
        78,491,740      topdown-fe-bound          #      0.5% frontend bound          ( +-  7.21% )
     6,122,355,723      topdown-be-bound          #     60.8% backend bound           ( +-  0.12% )

          0.444332 +- 0.000606 seconds time elapsed  ( +-  0.14% )

@dgiger42
Copy link
Contributor Author

can you check with objdump or the assembly view in perf report whether it's actually producing the same assembly as godbolt? Minor changes can affect codegen.

Here's the assembly from main, produced by perf report -Mintel. I think there's a significant difference between this and the version on Godbolt, although I'm not positive.

Percent│                                                                                                                                                                                                                 ◆
       │                                                                                                                                                                                                                 ▒
       │                                                                                                                                                                                                                 ▒
       │     Disassembly of section .text:                                                                                                                                                                               ▒
       │                                                                                                                                                                                                                 ▒
       │     00000000000088a0 <fast_primes::main>:                                                                                                                                                                       ▒
       │     _ZN11fast_primes4main17h635d4793fd95f7dbE():                                                                                                                                                                ▒
       │       push   r15                                                                                                                                                                                                ▒
       │       push   r14                                                                                                                                                                                                ▒
       │       push   rbx                                                                                                                                                                                                ▒
       │       sub    rsp,0x50                                                                                                                                                                                           ▒
       │       mov    QWORD PTR [rsp+0x8],0x64                                                                                                                                                                           ▒
       │       lea    rax,[rsp+0x8]                                                                                                                                                                                      ▒
       │       mov    QWORD PTR [rsp+0x10],rax                                                                                                                                                                           ▒
       │       lea    rax,[rip+0x3905d]                                                                                                                                                                                  ▒
       │       mov    QWORD PTR [rsp+0x18],rax                                                                                                                                                                           ▒
       │       lea    rax,[rip+0x48911]                                                                                                                                                                                  ▒
       │       mov    QWORD PTR [rsp+0x20],rax                                                                                                                                                                           ▒
       │       mov    QWORD PTR [rsp+0x28],0x2                                                                                                                                                                           ▒
       │       mov    QWORD PTR [rsp+0x40],0x0                                                                                                                                                                           ▒
       │       lea    rax,[rsp+0x10]                                                                                                                                                                                     ▒
       │       mov    QWORD PTR [rsp+0x30],rax                                                                                                                                                                           ▒
       │       mov    QWORD PTR [rsp+0x38],0x1                                                                                                                                                                           ▒
       │       lea    rdi,[rsp+0x20]                                                                                                                                                                                     ▒
       │     → call   QWORD PTR [rip+0x4b48c]        # 53d90 <_GLOBAL_OFFSET_TABLE_+0x4a8>                                                                                                                               ▒
       │       mov    r15,QWORD PTR [rsp+0x8]                                                                                                                                                                            ▒
       │       test   r15,r15                                                                                                                                                                                            ▒
       │     ↓ je     150                                                                                                                                                                                                ▒
       │       xor    r14d,r14d                                                                                                                                                                                          ▒
       │       mov    rax,r15                                                                                                                                                                                            ▒
       │       shr    rax,0x3d                                                                                                                                                                                           ▒
       │       sete   al                                                                                                                                                                                                 ▒
       │     ↓ jne    174                                                                                                                                                                                                ▒
       │       lea    rbx,[r15*4+0x0]                                                                                                                                                                                    ▒
       │       mov    r14b,al                                                                                                                                                                                            ▒
       │       shl    r14,0x2                                                                                                                                                                                            ▒
       │       test   rbx,rbx                                                                                                                                                                                            ▒
       │     ↓ je     15a                                                                                                                                                                                                ▒
       │       mov    rdi,rbx                                                                                                                                                                                            ▒
       │       mov    rsi,r14                                                                                                                                                                                            ▒
       │     → call   QWORD PTR [rip+0x4b217]        # 53b60 <_GLOBAL_OFFSET_TABLE_+0x278>                                                                                                                               ▒
       │       mov    rdi,rax                                                                                                                                                                                            ▒
       │       test   rdi,rdi                                                                                                                                                                                            ▒
       │     ↓ je     166                                                                                                                                                                                                ▒
       │ b5:   lea    rax,[rdi+r15*4]                                                                                                                                                                                    ▒
       │       dec    r15                                                                                                                                                                                                ▒
       │       movabs rcx,0x3fffffffffffffff                                                                                                                                                                             ▒
       │       and    rcx,r15                                                                                                                                                                                            ▒
       │       lea    rdx,[rcx+0x1]                                                                                                                                                                                      ▒
       │       mov    rsi,rdx                                                                                                                                                                                            ▒
       │       and    rsi,0xfffffffffffffff8                                                                                                                                                                             ▒
       │       lea    r8,[rdi+rsi*4]                                                                                                                                                                                     ▒
       │       xor    r9d,r9d                                                                                                                                                                                            ▒
       │       movaps xmm0,XMMWORD PTR [rip+0x3967e]                                                                                                                                                                     ▒
       │     ↓ jmp    fc                                                                                                                                                                                                 ▒
       │       data16 data16 nop WORD PTR cs:[rax+rax*1+0x0]                                                                                                                                                             ▒
       │ f0:   inc    r9d                                                                                                                                                                                                ▒
       │       cmp    r9d,0x5f5e100                                                                                                                                                                                      ▒
       │     ↓ je     142                                                                                                                                                                                                ▒
       │ fc:   mov    r10,rdi                                                                                                                                                                                            ▒
  3.58 │       cmp    rcx,0x7                                                                                                                                                                                            ▒
       │     ↓ jb     130                                                                                                                                                                                                ▒
       │       xor    r10d,r10d                                                                                                                                                                                          ▒
       │       nop                                                                                                                                                                                                       ▒
 32.08 │110:   movups XMMWORD PTR [rdi+r10*4],xmm0                                                                                                                                                                       ▒
  3.18 │       movups XMMWORD PTR [rdi+r10*4+0x10],xmm0                                                                                                                                                                  ▒
       │       add    r10,0x8                                                                                                                                                                                            ▒
  3.23 │       cmp    rsi,r10                                                                                                                                                                                            ▒
 36.97 │     ↑ jne    110                                                                                                                                                                                                ▒
       │       mov    r10,r8                                                                                                                                                                                             ▒
       │       cmp    rdx,rsi                                                                                                                                                                                            ▒
       │     ↑ je     f0                                                                                                                                                                                                 ▒
       │       nop                                                                                                                                                                                                       ▒
  8.60 │130:   mov    DWORD PTR [r10],0x4                                                                                                                                                                                ▒
       │       add    r10,0x4                                                                                                                                                                                            ▒
       │       cmp    r10,rax                                                                                                                                                                                            ▒
  9.25 │     ↑ jne    130                                                                                                                                                                                                ▒
  3.11 │     ↑ jmp    f0                                                                                                                                                                                                 ▒
       │142:   mov    edx,0x4                                                                                                                                                                                            ▒
       │       mov    rsi,rbx                                                                                                                                                                                            ▒
       │     → call   QWORD PTR [rip+0x4b160]        # 53b50 <_GLOBAL_OFFSET_TABLE_+0x268>                                                                                                                               ▒
       │150:   add    rsp,0x50                                                                                                                                                                                           ▒
       │       pop    rbx                                                                                                                                                                                                ▒
       │       pop    r14                                                                                                                                                                                                ▒
       │       pop    r15                                                                                                                                                                                                ▒
       │     ← ret                                                                                                                                                                                                       ▒
       │15a:   mov    rdi,r14                                                                                                                                                                                            ▒
       │       test   rdi,rdi                                                                                                                                                                                            ▒
       │     ↑ jne    b5                                                                                                                                                                                                 ▒
       │166:   mov    rdi,r14                                                                                                                                                                                            ▒
       │       mov    rsi,rbx                                                                                                                                                                                            ▒
       │     → call   QWORD PTR [rip+0x4b4e6]        # 53ef8 <_GLOBAL_OFFSET_TABLE_+0x610>                                                                                                                               ▒
       │       ud2                                                                                                                                                                                                       ▒
       │174: → call   QWORD PTR [rip+0x4b4c6]        # 53ee0 <_GLOBAL_OFFSET_TABLE_+0x5f8>                                                                                                                               ▒
       │       ud2                                                                                                                                                                                                       ```

@dgiger42
Copy link
Contributor Author

I tried extracting the relevant code to a function and outlining it to make the assembly more readable, and outlining it somehow made it as fast on nightly as on beta (still slower than 1.69), without changing the instruction counts.

#[inline(never)]  // outlining this makes nightly as fast as beta
pub fn do_stuff(v: &mut Vec<i32>) {
    for _ in 0..100_000_000 {
        v.iter_mut().for_each(|x| {
            *x = 4;
        })
    }
}

fn main() {
    let n = 100 as usize;
    println!("n: {}", n);
    let mut v = vec![0; n];

    do_stuff(&mut v);
}

@the8472
Copy link
Member

the8472 commented Jun 23, 2023

If the instructions counts stay the same but cycle counts change you'll want to do perf record -e cycles,<additional counters> <command> and see where things differ. If the assembly of the hot loop doesn't change at all it might be something more obscure metrics like execution port usage, cache misses, alignment, branch predictor misses or whatever.
I see both IPC and pipeline slot usage changing in your x86-64 baseline perf stat results.
Maybe intel vtune can provide more insight?

@apiraino apiraino added P-medium Medium priority and removed P-high High priority labels Jun 23, 2023
@nikic
Copy link
Contributor

nikic commented Jun 24, 2023

The difference in unrolling here is because LLVM 16 will no longer unroll loops that have been vectorized. The vectorizer is responsible for interleaving if it finds it profitable.

I haven't looked into why the vectorizer decides that further interleaving is not profitable, but based on @saethlin's report that this not profitable for the znver2 architecture, it may well be a reasonable choice when targeting -C target-cpu=generic.

@the8472
Copy link
Member

the8472 commented Jan 29, 2024

for_each is a fold, so cc #106343 @the8472

Actually, slice::Iter has its own for_each impl so it doesn't inherit the fold changes. I'll check if the same optimizations would also help for_each.

@the8472
Copy link
Member

the8472 commented Jan 30, 2024

#120477 only had some miniscule differences and didn't affect the unrolling, so it doesn't seem worth it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-bug Category: This is a bug. I-slow Issue: Problems and improvements with respect to performance of generated code. P-medium Medium priority regression-untriaged Untriaged performance or correctness regression. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

8 participants