Skip to content

Iterator inlining/optimization regression in 1.72 release #115601

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
alexpyattaev opened this issue Sep 6, 2023 · 5 comments
Open

Iterator inlining/optimization regression in 1.72 release #115601

alexpyattaev opened this issue Sep 6, 2023 · 5 comments
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-bug Category: This is a bug. C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such I-heavy Issue: Problems and improvements with respect to binary size of generated code. P-medium Medium priority regression-untriaged Untriaged performance or correctness regression. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@alexpyattaev
Copy link

Code

I tried this code to showcase zero-cost abstractions, and was quite surprised to see a whole bunch of asm as compared to hand-rolled version:

pub fn some_iterators_nonsense(z:&[i64])->i64{
    z.iter().map(| e| {
          e * e +3 
    }).filter(|e| {
        e % 2 ==0
    }).sum()  
}

In the hand-rolled version, the amount of generated asm is way less

pub fn some_nonsense(z:&[i64])->i64{
    let mut acc = 0;
    for e in z{
        let t = e * e+3 ;
        if t %2 ==0{
            acc += t;
        }
    }
    acc
}

According to godbolt, the amount of asm generated for the iterator version is ~3x more than hand-rolled version. Removing the map() operation or filter() operation (and updating the hand-rolled version) makes the issue go away.

Version it worked on

It most recently worked on: Rust 1.71
It also appears to have worked for many releases before that as well.

Version with regression

Regression observed on

  • Rust 1.71
  • Rust 1.72
  • Rust nightly

Godbolt

The exact code to reproduce is on Matt Godbolt's site:
https://godbolt.org/z/6jb8K3adK

@alexpyattaev alexpyattaev added C-bug Category: This is a bug. regression-untriaged Untriaged performance or correctness regression. labels Sep 6, 2023
@rustbot rustbot added needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. I-prioritize Issue: Indicates that prioritization has been requested for this issue. labels Sep 6, 2023
@saethlin saethlin added A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. I-heavy Issue: Problems and improvements with respect to binary size of generated code. and removed needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. labels Sep 6, 2023
@boulanlo
Copy link

boulanlo commented Sep 6, 2023

searched nightlies: from nightly-2023-05-26 to nightly-2023-07-07
regressed nightly: nightly-2023-06-16
searched commit range: 8c74a5d...114fb86
regressed commit: 4996b56

bisected with cargo-bisect-rustc v0.6.5

Host triple: x86_64-unknown-linux-gnu
Reproduce with:

cargo bisect-rustc --start=2023-05-26 --end=2023-07-07 --script script.sh --preserve -vv 

Related script.sh (might need a bit of adjustment to reproduce, especially on the asm size limit, i just used the size of the output pre-regression):

#!/usr/bin/env bash

set -e

cargo +"$RUSTUP_TOOLCHAIN" rustc --release -- --emit asm
test $(cat target-$RUSTUP_TOOLCHAIN/x86_64-unknown-linux-gnu/release/deps/*.s | wc -l) -le 34

Seems like the related merge request (#106343) directly affects fold(), which sum() uses. I hope this bisection is not misleading due to poor heuristics! I checked manually and the difference between before that commit and after are the same I see on godbolt, so that should be OK, but I'm still not 100% sure it's correct as it's my first bisection on asm output! :)

@alexpyattaev
Copy link
Author

Thanks for taking care of this! I know this is probably not killing performance of any particular app (or maybe it does?), but it would go a long way towards keeping binary sizes under control.

Some additional info related to specifying -C target-cpu=native that might be helpful:

  • Specifying target-cpu=native with rustc 1.72 dramatically flips the results (the iterator version generates smaller asm than the imperative version). It produces a whopping 520 instructions for the naive version, which is clearly too many for the function in question.
  • Specifying -C target-cpu=native on the 1.71 version produces identical native code for both versions of the code, as expected.
  • Up to my understanding of AVX assembly, -C target-cpu=native on 1.72 produces identical asm as compared to 1.71 for the iterator version of the program.

@the8472
Copy link
Member

the8472 commented Sep 6, 2023

Have you benchmarked it? It looks like it tries to unroll the loop and do two steps at a time and also skip over odd items I think. So it might be faster than the simple version; unless it's a misguided unroll, which happens.

@apiraino
Copy link
Contributor

apiraino commented Sep 7, 2023

WG-prioritization assigning priority (Zulip discussion).

@rustbot label -I-prioritize +P-medium

@rustbot rustbot added P-medium Medium priority and removed I-prioritize Issue: Indicates that prioritization has been requested for this issue. labels Sep 7, 2023
@apiraino apiraino added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Sep 7, 2023
@alexpyattaev
Copy link
Author

alexpyattaev commented Sep 7, 2023

Have you benchmarked it? It looks like it tries to unroll the loop and do two steps at a time and also skip over odd items I think. So it might be faster than the simple version; unless it's a misguided unroll, which happens.

AFAIK there is no appreciable difference in performance between the versions across compiler releases or w.r.t. target-cpu flag. So the issue is primarily with binary size.
I've also tried enabling and disabling inlining, it seems to change very little.

Chances are good that you're right and it is indeed a misguided unroll.

Below is the crude bench I've used to verify that.

//#[inline(never)]
pub fn some_iterators_nonsense(z:&[i64])->i64{
    z.iter().map(| e| {
          e * e +3 
    }).filter(|e| {
        e % 2 ==0
    }).sum()  
}

//#[inline(never)]
pub fn some_nonsense(z:&[i64])->i64{
    let mut acc = 0;
    for e in z{
        let t = e * e+3 ;
        if t %2 ==0{
            acc += t;
        }
    }
    acc
}


fn main() {
    let mut data = vec![];
    for i in 1i64..50000{
        data.push(i);
    }
    const N:usize = 10000;
    
    let s = std::time::Instant::now();
    let sum1:i64 = (0..N).map(|e| {
        data[e] += (e*2)  as i64;
        some_nonsense(&data)
    }).sum();
    let d = s.elapsed();
    println!("imperative {d:?}, {sum1}");
    
    let s = std::time::Instant::now();
    let sum1:i64 = (0..N).map(|e|{ 
        data[e] += (e*2) as i64;
        some_iterators_nonsense(&data)
    }).sum();
    let d = s.elapsed();
    println!("iterators {d:?}, {sum1}");
    
}

@workingjubilee workingjubilee added the C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such label Oct 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-bug Category: This is a bug. C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such I-heavy Issue: Problems and improvements with respect to binary size of generated code. P-medium Medium priority regression-untriaged Untriaged performance or correctness regression. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

7 participants