-
Notifications
You must be signed in to change notification settings - Fork 13.3k
Iterator inlining/optimization regression in 1.72 release #115601
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
searched nightlies: from nightly-2023-05-26 to nightly-2023-07-07 bisected with cargo-bisect-rustc v0.6.5Host triple: x86_64-unknown-linux-gnu cargo bisect-rustc --start=2023-05-26 --end=2023-07-07 --script script.sh --preserve -vv Related #!/usr/bin/env bash
set -e
cargo +"$RUSTUP_TOOLCHAIN" rustc --release -- --emit asm
test $(cat target-$RUSTUP_TOOLCHAIN/x86_64-unknown-linux-gnu/release/deps/*.s | wc -l) -le 34 Seems like the related merge request (#106343) directly affects |
Thanks for taking care of this! I know this is probably not killing performance of any particular app (or maybe it does?), but it would go a long way towards keeping binary sizes under control. Some additional info related to specifying -C target-cpu=native that might be helpful:
|
Have you benchmarked it? It looks like it tries to unroll the loop and do two steps at a time and also skip over odd items I think. So it might be faster than the simple version; unless it's a misguided unroll, which happens. |
WG-prioritization assigning priority (Zulip discussion). @rustbot label -I-prioritize +P-medium |
AFAIK there is no appreciable difference in performance between the versions across compiler releases or w.r.t. target-cpu flag. So the issue is primarily with binary size. Chances are good that you're right and it is indeed a misguided unroll. Below is the crude bench I've used to verify that. //#[inline(never)]
pub fn some_iterators_nonsense(z:&[i64])->i64{
z.iter().map(| e| {
e * e +3
}).filter(|e| {
e % 2 ==0
}).sum()
}
//#[inline(never)]
pub fn some_nonsense(z:&[i64])->i64{
let mut acc = 0;
for e in z{
let t = e * e+3 ;
if t %2 ==0{
acc += t;
}
}
acc
}
fn main() {
let mut data = vec![];
for i in 1i64..50000{
data.push(i);
}
const N:usize = 10000;
let s = std::time::Instant::now();
let sum1:i64 = (0..N).map(|e| {
data[e] += (e*2) as i64;
some_nonsense(&data)
}).sum();
let d = s.elapsed();
println!("imperative {d:?}, {sum1}");
let s = std::time::Instant::now();
let sum1:i64 = (0..N).map(|e|{
data[e] += (e*2) as i64;
some_iterators_nonsense(&data)
}).sum();
let d = s.elapsed();
println!("iterators {d:?}, {sum1}");
} |
Code
I tried this code to showcase zero-cost abstractions, and was quite surprised to see a whole bunch of asm as compared to hand-rolled version:
In the hand-rolled version, the amount of generated asm is way less
According to godbolt, the amount of asm generated for the iterator version is ~3x more than hand-rolled version. Removing the map() operation or filter() operation (and updating the hand-rolled version) makes the issue go away.
Version it worked on
It most recently worked on: Rust 1.71
It also appears to have worked for many releases before that as well.
Version with regression
Regression observed on
Godbolt
The exact code to reproduce is on Matt Godbolt's site:
https://godbolt.org/z/6jb8K3adK
The text was updated successfully, but these errors were encountered: