-
Notifications
You must be signed in to change notification settings - Fork 13.3k
iterator for_each performance regression #112911
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@rustbot label I-slow |
Looks to me like a change in the degree of loop unrolling. @rustbot label +A-LLVM |
WG-prioritization assigning priority (Zulip discussion). @rustbot label -I-prioritize +P-high +T-compiler |
There may be multiple regressions. |
Not seeing any differences in the assembly between stable, beta and nightly. Only 1.69 produces more unrolling. |
I reran the benchmark on a bunch of different versions to confirm that there are two separate regressions: One from 1.69 to 1.70, and one from beta to nightly. The one from 1.69 to 1.70 increases the time on my machine from 0.33 seconds to 0.48, and is the one that also changed the number of instructions. I suspect this is explained by the change in loop unrolling. The regression from beta to nightly increased the average time from 0.48 seconds to 0.67, without any change to the number of instructions. I'm not sure what to make of this one. Here are the commands I used to test this, along with the outputs from
|
What is "the benchmark" here? I see your executable is called |
I was referring to the example code at the beginning of the issue. The executable is called |
Yeah I definitely have a different CPU, a 3970X. That's unfortunate that this is so CPU-dependent. |
The assembly is produced for the x86-64 baseline by default and I don't know what kind of tuning that implies. What happens if you use |
can you check with objdump or the assembly view in |
With These benchmark results are particularly surprising to me, because when I originally noticed a regression in my (way more complicated) prime sieve on newer versions of Rust, I was already compiling with
|
Here's the assembly from
|
I tried extracting the relevant code to a function and outlining it to make the assembly more readable, and outlining it somehow made it as fast on nightly as on beta (still slower than 1.69), without changing the instruction counts. #[inline(never)] // outlining this makes nightly as fast as beta
pub fn do_stuff(v: &mut Vec<i32>) {
for _ in 0..100_000_000 {
v.iter_mut().for_each(|x| {
*x = 4;
})
}
}
fn main() {
let n = 100 as usize;
println!("n: {}", n);
let mut v = vec![0; n];
do_stuff(&mut v);
} |
If the instructions counts stay the same but cycle counts change you'll want to do |
The difference in unrolling here is because LLVM 16 will no longer unroll loops that have been vectorized. The vectorizer is responsible for interleaving if it finds it profitable. I haven't looked into why the vectorizer decides that further interleaving is not profitable, but based on @saethlin's report that this not profitable for the znver2 architecture, it may well be a reasonable choice when targeting |
#120477 only had some miniscule differences and didn't affect the unrolling, so it doesn't seem worth it. |
Code
I tried this code:
I expected to see this happen: Performance at least matching previous versions of Rust
Instead, this happened: On my machine, the execution time on nightly is around
2x
as long as on Rust 1.69. The time reported byperf stat
went from 0.35 to 0.67 seconds, and the instruction counts went up from 6.7 billion to 8.9 billion.Version it worked on
It most recently worked on: 1.69
Version with regression
rustc --version --verbose
:The regression also affects Rust 1.70, but the performance difference is not as big as on nightly.
Backtrace
Backtrace
The text was updated successfully, but these errors were encountered: