Skip to content

Add fold_mut alternative to Iterator trait #76746

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
93 changes: 93 additions & 0 deletions library/core/benches/iter.rs
Original file line number Diff line number Diff line change
Expand Up @@ -345,3 +345,96 @@ fn bench_partial_cmp(b: &mut Bencher) {
fn bench_lt(b: &mut Bencher) {
b.iter(|| (0..100000).map(black_box).lt((0..100000).map(black_box)))
}

#[bench]
fn bench_fold_fold_mut_vec(b: &mut Bencher) {
b.iter(|| {
(0..100000).map(black_box).fold(Vec::new(), |mut v, n| {
if n % 2 == 0 {
v.push(n * 3);
}
v
})
});
}

#[bench]
fn bench_fold_fold_mut_vec_mut(b: &mut Bencher) {
b.iter(|| {
(0..100000).map(black_box).fold_mut(Vec::new(), |v, n| {
if n % 2 == 0 {
v.push(n * 3);
}
})
});
}

#[bench]
fn bench_fold_fold_mut_hashmap(b: &mut Bencher) {
use std::collections::HashMap;

b.iter(|| {
(0..100000).map(black_box).fold(HashMap::new(), |mut hm, n| {
*hm.entry(n % 3).or_insert(0) += 1;
hm
})
});
}

#[bench]
fn bench_fold_fold_mut_hashmap_mut(b: &mut Bencher) {
use std::collections::HashMap;

b.iter(|| {
(0..100000).map(black_box).fold_mut(HashMap::new(), |hm, n| {
*hm.entry(n % 3).or_insert(0) += 1;
})
});
}

#[bench]
fn bench_fold_fold_mut_num(b: &mut Bencher) {
b.iter(|| (0..100000).map(black_box).fold(0, |sum, n| sum + n));
}

#[bench]
fn bench_fold_fold_mut_num_mut(b: &mut Bencher) {
b.iter(|| {
(0..100000).map(black_box).fold_mut(0, |sum, n| {
*sum += n;
})
});
}

#[bench]
fn bench_fold_fold_mut_chain(b: &mut Bencher) {
b.iter(|| (0i64..1000000).chain(0..1000000).map(black_box).fold(0, |sum, n| sum + n));
}

#[bench]
fn bench_fold_fold_mut_chain_mut(b: &mut Bencher) {
b.iter(|| {
(0i64..1000000).chain(0..1000000).map(black_box).fold_mut(0, |sum, n| {
*sum += n;
})
});
}

#[bench]
fn bench_fold_fold_mut_chain_flat_map(b: &mut Bencher) {
b.iter(|| {
(0i64..1000000)
.flat_map(|x| once(x).chain(once(x)))
.map(black_box)
.fold(0, |sum, n| sum + n)
});
}

#[bench]
fn bench_fold_fold_mut_chain_flat_map_mut(b: &mut Bencher) {
b.iter(|| {
(0i64..1000000).flat_map(|x| once(x).chain(once(x))).map(black_box).fold_mut(0, |sum, n| {
*sum += n;
})
});
}
38 changes: 38 additions & 0 deletions library/core/src/iter/traits/iterator.rs
Original file line number Diff line number Diff line change
Expand Up @@ -1990,6 +1990,44 @@ pub trait Iterator {
accum
}

/// An iterator method that applies a function, producing a single, final value.
///
/// `fold_mut()` is very similar to [`fold()`] except that the closure
/// takes a `&mut` to the 'accumulator' and does not need to return a new value.
///
/// [`fold()`]: Iterator::fold
///
/// # Examples
///
/// Basic usage:
///
/// ```
/// #![feature(iterator_fold_mut)]
/// use std::collections::HashMap;
///
/// let word = "abracadabra";
///
/// // the count of each letter in a HashMap
/// let counts = word.chars().fold_mut(HashMap::new(), |map, c| {
/// *map.entry(c).or_insert(0) += 1;
/// });
///
/// assert_eq!(counts[&'a'], 5);
/// ```
#[inline]
#[unstable(feature = "iterator_fold_mut", issue = "76751")]
fn fold_mut<B, F>(mut self, init: B, mut f: F) -> B
where
Self: Sized,
F: FnMut(&mut B, Self::Item),
{
let mut accum = init;
while let Some(x) = self.next() {
f(&mut accum, x);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the default should be implemented in terms of fold, so it automatically takes advantage of all the iterators that already specialize this path.

self.fold(init, |mut acc, x| { f(&mut acc, x); acc })

Even better if that isolates the closure generics like #62429, so it only depends on Self::Item rather than all of Self (plus B and F of course).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try with inputs like Chain or FlatMap to see the performance difference of custom fold.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the default should be implemented in terms of fold, so it automatically takes advantage of all the iterators that already specialize this path.

I like this idea! I tried it here and the results seem to indicate that this brought back the performance "issue" (assuming the benchmark is built right, Criterion is doing it's thing correctly, and I'm interpreting the results correctly):
criterion-plot

Should I push forward with that? I was originally making fold_mut for the performance but we could instead make it about the shape of the closure and just wait on improvements to fold if we think that's valuable

Try with inputs like Chain or FlatMap to see the performance difference of custom fold.

Good call, I'll whip up a few of these when I get the chance!

Copy link
Contributor Author

@mlodato517 mlodato517 Sep 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eureka! I added some benchmarks (with terrible names and with terrible duplication that should be replaced with a macro) in 2921101 and it appears that fold_mut is much slower when using chain and flat_map!

iter-bench

Unfortunately they're a little bit out of order - I will definitely pick better names if we decide to go through with this PR but basically for chain and flat_map, fold_mut is like 50%-100% slower.

For a non-chain-non-flat-map fold on a hashmap, they're about the same (except fold_mut has a huge variance so I'm not sure about that there).

For a non-chain-non-flat-map fold on a number, fold_mut is slower now (interesting that has now switched. I guess that means they're basically "the same within precision"?).

For a non-chain-non-flat-map fold on a Vec, fold_mut is faster by some huge margin that is suspicious.

I think maybe I'll port these benchmarks back to my other repo so I can use criterion for (maybe) more consistent data? Or should we just pull the plug on this? Or redirect the effort towards the ergonomics and not worry about performance?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it appears that fold_mut is much slower when using chain and flat_map!

That's expected because of the while let (essentially a for loop), as cuviper said/implied. If you just replace the (conceptual) for loop with a .for_each( I suspect you'd recover the performance:

         let mut accum = init;
-        while let Some(x) = self.next() {
-            f(&mut accum, x);
-        }
+        self.for_each(|x| f(&mut accum, x));

(insert comment about #62429 here and how that shouldn't be the implementation in core, but it'd be fine for benchmarking)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm that went a little over my head, let me see if I understand:

  1. Reduce the genericity of closures in the iterator traits #62429 introduced some specialized iterator methods (fold among them) for certain iterators. This reduced the number of types the iterator closure was generic over
  2. By implementing a separate fold_mut that doesn't use fold, fold_mut misses out on the specific fold implementations for those iterators
  3. By switching it to a for_each, fold_mut may recover the performance because, I assume, there are similarly specialized for_each implementations for several iterators

Interested to know if any of those are in the right ballpark!

And for expedience (maybe), assuming those points are in the correct ballpark, may I ask:

  1. how does the custom implementation, that reduces the generic types on the closure, improve performance? I can understand the comments about reducing the number of monomorphizations and then code size, but it's not immediately obvious how this plays into runtime performance? Or is it just smaller = faster here? Or is it that, with fewer trait bounds, the compiler is able to do some additional optimization?
  2. how should this be reflected in this PR? My original hope was to have a fold that was "closer" to a "zero cost abstraction". It's seeming more and more like that isn't super possible (except with maybe the for_each construction above?). Should I bail on the "performance" of fold_mut and double down on the ergonomics of the closure by defining fold_mut in terms of fold and then fold_mut will maybe become more of a zero cost abstraction as fold is specialized on other iterators like Vec::IntoIter or something?

Copy link
Member

@cuviper cuviper Sep 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The concerns of #62429 are mostly separate from the concerns of fold specialization. The overlap is just that using fold or for_each often introduces a generic closure.

The point of #62429 is that closures inherit all of the generic type parameters of the surrounding scope. In your example that would be B, F, and Self -- but the closure doesn't really need the whole iterator type, just Self::Item. So you can end up generating the same closure for <B, F, vec::IntoIter<i32>>, <B, F, Cloned<hash_set::Iter<'_, i32>>>, <B, F, Map<Chain<Range<i32>, Once<i32>>, map-closure{...}>>, etc., when our closure could be just <B, F, Item = i32>. This is a compile-time concern for excessive monomorphizations, and can also make it exceed type-length limits. There has been some compiler work to trim unused parameters, but it still doesn't know how to reduce that Self to Self::Item. So the trick in #62429 is to use a "closure factory" function with tighter control over the generic parameters.

The fold specialization is more about runtime performance, which is why it changes your benchmarks. For example, Chain::next() has to check its state every time it is called, whether to pull an item from its first or second iterator, whereas Chain::fold() can just fold all of the first iterator and then all of the second. The default for_each is just a fold with an empty () accumulator, which is why it also benefits here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, thank you for that explanation!

The fold specialization is more about runtime performance, which is why it changes your benchmarks. For example, Chain::next() has to check its state every time it is called, whether to pull an item from its first or second iterator, whereas Chain::fold() can just fold all of the first iterator and then all of the second.

Awesome, this makes a ton of sense

The default for_each is just a fold with an empty () accumulator, which is why it also benefits here.

Woah, did not see that coming! So I originally read that thinking, "Great, I'll rewrite fold_mut using for_each as mentioned above and I'll get the benefits of fold improvements plus maybe Rust can tell in for_each that moving and reassigning () is a no-op and we'll keep the performance improvements on 'simple' iterators".

I'm running out of benchmark steam (is there a better way to do this than running ./x.py bench -i library/core/ --test-args mark_bench_fold? - it takes about a half hour on my machine) but gave this a shot - I defined three methods

  1. fold_mut which uses while let
  2. fold_mut_fold which uses fold under the hood
  3. fold_mut_each which uses for_each under the hood

And ran 8 benchmarks. 4 were operating on (0..100000).map(black_box) (named _simple) and 4 were operating on (0i64..1000000).chain(0..1000000).map(black_box) (named _chain). Each test was (hopefully) calculating the sum of all the even numbers:

test iter::mark_bench_fold_chain                                ... bench:   2,287,252 ns/iter (+/- 160,890)
test iter::mark_bench_fold_chain_mut                            ... bench:   1,969,051 ns/iter (+/- 75,847)
test iter::mark_bench_fold_chain_mut_each                       ... bench:   2,516,875 ns/iter (+/- 123,425)
test iter::mark_bench_fold_chain_mut_fold                       ... bench:   2,363,194 ns/iter (+/- 123,658)
test iter::mark_bench_fold_simple                               ... bench:      57,271 ns/iter (+/- 3,691)
test iter::mark_bench_fold_simple_mut                           ... bench:      58,071 ns/iter (+/- 3,410)
test iter::mark_bench_fold_simple_mut_each                      ... bench:     540,887 ns/iter (+/- 5,221)
test iter::mark_bench_fold_simple_mut_fold                      ... bench:      57,627 ns/iter (+/- 4,896)

so I think based on all these shifty benchmarks moving around so much ... maybe these're all within statistical uncertainy (combined with whatever my computer is doing at any given time). This was super fun to play around with but I think I'm going to bow out of the "maybe fold_mut will be faster!" argument :-D

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe Rust can tell in for_each that moving and reassigning () is a no-op

It's a zero sized type (ZST), so there's literally nothing to do in codegen.

I'm really surprised that it did poorly in your benchmark, but my guess is that there was some unlucky inlining (or lack thereof), especially if parts of that benchmark got split across codegen-units.

accum
}

/// The same as [`fold()`], but uses the first element in the
/// iterator as the initial value, folding every subsequent element into it.
/// If the iterator is empty, return [`None`]; otherwise, return the result
Expand Down
12 changes: 12 additions & 0 deletions library/core/tests/iter.rs
Original file line number Diff line number Diff line change
Expand Up @@ -3222,3 +3222,15 @@ fn test_flatten_non_fused_inner() {
assert_eq!(iter.next(), Some(1));
assert_eq!(iter.next(), None);
}

#[test]
fn test_fold_mut() {
let nums = [1, 2, 3, 4, 5];
let result = nums.iter().fold_mut(Vec::new(), |v, i| {
if i % 2 == 0 {
v.push(i * 3);
}
});

assert_eq!(result, vec![6, 12]);
}