-
Notifications
You must be signed in to change notification settings - Fork 9
Reduce heap allocations #28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
In the 2^32 benchmark during key generation this avoids about 500k temporary heap allocations when running for about 30s. Likely only a small performance cost, but we can avoid them without making the code much more complicated.
Each rayon worker job had to allocate the full `packed_leaf_input`. We now use `for_each_init` to preallocate a vector for every Rayon worker instead. We overwrite the entire vector in every job, so not even a need to `fill(0)` the vector in each job. This drops another ~100k allocations when running the 2^32 bench over 30s. Brings us down to only 3k temporary allocations total in that time frame.
This way we essentially avoid all allocations, i.e. we get a single allocation per thread. `for_each_init` is known to allocate multiple times due to the rayon work stealing / splitting approach. See: rayon-rs/rayon#742
No need for a `Vec` in these two branches as we know at compile time how much data is required for each input. Only relevant if `apply` is part of a hot code path, which normally is unlikely to be the case. Still, the code is not significantly more, only more ugly :( It gets rid of a large number of allocations when running the 2^8 benchmark case.
Can't hurt to have this in here.
Somehow this is a case where cargo fmt has no opinion about it. Earlier when using `for_each_init` the indentation was changed, but this part didn't want to "come back" to what it was before...
following the benchmarks for the smallest and largest case
examples/single_keygen.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can remove this file no? Just used for the benchmark I imagine but shouldn't be pushed to main no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, we can do that. I personally find it nice to have files for easy profiling like that, but they are also easy to write again. I'll remove them.
examples/single_keygen_2_32.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can remove this file no? Just used for the benchmark I imagine but shouldn't be pushed to main no?
Cargo.toml
Outdated
| p3-koala-bear = { git = "https://github.com/Plonky3/Plonky3.git", rev = "a33a312" } | ||
| p3-symmetric = { git = "https://github.com/Plonky3/Plonky3.git", rev = "a33a312" } | ||
|
|
||
| thread_local = "1.1.9" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find it very risky to rely on this kind of external crate. We should minimize the number of external deps and this one looks like a personal project so I find it pretty risky, would love to avoid this and I would prefer another alternative such as for_each_init or something equivalent in std.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's what I thought and why I kept the for_each_init approach. I'll have a look at the thread_local std package https://doc.rust-lang.org/std/macro.thread_local.html.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using the stdlib macro now.
|
Pushed the changes addressing all comments! |
| for (i, x) in it.remainder().iter().enumerate() { | ||
| state[i] += *x; | ||
| } | ||
| // was a remainder, so permute. No need to mutate `state` as we *add* only anyway |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean here by "no need to mutate state"? Since you pass a mutable reference of state to the permutation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point is that before we zero pad. but because one only adds to the state, we don't have to add zeroes for the previously padded data.
| if out_idx < OUT_LEN { | ||
| perm.permute_mut(&mut state); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here to match exactly the logic we had before I think that we don't need the if statement. Here this piece of code is an important factor for security and so we should not diminish the number of permutations. Due to the if after out_idx += chunk_size I think that the number of permutations was not exactly the same.
A good exercise to do for this kind of sensitive refactoring is to check the outputs of a call before and after the modification, they should be the same since we didn't change any logic here.
| if out_idx < OUT_LEN { | |
| perm.permute_mut(&mut state); | |
| } | |
| perm.permute_mut(&mut state); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe I screwed that up. Will check next year, sorry.
After looking at performance profiling yesterday, I decided to have a look at heap profiling today. I didn't expect to find much that actually affects performance, but I did find some unnecessary heap allocations. The new code is barely longer.
Essentially we now avoid heap allocations in:
poseidon_spongeapplyin 2/3 code branches (cannot easily in the general hashing of N runtime values)compute_tree_leavesnow avoids reallocating thepacked_leaf_inputvector for each rayon job. We now use thethread_localcrate to have thread local storage. Each thread only allocates the vector once. Due to the fact that we overwrite the vector fully anyway, we don't even need to set it back to zero. Initially here I used rayon'sfor_each_init. While that works, it does not guarantee to only init once per worker, because of how rayon does work stealing and splitting:for_each_initcallsinitseveral times per thread rayon-rs/rayon#742The code that used
for_each_initis still here in commit: Vindaar@e3e5cebI've added two simple examples, which only do the keygen step for 2^8 and 2^32 elements to make it easier to only heap profile and performance profile that and only that.
For the 2^8 case before these fixes there were about ~1900 total allocations. Of those we get rid of ~160. In the 2^32 case we go from about 58000 temporary allocations to 41.
Heaptrack output for 2^32 before
Heaptrack output for 2^32 after
Note on performance
As one can expect, malloc / memmove etc were only small fractions visible in the performance profiling #27. As a result it is no surprise that the performance is essentially identical for the current benchmark setup we have. But depending on how the code is used in other setups and given that the code is not significantly more complex, it may still be considered a win.