-
Notifications
You must be signed in to change notification settings - Fork 9
Geofilter customize hasher #69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
12 commits
Select commit
Hold shift + click to select a range
440afa1
initial cut of customizable buildhashers in geo filters crate
itsibitzi 77d67b5
update hardcoded test values
itsibitzi 4259f65
wip save point
itsibitzi 7e35a6b
clean up type parameters
itsibitzi 7f450d7
new clippy rule
itsibitzi 86e7582
docs
itsibitzi 885d8db
fix test lints
itsibitzi be373ee
reworks
itsibitzi 4eb1f73
Further reworks
itsibitzi 132a244
more clippy
itsibitzi ea774b5
remove readme test and make readme doc tests
itsibitzi b9b0bae
fix spelling mistake
itsibitzi File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,115 @@ | ||
# Choosing a hash function | ||
|
||
## Reproducibility | ||
|
||
This library uses hash functions to assign values to buckets deterministically. The same item | ||
will hash to the same value, and modify the same bit in the geofilter. | ||
|
||
When comparing geofilters it is important that the same hash functions, using the same seed | ||
values, have been used for *both* filters. Attempting to compare geofilters which have been | ||
produced using different hash functions or the same hash function with different seeds will | ||
produce nonsensical results. | ||
|
||
Similar to the Rust standard library, this crate uses the `BuildHasher` trait and creates | ||
a new `Hasher` for every item processed. | ||
|
||
To help prevent mistakes caused by mismatching hash functions or seeds we introduce a trait | ||
`ReproducibleBuildHasher` which you must implement if you wish to use a custom hashing function. | ||
By marking a `BuildHasher` with this trait you're asserting that `Hasher`s produced using | ||
`Default::default` will hash identical items to the same `u64` value across multiple calls | ||
to `BuildHasher::hash_one`. | ||
|
||
The following is an example of some incorrect code which produces nonsense results: | ||
|
||
```rust | ||
use std::hash::RandomState; | ||
|
||
// Implement our marker trait for `RandomState`. | ||
// You should _NOT_ do this as `RandomState::default` does not produce | ||
// reproducible hashers. | ||
impl ReproducibleBuildHasher for RandomState {} | ||
|
||
#[test] | ||
fn test_different_hash_functions() { | ||
// The last parameter in this FixedConfig means we're using RandomState as the BuildHasher | ||
pub type FixedConfigRandom = FixedConfig<Diff, u16, 7, 112, 12, RandomState>; | ||
|
||
let mut a = GeoDiffCount::new(FixedConfigRandom::default()); | ||
let mut b = GeoDiffCount::new(FixedConfigRandom::default()); | ||
|
||
// Add our values | ||
for n in 0..100 { | ||
a.push(n); | ||
b.push(n); | ||
} | ||
|
||
// We have inserted the same items into both filters so we'd expect the | ||
// symmetric difference to be zero if all is well. | ||
let diff_size = a.size_with_sketch(&b); | ||
|
||
// But all is not well. This assertion fails! | ||
assert_eq!(diff_size, 0.0); | ||
} | ||
``` | ||
|
||
The actual value returned in this example is ~200. This makes sense because the geofilter | ||
thinks that there are 100 unique values in each of the filters, so the difference is approximated | ||
as being ~200. If we were to rerun the above example with a genuinely reproducible `BuildHasher` | ||
then the resulting diff size would be `0`. | ||
|
||
In debug builds, as part of the config's `eq` implementation, our library will assert that the `BuildHasher`s | ||
produce the same `u64` value when given the same input but this is not enabled in release builds. | ||
|
||
## Stability | ||
|
||
Following from this, it might be important that your hash functions and seed values are stable, meaning, | ||
that they won't change from one release to another. | ||
|
||
The default function provided in this library is *NOT* stable as it is based on the Rust standard libraries | ||
`DefaultHasher` which does not have a specified algorithm and may change across releases of Rust. | ||
|
||
Stability is especially important to consider if you are using serialized geofilters which may have | ||
been created in a previous version of the Rust standard library. | ||
|
||
This library provides an implementation of `ReproducibleBuildHasher` for the `FnvBuildHasher` provided | ||
by the `fnv` crate version `1.0`. This is a _stable_ hash function in that it won't change unexpectedly | ||
but it doesn't have good diffusion properties. This means if your input items have low entropy (for | ||
example numbers from `0..10000`) you will find that the geofilter is not able to produce accurate estimations. | ||
|
||
## Uniformity and Diffusion | ||
|
||
In order to produce accurate estimations it is important that your hash function is able to produce evenly | ||
distributed outputs for your input items. | ||
|
||
This property must be balanced against the performance requirements of your system as stronger hashing | ||
algorithms are often slower. | ||
|
||
Depending on your input data, different functions may be more or less appropriate. For example, if your input | ||
items have high entropy (e.g. SHA256 values) then the diffusion of your hash function might matter less. | ||
|
||
## Implementing your own `ReproducibleBuildHasher` type | ||
|
||
If you are using a hash function that you have not implemented yourself you will not be able to implement | ||
`ReproducibleBuildHasher` on that type directly due to Rust's orphan rules. The easiest way to get around this | ||
is to create a newtype which proxies the underlying `BuildHasher`. | ||
|
||
In addition to `BuildHasher` `ReproducibleBuildHasher` needs `Default` and `Clone`, which is usually implemented | ||
on `BuildHasher`s, so you can probably just `#[derive(...)]` those. If your `BuildHasher` doesn't have those | ||
traits then you may need to implement them manually. | ||
|
||
Here is an example of how to use new types to mark your `BuildHasher` as reproducible. | ||
|
||
```rust | ||
#[derive(Clone, Default)] | ||
pub struct MyBuildHasher(BuildHasherDefault<DefaultHasher>); | ||
|
||
impl BuildHasher for MyBuildHasher { | ||
type Hasher = DefaultHasher; | ||
|
||
fn build_hasher(&self) -> Self::Hasher { | ||
self.0.build_hasher() | ||
} | ||
} | ||
|
||
impl ReproducibleBuildHasher for MyBuildHasher {} | ||
``` |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,38 @@ | ||
use std::hash::{BuildHasher, BuildHasherDefault, DefaultHasher, Hasher as _}; | ||
|
||
use fnv::FnvBuildHasher; | ||
|
||
/// Trait for a hasher factory that can be used to produce hashers | ||
/// for use with geometric filters. | ||
/// | ||
/// It is a super set of [`BuildHasher`], enforcing additional requirements | ||
/// on the hasher builder that are required for the geometric filters and | ||
/// surrounding code. | ||
/// | ||
/// When performing operations across two different geometric filters, | ||
/// the hashers must be equal, i.e. they must produce the same hash for the | ||
/// same input. | ||
pub trait ReproducibleBuildHasher: BuildHasher + Default + Clone { | ||
#[inline] | ||
fn debug_assert_hashers_eq() { | ||
// In debug builds we check that hash outputs are the same for | ||
// self and other. The library user should only have implemented | ||
// our build hasher trait if this is already true, but we check | ||
// here in case they have implemented the trait in error. | ||
debug_assert_eq!( | ||
Self::default().build_hasher().finish(), | ||
Self::default().build_hasher().finish(), | ||
"Hashers produced by ReproducibleBuildHasher do not produce the same output with the same input" | ||
); | ||
} | ||
} | ||
|
||
/// Note that this `BuildHasher` has a consistent implementation of `Default` | ||
/// but is NOT stable across releases of Rust. It is therefore dangerous | ||
/// to use if you plan on serializing the geofilters and reusing them due | ||
/// to the fact that you can serialize a filter made with one version and | ||
/// deserialize with another version of the hasher factor. | ||
pub type UnstableDefaultBuildHasher = BuildHasherDefault<DefaultHasher>; | ||
|
||
impl ReproducibleBuildHasher for UnstableDefaultBuildHasher {} | ||
impl ReproducibleBuildHasher for FnvBuildHasher {} |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.