-
Notifications
You must be signed in to change notification settings - Fork 9
Geofilter customize hasher #69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
crates/geo_filters/src/diff_count.rs
Outdated
impl<C: GeoConfig<Diff>> GeoDiffCount<'_, C> { | ||
pub fn new(config: C) -> Self { | ||
impl<C: GeoConfig<Diff>, H: BuildHasher + Clone> GeoDiffCount<'_, C, H> { | ||
pub fn new(config: C, hash_builder: H) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can make this:
pub fn new<H: Default>(config: C) -> Self {
...
hash_builder::HashBuilder::Default(),
}
pub fn with_hasher(config: C, hash_builder: H) -> Self {
...
}
Update: The new commit introduces a trait that library users must mark their The advantage of this is because we're not holding any state any more we can use an associated type on the config object and create a new The downside is from the library users perspective because it means they will need to create some newtype for the |
// Note that this `BuildHasher` has a consistent implementation of `Default` | ||
// but is NOT stable across releases of Rust. It is therefore dangerous | ||
// to use if you plan on serializing the geofilters and reusing them due | ||
// to the fact that you can serialize a filter made with one version and | ||
// deserialize with another version of the hasher factor. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: change into doc comments?
@@ -17,7 +18,7 @@ use crate::Diff; | |||
// | |||
// scripts/accuracy -n 10000 geo_diff/u16/b=7/bytes=112/msb={8,12,16,20} | |||
// | |||
pub type GeoDiffConfig7 = FixedConfig<Diff, u16, 7, 112, 12>; | |||
pub type GeoDiffConfig7 = FixedConfig<Diff, u16, 7, 112, 12, DefaultBuildHasher>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should GeoDiffConfig7
have the BuildHasher as generic with DefaultBuildHasher
set as the default?
(same for the others?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about this, and presumably the accuracy figures from the comments are based on a combination of the config parameters and the specific hasher. E.g. previously I had FnvBuildHasher
in here and the accuracy was way off when working against a series of incrementing numbers.
If we provide the pre-canned configs in order to meet a specified error - does it confuse things to allow people to swap out the hash function and potentially get a different error?
I suspect it is the case where any sufficiently good hash function/input combination should perform about the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you use a bad hash function, then you get essentially no guarantees...
For all good ones, it should be about the same on average.
The amount of outliers might obviously differ 🤷
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I thought so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inference doesn't fall back to defaults so this will introduce a few extra type annotations, mostly in the tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds the ability to customize the hashing function used by geometric filters in the geo_filters
library. The change introduces a new trait ReproducibleBuildHasher
that enforces consistent hashing across devices, which is essential for meaningful comparisons between filters created on different systems.
Key changes include:
- Introduction of
ReproducibleBuildHasher
trait and associated hasher types - Addition of a hasher type parameter to the
Count
trait and configuration types - Comprehensive documentation on choosing appropriate hash functions
- Updates to all existing tests and examples to use the new hasher parameter
Reviewed Changes
Copilot reviewed 16 out of 16 changed files in this pull request and generated 4 comments.
Show a summary per file
File | Description |
---|---|
crates/geo_filters/src/build_hasher.rs |
New module defining the ReproducibleBuildHasher trait and default implementations |
crates/geo_filters/src/lib.rs |
Updated Count trait to include hasher type parameter and use configurable hashing |
crates/geo_filters/src/config.rs |
Added hasher type parameter to configuration traits and structs |
crates/geo_filters/src/distinct_count.rs |
Updated to use new hasher-parameterized Count trait |
crates/geo_filters/src/diff_count.rs |
Updated to use new hasher-parameterized Count trait |
crates/geo_filters/docs/choosing-a-hash-function.md |
New documentation explaining hash function selection considerations |
Various test and evaluation files | Updated to use DefaultBuildHasher type parameter |
crates/geo_filters/src/lib.rs
Outdated
@@ -29,13 +32,14 @@ pub struct Distinct {} | |||
impl Method for Distinct {} | |||
|
|||
/// Trait for types solving the set cardinality estimation problem. | |||
pub trait Count<M: Method> { | |||
pub trait Count<M: Method, H: ReproducibleBuildHasher> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like it would be simpler if H
would be a type member (or whatever it is called) instead of a generic.
Or is there another reason for having it as generic argument?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good shout, I've made it an associated type :)
crates/geo_filters/src/lib.rs
Outdated
let build_hasher = H::default(); | ||
self.push_hash(build_hasher.hash_one(item)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we could also remove the default implementation to completely get rid of the BuildHasher
dependency here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
I've removed this as a default function.
- Parametize config type aliases - Remove dependency on BuildHasher from Count - Add readme test - Clarify in type alias that default build hasher is unstable
// Mirrors what we have in the README. | ||
// If this breaks then it should be fixed an the readme updated. | ||
#[test] | ||
fn readme() { | ||
use geo_filters::distinct_count::GeoDistinctCount13; | ||
use geo_filters::Count; | ||
|
||
let mut c1 = GeoDistinctCount13::default(); | ||
c1.push(1); | ||
c1.push(2); | ||
|
||
let mut c2 = GeoDistinctCount13::default(); | ||
c2.push(2); | ||
c2.push(3); | ||
|
||
let estimated_size = c1.size_with_sketch(&c2); | ||
assert!((3.0_f32 * 0.9..=3.0_f32 * 1.1).contains(&estimated_size)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be moved into a crate documentation, then it would be compiled (if set up correctly).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added it using this trick:
#[doc = include_str!("../README.md")]
#[cfg(doctest)]
pub struct ReadmeDocTests;
This seems to need to be in lib.rs
because there are some issues in the rust tool chain that prevent doc tests from running.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Closes #68
What does this change?
This PR adds the ability for a user of the
geo_filters
library to modify the hashing function used by theCount
trait. This is required because if you're going to be generating a filter on two different devices across, for example, a network connection then you want your hashing function to be consistent. A client who uses hasher version 1 and a server using hasher version 2 would result in a meaningless comparison.I've included some documentation on choosing hashing functions. It would be great to get a review of that to make sure my language is all correct.
How does this work
This is implemented by introducing a new trait
ReproducibleBuildHasher
which users can mark theirBuildHasher
s with to promise that they produce consistent hashers (identical seed state) with multiple calls todefault
.Having this done in the type level makes the code cleaner because it means I can add an associated type to the config trait and any time we need a hasher we just create one using the default method. A pervious version used concrete build hashers and it resulted in a lot of type parameters being passed around the place. As far as I can tell build hashers tend to be zero sized and free to instantiate so this shouldn't incur any overheads.