Skip to content

Geofilter customize hasher #69

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

itsibitzi
Copy link

@itsibitzi itsibitzi commented Jul 18, 2025

Closes #68

What does this change?

This PR adds the ability for a user of the geo_filters library to modify the hashing function used by the Count trait. This is required because if you're going to be generating a filter on two different devices across, for example, a network connection then you want your hashing function to be consistent. A client who uses hasher version 1 and a server using hasher version 2 would result in a meaningless comparison.

I've included some documentation on choosing hashing functions. It would be great to get a review of that to make sure my language is all correct.

How does this work

This is implemented by introducing a new trait ReproducibleBuildHasher which users can mark their BuildHashers with to promise that they produce consistent hashers (identical seed state) with multiple calls to default.

Having this done in the type level makes the code cleaner because it means I can add an associated type to the config trait and any time we need a hasher we just create one using the default method. A pervious version used concrete build hashers and it resulted in a lot of type parameters being passed around the place. As far as I can tell build hashers tend to be zero sized and free to instantiate so this shouldn't incur any overheads.

impl<C: GeoConfig<Diff>> GeoDiffCount<'_, C> {
pub fn new(config: C) -> Self {
impl<C: GeoConfig<Diff>, H: BuildHasher + Clone> GeoDiffCount<'_, C, H> {
pub fn new(config: C, hash_builder: H) -> Self {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can make this:

pub fn new<H: Default>(config: C) -> Self {
    ...
       hash_builder::HashBuilder::Default(),
}

pub fn with_hasher(config: C, hash_builder: H) -> Self {
   ...
}

@itsibitzi
Copy link
Author

Update:

The new commit introduces a trait that library users must mark their BuildHashers with in order to promise to us that the two Hashers produced using the default method will produce the same output for the same input.

The advantage of this is because we're not holding any state any more we can use an associated type on the config object and create a new BuildHasher whenever we need it. This makes the type parameters a lot cleaner and should not have performance impact since generally building a BuildHasher should be a no-op. I will verify this against benchmarks though.

The downside is from the library users perspective because it means they will need to create some newtype for the BuildHasher in order to get around the orphan problem (their BuildHasher is likely from a library, and our trait is also from a library). We could create either another crate to provide these newtypes or we could introduce feature flags to this crate for common hasher implementations.

Comment on lines 30 to 34
// Note that this `BuildHasher` has a consistent implementation of `Default`
// but is NOT stable across releases of Rust. It is therefore dangerous
// to use if you plan on serializing the geofilters and reusing them due
// to the fact that you can serialize a filter made with one version and
// deserialize with another version of the hasher factor.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: change into doc comments?

@@ -17,7 +18,7 @@ use crate::Diff;
//
// scripts/accuracy -n 10000 geo_diff/u16/b=7/bytes=112/msb={8,12,16,20}
//
pub type GeoDiffConfig7 = FixedConfig<Diff, u16, 7, 112, 12>;
pub type GeoDiffConfig7 = FixedConfig<Diff, u16, 7, 112, 12, DefaultBuildHasher>;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should GeoDiffConfig7 have the BuildHasher as generic with DefaultBuildHasher set as the default?
(same for the others?)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about this, and presumably the accuracy figures from the comments are based on a combination of the config parameters and the specific hasher. E.g. previously I had FnvBuildHasher in here and the accuracy was way off when working against a series of incrementing numbers.

If we provide the pre-canned configs in order to meet a specified error - does it confuse things to allow people to swap out the hash function and potentially get a different error?

I suspect it is the case where any sufficiently good hash function/input combination should perform about the same?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you use a bad hash function, then you get essentially no guarantees...
For all good ones, it should be about the same on average.
The amount of outliers might obviously differ 🤷

Copy link
Author

@itsibitzi itsibitzi Jul 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I thought so.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inference doesn't fall back to defaults so this will introduce a few extra type annotations, mostly in the tests.

@itsibitzi itsibitzi marked this pull request as ready for review July 21, 2025 13:36
@Copilot Copilot AI review requested due to automatic review settings July 21, 2025 13:36
@itsibitzi itsibitzi requested a review from a team as a code owner July 21, 2025 13:36
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds the ability to customize the hashing function used by geometric filters in the geo_filters library. The change introduces a new trait ReproducibleBuildHasher that enforces consistent hashing across devices, which is essential for meaningful comparisons between filters created on different systems.

Key changes include:

  • Introduction of ReproducibleBuildHasher trait and associated hasher types
  • Addition of a hasher type parameter to the Count trait and configuration types
  • Comprehensive documentation on choosing appropriate hash functions
  • Updates to all existing tests and examples to use the new hasher parameter

Reviewed Changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
crates/geo_filters/src/build_hasher.rs New module defining the ReproducibleBuildHasher trait and default implementations
crates/geo_filters/src/lib.rs Updated Count trait to include hasher type parameter and use configurable hashing
crates/geo_filters/src/config.rs Added hasher type parameter to configuration traits and structs
crates/geo_filters/src/distinct_count.rs Updated to use new hasher-parameterized Count trait
crates/geo_filters/src/diff_count.rs Updated to use new hasher-parameterized Count trait
crates/geo_filters/docs/choosing-a-hash-function.md New documentation explaining hash function selection considerations
Various test and evaluation files Updated to use DefaultBuildHasher type parameter

@@ -29,13 +32,14 @@ pub struct Distinct {}
impl Method for Distinct {}

/// Trait for types solving the set cardinality estimation problem.
pub trait Count<M: Method> {
pub trait Count<M: Method, H: ReproducibleBuildHasher> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like it would be simpler if H would be a type member (or whatever it is called) instead of a generic.
Or is there another reason for having it as generic argument?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good shout, I've made it an associated type :)

Comment on lines 41 to 42
let build_hasher = H::default();
self.push_hash(build_hasher.hash_one(item));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we could also remove the default implementation to completely get rid of the BuildHasher dependency here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

I've removed this as a default function.

- Parametize config type aliases
- Remove dependency on BuildHasher from Count
- Add readme test
- Clarify in type alias that default build hasher is unstable
Comment on lines 3 to 20
// Mirrors what we have in the README.
// If this breaks then it should be fixed an the readme updated.
#[test]
fn readme() {
use geo_filters::distinct_count::GeoDistinctCount13;
use geo_filters::Count;

let mut c1 = GeoDistinctCount13::default();
c1.push(1);
c1.push(2);

let mut c2 = GeoDistinctCount13::default();
c2.push(2);
c2.push(3);

let estimated_size = c1.size_with_sketch(&c2);
assert!((3.0_f32 * 0.9..=3.0_f32 * 1.1).contains(&estimated_size));
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be moved into a crate documentation, then it would be compiled (if set up correctly).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added it using this trick:

#[doc = include_str!("../README.md")]
#[cfg(doctest)]
pub struct ReadmeDocTests;

This seems to need to be in lib.rs because there are some issues in the rust tool chain that prevent doc tests from running.

Copy link
Collaborator

@aneubeck aneubeck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@itsibitzi itsibitzi enabled auto-merge July 22, 2025 16:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

geo_filters: Allow user to specify hasher
2 participants