Skip to content

Geofilter customize hasher #69

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Jul 23, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
115 changes: 115 additions & 0 deletions crates/geo_filters/docs/choosing-a-hash-function.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
# Choosing a hash function

## Reproducibility

This library uses hash functions to assign values to buckets deterministically. The same item
will hash to the same value, and modify the same bit in the geofilter.

When comparing geofilters it is important that the same hash functions, using the same seed
values, have been used for *both* filters. Attempting to compare geofilters which have been
produced using different hash functions or the same hash function with different seeds will
produce nonsensical results.

Similar to the Rust standard library, this crate uses the `BuildHasher` trait and creates
a new `Hasher` for every item processed.

To help prevent mistakes caused by mismatching hash functions or seeds we introduce a trait
`ReproducibleBuildHasher` which you must implement if you wish to use a custom hashing function.
By marking a `BuildHasher` with this trait you're asserting that `Hasher`s produced using
`Default::default` will hash identical items to the same `u64` value across multiple calls
to `BuildHasher::hash_one`.

The following is an example of some incorrect code which produces nonsense results:

```rust
use std::hash::RandomState;

// Implement our marker trait for `RandomState`.
// You should _NOT_ do this as `RandomState::default` does not produce
// reproducible hashers.
impl ReproducibleBuildHasher for RandomState {}

#[test]
fn test_different_hash_functions() {
// The last parameter in this FixedConfig means we're using RandomState as the BuildHasher
pub type FixedConfigRandom = FixedConfig<Diff, u16, 7, 112, 12, RandomState>;

let mut a = GeoDiffCount::new(FixedConfigRandom::default());
let mut b = GeoDiffCount::new(FixedConfigRandom::default());

// Add our values
for n in 0..100 {
a.push(n);
b.push(n);
}

// We have inserted the same items into both filters so we'd expect the
// symmetric difference to be zero if all is well.
let diff_size = a.size_with_sketch(&b);

// But all is not well. This assertion fails!
assert_eq!(diff_size, 0.0);
}
```

The actual value returned in this example is ~200. This makes sense because the geofilter
thinks that there are 100 unique values in each of the filters, so the difference is approximated
as being ~200. If we were to rerun the above example with a genuinely reproducible `BuildHasher`
then the resulting diff size would be `0`.

In debug builds, as part of the config's `eq` implementation, our library will assert that the `BuildHasher`s
produce the same `u64` value when given the same input but this is not enabled in release builds.

## Stability

Following from this, it might be important that your hash functions and seed values are stable, meaning,
that they won't change from one release to another.

The default function provided in this library is *NOT* stable as it is based on the Rust standard libraries
`DefaultHasher` which does not have a specified algorithm and may change across releases of Rust.

Stability is especially important to consider if you are using serialized geofilters which may have
been created in a previous version of the Rust standard library.

This library provides an implementation of `ReproducibleBuildHasher` for the `FnvBuildHasher` provided
by the `fnv` crate version `1.0`. This is a _stable_ hash function in that it won't change unexpectedly
but it doesn't have good diffusion properties. This means if your input items have low entropy (for
example numbers from `0..10000`) you will find that the geofilter is not able to produce accurate estimations.

## Uniformity and Diffusion

In order to produce accurate estimations it is important that your hash function is able to produce evenly
distributed outputs for your input items.

This property must be balanced against the performance requirements of your system as stronger hashing
algorithms are often slower.

Depending on your input data, different functions may be more or less appropriate. For example, if your input
items have high entropy (e.g. SHA256 values) then the diffusion of your hash function might matter less.

## Implementing your own `ReproducibleBuildHasher` type

If you are using a hash function that you have not implemented yourself you will not be able to implement
`ReproducibleBuildHasher` on that type directly due to Rust's orphan rules. The easiest way to get around this
is to create a newtype which proxies the underlying `BuildHasher`.

In addition to `BuildHasher` `ReproducibleBuildHasher` needs `Default` and `Clone`, which is usually implemented
on `BuildHasher`s, so you can probably just `#[derive(...)]` those. If your `BuildHasher` doesn't have those
traits then you may need to implement them manually.

Here is an example of how to use new types to mark your `BuildHasher` as reproducible.

```rust
#[derive(Clone, Default)]
pub struct MyBuildHasher(BuildHasherDefault<DefaultHasher>);

impl BuildHasher for MyBuildHasher {
type Hasher = DefaultHasher;

fn build_hasher(&self) -> Self::Hasher {
self.0.build_hasher()
}
}

impl ReproducibleBuildHasher for MyBuildHasher {}
```
23 changes: 15 additions & 8 deletions crates/geo_filters/evaluation/accuracy.rs
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@ use std::fs::File;
use std::path::PathBuf;

use clap::Parser;
use geo_filters::build_hasher::UnstableDefaultBuildHasher;
use geo_filters::config::VariableConfig;
use itertools::Itertools;
use once_cell::sync::Lazy;
Expand Down Expand Up @@ -156,19 +157,22 @@ static SIMULATION_CONFIG_FROM_STR: Lazy<Vec<SimulationConfigParser>> = Lazy::new
let [b, bytes, msb] = capture_usizes(&c, [2, 3, 4]);
match t {
BucketType::U8 => {
let c = VariableConfig::<_, u8>::new(b, bytes, msb);
let c = VariableConfig::<_, u8, UnstableDefaultBuildHasher>::new(b, bytes, msb);
Box::new(move || Box::new(GeoDiffCount::new(c.clone())))
}
BucketType::U16 => {
let c = VariableConfig::<_, u16>::new(b, bytes, msb);
let c =
VariableConfig::<_, u16, UnstableDefaultBuildHasher>::new(b, bytes, msb);
Box::new(move || Box::new(GeoDiffCount::new(c.clone())))
}
BucketType::U32 => {
let c = VariableConfig::<_, u32>::new(b, bytes, msb);
let c =
VariableConfig::<_, u32, UnstableDefaultBuildHasher>::new(b, bytes, msb);
Box::new(move || Box::new(GeoDiffCount::new(c.clone())))
}
BucketType::U64 => {
let c = VariableConfig::<_, u64>::new(b, bytes, msb);
let c =
VariableConfig::<_, u64, UnstableDefaultBuildHasher>::new(b, bytes, msb);
Box::new(move || Box::new(GeoDiffCount::new(c.clone())))
}
}
Expand All @@ -185,19 +189,22 @@ static SIMULATION_CONFIG_FROM_STR: Lazy<Vec<SimulationConfigParser>> = Lazy::new

match t {
BucketType::U8 => {
let c = VariableConfig::<_, u8>::new(b, bytes, msb);
let c = VariableConfig::<_, u8, UnstableDefaultBuildHasher>::new(b, bytes, msb);
Box::new(move || Box::new(GeoDistinctCount::new(c.clone())))
}
BucketType::U16 => {
let c = VariableConfig::<_, u16>::new(b, bytes, msb);
let c =
VariableConfig::<_, u16, UnstableDefaultBuildHasher>::new(b, bytes, msb);
Box::new(move || Box::new(GeoDistinctCount::new(c.clone())))
}
BucketType::U32 => {
let c = VariableConfig::<_, u32>::new(b, bytes, msb);
let c =
VariableConfig::<_, u32, UnstableDefaultBuildHasher>::new(b, bytes, msb);
Box::new(move || Box::new(GeoDistinctCount::new(c.clone())))
}
BucketType::U64 => {
let c = VariableConfig::<_, u64>::new(b, bytes, msb);
let c =
VariableConfig::<_, u64, UnstableDefaultBuildHasher>::new(b, bytes, msb);
Box::new(move || Box::new(GeoDistinctCount::new(c.clone())))
}
}
Expand Down
7 changes: 4 additions & 3 deletions crates/geo_filters/evaluation/performance.rs
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
use criterion::{black_box, criterion_group, criterion_main, Criterion};
use geo_filters::build_hasher::UnstableDefaultBuildHasher;
use geo_filters::config::VariableConfig;
use geo_filters::diff_count::{GeoDiffCount, GeoDiffCount13};
use geo_filters::distinct_count::GeoDistinctCount13;
Expand All @@ -20,7 +21,7 @@ fn criterion_benchmark(c: &mut Criterion) {
})
});
group.bench_function("geo_diff_count_var_13", |b| {
let c = VariableConfig::<_, u32>::new(13, 7680, 256);
let c = VariableConfig::<_, u32, UnstableDefaultBuildHasher>::new(13, 7680, 256);
b.iter(move || {
let mut gc = GeoDiffCount::new(c.clone());
for i in 0..*size {
Expand Down Expand Up @@ -59,7 +60,7 @@ fn criterion_benchmark(c: &mut Criterion) {
})
});
group.bench_function("geo_diff_count_var_13", |b| {
let c = VariableConfig::<_, u32>::new(13, 7680, 256);
let c = VariableConfig::<_, u32, UnstableDefaultBuildHasher>::new(13, 7680, 256);
b.iter(move || {
let mut gc = GeoDiffCount::new(c.clone());
for i in 0..*size {
Expand Down Expand Up @@ -104,7 +105,7 @@ fn criterion_benchmark(c: &mut Criterion) {
})
});
group.bench_function("geo_diff_count_var_13", |b| {
let c = VariableConfig::<_, u32>::new(13, 7680, 256);
let c = VariableConfig::<_, u32, UnstableDefaultBuildHasher>::new(13, 7680, 256);
b.iter(move || {
let mut gc1 = GeoDiffCount::new(c.clone());
let mut gc2 = GeoDiffCount::new(c.clone());
Expand Down
38 changes: 38 additions & 0 deletions crates/geo_filters/src/build_hasher.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
use std::hash::{BuildHasher, BuildHasherDefault, DefaultHasher, Hasher as _};

use fnv::FnvBuildHasher;

/// Trait for a hasher factory that can be used to produce hashers
/// for use with geometric filters.
///
/// It is a super set of [`BuildHasher`], enforcing additional requirements
/// on the hasher builder that are required for the geometric filters and
/// surrounding code.
///
/// When performing operations across two different geometric filters,
/// the hashers must be equal, i.e. they must produce the same hash for the
/// same input.
pub trait ReproducibleBuildHasher: BuildHasher + Default + Clone {
#[inline]
fn debug_assert_hashers_eq() {
// In debug builds we check that hash outputs are the same for
// self and other. The library user should only have implemented
// our build hasher trait if this is already true, but we check
// here in case they have implemented the trait in error.
debug_assert_eq!(
Self::default().build_hasher().finish(),
Self::default().build_hasher().finish(),
"Hashers produced by ReproducibleBuildHasher do not produce the same output with the same input"
);
}
}

/// Note that this `BuildHasher` has a consistent implementation of `Default`
/// but is NOT stable across releases of Rust. It is therefore dangerous
/// to use if you plan on serializing the geofilters and reusing them due
/// to the fact that you can serialize a filter made with one version and
/// deserialize with another version of the hasher factor.
pub type UnstableDefaultBuildHasher = BuildHasherDefault<DefaultHasher>;

impl ReproducibleBuildHasher for UnstableDefaultBuildHasher {}
impl ReproducibleBuildHasher for FnvBuildHasher {}
73 changes: 60 additions & 13 deletions crates/geo_filters/src/config.rs
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

use std::{marker::PhantomData, sync::Arc};

use crate::Method;
use crate::{build_hasher::ReproducibleBuildHasher, Method};

mod bitchunks;
mod buckets;
Expand Down Expand Up @@ -30,8 +30,9 @@ use once_cell::sync::Lazy;
/// Those conversions can be shared across multiple geo filter instances. This way, the
/// conversions can also be optimized via e.g. lookup tables without paying the cost with every
/// new geo filter instance again and again.
pub trait GeoConfig<M: Method>: Clone + Eq + Sized + Send + Sync {
pub trait GeoConfig<M: Method>: Clone + Eq + Sized {
type BucketType: IsBucketType + 'static;
type BuildHasher: ReproducibleBuildHasher;

/// The number of most-significant bits that are stored sparsely as positions.
fn max_msb_len(&self) -> usize;
Expand Down Expand Up @@ -79,9 +80,16 @@ pub trait GeoConfig<M: Method>: Clone + Eq + Sized + Send + Sync {
/// Instantiating this type may panic if `T` is too small to hold the maximum possible
/// bucket id determined by `B`, or `B` is larger than the largest statically defined
/// lookup table.
#[derive(Clone, Eq, PartialEq)]
pub struct FixedConfig<M: Method, T, const B: usize, const BYTES: usize, const MSB: usize> {
_phantom: PhantomData<(M, T)>,
#[derive(Clone)]
pub struct FixedConfig<
M: Method,
T,
const B: usize,
const BYTES: usize,
const MSB: usize,
H: ReproducibleBuildHasher,
> {
_phantom: PhantomData<(M, T, H)>,
}

impl<
Expand All @@ -90,9 +98,11 @@ impl<
const B: usize,
const BYTES: usize,
const MSB: usize,
> GeoConfig<M> for FixedConfig<M, T, B, BYTES, MSB>
H: ReproducibleBuildHasher,
> GeoConfig<M> for FixedConfig<M, T, B, BYTES, MSB, H>
{
type BucketType = T;
type BuildHasher = H;

#[inline]
fn max_msb_len(&self) -> usize {
Expand Down Expand Up @@ -148,42 +158,76 @@ impl<
const B: usize,
const BYTES: usize,
const MSB: usize,
> Default for FixedConfig<M, T, B, BYTES, MSB>
H: ReproducibleBuildHasher,
> Default for FixedConfig<M, T, B, BYTES, MSB, H>
{
fn default() -> Self {
assert_bucket_type_large_enough::<T>(B);
assert_buckets_within_estimation_bound(B, BYTES * BITS_PER_BYTE);

assert!(
B < M::get_lookups().len(),
"B = {} is not available for fixed config, requires B < {}",
B,
M::get_lookups().len()
);

Self {
_phantom: PhantomData,
}
}
}

impl<
M: Method + Lookups,
T: IsBucketType,
const B: usize,
const BYTES: usize,
const MSB: usize,
H: ReproducibleBuildHasher,
> PartialEq for FixedConfig<M, T, B, BYTES, MSB, H>
{
fn eq(&self, _other: &Self) -> bool {
H::debug_assert_hashers_eq();

// The values of the fixed config are provided at compile time
// so no runtime computation is required
true
}
}

impl<
M: Method + Lookups,
T: IsBucketType,
const B: usize,
const BYTES: usize,
const MSB: usize,
H: ReproducibleBuildHasher,
> Eq for FixedConfig<M, T, B, BYTES, MSB, H>
{
}

/// Geometric filter configuration using dynamic lookup tables.
#[derive(Clone)]
pub struct VariableConfig<M: Method, T> {
pub struct VariableConfig<M: Method, T, H: ReproducibleBuildHasher> {
b: usize,
bytes: usize,
msb: usize,
_phantom: PhantomData<(M, T)>,
_phantom: PhantomData<(M, T, H)>,
lookup: Arc<Lookup>,
}

impl<M: Method, T> Eq for VariableConfig<M, T> {}
impl<M: Method, T, H: ReproducibleBuildHasher> Eq for VariableConfig<M, T, H> {}

impl<M: Method, T> PartialEq for VariableConfig<M, T> {
impl<M: Method, T, H: ReproducibleBuildHasher> PartialEq for VariableConfig<M, T, H> {
fn eq(&self, other: &Self) -> bool {
H::debug_assert_hashers_eq();

self.b == other.b && self.bytes == other.bytes && self.msb == other.msb
}
}

impl<M: Method + Lookups, T: IsBucketType> VariableConfig<M, T> {
impl<M: Method + Lookups, T: IsBucketType, H: ReproducibleBuildHasher> VariableConfig<M, T, H> {
/// Returns a new configuration value. See [`FixedConfig`] for the meaning
/// of the parameters. This functions computes a new lookup table every time
/// it is invoked, so make sure to share the resulting value as much as possible.
Expand All @@ -205,8 +249,11 @@ impl<M: Method + Lookups, T: IsBucketType> VariableConfig<M, T> {
}
}

impl<M: Method, T: IsBucketType + 'static> GeoConfig<M> for VariableConfig<M, T> {
impl<M: Method, T: IsBucketType + 'static, H: ReproducibleBuildHasher> GeoConfig<M>
for VariableConfig<M, T, H>
{
type BucketType = T;
type BuildHasher = H;

#[inline]
fn max_msb_len(&self) -> usize {
Expand Down
Loading