Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is the bumber of bands correct? #209

Closed
ZJaume opened this issue Jun 7, 2023 · 3 comments
Closed

Is the bumber of bands correct? #209

ZJaume opened this issue Jun 7, 2023 · 3 comments

Comments

@ZJaume
Copy link

ZJaume commented Jun 7, 2023

When doing initialization of a Minhash object and MinHashLSH index with the same parameters, I noticed that the hashranges derived from the number of bands, does not cover all the permutations in the MinHash.

Doing this

In [18]: from datasketch import MinHash, MinHashLSH

In [19]: lsh = MinHashLSH(threshold=0.8, num_perm=128)

In [20]: lsh.hashranges
Out[20]:
[(0, 13),
 (13, 26),
 (26, 39),
 (39, 52),
 (52, 65),
 (65, 78),
 (78, 91),
 (91, 104),
 (104, 117)]

In [21]: m1 = MinHash(num_perm=128)

In [22]: m1.hashvalues.shape
Out[22]: (128,)

the last value of the hashranges is 117, so If I'm not wrong, the insert function

Hs = [self._H(minhash.hashvalues[start:end])
for start, end in self.hashranges]
self.keys.insert(key, *Hs, buffer=buffer)
for H, hashtable in zip(Hs, self.hashtables):
hashtable.insert(H, key, buffer=buffer)

does not add the last 11 values of m1.hashvalues. I guess that this is expected, given that the number of bands is an estimation, but I just wanted to ask in case there is something wrong.

@ekzhu
Copy link
Owner

ekzhu commented Jun 7, 2023

This is related to this issue: #186.

Perhaps this one as well regarding optimizing the hyper parameters: #200

@ekzhu
Copy link
Owner

ekzhu commented Jun 7, 2023

In short, the behavior is expected due to hyper-parameter optimization given the threshold. If you want to avoid it you can set the number of bands and band size manually.

@ZJaume
Copy link
Author

ZJaume commented Jun 8, 2023

Many thanks for your advice!

@ZJaume ZJaume closed this as completed Jun 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants