Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Obtain hashvalues by key from MinHashLSH #186

Open
pavelnemirovsky opened this issue Jun 4, 2022 · 10 comments
Open

Obtain hashvalues by key from MinHashLSH #186

pavelnemirovsky opened this issue Jun 4, 2022 · 10 comments
Labels

Comments

@pavelnemirovsky
Copy link

Guys, I am desperately looking for the ability to obtain hash values as an array of int(s) based on provided key? Any direction? Thanks in advance,
P

@ekzhu
Copy link
Owner

ekzhu commented Jun 7, 2022

MinHashLSH is designed for looking up keys given hashvalues (i.e., MinHash), but does not natively support the reverse lookup. I think a simple dictionary for key-> hash values would be helpful.

@pavelnemirovsky
Copy link
Author

@ekzhu can you give an example of that code? I spent some time and didn't get how to make it happen in the right way? Appreciate your help

@pavelnemirovsky
Copy link
Author

@ekzhu I wrote the code that recovers original min-hashes of a document obtained from MinHashLSH (Cassandra storage), but I found a frustrating issue (maybe expected) that not all items from the array of min-hash permutations are properly stored in LSH index (regardless it is Cassandra storage or not).

The conclusions are looks as following:

Index With Num of perm: 128, Bands: 11, Items in Band: 11
5.46875 % of min-hashes will be lost / won't be stored in LSH storage
--
Index With Num of perm: 120, Bands: 12, Items in Band: 10
0.0 % of min-hashes will be lost / won't be stored in LSH storage
--
Index With Numof perm: 64, Bands: 7, Items in Band: 9
1.5625 % of min-hashes will be lost / won't be stored in LSH storage
--
Index With Num of perm: 32, Bands: 4, Items in Band: 8
0.0 % of min-hashes will be lost / won't be stored in LSH storage
--
Index With Num of perm: 16, Bands: 2, Items in Band: 6
25.0 % of min-hashes will be lost / won't be stored in LSH storage

The script was used for testing the above behavior is here

Root cause is derived from _optimal_param function which exists in MinHashLSH class.

Thanks and sorry if this is the expected behavior of LSH implementation (didn't have a chance to deep dive into it)

@pavelnemirovsky
Copy link
Author

@ekzhu your advice is highly appreciated, PING

@ekzhu
Copy link
Owner

ekzhu commented Jun 20, 2022

Thanks for the interesting benchmark. Yes, your observation is correct. LSH asks for a fixed band-size. So, if the optimizer returns a band-size that doesn't divide the number of permutation functions evently, some minimum hash values will be lost.

If we were to constraint the optimization space to only band-sizes that are integer divisors of num_perm, then we would have less accurate index on average.

@pavelnemirovsky
Copy link
Author

@ekzhu understood, so if I'll use 8 bands with 16 items in this case accuracy of prediction will be a little less accurate right?

@ekzhu
Copy link
Owner

ekzhu commented Jun 27, 2022

If your num_perm = 128 and you use 8 x 16 your will be using all the hashvalues, but your accuracy may not be better than using 12 x 10. This depends on the threshold you use, and the type of data have. Since we cannot predict what data you are going to put in the index, the best-effort optimization is performed with threshold only.

@pavelnemirovsky
Copy link
Author

@ekzhu understood, thx
but it will be nice that _optimal_param will pick the values which will allow to restore min-hash values per-key basis, don't you think so?

@ekzhu
Copy link
Owner

ekzhu commented Mar 21, 2023

a similar discussion regarding _optimal_param. #200

@ekzhu
Copy link
Owner

ekzhu commented Mar 21, 2023

@pavelnemirovsky True, maybe a consideration is to refactor the hyper-parameter optimization out of MinHashLSH so user can choose what objective function they would like to use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants