You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When doing initialization of a Minhash object and MinHashLSH index with the same parameters, I noticed that the hashranges derived from the number of bands, does not cover all the permutations in the MinHash.
Doing this
In [18]: fromdatasketchimportMinHash, MinHashLSHIn [19]: lsh=MinHashLSH(threshold=0.8, num_perm=128)
In [20]: lsh.hashrangesOut[20]:
[(0, 13),
(13, 26),
(26, 39),
(39, 52),
(52, 65),
(65, 78),
(78, 91),
(91, 104),
(104, 117)]
In [21]: m1=MinHash(num_perm=128)
In [22]: m1.hashvalues.shapeOut[22]: (128,)
the last value of the hashranges is 117, so If I'm not wrong, the insert function
does not add the last 11 values of m1.hashvalues. I guess that this is expected, given that the number of bands is an estimation, but I just wanted to ask in case there is something wrong.
The text was updated successfully, but these errors were encountered:
In short, the behavior is expected due to hyper-parameter optimization given the threshold. If you want to avoid it you can set the number of bands and band size manually.
When doing initialization of a Minhash object and MinHashLSH index with the same parameters, I noticed that the hashranges derived from the number of bands, does not cover all the permutations in the MinHash.
Doing this
the last value of the hashranges is 117, so If I'm not wrong, the insert function
datasketch/datasketch/lsh.py
Lines 169 to 173 in fd9e56b
does not add the last 11 values of
m1.hashvalues
. I guess that this is expected, given that the number of bands is an estimation, but I just wanted to ask in case there is something wrong.The text was updated successfully, but these errors were encountered: