You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Good question. The idea here is to "cheat" by mapping each token (or "word" in your example) to an integer in the hash space, which we know the complete vocabulary -- all integers 0 - 2^32!
Say if we have three documents A, B and C. Each document might contains different words.
According to the document of data sketch.MinHash, we can get a min-hash for A with
minxish = Minhash(num_perm=128)
minhash.update(A.encode('utf-8'))
vector = minhash.digest()
But isn't that we need to create a vocabulary consisting of all words from A, B and C before getting the vector?
The text was updated successfully, but these errors were encountered: