You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am working with a large dataset consisting of approximately 1 billion sets, where each set contains integers between 0 and one billion, with sizes typically in the hundreds(these sets represent a graph).
I am looking to compress them while preserving the approximate number of common items(or a metric that correlates with it).
My only use case is to have the ability to quickly compare two set ids.
Ideally, I want to map each set to a short vector, so that the dot product will be the approximate number of common elements, and then use this to create a numpy array or put it into faiss. My main concerns are speed and RAM usage. Could I use your library for this task, and do you have any advice? Thank you for your time and help.
The text was updated successfully, but these errors were encountered:
Thanks for contributing your issue. You can try our MinHashLSH index with Redis storage backend for scale. With that many sets, depending on the size of sets, you probably still need a beefy machine with enough memory to hold the MinHashLSH index backed by Redis.
Let's say you have N sets, the memory usage of all MinHash sketches will be at minimum N*num_perm*8 bytes. The memory size of the index is approximately equal to the total size of all MinHash sketches. Let's say N = 10^9, num_perm = 64, this is 512 gb memory needed for Redis.
I am working with a large dataset consisting of approximately 1 billion sets, where each set contains integers between 0 and one billion, with sizes typically in the hundreds(these sets represent a graph).
I am looking to compress them while preserving the approximate number of common items(or a metric that correlates with it).
My only use case is to have the ability to quickly compare two set ids.
Ideally, I want to map each set to a short vector, so that the dot product will be the approximate number of common elements, and then use this to create a numpy array or put it into faiss. My main concerns are speed and RAM usage. Could I use your library for this task, and do you have any advice? Thank you for your time and help.
The text was updated successfully, but these errors were encountered: