Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Advice for compression of a big graph #196

Open
ubbikk opened this issue Jan 22, 2023 · 3 comments
Open

Advice for compression of a big graph #196

ubbikk opened this issue Jan 22, 2023 · 3 comments
Labels

Comments

@ubbikk
Copy link

ubbikk commented Jan 22, 2023

I am working with a large dataset consisting of approximately 1 billion sets, where each set contains integers between 0 and one billion, with sizes typically in the hundreds(these sets represent a graph).
I am looking to compress them while preserving the approximate number of common items(or a metric that correlates with it).
My only use case is to have the ability to quickly compare two set ids.
Ideally, I want to map each set to a short vector, so that the dot product will be the approximate number of common elements, and then use this to create a numpy array or put it into faiss. My main concerns are speed and RAM usage. Could I use your library for this task, and do you have any advice? Thank you for your time and help.

@ekzhu
Copy link
Owner

ekzhu commented Jan 24, 2023

Thanks for contributing your issue. You can try our MinHashLSH index with Redis storage backend for scale. With that many sets, depending on the size of sets, you probably still need a beefy machine with enough memory to hold the MinHashLSH index backed by Redis.

Let's say you have N sets, the memory usage of all MinHash sketches will be at minimum N*num_perm*8 bytes. The memory size of the index is approximately equal to the total size of all MinHash sketches. Let's say N = 10^9, num_perm = 64, this is 512 gb memory needed for Redis.

@ekzhu ekzhu added the question label Mar 21, 2023
@icsa
Copy link

icsa commented Jul 19, 2023

Consider Roaring Bitmaps to represent each [compressed] set of integers.

https://github.com/RoaringBitmap

@ekzhu
Copy link
Owner

ekzhu commented Aug 17, 2023

Interesting. Have you tried using this type of integer compression for MinHash signatures?

Consider Roaring Bitmaps to represent each [compressed] set of integers.

https://github.com/RoaringBitmap

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants