Advice for compression of a big graph #196

ubbikk · 2023-01-22T15:40:18Z

I am working with a large dataset consisting of approximately 1 billion sets, where each set contains integers between 0 and one billion, with sizes typically in the hundreds(these sets represent a graph).
I am looking to compress them while preserving the approximate number of common items(or a metric that correlates with it).
My only use case is to have the ability to quickly compare two set ids.
Ideally, I want to map each set to a short vector, so that the dot product will be the approximate number of common elements, and then use this to create a numpy array or put it into faiss. My main concerns are speed and RAM usage. Could I use your library for this task, and do you have any advice? Thank you for your time and help.

ekzhu · 2023-01-24T05:52:00Z

Thanks for contributing your issue. You can try our MinHashLSH index with Redis storage backend for scale. With that many sets, depending on the size of sets, you probably still need a beefy machine with enough memory to hold the MinHashLSH index backed by Redis.

Let's say you have N sets, the memory usage of all MinHash sketches will be at minimum N*num_perm*8 bytes. The memory size of the index is approximately equal to the total size of all MinHash sketches. Let's say N = 10^9, num_perm = 64, this is 512 gb memory needed for Redis.

icsa · 2023-07-19T16:34:57Z

Consider Roaring Bitmaps to represent each [compressed] set of integers.

https://github.com/RoaringBitmap

ekzhu · 2023-08-17T19:26:29Z

Interesting. Have you tried using this type of integer compression for MinHash signatures?

Consider Roaring Bitmaps to represent each [compressed] set of integers.

https://github.com/RoaringBitmap

ekzhu added the question label Mar 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Advice for compression of a big graph #196

Advice for compression of a big graph #196

ubbikk commented Jan 22, 2023 •

edited

Loading

ekzhu commented Jan 24, 2023 •

edited

Loading

icsa commented Jul 19, 2023

ekzhu commented Aug 17, 2023

Advice for compression of a big graph #196

Advice for compression of a big graph #196

Comments

ubbikk commented Jan 22, 2023 • edited Loading

ekzhu commented Jan 24, 2023 • edited Loading

icsa commented Jul 19, 2023

ekzhu commented Aug 17, 2023

ubbikk commented Jan 22, 2023 •

edited

Loading

ekzhu commented Jan 24, 2023 •

edited

Loading