Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Could MinHash supports using Redis or other database as the storage? #244

Open
rocke2020 opened this issue Dec 11, 2024 · 1 comment
Open

Comments

@rocke2020
Copy link

rocke2020 commented Dec 11, 2024

Dear @ekzhu
Now we start to use v1.6.5 and plan to use Redis as storage in production, thanks for "MinHash LSH supports using Redis as the storage layer for handling large index and providing optional persistence as part of a production environment."
And in real usage, we hope to use Redis or similar middleware to storage the MinHash.

Because as shown in the MinHashLSH doc below, after LSM query, we need to MinHash to calculate Jaccard similarity and make a sort.
We hope to use redis or other no-sql database to store MinHash also.
But I see the MinHash doc, Since version 1.1.1, MinHash will only support serialization using [pickle], so we can only use python pickle to store MinHash, cannot use redis? If so, could datasketch let MinHash supports using Redis or other database?
Thanks very much in advance!!

https://ekzhu.com/datasketch/documentation.html#datasketch.MinHash
https://ekzhu.com/datasketch/documentation.html#datasketch.MinHashLSH

from datasketch import MinHash, MinHashLSH
import numpy as np

# Generate 100 random MinHashes.
minhashes = MinHash.bulk(
    np.random.randint(low=0, high=30, size=(100, 10)),
    num_perm=128
)

# Create LSH index.
lsh = MinHashLSH(threshold=0.5, num_perm=128)
for i, m in enumerate(minhashes):
    lsh.insert(i, m)

# Get the initial results from LSH.
query = minhashes[0]
results = lsh.query(query)

# Rank results using Jaccard similarity estimated by MinHash.
results = [(query.jaccard(minhashes[key]), key) for key in results]
results.sort(reverse=True)
print(results)
@rocke2020 rocke2020 changed the title Could MinHash supports using Redis as the storage? Could MinHash supports using Redis or other database as the storage? Dec 11, 2024
@ekzhu
Copy link
Owner

ekzhu commented Dec 12, 2024

I believe you don't need to use pickle if your minhash is stored already inside Redis. You can create a new LSH index with the same basename in the storage_config for Redis

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants