Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Advice for minhash with sparse dataset #193

Open
mathephysicist opened this issue Jan 5, 2023 · 1 comment
Open

Advice for minhash with sparse dataset #193

mathephysicist opened this issue Jan 5, 2023 · 1 comment
Labels

Comments

@mathephysicist
Copy link

I have a dataset that has is very sparse. That is, it has multiple null fields and multiple variations of the same entity.

Essentially,
FN, LN, field1, field2, ... , fieldk, ... fieldN
filled, filled, null, null, ..., Value, Null, ... Null (This is entity 1)
filled maybe typo or more info than filled from above, filled, null, ..., Value (with typo), (maybe this one is filled), ... Null (This is same entity as 1)
filled maybe typo or more info than filled from above, filled, null, ..., Value2 ( different then above), (maybe this one is filled), ... Null (This is same entity as 1)
Then we have other entities entirely

I've been leveraging minhashensemble code and tried a few varieties (indexing per column to deal with nulls better), and concatenating all together with word null for empty fields (or just space for that entry), evaluating different containment scores. A bunch of the varieties seem to produce slightly better performance on some situations and slightly worse on others. Does anyone know of a better way to approach this type of problem or recommend a resource to dive a bit deeper into figuring out what may work for this problem?

@ekzhu
Copy link
Owner

ekzhu commented Jan 9, 2023

Thanks for posting the question. I think it would be great if you can clarify:

  1. What is the input to the jaccard/containment similarity function -- without using minhash.
  2. What is the intended scenario: e.g., search, one-off similarity estimation, etc.

@ekzhu ekzhu added the question label Mar 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants