Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: reproducible results #65

Open
ivan-marroquin opened this issue Oct 11, 2018 · 5 comments
Open

Question: reproducible results #65

ivan-marroquin opened this issue Oct 11, 2018 · 5 comments

Comments

@ivan-marroquin
Copy link

Hi,

Is there a way to fix a seed in order to get reproducible results?

Many thanks,
Ivan

@yurymalkov
Copy link
Member

Hi @ivan-marroquin, the results would be reproducible if using only one thread during the construction.

@ivan-marroquin
Copy link
Author

Hi @yurymalkov thanks for the clarification. Do you think that it is possible to ensure reproducible results when using several CPUs? It will be a great addition to a great package!

Thanks for all,

Ivan

@searchivarius
Copy link
Member

@ivan-marroquin it is possible, but it is complicated. It also prevents incremental one-vector updates :you will likely need to process everything in batches. There is also no obvious benefit to this. If indexing parameters M and efConstruction are sufficiently large, usually what we see are very small variations in performance.

@ivan-marroquin
Copy link
Author

Hi @searchivarius , I will try using multiple batches. Since my input file has N >>> D (N: observations, D: features), I have to use M = 5 and efConstruction= 10 to deal with this large amount of data. Is it possible that I will get relative large variations?

@searchivarius
Copy link
Member

Hi @ivan-marroquin it is not impossible. But it is not impossible that the method will not work well at all. I have seen some data sets where this happened. HNSW is relatively robust, but there are no guarantees. It is best to build the index and to test.

BTW, for large amounts of data, you can easily trade-off some efficiency for performance if you index in chunks. Say, index a chunk not larger than 10M records. Then, combine results. If you have say 1B records, it can be an order of magnitude in terms of indexing time, so you would be able to trade it off for increased accuracy (i.e., by setting larger M and efConstruction).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants