-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When distance computation is expensive how to gradually build graph #194
Comments
Others may have better advice but here is my perspective (FWIW I have spent a reasonable chunk of my spare time looking at how approximate nearest neighbors and nearest neighbor descent specifically can give sub-optimal results for high-dimensional datasets). The problem I have seen with high dimensional datasets is the existence of hubs, which in turn implies the existence of anti-hubs, objects which have the following unfortunate properties:
this is bad news for the efficiency of nearest neighbor descent, which already tends to repeat distance calculations a lot. This is a scenario where setting Apart from that, tweaking the nearest neighbor descent parameters to be more thorough doesn't seem to be very cost effective compared to: using default build parameters, creating the search graph and then Although this also depends on your distance function, you could also try reducing the dimensionality of the data with something cheap like a truncated SVD, do nearest neighbor descent on that, and then run nearest neighbor descent on the high dimensional data using the knn graph from the previous step as initialization via |
Hello James, Thanks for the response and detailed explanation. In my case, dimensions is not a problem. The reason that distance computation is expensive is not because it has a lot of dimension but the genomic sequence manipulation (some sequence alignment related steps). In the case you mentioned, where nearest neighbor descent tends to repeat distance calculations a lot, is there a value, e.g. what faction of the dataset distance, was used to have a very good k graph (if pairwise distances as input, that is how many of them are used)? 90%? or all the pairwise distances are used. I am comparing it to HNSW because only a very small fraction of pairwise distances was used to build the graph (I noticed that in hnsw computation of distance happened only when a pairwise distance was needed), which makes it very efficient when distance computation is very expensive. However, it works only for metric distance like hamming or Euclidean distance. The distance I am using this time for k graph is not a metric so HNSW does not work. Thanks, Jianshu |
I doubt there is anyway to know because it will be highly dataset dependent. The more hubness the true nearest neighbor graph displays, the more repeated calculations there will be: by definition hubs will appear in the neighbor list of other objects at a much higher frequency than other observations. That means they will get involved in a large number of distance calculations many of which are redundant. If hubness is an issue for your dataset then it can be detected even when the nearest neighbor graph is quite rough, e.g. it will be apparent after one iteration of NND. Unfortunately, at this point you have still carried out a large number of distance calculations and there isn't a good way to fix the problem. |
Hello pynndescent team,
I am in a situation where, distance computation is very expensive (genomic sequence), all versus all distance as input is not possible (it takes forever for all versus all distance computation). I am wondering how to compute distance only when necessary in NNDescent() building, that is k graph is build gradually without the need to compute all versus all in the dataset but only those are needed, e.g., 2 datapoint that are far away from each other in k graph, we do not need their distance. Any idea how, original data as input and we have our own distance function.
Thanks,
Jianshu
The text was updated successfully, but these errors were encountered: