Chis,
Thank you for the tutorial and code. I found yours to be one of the more understandable explanations. MinHashing seems extremely clever.
In your runHashMinExample, you commented that you used the direct calculation for the similarities, which I have no problems understanding. However, I have been searching for this "MinHash approach approach" to creating similarities. I was wondering if you had written this for if you can point me in the right direction. I have a huge dataset and I would like to use MinHash approach.
I am assuming that this would be faster. I am also assuming that the storage code (for the triangle matrix) would remain the same.
Thank you,
Ben