Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sparsehash (friendly fork of imohash) #7

Closed
oliverpool opened this issue Feb 11, 2020 · 9 comments
Closed

Sparsehash (friendly fork of imohash) #7

oliverpool opened this issue Feb 11, 2020 · 9 comments

Comments

@oliverpool
Copy link

oliverpool commented Feb 11, 2020

@kalafut thanks for this great package! (I am especially grateful for the tests, which allowed me to refactor without breaking the algo!)

I really like the idea of hashing only a small part of the file and plan to use it for my next project.
I took a look at the code and did some refactoring (which ended into a fork: https://github.com/oliverpool/sparsehash - documentation).

Main differences:

  • the hash function must be provided on the creation of the Hasher
    • so there is no external dependency anymore for the root package (only stdlib)
    • the cmd and the example show how easy it is to plug murmur3
  • it covers some of the V2 ideas (Some V2 ideas #6)
    • the hashing function can be easily swapped for highway hash (but the output will still be truncated to 16 bytes)
    • the hashing function can be changed depending on the context
  • Whole file hash is still available (with SampleSize = 0)

I also added error handling (in case the file reading triggers some error).

Feel free to share you thoughts.

@oliverpool
Copy link
Author

oliverpool commented Feb 12, 2020

Updates:

  • I removed the inclusion of the size in the hash (must be handled by the user)
  • the hash returns a byte slice (conversion to array left to the user)

So it is now really different (and much smaller) - but the tests are still passing! (with some adaptation to re-include the input size)

@kalafut
Copy link
Owner

kalafut commented Feb 16, 2020

@oliverpool Cool! Glad you're getting some use out of the library. We're you hoping for a merge back into imohash, or do you think things have diverged, or may do so? I wouldn't mind cutting a v2 at some point with some of the enhancement mentioned in #6 or your fork, though I'm not sure when I'll be able to get to that. Though this may be fortuitous... as perhaps you'll discover some other things as you work on your project.

@oliverpool
Copy link
Author

Now that I refactored it further and changed a lot of bits, I am not sure if merging back would be worth it.
Along the go spirit, it is better to keep it as another package (since it is a totally different API).

(your package and its license are included there, since most of the code comes from here)

@kalafut
Copy link
Owner

kalafut commented Feb 22, 2020

Sounds good. I'll keep an eye on your project, especially should I get around to some of the v2 mods. For now I'm going to close the issue. Thanks!

@kalafut kalafut closed this as completed Feb 22, 2020
@coolaj86
Copy link

coolaj86 commented Jan 31, 2021

@oliverpool Would you turn on Github Issues for SparseHash? I wanted to have some discussion with you, as you've done many of the things that I was considering doing, and there are a few more options I'm wondering if you'd be willing to consider - such as making the following variable:

  • chunk size
  • number of chunks
  • small file exception size

16k is probably enough for all common formats (jpeg, hvec, mpeg, mp4, etc), but I'd like to have some sort of reference to reasonably prove that or to have it adjustable.

Also, I'm curious as to what the two of you are working on.

@kalafut
Copy link
Owner

kalafut commented Jan 31, 2021

@oliverpool @coolaj86 It has been a while since we chatted about this, and I'm curious how the SparseHash updates worked out for you? I'm definitely willing to consider pulling some of the options back into imohash, perhaps as a v2 (or maybe accomplished just with functional options).

@oliverpool
Copy link
Author

@coolaj86 I opened the issues there.

As you can see on the documentation, the SampleSize and the SizeThreshold can already be tuned (they are not global variables, but struct fields).

I am currently not considering allowing the number of chunks to be customizable:

  • adding this logic is non-trivial
  • if you have some changing file, where only specific part of the content change (without the beginning/middle/end and file-size being affected), then you should really hash the whole file

@kalafut I am quite happy with my fork: it doesn't have any external dependency (which allowed me to use a fork of murmur3). I also added error-handling everywhere.

I think functional options are very cool (I will not use them, because the 3 parameters of my Hasher are required and some sane defaults are provided by New)

It should be feasible to refactor your current version to use a v2 implementation in the background (I am not sure if it is worth it).

@kalafut
Copy link
Owner

kalafut commented Feb 1, 2021

@oliverpool Nice. At a minimum I might survey the mmh3 landscape to see if that dep should be upgraded.

@kalafut
Copy link
Owner

kalafut commented Jan 28, 2024

@oliverpool @coolaj86 In case either of you are still interested in this space, I'm scoping out a few backwards compatible updates, including user-configurable chunk count. See #11

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants