Sparsehash (friendly fork of imohash) #7

oliverpool · 2020-02-11T20:59:08Z

@kalafut thanks for this great package! (I am especially grateful for the tests, which allowed me to refactor without breaking the algo!)

I really like the idea of hashing only a small part of the file and plan to use it for my next project.
I took a look at the code and did some refactoring (which ended into a fork: https://github.com/oliverpool/sparsehash - documentation).

Main differences:

the hash function must be provided on the creation of the Hasher
- so there is no external dependency anymore for the root package (only stdlib)
- the cmd and the example show how easy it is to plug murmur3
it covers some of the V2 ideas (Some V2 ideas #6)
- the hashing function can be easily swapped for highway hash (but the output will still be truncated to 16 bytes)
- the hashing function can be changed depending on the context
Whole file hash is still available (with SampleSize = 0)

I also added error handling (in case the file reading triggers some error).

Feel free to share you thoughts.

The text was updated successfully, but these errors were encountered:

oliverpool · 2020-02-12T09:44:09Z

Updates:

I removed the inclusion of the size in the hash (must be handled by the user)
the hash returns a byte slice (conversion to array left to the user)

So it is now really different (and much smaller) - but the tests are still passing! (with some adaptation to re-include the input size)

kalafut · 2020-02-16T05:35:07Z

@oliverpool Cool! Glad you're getting some use out of the library. We're you hoping for a merge back into imohash, or do you think things have diverged, or may do so? I wouldn't mind cutting a v2 at some point with some of the enhancement mentioned in #6 or your fork, though I'm not sure when I'll be able to get to that. Though this may be fortuitous... as perhaps you'll discover some other things as you work on your project.

oliverpool · 2020-02-21T08:39:46Z

Now that I refactored it further and changed a lot of bits, I am not sure if merging back would be worth it.
Along the go spirit, it is better to keep it as another package (since it is a totally different API).

(your package and its license are included there, since most of the code comes from here)

kalafut · 2020-02-22T18:05:36Z

Sounds good. I'll keep an eye on your project, especially should I get around to some of the v2 mods. For now I'm going to close the issue. Thanks!

coolaj86 · 2021-01-31T20:56:47Z

@oliverpool Would you turn on Github Issues for SparseHash? I wanted to have some discussion with you, as you've done many of the things that I was considering doing, and there are a few more options I'm wondering if you'd be willing to consider - such as making the following variable:

chunk size
number of chunks
small file exception size

16k is probably enough for all common formats (jpeg, hvec, mpeg, mp4, etc), but I'd like to have some sort of reference to reasonably prove that or to have it adjustable.

Also, I'm curious as to what the two of you are working on.

kalafut · 2021-01-31T21:08:50Z

@oliverpool @coolaj86 It has been a while since we chatted about this, and I'm curious how the SparseHash updates worked out for you? I'm definitely willing to consider pulling some of the options back into imohash, perhaps as a v2 (or maybe accomplished just with functional options).

oliverpool · 2021-02-01T08:21:30Z

@coolaj86 I opened the issues there.

As you can see on the documentation, the SampleSize and the SizeThreshold can already be tuned (they are not global variables, but struct fields).

I am currently not considering allowing the number of chunks to be customizable:

adding this logic is non-trivial
if you have some changing file, where only specific part of the content change (without the beginning/middle/end and file-size being affected), then you should really hash the whole file

@kalafut I am quite happy with my fork: it doesn't have any external dependency (which allowed me to use a fork of murmur3). I also added error-handling everywhere.

I think functional options are very cool (I will not use them, because the 3 parameters of my Hasher are required and some sane defaults are provided by New)

It should be feasible to refactor your current version to use a v2 implementation in the background (I am not sure if it is worth it).

kalafut · 2021-02-01T15:00:10Z

@oliverpool Nice. At a minimum I might survey the mmh3 landscape to see if that dep should be upgraded.

kalafut · 2024-01-28T04:31:26Z

@oliverpool @coolaj86 In case either of you are still interested in this space, I'm scoping out a few backwards compatible updates, including user-configurable chunk count. See #11

kalafut closed this as completed Feb 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sparsehash (friendly fork of imohash) #7

Sparsehash (friendly fork of imohash) #7

oliverpool commented Feb 11, 2020 •

edited

Loading

oliverpool commented Feb 12, 2020 •

edited

Loading

kalafut commented Feb 16, 2020

oliverpool commented Feb 21, 2020

kalafut commented Feb 22, 2020

coolaj86 commented Jan 31, 2021 •

edited

Loading

kalafut commented Jan 31, 2021

oliverpool commented Feb 1, 2021

kalafut commented Feb 1, 2021

kalafut commented Jan 28, 2024

Sparsehash (friendly fork of imohash) #7

Sparsehash (friendly fork of imohash) #7

Comments

oliverpool commented Feb 11, 2020 • edited Loading

oliverpool commented Feb 12, 2020 • edited Loading

kalafut commented Feb 16, 2020

oliverpool commented Feb 21, 2020

kalafut commented Feb 22, 2020

coolaj86 commented Jan 31, 2021 • edited Loading

kalafut commented Jan 31, 2021

oliverpool commented Feb 1, 2021

kalafut commented Feb 1, 2021

kalafut commented Jan 28, 2024

oliverpool commented Feb 11, 2020 •

edited

Loading

oliverpool commented Feb 12, 2020 •

edited

Loading

coolaj86 commented Jan 31, 2021 •

edited

Loading