-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discussion about number of chunks in a sample approach. #14
Comments
Hi! I think a In your example above, you're just reading consecutive data from the beginning. Was that the intention or just a bug in the example? I envisioned something more like:
|
Yes, you're right. I, at first try, found a ratio like a simple integer n with |[Read_R_bytes][Bypass_B_bytes][Chunk_R_bytes][bypass_B_bytes][Chunk_R_bytes][bypass_B_bytes]| With n=4, we read 4 x sample_size bytes bloc, bypass one sample_size byt block etc ... And I dislike making the same lukewarm water ... you know. If one wants to use multi chunks, I think it is a better idea to choose a read/noread ratio. I'm workink with jpg you know. But I think it it a more accurate (in a deduplication manner) to sample file wide. With 50% or n=2 you have your half reading, half bypassing. with n=10, you bypass (seek ahead) 9 you read 1. With n=1 you force full and not sampled method. notice 1/1=1=100 % with the full method (not sampled). So, in order to do it well, one can use the read_array think of reading the last part of the array beeing bypassed. I worked with the size of the file. Sure it deduplicates well. But I you force reading the first element of size of the file, and the last one, say, with a sample size, you get 👍 size + first sample_size bytes of the file + middle seek and offset a sample_size of the file + last sample_size bytes of the file with n=0 size + i + m + o + half read with n=1 etc ... I think I will, for another work, use my dataset to find the best ration. My first try is to delivrate a script with no more dependancy than hashlib ans sqlite. I read the approach used in bloomier filters, in order to create a multi step and fit the correct ratio everywhere. But I think it would be a better method to use a ratio over the whole file. |
ChatGPT was faster than a huge dig in my gits ... |
I've started capturing some thoughts on this (and other updates) in the Go library, which is the primary source (py-imohash follows). kalafut/imohash#11 |
Ok, it's said :) I'll try to compete friendly because we have the same goal and I think the multi layered bloom filter approach is much better in term of speed in an order of magnitude when we use a pre-hash deduplication. |
I think the imo-based solution could compete in the middle step, between a simple bloom stack and a more general collision robust hash. |
Hi there, and thank you for py-imohash.
I was working an a similar approach based on a parametric read of files (a huge list of photos, here again), in order to deduplicate.
I was testing by reading by the same imo method, at start, middle and end of the file, seaking for the best ratio in between number of chunks and size of each chunk.
By the way, the size of each chunk is not a game changer, because here, I read from a cheap RAID5 based SSD, and the CPU is overkill. But, I found that the number of chunks is a game changer in the number of cluster found.
I wonder if the imohash approach has been tested ... in a similar parametric way.
Imo in sampled mode is used on order to prehash a huge (say, like HUGE) list of files. And then, with clusters of collision, doing another more formal general hash.
edit
Il will expose a parametric reader made in Python3.
(Yes it was an defect in my copy/paste, I lost my commit)
If you consider another crypto primitive, it is the same method in order to add many chunks and not only three.
The text was updated successfully, but these errors were encountered: