Discussion about number of chunks in a sample approach. #14

matDOTviguier · 2024-01-26T14:20:21Z

Hi there, and thank you for py-imohash.

I was working an a similar approach based on a parametric read of files (a huge list of photos, here again), in order to deduplicate.

I was testing by reading by the same imo method, at start, middle and end of the file, seaking for the best ratio in between number of chunks and size of each chunk.

By the way, the size of each chunk is not a game changer, because here, I read from a cheap RAID5 based SSD, and the CPU is overkill. But, I found that the number of chunks is a game changer in the number of cluster found.

I wonder if the imohash approach has been tested ... in a similar parametric way.
Imo in sampled mode is used on order to prehash a huge (say, like HUGE) list of files. And then, with clusters of collision, doing another more formal general hash.

edit
Il will expose a parametric reader made in Python3.
(Yes it was an defect in my copy/paste, I lost my commit)

If you consider another crypto primitive, it is the same method in order to add many chunks and not only three.

kalafut · 2024-01-26T17:07:39Z

Hi! I think a chunk_count parameter is a very reasonable addition. I can see use cases that would want to optimize for either low chunk count (e.g., if each round trip to a remote storage is costly) or high count (w/possibly small chunks) to scatter across the file better. Most of the work would probably be specifying exactly how the parameters translate into which byte ranges are read so it is consistent across implementations.

In your example above, you're just reading consecutive data from the beginning. Was that the intention or just a bug in the example? I envisioned something more like:

Instead of just:

|[Chunk]-----------------------[Chunk]-----------------------------[Chunk]|

one could optionally do:

|[Chunk]----[Chunk]----[Chunk]----[Chunk]----[Chunk]----[Chunk]----[Chunk]|

matDOTviguier · 2024-01-26T19:45:33Z

Yes, you're right.

I, at first try, found a ratio like a simple integer n with R=nB and your sample size.
(I dislike RnB music but Read&Bypass was the frist name of my script).

|[Read_R_bytes][Bypass_B_bytes][Chunk_R_bytes][bypass_B_bytes][Chunk_R_bytes][bypass_B_bytes]|

With n=4, we read 4 x sample_size bytes bloc, bypass one sample_size byt block etc ...
I tried to fit R size with my favourite crypto when I found your imo.

And I dislike making the same lukewarm water ... you know.

If one wants to use multi chunks, I think it is a better idea to choose a read/noread ratio.
Because 128 chunks of 128 bytes is more collission free effective than 4 chunks of 4096.

I'm workink with jpg you know. But I think it it a more accurate (in a deduplication manner) to sample file wide.
So a simple ratio.

With 50% or n=2 you have your half reading, half bypassing.
With 25% or n=4 you bypass four, you read one.
with 10% or n=10, you bypass ten, you read one.
But you can always choose a new rule.
Reading only the last elements of an array.

with n=10, you bypass (seek ahead) 9 you read 1.
It is a more accurate method, hence beeing quickier.
Why ?

With n=1 you force full and not sampled method.
with n=2, you read half your file
with n=3, you read 33% of you file
with n=m, you read, yes n/m*100 % of you file.

notice 1/1=1=100 % with the full method (not sampled).

So, in order to do it well, one can use the read_array think of reading the last part of the array beeing bypassed.

I worked with the size of the file. Sure it deduplicates well. But I you force reading the first element of size of the file, and the last one, say, with a sample size, you get 👍

size + first sample_size bytes of the file + middle seek and offset a sample_size of the file + last sample_size bytes of the file with n=0

size + i + m + o + half read with n=1

etc ...

I think I will, for another work, use my dataset to find the best ration.
I use imohash in a first try to deduplicate, while hashlib.blake2b() do the hard work in behind.

My first try is to delivrate a script with no more dependancy than hashlib ans sqlite.
It not in our scope, but I think "I" can "beat" imohash in term of speed and collision ratio with a multi level ratio reading with the same algo (mmh3 is another primitive ready to compete).

I read the approach used in bloomier filters, in order to create a multi step and fit the correct ratio everywhere.
I will be proud if I can compete with improved imohash with the chunk count.

But I think it would be a better method to use a ratio over the whole file.
(Be sure to be prepared to trade off reading and looping for iteration of chunk numbers).

matDOTviguier · 2024-01-26T20:01:07Z

def read_file_in_blocks(file_name, read_size, bypass_size):
    result_bytes = b''
    with open(file_name, 'rb') as file:
        while True:
            # Skip or bypass B bytes
            file.seek(bypass_size, 1)

            # Read R bytes
            data = file.read(read_size)

            # If the read data is empty, we have reached the end of the file
            if not data:
                break

            # Add the read data to the result byte object
            result_bytes += data

            # Skip B bytes again
            file.seek(bypass_size, 1)

    return result_bytes

ChatGPT was faster than a huge dig in my gits ...

kalafut · 2024-01-28T04:45:48Z

I've started capturing some thoughts on this (and other updates) in the Go library, which is the primary source (py-imohash follows). kalafut/imohash#11

matDOTviguier · 2024-01-28T07:45:23Z

Ok, it's said :)

I'll try to compete friendly because we have the same goal and I think the multi layered bloom filter approach is much better in term of speed in an order of magnitude when we use a pre-hash deduplication.

matDOTviguier · 2024-01-28T07:46:24Z

I think the imo-based solution could compete in the middle step, between a simple bloom stack and a more general collision robust hash.

kalafut mentioned this issue Jan 28, 2024

V1.2 Plans (input welcome!) kalafut/imohash#11

Open

matDOTviguier closed this as completed Jan 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion about number of chunks in a sample approach. #14

Discussion about number of chunks in a sample approach. #14

matDOTviguier commented Jan 26, 2024 •

edited

Loading

kalafut commented Jan 26, 2024 •

edited

Loading

matDOTviguier commented Jan 26, 2024 •

edited

Loading

matDOTviguier commented Jan 26, 2024

kalafut commented Jan 28, 2024

matDOTviguier commented Jan 28, 2024

matDOTviguier commented Jan 28, 2024

Discussion about number of chunks in a sample approach. #14

Discussion about number of chunks in a sample approach. #14

Comments

matDOTviguier commented Jan 26, 2024 • edited Loading

kalafut commented Jan 26, 2024 • edited Loading

matDOTviguier commented Jan 26, 2024 • edited Loading

matDOTviguier commented Jan 26, 2024

kalafut commented Jan 28, 2024

matDOTviguier commented Jan 28, 2024

matDOTviguier commented Jan 28, 2024

matDOTviguier commented Jan 26, 2024 •

edited

Loading

kalafut commented Jan 26, 2024 •

edited

Loading

matDOTviguier commented Jan 26, 2024 •

edited

Loading