Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion about number of chunks in a sample approach. #14

Closed
matDOTviguier opened this issue Jan 26, 2024 · 6 comments
Closed

Discussion about number of chunks in a sample approach. #14

matDOTviguier opened this issue Jan 26, 2024 · 6 comments

Comments

@matDOTviguier
Copy link

matDOTviguier commented Jan 26, 2024

Hi there, and thank you for py-imohash.

I was working an a similar approach based on a parametric read of files (a huge list of photos, here again), in order to deduplicate.

I was testing by reading by the same imo method, at start, middle and end of the file, seaking for the best ratio in between number of chunks and size of each chunk.

By the way, the size of each chunk is not a game changer, because here, I read from a cheap RAID5 based SSD, and the CPU is overkill. But, I found that the number of chunks is a game changer in the number of cluster found.

I wonder if the imohash approach has been tested ... in a similar parametric way.
Imo in sampled mode is used on order to prehash a huge (say, like HUGE) list of files. And then, with clusters of collision, doing another more formal general hash.

edit
Il will expose a parametric reader made in Python3.
(Yes it was an defect in my copy/paste, I lost my commit)

If you consider another crypto primitive, it is the same method in order to add many chunks and not only three.

@kalafut
Copy link
Owner

kalafut commented Jan 26, 2024

Hi! I think a chunk_count parameter is a very reasonable addition. I can see use cases that would want to optimize for either low chunk count (e.g., if each round trip to a remote storage is costly) or high count (w/possibly small chunks) to scatter across the file better. Most of the work would probably be specifying exactly how the parameters translate into which byte ranges are read so it is consistent across implementations.

In your example above, you're just reading consecutive data from the beginning. Was that the intention or just a bug in the example? I envisioned something more like:

Instead of just:

|[Chunk]-----------------------[Chunk]-----------------------------[Chunk]|

one could optionally do:

|[Chunk]----[Chunk]----[Chunk]----[Chunk]----[Chunk]----[Chunk]----[Chunk]|

@matDOTviguier
Copy link
Author

matDOTviguier commented Jan 26, 2024

Yes, you're right.

I, at first try, found a ratio like a simple integer n with R=nB and your sample size.
(I dislike RnB music but Read&Bypass was the frist name of my script).

|[Read_R_bytes][Bypass_B_bytes][Chunk_R_bytes][bypass_B_bytes][Chunk_R_bytes][bypass_B_bytes]|

With n=4, we read 4 x sample_size bytes bloc, bypass one sample_size byt block etc ...
I tried to fit R size with my favourite crypto when I found your imo.

And I dislike making the same lukewarm water ... you know.

If one wants to use multi chunks, I think it is a better idea to choose a read/noread ratio.
Because 128 chunks of 128 bytes is more collission free effective than 4 chunks of 4096.

I'm workink with jpg you know. But I think it it a more accurate (in a deduplication manner) to sample file wide.
So a simple ratio.

With 50% or n=2 you have your half reading, half bypassing.
With 25% or n=4 you bypass four, you read one.
with 10% or n=10, you bypass ten, you read one.
But you can always choose a new rule.
Reading only the last elements of an array.

with n=10, you bypass (seek ahead) 9 you read 1.
It is a more accurate method, hence beeing quickier.
Why ?

With n=1 you force full and not sampled method.
with n=2, you read half your file
with n=3, you read 33% of you file
with n=m, you read, yes n/m*100 % of you file.

notice 1/1=1=100 % with the full method (not sampled).

So, in order to do it well, one can use the read_array think of reading the last part of the array beeing bypassed.

I worked with the size of the file. Sure it deduplicates well. But I you force reading the first element of size of the file, and the last one, say, with a sample size, you get 👍

size + first sample_size bytes of the file + middle seek and offset a sample_size of the file + last sample_size bytes of the file with n=0

size + i + m + o + half read with n=1

etc ...

I think I will, for another work, use my dataset to find the best ration.
I use imohash in a first try to deduplicate, while hashlib.blake2b() do the hard work in behind.

My first try is to delivrate a script with no more dependancy than hashlib ans sqlite.
It not in our scope, but I think "I" can "beat" imohash in term of speed and collision ratio with a multi level ratio reading with the same algo (mmh3 is another primitive ready to compete).

I read the approach used in bloomier filters, in order to create a multi step and fit the correct ratio everywhere.
I will be proud if I can compete with improved imohash with the chunk count.

But I think it would be a better method to use a ratio over the whole file.
(Be sure to be prepared to trade off reading and looping for iteration of chunk numbers).

@matDOTviguier
Copy link
Author

def read_file_in_blocks(file_name, read_size, bypass_size):
    result_bytes = b''
    with open(file_name, 'rb') as file:
        while True:
            # Skip or bypass B bytes
            file.seek(bypass_size, 1)

            # Read R bytes
            data = file.read(read_size)

            # If the read data is empty, we have reached the end of the file
            if not data:
                break

            # Add the read data to the result byte object
            result_bytes += data

            # Skip B bytes again
            file.seek(bypass_size, 1)

    return result_bytes

ChatGPT was faster than a huge dig in my gits ...

@kalafut
Copy link
Owner

kalafut commented Jan 28, 2024

I've started capturing some thoughts on this (and other updates) in the Go library, which is the primary source (py-imohash follows). kalafut/imohash#11

@matDOTviguier
Copy link
Author

Ok, it's said :)

I'll try to compete friendly because we have the same goal and I think the multi layered bloom filter approach is much better in term of speed in an order of magnitude when we use a pre-hash deduplication.

@matDOTviguier
Copy link
Author

I think the imo-based solution could compete in the middle step, between a simple bloom stack and a more general collision robust hash.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants