Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feedback, showdown against 3 other tools #251

Open
Sanmayce opened this issue Nov 19, 2023 · 4 comments
Open

Feedback, showdown against 3 other tools #251

Sanmayce opened this issue Nov 19, 2023 · 4 comments

Comments

@Sanmayce
Copy link

Sanmayce commented Nov 19, 2023

Hi @pkolaczk
could you share why your superfast tool reports differently than other tools?
All performers are on GitHub, downloadable.

This scriplet (attached) shows differences between 'rmlint' and 'DIFFTREE' on latest Linux kernel tree.
Bottomline: First one gives 26+391=417 duplicates, whereas my script gives 434, who knows what causes the discrepancy?! My email: [email protected]

First, it is good to run more such tools, the-more-the-merrier,
since the tool below scans only files 1 bytes or bigger long while there are 26 (see further below) files with 0 bytes size - which means 25 duplicates,
in the end reported 409+25=434 duplicates, thus DIFTREE is kinda closer to the right count.

[root@djudjeto2 tree_bench]# echo 3 > /proc/sys/vm/drop_caches
[root@djudjeto2 tree_bench]# ./linux_czkawka_cli dup -m 1 -d TreeUnderDeduplication/
Results of searching ["/home/sanmayce/WorkTemp/tree_bench/TreeUnderDeduplication"] with excluded directories [] and excluded items []
-------------------------------------------------Files with same hashes-------------------------------------------------
Found 409 duplicated files which in 274 groups which takes 2.06 MiB.

Testdataset: linux-6.6.1 tree (untarred archive to TreeUnderDeduplication/)
OS: Fedora release 38 (Thirty Eight) x86_64
Host: 20LRS04700 ThinkPad 11e 5th Gen
Kernel: 6.2.12-300.fc38.x86_64
CPU: Intel Celeron N4100 (4) @ 2.400GHz
SSD: nvme Transcend 1TB bufferless
Filesystem: ext4

+---------------------------+-------------------------+------------------+------------------+
| Deduplicator              |                    Time | Memory Footprint | Duplicates Found |
+---------------------------+-------------------------+------------------+------------------+
| fclones v.0.34.0          |                  6.60 s |        26,384 KB |              384 |
| linux_czkawka_cli v.6.1.0 |                  7.69 s |       118,448 KB |              434 |
| rmlint v.2.10.1           |                 11.95 s |        61,952 KB |              391 |
| DIFFTREE r.4++            | 1*60*60+49*60+51=6591 s |        88,768 KB |              434 |
+---------------------------+-------------------------+------------------+------------------+

The actual scriplet in use:

# echo 3 > /proc/sys/vm/drop_caches
# /bin/time -v ./DIFFTREE_BLAKE3_r4++.sh TreeUnderDeduplication/
# echo 3 > /proc/sys/vm/drop_caches
# /bin/time -v rmlint TreeUnderDeduplication/
# echo 3 > /proc/sys/vm/drop_caches
# /bin/time -v ./linux_czkawka_cli dup -m 1 -d TreeUnderDeduplication/
# echo 3 > /proc/sys/vm/drop_caches
# /bin/time -v ./fclones-0.34.0-linux-musl-x86_64 group TreeUnderDeduplication/

The full script 'SpeedShowdown.sh' is attached.
SpeedShowdown.sh.tar.gz

@pkolaczk
Copy link
Owner

pkolaczk commented Nov 19, 2023

FClones doesn't scan hidden files by default. You must add --hidden flag to make it equivalent. Another thing to check are settings for following links and max/min file sizes. Different tools have different defaults, so it is good to set them explicitly.

@Sanmayce
Copy link
Author

Sanmayce commented Nov 19, 2023

Oh, after adding --hidden -s 0 the duplicates are 433, still 1 less, should be 434?!

@pkolaczk
Copy link
Owner

Maybe one is a hard link? Hard links are not considered duplicates by default, unless you tell it to treat them differently.

@Sanmayce
Copy link
Author

Maybe one is a hard link? Hard links are not considered duplicates by default, unless you tell it to treat them differently.

Not sure, as far as I know, this is how the hard links are to be found, no?:

$ find -type f -links +1

Running the above in root folder of kernel 6.6.2 tree, resulted in empty list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants