Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit deduplication to files that have not been deduplicated yet #195

Open
patrickwolf opened this issue Apr 10, 2023 · 2 comments
Open
Labels
enhancement New feature or request

Comments

@patrickwolf
Copy link

This is a feature request for the fclones dedupe feature.

Currently on each run it creates new reflinks for files even if they have been already deduplicated. This also means that the storage estimates around how much space is wasted are off.

Seems like there are at least two solutions:

  1. Write to the cache if a file has been deduplicated and not attempt it again (this could also fix the storage estimate)
  2. Check the extends of each file to verify if they have been already deduplicated and only attempt it again if they aren't fully deduplicated yet

For solution 2) here are some ways that could work

root@ubuntu1:/ex2/_Data# filefrag -v fclones.json fclones2.json
Filesystem type is: 9123683e
File size of fclones.json is 458923952 (112042 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..   65535: 32208319704..32208385239:  65536:             shared
   1:    65536..  112041: 32208385681..32208432186:  46506: 32208385240: last,shared,eof
fclones.json: 2 extents found
File size of fclones2.json is 458923952 (112042 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..   65535: 32208319704..32208385239:  65536:             shared
   1:    65536..  112041: 32208385681..32208432186:  46506: 32208385240: last,shared,eof
fclones2.json: 2 extents found
root@ubuntu1:/ex2/_Data#

Ref:

The cache might be easier to start with and checking the extends cooler :) and more future proof

Thanks for considering it

@th1000s
Copy link
Contributor

th1000s commented Apr 18, 2023

Using the (existing) cache is the practical approach since it would not require adding more low-level linux syscalls.

@patrickwolf
Copy link
Author

@pkolaczk what do you think of adding deduplication information to the cache?

@pkolaczk pkolaczk added the enhancement New feature or request label Jun 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants