Skip to content
Alexander Regueiro edited this page Nov 22, 2014 · 1 revision

ddar vs. rdiff-backup

Advantages

  • rdiff-backup de-duplication will only work for files whose names are similar, since this is the only case in which it will call librsync to generate a delta. If you rename a large file, rdiff-backup won’t know and will upload it again. If you have a single large file stored in two different places, rdiff-backup will store it twice. This is a limitation of the rsync algorithm. ddar will correctly detect these cases and store the underlying data only once.

  • rdiff-backup will only provide de-duplication specifically against the previous backup, so if a file appears and disappears between backups, rdiff-backup will not be able to detect and optimise this case. ddar makes no distinction and de-duplicates against all previously stored data.

  • rdiff-backup is inefficient at restoring anything but the last backup, since it must run backwards in time and resolve each reverse delta to reconstruct older data. So the further back into the past you want to go, the less efficient it is, as all relevant newer data needs to be processed first.

  • On a similar note, with rdiff-backup you cannot remove arbitrary backups since each backup depends on all newer backups. This makes it impossible to keep backups with decreasing granularity as you go back in time. With ddar, each backup can be created and deleted independently and efficiently. This means that the backups don’t even need to be related. For example, you can store backups of multiple machines in a single de-duplicated archive with ddar will de-duplicating across them all.

Disadvantages

  • rdiff-backup maintains the most recent backup as the original flat files, so you only need rdiff-backup to retrieve previous backups. ddar archives are opaque, and you need ddar to extract from them, together with any tools you used to store the archive such as tar and gzip. Having said that, the ddar format is straightforward. Here’s a short Python script that’ll do an extraction with no dependencies apart from Python itself:

    #!/usr/bin/python
    # Usage: $0 archive-path member-name output-filename
    
    import binascii, sqlite3, sys
    
    def extract(archive, member, filename): 
       with open(filename, 'w') as out_f:
           db = sqlite3.connect('%s/db' % archive)
           cursor = db.execute('''select hash from chunk
                                  where member_id = (select id from member where name=?)
                                  order by offset''', (member,))
           for row in cursor.fetchall():
               h = binascii.hexlify(row[0])
               bucket, filename = h[:2], h[2:]
               with open('%s/objects/%s/%s' % (archive, bucket, filename),
                         'r') as chunk_f:
                   out_f.write(chunk_f.read())
    
    if __name__ == '__main__':
       extract(*sys.argv[1:4])
  • Thanks to rdiff-backup’s flat storage as above, you can extract any file from the previous backup instantly. Using ddar with tar and gzip would need the system to run through the entire archive to extract a single file. This situation could improve when the new Unix archive format is finished; the ddar format will not need to change to support it since it can seek on reads efficiently.

Clone this wiki locally