I've been digging into how Apple got Time Machine to work on OS X. I've been using it for two weeks (as opposed to my old rsync the /~ directory solution because I was too lazy to implement a versioned solution) and find it works as advertised so far. My one concern is how Time Machine handles files that change a lot, but in very little ways. First, it's worth pointing out that Time Machine uses what Linux users call hard links:
Backup systems typically make a full clone backup, and then copy only the differential or incremental changes. Differential backups capture everything that has changed since the last full backup, while incremental backups only copy what's changed since the last partial backup. Full backups obviously consume too much disk space to do every hour, but differential or incremental backups don't capture the whole picture in a single shot. Time Machine appears to do both: capture full backups every hour without taking up all the disk space this would require. How does it do this?
An intelligent backup system using differential backups would also have to parse all the various backups done in order to present a composite view of all the partial backups to present the user with the files that can be restored at any given time. The user might want the version of a file from two hours ago, or from two weeks ago. Accommodating this kind of flexibility typically requires managing a complex database of backup file transactions. If that metadata database is lost, restoring files from the backups becomes far more complex, and requires an arduous and lengthly rebuilding of the database.
To solve both problems, Time Machine does something new and different that actually required Apple to make changes to the underlying Mac file system, HFS+. The new change is referred to multi-links, which are similar to "hard links" common to Unix users and potentially available when using NTFS on Windows. Hard links differ from "soft links" (also known as symbolic links), which simply act as placeholders pointing to another file.
Now let's say I've got a large file, such as my iTunes library XML file. Right now it's about 2MB and growing. If I add one track to my library, that file changes and the next backup that is run, the file gets copied over, in it's entirety, to the backup drive. A 2KB change is requiring 2MB of space. I understand that significant resources would be needed to restore a file if it were modified dozens of time. But you're not restoring files daily, you eating up more drive space daily.
I also know that 2MB aggregate over a year isn't going to kill you. But the same would happen to large files. Granted, most people aren't going to be editing large files (read: multimedia), and if they were, one would hope that they'd exclude that directory and manually backup or turn Time Machine off while doing such work. (Don't forget to exclude things like your BitTorrent directory, which would contain constantly changing large files.)
But what if you use encrypted disk images to secure your data? Each DMG file is treated as one big file by Time Machine which means if you have a DMG file that has 100 MB of data and you add a 1 MB image to it, Time Machine will recopy the modified 101 MB DMG file in its entirety and leave the 100 MB DMG file available in all the old backups. It would be great if Time Machine could interact with DMGs on a more advanced level, rather than treating them like simple files. If you only have one or two DMG files with sensitive information, this isn't so bad. But if you use File Vault, your entire home directory is a big fact encrypted DMG.
Certainly worth noting is that the Linux camp has been using things like rsync and rdiff to back things up for years. They ever did have a fancy GUI to get the data back, which is the big selling point of it, but now one is being worked on. Utilizing LVM and rdiff, a differential backup solution with versioning, and a neat GUI tool to see the previous versions is being worked on:
I finally got around to completing this--it's been a busy week. Anyway, I did decide to add revision previewing to this--as such it can preview past revisions of any kind of file which is supported by Konqueror (plaintext documents, word documents, videos, images, etc). It can also restore from any of these previous revisions. Unfortunately, this is not searchable at present--you need to know the name and path of the file you're looking for in order to preview or restore.
This is really neat! (I've also locally mirrored the TARDIS demo.)
Mike Rubel came up with this one a long time ago
You are right, Linux (and before that Unix) has been doing this for a really long time. The seminal article on this was written years ago by Mike Rubel. See http://www.mikerubel.org/computers/rsync_snapshots/
Post new comment