Validating backups and ensuring file integrity

I recently started planning for my upgrade from Snow Leopard to Lion (those are versions of the Macintosh operating system). I decided that I had enough cruft on my system after several in situ upgrades and years of use to warrant a so-called "wipe and load" — that is, I wanted to format my hard drive, do a fresh install of the operating system, then copy my applications and data back onto the system. That got me thinking about data integrity and the need to validate my backups. How would I know that my image files would be undamaged after being copied to an external hard drive and back again? Hard drives are very reliable these days and problems were unlikely, but I couldn't take the chance of having an undetected corruption somewhere along the way. That issue has bitten me once before and I don't plan to let it happen again.

The answer


The obvious solution in my mind was cryptographic hashing. If I fingerprint (hash) each file and append the result to a master fingerprint file, I can verify with a high degree of confidence whether any of the files are corrupted during the copy from one drive to another or corrupted at rest on a given drive. This is a technique in common use within the computer science field and which has stood the test of time. For example, Oracle's ZFS filesystem uses hashes to detect all manner of trouble including bit rot. Hashes allow web sites to validate your password without storing a copy of it. They are also used for digitally signing documents/files and a myriad other uses. There are many different hashing algorithms and tools which implement those algorithms. I started searching around for tools which would make the process easy.

During my search, I came across a great resource from the well-known and respected American Society of Media Photographers and the Library of Congress called dpBestFlow. It has a section on data validation which explains the subject more thoroughly than I can. Also, Steve Friedl has put together "An Illustrated Guide to Cryptographic Hashes" with diagrams and examples.

The tools


Fair warning: the concepts I'm going to talk about in this post are universal no matter what operating system you use, but the implementation details are going to be Mac/Unix/Linux-specific and fairly geeky. That's not to say that you can't apply the principles I will explain to a Windows computer; you'll just have to do a little digging to find the tools. If you stay with me, I'm confident you'll learn something you can use no matter your operating system or geek quotient. Also, when I use the term "directory" in this post, please understand that some people use the term "folder" instead.

The first tool that looked promising was ChronoSync from Econ Technologies. This is a utility for synchronizing directories, either by copying one directory to another (unidirectional) or by synchronizing the changes in two directories with each other so that changes in one directory are replicated to the other and vice versa (bidirectional). One of the features the user can enable is a byte-by-byte comparison of the original files and the copies to ensure that a copy is not corrupt. This was similar to what I was looking for. However, at the time I found this utility, I had already duplicated my internal hard drive to an external and was just looking for a way to verify the integrity of the duplicate. I will probably buy ChronoSync for future use, but it didn't appear to be what I was looking for at that particular moment.

The next tool I found was a Java utility called Compare Folders by Keith Fenske. Since it's a Java program, it will work on Mac, Windows, Linux, Solaris, or just about any other platform. It sounded like exactly what I was looking for and, indeed, it was. However, it was slow (a common problem with Java apps) and crashed on me at least once while I was comparing a directory with 86,000+ files (in many subdirectories). It's a handy utility that I will keep around for smaller jobs.

I have MacPorts installed, so I decided to do a little searching there. I came up with a utility called cfv that creates hash files and/or verifies files against hash files. It seemed rather handy and reasonably speedy. Unfortunately, it reported that a lot of files were different when in fact they were not. Mind you, it did catch some that were indeed different but the rate of false positives was too high for my liking. I initially wrote it off for this reason, but I'll come back to that below.

The answer, or at least an answer, was under my nose the whole time. A program (actually a Perl script) called shasum comes with OS X. This is a command-line tool that you use in Terminal.app so it's not the most user-friendly choice for people who are only comfortable with GUIs. If, however, you are a CLI (command line interface) junkie like I am, this is a good choice for an integrity-checking tool. One advantage it has over cfv is the ability to use hashes of various lengths up to 512 bits. I would venture to say that only the truly paranoid need to use hashes that long for verifying file integrity, but it's available if you want to use it. cfv only offers sha1 (along with md5 and CRC). Another one of shasum's advantages is the relative ease with which one can fine tune the files that it works on. Now, if you're not a Unix geek then the following example won't make much sense to you but if you are one, check this out:

$ find . \( \( -name .fseventsd -o -name .Spotlight-V100 -o -name .TemporaryItems -o -name .FBCLockFolder -o -name \*.lrplugin -o -name \*.lrdata \) -prune -o -type d \) -exec find {} \( -name .BridgeCache -o -name .BridgeCacheT -o -name .DS_Store -o -name .FBCSemaphoreFile -o -name .FBCIndex -o -name .localized \) -prune -o -type f -depth 1 -print0 \; | xargs -0 shasum -a 512 -b >> ../Pictures.shasum

That's admittedly a bit abstruse so here's what I'm doing. The first find command finds all the directories under the current one and excludes those I don't want to include in the fingerprinting process. In each of the directories (less exclusions) that find finds, I then do another find via the -exec option to find all the files the directory contains, again excluding those I don't want to fingerprint. This compound find feeds a pathed list of files to xargs, which runs shasum on each one and appends the output to the file "Pictures.shasum" in the parent directory. Note the use of find's -print0 and xargs's -0 options to protect any path components that contain special characters. This is not the best approach from a performance perspective because of the overhead of starting Perl and interpreting the shasum script for each file, but OS X does a pretty good job of caching things so the performance hit is negligible. By excluding directories and file types that we're not interested in, we can use sha512 and still finish sooner than fingerprinting every file (including those we aren't interested in) with cfv's sha1. On the other hand, if we don't mind the shorter hash algorithm and the lack of granularity in selecting what we fingerprint then cfv provides more convenience, a running status line (which doesn't show in the examples below), and a nice summary line. cfv also offers md5 and CRC32 algorithms if you need or prefer them. Here are performance comparisons in case you are interested. Keep in mind that I'm using sha512 in the shasum example below, whereas I'm using sha1 in the cfv example. The times are comparable despite using these algorithms with drastically different computational demands, which shows the value of being selective about which files to fingerprint.

The shasum approach...

$ time (find . \( \( -name .fseventsd -o -name .Spotlight-V100 -o -name .TemporaryItems -o -name .FBCLockFolder -o -name \*.lrplugin -o -name \*.lrdata \) -prune -o -type d \) -exec find {} \( -name .BridgeCache -o -name .BridgeCacheT -o -name .DS_Store -o -name .FBCSemaphoreFile -o -name .FBCIndex -o -name .localized \) -prune -o -type f -depth 1 -print0 \; | xargs -0 shasum -a 512 -b >> ../Pictures.shasum)

real 58m41.193s
user 27m52.426s
sys 3m47.569s

Using cfv to create the hash file...

$ time cfv -rr -C -t sha1 -f /tmp/sumtest.cfv
skipping already visited dir Other Library/Data.noindex (234881026L, 503448)
skipping already visited dir iPhoto Library/Data.noindex (234881026L, 446866)
skipping already visited dir iPhoto Library/Originals (234881026L, 439465)
skipping already visited dir iPhoto Library/Previews (234881026L, 444525)
skipping already visited dir iPhoto Library/Thumbnails (234881026L, 446866)
/tmp/sumtest.cfv: 82351 files, 82351 OK. 3560.550 seconds, 69247.6K/s

real 59m20.813s
user 15m28.333s
sys 2m55.563s

And now to verify the files…

$ time cfv -f /tmp/sumtest.cfv
/tmp/sumtest.cfv: 82351 files, 82351 OK. 3670.371 seconds, 67175.7K/s

real 61m12.710s
user 15m16.687s
sys 2m50.618s

One important thing to note: you mustn't try to validate files immediately after you write them unless the utility you are using knows how to disable or flush the disk's and operating system's caches. Otherwise, you run the risk of validating what's in the cache which may not be exactly what ends up on the disk's platter. ZFS has a definite advantage in that regard but alas, it's only available natively on Solaris. One can use ZFS on Linux and Mac OS through FUSE but the integration with the operating system is manual at best and user-unfriendly at worst (through no fault of the FUSE project).

I said I would come back to cfv. I determined after my initial trouble with false failures that the problem was in the USB IDE adapter I was using. I had no trouble whatsoever validating files on internal drives or external Firewire drives, nor did I have any files that failed to validate. I will probably use cfv in the future whenever I don't need to be selective about the files I'm backing up.

Geekery aside, I'll close by saying that if you aren't already validating the integrity of your backups, you really should start. You don't have to use anything as sophisticated or mind-numbing as what I've shown you above, but I heartily recommend you use something to ensure that the backups you've gone through the effort to make will actually be able to save your bacon if you need them.
blog comments powered by Disqus