Validating backups and ensuring file integrity

I recently started planning for my upgrade from Snow Leopard to Lion (those are versions of the Macintosh operating system). I decided that I had enough cruft on my system after several in situ upgrades and years of use to warrant a so-called "wipe and load" — that is, I wanted to format my hard drive, do a fresh install of the operating system, then copy my applications and data back onto the system. That got me thinking about data integrity and the need to validate my backups. How would I know that my image files would be undamaged after being copied to an external hard drive and back again? Hard drives are very reliable these days and problems were unlikely, but I couldn't take the chance of having an undetected corruption somewhere along the way. That issue has bitten me once before and I don't plan to let it happen again.

The answer

The obvious solution in my mind was cryptographic hashing. If I fingerprint (hash) each file and append the result to a master fingerprint file, I can verify with a high degree of confidence whether any of the files are corrupted during the copy from one drive to another or corrupted at rest on a given drive. This is a technique in common use within the computer science field and which has stood the test of time. For example, Oracle's ZFS filesystem uses hashes to detect all manner of trouble including bit rot. Hashes allow web sites to validate your password without storing a copy of it. They are also used for digitally signing documents/files and a myriad other uses. There are many different hashing algorithms and tools which implement those algorithms. I started searching around for tools which would make the process easy.

During my search, I came across a great resource from the well-known and respected American Society of Media Photographers and the Library of Congress called dpBestFlow. It has a section on data validation which explains the subject more thoroughly than I can. Also, Steve Friedl has put together "An Illustrated Guide to Cryptographic Hashes" with diagrams and examples.

The tools

Fair warning: the concepts I'm going to talk about in this post are universal no matter what operating system you use, but the implementation details are going to be Mac/Unix/Linux-specific and fairly geeky. That's not to say that you can't apply the principles I will explain to a Windows computer; you'll just have to do a little digging to find the tools. If you stay with me, I'm confident you'll learn something you can use no matter your operating system or geek quotient. Also, when I use the term "directory" in this post, please understand that some people use the term "folder" instead.

The first tool that looked promising was ChronoSync from Econ Technologies. This is a utility for synchronizing directories, either by copying one directory to another (unidirectional) or by synchronizing the changes in two directories with each other so that changes in one directory are replicated to the other and vice versa (bidirectional). One of the features the user can enable is a byte-by-byte comparison of the original files and the copies to ensure that a copy is not corrupt. This was similar to what I was looking for. However, at the time I found this utility, I had already duplicated my internal hard drive to an external and was just looking for a way to verify the integrity of the duplicate. I will probably buy ChronoSync for future use, but it didn't appear to be what I was looking for at that particular moment.

The next tool I found was a Java utility called Compare Folders by Keith Fenske. Since it's a Java program, it will work on Mac, Windows, Linux, Solaris, or just about any other platform. It sounded like exactly what I was looking for and, indeed, it was. However, it was slow (a common problem with Java apps) and crashed on me at least once while I was comparing a directory with 86,000+ files (in many subdirectories). It's a handy utility that I will keep around for smaller jobs.

I have MacPorts installed, so I decided to do a little searching there. I came up with a utility called cfv that creates hash files and/or verifies files against hash files. It seemed rather handy and reasonably speedy. Unfortunately, it reported that a lot of files were different when in fact they were not. Mind you, it did catch some that were indeed different but the rate of false positives was too high for my liking. I initially wrote it off for this reason, but I'll come back to that below.

The answer, or at least an answer, was under my nose the whole time. A program (actually a Perl script) called shasum comes with OS X. This is a command-line tool that you use in so it's not the most user-friendly choice for people who are only comfortable with GUIs. If, however, you are a CLI (command line interface) junkie like I am, this is a good choice for an integrity-checking tool. One advantage it has over cfv is the ability to use hashes of various lengths up to 512 bits. I would venture to say that only the truly paranoid need to use hashes that long for verifying file integrity, but it's available if you want to use it. cfv only offers sha1 (along with md5 and CRC). Another one of shasum's advantages is the relative ease with which one can fine tune the files that it works on. Now, if you're not a Unix geek then the following example won't make much sense to you but if you are one, check this out:

$ find . \( \( -name .fseventsd -o -name .Spotlight-V100 -o -name .TemporaryItems -o -name .FBCLockFolder -o -name \*.lrplugin -o -name \*.lrdata \) -prune -o -type d \) -exec find {} \( -name .BridgeCache -o -name .BridgeCacheT -o -name .DS_Store -o -name .FBCSemaphoreFile -o -name .FBCIndex -o -name .localized \) -prune -o -type f -depth 1 -print0 \; | xargs -0 shasum -a 512 -b >> ../Pictures.shasum

That's admittedly a bit abstruse so here's what I'm doing. The first find command finds all the directories under the current one and excludes those I don't want to include in the fingerprinting process. In each of the directories (less exclusions) that find finds, I then do another find via the -exec option to find all the files the directory contains, again excluding those I don't want to fingerprint. This compound find feeds a pathed list of files to xargs, which runs shasum on each one and appends the output to the file "Pictures.shasum" in the parent directory. Note the use of find's -print0 and xargs's -0 options to protect any path components that contain special characters. This is not the best approach from a performance perspective because of the overhead of starting Perl and interpreting the shasum script for each file, but OS X does a pretty good job of caching things so the performance hit is negligible. By excluding directories and file types that we're not interested in, we can use sha512 and still finish sooner than fingerprinting every file (including those we aren't interested in) with cfv's sha1. On the other hand, if we don't mind the shorter hash algorithm and the lack of granularity in selecting what we fingerprint then cfv provides more convenience, a running status line (which doesn't show in the examples below), and a nice summary line. cfv also offers md5 and CRC32 algorithms if you need or prefer them. Here are performance comparisons in case you are interested. Keep in mind that I'm using sha512 in the shasum example below, whereas I'm using sha1 in the cfv example. The times are comparable despite using these algorithms with drastically different computational demands, which shows the value of being selective about which files to fingerprint.

The shasum approach...

$ time (find . \( \( -name .fseventsd -o -name .Spotlight-V100 -o -name .TemporaryItems -o -name .FBCLockFolder -o -name \*.lrplugin -o -name \*.lrdata \) -prune -o -type d \) -exec find {} \( -name .BridgeCache -o -name .BridgeCacheT -o -name .DS_Store -o -name .FBCSemaphoreFile -o -name .FBCIndex -o -name .localized \) -prune -o -type f -depth 1 -print0 \; | xargs -0 shasum -a 512 -b >> ../Pictures.shasum)

real 58m41.193s
user 27m52.426s
sys 3m47.569s

Using cfv to create the hash file...

$ time cfv -rr -C -t sha1 -f /tmp/sumtest.cfv
skipping already visited dir Other Library/Data.noindex (234881026L, 503448)
skipping already visited dir iPhoto Library/Data.noindex (234881026L, 446866)
skipping already visited dir iPhoto Library/Originals (234881026L, 439465)
skipping already visited dir iPhoto Library/Previews (234881026L, 444525)
skipping already visited dir iPhoto Library/Thumbnails (234881026L, 446866)
/tmp/sumtest.cfv: 82351 files, 82351 OK. 3560.550 seconds, 69247.6K/s

real 59m20.813s
user 15m28.333s
sys 2m55.563s

And now to verify the files…

$ time cfv -f /tmp/sumtest.cfv
/tmp/sumtest.cfv: 82351 files, 82351 OK. 3670.371 seconds, 67175.7K/s

real 61m12.710s
user 15m16.687s
sys 2m50.618s

One important thing to note: you mustn't try to validate files immediately after you write them unless the utility you are using knows how to disable or flush the disk's and operating system's caches. Otherwise, you run the risk of validating what's in the cache which may not be exactly what ends up on the disk's platter. ZFS has a definite advantage in that regard but alas, it's only available natively on Solaris. One can use ZFS on Linux and Mac OS through FUSE but the integration with the operating system is manual at best and user-unfriendly at worst (through no fault of the FUSE project).

I said I would come back to cfv. I determined after my initial trouble with false failures that the problem was in the USB IDE adapter I was using. I had no trouble whatsoever validating files on internal drives or external Firewire drives, nor did I have any files that failed to validate. I will probably use cfv in the future whenever I don't need to be selective about the files I'm backing up.

Geekery aside, I'll close by saying that if you aren't already validating the integrity of your backups, you really should start. You don't have to use anything as sophisticated or mind-numbing as what I've shown you above, but I heartily recommend you use something to ensure that the backups you've gone through the effort to make will actually be able to save your bacon if you need them.



Meet Gunner. He's a charming two-month-old who was a joy to work with. Many children this young would be cranky about the strange people and noises and the flash going off all the time, but Gunner was a trooper.

Redesigned site online

If you read this blog via RSS feed or by going directly to the blog page, you probably haven't noticed that I've redesigned the rest of the site. It's a long overdue update that I hope will better serve the goals of my photography. Read More...

Blossom, day 4

I really like the stately appearance of the building in this shot. It looks like a queen bee with worker bees tending to it. I have a perspective-corrected version but it just doesn't have the impact that this one does.


Silver blur

The assignment topic in my local photo club's monthly contest this month was "blur." The club web site had a rather long and specific description of what blur is (as if we don't already know) which concluded with, "whatever the photographer wants to do to create areas that are not in focus in the composition." That limitation ruffled my feathers a bit. When I first joined the club, the assignments were one- or two-word topics and that's all. The photographer had complete creative license to interpret the topic however he or she wanted. Of course, with great power comes great responsibility — it was up to the photographer to research the meaning(s) of any terms he wasn't familiar with and depict them in a way that was creative but obvious enough for the judges to recognize.

The club got an influx of new members 2-3 years ago and many of them were relatively new to photography. They were eager to experiment and learn. As perhaps a byproduct of the digital-age instant gratification mindset, they also wanted to jump right in and participate in the program without taking the time to learn the nuances of our system. I can't say I blame them; I did the same thing when I first joined. However, I didn't ring the alarm bells and call the system flawed when the judges didn't "get" my approaches. I just tried harder. It ended up making me a better photographer.

At the behest of a vocal minority, the powers that be decided to attach rather specific and limiting narrative descriptions to the topics so the newer members would have a better feel for what the judges would look for. Their intentions were good, but the result was that the range of creativity in the entries dropped considerably and the entries became mostly rote and predictable. I've seen a wider creative range in the last few months but I still think the narratives are too limiting. Besides, the people who complained the loudest don't even participate anymore. So, I decided to make a point. My interpretation of "whatever the photographer wants to do to create areas that are not in focus in the composition" was to turn in a photo that was entirely sharp and in focus. I got a Silver award for my entry. It shouldn't be too difficult to see where the element of "blur" comes in, and thankfully the judges didn't adhere strictly limit themselves and the photographers rigidly to the published description.