I recently received a call from my wife saying that some of her pictures (she is a semi-professional photographer) were not displaying correctly and she had a show coming up and needed to get them printed and framed. The preview window in Photoshop was displaying only part of the picture and said there was an error opening the file. If you’re like me, a data corruption issue may be a big deal if it involves a client, but it’s an even bigger deal when it’s your spouse on the other end of the line.
That is how this trip down the data corruption path started for me. Those of us who have been in the computer industry for any amount of time know that it is not a matter of ‘if’ your files are going to get corrupted or lost, it is a matter of ‘when.’ I have found that this is especially true with home PCs.
I told my wife not to worry since I have multiple backup copies and could fix the problem when I got home. As it turned out, the problem was not so easily fixed, which got me thinking about file systems, hardware and data corruption.
The Analysis
I left work thinking this was going to be easy. I have a backup at home connected to the PC, another backup in a safety deposit box, and last but not least, an off-site internet backup through a company called Carbonite. On the way home, I stopped by the bank and picked up the drive from the safety deposit box. I figured I was now prepared for any eventuality. I rotated the two USB hard drives monthly between the home and the box.
The first step was to figure out why the drive got corrupted in the first place. I had to replace a power supply a few months earlier, and the files in question hadn’t been used since. I have a RAID-0 two-disk stripe with SATA-2 250 GB drives for performance. I wondered if the RAID controller had a problem that corrupted the files. I thought it best to begin by running a complete set of diagnostics on the hardware, only to find nothing. Then I ran a complete Microsoft error check and fsck of the file system and immediately found lots of errors at each of the five stages. To me, more than one error is a lot, given that NTFS is a journaled file system and journals are supposed to prevent data corruption and speed recovery. I had not run a complete error check in many months and realized that this was my fault for having bad disk hygiene. Could the corruption have been caused by a hardware or software problem from months ago that I had propagated?
I felt I had completed the analysis, since I found no hardware problems and did a complete fsck. I still had no idea what or why the problem was occurring, my deadline for recovery was fast approaching, and I had found via the fsck more corrupted files. Not a great deal more, but another 10 or so. I told my wife I would have her files by the time she woke up in the morning, and it was now nearly 8 at night. Tick tick tick.
The Restore
As I started the restore from the local UBS hard drive, I had the sinking feeling that if the drive was corrupted a while back, all I was doing was writing bad data from the hard drive in the system to the external drive. That feeling turned out to be correct, but I still had a backup on the external hard drive that was about a month old. Oops — the files on that drive turned out to be bad too.
I looked at the access times of the files, and according to the file system, they had not been changed in a long time. It had been more than a year since the pictures were taken, but we knew the files were perfect a few months ago. Time was ticking away — it was now nearly 10 p.m. and I had to get up in a little more than six hours. Since both backup copies were bad, my only choice was to go to the Carbonite internet copy. I downloaded the first and most important file, and after a few minutes of internet time, opened it in Photoshop and found it was good. I restored the other files I had found were bad and the internet copies were good there too. I finally made it to bed shortly after 11 p.m.
What It All Means
I had a poor night’s sleep thinking about what all of this meant and how I could have done things differently, and why things got corrupted. Here’s what I learned.
- I learned that I should be doing incremental backups instead of full backups, removing the old backup file each time, since my USB hard drive was only about the same as 75 percent of my hard drive. We use only about half of the space at this point, so I figured a USB hard drive of that size was fine. If I had done incrementals, I could have gone back to when the file was created. I wasn’t doing incrementals since I was a proponent of speed, and having lots of incrementals takes times to both manage and restore. I found that for full restoration after a crash, full backups and restore are great, but increments are needed in case of corruption. The nagging issue is how do I know the files are good? Incrementals could have saved me, but I have no idea how many I would have had to have to ensure that the files were good. It might have been over a year of incrementals.
- I learned that having two corrupt copies is no better than one good copy. Just because I’m paranoid and put a copy in the safety deposit box does not mean that I have not propagated a bad file. Someone could steal our PC, the house could be destroyed or some other disaster (I told you I was paranoid) and the disk in the safety deposit box would still have corrupted data on it. This is not to say that I could not get back 99.9 percent of that data (about 0.1 percent of the files were corrupted), but for some people, that might not be good enough. Like me.
- Having an off-site internet copy is a good thing. It took almost two months to get everything backed up over the internet, often leaving the PC on overnight. Not very energy efficient, but it saved me and likely my marriage, since I am responsible for PC and making sure the user (my wife) can use it when she needs it.
Some of this is obvious and some is not. The big issues I am still pondering are:
- How can I know a file is good before I commit it to backup? I had no idea I was backing up bad data. I had good backup procedures and I tested my backups, but how could I know that one file out of 1,000 was bad? Clearly, I could not open every file on the system. Yes, I had bad disk hygiene for not running the MS tools, but should that really matter that much with a journaling file system? I knew that it should, but what could I have done differently other than incrementals? Sun Microsystems is trying to address the data corruption issue by checksuming the file with its ZFS file system. The FC/SAS community is trying to address this with the T10 DIF (Data Integrity Field) standard, but neither of these were going to help me.
- What caused the corruption? To this day I do not know. We have found a few other files that were corrupted, but as far as we can tell, no new ones. It could have been anything: the power supply that went bad, causing problems with power fluctuations and corruption in the controller or drive; a file system sync problem caused by the power supply or something else; the SATA controller might have had a problem; the disk drive might have had a problem; or it might have been all of the above.
I will never know what caused the data corruption and I am glad it is behind me. With any luck, you’ll learn from my mistakes and prepare for data corruption before it happens.
Henry Newman, a regular Enterprise Storage Forum contributor, is an industry consultant with 27 years experience in high-performance computing and storage.
See more articles by Henry Newman.