Over Memorial Day weekend, I experienced what every user of a computer system fears the most: a hard drive crash. For the next few hours, I hoped that it was some sort of OS error and I did not have to worry about data restoration. I turned out to be wrong and needed a complete restore.
Not to worry, my system administrator said. All you have to do is install your recovery CD and our backup/restore software and, like magic, the system will be restored.
So I ran to the office and then back home to reload Windows and start the process of recovering my system. After 36 hours of downloading at cable modem speeds, the system said it was ready to re-boot and I would get all my data back. A new hard drive would arrive the next day, so I thought I would be able to reload again with the CD-ROMs that were on order and I was all set. No loss of productivity for this consultant.
My hopes were high — until I got the message "Cannot Restore System Please Reinstall." I called our system administrator at home (even though it was 4:45 A.M. — after all, there's nothing more worthless than a consultant without data), and he said he did not understand why it did not work, since he had tested the process a few months earlier as part of the final decision to buy the package. He said he would get back to me ASAP, but in the meantime I could get access to my data via a Web interface.
What do all of these problems with a small company have to do with storage users? This is not an isolated problem — it happens all the time to all sorts of users, and while you usually can recover your data after some effort, it always takes way more time than expected. That's why it's important to make sure that your backup system works before you need it.
Where We Went Wrong
The problem we found even in my small company is that testing restoration of data is difficult and costly. It is usually done once and then forgotten.
In our case, we were evaluating different backup/restoration options for employees who travel. We did some significant backup and restore testing, but when we installed the final version of the software, we did not test it again. It appears that a simple parameter was not set correctly, so we could not do an automatic restoration. We could get our physical data back, but we could not restore the machine state. In my case, we kept beating our heads against the wall trying to restore the machine state, but it wasn't going to happen. It took more than two weeks to get answers from the company handling our backup/restore environment. Fortunately, once the new disk drive showed up, I restored my system and my data myself.
So what I did learn from this experience, both from a policy and professional point of view?
I already knew the following:
- Backups are only as good are your restoration.
- Restores are only as good as the media they are written to.
- You should architect backup from the perspective of restoration of the data, not architect backup — we discussed this last week in Getting Backup Right. Restoration is the requirement.
What I learned was:
- Testing backup policy needs to be done after every single change to the backup/restore environment. This means that even changes that seem meaningless need to be tested.
- Very few organizations build in to the cost of a backup/restore environment the cost of testing that environment regularly, with or without changes. This is especially true for smaller organizations because the base cost of developing a backup/restore environment is an expensive process.
- Some of the companies that develop backup/restore software and provide off-site support for small and mid-sized businesses (SMBs) have a good sales story and good demos, but how good is the support? Find out as best you can before you need to know — regular testing will help. In our case, the company we dealt with was involved in configuring the software used to backup my system, yet they were not able to figure out the problem for more than two weeks.
While it would be nice to blame vendors for everything, we have to take some responsibility ourselves. So here is a checklist of items to consider for backup/restore environments and why they should be considered:
- Like Environments: In most cases, I have found that people tested a few desktops and a laptop or two, but they do not test any operational systems because these systems are generally in use and testing is disruptive. Wrong answer. Go out and buy an extra disk drive or two and test real running systems over a weekend. This will give you a far greater level of confidence in the company and your procedures.
- Testing Changes: If you follow the previous point, you will be able to test like environments and have a level of confidence that the systems work in an operational environment. So if any change is made to that environment from the status quo — and I mean any change at all — it should be re-tested. And this is in addition to regular testing. This means any software updates from the backup/restore vendor, MS patches, Linux patches, virus, firewall — any and every patch. This might lead to a change in site patch policy, but getting your data back is important enough to warrant it.
- Vendor Restoration: A number of SMB packages support off-site backup methods. This is often done via the Internet, but regardless of which of the following methods you use, each method should be tested at least at some point in the year. These are the common SMB methods:
- Block-based and kept on site so you can restore a whole system block by block;
- File-based and kept on site so you can restore your important data;
- Block-based and kept off site so you can restore via the Internet or by contacting the vendor and getting your data on media overnight; and
- File-based and kept off site so you can restore your important data via the Internet or media.
Sooner or later, your hardware or software is going to break down and you are going to need to restore your data. You could send your bad hard drive to one of the places that takes the drive apart and reads the data block by block. This works pretty well, but is very expensive. Backups are your most effective method of ensuring that your data is not lost, but without testing your restore policy, you do not know if your backups are worth anything or if you can meet your restoration requirements.
I was down for the better part of a week, and for a consultant that can be a lifetime. Think if you were a tax accountant and you crashed on April 10 and lost a week, or some other timing-based business disaster. The restoration process and procedures must be tested no matter what the cost, since the alternative could threaten the survival of your business.
Henry Newman, a regular Enterprise Storage Forum contributor, is an industry consultant with 25 years experience in high-performance computing and storage.
See more articles by Henry Newman.