Getting Backup Right
I was recently at a customer site working to architect a backup solution for a large environment that contained scientific data. At this site, a majority of the data is under HSM control, but departmental NAS servers for home directories and smaller site-specific tasks were not being backed up.
The customer and I started talking, and I realized that both of us were approaching the backup problem backwards. We were looking at the problem from the reference point of how to back up the data, which is the wrong approach. The problem with looking at things from the backup side first is that you might wind up planning an architecture that cannot support your restoration policy. So let's explore the reasons why I think restoration is the most important thing that needs to be planned, and some of the issues involved in planning and creating a restore policy.
As regular readers of this column know, I'm a big believer in gathering requirements at the start of the process. When I was visiting this customer, I started to gather their backup requirements and then their restoration requirements.
The discussion led to some detailed information on which machines were critical servers and which machines were not. We determined that the service agreement with the hardware vendor determined whether a machine was critical or not. If they have four-hour hardware response around the clock, then it was a critical machine, and if a machine was next-day hardware service on weekdays only, then it was not critical. I asked what their service-level agreements with the user community were for software restoration, including the operating system and user data in the event of a disaster. The customer had not thought about how fast they had to restore the data.
It dawned on me that the issue was not backup time, but the time it would take to restore the data. Backup is pretty easy — you can buy software from a variety of companies that can deal with big and small backup and restore problems, but the hardware and other software that is needed to restore the environment and meet the expectation of the organization for coming back after a disaster are far more difficult. Also, we have all deleted a file that we wanted to get back, so you need to consider not only the restoration of the system, but the process within the software for restoring a single file.
Gathering requirements for restoring data should address two issues:
- How are you going to restore the operating system and what are the expectations for how long this will take?
- How much user data will need to be restored and what are the expectations for how long this will take?
I once worked with a customer that wanted to use backup and restore for a critical system that required 99.995% uptime. I told them that it would be impossible to get the 4TB of data restored in the required time frame without hundreds of tape drives. Once they understood the technology issues, they changed from backup/restore to HSM because it met their operational requirements better — with the HSM system, they could get back online faster even though all of the data was not available on the primary storage. The customer determined that not all the data was needed when the system was booted. A complete understanding of the service-level agreements you have with the users is one of the first and most important steps for the architectural planning process.
Some of the areas that must be planned are:
- Network configuration;
- The number of incremental backups compared to full backups; and
- The backup medium.
As part of the architecture process, the architect must consider a numbers of areas of network architecture. I am assuming that you are not doing a backup for each machine with a separate tape drive, but a network backup for a group of machines. Some of the network issues include:
- How much data will be we backing up and at what intervals
- What is the current and planned network infrastructure
- What is the current and expected excess capacity available for backups
The key to all of these network issues is bandwidth and latency. If you have a large amount of data, you are going to need a large amount of bandwidth to support the backup process. If you have many users that are active when you are using large amounts of network bandwidth, you have the potential for a significant latency problem for the user community. Add to this what happens if a machine needs to be restored in the middle of the day, and you could affect the whole business by using up all of the available network bandwidth for the restore.
Understanding the network topology and usage patterns and the impact that both an inopportune restore and your standard backup will have is something that must be considered. Sooner or later you will have to do a restore at the worst possible time. If this is a critical business issue, then you need to consider an architecture concept we often use called "Engineering to Peak." What this means is that you develop an architecture with a set of worst-case scenarios in mind. These scenarios are documented and agreed to by all parties.
A good example might be that you use four-trunked GigE for your network interface to the backup/restore system to meet your required restoration time. As part of this architecture, you have a single GigE as a hot standby in case one of the four GigEs fails. If two GigEs fail, you cannot meet your restore service-level agreement. Since the chance of two failing are very low, as long as this is documented and you have "engineered to peak," you have met your service-level agreement with management because everyone agreed that the failure of two GigEs was not planned.
Incremental and Full Backups
Incremental backups allow the backup process to take far less time, but they lengthen the restoration process. As part of the backup and restoration planning process, you need to determine how much time you have to restore a machine. Given that information, you can plan how many incremental backups you can have before you have to do a full backup. Remember to include tape pick, load, position, rewind, unload and robot placement of tape as part of this time calculation.
What is the backup medium? For the most part, tape is still the backup medium of choice. If you have not read Tale of the Tape: Beware of Wind Quality, it might be a good idea to review the issues with tape performance and the importance of good wind quality.
MAID technology, or massive array of inactive disks, is being used more and more in the large environment backup market. One of the tradeoffs with MAID is that the MAID vendors do not currently support hardware compression, which tape drives support. Of course, your compression mileage may vary depending on data type.
Whatever technology you use, it is important to ensure that it the technology is cost-effective and meets your performance requirements. Having lower-cost DLT drives, compared to higher-cost drives from the likes of IBM and StorageTek, might not make sense when you consider that the compression of the drive and the higher performance allow you to have fewer drives, fewer tapes and potentially a smaller robot.
There are no simple answers, because without good knowledge of your requirements, your data and the compressibility of that data, making decisions in this area is difficult. Add to this that some backup packages have the option of compression within the backup software, and you have a very confusing situation. The only way to figure this out is through real-world data.
Architecting for backup correctly means basing your architecture on your restore requirements. Those requirements will often vary for each machine in an environment, depending on how critical they are for your business. What many organizations' IT departments do is set up a set of services for all the machines in the environment. This is often based on the hardware maintenance plans for the machines, since this is often a very good measure of the criticality of a machine.
Whatever you decide, it is important to look at the restore requirements first before you consider the backup issues. You will need to test your restoration time at least once a year to make sure that the process works as planned, but we'll save that topic for another column.
Henry Newman, a regular Enterprise Storage Forum contributor, is an industry consultant with 25 years experience in high-performance computing and storage.
See more articles by Henry Newman.