Since 9/11 all types of organizations have been talking about, struggling with, and implementing various disaster recovery plans. Even before 9/11 many sites with critical data had already implemented plans that allowed management to sleep at night knowing their companies could recover from a disaster.
Providing a disaster recovery plan for your home, small office, large office, or even large multi-national corporation is generally different than doing so for a large High Performance Computing (HPC) site, given that HPCs cannot use mirrored RAID as a result of typically having hundreds of terabytes or more of data under Hierarchical Storage Management
Background
If you have been reading this column for a while, you might remember this table. It looks at storage performance and density changes over the last 30 years for high performance tape and disk, using the best technology available at the time. The Fibre Channel drive specifications were gleaned from Seagate’s web site (for disk information) and StorageTek’s web site (for the tape information).
Technology Change Over Last 30 Years | Number of Time Increase |
Tape Density | 1333 |
Tape Transfer Rate | 24 |
Disk Density | 2250 |
Disk RAID-5 8+1 Density LUN | 18000 |
Disk Transfer Rate Single Average | 21 |
Disk RAID Transfer Rate LUN Average | 133 |
Tape technology clearly has not kept pace with the increase in density and performance of disk technology, nor has tape load and position time improved in such a way that the performance of writing small files has improved. I became aware of a site that was going to use remote tape mirroring over ATM OC-3 to write the third copy of their data at their DR facility.
I had my doubts that remote tape mirroring would not work given that the tape would not be streaming. If the drive could have been streamed, the data rate with compression of the StorageTek T9940B drive was going to be about 45 MB/sec. Given that ATM OC-3 could only run at a peak of about 15 MB/sec, and taking into account all of the overhead associated with ATM, TCP, and congestion, we estimated a sustained rate of about 5-8 MB/sec. I had a bad feeling that we could have problems.
I decided to contact one of the world’s leading media experts (and a personal friend of mine) at Imation, Jim Goins, who told me that in most cases for most tape drives, the expectation for the tape and the drive is that the data must be streamed. He suggested that we would have to re-tension the tape after writing the data in the way the customer suggested with OC-3. This was virtually impossible, though, as the HSM application did not support this function, and we would have to write and maintain special code to re-tension tapes after they were written.
So writing remote tapes without streaming the data was out of the question, and the additional cost of higher speed network connections, which would allow the data to stream, was not going to be cost effective given the amount of data generated a day. This presented a dilemma for the site. What should they do?
Page 2: DR Issues for HPC Sites
DR Issues for HPC Sites
Most HPC sites are in a difficult situation as a result of:
- Having huge amounts of data generated over time under HSM control
- The inability to mirror RAIDs, as most of the data is stored on tape, and the RAIDs are just a cache used by the HSM system
- The inability to use remote tape drives with compressible data over IP network connections unless enough expensive network bandwidth is available, which is not very cost effective given the typical total amount of data per day
Most sites have a few choices in terms of establishing an effective strategy for Disaster Recovery, including:
- Move a copy of the HSM tapes to an off-site location
- Create a second copy of the HSM software and hardware that is then replicated to an off-site location using the HSM software (most packages support this type of functionality in one shape or form)
- Build an environment that is effectively indestructible
HSM Tapes Off-Site
Moving a copy of your HSM tapes off-site means that you need to have a working methodology that allows you to recover the tapes assuming your site ceases to function or exist. For HSM products, this means you need:
- The server hardware
- HBAs
- Storage
- The tape robot and drives
- The HSM software
- The HSM file system metadata and/or tape data
Some HSMs write in a format and/or provide tools that allow you to read the tapes without the file system and/or tape metadata, but that means you have to read all of the tapes in and then re-archive them in order to have the data under the HSM controller. Doing this for a petabyte of data would not be what most people call fun and would take a very long time.
This method does provide protection against the malicious insider, because the tapes are off-site and not under the control of the HSM software. The problem with this method is that it is often quite difficult to maintain synchronization of the HSM file system and/or tape metadata with the tapes that are moved off-site.
It’s labor intensive to use this method and requires regular testing to ensure the procedures work. Additionally, it’s hard to get back up and running quickly after a disaster given that you must bring the server and other hardware and software up, connect to the tapes, and likely install the tapes into a robot.
HSM Off-Site Mirror
Most HSM software packages allow some type of distribution of the data from the main server. Copies of the files can be moved over a WAN and stored at another site where they are archived. This has some potential advantages, especially if the remote site can be made to appear within the same IP subnet:
- The remote site can be anywhere — 1 kilometer or 10,000 kilometers away. All you need is the WAN bandwidth, and since you are writing to another server, the latency is not a big issue
- Network security is likely maintained given that the remote system is within your same subnet
- The remote site is fully functional in case of a disaster and will not need hardware, software, or manual intervention
On the other hand, this method does have some issues that must be addressed and/or architected such as:
- The WAN connection should be encrypted. A number of switches support this technology, so this should not be hard
- Who is going to maintain the system, hardware, and software at the remote site?
- Who has access to the remote site?
Even if the proper security precautions are taken, the malicious inside user could still get access to both sites and destroy your data. The only good news is that with most of the data on tape, you can always read all of the tapes in, but as stated earlier, for a petabyte or more of data this is not fun.
Page 3: Disaster-Proof/Resistant Facility
Disaster-Proof/Resistant Facility
I live on the northern edge of the tornado belt in Minnesota and am aware of a number of local companies that have “disaster-proof facilities.” Most of these sites are located underground and are surrounded by lots and lots of concrete and steel. They have separate generators from the main site and are connected to multiple power grids. They should have multiple WAN connections as well, but as Northwest Airlines found out a few years ago, having separate lines in the same conduit does not help if someone accidentally cuts the conduit.
The cost of these types of facilities is expensive, with the price varying depending on what disasters are lurking. You have many types of disasters to consider, including:
- Tornadoes
- Hurricanes
- Floods
- Earthquakes
- Lightning
- Other Acts of God
- Terrorism
Which types of disaster can you protect against, and how much will it cost to do so? I believe that no structure is completely safe anywhere at any time, but how great is the potential? Can you handle a risk over 10 years of 1 billion to 1, or do you need 1 trillion to 1? In most cases, the latter is never achievable even with a hardened structure, but what is an acceptable risk to your organization?
Hardened structures have significant advantages in the area of management, specifically:
- All of the data is locally managed
- All of the hardware, software, and control is local, which reduces cost
- WAN connections are not needed, which improves security
- Testing disaster recovery is far easier
- Recovery is much faster
Unfortunately, you do still have the problem of the internal malicious user that could destroy the system.
Conclusions
Disaster recovery for large HPC sites using HSM has no easy answers. There was an old saying about RAID in the early and mid 1990s: you can have it fast, cheap, or reliable — pick any two. The same can be said for disaster recovery today, in that you can have it simple, cheap, or easy to recover — pick any two. Over time, as with RAID I believe this will change, but for now you will have to make difficult choices and compromises.
Clearly understanding the features of the HSM being used (or considered) is a critical part of any disaster recovery plan. Different vendors have different features, which can dictate some of your choices. This choice can also become a big gotcha, in that if you develop a DR plan around a specific HSM, migrating from that HSM becomes very difficult, both in terms of moving the data and the potential for having to devise a new disaster recovery plan.
You need to make sure that the HSM will meet your needs for today as well as for the future. That means you must understand the HSM vendor’s plans for hardware support, software support, features, performance, and scalability, and ensure that their plans match yours. Migration from one HSM vendor to another is at best difficult and at worst could become your biggest nightmare.