Overcoming Disaster Recovery Obstacles for HPC Sites

Enterprise Storage Forum content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More.

Since 9/11 all types of organizations have been talking about, struggling with, and implementing various disaster recovery plans. Even before 9/11 many sites with critical data had already implemented plans that allowed management to sleep at night knowing their companies could recover from a disaster.

Providing a disaster recovery plan for your home, small office, large office, or even large multi-national corporation is generally different than doing so for a large High Performance Computing (HPC) site, given that HPCs cannot use mirrored RAID as a result of typically having hundreds of terabytes or more of data under Hierarchical Storage Management control on tape. So the question is, how do you provide disaster recovery for sites that do not use mirrored RAID hardware and that have huge tape libraries?

Background

If you have been reading this column for a while, you might remember this table. It looks at storage performance and density changes over the last 30 years for high performance tape and disk, using the best technology available at the time. The Fibre Channel drive specifications were gleaned from Seagate’s web site (for disk information) and StorageTek’s web site (for the tape information).

Technology Change Over Last 30 Years Number of Time Increase
Tape Density 1333
Tape Transfer Rate 24
Disk Density 2250
Disk RAID-5 8+1 Density LUN 18000
Disk Transfer Rate Single Average 21
Disk RAID Transfer Rate LUN Average 133

Tape technology clearly has not kept pace with the increase in density and performance of disk technology, nor has tape load and position time improved in such a way that the performance of writing small files has improved. I became aware of a site that was going to use remote tape mirroring over ATM OC-3 to write the third copy of their data at their DR facility.

I had my doubts that remote tape mirroring would not work given that the tape would not be streaming. If the drive could have been streamed, the data rate with compression of the StorageTek T9940B drive was going to be about 45 MB/sec. Given that ATM OC-3 could only run at a peak of about 15 MB/sec, and taking into account all of the overhead associated with ATM, TCP, and congestion, we estimated a sustained rate of about 5-8 MB/sec. I had a bad feeling that we could have problems.

I decided to contact one of the world’s leading media experts (and a personal friend of mine) at Imation, Jim Goins, who told me that in most cases for most tape drives, the expectation for the tape and the drive is that the data must be streamed. He suggested that we would have to re-tension the tape after writing the data in the way the customer suggested with OC-3. This was virtually impossible, though, as the HSM application did not support this function, and we would have to write and maintain special code to re-tension tapes after they were written.

So writing remote tapes without streaming the data was out of the question, and the additional cost of higher speed network connections, which would allow the data to stream, was not going to be cost effective given the amount of data generated a day. This presented a dilemma for the site. What should they do?

Page 2: DR Issues for HPC Sites

DR Issues for HPC Sites

Most HPC sites are in a difficult situation as a result of:

  1. Having huge amounts of data generated over time under HSM control
  2. The inability to mirror RAIDs, as most of the data is stored on tape, and the RAIDs are just a cache used by the HSM system
  3. The inability to use remote tape drives with compressible data over IP network connections unless enough expensive network bandwidth is available, which is not very cost effective given the typical total amount of data per day

Most sites have a few choices in terms of establishing an effective strategy for Disaster Recovery, including:

  1. Move a copy of the HSM tapes to an off-site location
  2. Create a second copy of the HSM software and hardware that is then replicated to an off-site location using the HSM software (most packages support this type of functionality in one shape or form)
  3. Build an environment that is effectively indestructible

HSM Tapes Off-Site

Moving a copy of your HSM tapes off-site means that you need to have a working methodology that allows you to recover the tapes assuming your site ceases to function or exist. For HSM products, this means you need:

  1. The server hardware
  2. HBAs
  3. Storage
  4. The tape robot and drives
  5. The HSM software
  6. The HSM file system metadata and/or tape data

Some HSMs write in a format and/or provide tools that allow you to read the tapes without the file system and/or tape metadata, but that means you have to read all of the tapes in and then re-archive them in order to have the data under the HSM controller. Doing this for a petabyte of data would not be what most people call fun and would take a very long time.

This method does provide protection against the malicious insider, because the tapes are off-site and not under the control of the HSM software. The problem with this method is that it is often quite difficult to maintain synchronization of the HSM file system and/or tape metadata with the tapes that are moved off-site.

It’s labor intensive to use this method and requires regular testing to ensure the procedures work. Additionally, it’s hard to get back up and running quickly after a disaster given that you must bring the server and other hardware and software up, connect to the tapes, and likely install the tapes into a robot.

HSM Off-Site Mirror

Most HSM software packages allow some type of distribution of the data from the main server. Copies of the files can be moved over a WAN and stored at another site where they are archived. This has some potential advantages, especially if the remote site can be made to appear within the same IP subnet:

  1. The remote site can be anywhere — 1 kilometer or 10,000 kilometers away. All you need is the WAN bandwidth, and since you are writing to another server, the latency is not a big issue
  2. Network security is likely maintained given that the remote system is within your same subnet
  3. The remote site is fully functional in case of a disaster and will not need hardware, software, or manual intervention

On the other hand, this method does have some issues that must be addressed and/or architected such as:

  1. The WAN connection should be encrypted. A number of switches support this technology, so this should not be hard
  2. Who is going to maintain the system, hardware, and software at the remote site?
  3. Who has access to the remote site?

Even if the proper security precautions are taken, the malicious inside user could still get access to both sites and destroy your data. The only good news is that with most of the data on tape, you can always read all of the tapes in, but as stated earlier, for a petabyte or more of data this is not fun.

Page 3: Disaster-Proof/Resistant Facility

Disaster-Proof/Resistant Facility

I live on the northern edge of the tornado belt in Minnesota and am aware of a number of local companies that have “disaster-proof facilities.” Most of these sites are located underground and are surrounded by lots and lots of concrete and steel. They have separate generators from the main site and are connected to multiple power grids. They should have multiple WAN connections as well, but as Northwest Airlines found out a few years ago, having separate lines in the same conduit does not help if someone accidentally cuts the conduit.

The cost of these types of facilities is expensive, with the price varying depending on what disasters are lurking. You have many types of disasters to consider, including:

  • Tornadoes
  • Hurricanes
  • Floods
  • Earthquakes
  • Lightning
  • Other Acts of God
  • Terrorism

Which types of disaster can you protect against, and how much will it cost to do so? I believe that no structure is completely safe anywhere at any time, but how great is the potential? Can you handle a risk over 10 years of 1 billion to 1, or do you need 1 trillion to 1? In most cases, the latter is never achievable even with a hardened structure, but what is an acceptable risk to your organization?

Hardened structures have significant advantages in the area of management, specifically:

  1. All of the data is locally managed
  2. All of the hardware, software, and control is local, which reduces cost
  3. WAN connections are not needed, which improves security
  4. Testing disaster recovery is far easier
  5. Recovery is much faster

Unfortunately, you do still have the problem of the internal malicious user that could destroy the system.

Conclusions

Disaster recovery for large HPC sites using HSM has no easy answers. There was an old saying about RAID in the early and mid 1990s: you can have it fast, cheap, or reliable — pick any two. The same can be said for disaster recovery today, in that you can have it simple, cheap, or easy to recover — pick any two. Over time, as with RAID I believe this will change, but for now you will have to make difficult choices and compromises.

Clearly understanding the features of the HSM being used (or considered) is a critical part of any disaster recovery plan. Different vendors have different features, which can dictate some of your choices. This choice can also become a big gotcha, in that if you develop a DR plan around a specific HSM, migrating from that HSM becomes very difficult, both in terms of moving the data and the potential for having to devise a new disaster recovery plan.

You need to make sure that the HSM will meet your needs for today as well as for the future. That means you must understand the HSM vendor’s plans for hardware support, software support, features, performance, and scalability, and ensure that their plans match yours. Migration from one HSM vendor to another is at best difficult and at worst could become your biggest nightmare.

»


See All Articles by Columnist
Henry Newman

Henry Newman
Henry Newman
Henry Newman has been a contributor to TechnologyAdvice websites for more than 20 years. His career in high-performance computing, storage and security dates to the early 1980s, when Cray was the name of a supercomputing company rather than an entry in Urban Dictionary. After nearly four decades of architecting IT systems, he recently retired as CTO of a storage company’s Federal group, but he rather quickly lost a bet that he wouldn't be able to stay retired by taking a consulting gig in his first month of retirement.

Get the Free Newsletter!

Subscribe to Cloud Insider for top news, trends, and analysis.

Latest Articles

15 Software Defined Storage Best Practices

Software Defined Storage (SDS) enables the use of commodity storage hardware. Learn 15 best practices for SDS implementation.

What is Fibre Channel over Ethernet (FCoE)?

Fibre Channel Over Ethernet (FCoE) is the encapsulation and transmission of Fibre Channel (FC) frames over enhanced Ethernet networks, combining the advantages of Ethernet...

9 Types of Computer Memory Defined (With Use Cases)

Computer memory is a term for all of the types of data storage technology that a computer may use. Learn more about the X types of computer memory.