Since 9/11 all types of organizations have been talking about, struggling with, and implementing various disaster recovery plans. Even before 9/11 many sites with critical data had already implemented plans that allowed management to sleep at night knowing their companies could recover from a disaster. Providing a disaster recovery plan for your home, small office, […]
Since 9/11 all types of organizations have been talking about, struggling with, and implementing various disaster recovery plans. Even before 9/11 many sites with critical data had already implemented plans that allowed management to sleep at night knowing their companies could recover from a disaster.
Providing a disaster recovery plan for your home, small office, large office, or even large multi-national corporation is generally different than doing so for a large High Performance Computing (HPC) site, given that HPCs cannot use mirrored RAID as a result of typically having hundreds of terabytes or more of data under Hierarchical Storage Management
Background
If you have been reading this column for a while, you might remember this table. It looks at storage performance and density changes over the last 30 years for high performance tape and disk, using the best technology available at the time. The Fibre Channel drive specifications were gleaned from Seagate’s web site (for disk information) and StorageTek’s web site (for the tape information).
| Technology Change Over Last 30 Years | Number of Time Increase |
| Tape Density | 1333 |
| Tape Transfer Rate | 24 |
| Disk Density | 2250 |
| Disk RAID-5 8+1 Density LUN | 18000 |
| Disk Transfer Rate Single Average | 21 |
| Disk RAID Transfer Rate LUN Average | 133 |
Tape technology clearly has not kept pace with the increase in density and performance of disk technology, nor has tape load and position time improved in such a way that the performance of writing small files has improved. I became aware of a site that was going to use remote tape mirroring over ATM OC-3 to write the third copy of their data at their DR facility.
I had my doubts that remote tape mirroring would not work given that the tape would not be streaming. If the drive could have been streamed, the data rate with compression of the StorageTek T9940B drive was going to be about 45 MB/sec. Given that ATM OC-3 could only run at a peak of about 15 MB/sec, and taking into account all of the overhead associated with ATM, TCP, and congestion, we estimated a sustained rate of about 5-8 MB/sec. I had a bad feeling that we could have problems.
I decided to contact one of the world’s leading media experts (and a personal friend of mine) at Imation, Jim Goins, who told me that in most cases for most tape drives, the expectation for the tape and the drive is that the data must be streamed. He suggested that we would have to re-tension the tape after writing the data in the way the customer suggested with OC-3. This was virtually impossible, though, as the HSM application did not support this function, and we would have to write and maintain special code to re-tension tapes after they were written.
So writing remote tapes without streaming the data was out of the question, and the additional cost of higher speed network connections, which would allow the data to stream, was not going to be cost effective given the amount of data generated a day. This presented a dilemma for the site. What should they do?
Page 2: DR Issues for HPC Sites
DR Issues for HPC Sites
Most HPC sites are in a difficult situation as a result of:
![]()
![]()
Most sites have a few choices in terms of establishing an effective strategy for Disaster Recovery, including:
![]()
![]()
HSM Tapes Off-Site
Moving a copy of your HSM tapes off-site means that you need to have a working methodology that allows you to recover the tapes assuming your site ceases to function or exist. For HSM products, this means you need:
Some HSMs write in a format and/or provide tools that allow you to read the tapes without the file system and/or tape metadata, but that means you have to read all of the tapes in and then re-archive them in order to have the data under the HSM controller. Doing this for a petabyte of data would not be what most people call fun and would take a very long time.
This method does provide protection against the malicious insider, because the tapes are off-site and not under the control of the HSM software. The problem with this method is that it is often quite difficult to maintain synchronization of the HSM file system and/or tape metadata with the tapes that are moved off-site.
It’s labor intensive to use this method and requires regular testing to ensure the procedures work. Additionally, it’s hard to get back up and running quickly after a disaster given that you must bring the server and other hardware and software up, connect to the tapes, and likely install the tapes into a robot.
HSM Off-Site Mirror
Most HSM software packages allow some type of distribution of the data from the main server. Copies of the files can be moved over a WAN and stored at another site where they are archived. This has some potential advantages, especially if the remote site can be made to appear within the same IP subnet:
![]()
![]()
On the other hand, this method does have some issues that must be addressed and/or architected such as:
Even if the proper security precautions are taken, the malicious inside user could still get access to both sites and destroy your data. The only good news is that with most of the data on tape, you can always read all of the tapes in, but as stated earlier, for a petabyte or more of data this is not fun.
Page 3: Disaster-Proof/Resistant Facility
Disaster-Proof/Resistant Facility
I live on the northern edge of the tornado belt in Minnesota and am aware of a number of local companies that have “disaster-proof facilities.” Most of these sites are located underground and are surrounded by lots and lots of concrete and steel. They have separate generators from the main site and are connected to multiple power grids. They should have multiple WAN connections as well, but as Northwest Airlines found out a few years ago, having separate lines in the same conduit does not help if someone accidentally cuts the conduit.
The cost of these types of facilities is expensive, with the price varying depending on what disasters are lurking. You have many types of disasters to consider, including:
Which types of disaster can you protect against, and how much will it cost to do so? I believe that no structure is completely safe anywhere at any time, but how great is the potential? Can you handle a risk over 10 years of 1 billion to 1, or do you need 1 trillion to 1? In most cases, the latter is never achievable even with a hardened structure, but what is an acceptable risk to your organization?
Hardened structures have significant advantages in the area of management, specifically:
Unfortunately, you do still have the problem of the internal malicious user that could destroy the system.
Conclusions
Disaster recovery for large HPC sites using HSM has no easy answers. There was an old saying about RAID in the early and mid 1990s: you can have it fast, cheap, or reliable — pick any two. The same can be said for disaster recovery today, in that you can have it simple, cheap, or easy to recover — pick any two. Over time, as with RAID I believe this will change, but for now you will have to make difficult choices and compromises.
Clearly understanding the features of the HSM being used (or considered) is a critical part of any disaster recovery plan. Different vendors have different features, which can dictate some of your choices. This choice can also become a big gotcha, in that if you develop a DR plan around a specific HSM, migrating from that HSM becomes very difficult, both in terms of moving the data and the potential for having to devise a new disaster recovery plan.
You need to make sure that the HSM will meet your needs for today as well as for the future. That means you must understand the HSM vendor’s plans for hardware support, software support, features, performance, and scalability, and ensure that their plans match yours. Migration from one HSM vendor to another is at best difficult and at worst could become your biggest nightmare.
Henry Newman has been a contributor to TechnologyAdvice websites for more than 20 years. His career in high-performance computing, storage and security dates to the early 1980s, when Cray was the name of a supercomputing company rather than an entry in Urban Dictionary. After nearly four decades of architecting IT systems, he recently retired as CTO of a storage company’s Federal group, but he rather quickly lost a bet that he wouldn't be able to stay retired by taking a consulting gig in his first month of retirement.
Enterprise Storage Forum offers practical information on data storage and protection from several different perspectives: hardware, software, on-premises services and cloud services. It also includes storage security and deep looks into various storage technologies, including object storage and modern parallel file systems. ESF is an ideal website for enterprise storage admins, CTOs and storage architects to reference in order to stay informed about the latest products, services and trends in the storage industry.
Property of TechnologyAdvice. © 2025 TechnologyAdvice. All Rights Reserved
Advertiser Disclosure: Some of the products that appear on this site are from companies from which TechnologyAdvice receives compensation. This compensation may impact how and where products appear on this site including, for example, the order in which they appear. TechnologyAdvice does not include all companies or all types of products available in the marketplace.