Overcoming Disaster Recovery Obstacles for HPC Sites


Want the latest storage insights?

Download the authoritative guide: Enterprise Data Storage 2018: Optimizing Your Storage Infrastructure

Since 9/11 all types of organizations have been talking about, struggling with, and implementing various disaster recovery plans. Even before 9/11 many sites with critical data had already implemented plans that allowed management to sleep at night knowing their companies could recover from a disaster.

Providing a disaster recovery plan for your home, small office, large office, or even large multi-national corporation is generally different than doing so for a large High Performance Computing (HPC) site, given that HPCs cannot use mirrored RAID as a result of typically having hundreds of terabytes or more of data under Hierarchical Storage Management control on tape. So the question is, how do you provide disaster recovery for sites that do not use mirrored RAID hardware and that have huge tape libraries?


If you have been reading this column for a while, you might remember this table. It looks at storage performance and density changes over the last 30 years for high performance tape and disk, using the best technology available at the time. The Fibre Channel drive specifications were gleaned from Seagate's web site (for disk information) and StorageTek's web site (for the tape information).

Technology Change Over Last 30 YearsNumber of Time Increase
Tape Density1333
Tape Transfer Rate24
Disk Density2250
Disk RAID-5 8+1 Density LUN18000
Disk Transfer Rate Single Average21
Disk RAID Transfer Rate LUN Average133

Tape technology clearly has not kept pace with the increase in density and performance of disk technology, nor has tape load and position time improved in such a way that the performance of writing small files has improved. I became aware of a site that was going to use remote tape mirroring over ATM OC-3 to write the third copy of their data at their DR facility.

I had my doubts that remote tape mirroring would not work given that the tape would not be streaming. If the drive could have been streamed, the data rate with compression of the StorageTek T9940B drive was going to be about 45 MB/sec. Given that ATM OC-3 could only run at a peak of about 15 MB/sec, and taking into account all of the overhead associated with ATM, TCP, and congestion, we estimated a sustained rate of about 5-8 MB/sec. I had a bad feeling that we could have problems.

I decided to contact one of the world’s leading media experts (and a personal friend of mine) at Imation, Jim Goins, who told me that in most cases for most tape drives, the expectation for the tape and the drive is that the data must be streamed. He suggested that we would have to re-tension the tape after writing the data in the way the customer suggested with OC-3. This was virtually impossible, though, as the HSM application did not support this function, and we would have to write and maintain special code to re-tension tapes after they were written.

So writing remote tapes without streaming the data was out of the question, and the additional cost of higher speed network connections, which would allow the data to stream, was not going to be cost effective given the amount of data generated a day. This presented a dilemma for the site. What should they do?

Page 2: DR Issues for HPC Sites

Submit a Comment


People are discussing this article with 0 comment(s)