Overcoming Disaster Recovery Obstacles for HPC Sites Page 2
DR Issues for HPC Sites
Most HPC sites are in a difficult situation as a result of:
- Having huge amounts of data generated over time under HSM control
- The inability to mirror RAIDs, as most of the data is stored on tape, and the RAIDs are just a cache used by the HSM system
- The inability to use remote tape drives with compressible data over IP network connections unless enough expensive network bandwidth is available, which is not very cost effective given the typical total amount of data per day
Most sites have a few choices in terms of establishing an effective strategy for Disaster Recovery, including:
- Move a copy of the HSM tapes to an off-site location
- Create a second copy of the HSM software and hardware that is then replicated to an off-site location using the HSM software (most packages support this type of functionality in one shape or form)
- Build an environment that is effectively indestructible
HSM Tapes Off-Site
Moving a copy of your HSM tapes off-site means that you need to have a working methodology that allows you to recover the tapes assuming your site ceases to function or exist. For HSM products, this means you need:
- The server hardware
- The tape robot and drives
- The HSM software
- The HSM file system metadata and/or tape data
Some HSMs write in a format and/or provide tools that allow you to read the tapes without the file system and/or tape metadata, but that means you have to read all of the tapes in and then re-archive them in order to have the data under the HSM controller. Doing this for a petabyte of data would not be what most people call fun and would take a very long time.
This method does provide protection against the malicious insider, because the tapes are off-site and not under the control of the HSM software. The problem with this method is that it is often quite difficult to maintain synchronization of the HSM file system and/or tape metadata with the tapes that are moved off-site.
It's labor intensive to use this method and requires regular testing to ensure the procedures work. Additionally, it's hard to get back up and running quickly after a disaster given that you must bring the server and other hardware and software up, connect to the tapes, and likely install the tapes into a robot.
HSM Off-Site Mirror
Most HSM software packages allow some type of distribution of the data from the main server. Copies of the files can be moved over a WAN and stored at another site where they are archived. This has some potential advantages, especially if the remote site can be made to appear within the same IP subnet:
- The remote site can be anywhere — 1 kilometer or 10,000 kilometers away. All you need is the WAN bandwidth, and since you are writing to another server, the latency is not a big issue
- Network security is likely maintained given that the remote system is within your same subnet
- The remote site is fully functional in case of a disaster and will not need hardware, software, or manual intervention
On the other hand, this method does have some issues that must be addressed and/or architected such as:
- The WAN connection should be encrypted. A number of switches support this technology, so this should not be hard
- Who is going to maintain the system, hardware, and software at the remote site?
- Who has access to the remote site?
Even if the proper security precautions are taken, the malicious inside user could still get access to both sites and destroy your data. The only good news is that with most of the data on tape, you can always read all of the tapes in, but as stated earlier, for a petabyte or more of data this is not fun.