Overcoming Disaster Recovery Obstacles for HPC Sites Page 3
I live on the northern edge of the tornado belt in Minnesota and am aware of a number of local companies that have “disaster-proof facilities.” Most of these sites are located underground and are surrounded by lots and lots of concrete and steel. They have separate generators from the main site and are connected to multiple power grids. They should have multiple WAN connections as well, but as Northwest Airlines found out a few years ago, having separate lines in the same conduit does not help if someone accidentally cuts the conduit.
The cost of these types of facilities is expensive, with the price varying depending on what disasters are lurking. You have many types of disasters to consider, including:
- Other Acts of God
Which types of disaster can you protect against, and how much will it cost to do so? I believe that no structure is completely safe anywhere at any time, but how great is the potential? Can you handle a risk over 10 years of 1 billion to 1, or do you need 1 trillion to 1? In most cases, the latter is never achievable even with a hardened structure, but what is an acceptable risk to your organization?
Hardened structures have significant advantages in the area of management, specifically:
- All of the data is locally managed
- All of the hardware, software, and control is local, which reduces cost
- WAN connections are not needed, which improves security
- Testing disaster recovery is far easier
- Recovery is much faster
Unfortunately, you do still have the problem of the internal malicious user that could destroy the system.
Disaster recovery for large HPC sites using HSM has no easy answers. There was an old saying about RAID in the early and mid 1990s: you can have it fast, cheap, or reliable — pick any two. The same can be said for disaster recovery today, in that you can have it simple, cheap, or easy to recover — pick any two. Over time, as with RAID I believe this will change, but for now you will have to make difficult choices and compromises.
Clearly understanding the features of the HSM being used (or considered) is a critical part of any disaster recovery plan. Different vendors have different features, which can dictate some of your choices. This choice can also become a big gotcha, in that if you develop a DR plan around a specific HSM, migrating from that HSM becomes very difficult, both in terms of moving the data and the potential for having to devise a new disaster recovery plan.
You need to make sure that the HSM will meet your needs for today as well as for the future. That means you must understand the HSM vendor’s plans for hardware support, software support, features, performance, and scalability, and ensure that their plans match yours. Migration from one HSM vendor to another is at best difficult and at worst could become your biggest nightmare.