The Day the Cloud Died: Planning for Cloud Failure - Page 2
What is the critical data needed to keep your business up and running? Maybe it is an order and shipping database if you are in retail, maybe it might be your patient records if you are doctor. So – yikes – that brings up HIPAA requirements if you are moving your data around. The key here is you need a plan. It would be great if the plan could include at least the following:
1. A prioritization of your data so you can download it to your own systems: What is more important and what is least important, and the reasons why. You are likely going to have to prioritize your data movement so that the highest-priority business-critical data gets moved first.
2. A plan B that you can execute against – having another cloud vendor move your data out of the failing cloud. If your cloud provider fails, who do you go to for secondary access and who do you contract within that organization to move your data? This is well known method in the backup world as organizations can buy services that provide recovery and operational environments.
Cloud vendors do this all of the time with multiple locations, so it is likely that if enough people ask cloud vendor Y to support failover from cloud vendor X, you can buy an insurance policy. Just make sure that that policy is legally binding with penalties and meets your business requirements.
So what if you can’t do either, as either the cloud vendor will not allow you to get your data out fast enough or you do not have enough storage space? Or, you cannot find a plan B cloud vendor? Because it is highly unlikely that you are going to procure and install hardware and upgrade your network in time to get your data out.
Planning for Cloud Failure
There are a few technology realities that you need to consider in your planning.
Data density is growing faster than network speeds are growing, and data costs are dropping faster than network costs. So how long does it take to move say 1 PB of data?
Assume TCP/IP can use 80% of the bandwidth, which is a generous assumption considering latency, network error, retries, contention, congestion and lots of other things like storage bandwidth, metadata for each file or object, etc.
So the below table is likely best case for each of the network bandwidth types, from OC-12 to OC-768, and I doubt many companies can afford dedicated OC-768 channels.
So if a year has 8,760 hours (365*24) in it, moving 1 PB with a dedicated OC-192 channel (~10 Gbit/sec) is about 11 days. But remember you and everyone else are going to be trying to do the same thing at the same time as this cloud company spins down. Who knows what bandwidth is going to be available with creditors circling.
Disk drive performance is not growing as fast as density. Back in 1991, the time to read a disk drive (500 MB SCSI enterprise drive) was 125 seconds compared to 34,883 seconds today, an increase of 279x.
Even if the network bandwidth was available, it is still going to take months to read all of the disk drives in large cloud environments. Even if the whole environment was SSDs, it is still going to take a long time, as it is likely that all drives cannot be run at full rate given SATA and SAS controller bandwidth issues.
It takes fewer disk drives to saturate OC networks today, and that presents load balancing problems. Even though disks drives are not getting faster compared to density, they are getting faster with each generation. Just five years ago, 3.5 inch disks drives were running about 112 MB/sec and today are at 172 MB/sec.
The CPU requirements for decrypting the data and validating the erasure codes and sending the data out the network is likely going to be a problem given the volume of data that needs to be moved in a short period of time. Given that for most vendors you have no idea how much CPU is needed for hashes and/or erasure codes, it is likely going to be a problem, but who knows how big it will be?
There are two obvious choices: ignore the issue and don’t worry about it, or at the other end of the spectrum, do not use clouds. Neither is a likely a good idea, so what is the alternative?
Having a copy of your business critical data somewhere else where you can get it and use it, and depending on your business this might be something that is done in real time if the data changes often. For some businesses, the data might not change often so this would likely work. What about using multiple providers so that you would not get caught in the failure of a single provider?
What happens the day the music dies? If no one is making money and the economics do not work out, will anyone stay in business? If one of the cloud biggies leaves and the rest are not making money, will it prompt them all to leave? Rhetorical question, of course, as no one knows the answer, but it is something that people need to think about and plan for.
Photo courtesy of Shutterstock.