Lately I’ve been hearing people state that they need their file system or storage solution to have disaster recovery capability. I’ve just been listening and haven’t really participated in the discussion, but I have noticed one thing: the general theme is “we need DR.”
However, when you start asking questions such as “What do you mean by disaster recovery?” or “What data should be recovered?” or my favorite, “What kinds of disasters?” you find that the emperor is not wearing any clothes.
When you start probing beneath the surface of “We need DR,” you find that the definition or even the use of the phrase “Disaster Recovery” is like swimming in the Hudson River — you can’t see a thing. Basic questions like “what disaster?” or “what do you want to recover?” or “how quickly do you want to recover?” often have no answers.
To actually understand if you need a DR solution and under what conditions one is “designed,” you should start asking yourself some simple questions. This planning for DR should not be taken lightly. The answers can have a big impact on the scale of your DR solution and ultimately, the cost.
The best place to start is to ask yourself about the meaning of disaster recovery.
What data do you want to recover or protect?
What data to you want to protect using DR? Another way to think about the question is, what data do you want to recover in the event of a disaster? Which data is absolutely critical to keeping your business, research institution or university functioning in the event of a disaster?
For example, do you want to recover all the data including emails, customer data (or student data), financial records, shipping records, general working data, and archive data? Or do you only want to recover data that ensures business continuity, such as customer information, financial records, and shipping records? These two answers are typically at opposite ends of the spectrum where the first option is “protect all data!” to the second option which is “protect only what is critical.” Deciding where you fall in the spectrum is pretty important because the simple rule of thumb is that the more data you want to recover (or protect), the longer it will it take to protect and recover, and the more it will cost.
The best advice I can give is to focus on what DR is supposed to do — allow you to recover from a disaster. Which data allows you to keep functioning? Which data can be lost without detriment to the company?
One subtle point in this discussion is that you will have to keep in mind any compliance or regulatory information that might be required. Just because you can recover customer data and financial and shipping records does not mean that you are allowed to skip compliance and regulation requirements.
A great place to start is to make a simple list of the kinds of data you store today. It doesn’t have to be detailed but rather just high-level. Then create a second list of the data you need to keep the company functioning. Again, just make a high-level list but you could add data to the list that gives “added functionality” to the company. Then you correlate the two lists and discover which data is absolutely necessary and which is “nice to have”.
The second step might be to then discover where this data is being stored today and how the company is protecting it today (if it is). It might take a while to discover where the data is located but ultimately the exercise is useful because you now know which data is important and where it is. You can even use that information to start consolidating storage (that’s another story).
Which Disasters?
A second question that is inexorably tied to the first question is “what do you mean by ‘disaster’?” Specifically, what disasters do you want to recover from?
The list of possible disasters is huge, so let’s create our own list by starting with a small disaster and then increase the magnitude of the disaster. For each disaster, I’m going to list the “scope” (how many people it affects within the company and even outside the company), the time scale for recovering from the disaster, the monetary impact of the disaster including a comment on the loss of data, and the ultimate cost of recovering from the disaster for the case of no DR capability. Also, I use the phrase “data center” to mean an actual data center or a co-location facility or something on-premise inside the company.
- The cleaning crew pulls plug from your desktop so the data on your desktop is not accessible at that moment
- Scope: One person
- Time scale: A few minutes
- Impact: Minimal with no loss of data (i.e. just a delay in accessing data)
- Cost: $0
- The hard drive in your desktop failed
- Scope: One person
- Time scale: A few minutes to a few hours
- Impact: Minimal with possible loss of data
- Cost: $50 – $1,000 ($1,000 if data needs to be recovered from drive)
- The centralized storage for your company is not accessible (perhaps a power loss)
- Scope: Potentially a large number of people
- Time scale: A few minutes to a couple of hours
- Impact: Minimal with possible loss of data in flight
- Cost: Small
- Failed drive(s) in the company centralized storage
- Scope: Potentially a large number of people
- Time scale: A few hours to a few days
- Impact: Minimal. If recovery doesn’t go well, data could be lost
- Cost: Small but there is a performance impact in the case of a RAID rebuild
- The company’s centralized storage is damaged due to some sort of accident (e.g. car hits the data center, tornado, disgruntled employee)
- Scope: Potentially a large number of people
- Time scale: Hours, Days, Weeks, Months (varies)
- Impact: Minimal ($0) to very large ($$$,$$$)
- Cost: Small ($) to very large ($$$,$$$) or even larger due to loss of revenue during down time. The amount varies.
- The company centralized storage is irreparably damaged (i.e. has to be replaced)
- Scope: Potentially a large number of people
- Time scale: Weeks to Months or longer depending upon how long it takes to get new storage (assumes the data center is still functioning)
- Impact: Very large impact to a catastrophic impact (loss of all data)
- Cost: Very large ($$$,$$$) to huge ($$$,$$$,$$$)
- The data center experiences a power loss for an extended period of time (hours or days)
- Scope: Potentially a large number of people including the entire company
- Time scale: Hours to days
- Impact: Large to very large impact with possible loss of data in flight
- Cost: Large ($$,$$$) to Huge ($$$,$$$,$$$)
- The data center was damaged beyond repair (e.g. being blown up or hit by a meteor)
- Scope: Potentially a large number of people including the entire company
- Time scale: Months to years (varies). Have to rebuild data center and buy new hardware.
- Impact: Very large to catastrophic
- Cost: Very large ($$$,$$$) to huge ($$$,$$$,$$$)
- The state where the data center was located by hit by a meteor or lost power
- Scope: Potentially a large number of people including the entire company
- Time scale: Months to years (varies). Have to build a new data center in a new state and buy new hardware.
- Impact: Extremely large to catastrophic
- Cost: Massive ($,$$$,$$$) to huge ($$$,$$$,$$$)
- The country where the data center is located is hit by a meteor or otherwise destroyed
- Scope: A very large number of people (massive scale)
- Time scale: Extremely long to infinite (can’t recover data)
- Impact: Massive to Catastrophic
- Cost: Company could die (other companies as well)
- The planet gets blown up
- Scope: Everyone
- Time scale: Infinity (game over)
- Impact: Catastrophic
- Cost: Game over
I think I’m going to stop with the planet getting blown up because I’m not sure people have thought about what happens when all the people are gone since there is no reason to recover the data (unless the Martians want it for some reason).
Given this scale of disasters starting from something slight and annoying, such as the cleaning crew pulling the plug on your desktop, to the ultimate disaster of the planet being destroyed, you should decide at which point on the scale you need to create a disaster recovery plan. For example, you might want to start a DR plan based on the centralized storage being damaged beyond repair. This means that the data is not accessible and is either not recoverable from the surviving hardware or will be extremely expensive and time consuming to recover. This type of failure can cause massive business interruption and could cost a company a huge amount of money. Having a plan where the a copy of the company’s data is maintained at an alternative site and your servers can access the data with a minimum of interruption is not an unreasonable starting point.
You can pick a different starting point for your DR plan based on your needs. Instead of picking the loss of your centralized storage due to an accident and the data is not recoverable, you may want to pick something a little further “up” the list. For example, you might want to consider starting with losing access to the centralized data. This could be very important for business continuity.
Either starting point is fine. The point is that you need to pick a starting point and develop your DR plan to match it. But what many people fail to think about is at what point do you just give up hopes for data recovery? For example, do you want to make sure you can recover from a meteor hitting your primary data center? Or do you want to be able to recover from a power loss in the primary data center?
Picking the point at which the loss of data is beyond the resources of the company to recover creates an “upper bound” on the required resources (people and money). Without this step, DR planning becomes a black hole into which you pour money. If you don’t plan on an upper limit, then you could end up thinking about putting backup data centers in other countries or even putting them on Mars (watch the radiation and the dust). This can become expensive very quickly. Companies do this but you need to carefully evaluate whether you can afford it and what implications it has for operations. I don’t want to be overly dramatic in my examples but I do want to make the point that you need to think about the point at which the company can’t survive because of the data loss.
How quickly do you want to recover?
Let’s assume you have an idea of how much data you need to protect and how quickly it grows and you know the level of disaster you want to withstand. A logical next step in planning for DR is to ask the question, “How quickly do I want to recover the data?” The answer to this question will allow you to start estimating the amount of hardware you need for your DR solution and/or what kind of DR solution you need. Let’s try a simple scenario to illustrate how this might impact your solution.
Let’s assume I’m a small regional company with about 2,000 employees. I have a data center that stores about 200TB of centralized data of which I need to protect 100TB. For business continuity let’s assume I can’t afford to lose access to the data for more than 2 minutes. This last statement sets a boundary on the data recovery speed.
If my primary data is inaccessible I have two minutes to recover the data. That speed is pretty fast. I see two options for a DR solution: (1) copy the data from the secondary data center to the primary data center (if I can), or (2) switch over my business applications to use the data in a secondary data center. In the first case, I will need to copy 100TB in 120 seconds resulting in a sustained throughput of 833 GB/s. Getting that much throughput between data centers is going to be really, really costly if it’s even possible. For the second case, I just need to have my customer-facing servers and my important business servers start using the storage in the secondary data center. This shouldn’t be too difficult to do, but I need to have some sort of system in place to detect the failure of the primary data center and then automatically fail the servers over to the secondary storage. This isn’t something I would call “cheap” but it’s definitely something that most companies can easily implement.
By deciding how quickly I want to recover from my “disaster” I have started to shape the DR solution and costs. But don’t forget to define the upper bound of your DR solution. In my contrived example, let’s assume my upper bound is the complete loss of my data center due to some disaster but my secondary data center is located 500 miles away. If the disaster extends beyond 500 miles from my data center, I’m going to assume that the company is completely destroyed (apologies for be so morbid). But it is important to set an upper limit on the disaster from which I want to recover.
I’ve now set some boundaries on a DR solution by defining how much data is important, how much data I need to be covered, and how fast I want to recover. But before fleshing out your DR solution, there is one more important issues to address.
Async versus synch
One obvious DR solution is to simply make a copy of your important data to a secondary storage system, possibly in a secondary data center. The question we now face is how often or how quickly do we need to copy the data? Does the secondary copy of the data have to match the primary data 100% percent? Or can the secondary data lag behind the primary data by some amount of time? Answering this question can have a big impact on the cost and performance of your storage system.
The first case, where the data in the secondary storage is identical to the primary data, is synchronous replication. Synchronous replication can have an impact on performance because both the primary and the secondary storage both need to acknowledge that the data has been written before they return to the kernel. This can slow things down a bit, but in return the data is guaranteed to be acknowledged by both storage systems.
The second case is called asynchronous and involves a time delay between the primary and secondary storage. The primary storage always has the latest version of the data while the version of the data on the secondary storage is behind the primary by some amount of time. This amount of times can range from something very small, perhaps milliseconds, to something very large, perhaps days or even larger. The size of the delay is up to you. Can you tolerate a difference between the data on the primary storage and the secondary?
Remember that you don’t have to have synchronous replication for all types of data. You can have synchronous replication for some data and asynchronous for other data.
One nice result of asynchronous replication is that, generally, it is faster than synchronous replication and it is cheaper. But these are general rules of thumb and specific cases can differ.
Summary
The phrase “disaster recovery” means exactly what it says – recovery from a disaster. What happens if you lose access to your primary storage or if you lose the data in your primary storage? Being able to recover from this scenario and keep your business functioning is a beautiful thing. But you just can’t go out and buy a “DR solution.” You actually have to think and plan about what data you want to protect (hint: the answer is not “everything”).
You have to plan for a range of disasters from which you want to recover. Having the office cleaning crew accidentally pull the plug on your laptop is not a good starting point for disaster recovery regardless of the number of cat pictures on the drive. By the same token surviving the end of the world is also probably not a good disaster from which to recover since no one will be around to use the recovered data.
As you plan for DR, just remember that the broader the range of disasters from which you want to recover and the more data you want to protect, the more expensive the solution is likely to be. Notice that I haven’t talked about specific technologies for DR so the actual expenses can vary, but the general trend is always, “more money for more DR.”
Photo courtesy of Shutterstock.