Download the authoritative guide: Enterprise Data Storage 2018: Optimizing Your Storage Infrastructure
Calculating the Copies
Hard error rates and device failures are only part of the equation. Among the many other things to consider are:
- Silent data corruption
- A bad lot of media
- Natural disaster
- Network failure so you cannot replication
- Human error
- Intentional data damage
- A combination of these factors
We will now look at each of these.
Silent Data Corruption
A Bad Lot of Media
This has not happened in a while in the disk drive industry, but it has happened before for both disk and tape. If you have two copies on the same media lots, there is always the risk that both copies could be on media with a manufacturing defect. Be sure to have at least two different media lots if you are going to have the same type of media.
Whether you live in an earthquake zone, tornado zone, hurricane zone, flood zone or trouble zone of choice zone, almost every major population area in the country could be a target. If you have only two copies of your data and one of them gets destroyed, you will be replicating from only one copy. Given the media reliability and the amount of data, that might be a problem. Of course, you could have a computer center built into a missile silo that was designed to survive a nuclear attack, but most enterprises do not have a computer center that can survive a disaster such as an F5 tornado
Network Failure -- Preventing Replication
Having two copies of your data via replication is only as good as your network. There are three potential issues:
- Do you have enough network bandwidth to replicate your incoming data?
- Do you have enough network bandwidth to replicate your incoming data and re-replicate to failed devices?
- Do you have enough network bandwidth to replicate all of your data in the event of a disaster?
Clearly, having number three running is not practical given the cost, but some planning is needed.
Everyone makes mistakes, and archives can be lost via human error. Issues tend to occur if you have only two copies of the data, depending on the software that is chosen. How you decide to make sure a human error does not take out all of your copies is generally a function of the software and testing procedures.
Intentional Data Damage
Whether it be an employee with an ax to grind or someone hacking into your system to change or destroy data, having multiple copies of your data is critical. Each copy must be checksumed to ensure data has not been tampered with or been silently corrupted.
>Combination of Factors
Likely the worst possible scenario is that a combination of factors happens at the same time. Most people plan for one thing to happen but not a combination. This must be a consideration when you are determining how many copies you want.
So how many copies of the data do you need, on what media, and at what locations? Some of it depends on the size of your archive. If you have 1 PB of data, you might be able to keep it safe with two copies on enterprise RAIDed SATA drives. On the other hand, if you have 50 PB (50*1024*1024*1024*1024*1024 bytes) of data and want 99.9999999 percent (56,294,994 bytes of data lost in 50 PB) reliability, two copies on enterprise tape might not be enough because some lost bytes might overlap on the two copies. The count of copies depends on how much risk you want to tolerate and, of course, your budget.
You might be willing to archive far more data with a higher risk of loss, and that might be your corporate policy. On the other hand, if you are a drug company and the FDA requires you to keep all drug trial information and you lose some of the data -- as Ricky said to Lucy "you have some explaining to do." In large archives (over 50 PB) with high reliability, two copies might not be enough if you want, say 99.9999999999999 percent (15 9s) of data reliability or 56 bytes lost for 50 PB. I am not sure that even three copies are enough, given the myriad issues and impacts. The media type also comes into play: Three copies on non-RAIDed consumer drives are a recipe for disaster, while three copies on enterprise tape are likely close to what you need from a media perspective. However, if all three copies are in a hurricane zone or you have an employee intent on destruction, all bets are off.
Given all the variables, there are no good answers for how many copies you need based on the media type used and the amount of data you have. Some variables, like human error or intentional damage, are not really possible to quantify, but things like WORM media can surely help. Others, like disasters, you might be able to quantify, but even that is not exactly easy nor cheap to figure out. Everyone in the process must be aware of the risks and issues and make the best choices based on budget.
So back the question I get asked often: Are two copies on low-cost, low-reliability media better than one copy on enterprise media? My answer for large archives is that, from a media reliability perspective, one copy on enterprise media is better than two on low-cost, low-reliability media because media failures have a higher probability of failure than natural disasters, malicious employees and the like.
Everyone must know the limitations, and 100 percent data reliability is very costly, if not impossible to achieve, for large archives. As Sir Francis Bacon said, knowledge is power.
Henry Newman is CEO and CTO of Instrumental Inc. and has worked in HPC and large storage environments for 29 years. The outspoken Mr. Newman initially went to school to become a diplomat, but was firmly told during his first year that he might be better suited for a career that didn't require diplomatic skills. Diplomacy's loss was HPC's gain.