The concept of parity-based RAID (levels 3, 5 and 6) is now pretty old in technological terms, and the technology’s limitations will become pretty clear in the not-too-distant future — and are probably obvious to some users already. In my opinion, RAID-6 is a reliability Band Aid for RAID-5, and going from one parity drive […]
The concept of parity-based RAID (levels 3, 5 and 6) is now pretty old in technological terms, and the technology’s limitations will become pretty clear in the not-too-distant future — and are probably obvious to some users already. In my opinion, RAID-6 is a reliability Band Aid for RAID-5, and going from one parity drive to two is simply delaying the inevitable.
The bottom line is this: Disk density has increased far more than performance and hard error rates haven’t changed much, creating much greater RAID rebuild times and a much higher risk of data loss. In short, it’s a scenario that will eventually require a solution, if not a whole new way of storing and protecting data.
We’ll start with a short history of RAID, or at least the last 15 years of it, and then discuss the problems in greater detail and offer some possible solutions.
Some of the first RAID systems I worked on used RAID-5 and 4GB drives. These drives ran at a peak of 9 MB/sec. This, of course, was not the first RAID system, but 1994 is a good baseline year. You’ll need to click on the image below for how the RAID rebuild picture has changed in the last 15 years.
![]() |
| RAID reconstruction time, 1994-2009 (Click for larger image) |
A few caveats and notes on the data:
We all know that density is growing faster than bandwidth. A good rule of thumb is that each generation of drives will improve bandwidth by 20 percent. The problem is that density is growing far faster and has been for years. While density percentages might be slowing now from 100 percent to 50 percent or less, drive performance is pretty fixed at about 20 percent improvement per generation.
Using the sample data in the table above, RAID rebuilds have gone up more than 300 percent over the last 15 years. If we change the formula from 90 percent of bandwidth to 10 percent of disk bandwidth — which might be the case if the device is heavily used with application I/O, thanks in part to the growing use of server virtualization — then rebuild gets ugly pretty, as in the RAID rebuild table below.
![]() |
| RAID Rebuild for Application I/O (Click for larger image) |
The hard error rate for disk drives has not, for the most part, improved with the density. For example, the hard error rate for 9GB drives was 1 in 10E14 bits, and that error rate has improved by an order of magnitude to 1 in 10E15 for the current generation of enterprise SATAand 1 in 10E16 for the current generation of FC/SASdrives. The problem is that the drive densities have increased at a faster rate.
|
What this means for you is that even for enterprise FC/SAS drives, the density is increasing faster than the hard error rate. This is especially true for enterprise SATA, where the density increased by a factor of about 375 over the last 15 years while the hard error rate improved only 10 times. This affects the RAID group, making it less reliable given the higher probability of hitting the hard error rate during a rebuild.
|
Think about it: Only 8.88 PB of data has to move before you hit the hard error rate, and remember that these types of values are numbers that vendors say you cannot likely exceed. Consumer-grade SATA drives are pretty scary and must be a consideration when talking to low-end RAID vendors who might use these drives. A single 8+2 rebuild with 1.5 TB drives reads and writes about 28.5 TB (read 1.5 TB*9 drives, write 1.5 TB*10 drives). That means that in a perfect world using vendor specifications, we can expect one out of every 312 rebuilds to fail today with data loss. Moving to the next-generation 2 TB drives, this value drops to every 234 rebuilds, and with 4 TB drives it drops to every 117 rebuilds.
Some considerations:
All of this means that the hard error rate is beginning to become a big factor in MTDL (Mean Time before Data Loss) as a part of rebuild, given the amount of data moved and the density of the drives.
Some vendors claim that file systems can address the storage reliability problem, but I don’t buy that argument, given the latency inherent in large operating systems and file systems, and the need for high performance (see File System Management Is Headed for Trouble and SSDs, pNFS Will Test RAID Controller Design). And running software RAID-5 or RAID-6 equivalent does not address the underlying issues with the drive. Yes, you could mirror to get out of the disk reliability penalty box, but that does not address the cost issue.
The disk reliability density problem is getting worse, and fast. Some vendors are using techniques such as write logging — keeping track of write on another disk during rebuild to allow the rebuild to occur faster — to get around the growing problem. Will this solve the problem for the long term, or is this the equivalent of the RAID-5 to RAID-6 fix that just delayed the inevitable problem? Personally, I think it will turn out to be just another short-term fix. The real fix must be based on new technology such as OSD, where the disk knows what is stored on it and only has to read and write the objects being managed, not the whole device, or something like declustered RAID. In essence, the disk drive layer needs to have more knowledge of what is storage, or fixed RAID devices must be rethought. Or both.
There are some technologies that are available today and on the way that could help alleviate the problem. The sooner they arrive, the better chance we have of avoiding MTDL.
Henry Newman, CTO of Instrumental Inc. and a regular Enterprise Storage Forum contributor, is an industry consultant with 28 years experience in high-performance computing and storage.
See more articles by Henry Newman.
For more on the future of RAID, see The Future of RAID Storage Technology
Follow Enterprise Storage Forum on Twitter
Henry Newman has been a contributor to TechnologyAdvice websites for more than 20 years. His career in high-performance computing, storage and security dates to the early 1980s, when Cray was the name of a supercomputing company rather than an entry in Urban Dictionary. After nearly four decades of architecting IT systems, he recently retired as CTO of a storage company’s Federal group, but he rather quickly lost a bet that he wouldn't be able to stay retired by taking a consulting gig in his first month of retirement.
Enterprise Storage Forum offers practical information on data storage and protection from several different perspectives: hardware, software, on-premises services and cloud services. It also includes storage security and deep looks into various storage technologies, including object storage and modern parallel file systems. ESF is an ideal website for enterprise storage admins, CTOs and storage architects to reference in order to stay informed about the latest products, services and trends in the storage industry.
Property of TechnologyAdvice. © 2025 TechnologyAdvice. All Rights Reserved
Advertiser Disclosure: Some of the products that appear on this site are from companies from which TechnologyAdvice receives compensation. This compensation may impact how and where products appear on this site including, for example, the order in which they appear. TechnologyAdvice does not include all companies or all types of products available in the marketplace.