Download the authoritative guide: Enterprise Data Storage 2018: Optimizing Your Storage Infrastructure
The Case for RAID 10
Almost entirely overlooked in discussions of RAID reliability – an all too seldom discussed topic as it is – is the question of parity computation reliability.
With RAID 1 or RAID 10 there is no "calculation" done to create a stripe with parity. Data is simply written in a stable manner. When a drive fails its partner picks up the load and drive performance is slightly degraded until the partner is replaced. There is no rebuilding process that impacts existing drive members. Not so with parity stripes.
RAID arrays with parity have operations that involve calculating what is and what should be on the drives. While this calculation is very simple it provides an opportunity for things to go wrong.
An array control that fails with RAID 1 or RAID 10 could in theory write bad data over the contents of the drives, but there is no process by which the controller makes drive changes on its own. So this is extremely unlikely to ever occur, as there is never a "rebuild" process except in creating a mirror.
When arrays with parity perform a rebuild operation they perform a complex process by which they step through the entire contents of the array and write missing data back to the replaced drive. In and of itself this is relatively simple and should be no cause for worry.
What I and others have seen first hand is a slightly different scenario involving disks that have lost connectivity due to loose connectors to the array. Drives can commonly "shake" loose over time as they sit in a server, especially after several years of service in an always-on system.
What can happen in extreme scenarios is that good data on drives can be overwritten by bad parity data when an array controller believes that one or more drives have failed in succession and been brought back online for rebuild. In this case the drives themselves have not failed and there is no data loss. All that is required is that the drives be reseated, in theory.
On hot swap systems the management of drive rebuilding is often automatic, based on the removal and replacement of a failed drive. So this process of losing and replacing a drive may occur without any human intervention – and a rebuilding process can begin. During this process the drive system is at risk and should this same event occur again the drive array may, based upon the status of the drives, begin striping bad data across the drives, overwriting the good file system.
It is one of the most depressing sights for a server administrator to see when a system with no failed drives loses an entire array due to an unnecessary rebuild operation.
In theory this type of situation should not occur and safeguards are in place to protect against it. But the determination of a low level drive controller as to the status of a drive currently and previously and the quality of the data residing upon that drive is not as simple as it may seem and it is possible for mistakes to occur.
While this situation is unlikely, it does happen and it adds a nearly impossible to calculate risk to RAID 5 and RAID 6 systems. We must consider the risk of parity failure in addition to the traditional risk calculated from the number of drive losses that an array can survive out of a pool. As drives become more reliable the significance of the parity failure risk event becomes greater.
Additionally, RAID 5 and RAID 6 parity introduces system overhead due to parity calculation, which is often handled by way of dedicated RAID hardware. This calculation introduces latency into the drive subsystem that varies dramatically by implementation both in hardware and in software. This makes it impossible to state performance numbers of RAID levels against one another, as each implementation will be unique.
Possibly the biggest problem with RAID choices today is that the ease with which metrics for storage efficiency and drive loss survivability can be obtained mask the big picture of reliability and performance as those statistics are almost entirely unavailable. One of the dangers of metrics is that people will focus upon factors that can be easily measured and ignore those that cannot be easy measured regardless of their potential for impact.
While all modern RAID levels have their place, it is critical that they be considered within context and with an understanding as to the entire scope of the risks. We should work hard to shift our industry from a default of RAID 5 to a default of RAID 10. Drives are cheap and data loss is expensive.
Article courtesy of Datamation
Follow Enterprise Storage Forum on Twitter