Making RAID Work into the Future Page 3 - Page 3
Let's take a look at what happens if we lose a drive in a Dynamic Disk Pool. Figure 2 below, provided courtesy of NetApp, illustrates what happens when we lose drive 6 (D6).
Figure 2: What happens if you lose a drive in a DDP (copyright NetApp and used with permission)
This loss interrupts a number of D-Stripes that had D-Pieces on the failed drive. Therefore, this requires that the affected D-Stripes be reconstructed. In general, a Dynamic Disk Pool always has some spare capacity in case of drive failure. Also, you don't have to use all of the space in a DDP in volumes, leaving some possible additional spare capacity as well. This spare capacity is utilized during the reconstruction. Note that reconstruction in DDP is different than the classical RAID reconstruction because the D-Pieces and D-Stripes have to be regenerated while everything is rebalanced across the drives to ensure the pseudo-randomness of the data distribution.
In a "normal" RAID-6 with an 10+2 layout, only 11 drives participate in the reconstruction when one drive fails and all of them write to a single spare drive "target." The reconstruction time is limited by the number of drives and the write speed of one of the drives.
In the case of DDP with the same number of drives, the entire pool of drives participate in the reconstruction. This means we get a larger number of drives working on the reconstruction for both reads and writes. Plus the regeneration and rebalancing happen in parallel, further improving performance. All of this means DDP reconstruction goes faster than classic RAID-6.
A key thing to notice is that unlike classic RAID reconstruction, DDP's reconstruction only reads the required D-Pieces for the affected D-Stripes. The D-Pieces in the unaffected D-Stripes are not read. This means we don't have to read entire drives, keeping us much further away from the URE danger zone.
During the reconstruction, priority is given to any D-Stripe missing two D-Pieces to lessen the chance of another failure that might make recovery of the affected D-Stripes impossible. Remember that the D-Stripes are built using RAID-6. Now that two pieces of the RAID-6 group are gone, you are vulnerable to data loss in the event of a failure of a third piece. Since the data is in the controller, both D-Pieces are regenerated at the same time, fully restoring the D-Stripe. This makes the window of vulnerability for any D-Stripe that has lost two D-Pieces very small.
In the case of a lost drive, the mean number of D-Stripes with two more affected D-Pieces is fairly low. This means that a reconstruction of the "critical" D-pieces can happen quickly (remember that they are only 512MB in size). The regenerations happen so quickly that additional drives could fail within minutes without data loss. This is very important to realize -- using DDP you can lose more than 2 drives. So while RAID-6 is used at the lower level, the design of DDP allows us to tolerate the loss of more than two drives without data loss.
Netapp says that with twelve 1TB drives (twelve is the minimum number of drives in a DDP), the rebuild time for a classic RAID-6 group (10+2) is almost eleven hours for a particular chassis. The same time for a DDP can be as low as seven hours for a complete reconstruction. While this may not seem like a huge improvement, it is about a 36% decrease in rebuild time.
Also, remember that this is a complete recovery. Having RAID-6 redundancy on some of the impacted D-Stripes happens very quickly, giving you possible protection from any more failed drives. For a classic RAID-6 group you have to wait eleven hours to get complete RAID-6 protection. Whereas with DDP you get some D-Stripes with complete RAID-6 protection very quickly and the number of D-Stripes with full RAID-6 protection increases with time.
You can compare the classic RAID-6 with DDP in terms of resiliency fairly easily. As soon as one drive is lost, the classic RAID-6 operates in degraded mode (only one drive for resiliency). But the classic RAID-6 doesn't get full redundancy until the very end when the reconstruction is finished. DDP regenerates and rebalances very quickly, so the parts of DDP that have RAID-6 protection grows with time.