Download the authoritative guide: Enterprise Data Storage 2018: Optimizing Your Storage Infrastructure
Hard drive reliability has certainly generated research. A highly noteworthy example is the work of (2007), Bianca Schroeder and Garth Gibson a very important paper. It examined the failure rates of drives in real world systems. Several of the systems were from high performance computing (HPC) systems, but some were not. The observations in the paper were very interesting and have gone a long way in influencing how people think about Hard Drive reliability.
Recently, Dr. Schroeder and her co-authors released a new paper that discusses flash drive reliability. The observations in this paper are equally eye opening.
In this article I want to review the article about hard drive reliability to understand the methodologies that were used and how the conclusions were reached. Then I want to turn to the new article on flash drive reliability.
To get an in-depth look at SSD and HDD pricing analysis, see SSD vs. HDD Pricing: Seven Myths That Need Correcting.
In summary of hard drive reliability:
SLC drives are not generally more reliable than MLC drives. Furthermore, flash drives have a much lower ARR (Annual Replacement Rate) compared to hard drives. However, 20 percent of the flash drives developed uncorrectable errors in a four-year period. This is much higher than hard drives. Additionally, 30-80 percent of the flash drives develop bad blocks during their lifetime, possibly leading to loss of data. For hard drives, only 3.5 percent of them develop bad sectors in a 32-month period. The number of sectors on a hard drive are magnitudes larger than the number of either blocks or chips on an SSD. These sectors are smaller than flash drive blocks. Therefore when a sector goes bad, the impact is much less than if a block goes bad (i.e. the impact on the hard drives is less than for a flash drive). 2-7 percent of the drives develop bad chips, which again can lead to data loss.
Hard Drive Reliability Study
The hard drive reliability paper on is truly one of the seminal papers in storage. Even though the paper was written nine years ago, their observations about real world disk failures are worth reviewing. The study included about 100,000 drives from seven sites, four of which were HPC and three of which wer from large Internet Service Providers (ISPs). The drive types included FC (Fibre Channel), SCSI and SATA.
From the drive manufacturers' specifications (datasheets), the drives have a mean time to failure (MTTF) of between 1,000,000 and 1,500,000 hours. This suggested an annual failure rate of between 0.58 percent and 0.88 percent. Drive manufacturers specify the reliability of their products using two metrics: (1) Annualized Failure Rate (AFR) which is the percentage of disk drives in a population that fail in a test, scaled to a per year estimation, and (2) Mean Time to Failure (MTTF) which is the number of power on hours per year divided by the AFR.
In the paper, Dr. Schroeder and Dr. Gibson reported the Annual Replace Rate (ARR), which is similar in concept to AFR but counts the number of drives replaced rather than failed. To a customer, a drive may need to be replaced if it is identified as the "likely culprit" of a problem and the resulting customer tests show that the drive is faulty and needs to be replaced. Note that this is not the same thing as a "failure" drive in the eyes of the manufacturer although the customer tested the drive and was unable to continue using it (hence the word "replaced").
At the time, common wisdom said that FC and SCSI drives were more reliable than SATA drives. Dr. Schroeder and Dr. Gibson computed the ARR for each drive type for each center and plotted it versus the AFR from the drive manufacturers. The results are shown in Figure 1 below from their paper.
Figure 1 - ARR for the seven centers along with AFR data from drive manufacturers
The drive types in Figure 1 are listed below in Table 1, which is also taken from their paper.
Table 1 - Table of drive types
Note: Only disks within the nominal lifetime of five years are included in Figure 1 (i.e., there is no bar for the COM3 drives that were deployed in 1998 because they are older than 5 years). The third bar for COM3 in the graph is cut off to make the chart easier to read (its ARR is 13.5 percent).
For all of the drives, the overall weighted ARR is 3.01 percent (up to 3.4 times larger than 0.88 percent or about a 1,000,000 MTTF). The ARR ranges from about 0.5 percent to 13.5 percent. Recall that the drive manufacturers were stating that the drives should have an AFR of between 0.58 percent and 0.88 percent. The ARR values are up to a factor of 15 times larger than the drive manufacturers' data sheets.
Notice that the drives that had the highest ARR were FC drives that were thought to be some of the most reliable drives. The "HPC4" center, which only reported data for SATA drives, had the lowest ARR and in one case it was actually lower than the manufacturer's data sheet (0.58 percent or a MTTF of 1,500,000 hours). On the other hand, the SATA drives for the HPC3 center didn't fare as well and had a ARR that is slightly above the weighted average.
The other reliable drive type was widely considered to be SCSI drives. However, Figure 1 illustrates that almost all of those had failure rates close to the weighted average except for the second drive in HPC1 and the drives in HPC2.
Some other observations the authors made were:
- For older systems (5-8 years of age), data sheet MTTFs underestimated replacement rates by as much as a factor of 30.
- Even during the first few years of a system’s lifetime (less than 3 years), when wear-out is not expected to be a significant factor, the difference between datasheet MTTF and observed time to disk replacement was as large as a factor of 6.
- Contrary to common and proposed models, hard drive replacement rates do not enter a steady state after the first year of operation. Instead replacement rates seem to increase steadily over time.
The observations made in the paper are very important because it was the first time a public examination of drive replacement statistics was performed over a large population (100,000 drives). The results were a bit unexpected but pointed out some differences between real-world experiences and what datasheets say.
Reliability of Flash drives in Production
Dr. Schroeder published a new paper at FAST 16 around drive reliability, but this time it was about SSDs (flash drives). Dr. Schroeder, along with two researchers at Google, presented a paper entitled, "Flash Reliability in Production: The Expected and the Unexpected".
For this paper, they examined the drive reports for ten different drives for millions of drive days (lots of drives over several years) from the Google fleet of drives. They examined three different drive types: (1) SLC, (2) eMLC (Enterprise MLC), and (3) MLC, over a range of feature sizes (24nm to 50nm). They only used statistics from drives that had been in production a minimum of four years and typically about six years of production use. These drives used commodity flash chips from four different vendors. In some cases there was data on two generations of the same drive in their population, allowing them to at least investigate the impact of feature size.