Download the authoritative guide: Enterprise Data Storage 2018: Optimizing Your Storage Infrastructure
RBER (read bit error rates)
In the paper, the authors state that the standard metric to evaluate flash reliability is the raw bit effort rate (RBER) of a drive. This is defined as the number of corrupted bits per number of total bits read, which includes correctable as well as uncorrectable corruption events.
The paper has a very extensive discussion about RBER and comparing to other errors in the group of drives. They concluded that RBERs were mostly recovered by ECC or occasionally by read retries. Therefore they can be considered transient errors and perhaps were not as interesting to drive reliability as original thought.
Moreover, for drives past their P/E ratio limits, the RBER did not increase as dramatically as was first thought.
Figure 3 - Median RBER for drive families as a function of the P/E ratio
One of the unexpected outcomes from the data analysis was the high level of uncorrectable errors (UEs). The standard measure to report UEs is the number of Uncorrectable Bit Errors per total number of bits read (UBER - not the ride service).
They examined if there were any correlation between the number of bits read and the number of uncorrectable errors and found none. As a result, the authors stated that UBER is not a good measure for flash drive reliability. They also found that RBER is a bad indicator of UEs.
The authors did observe that the daily UE rate has some correlation with P/E cycles. Figure 4 shows a plot of the daily UE Probability as a function of the P/E ratio.
Figure 4 - Daily UE Probability as a function of the P/E ratio
This is the only correlation they found. If a drive had a UE one month, there was a 30 percent chance it would see another one the next month. This can be used as a metric for monitoring drives.
The last interesting observations made in the paper that I want to mention are around bad blocks on the drive. Recall that on a flash drive, the block is the lowest level where erase operations take place. Many modern drives have several blocks reserved in case blocks on the drive go bad (the so-called reserved space). The authors of the study looked at the number of bad blocks on drives when they arrived from the factory ("initial bad blocks") and the number of bad blocks that developed over time.
They defined the a block as "bad" if it had a final read error, a write error, or an erase error and consequently re-mappped to a different block on the drive. When this happens the drive controller marks the block as bad, and it is never used again. Also any data that was on the block that could have been recovered is recovered and written to the replacement block. But note that it may not be possible to recover all the data from the bad block, which is really data corruption.
Table 4 below, which is taken from the paper, presents the number of bad blocks for the various drive types.
Table 4 - Average bad block count for drives in study
The top half of the table lists the bad block data on each drive model that developed bad blocks in the field (as the drive was in production). It also lists the fraction of drives that developed bad blocks, the median and the average number of bad blocks for those drives that had bad blocks.
The bottom half of the table presents statistics for drives that arrived with bad blocks from the factory (abbreviated as "fact."). The percentage of drives that come from the factory with bad blocks is extremely large. Virtually every single drive came from the factory with bad blocks. Once class of drives (SLC-A) had every single drive arrive with bad blocks. The median number of bad blocks varied from as low as 50 (SLC-A) to as many of 3,450 (SLC-B). This seems like a surprisingly large number of bad blocks from the factory.
According to the authors, depending upon the model, between 30-80 percent of the drives develop bad blocks in the field (in production).
The authors also looked at the number of bad blocks per drive that are accumulated for drives that started with bad blocks. The median number of bad blocks for drives that started with bad blocks was 2-4, depending upon the drive model. Figure 5 from the paper illustrates this.
Figure 5 - The median number of bad blocks a drive will develop, as a function of how many bad blocks it has already developed.
The figure plots the median number of bad blocks that the drive develops on the y-axis, as a function of how many bad blocks a drive has already experienced. The solid blue lines are for the MLC drives and the dashed red lines are for the SLC drives.
The authors found that for MLC drives there was a sharp increase after the second bad block was detected. That is, close to 50 percent of those drives that develop two bad blocks will develop close to 200 or more bad blocks in total.
The authors offer the opinion that bad block counts on the order of hundreds are likely due to chip failure. Therefore they conclude that after experiencing a "handful" of bad blocks, there is a high chance for developing a chip failure. Therefore this metric could be used as an indicator of needing to copy the data from a failing chip to another chip in the drive.
The first paper that Dr. Schroeder wrote with Dr. Gibson on drive reliability is one of the most important papers for data storage. The new paper about the reliability of flash drives is equally as important. The study created a great deal of new and unexpected information.
There were several important observations made from the flash study:
- One of the first observations is that SLC drives are not generally more reliable than MLC drives.
- Another observation is that flash drives have a much lower ARR (Annual Replacement Rate) compared to hard drives (this is good news).
- On the downside, 20 percent of the flash drives developed uncorrectable errors in a four-year period. This is much higher than hard drives.
- 30-80 percent of the flash drives develop bad blocks during their lifetime, possibly leading to loss of data.
- For hard drives, only 3.5 percent of them develop bad sectors in a 32-month period. The number of sectors on a hard drive are magnitudes larger than the number of either blocks or chips on an SSD. These sectors are smaller than flash drive blocks. Therefore when a sector goes bad, the impact is much less than if a block goes bad (i.e. the impact on the hard drives is less than for a flash drive).
- 2-7 percent of the drives develop bad chips, which again can lead to data loss
A simple summary is that flash drives experience significantly lower replacement rates than hard drives. However, flash drives experience significantly higher rates of uncorrectable errors than hard drives. This can mean the potential loss of data, so steps should be taken to ensure no loss of data.
Photo courtesy of Shutterstock.