Drive Reliability Studies - Page 2
All of the drives were designed for high performance and used a customer PCIe interface. A table of the drives taken from their paper is below in Table 2.
Table 2 - Table of flash drive types
For all of the drives they had access to data regarding daily error counts (several types); daily workload statistics that include the number of read, write and erase operations; and the number of bad blocks developed during the day. Note that the read, write and erase operations were initiated by user processes (reading or writing to the drive) and internal operations from garbage collection operations. They also had the logs of when a chip in the drive was declared failed and when a drive itself was being swapped out.
The most common concern about flash drives is that they wear-out because of the limited number of Program/Erase (P/E) cycles the chips have. A rule of thumb is that SLC (Single-Level Cell) drives have a P/E limit of 100,000, eMLC drives have a P/E limit of 10,000 cycles, and MLC have a limit of 3,000 P/E cycles.
Table 3 gives the number of P/E ratios over the time scale of the drive (3-4 years), for each drive type.
|Model Name||Generation||Vendor||Flash Type||Lithography (nm)||PE Cycle Limit||Average P/E Cycles||Percentage used|
Over the minimum of a four-year life for these drive classes, the largest average percentage used P/E cycles was 31.6 percent and the smallest average used P/E cycles was 0.185 percent. After four years of use these are much, much lower than the P/E limits. The obvious conclusion is that P/E cycle exhaustion is not a real concern.As with the hard drive study, the next statistic that was examined was the replacement rate of the flash drives (ARR). The chart is shown below in Figure 2 (taken from the paper).
Figure 2 - ARR for eight flash drive types versus hard drives
Around 1-2 percent of the flash drives were replaced annually versus the hard drive average around 4.6 percent. This is a factor of 2.5-4 in favor of flash drives. Remember that this is the annual replacementrate and not the failure rate.To dig deeper, they had access to various error types for analysis. The errors were divided into two classes: (1) transparent errors, where the error was masked from the user, and (2) non-transparent errors, where the user encountered an error. The list below summarizes these errors:
- Transparent Errors:
- Correctable errors - during a read, an error is detected and corrected by the drive's ECC
- Read Errors - A read operation experiences a non-ECC error but after a retry, the read succeeds
- Write Errors - A write operation experiences a non-ECC error but after a retry, the write succeeds
- Erase Errors - An erase operation on a block fails (this doesn't impact the user so it's a transparent operation)
- Non-transparent errors:
- Uncorrectable errors - A read operation that ECC cannot correct
- Final read error - A read operation that cannot be corrected even after multiple retries
- Final write error - A write operation that cannot be corrected even after multiple retries
- Meta error - An error accessing metadata o the drive itself
- Timeout error - An operation that timed out after 3 seconds
Transparent errors are correctable so that the user does not see them in normal operations except perhaps for a brief delay in the I/O. Non-transparent errors will cause an application to either crash or report an error and stop.
Non-transparent errors are ones that cannot be corrected even using ECC and multiple retries. From the study the authors found that most non-transparent errors are final read errors (Unrecoverable Read Errors - URE). Depending upon the model of the drive, between 20-63 percent of drives experienced at least one of the errors during the time it was in production. In addition, between 2-6 out of 1,000 drive days were affected.
These UREs are almost exclusively due to bit corruptions that ECC cannot correct. Some people call this bit-rot (bits going bad). If a drive encounters a URE, the stored data cannot be read. This either results in a failed read in the user's code, or if the drives are in a RAID group that has replication, then the data is read from a different drive.
The authors found that final read errors (read errors after multiple retries) are about two orders of magnitude more frequent in terms of drive days than any other non-transparent (non-recoverable) error.
Given this, the authors wrote that write errors rarely turned into non-transparent (non-recoverable) errors. They found that, depending upon the drive model, 1.5 percent to 2 percent of the drives and 1-5 out of 10,000 drive days experienced a final write error. It's fairly safe to say that the reason the statistics are so low is that if a write fails on a drive, it can be written to a different location on the drive (of course, assuming the drive isn't full). A final write error really indicates that there is a larger-scale hardware problem than just a single chip on the drive. These types of errors need to be watched carefully.
Drive metadata errors happen on a frequency similar to write errors. Just like write errors, these happen at a much lower rate than read errors. Timeouts and response errors, indicative of metadata problems, typically affect less than 1 percent of the drives and less than in 100,000 drive days. This makes metadata errors the lowest frequency error encountered.
These are errors you don't see as a user but nonetheless happen within the drive. They are almost always a correctable error (ECC corrections) or a retry within the drive. Correctable errors, which are handled by ECC, are the most common type of transparent error found in the study. According to the study, virtually every flash drive had at least one correctable error during its life.
The majority of drive days, around 61-90 percent, experienced correctable errors. The most common transparent type of errors were write errors and erase errors. Typically 6-10 percent of drives had one of these two errors but some models had 40-68 percent of the drives affected. But less than 5 in 10,000 drive days experienced these errors.