Download the authoritative guide: Enterprise Data Storage 2018: Optimizing Your Storage Infrastructure
It's a fairly well known fact that solid state disk (SSD) performance can suffer over time. This was quite common in early SSDs, but newer controllers have helped reduce this problem through a variety of techniques. In part one of this two-part look at SSDs, we examine the origins of the performance problem and some potential solutions.
SSDs and Performance
Performance degradation problems are a result of how SSDs are constructed and how file systems and applications interact with them. Almost all of the problems stem from the design of SSDs.
In my last article, I presented the basic concepts for constructing an SSD. When SSDs are written to (programmed), they are written in units of pages (typically 4KB). But SSDs are erased in units of blocks, which are much larger than pages (the previous article used an example where the block size was 128 pages or 512KB). This difference in units for writing and erasing is a key to understanding why SSD performance can degrade over time.
Table 1 below, with data from this article illustrates the differences between reading, erasing, and writing for both Single Level Cell (SLC) and Multi Level Cell (MLC) from a performance perspective.
|SLC NAND flash||MLC NAND flash|
|Random Read||25 μs||50 μs|
|Erase||2 ms per block||2 ms per block|
|Programming (Write)||250 μs||900 μs|
Notice that the read I/O operation (the first row) is about 10 times faster than the write I/O operation (last row) for SLC, and about the same for MLC. But more importantly, notice that the erase I/O operation is much slower than either the write I/O operation or read I/O operation. For SLC-based SSDs, erasing a block is about 8 times slower than writing to it. Even more spectacular, a read I/O operation is about 100 times faster than the erase operation for SLC-based SSDs. This difference in the time it takes to complete I/O operations goes to the core of the performance problems people have encountered with SSDs over time.
Let's assume we need to erase some of data from a block, a few pages for example, but recall that SSDs have to erase in blocks. Typically, data within the block is first read from the NAND chips and then written to a cache. Then the appropriate pages within the block are then erased from the cache and new pages that are to be added to the block are added to the data within the cache. Then the entire block of the SSD is erased (reprogrammed) and the updated block data in the cache is written to the block on the SSD. This means that a simple 4KB (one page) change in the data can require the reading and writing of 512KB of data within the SSD. This is sometimes termed the “read-modify-erase-write” process, where the data is read, erased from the SSD, modified within the cache, and finally written to the SSD.
The problem with the read-modify-erase-write process is that the erase step is much slower than the other steps, hurting overall performance. Given that applications can write data in various chunks and that file systems can also write data in various chunk sizes, it is very common for SSDs to have data spread all over the blocks. Consequently, any time a page needs to be updated because the data has changed or the data has been erased, the SSD goes through the read-modify-erase-write cycle, greatly slowing overall performance.
SSD designers have eased the problem by utilizing a pool of unused blocks. The updated block in the cache is written to a clean block contained in the pool while the old block is flagged for erasing and erased at some point typically during a garbage collection cycle with the SSD. This is done to reduce the amount of data for the I/O operation since the erase part of the cycle is removed from the time it takes to write the data. As mentioned, at some point, usually during garbage collection, the controller will have to spend time to erase a block and this amount of time is much greater than the time to write or read data, once again slowing the overall throughput (performance) of the SSD. There may be some logic for keeping the number of blocks as large as possible depending on the SSD controller to increase the performance. However this might cause the read-modify-erase-write cycle to happen more often than desired.
The effect of the read-modify-erase-write process is that the amount of data that is actually written to the SSD can be greater than the amount of data sent to the SSD from the application. The worst case is that a simple 4KB write could cause 512KB worth of data to be written. The ratio of the amount of writes happening inside the SSD to the amount of application data to be written is called the write amplification factor. In the best case scenario, the write amplification factor is 1 where, for example, 4KB of application data results in 4KB of writes by the SSD. In the worst case, the write amplification factor is 128 (at least in our idealized SSD that has 128 pages per block. The exact value depends upon how the SSD is constructed). The write amplification factor can often be used as a measure of the impact of the read-modify-erase-write cycle on performance but is not something typically available to users.
The write amplification factor for any SSD is a function of the design of the SSD, the controller, the file system, and the exact application mix. Therefore it is impossible to give an average factor for a particular SSD.
What makes the write amplification problem worse is that, over time, a file system can become fragmented because data is added, removed, and changed within the file system by multiple applications. This can result in data being scattered all over the blocks within the SSD and, without a reasonable pool of clean blocks, a simple write will have a large write amplification factor resulting in slow performance.
While having little to do with performance, a write amplification factor greater than 1 can impact the longevity of the SSD. Recall that SSDs have a limited number of write/erase cycles. A write amplification factor greater than 1 means that more data than needed is being written, causing more write/erase cycles to be used reducing longevity.
Technologies to Improve SSD Write Performance
Don’t dismiss SSDs because of write performance issues. The problem of read-modify-erase-write cycle has been known for some time and engineers and SSD designers have been working on techniques for reducing the problem (because of the design of SSDs you can never get the write amplification factor to 1 all of the time). One of the first solutions to the problem is called write combining.
Write combining is a simple concept, but it heaps more work onto the SSD controller. In write combining, multiple writes are collected by the controller before being written to the block(s) within the SSD. The goal is to combine several small writes into a single larger write with the hope that neighboring pages of data are likely to be changed at the same time, and that these pages really belong to the same file. It can greatly reduce the write amplification factor getting closer to 1, which improves the write performance, but it depends how the data is sent to the drive and whether the data chunks are part of the same file, or are likely to be changed/erased at the same time.
Of course, one could try to be very clever and modify key applications to write data in block size chunks and make sure files are an integer multiple of the block size. This means that any data erasing flags result in the entire block being flagged (i.e. all pages are erased). However, this is likely to be too much work, applies only to SSD storage anyway (i.e. it doesn't affect spinning disk storage), and could vary depending upon the block size of SSDs. But, overall, write combining is definitely good to have in an SSD controller but may not always help.