Download the authoritative guide: Enterprise Data Storage 2018: Optimizing Your Storage Infrastructure
It's a fairly well known fact that solid state disk (SSD) performance can suffer over time. This was quite common in early SSDs, but newer controllers have helped reduce this problem through a variety of techniques. In part one of this two-part look at SSDs, we examine the origins of the performance problem and some potential solutions.
SSDs and Performance
Performance degradation problems are a result of how SSDs are constructed and how file systems and applications interact with them. Almost all of the problems stem from the design of SSDs.
In my last article, I presented the basic concepts for constructing an SSD. When SSDs are written to (programmed), they are written in units of pages (typically 4KB). But SSDs are erased in units of blocks, which are much larger than pages (the previous article used an example where the block size was 128 pages or 512KB). This difference in units for writing and erasing is a key to understanding why SSD performance can degrade over time.
Table 1 below, with data from this article illustrates the differences between reading, erasing, and writing for both Single Level Cell (SLC) and Multi Level Cell (MLC) from a performance perspective.
|SLC NAND flash||MLC NAND flash|
|Random Read||25 μs||50 μs|
|Erase||2 ms per block||2 ms per block|
|Programming (Write)||250 μs||900 μs|
Notice that the read I/O operation (the first row) is about 10 times faster than the write I/O operation (last row) for SLC, and about the same for MLC. But more importantly, notice that the erase I/O operation is much slower than either the write I/O operation or read I/O operation. For SLC-based SSDs, erasing a block is about 8 times slower than writing to it. Even more spectacular, a read I/O operation is about 100 times faster than the erase operation for SLC-based SSDs. This difference in the time it takes to complete I/O operations goes to the core of the performance problems people have encountered with SSDs over time.
Let's assume we need to erase some of data from a block, a few pages for example, but recall that SSDs have to erase in blocks. Typically, data within the block is first read from the NAND chips and then written to a cache. Then the appropriate pages within the block are then erased from the cache and new pages that are to be added to the block are added to the data within the cache. Then the entire block of the SSD is erased (reprogrammed) and the updated block data in the cache is written to the block on the SSD. This means that a simple 4KB (one page) change in the data can require the reading and writing of 512KB of data within the SSD. This is sometimes termed the “read-modify-erase-write” process, where the data is read, erased from the SSD, modified within the cache, and finally written to the SSD.
The problem with the read-modify-erase-write process is that the erase step is much slower than the other steps, hurting overall performance. Given that applications can write data in various chunks and that file systems can also write data in various chunk sizes, it is very common for SSDs to have data spread all over the blocks. Consequently, any time a page needs to be updated because the data has changed or the data has been erased, the SSD goes through the read-modify-erase-write cycle, greatly slowing overall performance.
SSD designers have eased the problem by utilizing a pool of unused blocks. The updated block in the cache is written to a clean block contained in the pool while the old block is flagged for erasing and erased at some point typically during a garbage collection cycle with the SSD. This is done to reduce the amount of data for the I/O operation since the erase part of the cycle is removed from the time it takes to write the data. As mentioned, at some point, usually during garbage collection, the controller will have to spend time to erase a block and this amount of time is much greater than the time to write or read data, once again slowing the overall throughput (performance) of the SSD. There may be some logic for keeping the number of blocks as large as possible depending on the SSD controller to increase the performance. However this might cause the read-modify-erase-write cycle to happen more often than desired.
The effect of the read-modify-erase-write process is that the amount of data that is actually written to the SSD can be greater than the amount of data sent to the SSD from the application. The worst case is that a simple 4KB write could cause 512KB worth of data to be written. The ratio of the amount of writes happening inside the SSD to the amount of application data to be written is called the write amplification factor. In the best case scenario, the write amplification factor is 1 where, for example, 4KB of application data results in 4KB of writes by the SSD. In the worst case, the write amplification factor is 128 (at least in our idealized SSD that has 128 pages per block. The exact value depends upon how the SSD is constructed). The write amplification factor can often be used as a measure of the impact of the read-modify-erase-write cycle on performance but is not something typically available to users.
The write amplification factor for any SSD is a function of the design of the SSD, the controller, the file system, and the exact application mix. Therefore it is impossible to give an average factor for a particular SSD.
What makes the write amplification problem worse is that, over time, a file system can become fragmented because data is added, removed, and changed within the file system by multiple applications. This can result in data being scattered all over the blocks within the SSD and, without a reasonable pool of clean blocks, a simple write will have a large write amplification factor resulting in slow performance.
While having little to do with performance, a write amplification factor greater than 1 can impact the longevity of the SSD. Recall that SSDs have a limited number of write/erase cycles. A write amplification factor greater than 1 means that more data than needed is being written, causing more write/erase cycles to be used reducing longevity.
Technologies to Improve SSD Write Performance
Don’t dismiss SSDs because of write performance issues. The problem of read-modify-erase-write cycle has been known for some time and engineers and SSD designers have been working on techniques for reducing the problem (because of the design of SSDs you can never get the write amplification factor to 1 all of the time). One of the first solutions to the problem is called write combining.
Write combining is a simple concept, but it heaps more work onto the SSD controller. In write combining, multiple writes are collected by the controller before being written to the block(s) within the SSD. The goal is to combine several small writes into a single larger write with the hope that neighboring pages of data are likely to be changed at the same time, and that these pages really belong to the same file. It can greatly reduce the write amplification factor getting closer to 1, which improves the write performance, but it depends how the data is sent to the drive and whether the data chunks are part of the same file, or are likely to be changed/erased at the same time.
Of course, one could try to be very clever and modify key applications to write data in block size chunks and make sure files are an integer multiple of the block size. This means that any data erasing flags result in the entire block being flagged (i.e. all pages are erased). However, this is likely to be too much work, applies only to SSD storage anyway (i.e. it doesn't affect spinning disk storage), and could vary depending upon the block size of SSDs. But, overall, write combining is definitely good to have in an SSD controller but may not always help.
Another technique for boosting SSD performance is to keep a certain number of blocks in reserve without exposing them to the OS. For example, if the SSD has a total of 75GB of total space, perhaps only 65GB of it will be exposed to the OS. These reserved blocks can be used for the general block pool to help performance without the OS knowing. These reserved pages increase the size of the block pool virtually guaranteeing that the pool will never run out of available empty blocks. This would let the write cycle just write, instead of read-modify-erase-write. At the very least it becomes read-modify-write and the "old" blocks are flagged for erasure outside of the write cycle. In contrast, if there were no empty blocks available then the controller would have to use the read-modify-erase-write process.
This simple concept is called over-provisioning. It has benefits for both SSD performance and longevity. In the case of longevity, if a particular block within the SSD has been used more than other blocks (i.e. higher number of write/erase cycles), then it can be switched with a block from the reserved pool that has much less usage. This helps with overall wear leveling of the SSD. On the downside, over provisioning means that you don't get to use all of the space on your SSD.
Another long-awaited technique is something called a TRIM command. Recall that one of the big performance problems comes when a write is performed to a page that has not been erased. The entire block that contains that page has to be read into cache (read), the new data is then merged with the existing data in the block (modify), the original block on the SSD is erased (erase), and finally the new block in cache is written to the block (write). This read-modify-erase-write process takes much more time than just a write would on the SSD. The TRIM command tells the SSD controller when a page is no longer needed so that it can be flagged for erasing. Then the SSD controller can write the new data to a "clean" page on a block so that the entire read-modify-erase-write cycle is avoided (the cycle just becomes "write"). Thus, the write performance is improved. Without TRIM, the SSD controller does not know when a page can be erased. The only indication that the controller has is when it writes a modified block to a clean block from the block pool. It then knows that the “old” block has pages that can be erased. In essence, TRIM is giving “hints” to the controller about the status of the data that it can use to improve performance and longevity.
However, the TRIM command has issues of its own. The first issue is that the SSD controller needs to erase the flagged pages (i.e. garbage collection). Hopefully there is enough time, capability, and cache for the SSD controller to go through the read-modify-erase-write cycle on the flagged blocks. The SSD controller will typically start with the blocks with the largest number of flagged pages. It might even take the used pages from these blocks and put them on other blocks, allowing the entire block to be just erased (i.e. no modify-write steps). This process can use a number of blocks in the block pool. The second issue is that as the capacity on the SSD is used there are fewer free blocks in the block pool. This can put more pressure on the SSD controller since the TRIM command will start flagging pages within the reduced set of blocks, resulting in fewer totally clean blocks. Consequently, you are more likely to encounter a read-modify-erase-write cycle while writing data (something the TRIM is designed to help alleviate) as free space is reduced.
The third issue, which is really related to all of the above, is that the SSD controller needs to have enough horsepower, time, and cache, to start the process of garbage collection. The likelihood that there will be blocks with pages flagged by the TRIM command increases if the SSD is being heavily used. The probability of encountering a read-modify-erase-write cycle increases with the associated reduction in write performance if the controller doesn't have time to perform garbage collection. Alternatively, the designers of the SSD controller can insert logic that forces the SSD controller to perform garbage collection at certain times but the result is the same: a reduction in write performance.
To make TRIM work effectively, the file system has to understand when pages are deleted and when to send an appropriate TRIM command. The OS must be able to send the TRIM command to the drive controller and the drive controller has to understand the command and act accordingly. TRIM appeared in Windows before Linux, but the more recent kernel versions understand the TRIM command and can pass it to the drive controller. Specifically, any kernel from 2.6.33 and up understands the TRIM command. Many file systems also understand the TRIM command. For example, ext4 understands TRIM as well as btrfs (since 2.6.32). Other file systems are gaining TRIM capability as well.
Like write combining and over-provisioning, TRIM is not a cure-all for write performance issues. The controller may have difficulty keeping up with TRIM commands if you push enough data to the SSD. If you modify an existing file without first erasing it, you are still likely to run into the read-modify-erase-write performance problem – and TRIM can't help.
SSD write performance degradation over time is rooted in the read-modify-erase-write cycle because, fundamentally, writes happen on a page level and erases happen on a block basis. The result is the write amplification factor, where a write amplification factor of 1 means that the SSD writes exactly the amount of data the application requests. Write amplification factors greater than 1 necessitate that the SSD perform some “housekeeping” so that application data can be written. This extra housekeeping can slow write performance and reduce the longevity of an SSD.
On the bright side, SSD engineers and designers have been working on techniques to fix these problems for some time. Write combining, over-provisioning, and the TRIM command, are all techniques that can help reduce the impact of the read-modify-erase-write cycle on performance. However, there are conditions under which these techniques may not help as much as we would like.
In addition, a file system can become fragmented as it ages and that fragmentation can trickle down to underlying SSDs, leading to reduced performance, but there’s not much engineers or designers can do to fix that problem.
Part two of this article series will present some benchmark results to examine the impact of age on SSD performance. We'll take a brand-spanking new n Intel X25-E SSD (enterprise class) and run some benchmarks against it and then we'll torture the poor SSD with more tests and rerun the benchmarks to see how well the performance holds up. The results are really interesting.
Jeff Layton is the Enterprise Technologist for HPC at Dell, Inc., and a regular writer of all things HPC and storage.
Follow Enterprise Storage Forum on Twitter.