SAS vs. SATA - Page 5
ZFS does read checksums. When it writes data to the storage, it computes checksums of each block and writes them along with the data to the storage devices. The checksums are written in the pointer to the block. A checksum of the block pointer itself is also computed and stored in its pointer. This continues all the way up the tree to the root node which also has a checksum.
When the data block is read, the checksum is computed and compared to the checksum stored in the block pointer. If the checksums match, the data is passed from the file system to the calling function. If the checksums do not match, then the data is corrected using either mirroring or RAID (depends upon how ZFS is configured).
Remember that the checksums are made on the blocks and not on the entire file, allowing the bad block(s) to be reconstructed if the checksums don't match and if the information is available for reconstructing the block(s). If the blocks are mirrored, then the mirror of the block is used and checked for integrity. If the blocks are stored using RAID then the data is reconstructed just like you would any RAID data - from the remaining blocks and the parity blocks. However, a key point to remember is that it in the case of multiple checksum failures the file is considered corrupt and it must be restored from a backup.
ZFS can help data integrity in some regards. ZFS computes the checksum information in memory prior to the data being passed to the drives. It is very unlikely that the checksum information will be corrupt in memory. After computing the checksums, ZFS writes the data to the drives via the channel as well as writes the checksums into the block pointers.
Since the data has come through the channel, then it is possible that the data can become corrupted by a SDC. In that case ZFS will write corrupted data (either the data or checksum possibly both). When the data is read, ZFS is capable of recovering the correct data because it will either detect a corrupted checksum for the data (stored in the block pointer) or it will detect corrupted data. In either case, it will restore the data from a mirror or RAID.
The key point is that the only way to discover if the data is bad is to read it again. ZFS has a feature called "scrubbing" that walks the data tree and checks both the checksums in the block pointers as well as the data itself. If it detects problems then the data is corrected. But scrubbing will consume CPU and memory resources while storage performance will be reduced to some degree (scrubbing is done in the background).
If you get a hard error on the drive (see first section) before ZFS scrubs the data that affects corrupted data (due to SDC in the SATA channel) then it's very possible that you can't recover the data. The data was corrupted but the checksums could have been used to correct it but now a drive with the block and block pointer is dead making life very difficult.
Given the drive error rate of Consumer SATA drives in the first section and the size of the RAID groups, plus the SATA Channel SDC, this combination of events can be a distinct possibility (unless you are start scrubbing data at a very high rate so that newly landed data is scrubbed immediately, which limits the performance of the file system).
Therefore ZFS can "help" the SATA channel in terms of reducing the effective SDC because it can recover data corrupted by the SATA channel, but to do this, all of the data that is written must be read as well (to correct the data). This means to write a chunk of data you have to compute the checksum in memory, write it with the data to the storage system, re-read the data and checksum, compare the stored checksum to the computed checksum, and possibly recover the corrupted data and compute a new checksum and write it to disk. This is a great deal of work just to write a chunk of data.
Another consideration for SAS vs. SATA is the performance. Right now SATA has a 6 Gbps interface. Instead of doubling the interface to go to 12 Gbps, the decision was made to switch to something called SATA Express. This is a new interface that supports either SATA of PCI Express storage devices. SATA Express should start to appear in consumer system in 2014 but the peak performance can vary widely from as low as 6 Gbps for legacy SATA devices to 8-16 Gbps PCI Express devices (e.g. PCIe SSDs).
However, there are companies currently selling SAS drives with a 12 Gbps interface. Moreover, in a few years, there will be 24 Gbps SAS drives.
SATA vs. SAS: Summary and Observations
Let's recap. To begin with, SATA drives have a much lower hard error rate than SAS drives. Consumer SATA drives are 100 times more likely to encounter a hard error than Enterprise SAS drives. The SATA/SAS Nearline Enterprise drives have a hard error rate that is only 10 times worse than Enterprise SAS drives. Because of this, RAID group sizes are limited when Consumer SATA drives are used or you run the risk of multi-disk failure that even something like RAID-6 cannot help. There are plenty of stories of people who have used Consumer SATA drives in larger RAID groups where the array is constantly in the middle of a rebuild. Performance suffers accordingly.
The SATA channel has a much higher incidence rate of silent data corruption (SDC) than the SAS channel. In fact, the SATA channel is four orders of magnitude worse than the SAS channel for SDC rates. For the data rates of today's larger systems, you are likely to encounter a few silent data corruptions per year even running at 0.5 GiB/s with a SATA channel (about 1.4 per year). On the other hand, the SAS Channel allows you to use a much higher data rate without encountering an SDC. You need to run the SAS Channel at about 1 TiB/s for a year before you might encounter an SDC (theoretically 0.3 per year).
Using T10-DIF, the SDC rate for the SAS channel can be increased to the point we are likely never to encounter a SDC in a year until we start pushing above the 100 TiB/s data rate range. Adding in T10-DIX is even better because we start to address the data integrity issues from the application to the HBA (T10-DIF fixes the data integrity from the HBA to the drive). But changes in POSIX are required to allow T10-DIX to happen.
But T10-DIF and T10-DIX cannot be used with the SATA channel so we are stuck with a fairly high rate of SDC by using the SATA Channel. This is fine for home systems that have a couple of SATA drives or so, but for the enterprise world or for systems that have a reasonable amount of capacity, SATA drives and the SATA channel are a bad combination (lots of drive rebuilds and lots of silent data corruption).
File systems that do proper checksums, such as ZFS, can help with data integrity issues because of writing the checksum with the data blocks, but they are not perfect. In the case of ZFS to check for data corruptions you have to read the data again. This really cuts into performance and increases CPU usage (remember that ZFS uses software RAID). We don't know the ultimate impact on the SDC rate but it can help. Unfortunately I don’t have any estimates of the increase in SDC when ZFS is used.
Increasingly, there are storage solutions that use a smaller caching tier in front of a larger capacity but slower tier. The classic example is using SSD's in front of spinning disks. The goal of this configuration is to effectively utilize much faster but typically costlier SSD's in front of slower but much larger capacity spinning drives. Conceptually, writes are first done to the SSD's and then migrated to the slower disks per some policy. Data that is to be read is also pulled into the SSD's as needed so that read speed is much faster than if it was read from the disks. But in this configuration the overall data integrity of the solution is limited by the weakest link as previously discussed.
If you are wondering about using PCI Express SSD's instead of SATA SSD's drives you can do that but unfortunately, I don't know the SDC rate for PCIe drives and I can't find anything that has been published. Moreover, I don't believe there is a way to dual-port these drives so that you can use them between two servers for data resiliency (in many cases if the cache goes down, the entire storage solution goes down).
If you have made it to the end of the article, congratulations, it is a little longer than I hoped but I wanted to present some technical facts rather than hand waving and arguing. It's pretty obvious that for reasonably large storage solutions where data integrity is important, SATA is not the way to go. But that doesn't mean SATA is pointless. I use them in my home desktop very successfully, but I don't have a great deal of data and I don't push that much data through the SATA channel. Take the time to understand your data integrity needs and what kind of solution you need.
Photo courtesy of Shutterstock.