SAS vs. SATA
There has been a perennial argument of SAS versus SATA for enterprise storage. Some people say it's OK to use SATA for enterprise storage and some say that you need to use SAS. In this article I'm going to address two aspects of the SAS vs. SATA argument. The first is about the drives themselves, SATA drives and SAS drives. The second is about data integrity in regard to SATA channels and SAS channels (channels are the connections from the drives to the Host Bus Adapter - HBA).
SAS vs. SATA: Drives - Hard Error Rate
This subject has been written about several times including one of Henry Newman's recent articles. It is defined as the number of bits that are read before the probability of hitting a read error reaches 100% (i.e. can't read the sector). When a drive encounters a read error it simply means that any data that was on the sector cannot be read. Most hardware will go through several retries to read the sector but after a certain number of retries and/or a certain period of time, it will fail and the drive reports the sector as unreadable and the drive has failed.
Below is a table from Henry's previously mentioned article that lists the hard error rates for various drive types and how much data, in petabytes, would have to read before encountering an unreadable sector.
Table 1: Hard error rate for various storage media
The first row in the table, which are drives listed as "SATA Consumer," are drives that typically only have a SATA interface (no versions with a SAS interface). Here is an example of a SATA consumer drive spec sheet. Notice that the hard error rate, referred to as "Nonrecoverable Read Errors per Bits Read, Max" in the linked document, is 10E14 as shown in the table above. <P/P>
The second class of drives, labeled as "SATA/SAS Nearline Enterprise" in the above table, can have a SATA or SAS interface (same drive for either interface). For example, Seagate has two enterprise drives, where the first one has a SAS 12 Gbps interface and the second one has a SATA 6 Gbps interface. Both drives are the same but have different interfaces. The first one has a 12Gbps interface and the second one has a 6Gbps SATA interface but both have the same hard error rate, 10E15.
The third class of drives, listed in the third row of the table as "Enterprise SAS/FC," typically only has a SAS interface. For example, Seagate has a 10.5K drive with a SAS interface (no SATA interface). The hard error rate for these drives is 10E16.
What the table tells us is that Consumer SATA drives are 100 times more likely than Enterprise SAS drives to encounter a read error. If you read 10TB of data from Consumer SATA drives, the probability of encountering a read error approaches 100% (virtually guaranteed to get an unreadable sector resulting in a failed drive).
SATA/SAS Nearline Enterprise drives improve the hard error rate by a factor of 10 but they are still 10 times more likely to encounter a hard read error (inability to read a sector) relative to an Enterprise SAS drive. This is equivalent to reading roughly 111 TB of data (0.11 PB).
On the other hand, using Enterprise SAS drives, a bit more data can read before encountering a read error. For Enterprise SAS drives about 1.1 PB of data can be read before approaching a 100% probability of hitting an unreadable sector (hard error).
At the point where you encounter a hard error the controller assumes the drive has failed. Assuming the drive was part of a RAID group the controller will start a RAID rebuild using a spare drive. Classic RAID groups will have to read all the disks that remain in the RAID group to rebuild the failed drive. This means they have to read 100% of the remaining drives even if there is no data on portions of the drive.
For instance, if we have a RAID-6 group with 10 total drives and you lose a drive, then 100% of the seven remaining drives have to be read to rebuild the failed drive and regain the RAID-6 protection. This is true even if the file system using the RAID-6 group has no data in it.
For example, if we are using ten 4TB Consumer SATA drives in a RAID-6 group, there is a total of 40TB of data. Given the information in the previous table, when about 10TB of data is read then there is almost a 100% chance of encountering a hard disk error. The drive on which the error has occurred is then failed causing a rebuild. In the case of the ten disk RAID-6 group, this means that there are now nine drives but we can only lose one more drive before losing data protection (recall that RAID-6 allows you to lose two drives before the next lost drive results in unrecoverable data lost).
In the scenario, I'm going to assume there is a hot-spare drive that can be used for the rebuild in the RAID group. In a classic RAID-6, all of the remaining nine drives will have to be read (a total of 36TB of data) for the rebuild. The problem is that during the rebuild the probability of hitting another hard error reaches 100% when just 10TB of data is read (a total of 36TB needs to be read for the rebuild). When this happens there is now a double drive failure and the RAID group is down to eight drives.