Let's Bid Adieu to Block Devices and SCSI

Enterprise Storage Forum content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More.

The concept of block devices has been around for a long time, so long it’s hard to pin down when the technology first appeared. I checked with some friends who are old-timers (although they might object to that term), and they thought that 35 years was a good guess. That is a very long time for a technology concept.

SCSI has been around for a long time too. There is an excellent historical account of SCSI here. The history is interesting, and it launched more that one company, but the standard was published in 1986, which makes it nearly 19 years old. Since that time, modest changes have been made to support interface changes, new device types, and some changes for error recovery and performance, but nothing has really changed from the basic concepts of the protocol. The last 18 years have been evolutionary at best.

So we have storage devices that have been working the same way for 35 years, and a common protocol to talk to devices that has been around for 18 years, with very little in the way of changes. Folks, I might be missing something, but that is a very long time for a set of technologies.

There are two areas that I see as big problems with these dated technologies:

End-to-end lack of coordination and knowledge
RAID rebuild

Coordination and Knowledge Issues

File systems and block devices are not well coordinated. What file systems are supposed to do is virtualize the storage so that you do not have to keep track of or maintain an understanding of the underlying storage topology. Well, we have achieved that, and things do not work very well. Here are some examples:

File systems cannot pass the location or locations of a file to a RAID device, so RAID devices cannot determine what blocks to “readahead” into the cache so that the next request comes from the RAID cache.
File systems do not track access patterns to readahead by skip increments. If a file is read reading 64 KB, skipping 128 KB and reading 64 KB, the file system will not issue a readahead nor can the RAID controller.
Many file systems stripe data so that blocks are not sequential on LUNs, and RAID devices have no chance to read data ahead since the blocks requested are not sequential.

The end result is that files are not allocated sequentially and RAID readahead algorithms can’t function properly, causing RAID controller performance degradation.

We’ll go into more detail on these issues below.

Location of a File

If the file is sequentially allocated on a single LUN and read sequentially, then a RAID device’s readahead cache will work just fine. More often that not, that is usually not the case. Most file systems I have seen have multiple write streams creating and adding to existing files. Given that almost all file systems use a first fit algorithm for allocation, data is not sequentially allocated. See this article for more information on file system allocation.

If files are not sequentially allocated, simple block devices have no idea how to readahead the data. A number of file systems try to get around this by allocating large blocks of data and then releasing it when the file is closed, or similar techniques that try to preallocate data space. From what I have seen, these techniques help, but over time the allocation maps become fragmented unless the files are all the same size, so you are back to square one.

File System and Access Patterns

From what I have seen, in both databases and HPC computing, files are often read with skip increments. Most file systems, when reading data with buffered I/O, readahead if the file system is reading sequential addresses. Remember, just because it is a sequential address on the file system does not mean that the addresses are sequential on the devices. Given the latency differences over the last 25 years between CPUs and storage devices, you need to have large I/O requests that allow the RAID device to readahead sequential blocks (see Storage I/O and the Laws of Physics for more information on this issue).

If you are reading I/O with direct I/O (open with O_Direct), the issue is the same. File systems just do not issue readaheads except for sequential block addresses. The application could issue an asynchronous readahead, and for some databases this does happen, but file systems just do not have the intelligence built in to allow this.

Striping Many file systems use a volume manager to stripe the data across all of the devices in the file system. This defeats any potential for sequential allocation on each of the blocks on the individual disks in a LUN and therefore on the readahead cache on the RAID. It should be noted that a number of file systems have added round-robin allocation (see the Physics article cited above for more details) as an additional allocation method. Most Linux volume manager and file systems that are combined with volume managers do not support round-robin allocation, which means that most Linux I/O will not use the RAID cache efficiently.

RAID Rebuild

If a disk within a RAID LUN goes bad, the RAID set must be rebuilt. With RAID-5, this means reading in the data from the good disks left and writing them out again to the same disks plus an additional hot spare. Take the example of a 2Gb FC RAID-5 8+1, 146 GB drives with two 2Gb channels connecting the disks. To rebuild, most RAIDs read a stripe in and then write the same stripe out, one stripe at a time. Therefore you will have:

400 MB/sec of bandwidth to read and write bandwidth at 100% efficiency
1.168 TB to read (eight disks each at 146 GB)
1.314 TB to write out (nine disks because you now have added the parity at 146 GB).

You cannot always read and write at 100% efficiency, so the table below estimates the time based on various efficiency factors such as the segment or stripe element per disk (bigger is better), other I/O being done on the RAID, tunables for rebuild within the RAID, and other factors, depending on the vendor.

Efficiency	Read Time Estimated in Seconds	Write Time Estimated in Seconds	Total Time Estimated in Seconds
100%	3062	3445	6506
90%	3402	3827	7229
75%	4082	4593	8675
50%	6124	6889	13013
25%	12247	13778	26026
10%	30618	34446	65064
Table 1: 146 GB Rebuild time and efficiency.

Having 50 percent efficiency is certainly not unheard of, and having your RAID take more than three hours to rebuild is a long time to have application performance degradation and exposure to another disk failure. Think about what happens with 400GB SATA drives and instead of an 8+1, use a 15+1, which is commonly used on some of the multimedia systems that I have worked on:

Efficiency	Read Time Estimated in Seconds	Write Time Estimated in Seconds	Total Time Estimated in Seconds
100%	15360	16384	31744
90%	17067	18204	35271
75%	20480	21845	42325
50%	30720	32768	63488
25%	61440	65536	126976
10%	153600	163840	317440
Table 2: 400 GB Drive Rebuild time and efficiency.

More than 17 hours for rebuild time at 50 pecent efficiency is unacceptable. This problem is going to get worse, not better, over time.

Conclusions

I hope have stated my case well enough so that I won’t get too much hate mail, but in my heart of hearts I believe the time has come for SCSI and block devices to be replaced. In the next article, we will cover the technology that I think will replace both of these technologies, and, if adopted, could also replace file systems as we know them. All of this would be a good thing, in my opinion. That technology is called Object Storage Device, or OSD.

Back To Enterprise Storage Forum