Cooling Storage 'Hot Spots'
A "hot spot" in storage architecture isn't nearly as racy as it sounds. In fact, it's quite the opposite: It's a part of the disk system that has significantly high activity and is usually characterized by long wait times for I/O requests and long waits for the data from those requests. Hot spots in a storage architecture are not desirable, of course, and storage architects and administrators work hard to reduce the number and effect of these hot spots.
Hot spots are trouble because for most applications, you are making requests across many devices for a single application request, so you are limited by the actual physical I/O request. If you have a high latency part of the request, say for a database index, the application response is dependent on that highest latency request created by the SQL call. The result, of course, is a big drop in performance.
So what can you do about these pesky hot spots? Well, almost everyone on the planet says the best way to address hot spots is to spread the data out so that you use more disk drives, but I believe that for many cases this is the absolute wrong thing to do. I am going to explore the pros and cons of this approach and suggest another way of looking at the problem. Remember that just because everyone believes something does not make it true. The world is not flat.
A Brief History of Hot Spots
Back in the early 1990s, when RAID was in its infancy on open systems but there had been some usage on IBM mainframes, EMC was the RAID leader and Veritas volume manager was coming into wide usage. The confluence of these two products, in my opinion, led to what I call "the hot spot theory" of storage architecture, which at the time might have been the correct solution to the problem, but we now have other tools that might provide better solutions. Let's dive into the way things were back then and why it made sense to do things that way.
The file systems of the time in open systems were standard UNIX file systems that had small allocations and that mixed file system metadata and data areas. On the mainframe side, IBM MVS dominated with its record-oriented file system. The point is that most UNIX file systems were structured so that data was not necessarily allocated sequentially, and MVS allocation was based on records, so databases were not allocated sequentially. Many database users on UNIX systems during this time used raw devices.
On the storage hardware side, Seagate had introduced SCSI drives, and they were taking the world by storm, replacing IPI-3 and other drive types. The table below shows the progression of SCSI drives.
Compare that progress to today:
Some of the key points from the comparison are:
- The amount of time it takes to read the whole device has gone from 125 seconds to 2400 seconds, or 19.2 times.
- The seek plus latency time has gone from 20.8 milliseconds to 7.69 milliseconds, or about 2.7 times.
- Storage density has gone up 600 times.
- Bandwidth has gone up 31.25 times.
Disks have gotten a little faster since 2004, but not much. Take all of these facts together and the conclusion is that it takes much longer to read data from the whole drive. Take your average 8 KB database I/O. In 1991, with the average seek and latency, it took about 0.004 seconds to read that data. Today it takes about 0.0008 seconds, about a five times improvement. What this means to you when you are doing small I/O requests is that seek and latency times will likely dominate the performance, which is not a surprise since disk drives are mechanical devices. There are two points to make:
- When RAID, volume managers and file systems started under UNIX systems, spreading data across lots of drives was the only way to get performance, since the drives were slow compared to seek+latency times.
- During these early days, RAID cache sizes were at a premium and therefore aggressive readahead and writebehind activities were not possible.
So we had a world where disk transfers were reasonably fast relative to seek and latency, file systems broke data up into small chunks, volume managers and RAID allocations were not designed to be very large (<128K), aggressive readahead was not practical, and the relative size of databases and applications was pretty small. It's no wonder that the only way to solve this was to spread things out as much as possible; it was the totally appropriate method and only way possible. The confluence of reasons probably lasted until about 2001 or 2002. By then we had:
- File systems that could sequentially allocate data with large sequential allocations of more than 32 MB;
- RAID caches for mid-range products over 2 GB, and for enterprise products over 64 GB; and
- Volume managers that could allocate gigabytes before switching to another device.
We've also had hardware changes for disk drives where transfer rates improved far more than seek and latency times.
Cooling Hot Spots
Any good storage person knows that one size does not fit all, but the hot spot theory of storage architecture is treated as one of the 10 storage commandments. I am not saying that is wrong, but think of the problem this way. Let's say I have a file system and volume manager allocating 64 MB or more of sequentially allocated data. Even with a RAID-5 8+1 and, say, a 512 KB stripe per drive, that would be 16 sequential allocations. Think about that. The current generation of Seagate hard drives has 16 MB of cache per drive. If everything is random, doing readahead by the drive does not help performance. Also, the average amount of data that can be transferred during a seek and latency for a read is 558 KB and for write 608 KB. This is a great deal of performance for data transfer that is lost for every I/O request.
There is one more area that must be looked at before sequential allocation can be considered, and that is whether the I/O requests are sequential, an area that even I am not 100 percent sure of. The Hot Spot Theory states that I/O requests are random. I am sure that is not correct 100 percent of the time, but I am not sure what percentage of the time it is correct. I know that in the HPC world, much of the I/O from the application is sequential. I know that backups are often sequential. I also know that some database accesses are random, but I also know that some of the databases I have seen and traced actually read indices sequentially for a while and then have a skip increment and then read sequentially. I have seen this type of behavior many times on many different types of databases, but not on all databases. If you put this on RAID-1, and your allocations can be sequential, then readahead will provide a significant performance improvement if requests are sequential, and best of all, you can do this with far less hardware since disk efficiency improves.
We are going to have the hot spotters who proclaim that their way is the only way to architect storage for performance, but if you take the next step down and analyze your data access patterns and use a file system and volume manager to properly lay out your data, you can get better performance with less hardware and have far better scaling. If the world moves to OSD, which I hope it does, the use of object technology should be able have the intelligence to perform the readaheads and writebehinds based on the data access patterns. Whatever happens, I think people need to rethink the problem and consider other potential solutions based on today's technology.
Henry Newman, a regular Enterprise Storage Forum contributor, is an industry consultant with 27 years experience in high-performance computing and storage.
See more articles by Henry Newman.