It was not that long ago that a big database was something less than 2 TB. Today large databases can mean 100 or more terabytes, a growth of about 50 times in less than a decade. We all know that storage performance has not increased at nearly the same rate (see Storage Headed for Trouble). Given our concerns about storage performance and all of the issues surrounding it, including file system fragmentation, HBA/HCA command queue and so on, what do you do when you have a very large database and do not want to wait forever to get the answers you need? I was just asked this question by a customer, and I thought you might be interested in the answer.
The customer in question had put a large database on a large enterprise RAID controller. The controller was configured with RAID-5 (the number of disks in the RAID set will remain anonymous to protect the RAID vendor in question). The first question I asked the customer was why RAID-5 and not RAID-1? The answer was that given the size of the database, they could not afford the cost of RAID-1 since it would require two additional enterprise controllers. That’s a big database.
The database itself was not fixed size, since there was an ingest stream adding new data all the time. The customer also said they never re-index the database since it couldn’t be completed in a reasonable amount of time. I asked a few more questions and then found out the reason I was called: The customer realized that the performance of the enterprise controller did not meet the operational requirements of the database.
They showed me some measurements, and the number of requests from the hosts far exceeded the capability of the enterprise controller. The large cache on the controller — 256 GB — was almost of no value, since there were less than a few percent of cache hits. The RAID stripes were not powers of two, and most of the I/O requests were small and powers of two. Basically, even though the enterprise controller might have been easy to manage, mirror and carve up, it could never meet the performance requirements of this database.
The customer had decided to move to midrange RAID controllers for a number of reasons. They expected to be able to scale the performance of the database better as the size increased by adding more controllers, and an equal cost in midrange controllers provided more performance on both the front end (cache to host) and back end (cache to disk) and a large command queue than did the enterprise controllers.
Still, I wondered why I had been called to this meeting, and I soon found out. They asked me to tell them how they should configure 200 TB of storage space on eight midrange controllers. I was somehow supposed to give them a list of RAID tunables, LUN configurations and file system settings off the top of my head with no information other than the hardware and software. An impossible task for anyone. So I started to ask a bunch of questions. Here was part of my first salvo:
- What are the size of each of the index files?
- How many index files are there?
- How many are created and how often?
- How often are they searched?
- How many users are searching them at the same time?
- Are the index files read sequentially or randomly?
I got answers to the first five questions pretty easily, but then at question six they said index reads are all random I/O. But when I asked them how they knew that, I got a number of different responses, ranging from “Well, the database monitor said the blocks were random” to “all database I/O are random.” That might be an acceptable answer for some, but when dealing with critical performance issues you need proof. So I asked them if anyone had ever run the Solaris truss() command on the index read process. Since this was a Sun Solaris system, truss is a utility that list all of the system calls, which would give the address for every read and write request. The truss() command runs on Solaris and Linux. Most other operating systems have a similar type of utility. Of course, the answer was that they weren’t really sure of the access patterns for the index reads since they had have never looked at it.
In short, they were asking me to configure all of the new storage and file systems when I had no idea what the database patterns did to the storage. Of course, if you configure the storage to stripe all the data, all the data would be random access, but over the years I have seen a number of databases where the index reads in terms of block address access were sequential access followed by a skip to a bigger address followed by sequential access. This isn’t true for every case I have seen, but I have seen it enough to know that it is something to seriously consider. Much of the reason this is not considered is that for most of the 1990s and much of this decade, most file systems striped data across RAID controllers, and performance was based on the performance of random IOPS for all of the drives and the RAID controller. This is no longer necessary, since many file systems and volume managers can work together to allow sequential allocation of the block address. This allows the RAID controller to readahead, since RAID controllers only readahead based on sequential blocks and the file system topology cannot be passed to the RAID controller since neither SCSI nor SATA have the mechanism to pass topology built into the protocol. There is hope on the horizon, at least for SCSI, with the new OSD standard, but so far OSD is still in the development phase. The problem is we still live in a block-based world.
The Answer’s in the Data Path
As I have said in the past, you need to review the whole data path before you have a full understanding of the I/O requests to the storage. Understanding the full data path requires you to understand the application, the file system, the volume manager layout and, last but not least, the storage layout.
Big databases with large indexes that are searched often require special attention to the I/O request sizes, which are often based on the database allocation and the file system layout. If you have a database block size of 8 KB and file system and RAID allocations of 32 KB, depending on a number of factors, you could be accessing far more data than the 8 KB with the application basic block.
The approach I take with big databases is no different than the approach I take with everything else, given the limitations we have with technology. That is:
- Find out what the application is doing and if it can be changed to make bigger requests. Bigger requests only make sense if you are using all of the data in the request.
- Based on the answer, tune the whole data path, including the operating system, file system, volume manager, HBAs and storage to optimize the data path.
You need to keep in mind all of the issues and limitations of current technology, such as RAID limitations for sequential I/O. The techniques are the same for storage systems small and large.
Henry Newman, a regular Enterprise Storage Forum contributor, is an industry consultant with 26 years experience in high-performance computing and storage.
See more articles by Henry Newman.